Various efforts have been made to “mass digitize” documents. The purpose is to make these items accessible to everyone. Typically, databases are the “home” for a variety of these efforts. OCR technology, however, is also increasing in accuracy and speed, which allows for more effective mass-digitization.
Table of Contents
OCR technology is faster and more accurate
Optical character recognition (OCR) is a technology that scans and converts text to searchable form. Using this technology, businesses can save time, resources, and money. It also reduces redundant work and helps to increase security. OCR can be used for a variety of applications, including data entry, printing and editing, and even converting printed books to electronic format.
OCR systems use optical character recognition and image optimisation to identify text on documents. The results are then corrected and proofread. It is important to remember that OCR is still a relatively new technology, and accuracy and speed are still limited.
The results of OCR are sometimes difficult to interpret, especially when a document has been scanned in high resolution and has a lot of text. To improve the accuracy of OCR, you must check and optimise the image in the scanning software.
Depending on the type of OCR system you are using, you might be required to manually correct the results. The OCR system will skip characters that are not readable, but the results need to be checked by an independent arbiter.
There are a number of commercial and open source OCR systems available for the most common writing systems. Some OCR software uses a two-pass approach. This means that the first pass will recognize the letters, while the second pass will analyse the symbols. The second pass ensures higher accuracy for each character.
For better accuracy, it is recommended to scan the documents at a resolution of at least 300 DPI. The image should be in a lossless file format to avoid losing information.
OCR software may also include a dictionary or a “training facility” to help the system recognize text. These types of programs are more accurate than OCR systems that do not incorporate a dictionary.
Several commercial companies and national libraries have taken the digitisation of newspapers in a more serious way. The results can sometimes be good, but they can also be bad. Some companies claim to be able to improve OCR accuracy by incorporating manual adjustments. These types of manual interventions are not viable for mass scale digitisation.
Databases are the “home” of a variety of mass-digitization efforts
Various organizations are pouring millions into databases of historical newspapers. One of the largest is the Chronicling America database. Despite the large numbers, however, publishers have yet to digitize their newspapers. Hence, libraries have been tasked with figuring out the best way to make the printed page relevant to today’s digitally-savvy researcher. While some libraries have taken the lead, others are still playing catch up. There is no shortage of national libraries that have a bet on the future of history and literacy.
The Library of Congress and the American Antiquarian Society are among the more prominent players. A few notable examples include the Dnnhaupt digital (Dnnhaupt is a German word for “horse”) and the IDS Project, which is a collaboration of more than a dozen libraries. The IDS project has been around for many years, but the newest member of the family, the National WWII Museum, recently joined the fray.
The Library of Congress has a slew of notable achievements, most notably the ICON database and the American Antiquarian Society’s Chronicling America, which includes hundreds of 19th-century US newspapers. Other notable accomplishments include the Library of Congress’s American Women’s History Project and the National World War II Museum’s digitized collections from New Orleans and Washington, DC. The latter institution recently announced plans to digitize its entire collection. A small fraction of the digitized content, meanwhile, is being incorporated into the Chronicling America database. Nonetheless, the most ambitious undertaking is the joint effort of Europeana, the Library of Congress, and the University of Michigan’s Digital Library Program, which is already working on a large part of the ICON database. A few weeks from now, the aforementioned institutions will convene a cross-library meeting to discuss their respective efforts and identify a few areas for further consolidation. A successful outcome will be an enhanced archival infrastructure, which will benefit a wide variety of scholarly communities.
Cost of digitization
Using optical character recognition (OCR) technology, the process of digitizing a book into an electronic format is called mass digitization. The goal is to create a single digitized version of every item in the library. This is a cost-effective solution to the problem of preserving physical materials.
The process involves converting hard copy records into digital files, either by scanning them or migrating them from one system to another. The Internet Archive is a good example of a large-scale project that produces a scanned pages.
Other large-scale digitization projects involve creating a set of documents, like the JSTOR project, which aims to create a complete run of each journal. A similar project at the University of Michigan was staggeringly fast. The process entailed scanning millions of books in less than a decade.
The Stratford Institute, a new center for digital media based in Waterloo, Ontario, is another example. They’ve digitized nearly a thousand journals.
The cost of mass digitization is likely to require private funding. It’s a question of finding the right mix of economy of scale and user satisfaction. For example, do you really need to digitize everything in the collection? Or is it okay to hold on to some items for a longer period of time?
Other than the obvious question of how much money you’ll need, you’ll have to ask whether you really need to digitize all of your books. Some items will be too fragile to be digitized. For example, some will have odd-sized plates, and others will have folded maps.
A digital book is roughly 500 megabytes in size, which is a good size for storing and reading. However, a digitized book may not be large enough to hold all of its contents. And it’s also possible that digitization will cause some books to be lost in the digital shuffle. This could cost more than the initial digitization cost.
The most important component of any big-picture project is planning. The best strategy for a large-scale project is to identify which components will be most profitable. For example, if you can’t afford to scan all of your archival material, you might want to digitize the items that are most important to your organization.
Permissibility under the fair use doctrine
Despite the fact that the Fair Use Doctrine is an open-ended concept, courts have been determining its application over the years. Generally speaking, the first step in the process is to determine whether the use is in “good faith.” In deciding on this, courts consider four factors: the nature and extent of the use, the nature and extent of the copyrighted work, the potential market, and the adverse impact on the licensing market.
In the case of Grateful Dead concert posters, the court found the use was not in violation of the fair use statute. The posters were used as part of a group biography of the band.
The second case involved the Google Library project. The project scanned books and other works in bulk to create a database that can be searched by patrons. The district court ruled in favor of Google on its fair use defense. However, the court acknowledged that the activity had little impact on the authors’ markets.
In addition, courts have generally favored nonprofit educational uses, based on a good faith factor. For instance, if the digital license fee is a small proportion of the publisher’s overall revenue, the school is likely to prevail.
The pendulum has swung away from the traditional market in recent interpretations. Instead, courts have emphasized the transformative use of digital works, which can change the original work. While this may not be a direct part of the fair use statute, it is a very important factor in determining the permissibility of mass digitization.
Using copyright material in a classroom setting requires written permission from the copyright owner. The fair use doctrine provides an exception that allows teachers to reproduce copies of copyrighted works in classrooms.
There are three different limits to this exemption. One limit is the number of preservation copies. A second limit is the security copies. A third limit is the amount of copying allowed.
The fair use pendulum has swung from the traditional market to the transformative market. The latter was recognized by the Supreme Court in Campbell v. Acuff-Rose. In Campbell, the court noted the impact of the 2 Live Crew parody on “Pretty Woman.”
Courts have also considered the effects of a use on the licensing market, including how many copies would be available if a copyrighted work were made publicly available. The courts also consider the effect on the derivative market.