For digital delivery production, CRL scans derivative files for access purposes from the best available copies. In some cases, the quality of the source may result in a scan of lower-than-optimal quality. In almost every instance, all original source materials, whether paper or microform, are retained by CRL indefinitely, and thus are available for rescanning if that becomes necessary or appropriate. In collaborative digitization efforts the files are scanned at the highest standard available to the digital aggregator or publisher.
Some digital files from CRL collections are represented as page images only, but an OCR search engine is applied to produce searchable text when the quality of the scan, format of the document, and language of the text is suitable.
General specifications for capture and access
- Master scans: TIFFs retained for archival use.
- Image capture: Minimal imaging specifications are 300 to 400 dpi, mostly bitonal with some grayscale as needed for legibility. Images for select content (including the APCRL collection from ProQuest) are in full color.
- OCR: Uncorrected OCR (optical character recognition) is applied to provide searchable text whenever the format and quality of the original source will support it.
- Access files: Scanned documents are accessible as PDF files, combining page images with searchable text. For digital delivery content, multi-page PDF files were produced until 2014, and single-page PDF files have been produced since then. For the single-page PDF files CRL’s DDS server produces multi-page PDFs on the fly, based on user defined page ranges.
Management of CRL digital assets
CRL has or controls three classes of digital assets:
- Digital assets generated by CRL
- Digital assets generated by programs under the CRL organizational umbrella
- Digital assets generated by CRL partnerships with other organizations, including publishers.
For digital assets generated by CRL the supporting infrastructure consists of a network of local Windows-based servers operating in a virtualized environment. Master files are maintained locally, with a first copy stored online and a backup copy stored off-line. Files from early digitization projects prior to 2009 are stored at Amazon Web Service. For digital assets generated by CRL partnerships with other organizations, CRL relies on the asset management expertise and resources of partners like LLMC-Digital, ProQuest, and NewsBank, as well as taking advantage of archiving at scale by specialized organizations like Portico, CLOCKSS, and Scholars Portal. In general, CRL obtains copies of digital assets produced through partnerships for safekeeping and eventual access through CRL’s servers.