Digitization Guidelines

These are SAOA’s technical guidelines for digital files derived from text-based materials (in print, microfilm, or microfiche) to be included in SAOA’s digital collections. Digitization providers (commercial entities as well as academic institutions) will be expected to conform to these specifications to ensure consistency of the digital materials for ingest into the SAOA digital asset management system. The following are the ideal specifications for ingesting image-based material into SAOA’s collections.

  1. At the outset of each project, the SAOA Project Manager will schedule a phone consultation with the digitization provider to help ensure that the digitization project conforms to the Digitization Guidelines laid out below. The digitization provider or content contributor should provide SAOA with:
    1. Estimates of the total number of images, total number of volumes (for serials and multi-volume monographs), and if possible, total file size (in MB, GB, or TB),

    2. Details regarding the condition of the print or microform material.

  2. Descriptive Metadata – the metadata should:
    1. Use one of the following metadata schemes: Dublin Core or MARC21.

    2. Be provided in one of the following metadata/catalog record file formats: MARC XML or CSV.

    3. Conform to SAOA’s metadata template (for example, for monographs vs serials).

    4. Include accurate holdings information for serials or multipart titles.

    5. Have been provided in a sample set of records for SAOA staff to review during the proposal phase, as specified above.

NOTE: the data entered into Forum (or copied/pasted into Forum) will be UTF-8. SAOA’s hosting platform defaults to UTF-8 encoding for data entry.

  1. Structural Metadata – appropriate structural metadata should be provided to help SAOA organize the image files and to allow for navigation within the item (for example, by chapter).
  2. Asset File Types – the following file types for each image of a given title should be provided:
    1. Master image files for preservation: TIFF images,

    2. Access files (image surrogates): JPEG, JPEG2000 (JP2), or PDF.
    3. OCR files (recommended, where available):
      1. .txt and,
      2. OCR XML or HOCR
  3. Image Capture
    1. TIFF master image files – these files are for long term preservation purposes and are deposited in a dark archive.
      1. Resolution: 400 ppi to 600 ppi for new digitization.
      2. Uncompressed, TIFF 6.0 images, in either “little endian” (IBM PC) or “big endian” (Mac) byte order.
      3. All files should be able to pass JHOVE format validation as valid and well-formed.
      4. 24-bit color for new digitization (8-bit grayscale may be acceptable for items already digitized or with no color content. Either no gray profile, or Gray Gamma 2.2). No proprietary scanner profiles.
      5. One page per image.
    2. JPEG, JP2, and PDF access files (image surrogates) – these images are for presentation purposes and are ingested and hosted on SAOA’s platform for researchers to access.
      1. Resolution: keep surrogate resolution the same as master TIFF file if the surrogate file size meets the requirements (see subsection iii, below). In some cases, it may be acceptable to decrease the resolution of the surrogate to a minimum of 300 ppi in order to decrease the file size.
      2. Compression level: between 10:1 and 15:1, depending upon the dimensions and color of the original.
      3. File Size: the size of each access file should range from 0.5 MB (megabytes) to 2.5 MB, depending on various factors (size of the original item, format, content, color, darkness).
    3. Image quality: images should meet the following characteristics, many of which may be available as automated settings on the scanner as part of the image capture option (e.g. microfilm scanning). In exceptional cases, post-processing or correction might be necessary to:
      1. Achieve desired tone distribution
      2. Sharpen images to match appearance of the originals
      3. Crop and/or deskew the images, oriented to the text (not to the page)
  4. File Naming
    1. Monographs (Single Volume)
      1. Format: titleID_YEAR_sequential image #.tif
      2. Example: 986786411_1915_00135.tif
        1. This would be for a monograph (single volume) published in 1915, 135th consecutive image.
    2. Monographs (Multi-Volume)
      1. Format: titleID_YEAR_VOLUME #_sequential image #.tif
      2. Example: 990512780_1918_003_00115.tif
        1. This would be for a monograph (multi-volume) published in 1918, volume 3, 115th consecutive image.
    3. Serials
      1. Numbered Issues:
        1. Format: titleID_YEAR_VOLUME #_ISSUE #_sequential image #.tif
        2. Example: 990312980_1915_002_001_00253.tif
          • This would be for a serial published in 1915, volume 2, issue 1, 253rd consecutive image.
      2. Dated Issues:
        1. Format: titleID_YEAR-MONTH-DAY_sequential image #.tif
        2. Example: 22123199_1921-12-24_00012.tif
          • This would be for a serial published on December 24, 1921, 12th consecutive image.
      3. Quarterly Issues:
        1. Format: titleID_YEAR_QUARTER_sequential image #.tif
        2. Example: 226114808_1895_Spring_00005.tif
          • This would be for a serial published in 1895, Spring issue, 5th consecutive image.
    4. FOR ALL THE ABOVE:
      1. File naming of the master and derivative access files must follow the same pattern.
        1. The .jp2 or .jpg derivative must have precisely the same filename as its corresponding master .tif file, except for the filename extension, i.e. "990512780_1918_003_00115.jp2" is derived from (corresponds to) the image of "990512780_1918_003_00115.tif.
        2. For the Title ID, assign the OCLC#.
        3. Allow for 5 digits for sequential image numbering and three digits for the volume and issue numbers.
  5. Folders
    1. For each title, there should be two separate folders containing the TIFF and JP2/JPG/PDF files, respectively.  
      1. Format:
        1. OCLC#_ShortTitle_TIFF 
        2. OCLC#_ShortTitle_JP2 
      2. For monographs, all files for the title should be contained in the main folder (above).
      3. For serials, please use subfolders and label them by volume#, issue #, or year (depending on the title). 
        1. Example (for issue 5) 
          • OCLC#_ShortTitle_TIFF 
            • 005
        2. Example (for volume 3, issue 7) 
          • ​​​​​​​​​​​​​​OCLC#_ShortTitle_TIFF
            • 003
              • 007
        3. Example (for the year 1915) 
          • ​​​​​​​​​​​​​​OCLC#_ShortTitle_TIFF​​​​​​​
            • 1915​​​​​​​
    2. For every folder, there should be identical names for the master TIFF files and the corresponding access files (such as JPG, JP2, or PDF), except for the file extensions. 
  6. File Transfer
    1. Acceptable methods of file transfer are via hard drive, USB drive, FTP, Dropbox, Google Drive, and CD.

 

Last updated: September 14, 2021

Featured: Unique Urdu and Hindi Collection

Prof. Robert Phillips, lecturer for the Program in South Asian Studies at Princeton University, teaches courses in Hindi-Urdu and South Asian Studies, and has used both South Asia Materials Project (SAMP) and CRL resources to support different research, writing, and teaching projects.

Accessing Āmukha in SAMP’s holdings offered an opportunity to incorporate the crucial - but often less-collected - genre of the little magazine into his research on Hindi modernism and a subsequent conference presentation.