CLOCKSS Audit Report 2018
Report Details
CLOCKSS: Detailed Audit Findings
The CLOCKSS Archive is a joint venture of publishers and libraries. Publishers of electronic journals, including Elsevier, Springer, Taylor & Francis, and Wiley, enable CLOCKSS to preserve the article contents of their journals on an ongoing basis. As of January 2018, CLOCKSS contained 30 million articles from 14,937 titles by 253 publishers. A total of 53 triggered titles were freely available on the CLOCKSS website.
Publishers provide their e-journal content to CLOCKSS for archiving. This is done in one of two ways: by allowing CLOCKSS to harvest that content directly from the publisher's website or by file transfer. With harvest, CLOCKSS crawls a publisher's site and harvests the same content that a publisher makes available online to readers. A crawl generates a submission information package (SIP) consisting of the journal content and appropriate metadata. With file transfer, an FTP (rsync) or other file transfer mechanism is used to transfer “packages” of content and metadata from a publisher to CLOCKSS.
With both harvested and transferred content, each SIP typically represents the articles published since the previous harvest or transfer. The unit archived by CLOCKSS typically contains all article content published by a publisher during a defined period of time (such as journal year or volume) plus files containing metadata related to that content.
Ingested content is then stored as the original bits on a global network of 12 “nodes,” repositories maintained by participating universities, libraries and other organizations, each of which has certain specified obligations to CLOCKSS. The nodes, located in the U.S. (five nodes), Canada, the United Kingdom, Germany, Italy, Japan, Hong Kong, and Australia, are each obliged to store a complete version of the CLOCKSS' Archive content. The nodes use LOCKSS technology to automatically and continually compare or audit their content against that held by other nodes and repair any differences.
In the event that access to the content through the publisher is disrupted for an extended period of time, CLOCKSS is authorized through its contracts with publishers to copy and transfer the content from the CLOCKSS' Archive to selected host organizations. Should CLOCKSS fail, the host organizations will make the content available to the general public without charge under a Creative Commons license (or equivalent license). The University of Edinburgh, Stanford, Humboldt University, and the University of Alberta have agreed to serve as hosts and re-publish triggered content.
Trigger warnings and other preservation activities are governed by a written contract between CLOCKSS and each publisher. The contract grants CLOCKSS archiving and certain other specified re-publishing rights. It binds the publisher to provide to CLOCKSS specified content, accompanying metadata, and a specified level of monetary support on an annual basis. The use of a Creative Commons license for triggered content permits anyone to re-publish CLOCKSS’ triggered titles. Some of CLOCKSS’ triggered content is in the Internet Archive (Annals of Clinical Psychiatry).5
In its audit, CRL determined that the CLOCKSS system operates as represented and appears to be generally well-designed and adequate to the preservation of the e-journal content currently archived. Moreover, the system is rigorously maintained.
The governance of the effort is structured to ensure accountability to CLOCKSS’s two major stakeholder communities: e-journal publishers and academic libraries. One of the strengths of CLOCKSS, in fact, is the deep engagement of the research library community in its planning and governance. This engagement is likely to ensure CLOCKSS' continued responsiveness to the needs of that community. The CLOCKSS funding model, moreover, is designed to enable the program to respond to changes in the amount, nature and value of the content archived.
In addition, two notable aspects of CLOCKSS operations became apparent in the audit that should be understood by current and prospective stakeholders. While not problematic enough to prevent certification, these matters could possibly have a bearing on future CLOCKSS services. The two notable aspects are described below with reference to the corresponding criteria in the TRAC checklist.
Notable Aspects of CLOCKSS
1.Repository has short- and long-term business-planning processes in place to sustain the repository over time. (TRAC A4.1)
The CLOCKSS funding model is designed to enable the enterprise to respond to changes in the nature, value, and amount of content archived. Each year, publishers pay a “means-based” annual fee, which is scaled to their total publishing revenue; plus a per-article fee, based on the amount of content archived that year. This price structure enables CLOCKSS to absorb the growing costs of content management to a certain extent. However, as the cost of ingest and management of content inevitably increases, so will the amount and complexity of the content being managed. Those costs could require CLOCKSS to seek greater revenue from libraries and or publishers.
2. Repository has a documented process for testing understandability of the information content and bringing the information content up to the agreed level of understandability. (TRAC B2.10)
CLOCKSS warrants that it will ensure that the journal articles in its archive, once ingested, will continue to be “understandable” at the level of understandability that they possessed at the time of ingest. That warranty is based on four assumptions:
a) E-journal publishers create understandable, renderable content deliverable through web browsers; and, should problems with that content occur, readers will detect and report them, and publishers will correct
b) Web browsers will continue to be the primary rendering tool for e-journal content and will continue to render old web content as well as new web content over time. Formats that are not intended to be rendered by web browsers (such as Microsoft Office formats) are widely supported.
c) The rendering of those files in the archive in discipline-specific formats that are not intended to be displayed in a web browser is considered by CLOCKSS “a problem for the specific field” and not something for which an archive can provide a generic solution.
d) Emulation, rather than format migration, is expected to become increasingly easy, robust, and affordable and may be the preferred way to deliver content in an obsolete format if obsolescence ever occurs.
Assumption "a", that successful exposure of the actual journal content on the web is a guarantee of the renderability of that content, does not apply, however, to content ingested by CLOCKSS through file transfer, rather than direct web harvest. Yet in the view of the auditors, this strategy is technically reasonable and justifiable. CLOCKSS staff actively monitors work in the fields of digital preservation, format migration, and emulation to support this strategy. As evidence of that, CLOCKSS made minor changes to its policy of dealing with potential file format obsolescence during the 2014 audit.
The strategy is also prudent in terms of resource expenditure for a dark archive. Tracking formats over time and migrating them can be costly in terms of programming and development resources, computing time, data management, and disk storage. It is, therefore, reasonable to assume that dealing with what is likely to be a relatively small number of obsolete formats only once, at the time of a trigger event or at time of delivery from a re-publishing site, with the technologies available at that time, maybe a wiser use of resources than constantly and repeatedly monitoring and migrating un-triggered content in a dark archive. Also, since 2014 CLOCKSS has successfully provided the content of 53 triggered journals for access. The current state of CLOCKSS technology suggests that these strategies will work now and may improve in the future.
Other Findings
One additional area of concern is a condition of the right granted by publishers to CLOCKSS to re-publish triggered content. That condition is the lag time between a “trigger event” and the point at which CLOCKSS may republish the triggered content without the publisher’s consent. The lag time of up to six months specified in CLOCKSS' agreements with the publishers, although it is the norm with other repositories such as Portico, is not likely to be acceptable in fields such as medicine, where a hiatus of such duration would have a greater impact on users than a comparable disruption in access to a journal in the humanities or social sciences. However, the lead time the CLOCKSS archive currently requires for the technical process of triggering content is only two-four weeks, and CLOCKSS has demonstrated its ability to republish triggered content, with the agreement of the publisher, within that period. As reasonable over time, the archive should endeavor to tailor agreements with publishers to better accommodate use cases in all fields.
Re-publishing triggered content is not a core function of the CLOCKSS archive. Two institutions have agreed to serve as "host organizations" for such content: Stanford University Libraries and the University of Edinburgh's EDINA. The host organizations agree to "re-publish" the released content on the open web under a Creative Commons license that allows it to be re-hosted freely. It is then expected that the content will henceforth be maintained and made available by one or more additional organizations that have an interest in sustaining the material.
As of September 2018, CLOCKSS has successfully provided free and open access to the content of 53 triggered journals without incident. However, there are costs involved in the successful release and re-publishing of significant amounts of triggered content. Today the re-publishing host organizations and the CLOCKSS’ community support these expenses. However, as triggered content continues to grow the costs will continue to increase. Those costs could increase significantly and suddenly, particularly if a large publisher fails or releases an enormous number of articles from many popular journals. For that reason, it would be prudent for CLOCKSS' management to develop detailed scenarios for future services.
It should also be noted here that CRL was not able to independently and comprehensively verify and monitor the presence and integrity of content in the CLOCKSS' repository at a meaningful level of granularity. Verification and monitoring are a challenge inherent in "dark" archives because the content is not accessible. However, practices for auditing this dark content are emerging. CLOCKSS submits title- and volume-level metadata to the Keeper’s Registry, KBART and provides access to issue-level metadata for CRL’s PAPR database. In addition, the illumination of its triggered content shows CLOCKSS is successfully storing its content.
Rating
CRL assessed CLOCKSS on each of the three categories of criteria specified in TRAC and has assigned a level of certification for each. The numeric rating (below) is based on a scale of 1 through 5, with 5 being the highest level and 1 being the minimum certifiable level. (The minimal certification rating of 1 is assigned in instances where a repository has inconsistencies or deficiencies in areas that might lead to minor defects of a systemic or pervasive nature, but where no major flaws are evident.)
TRAC Category |
CLOCKSS rating |
Optimum rating |
Organizational Infrastructure |
5 |
5 |
Digital Object Management |
4 |
5 |
Technologies, Technical Infrastructure, Security |
5 |
5 |
TOTAL |
14 |
15 |
[5] The circumstances under which content can be “re-published” by CLOCKSS are specified in the standard contract between CLOCKSS and publishers, as when either: (i) the owner of all rights to the Archived Content (including the copyrights) gives unconditional consent to the release of such Archived Content to the general public, or (ii) the Archived Content is determined in good faith by the Board to be unavailable from any publisher for at least six consecutive months.