Born Digital and Web-based News

This diagram from Megan Bernal’s presentation depicts a generalized distribution lifecycle of user-generated content for news organizations.

Having considered the challenges of print and broadcast media, the roundtable participants turned their attention to the issues surrounding born-digital news content. Megan Bernal, Associate Director for Library Information & Discovery Systems at DePaul University, spoke about the production and distribution of electronic news from the publishers’ perspective, drawing on her background as Director of Information Services at the Miami Herald Media Company from 2005 to 2009.

Bernal provided an overview of the report issued in 2011 by CRL, “Preserving News in the Digital Environment: Mapping the Newspaper Industry in Transition.” That report outlined the "lifecycle" of news content published in newspapers and online, providing an overview of news workflow and production systems, and offered a basis for a rational and effective strategy for libraries to preserve news in electronic formats.

Among the findings in the detailed report, Bernal highlighted the complexity of the modern news organization, with many actors and systems contributing disparate streams of information to produce and distribute news content. The actors include the parent companies, which make deals with content providers and web-application providers to receive and distribute content, metadata, or other information at multiple levels. Interactive divisions are often separate units, usually business or advertising driven.

Traditional methods of acquiring and maintaining news are not sufficient to the task of capturing the electronic record. The diversity of sources (such as licensed and third-party content), dynamic nature of online content, and customized displays for individual users all work against capturing the "best edition" of an online publication. The proliferation of platforms, devices, and distribution networks used to read or access the news further complicates the ability to archive the user experience. Capturing a web-based PDF may capture the look and feel of the print edition, but these generally are not high-resolution copies and are stripped of much or all metadata generated in the production and layout systems and processes.

News enterprise systems employ a rich set of standards and metadata protocols for some types or levels of content that may be of interest to libraries and memory organizations. However, traditional news organizations are comprised of various departments (production, business, editorial, advertising, and circulation) that may produce metadata for their own particular purposes, distinct or siloed from the other parts of the organization.

There are numerous core enterprise systems within a news organization that implement separate production workflows and outputs (pagination systems and e-facsimiles; web production systems and web output, third party systems for additional content, etc.). Some systems may be more integrated than others. From an archival perspective, Bernal stated that the potentially highest impact “point of entry” for libraries would be within the editorial system, where the bulk of articles, metadata, (often) high-resolution photographs, and other related content is contained. She suggested that attempting to make a deal with news producers is a feasible, if not easy, approach to harnessing some of this content.

Bernard Reilly, President, Center for Research Libraries, presented on “Legal Deposit Considerations in the Post-print Era” In the print era, one or more copies of a work to be copyrighted by an author or publisher were often deposited in the national library. In return, the intellectual property of the work received certain legal protections. Since the 1940s, the Library of Congress and the British Library have accepted microforms as the primary form of legal deposit for newspapers to build their major collections of newspapers.

Since 2000, new legislation has been passed in a number of Western countries allowing for, and in some cases even requiring, legal deposit of electronic publications, including websites, in their respective national libraries. The legislation authorizes the deposit requirement to be fulfilled by one or both of two means: 1) deposit by the publishers; and 2) authorizing the national library to harvest directly from the web.

Reilly reported that despite these new statutory rights obtained by national libraries, actual legal deposit of electronic news is still limited. Many national e-deposit laws confined deposit to offline materials such as content on DVD and CD. Most libraries were not permitted to harvest website content that existed behind a pay wall, thus eliminating electronic news available from providers by subscription from capture. Other national libraries that targeted websites were harvesting periodic “snapshots” of the sites, sometimes as infrequently as once a year.

The British Library’s “Collecting Plans 2013–14,” call for harvesting “some 200 to 500 websites within scope . . . on a more frequent basis such as quarterly, monthly, weekly or even daily, in order to ensure that rapidly changing or updated content is archived adequately.”

The Bibliothèque Nationale de France is conducting e-deposit experiments with a major French regional newspaper, Ouest France, receiving comprehensive deposit of electronic news content directly. But this effort is limited to digital versions “exactly identical to the one distributed in printed form” and therefore does not capture the publisher’s web news output.

The U.S. Library of Congress (LC) does not yet have the statutory right to require electronic deposit, nor the capability to ingest and archive news in electronic form on a regular basis. Some news sites have been captured periodically as parts of occasional thematic web harvests done by LC or the Internet Archive. But these tend to be incomplete, crawled over periods of several days, and often lack critical content such as multimedia and database-driven content.

In instances where national libraries do harvest news content from the web, under current terms of deposit that content can only be made available within the confines of the depository library’s physical facility.

Reilly observed that the scope of coverage and archiving of news content achieved in the print era through legal deposit programs is not likely to be provided in the digital age. This suggests that the preservation of those materials for long-term use will have to be ensured by others.

Mark Phillips, Assistant Dean for Digital Libraries at the University of North Texas (UNT), described various initiatives underway at academic institutions including UNT aimed at the preservation of digital news content. The “Chronicles in Preservation” project is a collaboration of the MetaArchive Cooperative, Educopia Institute, Chronopolis, and UNT, along with other academic institutions with collections of digital newspapers. The project aims to study and document the preservation practices of these institutions, and to model a distributed preservation framework to collaboratively preserve digital newspaper collections (both digitized and born-digital newspapers).

Participating institutions maintain a variety of digital news collections, including digitized historical newspapers, e-facsimiles of print newspapers, article “morgues,” and news websites dating back to the 1990s. These collections, predictably, contain an array of current and legacy content types and formats; are stored in a host of different systems; and have employed various metadata formats, OCR formats, and object identifier schemas over time. For many institutions, the size of the newspaper collections do not scale to the systems put into place for more traditional digital objects.

Based on surveys and in-depth interviews with participating institutions, the Chronicles in Preservation project is developing “Guidelines for Digital Preservation Readiness” to recommend specific practices to take advantage of available technologies and infrastructure that can bridge the gap in preservation readiness for institutions. The project team explored various means of “capacity building” and remediation at the partner institutions, which might be expanded to other institutions, publishers and/or content providers in subsequent phases of program activity.

Other deliverables of the Chronicles in Preservation project—still underway—include a comparative analysis of three leading distributed digital preservation approaches in the U.S., particularly in the context of preservation of newspaper holdings. The evaluation will include LOCKSS (implementation at the MetaArchive), iRODS (at Chronopolis), and CODA (at the University of North Texas).

The third primary deliverable will be the development of a set of “interoperability tools” to handle the exchange of content from partner repositories into the distributed preservation frameworks represented in this project. Details of the project are available through the project wiki at: http://metaarchive.org/neh/index.php/ Main_Page.

The University of North Texas is engaged in a number of other efforts working to collect, digitize, preserve, and provide access to newspaper content. UNT coordinates the Texas Digital Newspaper Program, supported in part by the National Digital Newspaper Program (NDNP). In the past few years, UNT has been attempting to work with publishers to receive and archive current born-digital content for preservation. Numerous smaller newspapers, Phillips has found, are eager to work with memory organizations to “move their legacy forward.” By offering repository services for the current PDF facsimiles, UNT has found that the publishers often become more willing to collaborate on other efforts, such as the digitization of backfiles. UNT is presently working with the Texas Press Association and is offering services to other states’ historical archives.