Signed, Sealed, Delivered…or Making, Stewarding, and Presenting Web Archives of Digital Publications: It Takes a Village

Thanks to collaboration between SUP, Webrecorder, and Stanford Digital Repository, SUP’s digital publications can be safely stored and simply delivered.
As previously announced, Stanford University Press has now established a template for the preservation packages of the projects published under its Mellon-funded digital initiative. One common feature of each publication’s preservation package is its web archive. The web archive files are stored in the Stanford Digital Repository and served through an embedded web archive player on a Press-hosted HTML page at archive.supdigital.org/[publication-title]. The collaboration required to achieve this integration between static archive site, web archive, and digital repository has been ongoing, and in this post we take a deeper dive into the various moving parts that came together to make it happen. This post features perspectives from each of the three collaborating groups: Stanford University Press, Webrecorder, and the Stanford Digital Repository.
Stanford University Press
by Jasmine Mulliken
When the Press first envisioned its now 5-year-old digital publishing program, it was assumed that anything we published on the web could be easily web archived. After all, in 2016 the Wayback Machine was already providing access to web pages long since dead and defunct, and Stanford Libraries, home of LOCKSS, had its own web archiving system known as the Stanford Web Archiving Portal (SWAP), described as “a searchable collection of websites archived by Stanford University.” With access to the specialized expertise available to us, we believed the archiving of these complex publications was solved. Our proposal to Mellon summed up the archiving question fairly succinctly in just a couple paragraphs. In regards to web archiving, we were especially succinct (and now we realize quite naive): “Stanford University Libraries will perform Web and data archiving of ISWs by the Stanford University Press.” No specific processes for this web archiving were outlined, though we did reiterate that “[p]eriodic web archiving of each site and data archiving of the underlying data … for each publication are essential elements of this archiving effort.” Over the past five years since the proposal was funded, we have undertaken both data archiving as well as web archiving and seen just how complicated these processes can be. But more importantly, we’ve seen just how disconnected they can be.
Efforts to web archive our first publication, Nicholas Bauch’s Enchanting the Desert (2016), proved that SWAP, the system we assumed to be the obvious solution for the harvest, storage, and delivery of a web archive version of the project, just wasn’t delivering a faithful representation of the original project.
What looked like this in the live publication:
was showing up like this in the web archive version:
After some investigation and conversations with Nicholas Taylor, the then web archiving guru at Stanford, and Ilya Kreymer, the developer of the at-the-time fairly new Webrecorder, I learned the JavaScript in the project was presenting challenges to the Open-Wayback-based SWAP system. I began testing Webrecorder, a PyWB-based system, and got much better results. The archive looked and acted just like the original. But even an archive created with a Webrecorder crawl was not displaying correctly in SWAP. Even though we could store the web archive file in SDR, the integrated replay system could not display the content of that file.
We realized we’d need to rely on different systems for our code/data archives and our high fidelity web archives. Though we continued to keep web archive files in SDR, we were not encouraging them to be opened in the connected SWAP interface. Instead they needed to be downloaded and then played with Webrecorder desktop player, which users would have to download separately on their own.
with some recent teamwork and minor edits, we can now create a web archive with the Webrecorder toolset, store the files in the Stanford Digital Repository, and then source it into an embedded web archive player on a simple static web page we can easily host and maintain at the Press.
As outlined elsewhere in this blog, we’ve identified the storage of publication data/code in the Stanford Digital Repository, and web archiving with the Webrecorder toolset, as two of our three preservation pathways. Until now, because of the limitations of SWAP which was supposed to be a bridge between them, these two pathways have been parallel but not connected. But with some recent teamwork and minor edits, we can now create a web archive with the Webrecorder toolset, store the files in the Stanford Digital Repository, and then source it into an embedded web archive player on a simple static web page we can easily host and maintain at the Press. No download needed. Readers can now simply click a link on the publication’s archive page and see the web archive right in their browser. And even though everything going into this may sound complicated, here’s the very simple and clean outcome:
From SUP’s perspective, it’s taken years of conversation and collaboration with both the SDR team and Webrecorder to identify the interoperabilities and how we can work around them to deliver to readers what appears a clean and simple archived version of these very complex publications. The following sections provide more insight into the technical work involved and how that work can now serve not just the next round of SUP publications but also a wider array of library collections.
Webrecorder
by Ilya Kreymer
The traditional approach to web archiving is to use automated archiving, usually via a web crawler to crawl certain sites over and over, and to expand a growing collection. This use case makes sense for many sites that will change over time. Since archivists do not manage the sites they crawl and so may not know when sites change, the only option is to continuously crawl such sites on a regular basis.
However, there is a different set of web content which represents discrete published works, such as the SUP digital publications or many digital humanities projects.
The digital preservation goals for this type of content are typically to preserve the final, published site as accurately as possible and to deposit the final web archive into a digital repository.
Today’s modern browsers can already directly load images, videos, PDFs and even 3D models through custom viewers, so why not web archives?
Webrecorder tools have been designed to address and support this important use case.
The archiving process for complex digital publications may be complex, involving a combination of high-fidelity manual archiving overlaid with selected one-time crawling of a publication.
Once the archive is created, it is stored in a single digital object (a WACZ file), which can be deposited into SDR, Stanford’s Digital Repository, just like any other digital object in this repository.
Traditional web archives require custom software (a replay system or ‘wayback machine’) to be run on web servers, creating an extra maintenance burden for institutions maintaining web archive systems. One of the Webrecorder project’s key goals is to make web archives more accessible, and a key approach to this was to reduce the burden for maintaining access to web archives by providing an alternative that loads web archives directly in the browser.
Today’s modern browsers can already directly load images, videos, PDFs and even 3D models through custom viewers, so why not web archives? Webrecorder’s replayweb.page system was built to provide exactly such a viewer for loading and rendering web archives directly in the browser. By using this viewer, a web archive can be loaded from any server on the web, including existing repositories such as SDR, without requiring any additional server-side software. Only a small change was needed in SDR (to allow content to be loaded from different domains such as ‘replayweb.page’) to allow web archives stored on SDR to be directly accessible via replayweb.page or any other site that embeds the viewer.
By design, no additional work is needed in SDR, and more importantly, there is no additional maintenance cost for Stanford to maintain web archives in SDR compared to other types of objects. Since the web archive is just a type of file, the workflow for depositing discrete publications can be made similar to depositing other objects (videos, images, PDFs) into digital repositories, and web archives can become ‘first-class’ citizens alongside these other more established formats.
My hope is that Webrecorder tools such as replayweb.page will continue to lower the barriers for creating and maintaining web archives, and I hope that this collaboration with SUP and SDR can help encourage other institutions to explore similar approaches.
Stanford Digital Repository
by Andrew Berger
Web archives present a particular challenge for a repository like the SDR. As a general purpose repository, the SDR stores and provides access to a wide range of digital materials and content types. For many of the most common types of content found in digital collections, such as images, audiovisual materials, and documents, the SDR uses an embedded viewer to display content within a web page. These types of materials tend to be self-contained, with fairly clear boundaries: an image, an album, a traditional paged book, a set of audio or video files.
What has been most remarkable to me as the SDR Repository Manager, having joined Stanford long after SUP’s work on this archiving process had begun, is how little the SDR needed to be modified in the end to support this use.
Web archiving at Stanford has pushed these boundaries in multiple ways: the files gathered from a single web archive crawl may contain a variety of file types from multiple different websites, and a single website may be captured multiple times over a period of months or years. The Stanford Web Archive Portal is geared more towards this type of archiving strategy – multiple, repeated, possibly wide-ranging crawls – than it is toward single captures of specific sites. In terms of the public interface, the approach taken by the SDR has been to link out to SWAP to provide access, while storing the underlying crawl files in the preservation system. An example of this is the record for the Stanford University Health Alerts website, spanning from 2016 to 2021.
In a way, SUP’s use of Webrecorder is both a departure and a return for SDR. It’s a departure in that making a single, complete web archive of a site is a different approach to web archiving from running a series of automated crawls. But it’s also a return in that it shows how SDR can be used to present an archived website in an embedded viewer on a static webpage rather than being limited to the SWAP interface. What has been most remarkable to me as the SDR Repository Manager, having joined Stanford long after SUP’s work on this archiving process had begun, is how little the SDR needed to be modified in the end to support this use. Only some changes to the configuration of the server that serves the files to the public was needed to enable the Webrecorder player. The ingest and storage process for these web archive files uses the same repository infrastructure that supports other deposited materials.
Conclusion
For web-based digital humanities projects, web archiving can extend the lifespan of a finished product or capture states of an ongoing project. While we’re using the process at SUP for the former, individual authors can incorporate the Webrecorder tools however they see fit. If they have access to their own institution’s repository, they may also be able to store their own web archives to provide access to them via their own easy-to-maintain static page that embeds the replayweb.page player. Although the various parts–the creation of the archive, the accessioning of it, and the delivery of it–may take a range of expertise, the key is to collaborate. Our experience brought together the expertise of the publisher, the tool developer, and the repository management team. Once everyone is involved in the conversation, it’s exciting how simply the pieces come together!