Archival Success!
Publications from SUP’s digital initiative now have nearly complete web-archive versions thanks to a 2020 partnership with Webrecorder.
With the renewal in 2019 of the Mellon grant supporting the continued publishing of interactive web-based digital monographs at SUP came a more defined focus on the archiving and preservation of those works. In particular, we wanted to expand on the work we began in our initial grant period with web archiving, a process that had turned out to be a bit more complicated than we initially imagined when we first (naively) identified it as an obvious and simple solution to the challenges surrounding the persistence of web content. The purpose of web archiving our publications is to provide a persistent version of the publications that can replace the live versions once they become outdated with the inevitable advancements in server and browser technologies.
As reported in previous blog posts (here, here, and more recently here), mainstream large-scale web archiving tools and systems, including those at Stanford Libraries, proved to be inadequate for capturing the range of interactive features at the heart of the works we were publishing. Mainly, tools like Archive-It, the Wayback Machine, and systems built on those frameworks–while great at capturing a lot of frequently updated top-level web content–were not as good at delving into the depths of a single, fixed, more complex work. So we were excited to learn in 2017 about Webrecorder, a web archiving toolset built on an alternative framework more amenable to the capture and replay of the kind of content we were publishing. Over the next two years, we developed an informal partnership with the Webrecorder team as we experimented with deep manual crawling of our publications. And when it became clear that Webrecorder was the right tool for our content, we formalized a partnership with their development team in our next grant proposal, naming them as subawardee for the continued development of, including customized features for, an improved toolset based on the one that had proven to be so effective at archiving our unique and varied digital publications.
At the beginning of 2020, work began on a year-long partnership to create web archives of our existing publications. These archives would serve as real-life use cases for the broader development of a web archiving toolset that could also serve future publications and would also benefit a wider user group. Starting with a set of objectives laid out and enumerated in the grant proposal in 2019 in collaboration with Webrecorder’s then associate director of partnerships Anna Perricci, Ilya Kreymer and I mapped out a timeline for the year. The first couple months would be devoted to scoping SUP’s six digital publications and identifying the components of their architecture; the next few months would be devoted to technical development, and the rest of the year would focus on documentation and dissemination of our work among the web archiving and publishing communities. What follows here is a broad description of the work completed so far, and you can read Ilya Kreymer’s more detailed account of that work and its implications for the web archiving community on the Webrecorder blog.
The January-March scoping phase involved identifying the challenges each publication presented, and developing strategies for dealing with them. We compiled content files, server backups, metadata, and technical inventories and specs into a shared workspace so Ilya could get to know the publications in depth and begin mapping strategies for archiving. After reviewing the projects and compiling summaries clarifying each publication’s unique features and challenges, the next few months saw technical development of the tools needed to archive them. By the end of June, we began reviewing a nearly complete set of web-archived versions of our digital catalog to date. In just a few months, Ilya had built the tools to capture and replay the six projects we’ve published over the course of our digital publishing initiative. Though all of the projects are still alive and well in their current server-supported publication formats, those formats are already beginning to deprecate, as all web projects will do. The archived versions developed during this partnership extend the life of those projects beyond their current iterations, and they do so in a format that is nearly indistinguishable from the original to most human readers.
In the new system all web archives are packaged into a single file that is loaded directly by the browser and can be hosted on any static web server. This is possible due to two new developments: the ReplayWeb.page system developed by Webrecorder to load web archives directly in the browser, and a new web archive collection format, called WACZ, which can package all web archive components (WARCs, indexes, full text search) in such a way that they can be loaded quickly on-demand, even for large archives. See again the Webrecorder blog post for more technical details on this new format and system.
Since three of our live publications use the Scalar platform, and Ilya had previously built a prototype that would capture and serve that system, he focused first on updating that system to support the newer Scalar version that two of the projects were using and to refining and improving the general scope of content capturability. Now, SUP’s three Scalar-based publications—When Melodies Gather, Black Quotidian, and Constructing the Sacred—have been successfully web archived with this system, and further improvements such as full-text search, have been added just within the last week! There will still be improvements to some of the custom components of each project, especially where third-party content is a factor, but the framework is there to deliver nearly all of the content and features of a Scalar publication.
The next two works to be archived each employ their own custom platforms as live sites, but the common feature—the system being used to host the projects—served as an entry point to an approximated scalable workflow, at least where content was concerned. Reclaim’s cPanel backups served as the transferable content package to be ingested by the custom tools Ilya was developing to process and replay that content. By the end of June, both Filming Revolution and Chinese Deathscape had high fidelity web archives.
While we anticipated Chinese Deathscape, a Ruby on Rails project leveraging potentially millions of remote map tiles, would pose a particularly insurmountable challenge, much of the relevant content has been captured. The full text appears in its original presentation layout with connections to map plot points, but the map tiles themselves are understandably hit-or-miss depending on the zoom levels and overlay preferences of the user. This issue is a larger one for authors and publishers of this kind of content, as reliance on external data sources to generate maps is common due to a lack of resources for packaging that kind of material for 100% local webhosting. It’s an example and an issue that could (and may!) comprise a blog post if not a white paper of its own.
Filming Revolution’s unique challenges included its over 400 videos which leverage the Vimeo API for serving embedded video in websites. While I had previously attempted manual crawling with Webrecorder to capture the project along with its hundreds of video clips, Ilya’s new system was able to automate this process with better results than the manual capture was able to produce. The archived version of this publication is now indistinguishable by a human reader from the original, and we will be able to seamlessly roll over to the full-fidelity archive when browser and script updates render the live site inoperable.
The final publication, Enchanting the Desert, which I had previously compiled manually with Webrecorder has also been converted to the WACZ format and rolled into our new archive catalog.
The tools developed during this work cycle are meant to be reusable not only by SUP for future publications, but also for a wider publishing and web archiving community. In true open-access, open-source spirit, the toolset developed out of this Mellon-funded collaboration is available, with documentation, via a public repository.
Next steps will involve deploying the replay system on the supDigital domain and adding branding to integrate with the look and feel of our program styles. We’ll deposit the WACZ archive packages into the Stanford Digital Repository for preservation and sourcing into the static HTML pages readers will access to view the projects for years to come. We’ll also explore the longevity question, evaluating the requirements of maintaining the archives, and create new archives of publications still in the pipeline. Webrecorder will further document and release the tools used for creating these archives and converting them to the new WACZ format. Finally, we’ll continue to report here on our discoveries and conclusions. And hopefully, in the temporary absence of on-site conferences, there will be an opportunity in the next few months to present virtual walk-throughs of our work and share our experiences with our colleagues in both the web archiving and digital publishing worlds.
The catalog of web archives now at our disposal raises renewed questions about the final output format for digital web-based scholarly publications. If any web project, regardless of its platform our framework, can be delivered online in a browser as a web archive, could this be a potential publishing standard? Would this format limit the range of tools authors want to be able to use to present their work digitally? And if so, are those limitations worth the resulting longterm stability of the publication? What new maintenance dependencies would be involved as a trade-off for standardization? Now that we’ve reached this milestone, we can start pushing these inquiries further even while we continue to invite innovative work. And all with relatively more assurance that we have at least extended the life of our existing publications by a little longer than the typical lifespan of web content.