A few weeks ago I posted an interview with Webrecorder’s Dragan Espenchied in which he detailed the features and uses of the web-archiving tool developed by Rhizome. A fellow Mellon-funded project, Webrecorder has been especially intriguing to us because it is perhaps the most specifically focused solution to providing readers of our web-based interactive scholarly works a chance to experience the project in its original form once the live version begins to suffer the usual fate of web-based content. While there are other approaches to this challenge that we’re exploring—from emulation to documentation to repository storage and delivery—none are as specifically centered around preserving the experience of interacting with web content. This type of content always seems like a browser update away from breaking, although we do discuss how to safeguard against this fate in our Archivability guidelines. But if the commonplace is true that any given website has a life expectancy of 2-5 years, we need to begin planning for each project’s afterlife as soon it goes live. It’s the only way to ensure we have enough time to try and get the best, most functional, highest-fidelity archived version of it that we can.
As a follow-up to our earlier interview, I thought it might be worth going into a little more depth on some of the specific ways we’re experimenting with Webrecorder for our own work here at supDigital.
Web archiving isn’t a new concept. The Internet Archive, perhaps the biggest collector and provider of what would otherwise be lost web content, has been doing it since 1996. Its Wayback Machine can call up just about anything that its crawlers have captured online over the past 21 years. Examples range anywhere from a personal web page from the nineties, to a news site as it appeared on Sept. 11, 2001, to the official White House site from any day of the Obama presidency. It’s a valuable and increasingly important resource, and with an archive of 279 billion individual web pages, it’s certainly becoming an invaluable collection of cultural heritage in a time when so much information is born and lives online. But the Internet Archive isn’t the only option when it comes to preserving online material.
Even though Webrecorder is younger and less well-known in most circles than the Wayback Machine, we’ve found it has some pretty great features for our work in particular. So as a follow-up to our earlier interview, I thought it might be worth going into a little more depth on some of the specific ways we’re experimenting with Webrecorder for our own work here at supDigital.
Here’s a visual example using our first supDigital publication, Enchanting the Desert. Above is a screenshot of the live web project.
And this is a side-by-side comparison of the web-archived versions in Webrecorder and the Wayback Machine. The archive looks quite a bit different in Webrecorder than it does in the Wayback Machine. Although the Webrecorder version, which you can see live here, takes a little while to load, given about 15-20 seconds after the load-wheel completes, the content comes up in its intended arrangement on the screen. Meanwhile, the Wayback Machine version, linked here, shows overlapping text and blank or erroneously loaded containers where content should be. (Note that this version works sometimes, while other times it appears as above. We are working on ascertaining the reason behind the random results from the same WARC file.) The Wayback Machine does actually contain the correct content links that are normally accessible on the landing page, but it’s stripped from the scripting that defines its intended layout and display characteristics. Scrolling down the page shows a long list of content links, none of which work to bring the content into the console.
…unlike a photo behind glass or a film played in a museum on a loop, a useful web archive makes the experience of self-directed navigation and interaction with the work virtually indistinguishable from the original.
As publishers of the content we’re capturing, we’re in a unique position in terms of our rights and access to each of our web-based projects. We have direct access to every part of the projects, and we’re deeply familiar with all of their contents and features. While many archiving agencies work with material that has either been donated to them or that they’ve chosen to seek out and collect (as is often the case with open web content), we’ve known these projects from their infancy. This level of familiarity with a web-based work means we know where its pieces live and how to access them. As the publishers of the content, we are invested in their longevity and have what sometimes feels like a parental interest in how they are perceived by the rest of the world. Web-archiving technology lets us capture that quintessential moment in a project’s life when it was officially released to the world, when it was at its prime. But unlike a photo behind glass or a film played in a museum on a loop, a useful web archive makes the experience of self-directed navigation and interaction with the work virtually indistinguishable from the original.
Examples range anywhere from a personal web page from the nineties, to a news site as it appeared on Sept. 11, 2001, to the official White House site from any day of the Obama presidency.
Much of the big push recently behind improvements in web-archiving technology has come from an increased sense of responsibility to preserve our history. Efforts especially focused on capturing news and social media content are a huge driving force behind the need to step up the functionality of tools like Webrecorder and the Wayback Machine. Datathons around the world, like those organized by Archives Unleashed (also Mellon-funded), have increased in frequency and attendance as has the need to make sure scientific data and facts aren’t deleted from hard drives or public memory. Digital scholarship is just as important to the historical record as Trump’s Twitter feed or a cnn.com news story, and as scholarly publishers we have a responsibility to help shape the capabilities of the kinds of technology we need in order to preserve that record. So we’ll continue to test and experiment with tools that offer a way to not just preserve the content we’re publishing but to also make it accessible online, in its native habitat, for as long as that habitat exists and to anyone interested in accessing it.
Jasmine Mulliken is Production and Preservation Manager, Digital Projects, at Stanford University Press. She coordinates the production and preservation workflow of born-digital projects, including recommending platforms and coding standards to authors, consulting with authors on projects’ technical attributes, and evaluating best practices for archiving and preservation.