More on Web Archiving

A few weeks ago I posted an interview with Webrecorder’s Dragan Espenchied in which he detailed the features and uses of the web-archiving tool developed by Rhizome. A fellow Mellon-funded project, Webrecorder has been especially intriguing to us because it is perhaps the most specifically focused solution to providing readers of our web-based interactive scholarly works a chance to experience the project in its original form once the live version begins to suffer the usual fate of web-based content. While there are other approaches to this challenge that we’re exploring—from emulation to documentation to repository storage and delivery—none are as specifically centered around preserving the experience of interacting with web content. This type of content always seems like a browser update away from breaking, although we do discuss how to safeguard against this fate in our Archivability guidelines. But if the commonplace is true that any given website has a life expectancy of 2-5 years, we need to begin planning for each project’s afterlife as soon it goes live. It’s the only way to ensure we have enough time to try and get the best, most functional, highest-fidelity archived version of it that we can.
As a follow-up to our earlier interview, I thought it might be worth going into a little more depth on some of the specific ways we’re experimenting with Webrecorder for our own work here at supDigital.
Web archiving isn’t a new concept. The Internet Archive, perhaps the biggest collector and provider of what would otherwise be lost web content, has been doing it since 1996. Its Wayback Machine can call up just about anything that its crawlers have captured online over the past 21 years. Examples range anywhere from a personal web page from the nineties, to a news site as it appeared on Sept. 11, 2001, to the official White House site from any day of the Obama presidency. It’s a valuable and increasingly important resource, and with an archive of 279 billion individual web pages, it’s certainly becoming an invaluable collection of cultural heritage in a time when so much information is born and lives online. But the Internet Archive isn’t the only option when it comes to preserving online material.
Even though Webrecorder is younger and less well-known in most circles than the Wayback Machine, we’ve found it has some pretty great features for our work in particular. So as a follow-up to our earlier interview, I thought it might be worth going into a little more depth on some of the specific ways we’re experimenting with Webrecorder for our own work here at supDigital.
One of the advantages we’ve found to Webrecorder is its ability to capture some of the more complex JavaScript elements found in most modern websites. Without getting too technical, this functionality is due to the underlying technology powering the crawl and delivery of web content. While the Wayback Machine at archive.org uses the Open-Wayback approach, Webrecorder employs PyWb, an open-source Python-based system that can more easily capture and return JavaScript features that might make some sites more difficult to process with Open-Wayback.

Here’s a visual example using our first supDigital publication, Enchanting the Desert. Above is a screenshot of the live web project.

Right: Screenshot of the Wayback Machine’s web-archived version of the same state of Enchanting the Desert as above.
And this is a side-by-side comparison of the web-archived versions in Webrecorder and the Wayback Machine. The archive looks quite a bit different in Webrecorder than it does in the Wayback Machine. Although the Webrecorder version, which you can see live here, takes a little while to load, given about 15-20 seconds after the load-wheel completes, the content comes up in its intended arrangement on the screen. Meanwhile, the Wayback Machine version, linked here, shows overlapping text and blank or erroneously loaded containers where content should be. (Note that this version works sometimes, while other times it appears as above. We are working on ascertaining the reason behind the random results from the same WARC file.) The Wayback Machine does actually contain the correct content links that are normally accessible on the landing page, but it’s stripped from the scripting that defines its intended layout and display characteristics. Scrolling down the page shows a long list of content links, none of which work to bring the content into the console.
No doubt future updates to the Wayback Machine will address the current holes, and I saw evidence of this kind of pursuit from developers and scholars at the 2017 Joint Conference on Digital Libraries. But for now Webrecorder seems to be the best solution for the kind of web content we’re producing, despite the challenges some of our forthcoming material is presenting in the tool’s current version. From our perspective as scholarly publishers, the complex JavaScript authors are using in their digital arguments requires an approach that takes into account the need to capture not just the content but the relationships and arrangement of that content as part of the argument.
…unlike a photo behind glass or a film played in a museum on a loop, a useful web archive makes the experience of self-directed navigation and interaction with the work virtually indistinguishable from the original.
Another feature of Webrecorder we’re taking advantage of is the control we have over what to capture as part of the final WARC file. Webrecorder’s interface allows us to essentially tell the crawler what to capture by simply navigating the site, being sure to click on and activate each function or piece of content we want the archive to be able to “play back.” For us, it’s everything, far more than what might be captured in an automatic crawl by the Wayback Machine. To be fair, Internet Archive’s Archive-It program offers its subscribers much more control and flexibility over what and when they crawl and capture, but along with the enhanced JavaScript capacity, the (for now) free recording sessions Webrecorder offers just makes this tool a bit more accessible and specific to our unique needs.
As publishers of the content we’re capturing, we’re in a unique position in terms of our rights and access to each of our web-based projects. We have direct access to every part of the projects, and we’re deeply familiar with all of their contents and features. While many archiving agencies work with material that has either been donated to them or that they’ve chosen to seek out and collect (as is often the case with open web content), we’ve known these projects from their infancy. This level of familiarity with a web-based work means we know where its pieces live and how to access them. As the publishers of the content, we are invested in their longevity and have what sometimes feels like a parental interest in how they are perceived by the rest of the world. Web-archiving technology lets us capture that quintessential moment in a project’s life when it was officially released to the world, when it was at its prime. But unlike a photo behind glass or a film played in a museum on a loop, a useful web archive makes the experience of self-directed navigation and interaction with the work virtually indistinguishable from the original.
Examples range anywhere from a personal web page from the nineties, to a news site as it appeared on Sept. 11, 2001, to the official White House site from any day of the Obama presidency.
Much of the big push recently behind improvements in web-archiving technology has come from an increased sense of responsibility to preserve our history. Efforts especially focused on capturing news and social media content are a huge driving force behind the need to step up the functionality of tools like Webrecorder and the Wayback Machine. Datathons around the world, like those organized by Archives Unleashed (also Mellon-funded), have increased in frequency and attendance as has the need to make sure scientific data and facts aren’t deleted from hard drives or public memory. Digital scholarship is just as important to the historical record as Trump’s Twitter feed or a cnn.com news story, and as scholarly publishers we have a responsibility to help shape the capabilities of the kinds of technology we need in order to preserve that record. So we’ll continue to test and experiment with tools that offer a way to not just preserve the content we’re publishing but to also make it accessible online, in its native habitat, for as long as that habitat exists and to anyone interested in accessing it.