As is apparent by now, we spend a lot of time thinking about ways to preserve and archive the digital web-based projects we’re publishing. While our technical guidelines provide authors with recommendations for building projects that are more easily sustainable, the fact is technology changes, and even the most rigid technical standards and requirements will be outdated when the next browser update is released or another version of HTML is unleashed or the newest device for browsing the web drops. What is accepted as safe today will become tomorrow’s vulnerability for hacking or decay. If there is a nature to technology it’s that five years is a lifetime. And for many web-based projects, five years is a pretty substantial life indeed.
So while we do what we can–based on what we know in the present–to make our publications viable in the future, we must also anticipate the inevitable moment when we need to consider other ways of providing access to the original content when software and hardware updates stop it from smoothly moving from the press’s server to the reader’s browser. And since all of these projects are web-based, one of the most obvious solutions is web archiving.
A web archive is just what it sounds like—an archive of web-based material. But unlike the kind of archive I was used to exploring in grad school—the boxes of handwritten letters and ticket stubs and photographs and playbills from a long-closed box that spent most of its life in the nuclear safety of the Harry Ransom Center basement—a web archive can be perpetually accessible to any user with an internet connection. No scholarly credentials required.
I recently had the good fortune to snag an email interview from Dragan Espenchied, one of the Rhizome team working on Webrecorder, and I’m thrilled to be able to present that interview here in full so our readers can learn more about this promising resource directly from one of its developers.
JM is Jasmine Mulliken, Digital Production Associate at Stanford University Press (blog author)
JM: How did the Webrecorder project come together? Were there any preceding projects it follows up on? How long has it been going? How long do you expect it to continue?
DE: Webrecorder started out as a project by Ilya Kreymer in 2014 and is now fully integrated at Rhizome (2015), thanks to generous support from the Andrew W Mellon Foundation. Rhizome is an arts organization dedicated to internet art and online culture, so we were looking towards a way to do web archiving with the highest possible “fidelity” when we found most web archiving tools are focused on automation, “crawl scale,” and HTML as “documents” instead of the highly interactive and dynamic content that is a reality today.
Rhizome’s goal is to make this a fully self-sustaining project in the future, so we are expecting it to continue supporting Webrecorder for as long as somebody wants to do web archiving. It is also open-source software, so even if Rhizome would go belly-up, the project wouldn’t need to go under with us. Already today, you can run your own instance of the Webrecorder service. Additionally, Webrecorder allows users to download their collections in a single WARC file and access them with a desktop application, Webrecorder Player, which is part of our suite of tools and we hope will enable a new kind of ownership of web archives. (https://github.com/webrecorder/webrecorderplayer-electron/blob/master/README.md)
We strongly believe the web is where it’s at, where everything of importance either happens directly or is represented, so web archiving should be in more people’s hands
JM: What were you hoping Webrecorder would do that other web archiving services were not doing? How is web recorder different?
DE: Webrecorder’s main difference to other tools is that it works “symmetrically:” the same code is at work during archiving and access, mainly a standard browser.
We also want to make web archiving as a practice accessible to a broad audience. We strongly believe the web is where it’s at, where everything of importance either happens directly or is represented, so web archiving should be in more people’s hands instead of just big institutions who can justify hiring the specialized staff to run a web archiving program.
JM: What kind of web content do you envision users of web recorder capturing?
Since a real browser is used for capture, and users can activate certain procedures on the sites they are visiting, they can make sure to capture all the resources that are only revealed through human interactivity.
DE: The classic Wayback Machine is a web archiving replay software that relies on other software to create the web archives.
Webrecorder is located in between a user’s browser and the whole web, so every request that a browser would generate, and the reply to that request from the web, is routed through the Webrecorder capturing mechanism. Since a real browser is used for capture, and users can activate certain procedures on the sites they are visiting, they can make sure to capture all the resources that are only revealed through human interactivity.
DE: The core replay engine in Webrecorder is actually a standalone tool, called pywb, which provides all of the features of OpenWayback with the fidelity of Webrecorder, allowing institutions to provide access to Webrecorder-created WARCs by hosting their own instance of pywb.
Webrecorder in itself does not require emulation. On the most basic level, it tries to catch all the HTTP traffic that happens when somebody is using a web site and stores it: that is made up of requests and responses; on access later, it tries to find the best archived response when a request into the archive happens. For a single web site, there can be hundreds of requests and responses to load all images, embedded social media widgets, ads, and so forth.
A current version of Chrome or so cannot do anything with the Java code that is contained in the web archive. The side-project oldweb.today is offering a set of browsers to look at web archives. (http://oldweb.today/)
JM: Do you anticipate a future where web-archiving will be able to capture all of a website’s interactive features and content and re-deliver that as it appeared/functioned at the moment it was captured?
DE: That depends—if I go to Google Maps and ask for a route from my home to my work, I can webrecord the process, but will never get hold of all the background infrastructure that makes Google Maps so deeply interactive (https://google.com/maps/). This is not possible to capture, because the possibilities are infinite, and there is lots of computation involved that happens at Google’s server farms which is invisible to the browser. In essence, if I’d want to preserve a highly interactive system, I’d need to preserve the backend infrastructure, which is potentially prohibitively expensive.
I believe creating an online publication and immediately creating a web archive of each issue would be good practice.
JM: Could a web recording essentially be a publication? (Taking into account that we’re publishing complete web-based scholarly works, it’s really interesting to consider a “capture” as the publication itself so that all the content and features exist in one WARC file that can be distributed in any browser for as long as WARCs are readable and perhaps even crawled by Google Scholar or other indexing agents.)
DE: The organization Net Freedom Pioneers has published web archives to users in Iran via satellite (https://www.toosheh.org/). Of course this is not an actual publication, since they mostly collect items from the web Iranians cannot access. Some academics like to submit WARCs with their research publications of web resources they have referenced. However, I haven’t heard of a publication that would directly publish to WARC. In general, web archives could be indexed by Google, but it is usually prevented because of cataloging issues. 🙂 (Search on the web is already hard, to search across different timestamps is even harder!)
However I believe creating an online publication and immediately creating a web archive of each issue would be good practice.
Lots of web archiving happens by accident, as crawlers do not know when exactly they should be archiving a source; so they’re usually programmed to look in certain time intervals or so. Publishers themselves know best when their publication should be archived and could do it themselves with Webrecorder.
Jasmine Mulliken is Digital Production Associate at Stanford University Press. She coordinates the production and workflow of born-digital projects, including recommending platforms and coding standards to authors, consulting with authors on projects’ technical attributes, and evaluating best practices for archiving and preservation.