Saving the Internet: An Interview with Webrecorder’s Dragan Espenchied

Webrecorder's logo containing a graphic and url displayed as "WEBRECORDERr.io"
WebRecorder logo, used with permission.

As is apparent by now, we spend a lot of time thinking about ways to preserve and archive the digital web-based projects we’re publishing. While our technical guidelines provide authors with recommendations for building projects that are more easily sustainable, the fact is technology changes, and even the most rigid technical standards and requirements will be outdated when the next browser update is released or another version of HTML is unleashed or the newest device for browsing the web drops. What is accepted as safe today will become tomorrow’s vulnerability for hacking or decay. If there is a nature to technology it’s that five years is a lifetime. And for many web-based projects, five years is a pretty substantial life indeed.

So while we do what we can–based on what we know in the present–to make our publications viable in the future, we must also anticipate the inevitable moment when we need to consider other ways of providing access to the original content when software and hardware updates stop it from smoothly moving from the press’s server to the reader’s browser. And since all of these projects are web-based, one of the most obvious solutions is web archiving.

A web archive is just what it sounds like—an archive of web-based material. But unlike the kind of archive I was used to exploring in grad school—the boxes of handwritten letters and ticket stubs and photographs and playbills from a long-closed box that spent most of its life in the nuclear safety of the Harry Ransom Center basement—a web archive can be perpetually accessible to any user with an internet connection. No scholarly credentials required.

While many people are already familiar with the Internet Archive’s Wayback Machine, a new web-archiving tool on the open-source market, Webrecorder, is in many ways an improvement on Wayback. For instance our first publication, Enchanting the Desert, while still fully functional in a modern Chrome browser, is already broken in the Wayback Machine. However, Webrecorder captures Enchanting the Desert in all its rich JavaScript glory. Webrecorder is a free open-source tool that puts the archiving in the hands of the user, allowing them to capture all the moving parts to ensure they are then able to be viewed in a fully functional context.

I recently had the good fortune to snag an email interview from Dragan Espenchied, one of the Rhizome team working on Webrecorder, and I’m thrilled to be able to present that interview here in full so our readers can learn more about this promising resource directly from one of its developers.

JM is Jasmine Mulliken, Digital Production Associate at Stanford University Press (blog author)

DE is Dragan Espenchied, Preservation Director for Rhizome at the New Museum

JM: How did the Webrecorder project come together? Were there any preceding projects it follows up on? How long has it been going? How long do you expect it to continue?

DE: Webrecorder started out as a project by Ilya Kreymer in 2014 and is now fully integrated at Rhizome (2015), thanks to generous support from the Andrew W Mellon Foundation. Rhizome is an arts organization dedicated to internet art and online culture, so we were looking towards a way to do web archiving with the highest possible “fidelity” when we found most web archiving tools are focused on automation, “crawl scale,” and HTML as “documents” instead of the highly interactive and dynamic content that is a reality today.

Rhizome’s goal is to make this a fully self-sustaining project in the future, so we are expecting it to continue supporting Webrecorder for as long as somebody wants to do web archiving. It is also open-source software, so even if Rhizome would go belly-up, the project wouldn’t need to go under with us. Already today, you can run your own instance of the Webrecorder service. Additionally, Webrecorder allows users to download their collections in a single WARC file and access them with a desktop application, Webrecorder Player, which is part of our suite of tools and we hope will enable a new kind of ownership of web archives. (https://github.com/webrecorder/webrecorderplayer-electron/blob/master/README.md)

We strongly believe the web is where it’s at, where everything of importance either happens directly or is represented, so web archiving should be in more people’s hands

JM: What were you hoping Webrecorder would do that other web archiving services were not doing? How is web recorder different?

DE: Webrecorder’s main difference to other tools is that it works “symmetrically:” the same code is at work during archiving and access, mainly a standard browser.

We also want to make web archiving as a practice accessible to a broad audience. We strongly believe the web is where it’s at, where everything of importance either happens directly or is represented, so web archiving should be in more people’s hands instead of just big institutions who can justify hiring the specialized staff to run a web archiving program.

JM: What kind of web content do you envision users of web recorder capturing?

DE: Really anything that is on the web. We have users doing all kinds of non-traditional web archiving projects. Here is a little article describing some of the use-cases we presented with the latest release. (http://rhizome.org/editorial/2017/jul/12/whats-good-for-net-art-is-good-for-everyone/)

Since a real browser is used for capture, and users can activate certain procedures on the sites they are visiting, they can make sure to capture all the resources that are only revealed through human interactivity.

JM: As more publishers begin to venture into digital scholarship, we’re faced with the issue of longevity. It would seem that web archiving should offer a solution to preserving web content. The biggest problem for SUP and many self-publishing digital scholars, though, is that traditional crawling web-archiving tools (Open Wayback) don’t handle JavaScript very well. Webrecorder seems to be better at it. Can you talk about the differences between, for instance, the Wayback Machine and Webrecorder?

DE: The classic Wayback Machine is a web archiving replay software that relies on other software to create the web archives.

Webrecorder is located in between a user’s browser and the whole web, so every request that a browser would generate, and the reply to that request from the web, is routed through the Webrecorder capturing mechanism. Since a real browser is used for capture, and users can activate certain procedures on the sites they are visiting, they can make sure to capture all the resources that are only revealed through human interactivity.

Webrecorder has been built with the dynamic web in mind first and foremost, so it “intercepts” JavaScript calls before they are fired and then makes sure that they operate through the capture and replay part of the architecture. This operation is done in real-time, in the browser, instead of trying to rewrite JavaScript before it is sent to the browser. It is impossible to “rewrite” a script for an archive in its switched-off state as just a text resource, only in its execution can a script be really changed in that way.

JM: What technologies are you using for playback of the captured WARC files? Is emulation involved? If so, what all are you emulating (browser/version, JavaScript, etc.)?

DE:  The core replay engine in Webrecorder is actually a standalone tool, called pywb, which provides all of the features of OpenWayback with the fidelity of Webrecorder, allowing institutions to provide access to Webrecorder-created WARCs by hosting their own instance of pywb.

Webrecorder in itself does not require emulation. On the most basic level, it tries to catch all the HTTP traffic that happens when somebody is using a web site and stores it: that is made up of requests and responses; on access later, it tries to find the best archived response when a request into the archive happens. For a single web site, there can be hundreds of requests and responses to load all images, embedded social media widgets, ads, and so forth.

However, since the software that makes the requests and interprets the responses is critical, Webrecorder can combine software emulation with web archiving. For example here is a web archive accessed with a browser that can execute Java applets and produce this great 1990’s lake effect. (https://webrecorder.io/despens/pieces/20161020063555$br:firefox:49/http://
art.teleportacia.org/exhibition/merry_christmas/bridging_the_digital_divide/)

A current version of Chrome or so cannot do anything with the Java code that is contained in the web archive. The side-project oldweb.today is offering a set of browsers to look at web archives. (http://oldweb.today/)

JM: Do you anticipate a future where web-archiving will be able to capture all of a website’s interactive features and content and re-deliver that as it appeared/functioned at the moment it was captured?

DE: That depends—if I go to Google Maps and ask for a route from my home to my work, I can webrecord the process, but will never get hold of all the background infrastructure that makes Google Maps so deeply interactive (https://google.com/maps/). This is not possible to capture, because the possibilities are infinite, and there is lots of computation involved that happens at Google’s server farms which is invisible to the browser. In essence, if I’d want to preserve a highly interactive system, I’d need to preserve the backend infrastructure, which is potentially prohibitively expensive.

However, if I am looking at a web site that does most of its activities in the users’ browser, it can be captured. An example is this meme generator that puts text over an image. (https://webrecorder.io/despens/a-virtual-roundtable-discussion/20170803101047/https://imgflip.com/memegenerator/Confession-Bear)

I believe creating an online publication and immediately creating a web archive of each issue would be good practice.

JM: Could a web recording essentially be a publication? (Taking into account that we’re publishing complete web-based scholarly works, it’s really interesting to consider a “capture” as the publication itself so that all the content and features exist in one WARC file that can be distributed in any browser for as long as WARCs are readable and perhaps even crawled by Google Scholar or other indexing agents.)

DE: The organization Net Freedom Pioneers has published web archives to users in Iran via satellite (https://www.toosheh.org/). Of course this is not an actual publication, since they mostly collect items from the web Iranians cannot access. Some academics like to submit WARCs with their research publications of web resources they have referenced. However, I haven’t heard of a publication that would directly publish to WARC. In general, web archives could be indexed by Google, but it is usually prevented because of cataloging issues. 🙂 (Search on the web is already hard, to search across different timestamps is even harder!)

However I believe creating an online publication and immediately creating a web archive of each issue would be good practice.

Lots of web archiving happens by accident, as crawlers do not know when exactly they should be archiving a source; so they’re usually programmed to look in certain time intervals or so. Publishers themselves know best when their publication should be archived and could do it themselves with Webrecorder.

 

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *