Zombies in the Archives

The Wayback Machine, a project of the Internet Archive.

It’s been widely noted that the typical website lasts roughly three to five years. One of the goals of SUP’s Mellon grant is to mitigate that inevitability by exploring a range of preservation approaches for the web-based works we’re publishing. While documentation is a necessary component of archiving digital content, an ideal archive would also offer readers the ability to fully experience the interactive qualities of a digital scholarly work. So I’ve spent the past week learning about what some people are doing to make digital content accessible, even after its three-to-five year countdown has expired.

Just as last week’s blog post on documentation was going live, I was checking in at the 2017 Joint Conference on Digital Libraries, a meeting with a heavy focus on web archiving. If documentation offers a way to chronicle the experience of interacting with digital content that is no longer accessible, web archiving seeks to capture a snapshot of a website, even if it’s not quite as interactive as the original. Take, for example, the Wayback Machine, a project of the Internet Archive that crawls websites and records them so that readers can ostensibly visit a version of a webpage as it appeared on a specific date in history. It’s an important resource, especially at a time when information can disappear overnight, as if it never existed.

A Wayback Machine capture of whitehouse.gov on October 31, 2016 and the live site as of June 25, 2017.

At first glance, web archiving appears to be the perfect solution to capturing and saving web-based content. But it has its limitations. As a solution for preserving our digital projects, it’s still too unreliable. Its weak point is JavaScript, a programming language powering just about every active website on the internet today. JavaScript is considered one of the three fundamental building blocks of the web, along with HTML and CSS. It’s what makes web content interactive and dynamic. And while some publishers understandably might opt to limit authors’ use of dynamic JavaScript because of the difficulty it presents to web archiving tools, we’re more interested in challenging web archiving entities to improve their crawlers’ ability to handle JavaScript.

At first glance, web archiving appears to be the perfect solution to capturing and saving web-based content. But it has its limitations.

I was happy to hear from a couple of different presenters at the Digital Libraries conference that they’re working on just that (check out the conversation at #JCDL2017 on Twitter). For example, developers are writing scripts to prevent “zombies” from appearing in archived content. Zombies are anachronistic pieces of content, like ads, that are generated on an archived webpage by JavaScript. The glitch happens when an archived site pulls that extra content from current databases rather than the historic material that would have appeared on the site at the time it was captured. A popular example used by two different presenters at the conference was the news coverage of the 2008 presidential campaign. Although the news story headline and blurb is a recorded capture of the site as it appeared on that day, the ads generated by the JavaScript referenced events that hadn’t taken place yet when the site was captured. The fix lies in additional scripts implemented in the crawling process.

The big focus still seems to be on accurately capturing news, social media content, and scientific data for historical reasons, and rightly so. It’s becoming more important than ever to ensure history and science are not manipulated, misrepresented, or altogether deleted. Hopefully, the advancements being made in web archiving will continue and can also benefit the scholarly content we’re publishing so that academic perspectives and arguments are preserved along with cultural and scientific data. Scholarly discourse is happening online, and it isn’t just CNN and Twitter that need to be recorded. We also need to capture and preserve the voices that are interpreting, analyzing, and making sense of this content, and publishing their insights in digital formats.


Add a Comment

Your email address will not be published. Required fields are marked *