Granular Web Archiving at IIPC

As is pretty clear by now, we’re spending a lot of time and energy in the pursuit of ensuring the digital work we’re publishing at SUP is just as long-lived as a typical scholarly monograph. We’ve zeroed in on three approaches, and the one that has been most successful so far is web archiving. So I’ve made it one of my priorities to engage with the web archiving world by making connections and keeping up with the latest technologies and tools.
It was in this context that I attended and gave a presentation at this year’s IIPC WAC (International Internet Preservation Consortium’s Web Archiving Conference) in Wellington, New Zealand, November 12-16. My presentation was part of a panel including Anna Perricci of Webrecorder and Sumitra Duncan, head of the New York Art Resources Consortium’s web archiving program. Moderated by fellow supDigital team member Nicole Coleman, who has written her own report of the conference here, our panel showcased the advantages of granular web archiving for creative and scholarly web content. Unlike many of the other conference sessions, which focused on large-scale crawling mostly for purposes of dark archives, our session, titled “Capturing complex websites and publications with Webrecorder,” sought to push back against scale in favor of scope. Ultimately what we brought to the conversation at this year’s conference, I think, was the importance of thorough, thoughtful, comprehensive capturing of web content that challenges sweeping, scalable methods often used for capturing a lot of time-sensitive content very quickly.
Web archiving has been perhaps one of the strongest responses to the threat of disappearing news and data that has become frighteningly commonplace as more and more information is published in digital-only formats and as political leaders seek to distort history and fact. Being able to crawl and archive thousands of news articles in a single day, then, is certainly a legitimate expectation from web archiving applications and software. And it makes sense that most of the focus of web archiving is on this kind of content. It was reassuring to see just how many institutions were working to preserve truth and records of knowledge and how much care they were putting into the legality and cultural sensitivity surrounding that kind of record building. But not everyone who needs to web archive needs to do it the same way.
As the first presenter in our panel, Anna Perricci outlined the affordances that Webrecorder, as as opposed to other systems, offers and how they are evolving their tools to meet the needs of not just archivists but of content producers who are anticipating the need to archive early in their production processes. She shared updates to the interface and previewed upcoming plans for collaboration and extended services. The presentation preceding our panel, by Ilya Kreymer, set up the technological foundation for these features, and Anna offered a reframing of them through the human-scale lens that Webrecorder has come to represent in the web archiving world.
As a scholarly publisher, our priority is to capture and reproduce the full content and experience of each digital publication we put out, and often times these publications contain technologies beyond the typical scope of a scale-based crawler. Complex Javascript, for instance, requires the affordances of the PyWB framework rather than the Open WB system so commonly used by large institutions. My presentation highlighted the capabilities we’ve needed to leverage from Webrecorder as a PyWB system when Open Wayback failed to capture and/or deliver all the intricacies of the innovative work we’re publishing.
Sumitra Duncan wrapped up the presentations by bringing a similar perspective from the art world. Capturing artist websites presents unique challenges similar to our own at SUP. Artists, like digital scholars, by nature push the bounds of technology and interactive and visual potential. Organizations like ours require a web archiving tool that allows for fine-tuned manual capturing, so we can ensure nothing is lost. Duncan explained that NYARC is using a combination of Archive-It and Webrecorder to capture what they can through automation but then to fill in the gaps automation necessarily leaves with a more granular approach.
Ultimately, the trade-off with granular control of a capture is that the process it requires is time intensive. As content producers with still only a few works released, at SUP we can afford to focus closely on each one. Most stakeholders in the web archiving world are still, understandably, focused on scale—grabbing as much as possible while it’s there without necessarily planning immediately for replay or access to the archived content. But if our arguments carried any weight in the ongoing discussion, I hope it’s for the creators of web content to take a more active role in the longevity of what they’re putting online. Our perspective seemed to resonate and piqued the interest of the audience, and our post-presentation panel discussion yielded insightful questions and comments, provoked and encouraged by our moderator. Our presence at the conference in general, I hope, demonstrated that archiving needs to and is beginning to happen during production, and that as the creators of web content we have a unique insight into the technical needs of that content.
I feel encouraged by the welcome our little panel received. The group in Wellington, including the hosts at the National Library of New Zealand, were remarkably accommodating and inclusive. Our representative in the welcoming waiata called us all “warriors of the internet.” I’m happy our program is counted among those protecting knowledge and that we were able to share our perspectives as content creators among those so passionate about preserving it.