More Web Archiving: Collaboration, Testing, and New Tools for Extending the Life of Digital Scholarship

Production and Preservation Manager Jasmine Mulliken participated in a panel at the International Internet Preservation Consortium’s Web Archiving Conference.

Last month, I had the honor of presenting SUP’s digital publications as a use case for Browsertrix Cloud, a tool for the web archiving of complex, interactive, digital scholarly publications. The panel, entitled Browser-Based Crawling For All: The Story So Far, was part of the program for IIPC’s Web Archiving Conference “online day.” The format of the online portion of the conference prioritized engagement by requiring panel presenters to pre-record their presentations so the videos could be compiled into a playlist and attendees could watch the videos before the panel met live online for an hour of Q&A.

The panel’s abstract conveys the topic and the range of user experience cases presented:

Through the IIPC-funded “Browser-based crawling system for all” project, members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsetrix Crawler crawls). The IIPC funding is particularly focused on making sure IIPC members can use these tools.

This online panel will provide an update on the project, emphasizing the experiences of IIPC members who have been experimenting with the tools. Four IIPC members who have been exploring Browsertrix Cloud in detail will present their experiences so far. What works well, what works less well, how the development process has been, and what the longer-term issues might be. The Q&A session will be used to explore the issues raised and encourage wider engagement and feedback from IIPC members.

The pre-recorded presentations began with an introduction from Anders Klindt Myrvoll & Ilya Kreymer. They provided updates on the project since it began in 2022 and talked about next steps.

The first user experience, “Testing Browsertrix Cloud at NLNZ,” was presented by Sholto Duncan:

In recent years the selective web harvesting programme at the National Library of New Zealand has broadened its crawling tools of choice in order to use the best one for the job, from primarily using Heritrix, through WCT, to now also regularly crawling with Webrecorder and Archive-IT. This allowed us to get the best capture possible, but unfortunately still falls short in harvesting some of those more rich, dynamic, modern websites that are becoming more commonplace. Other areas within the Library that often use web archiving processes for capturing web content have seen this same need for improved crawling tools. This has provided a range of users and diverse use cases for our Browsertrix Cloud testing.

The second user experience, “Improving the Web Archive Experience,” was presented by Lauren Ko from the university of North Texas Library:

With a focus on collecting the expiring websites of defunct federal government commissions, carrying out biannual crawls of its own subdomains, and participating in event-based crawling projects, since 2005 UNT Libraries has mostly carried out harvesting with Heritrix. However, in recent years, attempts to better archive increasingly challenging websites and social media have led to supplementing this crawling with a more manual approach using pywb’s record mode. Now hosting an instance of Browsertrix Cloud, UNT Libraries hopes to reduce the time spent on archiving such content that requires browser-based crawling. Additionally, the libraries expect the friendlier user interface Browsertrix Cloud provides to facilitate its use by more staff in the library, as a teaching tool in a web archiving course in the College of Information, and in a project collaborating with external contributors.

Pre-recorded presentation of the SUP use case by Jasmine Mulliken

The third user experience was presented by myself, Jasmine Mulliken, and was subtitled “Crawling the Complex:”

Web-based digital scholarship, like the kind produced under Stanford University Press’s Mellon-funded digital publishing initiative (, is especially resistant to standard web archiving. Scholars choosing to publish outside the bounds of the print book are finding it challenging to defend their innovatively formatted scholarly research outputs to tenure committees, for example, because of the perceived ephemerality of web-based content. SUP is supporting such scholars by providing a pathway to publication that also ensures the longevity of their work in the scholarly record. This is in part achieved by SUP’s partnership with Webrecorder, which has now, using Browsertrix Cloud, produced web-archived versions of all eleven of SUP’s complex, interactive, monograph-length scholarly projects. These archived publications represent an important use case for Browsertrix Cloud that speaks to the needs of creators of web content who rely on web archiving tools as an added measure of value for the work they are contributing to the evolving innovative shape of the scholarly record.

The fourth user experience, “Integrating Browsertrix,” was presented by Andreas Predikaka & Antares Reich of the Austrian National Library:

Since the beginning of the web archiving project in 2008, Austrian National Library has been using the crawler Heritrix integrated in Netarchivesuite. For many websites in daily crawls, the use of Heritrix is no longer sufficient and it is necessary to improve the quality of our crawls. Tests showed very quickly, that Browsertrix is doing a very good job to fulfil this requirement. But for us it is also important that the results of Browsertrix crawls are integrated into our overall working process. By using the API of Browsertrix, it was possible to create a proof of concept of necessary steps for this use case.

The variety of Webrecorder’s users and use cases was certainly reflected in the panel’s presentations and shows just how important its toolset has become to not just libraries, but to publishers, researchers, and authors of web content around the world.

The hour-long live Q&A session reinforced the diversity of interest, and topics of conversation ranged from technical updates, video embeds, scholarly context, archiving workflows, and quality assurance. My experience using Webrecorder tools, especially Browsertrix Cloud for crawling and Replayweb for presenting web archived versions of SUP’s digital publications, along with the experience of presenting our work among these diverse colleagues, reinforces that choosing these tools and methods as part of our digital publication preservation initiative has been the right decision.

The full recorded live Q&A session, moderated by Meghan Lyon, is here:

You’re also welcome to view the slides from my portion of the presentation.

Add a Comment

Your email address will not be published. Required fields are marked *