From Publication to Digital Repository
Creating an archive of an interactive scholarly work’s publication components in the Stanford Digital Repository is a time-intensive and collaborative effort.
The source and content files of our first publication, Enchanting the Desert, have now been fully accessioned, deposited, and processed in the Stanford Digital Repository. Aside from the collection record itself and the referenced screencast, the contents of the collection will remain “dark”—in other words, citation-only and undiscoverable via the SDR search interface—until a time when the original publication or another interactive, high-fidelity version of it is no longer accessible. Hopefully, such a time is still a long way off, but it’s become part of a typical publishing workflow for us to mitigate the inherent risks of web publishing by preparing archival versions of the complex digital projects we produce.
In the spirit of documenting and sharing the work we’re doing to ensure the persistence of the digital scholarly work we’re publishing, here’s a simplified outline of what it took to establish an SDR collection for our first interactive scholarly work. I’m not a librarian or a repository specialist, so I’m undoubtedly glossing over some of the nuances and more complex processes. But suffice to say I’ve learned a great deal about library infrastructure and digital collection building thanks to the many months I was able to spend working on this task.
1. The first logical step to building any collection is to prepare an inventory of the contents. And for a web publication, the most logical place to start, then, is with the public_html folder that has been deployed to the publication server. Such a folder ideally contains all the project-level text, media, and code files needed for a project to function in a browser contemporary to its publication. Beyond those contents, we also include an author-provided documentation to serve as a kind of “read me” for the project and thus the archive collection. We also include a WARC file, an object generated by the web archiving process. All this content is then listed in a spreadsheet, each row representing an object (basically a file from the public_html folder) along with columns of customized metadata for each object. This metadata varies by project. For Enchanting the Desert, for example, a project consisting of 1050 separate files, we provided original directory location, a short description of the file content, a list of associated files that would accompany the one listed on screen, and bibliographic citations.
Sample rows from the full dataset showing relationships between collection file content.
2. The next step was to use the organized data to register each item for deposit using Argo, Stanford’s open-source administrative “hydra head” for Fedora. Registration involves setting up basic collection parameters, and as the data is processed, the system assigns each object a “druid,” which will then be used to fill in content and additional metadata for each piece of the collection once it’s deposited. Learning this interface took the help of repository specialists in the library, and it involved some command-line processes, but having gone through it all once, and on such a large collection, we’re much better equipped for the next round of projects.
3. With the help of the same team of specialists, we then uploaded the 1050 files into Stanford’s Box service where a specialist downloaded them into the repository where they were linked with the druids generated in the registration process.
4. Next, we worked with metadata specialists in the library to expand the initial metadata we established during inventory and add more standard metadata like author/conceptor, media type, genre, etc. These fields were then linked in to the collection and populated the records for each object by loading a mods data file into Argo.
5. As persistent URLs (PURLs) were generated and we were able to begin viewing the product of the work we’d executed in the form of front-facing online catalog records, we were able to see pieces that were missing or formats that needed cleaning or content that was not displaying correctly. Some of these issues could be handled by modifying batch or individual records back in the Argo interface, while others meant more processing work by specialists fluent in the automated backend processes. The screencast for the Enchanting the Desert collection, for instance, needed to be processed by a media specialist in the library to work with the embedded viewer in the object’s record.
6. At the same time checks were being run and tweaks and edits being applied, we needed to work out the access and permissions for the collection, requiring coordinating efforts between SUP’s rights manager and DLSS’s Product and Service manager. Deciding to keep the collection dark made some of the work we still needed to do less urgent. The access setting also reduced the quality of some of the media since the embedded viewer would not be serving a purpose of delivery but rather just a providing a preview. Thus, even now there are some media objects that will need adjustment before the collection can be made available to researchers, but with three new recently published projects lined up for accessioning in the next year, this work will need to wait. Changing access settings will fix some issues, but others are likely to only emerge once that happens.
Ultimately, the SDR preservation approach, which is only one of three we’re either actively applying or exploring, is particularly time-intensive (the above workflow took about a year and a half) and relies on a variety of specialists across the Stanford Libraries. It also does not provide the full interactive experience of the publication. It’s decontextualized and disaggregated, each of its components preserved at the file level. But it’s perhaps the most secure and durable preservation method as the bits are stored in an institutional repository with an infrastructure that is virtually guaranteed to persist even after web browsers and file formats come and go. So despite the time and coordinated effort necessary, and despite the loss of the original context and functionality, the process is worth it for the opportunity it offers researchers in a more distant future to recompile the bits and pieces into a working whole again. And in the shorter term, as we also pursue higher fidelity preservation solutions like web archiving and emulation, it provides us with a safe storage from which we can provide access to the WARC or to hopefully draw the pieces needed to emulate the original publication.
Not only has taking Enchanting the Desert through the SDR accessioning and deposit process given us a template for future workflows, it has also help us identify some wish list features for the SDR that could also benefit other library collections, not just archived components of SUP’s interactive scholarly works. It’s been interesting to consider how we might leverage SDR to store and deliver data to live projects and their various catalog/index record, for example. We can also imagine how an SDR PURL page could itself function as a persistent cover page if only it allowed for customized branding and was amenable to indexing crawlers. Typically these kinds of features have been and remain out of scope for digital repositories, but as publication formats evolve, these kinds of systems will need to be leveraged for new kinds of archiving purposes. Learning the system from a publisher’s perspective has been fascinating, and the conversations we’ve gotten to have with the system specialists along the way have generated some pretty wild and exciting ideas.