Creating an archive of an interactive scholarly work’s publication components in the Stanford Digital Repository is a time-intensive and collaborative effort.
The source and content files of our first publication,
Enchanting the Desert, have now been fully accessioned, deposited, and
processed in the Stanford Digital Repository. Aside from the collection record itself and
the referenced screencast, the contents of the collection will remain “dark”—in
other words, citation-only and undiscoverable via the SDR search interface—until
a time when the original publication or another interactive, high-fidelity
version of it is no longer accessible. Hopefully, such a time is still a long
way off, but it’s become part of a typical publishing workflow for us to
mitigate the inherent risks of web publishing by preparing archival versions of
the complex digital projects we produce.
In the spirit of documenting and sharing the work we’re
doing to ensure the persistence of the digital scholarly work we’re publishing,
here’s a simplified outline of what it took to establish an SDR collection for our
first interactive scholarly work. I’m not a librarian or a repository
specialist, so I’m undoubtedly glossing over some of the nuances and more
complex processes. But suffice to say I’ve learned a great deal about library
infrastructure and digital collection building thanks to the many months I was
able to spend working on this task.
1. The first logical step to building any collection is to prepare an inventory of the contents. And for a web publication, the most logical place to start, then, is with the public_html folder that has been deployed to the publication server. Such a folder ideally contains all the project-level text, media, and code files needed for a project to function in a browser contemporary to its publication. Beyond those contents, we also include an author-provided documentation to serve as a kind of “read me” for the project and thus the archive collection. We also include a WARC file, an object generated by the web archiving process. All this content is then listed in a spreadsheet, each row representing an object (basically a file from the public_html folder) along with columns of customized metadata for each object. This metadata varies by project. For Enchanting the Desert, for example, a project consisting of 1050 separate files, we provided original directory location, a short description of the file content, a list of associated files that would accompany the one listed on screen, and bibliographic citations.
Sample rows from the full dataset showing relationships between collection file content.
2. The next step was to use the organized data to register
each item for deposit using Argo, Stanford’s
open-source administrative “hydra head” for Fedora. Registration
involves setting up basic collection parameters, and as the data is processed,
the system assigns each object a “druid,” which will then be used to fill in
content and additional metadata for each piece of the collection once it’s
deposited. Learning this interface took the help of repository specialists in
the library, and it involved some command-line processes, but having gone
through it all once, and on such a large collection, we’re much better equipped
for the next round of projects.
3. With the help of the same team of
specialists, we then uploaded the 1050 files into Stanford’s Box service where
a specialist downloaded them into the repository where they were linked with
the druids generated in the registration process.
4. Next, we worked with metadata specialists
in the library to expand the initial metadata we established during inventory
and add more standard metadata like author/conceptor, media type, genre, etc.
These fields were then linked in to the collection and populated the records
for each object by loading a mods data file into Argo.
5. As persistent URLs (PURLs) were generated
and we were able to begin viewing the product of the work we’d executed in the
form of front-facing online catalog records, we were able to see pieces that
were missing or formats that needed cleaning or content that was not displaying
correctly. Some of these issues could be handled by modifying batch or individual
records back in the Argo interface, while others meant more processing work by
specialists fluent in the automated backend processes. The screencast for the
Enchanting the Desert collection, for instance, needed to be processed by a
media specialist in the library to work with the embedded viewer in the
6. At the same time checks were being run and
tweaks and edits being applied, we needed to work out the access and
permissions for the collection, requiring coordinating efforts between SUP’s
rights manager and DLSS’s Product and Service manager. Deciding to keep the
collection dark made some of the work we still needed to do less urgent. The
access setting also reduced the quality of some of the media since the embedded
viewer would not be serving a purpose of delivery but rather just a providing a
preview. Thus, even now there are some
media objects that will need adjustment before the collection can be made
available to researchers, but with three new recently published projects lined
up for accessioning in the next year, this work will need to wait. Changing
access settings will fix some issues, but others are likely to only emerge once
Ultimately, the SDR preservation approach, which is only one
of three we’re either actively applying or exploring, is particularly
time-intensive (the above workflow took about a year and a half) and relies on
a variety of specialists across the Stanford Libraries. It also does not
provide the full interactive experience of the publication. It’s
decontextualized and disaggregated, each of its components preserved at the
file level. But it’s perhaps the most secure and durable preservation method as
the bits are stored in an institutional repository with an infrastructure that
is virtually guaranteed to persist even after web browsers and file formats
come and go. So despite the time and coordinated effort necessary, and despite
the loss of the original context and functionality, the process is worth it for
the opportunity it offers researchers in a more distant future to recompile the
bits and pieces into a working whole again. And in the shorter term, as we also
pursue higher fidelity preservation solutions like web archiving and emulation,
it provides us with a safe storage from which we can provide access to the WARC
or to hopefully draw the pieces needed to emulate the original publication.
Not only has taking Enchanting the Desert through the SDR accessioning and deposit process given us a template for future workflows, it has also help us identify some wish list features for the SDR that could also benefit other library collections, not just archived components of SUP’s interactive scholarly works. It’s been interesting to consider how we might leverage SDR to store and deliver data to live projects and their various catalog/index record, for example. We can also imagine how an SDR PURL page could itself function as a persistent cover page if only it allowed for customized branding and was amenable to indexing crawlers. Typically these kinds of features have been and remain out of scope for digital repositories, but as publication formats evolve, these kinds of systems will need to be leveraged for new kinds of archiving purposes. Learning the system from a publisher’s perspective has been fascinating, and the conversations we’ve gotten to have with the system specialists along the way have generated some pretty wild and exciting ideas.
Jasmine Mulliken is Production and Preservation Manager, Digital Projects, at Stanford University Press. She coordinates the production and preservation workflow of born-digital projects, including recommending platforms and coding standards to authors, consulting with authors on projects’ technical attributes, and evaluating best practices for archiving and preservation.