by Ann Cooper

Where are the data?

Good data are crucial to advancing the accuracy, usefulness, and quality of everything from scientific research to data journalism. If we value high-quality reproducible scientific research and journalism that furthers the cause of justice and transparency, then we need to allow for the collection, storage, and reuse of data. Ideally, the data used in these types of work would be open; and digital repositories make this access and reuse possible.

Think of data, and you may think of databases or research data sets. You may not immediately think of digital repositories. Digital repositories do make use of databases, though, and a data set is a type of digital material that can be stored and preserved in a digital repository. In addition to data sets, the digital objects in a repository can include scholarly production and born-digital or digitized archival or special collections.1 The former can include digital objects like academic journal articles and data sets, and would typically be found in an institutional repository or data repository. The latter can include an enormous variety of digital objects depending on the archives’ collecting focuses; examples are university records, drafts of a novel donated on floppy disks, and archived websites.

My experience is primarily with archival and special collections repositories, a particular type of digital repository that is designed to meet a certain set of preservation and security needs. The type of digital objects that could be included in these digital repositories might be the born-digital records of a local labor union chapter’s strike or records of a university’s landmark decision to make its campus more accessible. Archival and special collections digital materials support our knowledge of local history, buttress community identity, document struggles and political action, and preserve records to support government accountability.

To allow these data and digital materials to remain useful and usable, we have to preserve, maintain, and provide access to them over the long term. But these records are not accessible and usable over the long term by default; a digital repository maintained by people working as archivists, data producers, and repository developers will make preservation and access possible.

Putting together a such a repository might sound like a big project, and it can be, but building a small digital preservation repository is within the capability of a determined technophile.

Into the digital repository

Digital repositories should be able to ensure integrity, authenticity, and usability of the objects they store over the long term. In order to meet these high-level needs, many digital repositories function according to the open archival information system—or OAIS—model, which is ISO reference model 14721-2012.

What this means is that the repository system needs to be able to ingest a submission information package, or SIP (often in a BagIt package), that contains the digital objects and metadata that describes and provides preservation reference points for the digital objects.2 Then the repository needs to be able to store this package as an archival information package, or AIP, and perform checks on the integrity of the digital object over time. Normally, this means making sure that the hash value of the package established on ingest hasn’t changed.3 It can also note tracking events like a format migration or changes to the arrangement of the files, when those events happened, and the agent responsible (human or system). Finally, the repository should produce a dissemination information package, or DIP, for end users to view or interact with.

Life with repository projects

I worked on digital repository solutions in a couple different jobs, both of which were in archival and special collections housed in university libraries. In both cases, the problem that we needed to solve was that the collections were accepting increasing amounts of born-digital material and the archivists were realizing that their under-described and unprocessed material on digital storage media was at risk. In general, archivists expect that born-digital material will not only continue to be accessioned into their collections, but that the amount of digital material generated by donors will grow.

What this meant for us was that we needed to be able not only to safely extract the digital material from the storage media but also that we needed to have a place to preserve it…and we didn’t. That’s when I jumped in and put my perseverance and love of technology and digital preservation to work.

Many smaller library systems and new digital preservation programs have to make do with minimal resources. My experience certainly followed this pattern. Common project constraints include the number of people available to work on the system or answer questions when you encounter an obstacle, small budgets for equipment and software, and a need or pressure to get some kind of digital preservation environment operational quickly. The good news is that these are typical constraints, and there are ways to get a preservation repository working in these circumstances. The bad news is that depending on your situation, it might be time-consuming and frustrating.

Equipment and workflow

I needed to set up a workflow to process the digital material and prepare it for ingest into the repository. I also needed to set up a repository. To do that I required appropriate software and hardware tools in addition to secure, geographically distributed, and backed-up storage. My most successful experience putting together a system to accomplish all this used Amazon Web Services S3 for the access copies of the material and Glacier for the master copies.

The hardware included:

  • A spare PC running Ubuntu Linux
  • A spare Mac laptop for digital forensics
  • A NAS server
  • An external hard drive
  • Hardware write blockers such as these from Tableau:

  • Forensic SATA/IDE Bridge

  • Forensic USB Bridge
  • 5 1/4 inch floppy disk drive
  • 3 1/2 inch floppy disk drive
  • Zip drive
  • CD / DVD drive

The software included:

I used the Ubuntu machine for Archivematica, which is an open-source archival processing and packaging tool that makes use of other open-source tools to run what it calls “microservices” on packages to assign UUIDs, get a hash value, identify file formats, make a SIP, and then store the DIP and AIP with attached documentation in whichever location(s) you have configured. I like Archivematica and would use it again, but it can be tricky when you are first trying to get it up and running.

The NAS server was the processing space for Archivematica and the holding area for finished AIPs and DIPs until I pushed them to their respective homes in AWS.

I used the Mac for some of the processing tools, like BitCurator and Curator’s Workbench. The external hard drive was connected to the Mac to use as processing space for BitCurator, which I ran in a VM. BitCurator makes a disk image of the source material before it allows the user to run reports and apply a suite of digital forensics tools. For larger storage sources, like external hard drives, it is really useful to have the space to make a disk image. I didn’t use BitCurator for everything, but depending on local needs and workflow, it is a very powerful archival digital forensics solution.

Overall, my workflow from receiving born-digital material to storing and preserving it was basic. There are a lot of ways to configure the different parts of a preservation and repository system, and a world of tools to use in the processing and preparation for ingest phase of the workflow. There are also some fantastic guides for best practices and local workflows in digital preservation. My system is just one way to do it, and it was tailored for the specific environment I was working in.

Without going into the strictly archival side of my workflow too much, this is what I did:

  1. Determine whether the digital storage media needs to be protected, and if so, connect using write blockers.
  2. Determine whether the digital material needs arrangement, if so, use Curator’s Workbench.4
  3. Curator’s Workbench copies the files into a project. It allows the user to see the original arrangement and to create and export a new arrangement. Save the new arrangement somewhere that Archivematica can ingest from.
  4. If you need to preview the files for arrangement and appraisal, use the native file browser.
  5. If you are keeping a separate project documentation folder, use NARA File Analyzer to generate a detailed manifest. This can also help with arrangement and appraisal decisions.
  6. Ingest the arranged digital material using Archivematica.
  7. You can automate pieces of the Archivematica ingest processes, or check through the steps manually.
  8. Archivematica will process the digital material and create a DIP and an AIP unless you only want one or the other. It will save these to the locations you mapped when you set up Archivematica.
  9. If you are using AWS or some other backed-up storage system and you move your finished packages to that location frequently, you can save the AIP and the DIP to your processing space, even if it isn’t on your organization’s domain and isn’t backed up nightly. My process was to save the AIP and DIP to the NAS server and then upload to AWS whatever AIPs and DIPs were on the server after I finished my digital processing work for the day.
  10. You are now finished with the preparation and storage phase! If you are able to complete these steps, you have a functioning digital preservation repository and a digital processing workflow.

The solution I put together and used was not perfect, but it was adequate for our needs and I was able to do it myself (with greatly appreciated but only occasional help from a very busy library software developer). Using this approached worked well in that it didn’t require a large budget and it met the basic long-term preservation requirements for digital material. It also resulted in a workflow that one person can easily manage.

The point of preservation is ensuring continued access, though, and a weak point in the approach I used was lack of it. Ideally, a project to build a digital repository would include a way to make the materials accessible to the public. Although the repositories I worked on fulfilled the long-term preservation needs, efficiently building an access layer required more skill than I had, so my repository solution didn’t include that essential piece. The access layer could be a locally built interface using something like Drupal, an open source solution like Access to Memory, or any one of a number of hosted digital library access points. The first two are more configurable, but also require more time and work, so ultimately the right access layer will depend on the local circumstances and needs. One thing that doing these repository projects showed me is that if a project I’m leading has a functional requirement that I’m not able to deliver myself, then I need to find a way to get the people who do have the skills to make time for the project from the beginning.

Aside from my personal experiences doing digital preservation projects, the larger issue is that we won’t have digital collections—including the kinds of data that support scientific research, journalism, and open government—unless we preserve the material first. Whether we work as archivists, analysts, records managers, community activists, interested tinkerers, or something else entirely, we can take the first steps toward safely preserving digital material.

Ann is a data analyst and former digital archivist interested in digital preservation and access. She also spends a lot of time running and discovering new plant allergies.

  1. Digitized refers here to material that was made digital through scanning, filming, or other processes. 
  2. The metadata could include a hash value, the steps that were taken to package and process the material, the agents or tools that performed these steps, file size, format, creator, title, UUID, local identifier, and title. 
  3. A hash value is a type of checksum that can be calculated using several methods, which include SHA and MD5. It evaluates the digital objects to create a forensic hash that is a number that identifies the objects and will change if any aspect of the digital object changes. It is often used to verify the integrity and authenticity of a digital object over time. 
  4. Arrangement is a term specific to archives and it refers to organizing the material (digital, physical, or hybrid) into series, sub-series, and folders according to either the existing order of the collection as it arrived in the archives or an applied order that tries to conserve the context of use.