by Chelsea Gunn

It is sometimes said that physical objects decay gracefully, while digital objects appear stable on the surface, decaying or corrupting suddenly and without warning. This can be a source of anxiety for those of us who document our personal and professional lives digitally. Many of us have had the experience of attempting to open an old text document that turns out to be incompatible with our modern word processing program, or attempting to open an image or audio file, only to learn that it has been corrupted, or the format is no longer supported.

At a recent conference on personal digital archiving, I attended an open community panel and discussion held at the local library. It was illuminating to hear the questions that members of the public had for the panel. Concerns about the ongoing preservation of and access to personal digital photographs and videos were by far the most common, though some questions about documents and archived email correspondence arose as well.

As an archivist, thinking about the functions and maintenance of legacy systems has taken up a significant chunk of my professional life (and, admittedly, a fair amount of my personal time). Just as it has been important to consider environmental factors for physical objects, such as what temperature and relative humidity create sustainable macro-environments for documents composed on fragile paper in acidic ink, it is vital to store legacy digital formats in electronic environments that will continue to support preservation and access over time. The exact specifications of a format can be easily obfuscated by graphical user interfaces, however, making this process a bit more intimidating at first. In order to understand how to best preserve and steward our digital objects into the future, we first need to understand what they actually are, how they were created, and what infrastructure is required to maintain them.

In the archival field, we approach collections using the guiding principles of context, content, and structure.  When working with legacy digital formats, the files themselves provide the content. We may consider the structure at several different levels, from the bit level (the base sequence of ones and zeroes that comprise a given digital object) to the file level, the organizational arrangement of digital objects within a folder or file structure. Context, then, often involves understanding the environment in which the content was originally created. On what operating system or in what program was a digital file format meant to be run? When was it created, updated, or last opened? An operating system’s file information functions will shed light on most of these elements on the basis of the individual file, but additional tools are useful in revealing additional information at the batch or collection level.

File format identification is likely not the most glamorous aspect of any work, but it is incredibly useful, and learning effective ways to better maintain and read files you created weeks, months, or years ago can feel deeply powerful. It has become an increasingly common task in archival work, but is useful in our personal file management activities as well. Whether you’ve recovered a box of floppy disks from your parents’ attic, or you’re migrating files to a new computer or hard drive, or you’ve hit thrift store gold and want to see what you might discover on a vintage computer, basic file identification and organization are crucial. In my experience, the value of good technical and descriptive metadata can also never be overstated.

For identifying larger batches of diversely formatted files, open source tools like DROID (Digital Record Object Identification) and FITS (File Information Tool Set) provide additional functionality, but for getting started with digital object format identification and metadata extraction, Exiftool (described below) does the job quite nicely. The advantage of running a series of comparatively bespoke operations to learn about your files, rather than automating the same processes through a consolidated suite of tools or GUI, is that each discrete action and its functionality become more familiar.

For me, the tool that meets most of my needs is Exiftool, a versatile open source tool developed by Phil Harvey. Exiftool can run on any platform, and is easy to use even if you have never run a program from the command line before, in which case, the brief Codecademy command line tutorial is a useful introduction. As the name suggests, it can be used for extracting EXIF metadata, but is compatible with most other technical and descriptive metadata formats as well. It can be used to export embedded metadata from images maintained through photograph organization programs like Picasa and Adobe Bridge, in which users may have tagged photos with the names of people or places featured in each image or collection, for example. Or a simple list of the file formats and versions on a given drive can be exported as a spreadsheet, allowing the user to identify files that may need migration or reformatting.

Exiftool will help you extract and read the metadata for your files, but more information is generally needed to ensure that files remain readable and accessible, regardless of format. An additional resource that helps in this department is the free PRONOM technical registry, developed by the UK’s National Archives. PRONOM is designed to work with DROID, but serves as a useful resource on its own. Users can search by file extension and filter by version in order to obtain additional information about their files. This includes listings of previous and subsequent versions of the software with which the file was created, which is crucial to the safe incremental migration of files. Because PRONOM has been developed with the digital preservation community in mind, risks to sustainability or migration are identified for each version of any given format.

In my own work and research, I find myself engaged less frequently in thinking about the distant past, and increasingly thinking more about how to best steward digital objects into the future. As someone beginning to work as an instructor in the information sciences, I want to contribute to a future in which I rarely hear the phrase “I just don’t understand technology.”  Thinking about ways to eradicate that sentence from the narratives of young professionals, and especially women, is one of my ongoing objectives.

Programs like Exiftool provide users with a flexible toolkit for accessing, exporting, and manipulating metadata, allowing them to break down some of the obfuscation inherent in so many digital interfaces.  One major step in overcoming intimidation in the face of new technology is simply learning simple strategies with which to approach, identify and contextualize digital objects and systems. Fear of computers is just one more unnecessary barrier to preserving, accessing, and generally understanding our own records and documentation.

Chelsea Gunn is a doctoral student at the University of Pittsburgh’s School of Information Sciences, where she studies archives and how we arrange, describe, and understand personal records that are created in electronic environments.

We have a print edition too! Find this issue in the shop.