spreadsheet to finding aid — the game plan

Where I work, I have a lot of legacy data that isn’t accessible to researchers. Since we manage our description in an archival management system that exports to EAD-encoded finding aids (and we publish images through the finding aid), I want to get as much of that converted to EAD as I can.

My current project may be my white whale. Several times throughout the last decades, there have been description, re-description, and digitization projects for a particular collection of negatives in our repository. The images are described at the item level and they were digitized TEN years ago, but still, researchers don’t have access to these images. We still make them sit down with gloves and a light box, and frankly, these images don’t get as much use as they should.

The negatives were described at the item level. Titles were transcribed and possibly enhanced, with liberal use of brackets and abbreviations. I also have information about the photographer, subject (generally), negative number, something called a shoot number that I still can’t suss out, gauge, two different date fields in different combinations of m/d/y (usually the same, sometimes different, which is hard to fathom for an image), and information about the Kodak Gold CD onto which it was digitized (seriously). In many cases, there are many images with the same title, same date and same photographer.

So, I started with an EAD mapping and a plan.


  • Image title will map to <unittitle>. I’ll need to do some major clean-up to write out abbreviations, acronyms and initialisms. It will also be good to pay attention to brackets and delete or use alternate punctuation, as appropriate.
  • Photographer will map to <origination><persname>. There seem to be variants of names and initials — I’ll use faceting and clustering in OpenRefine to bring consistency to these. Since these are amateur photographers, the names will not be authorized.
  • After doing a lot of thinking, I think I’m going to use the subject column as the series structure of the finding aid. A previous version of this finding aid has series scope and contents notes based on these subjects. If this were a collection of more notable photographers, I would organize the finding aid that way, but this seems the best of bad options for a collection that researchers will approach by subject.
  • The negative number will map to <unitid>. In the case of images with the same caption, date and photographer, I’ll plan to have one component with multiple values in unitid (separated by a delimiter, since Archivists’ Toolkit doesn’t allow for multiple unitids, BLERGH), and an extent statement indicating how many images are in a component. I’ll figure out an excel formal to calculate the extent statement, and I’ll collapse the multiple components into one using the blank down and join multi-valued cells features in OpenRefine.
  • Gauge will map to <dimensions>.
  • Dates will map to <unitdate>. It looks like the first date is more complete than the second, but there are some cases where there’s a second date and not a first. I’ll use excel formulas to break each apart by year, month and date, and create a new date column that asks whether the first date is present and, if not, adds the second date. I’ll also use excel formulas to convert to our (not DACS compliant, unfortunately)  date format and also to create normalized date formatting.
  • I’ll wait until the finding aid is created to map from c @id to information about the digitized file, since Archivists’ Toolkit doesn’t import that information (ARRRGGHHHH).
  • There are also some lines for missing negatives (these should be moved to the collection-level scope and contents note, for lack of a better place) and lines for negative numbers that were never asssigned to an image (which should just be deleted).

Most of this work happens in Excel and OpenRefine. The Excel spreadsheet will then be imported into oXygen, and I’ll use an XSLT (already written, thank goodness) to convert this to EAD. I’ll then import the EAD to Archivists’ Toolkit, since the institution has decided to make this the canonical source of data. Finally, I’ll export the EAD =OUT= of AT and use split multi-valued cells / paste down in OpenRefine to get an item-level list of images WITH AT-created unitids, and use MS Access to map that to information about the digitized file in the original spreadsheet (using negative number as the primary key). This then gets sent to digital library folks, who do the <dao> linking.

Does this all sound complicated and stupid? It should, because it is. I’m fairly confident that I’m using the most efficient tools for this kind of clean-up, but I’m sure part of my process could be improved. I also think that our tools and processes at my institution could be better.

My thoughts on how to use our tools better and avoid these kinds of clean-up projects in the future.

  • Most important. Don’t describe at the item level. Why does archival theory seem to go out the window when we start working with images? Describe by roll, by event, by creator — give the researcher better clues about the context of creation. Photographic records, like all records, can be described in aggregate. It’s very difficult to do so, though, which context is destroyed and provenance isn’t documented. Especially in this case, where images are digitized, there’s no reason to create item-level metadata.
  • Let’s have everyone use the same platform and keep the metadata together right away. Wouldn’t it be great if each image file were assigned a URI and linked to the component in the <dao> field BY DIGITIZATION TECHNICIANS DURING SCANNING? Sure, maybe there would need to be some post-processing and some of this data would change, but it would be all together right away from the outset, instead of relying on the crazy internal logic of file systems or secondary spreadsheets.
  • The project isn’t finished until the project’s finished. A ten-year gap between digitization and publication in the finding aid is a big problem.
  • Archivists’ Toolkit isn’t a very good tool for archivists who want to do bulk manipulation of data. There, I said it. It would be nearly impossible to make these changes in AT — I need tools like Excel, OpenRefine, XSLT, and the xpath find/replace functions in oXygen to change this much data. Sure, I can export and then re-import, but AT doesn’t reliably round-trip EAD.
  • Maybe we shouldn’t be using AT as our canonical data source. It really don’t offer much added value from data clean-up point of view, beyond being able to make bulk changes in a few fields (the names and subject modules are particularly useful, although our finding aids don’t even link that data!). And frankly, I’m not nuts about the EAD that it spits out. WHY DO YOU STRIP C @ID during import?!?! Why can’t extent be repeated? Why can’t some notes (like <physfacet>, which really isn’t even a note) be repeated? Why not multiple <unitid>s? And as I look at our AT-produced finding aids, I find a LOT of data mistakes that, thinking about it, are pretty predictable. A lot of crap gets thrown into <unittitle>. There’s confusion about the difference between container attributes and <extent>, <physdesc>, and <physfacet> notes. I’m not endorsing the hand-coding of finding aids, but I think that there was some slippage between “oh good! We don’t have to use an XML editor!” and “Oh good! No one needs to pay attention to the underlying encoding!”

I’ll be sure to report back when this project is done. Until then, keep me in your thoughts.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s