Tutorial — How to Turn a Spreadsheet into the Contents List of an EAD-Encoded Finding Aid

There are a lot of reasons to turn a spreadsheet into a finding aid. Maybe your donor sent you an inventory of a new accession in Excel, and you want to modify and re-use that data. Maybe you’ve determined that Excel has data entry features that Archivists’ Toolkit doesn’t have, and you want to take advantage of those. Maybe it’s the good in-between step between a legacy Word Document and structured data.

At Tamiment, we have an MS word inventory for the National Guardian’s photographic prints that has never been a finding aid. And looking at the inventory, we noticed that there were a few things to clean up — none of the prints had date information, the boxes numbering was confusing (A and B boxes), and the same folder title was often repeated as many as ten times instead of adding an extent statement to indicate that there are ten folders. There had also been a project many years ago to label archival folders with information about these prints, but the prints were never re-foldered.

This became a good back-burner project for our awesome graduate student, Giana. And since I get a lot of questions about how to transform this kind of data, let’s go through all of the steps together. My documents are all on GitHub, so you should feel free to use or steal or modify what I’ve made available (which, indeed, I used and stole and modified from others).

First, I copied the text from MS Word to Excel (actually, Google Spreadsheets), used filters to delete blank rows, and created a new column for box numbers, which I pulled down so that there is a box number on each line. I then added column headers for Date Start, Date End, New Box Number (I knew that boxes might swell because of re-foldering, and I wanted to get rid of 14A, etc. and make sure that all box numbers are whole numbers), Folder Number, Extent, and Notes (just in case).

Then, Giana did the real work. She re-foldered prints, verified folder names, copied date information, deleted repeating rows and added extent information, and made sure that everything was more-or-less right. Her result looked like this:

National Guardian -- Spreadsheet

National Guardian — Spreadsheet

This looked pretty good, although there were one or two things to fix — we deleted empty folders from the spreadsheet and threw away the physical folder, too. I also looked through any of her notes, and did some clean-up. Finally, we saw that there were some missing materials that our former director has pulled from the file and never returned — I removed them from the finding aid, but I’ll add a note about them to the scope and contents note at the end. Hopefully, they’ll turn up someday. I also went through and did general proofreading — although Giana had already done a really good first pass of this.

The last thing I did with the spreadsheet was update the column headings to map to EAD fields.

Next, I import this spreadsheet into Oxygen. Did you know that you can do this? It’s here:

Importing Excel into Oxygen

Importing Excel into Oxygen

I set my import preferences so that the column headers are used as tag names, and I can elect to not include some columns.

Settings

Now that I have my data in xml, I use an xslt to turn this into an EAD container list.

If you’re following along at home, the first thing you want to do is look at the xslt and get comfy with it. It’s basically working like a good, old-fashioned mail merge in Word/Excel — it’s asking whether elements are present, then giving commands to pull in that element if it is. I’ve gone through and added the text “folders” after the extent statement, for instance, and I’ve modified unitdate to conform with my column names.

XSLT

Next, I set up my transformation in Oxygen (the little wrench icon). I give the transformation a name, and tell it that I want it to apply this file to whatever xml I’m working on. I use the Saxon-PE engine, and in the output tab I tell it to open the transformed file in the editor.

Transformation

Since my XSLT is kind of dumb (it doesn’t deal with one folder vs several folders, or a situation where there might not be an end date) I have to do some mass find-and-replaces. But for this project, it’s easier doing this clean-up work than writing the logic into the XSLT.

find-replace

Finally, I paste the file list into the exported EAD from the toolkit that has collection-level information, and re-import. Voila!

AT Record

Advertisements

spreadsheet to finding aid — the game plan

Where I work, I have a lot of legacy data that isn’t accessible to researchers. Since we manage our description in an archival management system that exports to EAD-encoded finding aids (and we publish images through the finding aid), I want to get as much of that converted to EAD as I can.

My current project may be my white whale. Several times throughout the last decades, there have been description, re-description, and digitization projects for a particular collection of negatives in our repository. The images are described at the item level and they were digitized TEN years ago, but still, researchers don’t have access to these images. We still make them sit down with gloves and a light box, and frankly, these images don’t get as much use as they should.

The negatives were described at the item level. Titles were transcribed and possibly enhanced, with liberal use of brackets and abbreviations. I also have information about the photographer, subject (generally), negative number, something called a shoot number that I still can’t suss out, gauge, two different date fields in different combinations of m/d/y (usually the same, sometimes different, which is hard to fathom for an image), and information about the Kodak Gold CD onto which it was digitized (seriously). In many cases, there are many images with the same title, same date and same photographer.

So, I started with an EAD mapping and a plan.

Mapping:

  • Image title will map to <unittitle>. I’ll need to do some major clean-up to write out abbreviations, acronyms and initialisms. It will also be good to pay attention to brackets and delete or use alternate punctuation, as appropriate.
  • Photographer will map to <origination><persname>. There seem to be variants of names and initials — I’ll use faceting and clustering in OpenRefine to bring consistency to these. Since these are amateur photographers, the names will not be authorized.
  • After doing a lot of thinking, I think I’m going to use the subject column as the series structure of the finding aid. A previous version of this finding aid has series scope and contents notes based on these subjects. If this were a collection of more notable photographers, I would organize the finding aid that way, but this seems the best of bad options for a collection that researchers will approach by subject.
  • The negative number will map to <unitid>. In the case of images with the same caption, date and photographer, I’ll plan to have one component with multiple values in unitid (separated by a delimiter, since Archivists’ Toolkit doesn’t allow for multiple unitids, BLERGH), and an extent statement indicating how many images are in a component. I’ll figure out an excel formal to calculate the extent statement, and I’ll collapse the multiple components into one using the blank down and join multi-valued cells features in OpenRefine.
  • Gauge will map to <dimensions>.
  • Dates will map to <unitdate>. It looks like the first date is more complete than the second, but there are some cases where there’s a second date and not a first. I’ll use excel formulas to break each apart by year, month and date, and create a new date column that asks whether the first date is present and, if not, adds the second date. I’ll also use excel formulas to convert to our (not DACS compliant, unfortunately)  date format and also to create normalized date formatting.
  • I’ll wait until the finding aid is created to map from c @id to information about the digitized file, since Archivists’ Toolkit doesn’t import that information (ARRRGGHHHH).
  • There are also some lines for missing negatives (these should be moved to the collection-level scope and contents note, for lack of a better place) and lines for negative numbers that were never asssigned to an image (which should just be deleted).

Most of this work happens in Excel and OpenRefine. The Excel spreadsheet will then be imported into oXygen, and I’ll use an XSLT (already written, thank goodness) to convert this to EAD. I’ll then import the EAD to Archivists’ Toolkit, since the institution has decided to make this the canonical source of data. Finally, I’ll export the EAD =OUT= of AT and use split multi-valued cells / paste down in OpenRefine to get an item-level list of images WITH AT-created unitids, and use MS Access to map that to information about the digitized file in the original spreadsheet (using negative number as the primary key). This then gets sent to digital library folks, who do the <dao> linking.

Does this all sound complicated and stupid? It should, because it is. I’m fairly confident that I’m using the most efficient tools for this kind of clean-up, but I’m sure part of my process could be improved. I also think that our tools and processes at my institution could be better.

My thoughts on how to use our tools better and avoid these kinds of clean-up projects in the future.

  • Most important. Don’t describe at the item level. Why does archival theory seem to go out the window when we start working with images? Describe by roll, by event, by creator — give the researcher better clues about the context of creation. Photographic records, like all records, can be described in aggregate. It’s very difficult to do so, though, which context is destroyed and provenance isn’t documented. Especially in this case, where images are digitized, there’s no reason to create item-level metadata.
  • Let’s have everyone use the same platform and keep the metadata together right away. Wouldn’t it be great if each image file were assigned a URI and linked to the component in the <dao> field BY DIGITIZATION TECHNICIANS DURING SCANNING? Sure, maybe there would need to be some post-processing and some of this data would change, but it would be all together right away from the outset, instead of relying on the crazy internal logic of file systems or secondary spreadsheets.
  • The project isn’t finished until the project’s finished. A ten-year gap between digitization and publication in the finding aid is a big problem.
  • Archivists’ Toolkit isn’t a very good tool for archivists who want to do bulk manipulation of data. There, I said it. It would be nearly impossible to make these changes in AT — I need tools like Excel, OpenRefine, XSLT, and the xpath find/replace functions in oXygen to change this much data. Sure, I can export and then re-import, but AT doesn’t reliably round-trip EAD.
  • Maybe we shouldn’t be using AT as our canonical data source. It really don’t offer much added value from data clean-up point of view, beyond being able to make bulk changes in a few fields (the names and subject modules are particularly useful, although our finding aids don’t even link that data!). And frankly, I’m not nuts about the EAD that it spits out. WHY DO YOU STRIP C @ID during import?!?! Why can’t extent be repeated? Why can’t some notes (like <physfacet>, which really isn’t even a note) be repeated? Why not multiple <unitid>s? And as I look at our AT-produced finding aids, I find a LOT of data mistakes that, thinking about it, are pretty predictable. A lot of crap gets thrown into <unittitle>. There’s confusion about the difference between container attributes and <extent>, <physdesc>, and <physfacet> notes. I’m not endorsing the hand-coding of finding aids, but I think that there was some slippage between “oh good! We don’t have to use an XML editor!” and “Oh good! No one needs to pay attention to the underlying encoding!”

I’ll be sure to report back when this project is done. Until then, keep me in your thoughts.