The Value of Archival Description, Considered

This is a talk that I gave at the Radcliffe Workshop on Technology and Archival Processing on April 3, 2014. I hope you enjoy what I had to say. I think it dovetails nicely with the work the four of us do on this blog.


I’m very happy to be here today. As Ellen mentioned, my name is Maureen Callahan. I currently work at the Tamiment Library at New York University in a technical services role. In our context, which I know isn’t unique, almost all of our arrangment and description work is done by very new professionals or pre-professionals. This means that most of my job is teaching, coaching and supervising – and making sure that all of the workers I supervise have the infrastructure, support, and knowledge they need to meet our obligations to users and donors.

Because I work with pre-professionals, I think it’s important to be deliberate and take the time to explain the values behind archival description – what our obligations are, how to make our work transparent, what’s valuable and what isn’t, how we should be thinking about how we spend our time, and how to look at the finding aid that we’ve created from a researcher’s point of view.


When the organizers asked me to present, they included a few questions, questions that have been weighing on my mind too.

In their initial email, Ellen Shea and Mary O’Connell Murphy asked, “Is the product of a finding aid worthy of the time required to make them considering emerging technologies? Where do you think research guides might be headed in the future? How do you think they must change in order to improve access to archival collections and meet today’s user’s needs?”

Most provocatively, they asked, “What do researchers really want from finding aids? Do they want them at all?”

And I think that the answer is no. And maybe. And yes.


At its core, I think that this question gets at what is and isn’t valuable about what archivists do, and what might be good for us to pay more attention to.

So, what do finding aids do? Why do we create them?

OK, so we can start by looking at finding aids as a way to address the practical problem of giving potential researchers access to unique or rare material that can only be found in a single location, behind a locked door in a closed stacks. Until you come here and show us your ID and solemnly swear that you’re going to follow our rules, the finding aid is all you get. This is the deal. So, to answer the question of whether researchers want finding aids – no. They don’t. They want the records. But they get the guide first.


And many parts of a finding aid – the parts that we spend so much of our time creating – take this imperfect surrogate role. Many finding aids are built on the model of looking at a body of records, dividing it into groupings (either physically or intellectually, usually both), and then faithfully representing files in that grouping to a mind-numbing level of meticulous detail. I’m going to call this model a map.


And what this slide, which is based on an analysis of the finding aids at the Tamiment Library, will show you is that yes, this work is getting done. We have plenty of information about what the materials tell us about their titles and dates and how much we have of it. But this slide only tells us information about finding aids that have been created. I also know that backlogs are still a problem at a lot of repositories, especially mine. This “mapping” model of tedious representation, starting at the beginning and going to the end, means that often the end never comes. We have plenty of collections that aren’t represented at all. Is this serving our users? Does this meet our donors’ expectations? Can’t we find a better way?


I’m looking forward to hearing from speakers today and tomorrow who will talk about how we can get machines to do some of this mapping for us. Because, as far as I’m concerned, good riddance. I don’t think that archivists are just secretaries for dead people, and I welcome as much automation as we can get for this kind of direct representation of what the records tell us about themselves.


Indeed, it’s already happening. At my institution, we’re just starting to work through the process of accessioning electronic records, and I can already see how tools like Forensic Toolkit help us to get electronic records to describe themselves.

After all, electronic records are records. Digital archives are archives. This is our present, future, and poorly-served past. And in the case of electronic records, we have ways of transcending the problem of our collections being singly, uniquely sited, requiring a mapping of what’s inside.

But some collections are, indeed, unique and sited. Before going on, I want to be pragmatic about the idea of scanning everything that isn’t born-digital, that does require a certain degree of mapping. I think we should be scanning a lot, I think we should be scanning much more than we are, but I don’t think that we necessarily should be scanning everything. I think we should scan what the people want. The city archives of Amsterdam, which has the most complete and sophisticated scanning operations that I’ve encountered, has committed to providing researchers with what they want not by scanning everything (they estimate that it would take 406 years to do so for all 739 million pages in their holdings, even in an extremely robust production environment) but by scanning what the users want to see. After all, what if you want to see the 739 millionth scan? And in order to figure out what the people want, we need some minimal level of mapping. Not every file, not in crazy, tedious detail, but some indication of what’s in a collection

So, we’ve dispensed with much of the map. Didn’t that feel good? What else is a finding aid? What else does the archivist do? What else do our researchers need from us?


At the next level of abstraction, a really good finding aid is a guide. In this painting by Eugene Delacroix, we see Virgil leading Dante across the river Styx. I don’t want to take this metaphor too far, but I do think that there’s a role for the archivist to help researchers understand our materials by explaining the collections, pointing out pitfalls and rich veins of content, rather than just representing titles on folders.

I can see, in some contexts, that it makes sense for an archivist to spend quality time really understanding the records and explaining this understanding so that each researcher doesn’t have to wade through it every time. When I teach description, I urge workers to evaluate rather than represent records. For instance, does a correspondence series include long, juicy, hand-written letters wherein the writer pours his heart out? Or are they dictated carbon copies based on forms? A title of “Letter from John Doe to Jane Smith” doesn’t tell us this, but an archivist’s scope and content note can. It takes a lot of time to type “Correspondence” and the date a zillion times. Wouldn’t researchers prefer an aggregate description and date range with a nice, full note about what kinds of correspondence with what kinds of information she can expect therein? This is a choice to guide rather than map.

So here, we’re representing information about the collection that a researcher would need to spend a lot of time to discover on his own. And by the way, I’m not claiming a breakthrough. Seasoned archivists do this all the time. It’s also what Greene and Meissner were talking about in their 2005 article – our value is in our focus on the aggregate and the judgment required to make sense of records, rather than just representing them.

So to answer the original question, I would say that maybe, yes, maybe, researchers do want these kinds of finding aids where some of the sensemaking has already been done for them. The scale of archives is large, and it may indeed be inefficient to expect researchers to browse scanned document after scanned document to get a good understanding of what this all means together.

But there’s an even higher level of abstraction central to our role as archivists that should be included in our finding aids, which I rarely see documented comprehensively or well. This is the information about a collection that no amount of time with a collection will reveal to a researcher – it has to do with the archivists’ interventions into a collection, the collection’s custodial history, and the contexts of the records’ creation.


This last bit – getting to understand who created records, why they were created, and what they provide evidence of – really gets to the nature of research. These are the questions that historians and journalists and lawyers and all of the communities that use our collections ask – they don’t just see artifacts, they see evidence that can help them make a principled argument about what happened in the past. They want to know about reliability, authenticity, chain of custody, gaps, absences and silences.

This is the core work of archivists. This is what we talked about over and over again when I was in graduate school, and what has been drilled into me as the true value we, as archivists, add to the research process. We occupy a position of responsibility, of commitment to transparency and access. Researchers expect us to tell them this information, and we do a terrible job of doing so.

The above slide is based on the same corpus of finding aids at the Tamiment Library. While we did a great job of documenting what we saw before us, we did an abysmal job of explaining who gave us the collection and under what circumstances, how we changed the collection when we processed it, and what choices we made about what stays in the collection and what’s removed. And from what I can tell, it’s pretty consistent with the kinds of meta-analysis done by Dean and Wisser, and also by Bron, Proffitt, and Washburn in their recent articles analyzing EAD tag usage.


Like I say, communicating this in the finding aid is some of the most important work we do, and we do a pretty bad job of it. I have no reason to believe that my library is unique in this.

Because I also know, when I go to describe records, especially legacy collections that have sat unprocessed for a long time, I often have to do this by guessing. I’m like an archaeologist who tries to figure out the life of these documents before they came to me based on the traces left behind. It’s what I most want to explain, and what I often have the least evidence of.

This is an area where curators — collectors — whatever you call them — can intervene, where the best of breed are invaluable. After all, we’re not doing archaeology and working with the remains of long-dead civilizations. Creators, heirs or successors are usually around — they’re the ones who packed the boxes and dropped off the materials. Let’s make sure that we sit them down and talk with them then. Let’s make sure we’re getting all of the good stuff. Let’s make sure we really understand the nature of the records before we ask the processing archivist — usually a person fairly low in the organizational hierarchy, often a new professional, and almost always the person with the least access to the creator — to labor at reconstruction when just asking the creator might reveal all.

I have one short anecdote from my own repository to help illustrate this problem. In 1992, the Tamiment Library acquired the records of the Church League of America from Liberty University in Lynchburg Virginia. The Church League of America was a group created in the 1930s to oppose left-wing and social gospel influences in Christian thought and organizations through research and advocacy. The first iteration of the finding aid for this collection could be described as a messy map – a complicated rendering of the folder titles found in this extensive collection, without much explanation of what it all means and how it came.

Two years ago, before I came to Tamiment, my colleagues did a re-processing project. In doing so, they realized that these records had a rich history and diverse creators — far richer than what the finding aid had indicated. It turns out that the collection is an amalgamation of many creators’ work, including the files of the Wackenhut Corporation, which started as a private investigations firm and moved on to be government contractor for private prisons. The organization maintained files on four million suspected dissidents, including files originally created by Karl Barslaag, a former HUAC staff member, and only donated them to the Church League of America in 1975 as a way of side-stepping the Fair Credit Reporting Act.

Until re-processing happened, researchers had an incomplete picture of the relationship between private commerce and non-profit organizations that converged to become the lobbying arm of the anti-Communist religious right.

So back to our original question. Do researchers want finding aids qua finding aids? No, maybe, yes. They want the stuff, not descriptions of the stuff. They might want some help navigating the stuff. And they absolutely want all the help that they can get with uncovering the story behind the story.


Before I turn this over to Trevor, I want to add a brief coda about how we should be thinking of finding aids as discovery tools as long as we decide to have them.

Let’s start with a reality check – how are finding aids used? What do we know about information-seeking behavior around archival resources?

The first and most important thing that we know is that discovery happens through search engines. It is true that some sophisticated researchers know what kinds of records are held at what repositories – that the Tamiment Library holds records of labor and the radical left, or that Salman Rushdie’s papers are at Emory.  But I think that we can all agree that “just knowing” isn’t a good strategy to make sure that researchers discover our materials!


This was the understanding that we started with at Princeton (my previous job) when we decided to revise our finding aids portal. Previously, our finding aids looked like a lot of other finding aids – very, very long, often monograph-length webpages that give a map – and the better once (there were many better once there), would also be a good guide as well.

Basically, we decided to surrender to Google. We hoped that by busting apart the finding aid into the components that archivists create (collections, series, files and items), and letting Google index it all, our users would be able to come directly to the content that they want to find.


This is the dream. A researcher searches Google for George Kennan’s the Long Telegram, and we can give her exactly what she’s looking for, in the context of the rest of the papers. We also wanted the finding aid to be actionable – a researcher can ask a question about the material, request to see it in the reading room, and, if it had been scanned, would be able to look at images directly in the context of the finding aid.


In this case, you can see a report on Jack Ruby from Allen Dulles’s Warren Commission files.

While we’re putting so much effort into making our finding aids into structured data, let’s make our finding aids function as data. Let’s make it so that we can sort, filter, compare, comment and annotate. Why do we take our EAD, which we’ve painstakingly marked up, and render it in finding aids as flat HTML?



Let’s work together to take the next step, to think critically about the metadata we’re creating, and then make sure that it’s readable by the machines that present it to our users.

spreadsheet to finding aid — the game plan

Where I work, I have a lot of legacy data that isn’t accessible to researchers. Since we manage our description in an archival management system that exports to EAD-encoded finding aids (and we publish images through the finding aid), I want to get as much of that converted to EAD as I can.

My current project may be my white whale. Several times throughout the last decades, there have been description, re-description, and digitization projects for a particular collection of negatives in our repository. The images are described at the item level and they were digitized TEN years ago, but still, researchers don’t have access to these images. We still make them sit down with gloves and a light box, and frankly, these images don’t get as much use as they should.

The negatives were described at the item level. Titles were transcribed and possibly enhanced, with liberal use of brackets and abbreviations. I also have information about the photographer, subject (generally), negative number, something called a shoot number that I still can’t suss out, gauge, two different date fields in different combinations of m/d/y (usually the same, sometimes different, which is hard to fathom for an image), and information about the Kodak Gold CD onto which it was digitized (seriously). In many cases, there are many images with the same title, same date and same photographer.

So, I started with an EAD mapping and a plan.


  • Image title will map to <unittitle>. I’ll need to do some major clean-up to write out abbreviations, acronyms and initialisms. It will also be good to pay attention to brackets and delete or use alternate punctuation, as appropriate.
  • Photographer will map to <origination><persname>. There seem to be variants of names and initials — I’ll use faceting and clustering in OpenRefine to bring consistency to these. Since these are amateur photographers, the names will not be authorized.
  • After doing a lot of thinking, I think I’m going to use the subject column as the series structure of the finding aid. A previous version of this finding aid has series scope and contents notes based on these subjects. If this were a collection of more notable photographers, I would organize the finding aid that way, but this seems the best of bad options for a collection that researchers will approach by subject.
  • The negative number will map to <unitid>. In the case of images with the same caption, date and photographer, I’ll plan to have one component with multiple values in unitid (separated by a delimiter, since Archivists’ Toolkit doesn’t allow for multiple unitids, BLERGH), and an extent statement indicating how many images are in a component. I’ll figure out an excel formal to calculate the extent statement, and I’ll collapse the multiple components into one using the blank down and join multi-valued cells features in OpenRefine.
  • Gauge will map to <dimensions>.
  • Dates will map to <unitdate>. It looks like the first date is more complete than the second, but there are some cases where there’s a second date and not a first. I’ll use excel formulas to break each apart by year, month and date, and create a new date column that asks whether the first date is present and, if not, adds the second date. I’ll also use excel formulas to convert to our (not DACS compliant, unfortunately)  date format and also to create normalized date formatting.
  • I’ll wait until the finding aid is created to map from c @id to information about the digitized file, since Archivists’ Toolkit doesn’t import that information (ARRRGGHHHH).
  • There are also some lines for missing negatives (these should be moved to the collection-level scope and contents note, for lack of a better place) and lines for negative numbers that were never asssigned to an image (which should just be deleted).

Most of this work happens in Excel and OpenRefine. The Excel spreadsheet will then be imported into oXygen, and I’ll use an XSLT (already written, thank goodness) to convert this to EAD. I’ll then import the EAD to Archivists’ Toolkit, since the institution has decided to make this the canonical source of data. Finally, I’ll export the EAD =OUT= of AT and use split multi-valued cells / paste down in OpenRefine to get an item-level list of images WITH AT-created unitids, and use MS Access to map that to information about the digitized file in the original spreadsheet (using negative number as the primary key). This then gets sent to digital library folks, who do the <dao> linking.

Does this all sound complicated and stupid? It should, because it is. I’m fairly confident that I’m using the most efficient tools for this kind of clean-up, but I’m sure part of my process could be improved. I also think that our tools and processes at my institution could be better.

My thoughts on how to use our tools better and avoid these kinds of clean-up projects in the future.

  • Most important. Don’t describe at the item level. Why does archival theory seem to go out the window when we start working with images? Describe by roll, by event, by creator — give the researcher better clues about the context of creation. Photographic records, like all records, can be described in aggregate. It’s very difficult to do so, though, which context is destroyed and provenance isn’t documented. Especially in this case, where images are digitized, there’s no reason to create item-level metadata.
  • Let’s have everyone use the same platform and keep the metadata together right away. Wouldn’t it be great if each image file were assigned a URI and linked to the component in the <dao> field BY DIGITIZATION TECHNICIANS DURING SCANNING? Sure, maybe there would need to be some post-processing and some of this data would change, but it would be all together right away from the outset, instead of relying on the crazy internal logic of file systems or secondary spreadsheets.
  • The project isn’t finished until the project’s finished. A ten-year gap between digitization and publication in the finding aid is a big problem.
  • Archivists’ Toolkit isn’t a very good tool for archivists who want to do bulk manipulation of data. There, I said it. It would be nearly impossible to make these changes in AT — I need tools like Excel, OpenRefine, XSLT, and the xpath find/replace functions in oXygen to change this much data. Sure, I can export and then re-import, but AT doesn’t reliably round-trip EAD.
  • Maybe we shouldn’t be using AT as our canonical data source. It really don’t offer much added value from data clean-up point of view, beyond being able to make bulk changes in a few fields (the names and subject modules are particularly useful, although our finding aids don’t even link that data!). And frankly, I’m not nuts about the EAD that it spits out. WHY DO YOU STRIP C @ID during import?!?! Why can’t extent be repeated? Why can’t some notes (like <physfacet>, which really isn’t even a note) be repeated? Why not multiple <unitid>s? And as I look at our AT-produced finding aids, I find a LOT of data mistakes that, thinking about it, are pretty predictable. A lot of crap gets thrown into <unittitle>. There’s confusion about the difference between container attributes and <extent>, <physdesc>, and <physfacet> notes. I’m not endorsing the hand-coding of finding aids, but I think that there was some slippage between “oh good! We don’t have to use an XML editor!” and “Oh good! No one needs to pay attention to the underlying encoding!”

I’ll be sure to report back when this project is done. Until then, keep me in your thoughts.