Clean Up: Dates and OpenRefine

Here are some of the ways we are using OpenRefine to clean up dates in accession records based on our instructions. In some cases I have included more than one way to complete the same activity. There are probably more elegant ways to do this, so if you have other suggestions please let us know in the comments!

In this example we’re starting with running formulas on our unitdateinclusive column, splitting this out into begin and end dates, and then will return to format our date expression.

Periodically I use Edit Cells>Common Transforms>Trim leading and trailing whitespace just to keep things tidy. Be sure to run this anytime you’ll be using an expression that looks at the beginning or end of a cell.

Remove various characters, punctuation, and circa notations

Use Edit Cells>Transform with the following formals:

  • value.replace(‘c.’, ‘ ‘).replace(‘c’, ‘ ‘).replace(‘ca.’, ‘ ‘).replace)(‘Ca.’, ‘ ‘)replace(‘circa’, ‘ ‘).replace(‘unknown’, ‘ ‘).replace(‘n.d.’, ‘undated’).replace(‘, undated’, ‘and undated’).replace(‘no date’, ‘ ‘)
    • You could use Facet>Text facet and enter one of the words to analyze the fields first. Then either use all or part of the formula above or make use of the cluster function.
  • value.replace(/[?]/,’ ‘)  or value.replace(“?”,’ ‘)  [These do the same things, just showing different syntax you can use.]
  • value.replace(“[“,’ ‘).replace(“]”, ‘ ‘)
  • value.replace(“‘s”, ‘s’)

Convert to standardized dates

  • Edit Cells>Common Transforms>To text
  • Edit Cells>Common Transforms>To date

From here we’ll run a set of operations to give more conformity to the data. (This is where there must be better ways of doing this.)

  • Edit>Cells>Common Transforms>To text
  • Change months to numbers with
    • value.replace(‘January ‘, ’01/’).replace(‘February ‘, ’02/’).replace(‘March ‘, ’03/’).replace(‘April ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘June’, ’06/’).replace(‘July ‘, ’07/’).replace(‘August ‘, ’08/’).replace(‘September ‘, ’09/’).replace(‘October ‘, ’10/’).replace(‘November ‘, ’11/’).replace(‘December ‘, ’12/’)
    • value.replace(‘Jan ‘, ’01/’).replace(‘Feb ‘, ’02/’).replace(‘Mar ‘, ’03/’).replace(‘Apr ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘Jun’, ’06/’).replace(‘Jul ‘, ’07/’).replace(‘Aug ‘, ’08/’).replace(‘Sep ‘, ’09/’).replace(‘Sept ‘, ’09/’).replace(‘Oct ‘, ’10/’).replace(‘Nov ‘, ’11/’).replace(‘Dec ‘, ’12/’)
  • Replace commas and space after day before year: value.replace(‘, ‘, ‘/’)

We now have dates that are in the same and standard format as opposed to the variations we saw in the Date Formats post.

Create begin and end date columns with this data:

  • Duplicate the column: Edit Column>Add column based on this column… and by supplying no formula we can copy over all the data to our new column.
  • Split the new column using Edit Column>Split into several columns… and designate “-” as our separator.
  • Rename first column “date_1_begin” and second “date_1_end”.
  • Run Edit Cells>Transform: toString(toDate(value),”yyyy-MM-dd”) on both columns to get in the format required by ArchivesSpace, without having to deal with the day, hours, minutes, and seconds of the date format in OpenRefine.

We’ll still want to review the date begin column to identify the rows that had two single dates indicated by commas, ampersands, semi colons, and, or some other designator. We would separate out the second date to “date 2” fields.


  • Standardizing dates in to our preferred local natural language for date expression
  • Converting decades (ex: 1920s) programmatically



OpenRefine and Messy Legacy Access Points in an Archivists’ Toolkit Database

After having read posts on this blog and articles concerning the use of OpenRefine to handle metadata stored in Excel and Access files, I found myself asking how this could be done with an Archivists’ Toolkit (MySQL) database. Since the literature was not forthcoming, I did my own experiment, which Maureen graciously offered me the space here to describe. Before attempting this on a larger scale, you may wish to create a local version of Archivists’ Toolkit on your computer to test it with. To do this in a working archives, contact your database administrator and work with her on determining how you’ll do the project.

For my experiment, I didn’t work on an active version of the database. Instead I duplicated my Archivists’ Toolkit database into something called `messydb` and temporarily linked it to my Archivists’ Toolkit software.

I chose to restrict my experiment to personal names, but with a little more time going through the database structure/export, I could have done all names and subjects. I seeded the database with 5 less-optimal versions of 3 names which already existed. I did this in Archivists’ Toolkit by opening 3 different records and linking new improper names to them. I also created a non-preferred form for one of the names, for the sake of a later part of the experiment. I backed up this problem database so that I could reload and repeat the experiment as many times as necessary.

Next, I had to write my queries.0 I began by working from the database export I’d used to create the duplicate to determine the tables and fields which would be relevant to the project. I determined that in the table `names` the field `nameId` was used to create a primary key for the name entry (used as a foreign key in other tables) and `sortName` was the best way to view a full version of the name. There’s also an important field called `nameType` which is used to designate that the name is a “Person” (vs. “Corporate Body” etc.). So, I wrote a query which would take the nameId and the sortName from any entry where the nameType was “Person” and put them into a comma-separated file.

SELECT nameId, sortName INTO OUTFILE '/names.csv'
FROM names
WHERE nameType = 'Person'


This resulted in the following file. I then opened it in Notepad++ so that I could quickly add comma-separated headers before the OpenRefine. These weren’t necessary, but I found adding them at this point helpful.

Adding Headers in Notepad ++

Working in OpenRefine

I then imported this file into OpenRefine. I then used OpenRefine’s cluster and edit option, ticking off metaphone3 as the keying function.

The results I got were:

Burns, Jacob A.
Burns, Jacob, 1902-1993
Burns, Jacob

O'Connor, Sandra D.
O'Connor, Sandra Day 1930-

Barron, Jerome
Barron, Jerome A.

which is all well and good, but if you recall above I said that I’d put in three problem names for Jacob Burns. The name “Burns, J.” wasn’t caught by metaphone3, which didn’t parse the J as close enough to Jacob. Of course, J could also be Jerome or Jacqueline or James. I’ll come back to this at the end.

Now that I’ve got most of the duplicates selected, it’s not as simple as using the editing function in OpenRefine to standardize the names. Even if you’re sure that these names are true duplicates, they must be edited within Archivists’ Toolkit. There are three ways to do it. No matter the method, I need to first flag all the names, sort to select all flagged names, export this view of the flagged names into an Excel file, and sort them by name in Excel.2 Now we have a list of names and corresponding nameIds from which to work.

Removing the Duplicates

The first method is to simply take the exported Excel file and work from it. This would involve going into the Names module of Archivists’ Toolkit and locating the finding aids attached to each improper name form. The archivist would double-check in each finding aid that this is really the same person, then replace it with the preferred name from the possible list. After it’d been removed from all linked records, the name could be deleted in the Names module.

The second method is one for which I’m still writing the specific SQL sequence (it involves 5 table joins and one temporary loop). The result will pull the following table.fields: resources.eadFaUniqueIdentifier (<eadid>), resources.findingAidTitle (<titleproper>), and names.sortName (display version of the name) into a list for any cases where the names.nameId is one of the potential duplicates. This could print into a neat list which the archivist could then use to view every finding aid where a problem name is linked without as much repetitive work as the first method would require.

The third method involves a mix of either the first or second and a SQL batch update. Using either the first or second method, the archivist would confirm that the names are true duplicates. Using the second method, for example, might allow her to easily view the finding aids online using the eadFaUniqueIdentifier/<eadid> and scroll through them to double check the name. Then she could follow these three steps to do SQL batch updates using the appropriate nameIds.

Removing Duplicates with SQL

As I begin this section, I feel compelled to remind everyone to check with your database administrator before trying this method. This may be outside the bounds of what she’s willing to do, and there are probably good reasons why. If she’s not able to help you with this, use methods one or two. You will also need her assistance to use the second method, but as it’s just running a query to generate a list of names and not altering anything in the database, she’s more likely to work with you on it.

Updating the Names

Archivists’ Toolkit’s database uses the linking table `archdescriptionnames` to handle links between the name records and the archival records. There are other ways to update this linking table, but the simplest query is the following three lines, where the number in the SET row is the nameId of the good version of the name and the number in the WHERE row is the nameId of the deprecated name. With this example, you’d have to run one query for each name, but a good macro or copy/paste setup could help you generate it pretty quickly.

UPDATE archdescriptionnames
SET primaryNameID=6
WHERE primaryNameID=10001;


Handling Any Non-Preferred Names

At this point, the main mission has been accomplished. All the deprecated names have been removed from the finding aids and have been replaced with the optimized version. However, if any non-preferred forms were created for those now-deprecated names, you’ll be unable to simply delete the unwanted names from your database without handling the non-preferred forms first. This part mirrors above. The query below will update each non-preferred name record that’s connected to the wrong name & connect it to the right one.

UPDATE nonpreferrednames
SET primaryNameID=6
WHERE primaryNameID=10001;

If you’d rather just delete the non-preferred names for any deprecated name, mimic the query below, but change `names` to `nonpreferrednames`.

Deleting Deprecated Names

Now that the deprecated names have been removed from records and disconnected from their non-preferred versions, they can be deleted. This is a very important step, since you don’t someone using AT’s features later on to add the wrong name to their record.

WHERE nameID=10001
OR NameID=10002
OR NameID=20001
OR NameID=20002;

Voila, you’re done!

Final Thoughts

Like all the other work done using metaphone3, this is only as good at catching duplications as the phonetic algorithm allows. In my case, it caught 5 out of 6 duplications and the duplicate it missed was rather different.


0. To run these queries on a local installation, navigate to your phpmyadmin in your browser, probably http://localhost/phpmyadmin/ then click on the database, click on the SQL tab at the top when viewing the database, and run your queries in the SQL box.

1. Line-by-line, this 1) pulls each nameID and sortName into a file named names.csv, which can be found at the drive root (C: in this case), 2) with commas between each field, 3) and enclose the contents of each field in ” ” (which keeps CSV software from thinking sort names like “Burns, Jacob” are two fields vs. one). It 4) pulls these fields from the table `names` 5) whenever the `nameType` field is “Person.” The order makes writing it out as an ordered description a little tricky, but is proper SQL order.

2. I could do the final step in OpenRefine, but I found Excel wasn’t parsing the alphabetical sort.

3. Line-by-line, this tells the database to 1) update the table `archdescriptionnames` by 2) inserting the `primaryNameID` number included 3) in every row where the `primaryNameID` number of the last line is right now. So if it occurs once, it’ll replace it once. If it occurs 150 times, it’ll replace it 150 times.

Case Study: Clean Data, Cool Project

SPLCblogEvery now and then I get to work on a project from the very beginning, meaning that instead of cleaning up legacy data, I get to define and collect the data from scratch. Such was the case with one of Duke’s recent acquisitions, the records of the Southern Poverty Law Center Intelligence Project. Beginning in the 1970s, SPLC collected publications and ephemera from a wide range of right-wing and left-wing extremist groups. The Intelligence Project included groups monitored by SPLC for militia-like or Ku Klux Klan-like activities. There are also many organizations represented in the collection that are not considered “hate groups”– they simply made it onto SPLC’s radar and therefore into the Project’s records. The collection arrived at Duke in good condition, but very disorganized. Issues of various serial titles were spread across 90 record cartons with no apparent rhyme or reason. Inserted throughout were pamphlets, fliers, and correspondence further documenting the organizations and individuals monitored by SPLC.

What do you do when an archival collection arrives and consists mostly of printed materials and serials? In the past, Duke did one of two things: either pull out the books/serials and catalog them separately, or leave them in the archival collection and list them in the finding aid, sort of like a bibliography within a box list. This project was a great opportunity to try out something new. In consultation with our rare book and serials catalogers, we developed a hybrid plan to handle SPLC. Since we had to do an intensive sort of the collection anyway, I used that chance to pull out the serials and house each title separately. They are now being cataloged individually by our serials cataloger, which will get them into OCLC and therefore more publicly available than they would ever be if just buried in a list in the finding aid. She is also creating authority records for the various organizations and individuals represented in the collection, allowing us to build connections across the various groups as they merged and split over time. While she catalogs the serials, I have been archivally processing the non-serial pieces of the collection, tracking materials by organization and describing them in an AT finding aid. When all of the serials are cataloged, I will update the finding aid to include links to each title, so that although the printed materials have been physically separated from their archival cousins, the entire original collection will be searchable and integrated intellectually within the context of the SPLC Collection.

To further ensure that the SPLC serials did not lose their original provenance, we developed a template that our cataloger is applying to each record to keep the titles intellectually united with their original collection. All of the serials being cataloged are receiving 541 and 561 fields identifying them as part of the SPLC Collection within the Rubenstein Special Collections Library. We are also adding 710s for the Southern Poverty Law Center, and an 856 that includes a link to the SPLC collection guide. (Duke inserts all its finding aid links in the 856 field, but we rarely do this for non-manuscript catalog records.) The result is a catalog record for each serial that makes it blatantly obvious that the title was acquired through the SPLC Collection, and that there are other titles also present within the collection, should researchers care to check out the links. But, cataloging the serials this way also allows the researcher to find materials without necessarily searching for “SPLC.”

Screenshot 2014-04-07 at 8.20.22 PM

An example of one of the SPLC serials: The Crusader, a KKK publication.

Along with hammering out our various print and manuscript workflows to better meet the needs of this collection, we also saw it as an opportunity to create and collect data that would allow us to easily extract information from all the discrete catalog records we are creating. We are being as consistent as possible with controlled vocabularies. Our serials cataloger is adding various 7xxs to track each publisher using RBMS or LOC relator codes. LOC geographic headings are being added as 752s. We are also trying to be consistent in applying genre terms in the 655 field using the RBMS gathering term “Political Works.”

Screenshot 2014-04-07 at 9.10.28 PM

A view of the MARC fields from The Crusader’s catalog record.

Equally important, we are replicating this sort of data collection in the archival description of the non-serial portions of the SPLC Collection. When we finally reunite the serials with the finding aid, the same sort of geographic, subject, and publisher data will allow us to match up all of the fields and create relationships between an organization’s random fliers and its various newsletters.

Furthermore, my colleagues and I have dreams of going beyond a basic finding aid to create some sort of portal that will capitalize on our clean data to offer researchers a new way to access this collection. SPLC’s own website has a neat map of the various hate groups it has identified in the United States, but we would like to build something that specifically addresses the organizations and topics represented in this particular collection–after all, the Intelligence Project collected materials from all sorts of groups. We’re thinking about using something like Google Fusion Tables or some other online tool that can both map and sort the groups and their various agendas, but also connect back to the catalog records and collection guide so that researchers can quickly get to the original sources too.

I’ll have more to report on this cool project — and what we end up doing with our clean data — as it continues to progress over the next few months. Already, our serials cataloger has created 55 new OCLC records for various serial titles, and has replaced or enhanced another 140. She’s about halfway done with the cataloging part of the project. With so many of these groups being obscure, secretive, or short-lived, we believe that creating such thorough catalog records is worth our time and energy. Not only will it make the titles widely discoverable in OCLC, but hopefully it will build connections for patrons across the diverse organizations represented within this collection.

The Value of Archival Description, Considered

This is a talk that I gave at the Radcliffe Workshop on Technology and Archival Processing on April 3, 2014. I hope you enjoy what I had to say. I think it dovetails nicely with the work the four of us do on this blog.


I’m very happy to be here today. As Ellen mentioned, my name is Maureen Callahan. I currently work at the Tamiment Library at New York University in a technical services role. In our context, which I know isn’t unique, almost all of our arrangment and description work is done by very new professionals or pre-professionals. This means that most of my job is teaching, coaching and supervising – and making sure that all of the workers I supervise have the infrastructure, support, and knowledge they need to meet our obligations to users and donors.

Because I work with pre-professionals, I think it’s important to be deliberate and take the time to explain the values behind archival description – what our obligations are, how to make our work transparent, what’s valuable and what isn’t, how we should be thinking about how we spend our time, and how to look at the finding aid that we’ve created from a researcher’s point of view.


When the organizers asked me to present, they included a few questions, questions that have been weighing on my mind too.

In their initial email, Ellen Shea and Mary O’Connell Murphy asked, “Is the product of a finding aid worthy of the time required to make them considering emerging technologies? Where do you think research guides might be headed in the future? How do you think they must change in order to improve access to archival collections and meet today’s user’s needs?”

Most provocatively, they asked, “What do researchers really want from finding aids? Do they want them at all?”

And I think that the answer is no. And maybe. And yes.


At its core, I think that this question gets at what is and isn’t valuable about what archivists do, and what might be good for us to pay more attention to.

So, what do finding aids do? Why do we create them?

OK, so we can start by looking at finding aids as a way to address the practical problem of giving potential researchers access to unique or rare material that can only be found in a single location, behind a locked door in a closed stacks. Until you come here and show us your ID and solemnly swear that you’re going to follow our rules, the finding aid is all you get. This is the deal. So, to answer the question of whether researchers want finding aids – no. They don’t. They want the records. But they get the guide first.


And many parts of a finding aid – the parts that we spend so much of our time creating – take this imperfect surrogate role. Many finding aids are built on the model of looking at a body of records, dividing it into groupings (either physically or intellectually, usually both), and then faithfully representing files in that grouping to a mind-numbing level of meticulous detail. I’m going to call this model a map.


And what this slide, which is based on an analysis of the finding aids at the Tamiment Library, will show you is that yes, this work is getting done. We have plenty of information about what the materials tell us about their titles and dates and how much we have of it. But this slide only tells us information about finding aids that have been created. I also know that backlogs are still a problem at a lot of repositories, especially mine. This “mapping” model of tedious representation, starting at the beginning and going to the end, means that often the end never comes. We have plenty of collections that aren’t represented at all. Is this serving our users? Does this meet our donors’ expectations? Can’t we find a better way?


I’m looking forward to hearing from speakers today and tomorrow who will talk about how we can get machines to do some of this mapping for us. Because, as far as I’m concerned, good riddance. I don’t think that archivists are just secretaries for dead people, and I welcome as much automation as we can get for this kind of direct representation of what the records tell us about themselves.


Indeed, it’s already happening. At my institution, we’re just starting to work through the process of accessioning electronic records, and I can already see how tools like Forensic Toolkit help us to get electronic records to describe themselves.

After all, electronic records are records. Digital archives are archives. This is our present, future, and poorly-served past. And in the case of electronic records, we have ways of transcending the problem of our collections being singly, uniquely sited, requiring a mapping of what’s inside.

But some collections are, indeed, unique and sited. Before going on, I want to be pragmatic about the idea of scanning everything that isn’t born-digital, that does require a certain degree of mapping. I think we should be scanning a lot, I think we should be scanning much more than we are, but I don’t think that we necessarily should be scanning everything. I think we should scan what the people want. The city archives of Amsterdam, which has the most complete and sophisticated scanning operations that I’ve encountered, has committed to providing researchers with what they want not by scanning everything (they estimate that it would take 406 years to do so for all 739 million pages in their holdings, even in an extremely robust production environment) but by scanning what the users want to see. After all, what if you want to see the 739 millionth scan? And in order to figure out what the people want, we need some minimal level of mapping. Not every file, not in crazy, tedious detail, but some indication of what’s in a collection

So, we’ve dispensed with much of the map. Didn’t that feel good? What else is a finding aid? What else does the archivist do? What else do our researchers need from us?


At the next level of abstraction, a really good finding aid is a guide. In this painting by Eugene Delacroix, we see Virgil leading Dante across the river Styx. I don’t want to take this metaphor too far, but I do think that there’s a role for the archivist to help researchers understand our materials by explaining the collections, pointing out pitfalls and rich veins of content, rather than just representing titles on folders.

I can see, in some contexts, that it makes sense for an archivist to spend quality time really understanding the records and explaining this understanding so that each researcher doesn’t have to wade through it every time. When I teach description, I urge workers to evaluate rather than represent records. For instance, does a correspondence series include long, juicy, hand-written letters wherein the writer pours his heart out? Or are they dictated carbon copies based on forms? A title of “Letter from John Doe to Jane Smith” doesn’t tell us this, but an archivist’s scope and content note can. It takes a lot of time to type “Correspondence” and the date a zillion times. Wouldn’t researchers prefer an aggregate description and date range with a nice, full note about what kinds of correspondence with what kinds of information she can expect therein? This is a choice to guide rather than map.

So here, we’re representing information about the collection that a researcher would need to spend a lot of time to discover on his own. And by the way, I’m not claiming a breakthrough. Seasoned archivists do this all the time. It’s also what Greene and Meissner were talking about in their 2005 article – our value is in our focus on the aggregate and the judgment required to make sense of records, rather than just representing them.

So to answer the original question, I would say that maybe, yes, maybe, researchers do want these kinds of finding aids where some of the sensemaking has already been done for them. The scale of archives is large, and it may indeed be inefficient to expect researchers to browse scanned document after scanned document to get a good understanding of what this all means together.

But there’s an even higher level of abstraction central to our role as archivists that should be included in our finding aids, which I rarely see documented comprehensively or well. This is the information about a collection that no amount of time with a collection will reveal to a researcher – it has to do with the archivists’ interventions into a collection, the collection’s custodial history, and the contexts of the records’ creation.


This last bit – getting to understand who created records, why they were created, and what they provide evidence of – really gets to the nature of research. These are the questions that historians and journalists and lawyers and all of the communities that use our collections ask – they don’t just see artifacts, they see evidence that can help them make a principled argument about what happened in the past. They want to know about reliability, authenticity, chain of custody, gaps, absences and silences.

This is the core work of archivists. This is what we talked about over and over again when I was in graduate school, and what has been drilled into me as the true value we, as archivists, add to the research process. We occupy a position of responsibility, of commitment to transparency and access. Researchers expect us to tell them this information, and we do a terrible job of doing so.

The above slide is based on the same corpus of finding aids at the Tamiment Library. While we did a great job of documenting what we saw before us, we did an abysmal job of explaining who gave us the collection and under what circumstances, how we changed the collection when we processed it, and what choices we made about what stays in the collection and what’s removed. And from what I can tell, it’s pretty consistent with the kinds of meta-analysis done by Dean and Wisser, and also by Bron, Proffitt, and Washburn in their recent articles analyzing EAD tag usage.


Like I say, communicating this in the finding aid is some of the most important work we do, and we do a pretty bad job of it. I have no reason to believe that my library is unique in this.

Because I also know, when I go to describe records, especially legacy collections that have sat unprocessed for a long time, I often have to do this by guessing. I’m like an archaeologist who tries to figure out the life of these documents before they came to me based on the traces left behind. It’s what I most want to explain, and what I often have the least evidence of.

This is an area where curators — collectors — whatever you call them — can intervene, where the best of breed are invaluable. After all, we’re not doing archaeology and working with the remains of long-dead civilizations. Creators, heirs or successors are usually around — they’re the ones who packed the boxes and dropped off the materials. Let’s make sure that we sit them down and talk with them then. Let’s make sure we’re getting all of the good stuff. Let’s make sure we really understand the nature of the records before we ask the processing archivist — usually a person fairly low in the organizational hierarchy, often a new professional, and almost always the person with the least access to the creator — to labor at reconstruction when just asking the creator might reveal all.

I have one short anecdote from my own repository to help illustrate this problem. In 1992, the Tamiment Library acquired the records of the Church League of America from Liberty University in Lynchburg Virginia. The Church League of America was a group created in the 1930s to oppose left-wing and social gospel influences in Christian thought and organizations through research and advocacy. The first iteration of the finding aid for this collection could be described as a messy map – a complicated rendering of the folder titles found in this extensive collection, without much explanation of what it all means and how it came.

Two years ago, before I came to Tamiment, my colleagues did a re-processing project. In doing so, they realized that these records had a rich history and diverse creators — far richer than what the finding aid had indicated. It turns out that the collection is an amalgamation of many creators’ work, including the files of the Wackenhut Corporation, which started as a private investigations firm and moved on to be government contractor for private prisons. The organization maintained files on four million suspected dissidents, including files originally created by Karl Barslaag, a former HUAC staff member, and only donated them to the Church League of America in 1975 as a way of side-stepping the Fair Credit Reporting Act.

Until re-processing happened, researchers had an incomplete picture of the relationship between private commerce and non-profit organizations that converged to become the lobbying arm of the anti-Communist religious right.

So back to our original question. Do researchers want finding aids qua finding aids? No, maybe, yes. They want the stuff, not descriptions of the stuff. They might want some help navigating the stuff. And they absolutely want all the help that they can get with uncovering the story behind the story.


Before I turn this over to Trevor, I want to add a brief coda about how we should be thinking of finding aids as discovery tools as long as we decide to have them.

Let’s start with a reality check – how are finding aids used? What do we know about information-seeking behavior around archival resources?

The first and most important thing that we know is that discovery happens through search engines. It is true that some sophisticated researchers know what kinds of records are held at what repositories – that the Tamiment Library holds records of labor and the radical left, or that Salman Rushdie’s papers are at Emory.  But I think that we can all agree that “just knowing” isn’t a good strategy to make sure that researchers discover our materials!


This was the understanding that we started with at Princeton (my previous job) when we decided to revise our finding aids portal. Previously, our finding aids looked like a lot of other finding aids – very, very long, often monograph-length webpages that give a map – and the better once (there were many better once there), would also be a good guide as well.

Basically, we decided to surrender to Google. We hoped that by busting apart the finding aid into the components that archivists create (collections, series, files and items), and letting Google index it all, our users would be able to come directly to the content that they want to find.


This is the dream. A researcher searches Google for George Kennan’s the Long Telegram, and we can give her exactly what she’s looking for, in the context of the rest of the papers. We also wanted the finding aid to be actionable – a researcher can ask a question about the material, request to see it in the reading room, and, if it had been scanned, would be able to look at images directly in the context of the finding aid.


In this case, you can see a report on Jack Ruby from Allen Dulles’s Warren Commission files.

While we’re putting so much effort into making our finding aids into structured data, let’s make our finding aids function as data. Let’s make it so that we can sort, filter, compare, comment and annotate. Why do we take our EAD, which we’ve painstakingly marked up, and render it in finding aids as flat HTML?



Let’s work together to take the next step, to think critically about the metadata we’re creating, and then make sure that it’s readable by the machines that present it to our users.