Clean Metadata for Non-Metadata Geeks

Over the past two years, Maureen, Carrie, Meghan, Cassie and their guests have turned this blog into a powerhouse of know-how around working smarter with archival metadata. Some of us really enjoy this type of work; we find it crazy satisfying and it aligns well with our worldviews. We acknowledge, with some pride, that we are metadata geeks. But not all archivists are like this, AND THAT’S TOTALLY OKAY. We all have different strengths, and not all archivists need to be data wranglers. But we can all produce clean metadata.

Just one of the awesome buttons from AVPreserve

Just one of the awesome metadata jokes promulgated by AVPreserve‘s button campaign

Today, though, I’m going to take a BIG step backward and talk for a few minutes about what we actually mean when we talk about “clean” data, and I’ll share a few basic things that any archivist can do to help prevent their painstakingly produced metadata from becoming someone else’s “clean up” project later.

As Maureen explained in the very first Chaos —> Order post, the raison d’etre of all of this is to use computers to do what they do best, freeing the humans to do what they do best. Computers are really good at quickly accomplishing tasks like indexing, searching, replacing, linking and slicing up information for which you can define a rule or pattern, things that would take a human tens or hundreds of hours to do by hand and wouldn’t require any of the higher-level processes that are still unique to humans, let alone the specialized training or knowledge of an archivist. Clean data is, quite simply, data that is easy for a computer to digest in order to accomplish these tasks.

Continue reading

Put a strategic plan on it!

People who know me will know I love strategic planning. Or, more accurately, I love good strategic planning and how a strategic plan can assist you in many other activities.

Given that our library’s strategic plan is a few years old and our dean is retiring in the spring, the functional areas of SCUA didn’t want to wait for the whole library process to move forward. Luckily, there’s no rule that says you can’t have a strategic document for levels below the top or division/department.

While we didn’t go through a full blown strategic planning process, we had run many brainstorming, visioning, and planning activities over the last year and a half. Many of the projects in our document were already approved (officially or unofficially) and represented in individual and unit work plans.

Why did we need a plan then? When planning projects or allocating resources we seemed to encounter a few challenges. The biggest (to me) were a lack of understanding about:

  • The difference between work that is strategic to move a program forward v. the prioritization of regular ongoing work/projects
    • ex: processing the so and so papers may be a high priority on the list of collections to process, but this does not necessarily make that specific processing project a strategic priority
  • How the work of different functional areas within SCUA directly relate to one another, supports the work of the entire department, and how each unit/function can participate in meeting shared goals.

We determined three strategic directions across our work:

  1. Optimize the user experience
  2. Increase access to collections
  3. Expand knowledge of our collections to new audiences

Check out the full Strategic Directions for SCUA Functional Areas 2014-2017.

Here’s how I’m hoping to use our strategic directions document:

  • Raising awareness about what we do, why we do it, and its value within SCUA and the Libraries
  • Assist in developing annual work plans, how we spend our time, and evaluating our progress
  • Prioritization of pop up/new projects. Is it really a project that will move us forward? Does it have to happen right now? Can we approach it differently than before? What do we STOP doing from our strategic directions or regular/ongoing work to accommodate it?
  • Use as a tool for updating specific policies, procedures, and workflows highlighting how these changes support the activities and goals outlined in the strategic directions.
  • Advocating for resources at various levels within the library. Our AUL has already said this document will be extremely helpful as the libraries start to discuss priorities for fiscal and human resources for FY16.

Also, a hat tip to UCLA’s Library Special Collections strategic plan! We liked their presentation/formatting, so borrowed that for ours. Don’t reinvent the wheel!


Figuring Out What Has Been Done

It’s been a while since I last posted, and there’s a good reason for that — I’ve started an exciting new job as an archivist and metadata specialist at Yale. I miss my colleagues and friends at Tamiment every day, and I look forward to continued awesome things from them.

Here at Yale, I work in Manuscripts and Archives. The major project for the first year will be to migrate from Archivists’ Toolkit to ArchivesSpace. In anticipation of this migration, I’m learning about the department’s priorities for data clean-up and thinking through what I can do to help implement those goals.

The Goals

One of the first projects that was added to my list was to take a look at a project that has been ongoing for a while — cleaning up known errors from the conversion of EAD 1.0 to EAD 2002. Much of the work of fixing problems has already been done — my boss was hoping that I could do some reporting to determine what problems remain and in which finding aids they can be found.

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?

Getting to the Files

Access to files that have been published to our portal is easy — they’re kept in a file directory that is periodically uploaded to the web. And I also have a cache of the EAD 1.0 files, pre-transformation. These were both easy to pull down copies of.  But, one of the questions I was asking was how these differ from what’s in the AT. It’s so, so easy to make changes in AT and forget to export to EAD.

If any of you know good ways to batch export EAD from AT, PLEASE LET ME KNOW. I have a pretty powerful machine and I know that folks here have worked on optimizing our AT database, but I could only export about 200 files at a time, for fear of the application hanging and eventually seizing up. So, I ran this in the background over the course of several days and worked on other stuff while I was waiting.

For some analyses, I wanted to exclude finding aids that aren’t published to our portal — for these, I copied the whole corpus to a separate directory. To get a list of which finding aids are internal-only, I very simply copied the resource record screen in AT (you can customize this to show internal-only finding aids as a column), which pastes very nicely into Excel.



Once in Excel, I filtered the list to get a list of Internal Only = “TRUE”. From there, I used the same technique that I had used to kill our zombie finding aids at NYU. I made a text document called KillEAD.txt, which had a list of the internal-only finding aids, and I used the command

cat KillEAD.txt | tr -d '\r' | xargs echo rm | sh

to look through a list of files and delete the ones that are listed in that text document. (In case you’re wondering, I’m now using a Unix simulator called Cygwin  and there are weird things that don’t play nicely with Windows, including the fact that Windows text documents append /r to the ends of lines to indicate carriage returns. Oh, also, I put this together with spit and bailing wire and a google search — suggestions on better ways to do this kind of thing are appreciated).

So, that’s step 0. Don’t worry, I’ll be blogging my approach to goals 1-7 in the days ahead. I have some ideas about how I’ll do much of it (1, 3, and 4 I know how to assess, 2 I have a harebrained plan for, 5-7 are still nascent), but any suggestions or brainstorming for approaching these problems would be MORE THAN WELCOME.

Clean Up: Dates and OpenRefine

Here are some of the ways we are using OpenRefine to clean up dates in accession records based on our instructions. In some cases I have included more than one way to complete the same activity. There are probably more elegant ways to do this, so if you have other suggestions please let us know in the comments!

In this example we’re starting with running formulas on our unitdateinclusive column, splitting this out into begin and end dates, and then will return to format our date expression.

Periodically I use Edit Cells>Common Transforms>Trim leading and trailing whitespace just to keep things tidy. Be sure to run this anytime you’ll be using an expression that looks at the beginning or end of a cell.

Remove various characters, punctuation, and circa notations

Use Edit Cells>Transform with the following formals:

  • value.replace(‘c.’, ‘ ‘).replace(‘c’, ‘ ‘).replace(‘ca.’, ‘ ‘).replace)(‘Ca.’, ‘ ‘)replace(‘circa’, ‘ ‘).replace(‘unknown’, ‘ ‘).replace(‘n.d.’, ‘undated’).replace(‘, undated’, ‘and undated’).replace(‘no date’, ‘ ‘)
    • You could use Facet>Text facet and enter one of the words to analyze the fields first. Then either use all or part of the formula above or make use of the cluster function.
  • value.replace(/[?]/,’ ‘)  or value.replace(“?”,’ ‘)  [These do the same things, just showing different syntax you can use.]
  • value.replace(“[“,’ ‘).replace(“]”, ‘ ‘)
  • value.replace(“‘s”, ‘s’)

Convert to standardized dates

  • Edit Cells>Common Transforms>To text
  • Edit Cells>Common Transforms>To date

From here we’ll run a set of operations to give more conformity to the data. (This is where there must be better ways of doing this.)

  • Edit>Cells>Common Transforms>To text
  • Change months to numbers with
    • value.replace(‘January ‘, ’01/’).replace(‘February ‘, ’02/’).replace(‘March ‘, ’03/’).replace(‘April ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘June’, ’06/’).replace(‘July ‘, ’07/’).replace(‘August ‘, ’08/’).replace(‘September ‘, ’09/’).replace(‘October ‘, ’10/’).replace(‘November ‘, ’11/’).replace(‘December ‘, ’12/’)
    • value.replace(‘Jan ‘, ’01/’).replace(‘Feb ‘, ’02/’).replace(‘Mar ‘, ’03/’).replace(‘Apr ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘Jun’, ’06/’).replace(‘Jul ‘, ’07/’).replace(‘Aug ‘, ’08/’).replace(‘Sep ‘, ’09/’).replace(‘Sept ‘, ’09/’).replace(‘Oct ‘, ’10/’).replace(‘Nov ‘, ’11/’).replace(‘Dec ‘, ’12/’)
  • Replace commas and space after day before year: value.replace(‘, ‘, ‘/’)

We now have dates that are in the same and standard format as opposed to the variations we saw in the Date Formats post.

Create begin and end date columns with this data:

  • Duplicate the column: Edit Column>Add column based on this column… and by supplying no formula we can copy over all the data to our new column.
  • Split the new column using Edit Column>Split into several columns… and designate “-” as our separator.
  • Rename first column “date_1_begin” and second “date_1_end”.
  • Run Edit Cells>Transform: toString(toDate(value),”yyyy-MM-dd”) on both columns to get in the format required by ArchivesSpace, without having to deal with the day, hours, minutes, and seconds of the date format in OpenRefine.

We’ll still want to review the date begin column to identify the rows that had two single dates indicated by commas, ampersands, semi colons, and, or some other designator. We would separate out the second date to “date 2” fields.


  • Standardizing dates in to our preferred local natural language for date expression
  • Converting decades (ex: 1920s) programmatically



The Beast

No, really, we call our homegrown archival management system the “Beast”. It is an Access database launched in the early-mid 2000s. Originally started as the locations register, the database evolved to support the creation of EAD finding aids published on ArchivesUM. A JAVA based conversion program pulls information from the database into EAD elements in a XML file. Files are then checked by hand for XML and EAD compliance before upload to ArchivesUM. You can read all about its creation in “Taming the ‘Beast’: An Archival Management System Based on EAD.“*

At the time, the Beast did some good things. It launched local EAD implementation, got more finding aids online, and consolidated collection information into a central location. It did a great job at facilitating putting up abstracts of new accessions or unprocessed collections.

Over the years the Beast has evolved including development on the conversion scripts, but this has been pretty minimal in terms of new functionality over the past several years. The decision was made to wait for ArchivesSpace instead of making major changes to the Beast.

Here’s a highlight of general issues with the Beast and our associated practices:

  • It was built based on current local policies/practices at the time making its functionality rigid in cases. It was seen as a tool to get away from paper instead of viewing it as a source of reusable data.
  • While staff enter information using forms in Access for either “accessions” or “finding aids”, most of the information is stored in the same table making it difficult on the back end to know what you are looking at.
  • It is clunky and difficult to link multiple accessions with a collection description.
  • Some fields don’t map to the best EAD tag choice (ex: all extent information is dumped into <physdesc> and not <extent>.)
  • For container lists, the Beast/ArchivesUM stylesheet requires that your intellectual and physical order MATCH EXACTLY. This limits flexibility in description and processing at various levels and often requires spending too much time physically moving around materials.
  • Not all finding aids uploaded to ArchivesUM are EAD compliant (they are all XML compliant.) People fell on the side of getting the finding aid up instead of figuring out the EAD error and required changes in the Beast.
  • We did not do a good job at quality control. We just didn’t. We didn’t utlize controlled fields when we could have (ex: linear feet is spelled ten ways) and didn’t enforce adhering to local policies (ex: dates entered in date fields don’t match format of acceptable dates in our processing manual.)

In future posts I’ll share how we are mapping the Beast fields to ArchivesSpace as well as the specific data cleanup issues we are facing.

*Jennie A. Levine, Jennifer Evans, and Amit Kumar, “Taming the “Beast”: An Archival Management System Based on EAD,” Journal of Archival Organization, 4, no. 2 (2007): 63-98.

My thinking about all of this

Many moons ago, my friend the super-librarian Dianne and I used to hang out in Ann Arbor and talk about feminism and computers. She has FAR MORE formal training with computers (like, a lot compared to none) and I have a bit more training in feminist critical theory. We had a lot of fun, but I definitely got the better part of this deal, because I had the privilege of having Dianne shape my professional thinking. And here, I share with you, the single principle that will never leave me.

  1. Never have a human do something that a computer is better at, even if it takes longer to explain to the computer than it would have been to do by hand.

Whoa, right?

SURELY I can’t mean that every single time I want to transform a bunch of lines of something to somewhere else, I write an xslt or a bit of python or whatever, right? Well, no, I don’t, because I’m not as smart and disciplined as Dianne. But I should. Because the time I spend transforming data by hand only results in transformed data — but the time I spend learning how to do it programmatically results in me having another tool in my toolkit. I have seen some CRAZY examples of people in libraries having students transform data by hand (or worse, doing it themselves).

Confession bear time:

  • Once, in grad school, to keep me busy on the reference desk I had to manually find webpages for publishers of journals. This list was more than a thousand lines long. A computer can do this.
  • Once, in my last job, I made a student sit down with a spreadsheet and identify dates in title fields, pull them out, and put them in a different field. I’m so embarrassed to even admit that. A computer can do this.
  • In my life, I have done so much copying and pasting and pulling down of Excel cells. There are better ways. I have learned some of them. I need to learn more.
  • Because I didn’t start by learning tools that would help me verify the correctness of my data transformations, I have shot myself in the foot SO MANY TIMES. There are ways to put safeguards in place. I need to employ them more often.

We work with too much data too often to not figure out the right tools to deal with it. We need to stop repetitively manipulating data by hand, or at least cut back, and start thinking through what kinds of things we would need to tell a computer to do, even if we don’t yet speak the language.