Baby Steps in Using Openrefine to clean up collection data

As I mentioned in my last post, my latest collection management project is making sure that we have collection level records for everything in the repository, which I am doing through creating accession records in Archivists’ Toolkit (I chose accession records rather than resource records based on a set of legacy decisions about how the institution uses AT, if I was starting from scratch I probably would do that differently).  The piece of the puzzle that I’ve been working on recently is integrating the records of one of our sub-collections, the Bakhmeteff Archive of Russian and East European Culture, into AT.

The manuscript unit took over accessioning materials for this collection in about 2007, so collections  that have been acquired in the 7 or 8 years do have an AT record, as do about 60 older collections that were added as part of a separate project.  So, my first step was to figure out which Bakhmeteff Collections already had a collection level record and which ones did not.  Since there was not one comprehensive collection list, this involved checking our AT records against a whole slew of other data sources* to see what was already in the database and which collections still needed collection descriptions in Archivists’ Toolkit.

The next issue was to figure out the best way to identify duplicate records.  In looking at all of the source data it became clear very quickly that way that the title was expressed across all of the data sources I was working with varied wildly — sometimes expresses as “John Smith Papers” sometimes “Smith, John Papers” and, in the case of many of our older catalog records, just “Papers” with John Smith living in the 100 field and not reappearing in the 245.  Some sources used diacritical marks and some didn’t (always thorny, but with several hundred collections in Russian a potential dealbreaker).  Therefore I chose to use the collection number rather than title.  The one issue with that is that I was using AT accession records, not resource records, so the collection number was expressed as part of the title filed (I know, I know) and had to be broken out into its own column, but not a huge deal.  Once I had that as a master document I could combine this spreadsheet and my other data sources and then use Open Refine to facet my spreadsheet by ID number to identify (and eliminate) any collection that shows up both in AT and  in one of my data sources.  I then had a comprehensive list of collections not represented in AT so that I knew which collections needed to be added.  It’s not a perfect solution, but it is a down and dirty way to identify where I have work to do so that I am not having a student manually check every collection against different data sources to identify what needs a better record.   It also let me combine data from all sources to come up with a new master collection list to work.  Plus it was a good, baby-steps introduction to using OpenRefine.


*Since information was coming from so many sources, and because I didn’t trust the completeness of any of them, I was checking our AT holdings against accession registers, a collection number spreadsheet, our stack directory, and a list of collections that I asked one of our fantastic systems librarians to generate for that that queried our catalog for any record that was cataloged with a bibliographic level of c: collection in the Leader, and had a location code that tied it back to our library.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s