Clean Metadata for Non-Metadata Geeks

Over the past two years, Maureen, Carrie, Meghan, Cassie and their guests have turned this blog into a powerhouse of know-how around working smarter with archival metadata. Some of us really enjoy this type of work; we find it crazy satisfying and it aligns well with our worldviews. We acknowledge, with some pride, that we are metadata geeks. But not all archivists are like this, AND THAT’S TOTALLY OKAY. We all have different strengths, and not all archivists need to be data wranglers. But we can all produce clean metadata.

Just one of the awesome buttons from AVPreserve

Just one of the awesome metadata jokes promulgated by AVPreserve‘s button campaign

Today, though, I’m going to take a BIG step backward and talk for a few minutes about what we actually mean when we talk about “clean” data, and I’ll share a few basic things that any archivist can do to help prevent their painstakingly produced metadata from becoming someone else’s “clean up” project later.

As Maureen explained in the very first Chaos —> Order post, the raison d’etre of all of this is to use computers to do what they do best, freeing the humans to do what they do best. Computers are really good at quickly accomplishing tasks like indexing, searching, replacing, linking and slicing up information for which you can define a rule or pattern, things that would take a human tens or hundreds of hours to do by hand and wouldn’t require any of the higher-level processes that are still unique to humans, let alone the specialized training or knowledge of an archivist. Clean data is, quite simply, data that is easy for a computer to digest in order to accomplish these tasks.

Continue reading

Advertisements

Backlog Control — Known Unknowns

As part of a continuing attempt to understand our holdings, I’ve been writing a series of reports against our EAD. Of course, what’s in a finding aid only accounts for the stuff that someone bothered to record in the first place. To tackle undescribed collections, we’ve also been doing a shelfread project to get an understanding of what’s on our shelves.

Today, we take an “accessioning as processing” approach to accruals — we describe what’s there at the appropriate level of description at the time of accessioning, and we include a lot of the background information about how it came to us, what it all means, etc., to help make sense of it. This helps us avoid building a backlog.

In the past, however, there was a mysterious (to me) understanding of the nature of processed/unprocessed materials. We have many, many series of materials in collections (usually accruals) that may even have file-level inventories but are described as “unprocessed.” They don’t include essential information about immediate source of acquisition, creators, or what about these materials makes them hang together. I’m frankly not sure what my predecessors were waiting for — they did all the work of creating lots of description without doing any real explanation!

So, my boss wanted a sense of these known knowns — parts of collections that we need to at least give a better series title, or somehow take out of the limbo of “unprocessed”. She wanted to know how many series there were, which collections these series belong to, and how many boxes of stuff we’re talking about. It would also be great to know linear footage, but this is frankly unknowable from the data we have.

So, I wrote an xQuery. You can find it here. The xQuery looks for any series or subseries that has the string “unprocessed” in its title. From there, it reports out the distinct values of containers. The result looks something like this:

Screen Shot 2014-05-06 at 10.06.10 PM

Perhaps you see the problem. Originally, I thought I just wanted to get a count of the distinct containers. My xpath for the variable that would give me box info (called footage here) originally looked like this:

$unprocessedfootage := count(distinct-values($series//ead:container[@type eq ‘Box’]))

The idea here was that it would take a series, get a census of the different boxes in that series, and count ’em up. But this gave me bad data. In the case of :

<containertype=“Box”>10-17</container>

I would have “10-17” be considered one distinct value in the count, when really it represents 8 boxes. The report as I first had it was severely undercounting boxes.

If I want to get a count of the distinct containers, I have to deal with ranges like 10-17. I started by importing this into OpenRefine and separated the multi-valued cells in the “unprocessed” column so that each number or range was in its own cell/row.

Then, I did some googling and came across this StackOverflow answer that explained how to enumerate the values in a range in Excel (this will give me 10, 11, 12, 13, 14, 15, 16 and 17 from 10-17). I exported from OpenRefine and brought the document into Excel, separated the ranges into two columns, and did a quick if/then statement to repeat single values in the second column. From there, I just ran the VBA code that was provided. I brought the document BACK into Refine and separated multi-valued cells again, and found out that we have 908 distinct boxes of “unprocessed” materials in 67 collections.

Now, happily, we know exactly how big of a mess our described “unprocessed” materials are, and we’re in a much better position to make good sense of them.

Update — 2014 May 7

@ostephens on twitter very helpfully pointed out that the dumb VBA step can easily be avoided by doing the work in OpenRefine.

He was completely right and was kind enough to give me the recipe

After multi-valued cells were each in their own row/cell, I separated by “-” so that the beginning and end of each range was in its own column. Then, I created a new column based on the “First” column and did the following:

Screen Shot 2014-05-07 at 10.29.10 AM

On error, it copies the value from the original column so that my “enum” column is everything I need. Once I had the values enumerated, I split multi-value cells again and ended up with a much more beautiful process.

You can follow my steps by importing my OpenRefine project here.

Clean Up: Dates and OpenRefine

Here are some of the ways we are using OpenRefine to clean up dates in accession records based on our instructions. In some cases I have included more than one way to complete the same activity. There are probably more elegant ways to do this, so if you have other suggestions please let us know in the comments!

In this example we’re starting with running formulas on our unitdateinclusive column, splitting this out into begin and end dates, and then will return to format our date expression.

Periodically I use Edit Cells>Common Transforms>Trim leading and trailing whitespace just to keep things tidy. Be sure to run this anytime you’ll be using an expression that looks at the beginning or end of a cell.

Remove various characters, punctuation, and circa notations

Use Edit Cells>Transform with the following formals:

  • value.replace(‘c.’, ‘ ‘).replace(‘c’, ‘ ‘).replace(‘ca.’, ‘ ‘).replace)(‘Ca.’, ‘ ‘)replace(‘circa’, ‘ ‘).replace(‘unknown’, ‘ ‘).replace(‘n.d.’, ‘undated’).replace(‘, undated’, ‘and undated’).replace(‘no date’, ‘ ‘)
    • You could use Facet>Text facet and enter one of the words to analyze the fields first. Then either use all or part of the formula above or make use of the cluster function.
  • value.replace(/[?]/,’ ‘)  or value.replace(“?”,’ ‘)  [These do the same things, just showing different syntax you can use.]
  • value.replace(“[“,’ ‘).replace(“]”, ‘ ‘)
  • value.replace(“‘s”, ‘s’)

Convert to standardized dates

  • Edit Cells>Common Transforms>To text
  • Edit Cells>Common Transforms>To date

From here we’ll run a set of operations to give more conformity to the data. (This is where there must be better ways of doing this.)

  • Edit>Cells>Common Transforms>To text
  • Change months to numbers with
    • value.replace(‘January ‘, ’01/’).replace(‘February ‘, ’02/’).replace(‘March ‘, ’03/’).replace(‘April ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘June’, ’06/’).replace(‘July ‘, ’07/’).replace(‘August ‘, ’08/’).replace(‘September ‘, ’09/’).replace(‘October ‘, ’10/’).replace(‘November ‘, ’11/’).replace(‘December ‘, ’12/’)
    • value.replace(‘Jan ‘, ’01/’).replace(‘Feb ‘, ’02/’).replace(‘Mar ‘, ’03/’).replace(‘Apr ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘Jun’, ’06/’).replace(‘Jul ‘, ’07/’).replace(‘Aug ‘, ’08/’).replace(‘Sep ‘, ’09/’).replace(‘Sept ‘, ’09/’).replace(‘Oct ‘, ’10/’).replace(‘Nov ‘, ’11/’).replace(‘Dec ‘, ’12/’)
  • Replace commas and space after day before year: value.replace(‘, ‘, ‘/’)

We now have dates that are in the same and standard format as opposed to the variations we saw in the Date Formats post.

Create begin and end date columns with this data:

  • Duplicate the column: Edit Column>Add column based on this column… and by supplying no formula we can copy over all the data to our new column.
  • Split the new column using Edit Column>Split into several columns… and designate “-” as our separator.
  • Rename first column “date_1_begin” and second “date_1_end”.
  • Run Edit Cells>Transform: toString(toDate(value),”yyyy-MM-dd”) on both columns to get in the format required by ArchivesSpace, without having to deal with the day, hours, minutes, and seconds of the date format in OpenRefine.

We’ll still want to review the date begin column to identify the rows that had two single dates indicated by commas, ampersands, semi colons, and, or some other designator. We would separate out the second date to “date 2” fields.

Upcoming

  • Standardizing dates in to our preferred local natural language for date expression
  • Converting decades (ex: 1920s) programmatically

 

 

Clean Up: Date Instructions for Accession Records

In an effort to be transparent (and highlight the volume of work) I’m attempting to document all of our cleanup instructions and decisions. For each ArchivesSpace field we’re including actions and questions as we explore our data. Some of the questions focus on how ArchivesSpace works or implements fields while others focus on our specific data, policies, or procedures. Over time, the questions turn into actions and we are continually updating our documentation as we work through these issues.

Below is what we have so far for dates of materials in accession records. We still have lots of questions (and will no doubt develop more) so feel free to provide any feedback or suggestions.

Actions

Dates from Beast fields unitdateinclusive and unitdatebulk will parse into date 1 and date 2 fields each including:

  • date_1_begin
  • date_1_end
  • date_1_type (required by AS)
  • date_1_expression (required by AS if no normalized dates)
  • date_1_label (required by AS)

Adhere to required date formats and definitions for ArchivesSpace fields

  • Control begin and end fields for date format: YYYY, YYYY-MM, or YYYY-MM-DD
  • Single dates do not need a date end
  • Control date expression based on local convention [revising current local format]
  • Split date ranges using “-” as delimiter and designate as “inclusive” or “bulk” in date type, based on what column they came from. Use date 1 fields for inclusive and date 2 fields for bulk
  • For values with commas, “1950, 1950” parse to two single dates using date 1 fields and date 2 fields
  • Label defaults to “creation” unless otherwise specified

For dates that include “undated”

  • Keep “undated” as part of whole statement in the date expression field.
  • Parse remaining dates as normal and remove “undated” from begin and end
  • ex: “1857-1919 and undated” remains as the date expression, 1857 goes to date_begin, 1919 goes to date_end, type is “inclusive”

Certainty

  • Assume that all collection dates in accession records are best estimates [Policy decision]
  • remove all forms of “circa” from accession dates
  • remove question marks and variations
  • remove square brackets and variations indicating guessed dates
  • remove “undated” if it is the only value
  • remove “unknown” if it is the only value

For dates listed as decades

  • Control decades to year spans. Use the first and last year of decade.
  • 1940s-1950s becomes 1940-1959 in date expression, begin 1940, and end 1959

If bulk dates are the exact same as inclusive, delete bulk dates

Questions

What date cutoffs do we use for partial decades?

  • ex: Late 1990s, mid 1970s, early 1930s
  • late = 1995-1999
  • mid = ???
  • early = 1930-1935

If bulk dates exist for single items, when to delete or not?

  • Will delete if same
  • Should we keep if there is a difference?
  • If a difference what is the year cutoff? 1? 5? 10? etc……

Are single dates with “undated” really single?

  • ex “1941 and undated”

Can we have “bulk” dates that are “single”?

  • ex: 1989, type as “bulk” in ArchivesSpace?

For date expression, can we agree on the preferred date formats?

  • Start with guidelines in processing manual
  • Update and make suggestions for changes
  • Solicit comments/feedback
  • Finalize decisions
  • ex: 1880s, 18th-20th Century, Oct. 1976, Sept-9-1980, May 5, 1990

What if accession is a single item with a date range, but the abstract gives single date?

  • ex: Edwin Warfield accessions, range 1875-1912, abstract for accession says 1889 for a single item. Assuming that this date range was for all Warfield papers? Ignore and take date from abstract?

What do we do if we have more dates than fields?

  • ex: single dates of 1966, 1967, 1969 or 1930, 1950, and 2002 would parse to three single date fields
  • Version 1.0.7.1 currently only imports date 1 and date 2 in CSV accession template
  • When do we want to turn single dates into a range instead? How many years in between? Based on size of materials? Never and develop procedure for adding dates beyond second into ArchivesSpace record post import?

 

In the next post we’ll go through some of the specific ways we are executing the clean up actions.

 

Date Formats

Now that we’re eliminated most of our duplicate bulk dates let’s take a look at the plethora of date formats in our accession records. Does your repository have local definitions for how to record collection dates? My guess is most would have something somewhere, even if not a formal written policy. We have a small section of our processing manual detailing preferred date formats and unacceptable date formats. It is suppose to apply to collection dates in accession records and in finding aids. Do people follow it? Sometimes. Usually not. Or they follow some of it, but not all of it. We also have dates created before the processing manual existed. The samples below are just from a portion of our accession records, so we might have additional formats yet to be discovered, but you’ll get the idea.

Our date fields could contain three elements: year only, month and year, or month, day, and year. The type might be a single date, multiple single dates, range, multiple ranges, or a combination of these (although that isn’t specified). For dates in accession records I have already gone ahead and removed any variation of the word “circa”. There’s also a healthy amount of “unknown” and “undated” speckled throughout.

Element, type unitdateinclusive (Beast field)
Year, single 1909
[1922]
1636 (approx.)
1920[?]
1940?
1946, undated
1957(?)
1999?
Year, multiple single 1913, 1963
1945 or 1946
1953, 1961, 1969, 1994
1954, 1956, 1966-1967, 1971
1958, 1960, 1962
1966, 1967, 1969
1967, 1968, 1969
1969, 1970
1995, 2000, undated
Year, range 1910-1950
1920s
1920s-1930s?
1921-1981 and undated
1940’s-2006
1980’s-1990’s
2000-2001 (FY 2001)
Early 1970s
late 1980s-early 1990s
undated, 1970s-2002
Year, multiple range 1920s, 1969-1975
1932-1934, 1950s
1937-1942; 1947-1950
Year, single and range 1928; 1938-1962 and undated
1938, 1950-1951
1950s-1960s, 1988
2008 [1901-2002]
Month Year, single November 1962
April 2001?
Month Year, range January 1977- November 1981
May2005-January 2007
Otober 1920-Marh 1921
Month, Day, Year, single 11/9/1911
June 14, 1924
Marh 8, 2006
Otober 26, 1963
Month, Day, Year, multiple single 12/19/2005; 4/4/2006
January 5, 2000,  July 12, 2000
9/19 & 9/20/2007
Month, Day, Year, range 10/24-10/26/2008
January 30, 2011-February 2, 2011
Marh 22-24, 2001
Otober 13, 1987-Deember 7, 1987

Here’s a summary of the issues:

  • Punctuation is not standard. Multiple dates may be separated with a period, comma, semi-colon, ampersand, or the word “and”.
  • We used a variety of methods to convey we were unsure of the date, such as ?, (?), [ ], [?], (approx.) in addition to all the circa variations. I’m guessing there are other dates we weren’t sure of, but we didn’t specify that.
  • Spacing isn’t consistent. Sometimes there are no spaces around punctuation, others times one, two , or more spaces.
  • Spelling. Sometimes we just couldn’t spell October or March (the most popular offenders apparently)
  • Formats are all over the place, even comparing the same element and type. Ex: March 22-24, 2001 compared to March 22, 2001-March 24, 2001.
  • Use of decades was a common practice.
  • Providing single dates instead of ranges. Do we really need to say “1966, 1967, 1969” instead of “1966-1969” if we’re only missing 1968?

Next post we’ll talk about the instructions and rules we’re developing for cleaning this up and how we go about executing those decisions.

Baby Steps in Using Openrefine to clean up collection data

As I mentioned in my last post, my latest collection management project is making sure that we have collection level records for everything in the repository, which I am doing through creating accession records in Archivists’ Toolkit (I chose accession records rather than resource records based on a set of legacy decisions about how the institution uses AT, if I was starting from scratch I probably would do that differently).  The piece of the puzzle that I’ve been working on recently is integrating the records of one of our sub-collections, the Bakhmeteff Archive of Russian and East European Culture, into AT.

The manuscript unit took over accessioning materials for this collection in about 2007, so collections  that have been acquired in the 7 or 8 years do have an AT record, as do about 60 older collections that were added as part of a separate project.  So, my first step was to figure out which Bakhmeteff Collections already had a collection level record and which ones did not.  Since there was not one comprehensive collection list, this involved checking our AT records against a whole slew of other data sources* to see what was already in the database and which collections still needed collection descriptions in Archivists’ Toolkit.

The next issue was to figure out the best way to identify duplicate records.  In looking at all of the source data it became clear very quickly that way that the title was expressed across all of the data sources I was working with varied wildly — sometimes expresses as “John Smith Papers” sometimes “Smith, John Papers” and, in the case of many of our older catalog records, just “Papers” with John Smith living in the 100 field and not reappearing in the 245.  Some sources used diacritical marks and some didn’t (always thorny, but with several hundred collections in Russian a potential dealbreaker).  Therefore I chose to use the collection number rather than title.  The one issue with that is that I was using AT accession records, not resource records, so the collection number was expressed as part of the title filed (I know, I know) and had to be broken out into its own column, but not a huge deal.  Once I had that as a master document I could combine this spreadsheet and my other data sources and then use Open Refine to facet my spreadsheet by ID number to identify (and eliminate) any collection that shows up both in AT and  in one of my data sources.  I then had a comprehensive list of collections not represented in AT so that I knew which collections needed to be added.  It’s not a perfect solution, but it is a down and dirty way to identify where I have work to do so that I am not having a student manually check every collection against different data sources to identify what needs a better record.   It also let me combine data from all sources to come up with a new master collection list to work.  Plus it was a good, baby-steps introduction to using OpenRefine.

 

*Since information was coming from so many sources, and because I didn’t trust the completeness of any of them, I was checking our AT holdings against accession registers, a collection number spreadsheet, our stack directory, and a list of collections that I asked one of our fantastic systems librarians to generate for that that queried our catalog for any record that was cataloged with a bibliographic level of c: collection in the Leader, and had a location code that tied it back to our library.

Clean Up: Inclusive and Bulk Dates Comparison

Let’s start with a more straightforward cleanup issue identified during our accession record mapping. In this example, we’ll use a set of accession records (only a portion of our total) we have exported from the Beast into an Excel spreadsheet and focus on two fields: <unitdateinclusive> and <unitdatebulk>. We’ll map these to date fields in ArchivesSpace, but before we get to that let’s examine the data.

This spreadsheet contains 3361 accession records. 2685 rows have an inclusive date and 1908 rows include a bulk date. By sorting the spreadsheet by date and spot checking the data, we’ve come up with a list of inconsistent date formatting issues. One of the most pervasive habits was to always fill out both the inclusive and bulk dates, even if the values for each were the exact same. (For now, ignore the other date formatting issues in these examples.)

same inclusive and bulk dates

Supplying this information twice isn’t necessary for our users and could be confusing to some (plus it is extra work for us!) DACS 2.4.10 suggests providing a bulk date when dates differ significantly from the inclusive dates, so we want to keep the bulk dates that are different than our inclusive dates while removing the duplicate values.

We could compare these by hand (done that before!) or use a formula in Excel to do the work for us:

=IF(A2=B2, “same”, B2)

This formula asks if the value in <unitdateinclusive> equals the value in <unitdatebulk>. If they are equal, return the value “same” and if they are different return the value of <unitdatebulk>.

After dragging down the formula for the entire sheet I then copy the results of this new column to another one, utilizing the “paste values” feature to carry over the content and not the formula for the cell value.

Pasting values from formual

I could have put nothing instead of “same” in my new column if the values were equal, but I wanted to know how many times these dates were equal. Sorting by my newbulkdate column I know that:

  • 777 rows only contained an inclusive date.
    • The formula as I have it would have returned a “0” here, because I didn’t tell it how to handle a blank cell in <unitdatebulk>.
    • Easy fix is to go back to my <unitdatebulk> column, find all the blank cells, and replace them with “empty.” Empty will carry forward with the formula. (I’m sure there is a way to handle this with the formula if anyone wants to jump in.)
  • 567 rows were identified as having a different value in bulk date so these dates were retained.
  • 1341 rows had an identical value in inclusive and bulk dates and were not kept.

I can now do a global replace on the newbulkdate column to replace “same” and “empty” with nothing. I then remove my original <unitdatebulk> column and my bulkformula column.

Only different bulk dates remain

Of course, this method only worked on cells were the characters were the exact same. There will be bulk dates that are the same as inclusive dates this didn’t catch, such as values with different spacing or punctuation. (ex: c.2007 v. c. 2007)

In other posts we’ll look at more date clean up questions, issues, and cleanup.