Clean Up: Dates and OpenRefine

Here are some of the ways we are using OpenRefine to clean up dates in accession records based on our instructions. In some cases I have included more than one way to complete the same activity. There are probably more elegant ways to do this, so if you have other suggestions please let us know in the comments!

In this example we’re starting with running formulas on our unitdateinclusive column, splitting this out into begin and end dates, and then will return to format our date expression.

Periodically I use Edit Cells>Common Transforms>Trim leading and trailing whitespace just to keep things tidy. Be sure to run this anytime you’ll be using an expression that looks at the beginning or end of a cell.

Remove various characters, punctuation, and circa notations

Use Edit Cells>Transform with the following formals:

  • value.replace(‘c.’, ‘ ‘).replace(‘c’, ‘ ‘).replace(‘ca.’, ‘ ‘).replace)(‘Ca.’, ‘ ‘)replace(‘circa’, ‘ ‘).replace(‘unknown’, ‘ ‘).replace(‘n.d.’, ‘undated’).replace(‘, undated’, ‘and undated’).replace(‘no date’, ‘ ‘)
    • You could use Facet>Text facet and enter one of the words to analyze the fields first. Then either use all or part of the formula above or make use of the cluster function.
  • value.replace(/[?]/,’ ‘)  or value.replace(“?”,’ ‘)  [These do the same things, just showing different syntax you can use.]
  • value.replace(“[“,’ ‘).replace(“]”, ‘ ‘)
  • value.replace(“‘s”, ‘s’)

Convert to standardized dates

  • Edit Cells>Common Transforms>To text
  • Edit Cells>Common Transforms>To date

From here we’ll run a set of operations to give more conformity to the data. (This is where there must be better ways of doing this.)

  • Edit>Cells>Common Transforms>To text
  • Change months to numbers with
    • value.replace(‘January ‘, ’01/’).replace(‘February ‘, ’02/’).replace(‘March ‘, ’03/’).replace(‘April ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘June’, ’06/’).replace(‘July ‘, ’07/’).replace(‘August ‘, ’08/’).replace(‘September ‘, ’09/’).replace(‘October ‘, ’10/’).replace(‘November ‘, ’11/’).replace(‘December ‘, ’12/’)
    • value.replace(‘Jan ‘, ’01/’).replace(‘Feb ‘, ’02/’).replace(‘Mar ‘, ’03/’).replace(‘Apr ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘Jun’, ’06/’).replace(‘Jul ‘, ’07/’).replace(‘Aug ‘, ’08/’).replace(‘Sep ‘, ’09/’).replace(‘Sept ‘, ’09/’).replace(‘Oct ‘, ’10/’).replace(‘Nov ‘, ’11/’).replace(‘Dec ‘, ’12/’)
  • Replace commas and space after day before year: value.replace(‘, ‘, ‘/’)

We now have dates that are in the same and standard format as opposed to the variations we saw in the Date Formats post.

Create begin and end date columns with this data:

  • Duplicate the column: Edit Column>Add column based on this column… and by supplying no formula we can copy over all the data to our new column.
  • Split the new column using Edit Column>Split into several columns… and designate “-” as our separator.
  • Rename first column “date_1_begin” and second “date_1_end”.
  • Run Edit Cells>Transform: toString(toDate(value),”yyyy-MM-dd”) on both columns to get in the format required by ArchivesSpace, without having to deal with the day, hours, minutes, and seconds of the date format in OpenRefine.

We’ll still want to review the date begin column to identify the rows that had two single dates indicated by commas, ampersands, semi colons, and, or some other designator. We would separate out the second date to “date 2” fields.

Upcoming

  • Standardizing dates in to our preferred local natural language for date expression
  • Converting decades (ex: 1920s) programmatically

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s