getting dates out of titles

Today, I’m taking a 43 thousand-line spreadsheet and turning it into a CRAAAAAAAAZZZZY finding aid. Even though the spreadsheet has a separate column for dates, there are often dates in the title column. They need to go. And luckily, they don’t need to be moved elsewhere (this is possible but slightly trickier).

I spent some quality time with my data and noticed that in every case, the date was formatted like the examples below:

  • Meli, Robert What Do You Think Is Causing the Fighting In Palestine 2/15/48
  • Mc Arteen, Patrick Do You Favor Henry Wallace’s Decision to Run for President In on a Third-party Ticket 1/18/48
  • Keeran, Vincent What Is Causing the Current Rise In Prices 10/12/47

Basically, these are interview pieces. There’s the interviewee’s name, then the title of the interview, then the date.

Since I was working in OpenRefine (formerly Google Refine), I knew that I could use GREL (the Google Refine Expression Language) to replace this with blank text using a regular expression.

Google Refine regular expression

The expression is value.replace(/[0-9]{1,2}\/[0-9]{1,2}\/\d{2}/,””) . Basically, this looks for a pattern of one or two numbers, then a slash, then one or two numbers, then a slash, then two numbers. It replaces it with nothing.

This is not the most elegant regex anyone has ever written. But it was pretty quick to figure out, and it worked!

Next, I want to figure out a way to add a period after the interviewee’s name. More on that when I figure it out!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s