Adding Structure to a Word Document Using Regular Expressions

At Tamiment I’ve been working on a team project to describe our photographic collection backlog.  The New York Hotel and Motel Trades Council Photographs contains around 15 linear feet of materials, which are mostly in labeled folders and grouped by subject.  While my first instinct was to describe this collection at the box level, I discovered that there was a folder-level MS Word inventory that had been done on-site.  Maureen and I turned this Word document into the container list of an EAD finding aid.

initialinventory

As Maureen showed in her previous tutorial, the first step was copying this into a spreadsheet.  However, this inventory needed some extra cleanup before it looked how I wanted it to.  Problems included: dates at the beginning and ends of folder titles; subseries that ended in one box and started again two boxes later; and administrative “codes.”  In order to clean up the document and add structure, I used regular expressions in the find and replace in my word processor.

Regular Expressions

Regular expressions allow us to match patterns instead of specific words.  For example, instead of searching for a specific abbreviation, such as Misc., I could use the pattern [A-Za-z]+\. to match all abbreviations in my inventory.  Regular expressions are used in a number of programming languages and can also be used in the find and replace tool in several word processors.  In order to use this functionality of find and replace, you need to have “regular expressions” checked (usually under “more options”).  Not all word processors have the same functionality; I used LibreOffice Writer, which is free, open-source software.

I learned how to use regular expressions through the tutorial “Understanding Regular Expressions” by Doug Knox on the Programming Historian, which I highly recommend checking out.

Here is a list of special characters, and combinations of special characters, that I used to edit and add structure to my inventory.  A full list of LibreOffice’s regular expressions can be found here.  Note: this exact syntax won’t work with Microsoft Word, which uses their own wildcards.  More information about Microsoft Word’s find and replace can be found here.

.*

any

[A-Za-z]+

all capital and lowercase letters occurring one or more times

^

beginning of a paragraph

$

end of a paragraph

^$

empty paragraph

\t

tab

[0-9]{2}

all digits repeated two times

\*

“\” indicates that the following special character (in this case, “*”) is a normal character

( )

in a search, defines the characters inside the parentheses as a reference

$1

in a replace, refers to the first reference defined in the search

Cleanup

First, I cleaned up the formatting.  I removed the bullet points and made sure everything was aligned left.  Though I could remove the blank lines easily in my spreadsheet, I removed them at this stage so the data would be easier to work with:

Search for

^$

Replace with

nothing

The markers “*” and “*P” appeared at the beginning of some of the lines; this was for administrative purposes and I don’t want it in the published finding aid.  To remove these, I searched for them at the beginning of the line, which I indicated with “^”.  However, because “*” is a special character, I needed to use “\” before it to indicate that it was a normal.  So my expressions looked like this:

Search for 

^\*P

Replace with

nothing

Search for

^\*

Replace with

nothing

In this inventory, when a subseries continues into another box, it is indicated by “[Members continued]”.  Since in my spreadsheet it will be evident that these folders are all part of one series, I got rid of them using:

Search for

^\t\[.*continued\]$

Replace with

nothing

Adding Structure

When I copied and pasted the document into a spreadsheet, it was only one column.  I want the series title in the first column, subseries title in the second column, folder title in the third column, date in the fourth column, and box number in the fifth column.  I can add structure to this inventory by indicating columns with tabs.

I manually tabbed the subseries titles (which are indicated with italics) once.

To put the box number in the fifth column, I used

Search for

^Box

Replace with

\t\t\t\tBox

To put my folder titles in the third column, I used

Search for

^([A-Za-z]+)

Replace with

\t\t$1

Dates

Some of the dates are in the beginning of the folder title; I wanted them all in the end of the folder title.  Since I know that what I want to move is in the beginning of the line, I indicated this with “^”. There are some four-digit numbers that are not years, and all dates are from the 20th century, so I made sure that it began with “19” and was followed by two digits, so it looked like this: “19[0-9]{2}”.  Because I wanted to switch this value with the rest of the line, I indicated the rest of the line using “.*$” and surrounded each value in parentheses.  In my replace field, the value $1 refers to the first group in parentheses and the value $2 refers to the second group in parentheses.

Search for

(19[0-9]{2}) (.*)$

Replace with

$1\t$2

Because I want the date in the next column over, I put a tab between the folder title and year by searching for “([A-Za-z]+) (19[0-9]{2})$” and replacing with “$1\t$2”.  The “\t” indicates a tab.

Search for

([A-Za-z]+) (19[0-9]{2})$

Replace with

$1\t$2

Into a Spreadsheet

I’m ready to put this data into a spreadsheet.  When pasting, and then click “paste special.”  Then check “unformatted text” and “separated by tabs.”  My spreadsheet looks like this:

hotelworkersfinal

It still needs some further edits, which I can do in this spreadsheet or back in the word processor.  But it’s a lot better than before, with little editing by hand, and is a step closer to being turned into the container list of an EAD-encoded finding aid.

Advertisements

4 thoughts on “Adding Structure to a Word Document Using Regular Expressions

  1. So helpful! Just used some of these steps to provide some structure to a 800 page folder list in Word. Was wondering if you developed other regular expressions for some of the date formatting? Or would be willing to offer some suggestions?

    It seems in your example above the date follows the folder title (or you moved it to be in that order.) In my example, I have multiple date formatting, but also lots of punctuation thrown in. I’ve tried building those into my expressions, but can’t quite get there.

    Here’s some examples:
    folder title, YYYY
    folder title, YYYY-YYYY
    folder title, [YYYY]
    folder title, [YY]YY
    folder title, June DD, YYYY

    Dates are from the 1900s, but there could be some outliers.

    • I’ve found it helpful to be as targeted as possible, so *1[8,9][0-9]{2}* is more likely to only get what you need than *[0-9]{4}*. To find the dates with months and days, I would use *[a-z]\, [a-z]+ [0-9]{1,2}\, 19[0-9]{2}. *If there aren’t always commas after folder titles, *[a-z] [a-z]+ [0-9]{1,2}\, 19[0-9]{2}* would work. To find bracketed dates, I would use *[a-z]+\, \[19* or *[a-z]+\, \[[0-9]{2}* if you’re worried about outliers. However, if removing the brackets is possible, I would do that first. (It might also be helpful to remove any unnecessary punctuation as a first step. If months are a major problem, which they have been for me, it might be worth it to do 12 find-and-replaces turning the month name into 3 letters or a number.)

      These are targeted towards between adding tabs (or something else) *between* two expressions, but I’m not sure if that’s what you’re doing.

      On Mon, Jun 23, 2014 at 6:37 PM, Chaos —> Order wrote:

      >

  2. Thanks, Bonnie! Just getting back to this now so will let you know how it goes. I will say, removing punctuation really helped with the dates without changing anything else. Had lots of “space dash space” between ranges that was messing things up.

  3. Pingback: Chaos —> Order | Converting Preliminary Inventories to Tables with Macros: Moving Box Numbers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s