Exporting, Editing, and Importing EAD in Archivists’ Toolkit: A Checklist

Sometimes, it can be extremely helpful to take EAD XML files out of Archivists’ Toolkit to edit them.  Maybe you have a contents list that you generated from a spreadsheet, or maybe you want to quickly change 500 “otherlevel”s to “file”s.  Since there are so many small steps, I created a checklist.  Using the checklist will help to make sure that information doesn’t get lost and that the record looks like you want it to.

First, a word of caution: when the record is imported back into AT, it will overwrite all refids with new ones.  So if you’re using those refids elsewhere, this won’t work.  Additionally, before exporting the record, it’s important to copy down information that won’t be included in the export.  This includes any repository processing notes and linked accession records.  This is also why it’s important to make sure that “internal only” notes are included in the export.  Also, the file won’t re-import with barcode information, because barcodes are kept as non-valid attributes and violate the importer’s validation rules.

We found that when exporting, AT added information that we didn’t want when we re-imported it, or imported information to different fields.  For example, at Tamiment, we use the container summary to on the “Basic Info” tab to record the container summary.  When this is exported, it maps to <extent> in <physdec>.  When it’s re-imported into the Toolkit, it does not go into the container summary but becomes a Physical Description Note.  You can also change some of these in the EAD XML file, instead of after importing into AT.

You can find my checklist here or below

Before Exporting EAD:

  • Write down which accession records are linked to the resource record
  • Record any information in repository processing note(s)
  • Do NOT check “Suppress components and notes when marked ‘internal only’” when exporting the original resource record

Before importing EAD:

  • If there are barcodes: do a find/replace on containers (using dot matches all) to delete barcodes
  • Make sure that the record it is replacing has been deleted

After importing EAD:

Basic Description:

  • Separate the prefix and numeric sections of the Resource Identifier into separate fields
  • Remove bulk dates from Date Expression field (this may also need to be done at the series or sub-series level)
  • Copy the text from the General Physical Description note into the Container Summary

Notes:

  • Remove General Physical Description note

Finding Aid Data:

  • Remove call number from Finding Aid Title field
  • Remove “Collection processed by” in Author field

Barcodes

  • Re-enter barcodes

Accessions:

  • Re-link resource record to accession record(s)

Adding Structure to a Word Document Using Regular Expressions

At Tamiment I’ve been working on a team project to describe our photographic collection backlog.  The New York Hotel and Motel Trades Council Photographs contains around 15 linear feet of materials, which are mostly in labeled folders and grouped by subject.  While my first instinct was to describe this collection at the box level, I discovered that there was a folder-level MS Word inventory that had been done on-site.  Maureen and I turned this Word document into the container list of an EAD finding aid.

initialinventory

As Maureen showed in her previous tutorial, the first step was copying this into a spreadsheet.  However, this inventory needed some extra cleanup before it looked how I wanted it to.  Problems included: dates at the beginning and ends of folder titles; subseries that ended in one box and started again two boxes later; and administrative “codes.”  In order to clean up the document and add structure, I used regular expressions in the find and replace in my word processor.

Regular Expressions

Regular expressions allow us to match patterns instead of specific words.  For example, instead of searching for a specific abbreviation, such as Misc., I could use the pattern [A-Za-z]+\. to match all abbreviations in my inventory.  Regular expressions are used in a number of programming languages and can also be used in the find and replace tool in several word processors.  In order to use this functionality of find and replace, you need to have “regular expressions” checked (usually under “more options”).  Not all word processors have the same functionality; I used LibreOffice Writer, which is free, open-source software.

I learned how to use regular expressions through the tutorial “Understanding Regular Expressions” by Doug Knox on the Programming Historian, which I highly recommend checking out.

Here is a list of special characters, and combinations of special characters, that I used to edit and add structure to my inventory.  A full list of LibreOffice’s regular expressions can be found here.  Note: this exact syntax won’t work with Microsoft Word, which uses their own wildcards.  More information about Microsoft Word’s find and replace can be found here.

.*

any

[A-Za-z]+

all capital and lowercase letters occurring one or more times

^

beginning of a paragraph

$

end of a paragraph

^$

empty paragraph

\t

tab

[0-9]{2}

all digits repeated two times

\*

“\” indicates that the following special character (in this case, “*”) is a normal character

( )

in a search, defines the characters inside the parentheses as a reference

$1

in a replace, refers to the first reference defined in the search

Cleanup

First, I cleaned up the formatting.  I removed the bullet points and made sure everything was aligned left.  Though I could remove the blank lines easily in my spreadsheet, I removed them at this stage so the data would be easier to work with:

Search for

^$

Replace with

nothing

The markers “*” and “*P” appeared at the beginning of some of the lines; this was for administrative purposes and I don’t want it in the published finding aid.  To remove these, I searched for them at the beginning of the line, which I indicated with “^”.  However, because “*” is a special character, I needed to use “\” before it to indicate that it was a normal.  So my expressions looked like this:

Search for 

^\*P

Replace with

nothing

Search for

^\*

Replace with

nothing

In this inventory, when a subseries continues into another box, it is indicated by “[Members continued]”.  Since in my spreadsheet it will be evident that these folders are all part of one series, I got rid of them using:

Search for

^\t\[.*continued\]$

Replace with

nothing

Adding Structure

When I copied and pasted the document into a spreadsheet, it was only one column.  I want the series title in the first column, subseries title in the second column, folder title in the third column, date in the fourth column, and box number in the fifth column.  I can add structure to this inventory by indicating columns with tabs.

I manually tabbed the subseries titles (which are indicated with italics) once.

To put the box number in the fifth column, I used

Search for

^Box

Replace with

\t\t\t\tBox

To put my folder titles in the third column, I used

Search for

^([A-Za-z]+)

Replace with

\t\t$1

Dates

Some of the dates are in the beginning of the folder title; I wanted them all in the end of the folder title.  Since I know that what I want to move is in the beginning of the line, I indicated this with “^”. There are some four-digit numbers that are not years, and all dates are from the 20th century, so I made sure that it began with “19” and was followed by two digits, so it looked like this: “19[0-9]{2}”.  Because I wanted to switch this value with the rest of the line, I indicated the rest of the line using “.*$” and surrounded each value in parentheses.  In my replace field, the value $1 refers to the first group in parentheses and the value $2 refers to the second group in parentheses.

Search for

(19[0-9]{2}) (.*)$

Replace with

$1\t$2

Because I want the date in the next column over, I put a tab between the folder title and year by searching for “([A-Za-z]+) (19[0-9]{2})$” and replacing with “$1\t$2”.  The “\t” indicates a tab.

Search for

([A-Za-z]+) (19[0-9]{2})$

Replace with

$1\t$2

Into a Spreadsheet

I’m ready to put this data into a spreadsheet.  When pasting, and then click “paste special.”  Then check “unformatted text” and “separated by tabs.”  My spreadsheet looks like this:

hotelworkersfinal

It still needs some further edits, which I can do in this spreadsheet or back in the word processor.  But it’s a lot better than before, with little editing by hand, and is a step closer to being turned into the container list of an EAD-encoded finding aid.