Getting into the guts of AT

It’s the thing that we keep saying — in order to deal with our masses of stuff better, we need better ways of understanding what we have. A lot of my questions aren’t just about what’s in our finding aids — they’re about the relationships between archival materials and other archival management functions — accessioning, digital object management, location management, container management. For instance, the following questions have come up in the past or could easily come up in the future:

  • Which collections are constituted of accessions that came in before 1980?
  • Which collections have digital objects associated with them? What are the URIs of those digital objects?
  • I have a barcode for a box. Can you tell me the materials that are supposed to be in that box? What collection is this from?
  • We haven’t used this location listed in the location table since 2005! Are there any boxes associated with that location? What are they?

In order to answer these questions, I need to write reports that join different tables in Archivists’ Toolkit together. And this is a little bit tricky, because in their own way, components in the AT database are hierarchical (just like in an EAD-encoded finding aid). If I have an instance (a container with a barcode), and I want to know which collection it belongs to, I don’t have a direct relationship in the database. Instead, an instance is associated with a component. That component is associated with its parent component. It may have a lot of ancestor components before the most high-level component is associated with the collection-level information in the resource table.

These relationships are made in sql through what are called “joins”. And joining a table on itself (in some cases several times, recursively) is a huge friggin pain in the neck. So, after mucking around for a little while, the solution was to just ask someone smarter than me how he would handle this.

This is where my colleague Steelsen comes in — Steelsen introduced the idea of writing a stored procedure that would look for the top-most component instead of having to do this through joins. And then he wrote them for me, because he is a mensch of the first order. His procedures are here, and available to anyone who might find them useful. They have seriously revolutionized the way that I’ve been able to do reporting and solve problems.

For instance, something that folks have been begging for is a barcode look-up tool — they have a barcode, and they want to know which collection it belongs to, what its call number is, which location it’s assigned to, and which components are associated with that box. So here’s what I wrote (the user indicates the barcode in the where statement):

use schema;
 ' ',
 LPAD(r.resourceIdentifier2, 4, '00')) 'Collection',
 r.title 'Collection Title',
 series.subdivisionIdentifier 'Series/Accession Number',
 series.title 'Series Title',
 rc.title 'Component Title',
 rc.dateExpression 'Component Date',
 adi.container1Type 'Container Type',
 adi.container1NumericIndicator BoxNum,
 adi.container1AlphaNumIndicator BoxAlpha,
 adi.container2NumericIndicator FolderNum,
 adi.container2AlphaNumIndicator FolderAlpha,
 adi.archDescriptionInstancesId InstanceID,
 adi.barcode Barcode,
 adi.userDefinedString1 'Voyager Info',
 loc.coordinate1NumericIndicator ShelfNum,
 loc.coordinate1AlphaNumIndicator ShelfAlpha
 ArchDescriptionInstances adi
 ResourcesComponents rc ON adi.resourceComponentId = rc.resourceComponentId
 LocationsTable loc ON adi.locationID = loc.locationID
 Resources r ON r.resourceId = GETRESOURCEFROMCOMPONENT(rc.resourceComponentId)
 ResourcesComponents series ON GETTOPCOMPONENT(rc.resourceComponentId) = series.resourceComponentID
 adi.barcode = 39002042658774;

Here I use two of Steelsen’s procedures. In GETRESOURCEFROMCOMPONENT, I go up the tree of a component to find out what resource it belongs to and join that to the resource. I use GETTOPCOMONENT to help figure out what series a component belongs to (this assumes that the top-most component is a series, but that’s usually a safe bet for us).

I’m a sql n00b, and this isn’t the most efficient query I’ve ever run, but I’m really happy with the results, which can be viewed in a spreadsheet here.

By changing the where statement, I can find out all kinds of associated information about a location, a collection, a box, whatever. I can find out if barcodes have been assigned to components with different box numbers; I can find out if components with the same barcode have been assigned to more than one location. This set of procedures has really been a godsend to help me know more about the problems I’m fixing. So many thanks to Steelsen. I hope others find them useful too.

Figuring Out What Has Been Done

It’s been a while since I last posted, and there’s a good reason for that — I’ve started an exciting new job as an archivist and metadata specialist at Yale. I miss my colleagues and friends at Tamiment every day, and I look forward to continued awesome things from them.

Here at Yale, I work in Manuscripts and Archives. The major project for the first year will be to migrate from Archivists’ Toolkit to ArchivesSpace. In anticipation of this migration, I’m learning about the department’s priorities for data clean-up and thinking through what I can do to help implement those goals.

The Goals

One of the first projects that was added to my list was to take a look at a project that has been ongoing for a while — cleaning up known errors from the conversion of EAD 1.0 to EAD 2002. Much of the work of fixing problems has already been done — my boss was hoping that I could do some reporting to determine what problems remain and in which finding aids they can be found.

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?

Getting to the Files

Access to files that have been published to our portal is easy — they’re kept in a file directory that is periodically uploaded to the web. And I also have a cache of the EAD 1.0 files, pre-transformation. These were both easy to pull down copies of.  But, one of the questions I was asking was how these differ from what’s in the AT. It’s so, so easy to make changes in AT and forget to export to EAD.

If any of you know good ways to batch export EAD from AT, PLEASE LET ME KNOW. I have a pretty powerful machine and I know that folks here have worked on optimizing our AT database, but I could only export about 200 files at a time, for fear of the application hanging and eventually seizing up. So, I ran this in the background over the course of several days and worked on other stuff while I was waiting.

For some analyses, I wanted to exclude finding aids that aren’t published to our portal — for these, I copied the whole corpus to a separate directory. To get a list of which finding aids are internal-only, I very simply copied the resource record screen in AT (you can customize this to show internal-only finding aids as a column), which pastes very nicely into Excel.



Once in Excel, I filtered the list to get a list of Internal Only = “TRUE”. From there, I used the same technique that I had used to kill our zombie finding aids at NYU. I made a text document called KillEAD.txt, which had a list of the internal-only finding aids, and I used the command

cat KillEAD.txt | tr -d '\r' | xargs echo rm | sh

to look through a list of files and delete the ones that are listed in that text document. (In case you’re wondering, I’m now using a Unix simulator called Cygwin  and there are weird things that don’t play nicely with Windows, including the fact that Windows text documents append /r to the ends of lines to indicate carriage returns. Oh, also, I put this together with spit and bailing wire and a google search — suggestions on better ways to do this kind of thing are appreciated).

So, that’s step 0. Don’t worry, I’ll be blogging my approach to goals 1-7 in the days ahead. I have some ideas about how I’ll do much of it (1, 3, and 4 I know how to assess, 2 I have a harebrained plan for, 5-7 are still nascent), but any suggestions or brainstorming for approaching these problems would be MORE THAN WELCOME.

getting dates out of titles

Today, I’m taking a 43 thousand-line spreadsheet and turning it into a CRAAAAAAAAZZZZY finding aid. Even though the spreadsheet has a separate column for dates, there are often dates in the title column. They need to go. And luckily, they don’t need to be moved elsewhere (this is possible but slightly trickier).

I spent some quality time with my data and noticed that in every case, the date was formatted like the examples below:

  • Meli, Robert What Do You Think Is Causing the Fighting In Palestine 2/15/48
  • Mc Arteen, Patrick Do You Favor Henry Wallace’s Decision to Run for President In on a Third-party Ticket 1/18/48
  • Keeran, Vincent What Is Causing the Current Rise In Prices 10/12/47

Basically, these are interview pieces. There’s the interviewee’s name, then the title of the interview, then the date.

Since I was working in OpenRefine (formerly Google Refine), I knew that I could use GREL (the Google Refine Expression Language) to replace this with blank text using a regular expression.

Google Refine regular expression

The expression is value.replace(/[0-9]{1,2}\/[0-9]{1,2}\/\d{2}/,””) . Basically, this looks for a pattern of one or two numbers, then a slash, then one or two numbers, then a slash, then two numbers. It replaces it with nothing.

This is not the most elegant regex anyone has ever written. But it was pretty quick to figure out, and it worked!

Next, I want to figure out a way to add a period after the interviewee’s name. More on that when I figure it out!