Round-tripping EAD, updated — how to not lose those barcodes

Several weeks ago, Bonnie published a hugely-useful checklist of things to think about when trying to round-trip data from Archivists’ Toolkit to EAD and back to Archivists’ Toolkit to help minimize data loss.

Don’t lose those barcodes!

In a comment, my colleague Mark gave a very sensible suggestion for how to not lose barcodes during that trip. When you export an EAD from AT, the container information looks something like this:

<c id="ref21" level="file">
    <did>
        <unittitle>Notecards</unittitle>
        <container id="cid1063028" type="Box" label="Mixed Materials (39002042652603)">7</container>
        <unitdate normal="1930/1940" type="inclusive">circa 1935-1950</unitdate>
    </did>
</c>

What I see here is that two separate fields in the AT database have been jammed together for EAD export. “Mixed materials” lives in the ArchDescriptionInstances table in AT as instanceType. 39002042652603 lives in the ArchDescriptionInstances table as barcode. During export, these are brought together as container @label.

So, the trick is to get them back into the fields in tables where they belong. By default, AT imports container @label to ArchDescriptionInstances.instanceType. You want to get the barcode information into barcode. Here’s how.

I sometimes have bad luck getting AT to re-import if container @label=”Mixed Materials (39002042652603)” — I do a big find/replace to make that look like @label=”39002042652603″

Then, I import into AT.

Getting the barcodes where they belong is pretty easy, but it has to happen in SQL in the back-end database. If you haven’t written SQL before, you might want to run this by a database administrator, just to have another set of eyes on it. You DEFINITELY want to do this in the test or dev version of your database first.

Step 1: copy over barcodes from instanceType to barcode (this assumes 14-digit barcodes)

UPDATE ArchDescriptionInstances 
SET barcode = instanceType
WHERE instanceType REGEXP '^[0-9]{14}$';

Step 2: replace barcodes in instanceType with “Mixed materials” (this assumes that you want everything to be “Mixed Materials”. I have a rant in the works about how inappropriate it is to have this label on containers anyway, but I’ll save that for another day).

UPDATE ArchDescriptionInstances 
SET instanceType = 'Mixed materials'
WHERE instanceType REGEXP '^[0-9]{14}$';

Congratulations! You haven’t lost your barcodes! You can now delete the previous resource record and enjoy your new-and-improved one.

But that’s not all…

Extra super-secret tips for folks keeping Voyager item information in Archivists’ Toolkit.

So, where I work, this wouldn’t be enough. We have a whole other set of data in the ArchDescriptionInstances table that isn’t serialized in EAD. Container type, location code, Voyager bibid, container restriction and a Boolean of whether this information has been exported to Voyager are kept in the ArchDescriptionInstances table in user defined fields. We then have a background program that sends this information to Voyager, where holdings/item records are created.

This means that if I round-trip EAD, this item information that exists is disassociated with the descriptive information. This doesn’t work.

Note: If you’re also using the plug-in that Yale developed to keep Voyager holdings information in AT, could you please drop me a line? We’re starting to think about migration paths to ArchivesSpace.

But I had a brainwave — if I round-trip EAD with barcodes, I now have a key. The same information associated with a barcode which is now blank (what’s the Voyager bib/holdings ID? was it exported?), is present in a different record with the same barcode from when the information was entered before, before the EAD was round-tripped.

This means that no one has to re-enter the same information or re-export to Voyager. Check out my SQL update below — same warnings and caveats apply.

 

start transaction;

UPDATE schema.ArchDescriptionInstances AS ArchDesc

JOIN (
SELECT
barcode,
userDefinedString1,
userDefinedString2,
locationId,
userDefinedBoolean1,
userDefinedBoolean2
FROM schema.ArchDescriptionInstances
WHERE userDefinedString1 = "{{{bibid}}}"
) AS newdata ON newdata.barcode=ArchDesc.barcode
SET
ArchDesc.userDefinedString1 = newdata.UserDefinedString1,
ArchDesc.userDefinedString2 = newdata.UserDefinedString2,
ArchDesc.locationId = newdata.locationId,
ArchDesc.userDefinedBoolean1 = newdata.userDefinedBoolean1,
ArchDesc.userDefinedBoolean2 = newdata.userDefinedBoolean2

WHERE
ArchDesc.userDefinedString1 = ""
AND ArchDesc.userDefinedString2 = "";

commit;

When will restricted materials become available?

One morning recently, our records services archivist sent me an email. He was wondering if there was a way I could report to him on which materials in our university archives have restrictions that have passed. After all, this data is buried in access restriction notes all over finding aids — it would be very difficult to find this information by doing a search on our finding aids portal or in Archivists’ Toolkit.

This is exactly the kind of project that I love to do — it’s the intersection of archival functions, improved user experience, and metadata power tools.

In ArchivesSpace, restrictions have controlled date fields. This kind of report would be very easy in that kind of environment! Unfortunately, AT and EAD only has a place for this information as free text in notes.

Time for an xquery!

xquery version "3.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
declare namespace functx = "http://www.functx.com";

<restrictions>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $accessrestrict in $ead//ead:dsc//ead:accessrestrict/ead:p[matches(.,'(19|20)[0-9]{2}')]
 let $series := $accessrestrict/ancestor::ead:c[@level = 'series' or @level = 'accession' or @level = 'accn']//ead:unitid
 let $dateseg := fn:substring-after($accessrestrict,'until')
 for $x in $series
 return
 
 <lookhere location="{$x}">
 {$accessrestrict}
 <date>{$dateseg}</date>
 </lookhere>
 }
 </document>
}
</restrictions>

And now for the walk-through.

Working together, we determined that any end dates will be below the <dsc>. So this report asks for any access restriction note below the dsc that includes a date in the twentieth or twenty-first century.

The report tells me what series that access restriction note is a part of and which file it’s a part of. I also pull out any text after the word “until”, because I see that common practice is to say “These materials will be restricted until XXXX.”

From there, I was able to put this data into an excel spreadsheet, do a bit of clean-up there, and give my colleague a sorted list of when particular series in collections are slated to be open.

Figuring Out What Has Been Done: Folder Numbers that Go On and On and On

What was the problem?

As part of a retrospective conversion project, paper-based finding aids were turned into structured data. A lot of this work was done in Excel, and one problem was a mistake with folder numbers — instead of folder numbers starting at number one at the beginning of each box, their numbering continues as the next box starts. For instance, instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.

How did I figure this out?

Since I’m new here and not overly familiar with numbering conventions, I approached it two ways. First, I want a list of finding aids that have really, really high folder numbers. This is really easy — I basically point to the folder and ask to return the biggest one. Of course, fn:max() can only handle numbers that look like numbers, so I included a predicate [matches(.,’^[0-9]+$’)] that only looks for folder numbers that are integers. This means that folder ranges and folders with letters in their name won’t be included, but it’s very unlikely that the biggest folder numbers in a collection would be styled this way.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $folder := $ead//ead:container[@type="Folder"][matches(.,'^[0-9]+$')]
 let $maxfolder := max($folder)
 return 
 $maxfolder
 }
 
 </document>
}
</root>

Looking through this, there are a LOT of collections with really high folder numbers. When I dig in, I realize that in a lot of cases, this can be just because of typos (for instance, a person means to type folder “109” but accidentally types “1090”). But I thought it would be good to know, more generally, which boxes in a collection have a “Folder 1”.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $box := $ead/(//ead:container[@type="Box"])
 let $folder1 := $box/following-sibling::ead:container[@type ="Folder"][. eq "1"]
 let $boxbelong := $folder1/preceding-sibling::ead:container[@type ="Box"]
 return
 $boxbelong
 }
 </document>
}
</root>

And, like a lot of places, practice has varied over time here. Sometimes folder numbering continues across boxes for each series. Sometimes it starts over for each box. Sometimes it goes through the whole collection. This could be tweaked to answer other questions. Which box numbers that aren’t Box 1 have a folder 1? How many/which boxes are in this collection, anyway?

From this, I got a good list of finding aids with really folder numbers that will help us fix a few dumb typos and identify finding aids that have erroneous numbering. We’re still on the fence regarding what to do about this (I think I would advocate just deleting the folder numbers, since I’m actually not a huge fan of them anyway), but we have a good start to understanding the scope of the problem.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?  We now know which notes are duplicated from their children.

Figuring Out What Has Been Done: Making Sense of Versions

What was the problem?

We know that a lot of good work has been done to fix the problems we’ve identified, but there’s no quick-glance way of knowing whether a record has been exported from the AT and published to our finding aid database after the work was done. Since so many people touched so many records, the element of human error is inevitable — there are probably some finding aids that are fixed but not yet exported.

How did I figure this out?

First, I had to get a copy of each data set. I talked about that in a previous blog post. Then, I had to figure out some good metrics of change. Since our finding aids go through a transformation between AT and the finding aid database (and this transformation involves, I think, a human opening the xml editor), I didn’t want trivial changes to throw off my results. So, I couldn’t just get a hash of each of the documents and compare them.

Off the top of my head, I thought of two metrics that would give us a rough sense of what work has been done. We saw that there are only three finding aids that still have duplicate arrangement statements — let’s compare those to finding aids in the database that have duplicate arrangement statements.

I used the same xquery against both data sets. After comparing the two, I saw that there are 16 finding aids that are published that have duplicate arrangement statements that don’t have this problem in the up-to-date data in AT.

Okay, that’s a start. But it’s only telling us whether fixes to this particular problem have been updated. So, let’s look more broadly. My first thought is that getting a count of components gives a good sense of whether work has been done on a finding aid.

This one was pretty straightforward. I just wrote an xquery that did a count of components at every level and told me which file this was associated with. I used dumb Access to associate the files with each other, and then write a query to see for each finding aid how many components are published to the database and how many are in AT. From here, hopefully, we’ll be able to make a big update and get our files in sync.

Obviously, it’s entirely possible that we have files out of sync that have the same numbers of components and the same numbers of arrangement statements. As I evaluate other known errors, I’ll be sure to evaluate them over both data sets to get a sense of what needs to be updated.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures? This problem endures, although I’m working on some solutions based on helpful suggestions!

Figuring Out What Has Been Done: Duplicated Notes (And Documenting My Failed Attempt…)

What was the problem?

This time, I’m trying to track down see/see-also notes that, because of a problem in the original EAD 1.0 -> EAD 2002 transform, were duplicated to parent components. A lot of really good clean-up work was done with these — we want to know what’s left.

How did I figure this out?

I haven’t. This has totally failed and I could use some help.

I had a lot of thoughts initially about how to pull this off, and I was mostly concerned about how oXygen would be able to handle any solution I came up with. Happily, my colleague Mark introduced me to BaseX, which is an xml database and xQuery processor. It’s awesome, and has been able to handle everything I’ve thrown at it thus far.

When it came down to it, I realized that I just wanted to find the note, then find a note in the parent component, and figure out if they were the same. I toyed with the idea of making a hash of each of these and comparing them, but it turned out that BaseX was able (I think) to handle the content of the note itself.

The xQuery is here, and the meaty bits are below.

declare variable $COLLECTION as document-node()* := db:open('MSSAAtExport');

for $note in $COLLECTION//ead:ead//ead:note//text()
let $doc := base-uri($note),
$parent-note := $note/parent::ead:c/parent::ead:c/ead:note[1]//text()

Basically, I’ve put all of my EAD in a database in BaseX called “MSSAAtExport” (which is the best thing I could have done — it made everything fast and awesome). Then, I declared my main variable, $note (any <note> element, anywhere, although strictly speaking I’m only interested in notes in the <dsc>). I declared $doc as the file I’m in.

Finally, (and here’s where I’m pretty sure my mistake is, so PLEASE CHECK), I created another variable for the note that might have been duplicated. Because that’s the problem, right? There are notes in components that were duplicated in parent components. So, $parent-note starts at $note, then goes up to its own <c>, then goes up to the parent <c>, and then goes down to the parent <note>. For both the $note and $parent-note, I was hoping to simplify things by just comparing the text of the element, and not everything else.

Finally, I have the return statement.

return
<dupes>
<doc>{$doc}</doc>
<results>{if ($note eq $parent-note)
 then "same"
 else "different"
 }</results>
</dupes>

So, it came back with no matches. This is great, right? It means that there are no notes that were duplicated at the parent! Hooray!!!!

BUT, before congratulating myself too much, I decided to test it by inserting a note in the parent component of a component with a note, and checking to see if they came back the same. No dice. They came back as different, which means that there’s something wrong with this query.

The appeal

If you’re so inclined, please check my work and let me know where I messed up. Alternately, let me know if you have thoughts on a better/easier way to do this. Until then, I’ll keep chipping away at other reports in my list.

Figuring Out What Has Been Done

It’s been a while since I last posted, and there’s a good reason for that — I’ve started an exciting new job as an archivist and metadata specialist at Yale. I miss my colleagues and friends at Tamiment every day, and I look forward to continued awesome things from them.

Here at Yale, I work in Manuscripts and Archives. The major project for the first year will be to migrate from Archivists’ Toolkit to ArchivesSpace. In anticipation of this migration, I’m learning about the department’s priorities for data clean-up and thinking through what I can do to help implement those goals.

The Goals

One of the first projects that was added to my list was to take a look at a project that has been ongoing for a while — cleaning up known errors from the conversion of EAD 1.0 to EAD 2002. Much of the work of fixing problems has already been done — my boss was hoping that I could do some reporting to determine what problems remain and in which finding aids they can be found.

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?

Getting to the Files

Access to files that have been published to our portal is easy — they’re kept in a file directory that is periodically uploaded to the web. And I also have a cache of the EAD 1.0 files, pre-transformation. These were both easy to pull down copies of.  But, one of the questions I was asking was how these differ from what’s in the AT. It’s so, so easy to make changes in AT and forget to export to EAD.

If any of you know good ways to batch export EAD from AT, PLEASE LET ME KNOW. I have a pretty powerful machine and I know that folks here have worked on optimizing our AT database, but I could only export about 200 files at a time, for fear of the application hanging and eventually seizing up. So, I ran this in the background over the course of several days and worked on other stuff while I was waiting.

For some analyses, I wanted to exclude finding aids that aren’t published to our portal — for these, I copied the whole corpus to a separate directory. To get a list of which finding aids are internal-only, I very simply copied the resource record screen in AT (you can customize this to show internal-only finding aids as a column), which pastes very nicely into Excel.

ATInternal

InternalOnlyExcel

Once in Excel, I filtered the list to get a list of Internal Only = “TRUE”. From there, I used the same technique that I had used to kill our zombie finding aids at NYU. I made a text document called KillEAD.txt, which had a list of the internal-only finding aids, and I used the command

cat KillEAD.txt | tr -d '\r' | xargs echo rm | sh

to look through a list of files and delete the ones that are listed in that text document. (In case you’re wondering, I’m now using a Unix simulator called Cygwin  and there are weird things that don’t play nicely with Windows, including the fact that Windows text documents append /r to the ends of lines to indicate carriage returns. Oh, also, I put this together with spit and bailing wire and a google search — suggestions on better ways to do this kind of thing are appreciated).

So, that’s step 0. Don’t worry, I’ll be blogging my approach to goals 1-7 in the days ahead. I have some ideas about how I’ll do much of it (1, 3, and 4 I know how to assess, 2 I have a harebrained plan for, 5-7 are still nascent), but any suggestions or brainstorming for approaching these problems would be MORE THAN WELCOME.

Backlog Control — Known Unknowns

As part of a continuing attempt to understand our holdings, I’ve been writing a series of reports against our EAD. Of course, what’s in a finding aid only accounts for the stuff that someone bothered to record in the first place. To tackle undescribed collections, we’ve also been doing a shelfread project to get an understanding of what’s on our shelves.

Today, we take an “accessioning as processing” approach to accruals — we describe what’s there at the appropriate level of description at the time of accessioning, and we include a lot of the background information about how it came to us, what it all means, etc., to help make sense of it. This helps us avoid building a backlog.

In the past, however, there was a mysterious (to me) understanding of the nature of processed/unprocessed materials. We have many, many series of materials in collections (usually accruals) that may even have file-level inventories but are described as “unprocessed.” They don’t include essential information about immediate source of acquisition, creators, or what about these materials makes them hang together. I’m frankly not sure what my predecessors were waiting for — they did all the work of creating lots of description without doing any real explanation!

So, my boss wanted a sense of these known knowns — parts of collections that we need to at least give a better series title, or somehow take out of the limbo of “unprocessed”. She wanted to know how many series there were, which collections these series belong to, and how many boxes of stuff we’re talking about. It would also be great to know linear footage, but this is frankly unknowable from the data we have.

So, I wrote an xQuery. You can find it here. The xQuery looks for any series or subseries that has the string “unprocessed” in its title. From there, it reports out the distinct values of containers. The result looks something like this:

Screen Shot 2014-05-06 at 10.06.10 PM

Perhaps you see the problem. Originally, I thought I just wanted to get a count of the distinct containers. My xpath for the variable that would give me box info (called footage here) originally looked like this:

$unprocessedfootage := count(distinct-values($series//ead:container[@type eq ‘Box’]))

The idea here was that it would take a series, get a census of the different boxes in that series, and count ’em up. But this gave me bad data. In the case of :

<containertype=“Box”>10-17</container>

I would have “10-17” be considered one distinct value in the count, when really it represents 8 boxes. The report as I first had it was severely undercounting boxes.

If I want to get a count of the distinct containers, I have to deal with ranges like 10-17. I started by importing this into OpenRefine and separated the multi-valued cells in the “unprocessed” column so that each number or range was in its own cell/row.

Then, I did some googling and came across this StackOverflow answer that explained how to enumerate the values in a range in Excel (this will give me 10, 11, 12, 13, 14, 15, 16 and 17 from 10-17). I exported from OpenRefine and brought the document into Excel, separated the ranges into two columns, and did a quick if/then statement to repeat single values in the second column. From there, I just ran the VBA code that was provided. I brought the document BACK into Refine and separated multi-valued cells again, and found out that we have 908 distinct boxes of “unprocessed” materials in 67 collections.

Now, happily, we know exactly how big of a mess our described “unprocessed” materials are, and we’re in a much better position to make good sense of them.

Update — 2014 May 7

@ostephens on twitter very helpfully pointed out that the dumb VBA step can easily be avoided by doing the work in OpenRefine.

He was completely right and was kind enough to give me the recipe

After multi-valued cells were each in their own row/cell, I separated by “-” so that the beginning and end of each range was in its own column. Then, I created a new column based on the “First” column and did the following:

Screen Shot 2014-05-07 at 10.29.10 AM

On error, it copies the value from the original column so that my “enum” column is everything I need. Once I had the values enumerated, I split multi-value cells again and ended up with a much more beautiful process.

You can follow my steps by importing my OpenRefine project here.