Figuring Out What Has Been Done: Folder Numbers that Go On and On and On

What was the problem?

As part of a retrospective conversion project, paper-based finding aids were turned into structured data. A lot of this work was done in Excel, and one problem was a mistake with folder numbers — instead of folder numbers starting at number one at the beginning of each box, their numbering continues as the next box starts. For instance, instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.

How did I figure this out?

Since I’m new here and not overly familiar with numbering conventions, I approached it two ways. First, I want a list of finding aids that have really, really high folder numbers. This is really easy — I basically point to the folder and ask to return the biggest one. Of course, fn:max() can only handle numbers that look like numbers, so I included a predicate [matches(.,’^[0-9]+$’)] that only looks for folder numbers that are integers. This means that folder ranges and folders with letters in their name won’t be included, but it’s very unlikely that the biggest folder numbers in a collection would be styled this way.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $folder := $ead//ead:container[@type="Folder"][matches(.,'^[0-9]+$')]
 let $maxfolder := max($folder)
 return 
 $maxfolder
 }
 
 </document>
}
</root>

Looking through this, there are a LOT of collections with really high folder numbers. When I dig in, I realize that in a lot of cases, this can be just because of typos (for instance, a person means to type folder “109” but accidentally types “1090”). But I thought it would be good to know, more generally, which boxes in a collection have a “Folder 1”.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $box := $ead/(//ead:container[@type="Box"])
 let $folder1 := $box/following-sibling::ead:container[@type ="Folder"][. eq "1"]
 let $boxbelong := $folder1/preceding-sibling::ead:container[@type ="Box"]
 return
 $boxbelong
 }
 </document>
}
</root>

And, like a lot of places, practice has varied over time here. Sometimes folder numbering continues across boxes for each series. Sometimes it starts over for each box. Sometimes it goes through the whole collection. This could be tweaked to answer other questions. Which box numbers that aren’t Box 1 have a folder 1? How many/which boxes are in this collection, anyway?

From this, I got a good list of finding aids with really folder numbers that will help us fix a few dumb typos and identify finding aids that have erroneous numbering. We’re still on the fence regarding what to do about this (I think I would advocate just deleting the folder numbers, since I’m actually not a huge fan of them anyway), but we have a good start to understanding the scope of the problem.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?  We now know which notes are duplicated from their children.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s