Figuring Out What Has Been Done: Folder Numbers that Go On and On and On

What was the problem?

As part of a retrospective conversion project, paper-based finding aids were turned into structured data. A lot of this work was done in Excel, and one problem was a mistake with folder numbers — instead of folder numbers starting at number one at the beginning of each box, their numbering continues as the next box starts. For instance, instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.

How did I figure this out?

Since I’m new here and not overly familiar with numbering conventions, I approached it two ways. First, I want a list of finding aids that have really, really high folder numbers. This is really easy — I basically point to the folder and ask to return the biggest one. Of course, fn:max() can only handle numbers that look like numbers, so I included a predicate [matches(.,’^[0-9]+$’)] that only looks for folder numbers that are integers. This means that folder ranges and folders with letters in their name won’t be included, but it’s very unlikely that the biggest folder numbers in a collection would be styled this way.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $folder := $ead//ead:container[@type="Folder"][matches(.,'^[0-9]+$')]
 let $maxfolder := max($folder)
 return 
 $maxfolder
 }
 
 </document>
}
</root>

Looking through this, there are a LOT of collections with really high folder numbers. When I dig in, I realize that in a lot of cases, this can be just because of typos (for instance, a person means to type folder “109” but accidentally types “1090”). But I thought it would be good to know, more generally, which boxes in a collection have a “Folder 1”.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $box := $ead/(//ead:container[@type="Box"])
 let $folder1 := $box/following-sibling::ead:container[@type ="Folder"][. eq "1"]
 let $boxbelong := $folder1/preceding-sibling::ead:container[@type ="Box"]
 return
 $boxbelong
 }
 </document>
}
</root>

And, like a lot of places, practice has varied over time here. Sometimes folder numbering continues across boxes for each series. Sometimes it starts over for each box. Sometimes it goes through the whole collection. This could be tweaked to answer other questions. Which box numbers that aren’t Box 1 have a folder 1? How many/which boxes are in this collection, anyway?

From this, I got a good list of finding aids with really folder numbers that will help us fix a few dumb typos and identify finding aids that have erroneous numbering. We’re still on the fence regarding what to do about this (I think I would advocate just deleting the folder numbers, since I’m actually not a huge fan of them anyway), but we have a good start to understanding the scope of the problem.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?  We now know which notes are duplicated from their children.

Exporting, Editing, and Importing EAD in Archivists’ Toolkit: A Checklist

Sometimes, it can be extremely helpful to take EAD XML files out of Archivists’ Toolkit to edit them.  Maybe you have a contents list that you generated from a spreadsheet, or maybe you want to quickly change 500 “otherlevel”s to “file”s.  Since there are so many small steps, I created a checklist.  Using the checklist will help to make sure that information doesn’t get lost and that the record looks like you want it to.

First, a word of caution: when the record is imported back into AT, it will overwrite all refids with new ones.  So if you’re using those refids elsewhere, this won’t work.  Additionally, before exporting the record, it’s important to copy down information that won’t be included in the export.  This includes any repository processing notes and linked accession records.  This is also why it’s important to make sure that “internal only” notes are included in the export.  Also, the file won’t re-import with barcode information, because barcodes are kept as non-valid attributes and violate the importer’s validation rules.

We found that when exporting, AT added information that we didn’t want when we re-imported it, or imported information to different fields.  For example, at Tamiment, we use the container summary to on the “Basic Info” tab to record the container summary.  When this is exported, it maps to <extent> in <physdec>.  When it’s re-imported into the Toolkit, it does not go into the container summary but becomes a Physical Description Note.  You can also change some of these in the EAD XML file, instead of after importing into AT.

You can find my checklist here or below

Before Exporting EAD:

  • Write down which accession records are linked to the resource record
  • Record any information in repository processing note(s)
  • Do NOT check “Suppress components and notes when marked ‘internal only’” when exporting the original resource record

Before importing EAD:

  • If there are barcodes: do a find/replace on containers (using dot matches all) to delete barcodes
  • Make sure that the record it is replacing has been deleted

After importing EAD:

Basic Description:

  • Separate the prefix and numeric sections of the Resource Identifier into separate fields
  • Remove bulk dates from Date Expression field (this may also need to be done at the series or sub-series level)
  • Copy the text from the General Physical Description note into the Container Summary

Notes:

  • Remove General Physical Description note

Finding Aid Data:

  • Remove call number from Finding Aid Title field
  • Remove “Collection processed by” in Author field

Barcodes

  • Re-enter barcodes

Accessions:

  • Re-link resource record to accession record(s)

Repeating information at lower levels description

Today I discovered a description pet peeve while testing how finding aid requesting will work with our Aeon implementation.

Highlighting practice of reusing a series title in a folder title.

Highlighting practice of reusing a series title in a folder title.

Almost every single folder in this collections starts by repeating the series name followed but more specific information for that particular folder. I know I’ve seen this before in our finding aids (and previous institutions), but it’s pretty widespread in this example. You can view the full Anne St. Clair Wright papers finding aid.

Really, we don’t need to do this as it doesn’t add more value, makes the display more cluttered, and isn’t a good use of our time spent repeating information.

DACS covers this:

Principle 7.3: Information provided at each level of description must be appropriate to that level.

When a multilevel description is created, the information provided at each level of description must be relevant to the material being described at that level. This means that it is inappropriate to provide detailed information about the contents of files in a description of a higher level. Similarly, archivists should provide administrative or biographical information appropriate to the materials being described at a given level (e.g., a series). This principle also implies that it is undesirable to repeat information recorded at higher levels of description. Information that is common to the component parts should be provided at the highest appropriate level.

This principle is discussed in numerous articles on archival description (including on page 246 of Greene and Meissner’s 2005 MPLP article) and can be seen in many institution’s processing manuals.

Going back to our processing manual, there really aren’t any explanations on the hierarchical relationships between levels of description or instructions stating that lower levels of description inherit description from above . There is some guidance on creating folder titles, but most of it has to do with formatting. There’s almost no explanation of how to develop series titles.

Adding this to the list of updates to make!

 

 

Figuring Out What Has Been Done: Making Sense of Versions

What was the problem?

We know that a lot of good work has been done to fix the problems we’ve identified, but there’s no quick-glance way of knowing whether a record has been exported from the AT and published to our finding aid database after the work was done. Since so many people touched so many records, the element of human error is inevitable — there are probably some finding aids that are fixed but not yet exported.

How did I figure this out?

First, I had to get a copy of each data set. I talked about that in a previous blog post. Then, I had to figure out some good metrics of change. Since our finding aids go through a transformation between AT and the finding aid database (and this transformation involves, I think, a human opening the xml editor), I didn’t want trivial changes to throw off my results. So, I couldn’t just get a hash of each of the documents and compare them.

Off the top of my head, I thought of two metrics that would give us a rough sense of what work has been done. We saw that there are only three finding aids that still have duplicate arrangement statements — let’s compare those to finding aids in the database that have duplicate arrangement statements.

I used the same xquery against both data sets. After comparing the two, I saw that there are 16 finding aids that are published that have duplicate arrangement statements that don’t have this problem in the up-to-date data in AT.

Okay, that’s a start. But it’s only telling us whether fixes to this particular problem have been updated. So, let’s look more broadly. My first thought is that getting a count of components gives a good sense of whether work has been done on a finding aid.

This one was pretty straightforward. I just wrote an xquery that did a count of components at every level and told me which file this was associated with. I used dumb Access to associate the files with each other, and then write a query to see for each finding aid how many components are published to the database and how many are in AT. From here, hopefully, we’ll be able to make a big update and get our files in sync.

Obviously, it’s entirely possible that we have files out of sync that have the same numbers of components and the same numbers of arrangement statements. As I evaluate other known errors, I’ll be sure to evaluate them over both data sets to get a sense of what needs to be updated.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures? This problem endures, although I’m working on some solutions based on helpful suggestions!

Figuring Out What Has Been Done: Duplicated Notes (And Documenting My Failed Attempt…)

What was the problem?

This time, I’m trying to track down see/see-also notes that, because of a problem in the original EAD 1.0 -> EAD 2002 transform, were duplicated to parent components. A lot of really good clean-up work was done with these — we want to know what’s left.

How did I figure this out?

I haven’t. This has totally failed and I could use some help.

I had a lot of thoughts initially about how to pull this off, and I was mostly concerned about how oXygen would be able to handle any solution I came up with. Happily, my colleague Mark introduced me to BaseX, which is an xml database and xQuery processor. It’s awesome, and has been able to handle everything I’ve thrown at it thus far.

When it came down to it, I realized that I just wanted to find the note, then find a note in the parent component, and figure out if they were the same. I toyed with the idea of making a hash of each of these and comparing them, but it turned out that BaseX was able (I think) to handle the content of the note itself.

The xQuery is here, and the meaty bits are below.

declare variable $COLLECTION as document-node()* := db:open('MSSAAtExport');

for $note in $COLLECTION//ead:ead//ead:note//text()
let $doc := base-uri($note),
$parent-note := $note/parent::ead:c/parent::ead:c/ead:note[1]//text()

Basically, I’ve put all of my EAD in a database in BaseX called “MSSAAtExport” (which is the best thing I could have done — it made everything fast and awesome). Then, I declared my main variable, $note (any <note> element, anywhere, although strictly speaking I’m only interested in notes in the <dsc>). I declared $doc as the file I’m in.

Finally, (and here’s where I’m pretty sure my mistake is, so PLEASE CHECK), I created another variable for the note that might have been duplicated. Because that’s the problem, right? There are notes in components that were duplicated in parent components. So, $parent-note starts at $note, then goes up to its own <c>, then goes up to the parent <c>, and then goes down to the parent <note>. For both the $note and $parent-note, I was hoping to simplify things by just comparing the text of the element, and not everything else.

Finally, I have the return statement.

return
<dupes>
<doc>{$doc}</doc>
<results>{if ($note eq $parent-note)
 then "same"
 else "different"
 }</results>
</dupes>

So, it came back with no matches. This is great, right? It means that there are no notes that were duplicated at the parent! Hooray!!!!

BUT, before congratulating myself too much, I decided to test it by inserting a note in the parent component of a component with a note, and checking to see if they came back the same. No dice. They came back as different, which means that there’s something wrong with this query.

The appeal

If you’re so inclined, please check my work and let me know where I messed up. Alternately, let me know if you have thoughts on a better/easier way to do this. Until then, I’ll keep chipping away at other reports in my list.