Figuring Out What Has Been Done: Making Sense of Versions

What was the problem?

We know that a lot of good work has been done to fix the problems we’ve identified, but there’s no quick-glance way of knowing whether a record has been exported from the AT and published to our finding aid database after the work was done. Since so many people touched so many records, the element of human error is inevitable — there are probably some finding aids that are fixed but not yet exported.

How did I figure this out?

First, I had to get a copy of each data set. I talked about that in a previous blog post. Then, I had to figure out some good metrics of change. Since our finding aids go through a transformation between AT and the finding aid database (and this transformation involves, I think, a human opening the xml editor), I didn’t want trivial changes to throw off my results. So, I couldn’t just get a hash of each of the documents and compare them.

Off the top of my head, I thought of two metrics that would give us a rough sense of what work has been done. We saw that there are only three finding aids that still have duplicate arrangement statements — let’s compare those to finding aids in the database that have duplicate arrangement statements.

I used the same xquery against both data sets. After comparing the two, I saw that there are 16 finding aids that are published that have duplicate arrangement statements that don’t have this problem in the up-to-date data in AT.

Okay, that’s a start. But it’s only telling us whether fixes to this particular problem have been updated. So, let’s look more broadly. My first thought is that getting a count of components gives a good sense of whether work has been done on a finding aid.

This one was pretty straightforward. I just wrote an xquery that did a count of components at every level and told me which file this was associated with. I used dumb Access to associate the files with each other, and then write a query to see for each finding aid how many components are published to the database and how many are in AT. From here, hopefully, we’ll be able to make a big update and get our files in sync.

Obviously, it’s entirely possible that we have files out of sync that have the same numbers of components and the same numbers of arrangement statements. As I evaluate other known errors, I’ll be sure to evaluate them over both data sets to get a sense of what needs to be updated.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures? This problem endures, although I’m working on some solutions based on helpful suggestions!

