Fighting Zombies — De-Duplicating Our Finding Aids

It’s one of those problems that drive everyone a little nutty. I mentioned in my previous posts that I’ve been looking across our finding aids to get a better sense of how well we’re meeting standards. In order to do this, I’ve had to request a copy of all finding aids from our digital library people each month, since I don’t have access to the xml files that our website is based on.

While doing this, I noticed that we had a LOT of files in the bundle digital libraries had sent us. Like, more than 2000. When really, we only have 1200 collections. And I noticed, too, that many of the files were named in a way that don’t meet our current conventions. Opening some of them up, it was clear that these were very, very old versions of finding aids that had since been updated — and that we have have also requested to be deleted from the production environment before.

We couldn’t figure out why the zombies kept coming back. So, digital libraries folks and I worked together and decided that I would identify the duplicates, they would flush everything from the production environment, and they would then re-index the current set, minus the list of duplicates I would send.

To do this, I had to think through which pieces of data I needed in order to determine which finding aids were versions of one another.

So, I wrote an xQuery. My xQuery asks for the following pieces of information:

  • What’s the call number for this finding aid? This is represented by the element <unitid> at the collection level. In our finding aids, we can’t consistently assume that the file name is the same as (or is a derivative of) the call number.
  • When was this finding aid updated? I’ll need to know this so that I can delete any finding aid that is out of date.
  • What file is it a part of? I’ll need this so that I can determine which files to delete.

Let’s walk through what my xQuery looks like, from the top.

xquery version "1.0";

 declare namespace ead="urn:isbn:1-931666-22-9";
 declare namespace xlink = "http://www.w3.org/1999/xlink";
 import module namespace functx="http://www.functx.com" 
     at "http://www.xqueryfunctions.com/xq/functx-1.0-doc-2007-01.xq";

This is all boilerplate that tells the computer that we’re writing an xQuery, and that it is querying data that is in the EAD and xlink namespaces. I’m also importing the functx module, which I don’t actually use in this context but is a useful library for commonly-used xQuery functions.

Next, I tell my computer which files I’m working with.

declare variable $COLLECTION as document-node()* := collection("file:///Users/staff/Documents/dlts_findingaids_eads/tamwag/?recurse=yes;select=*.xml");

Here, I’m telling my computer where my EAD files live. I’m also saying that it should only look at xml files, and that if there are folders within this location, it should look at them too. I’m declaring the location of these files as a variable, called $COLLECTION.

Let’s jump to the bottom, where I tell the xQuery what my report should look like.

return
<doc>
<file>{$doc}</file>
{$callno}
{$datemodified}
</doc>

This report is saying that I want information about each document about what the file name is, what the call number is, and when it was modified. In order to tell the report where to find that information, I explain those variables, above.

for $i in $COLLECTION//ead:ead
let $callno := $i//ead:archdesc/ead:did/ead:unitid,
$doc := base-uri($i),
$datemodified := $i//ead:profiledesc/ead:creation/ead:date

The first thing I do is decide to declare a jumping-off variable, called $i, that calls the root of the EAD. Below, to get the file name of that EAD within my file system, I use the function base-uri() to get the file name of my EAD. This fulfills one of the three pieces of information I’m looking for — I now know which file I’m talking about.

The next piece of information I want is the call number of the EAD. I declare a variable $callno (no reason to be mysterious with variable names!), and I tell the computer where in the EAD it can find this. In this case, I want the collection-level <unitid> (there may be other unitids in a finding aid, but I just want the collection-level one), which is found by going to the root of the EAD <ead>, then going to <archdesc>, then going immediately to <did>, then looking to <unitid>. I now have my call number.

Finally, I want to know the date that this file was created. This is stored in the EAD when the file is exported from Archivists’ Toolkit (and can be updated using a variety of methods based on institutions’ practices even if they don’t use AT). I’ll call this variable $datemodified , and I tell the computer that it can be found under <profiledesc>, then <creation>, then <date>.

Whew! Now that that’s all together, I run the xquery, and get something that looks like this:

dedup screenshot

See? For each document, I get the file name/path, I get the unitid, and I get the date.

So how do I analyze/de-dupe these? I’m sure that there lots of ways to do this, but I want to see it all in a big table — so I import into Excel. Did you know that you can import xml into Excel? It’s under the data tab.

From here, it’s pretty straightforward. I get the data into the columns I want, and I take a look at it. Below, do you see how aia_sullivan.xml and AIA.048-ead.xml have the same call number, but one was created two years before the other? I definitely want to keep the newer one and delete the older one.

dedup excel

But boy, there are a lot here. I don’t want to do this by hand!. First I do a multi-level sort — by call number and then by date. Then I apply a formula that says if the first call number is the same as the second call number, and the first call number was created before the second call number, give the first call number the value of “delete” and the second call number the value of “keep”. In excel, it looks like this:

=IF(AND(D1=D2,C1<C2),"delete","keep")

Then, I simply filter that list so that I only see the “delete” values, I copy the list of file names with delete values, and I save that to a text file for digital libraries folks to delete.

Advertisements

One thought on “Fighting Zombies — De-Duplicating Our Finding Aids

  1. Pingback: Chaos —> Order | Figuring Out What Has Been Done

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s