Figuring Out What Has Been Done: Duplicated Notes (And Documenting My Failed Attempt…)

What was the problem?

This time, I’m trying to track down see/see-also notes that, because of a problem in the original EAD 1.0 -> EAD 2002 transform, were duplicated to parent components. A lot of really good clean-up work was done with these — we want to know what’s left.

How did I figure this out?

I haven’t. This has totally failed and I could use some help.

I had a lot of thoughts initially about how to pull this off, and I was mostly concerned about how oXygen would be able to handle any solution I came up with. Happily, my colleague Mark introduced me to BaseX, which is an xml database and xQuery processor. It’s awesome, and has been able to handle everything I’ve thrown at it thus far.

When it came down to it, I realized that I just wanted to find the note, then find a note in the parent component, and figure out if they were the same. I toyed with the idea of making a hash of each of these and comparing them, but it turned out that BaseX was able (I think) to handle the content of the note itself.

The xQuery is here, and the meaty bits are below.

declare variable $COLLECTION as document-node()* := db:open('MSSAAtExport');

for $note in $COLLECTION//ead:ead//ead:note//text()
let $doc := base-uri($note),
$parent-note := $note/parent::ead:c/parent::ead:c/ead:note[1]//text()

Basically, I’ve put all of my EAD in a database in BaseX called “MSSAAtExport” (which is the best thing I could have done — it made everything fast and awesome). Then, I declared my main variable, $note (any <note> element, anywhere, although strictly speaking I’m only interested in notes in the <dsc>). I declared $doc as the file I’m in.

Finally, (and here’s where I’m pretty sure my mistake is, so PLEASE CHECK), I created another variable for the note that might have been duplicated. Because that’s the problem, right? There are notes in components that were duplicated in parent components. So, $parent-note starts at $note, then goes up to its own <c>, then goes up to the parent <c>, and then goes down to the parent <note>. For both the $note and $parent-note, I was hoping to simplify things by just comparing the text of the element, and not everything else.

Finally, I have the return statement.

return
<dupes>
<doc>{$doc}</doc>
<results>{if ($note eq $parent-note)
 then "same"
 else "different"
 }</results>
</dupes>

So, it came back with no matches. This is great, right? It means that there are no notes that were duplicated at the parent! Hooray!!!!

BUT, before congratulating myself too much, I decided to test it by inserting a note in the parent component of a component with a note, and checking to see if they came back the same. No dice. They came back as different, which means that there’s something wrong with this query.

The appeal

If you’re so inclined, please check my work and let me know where I messed up. Alternately, let me know if you have thoughts on a better/easier way to do this. Until then, I’ll keep chipping away at other reports in my list.

Advertisements

7 thoughts on “Figuring Out What Has Been Done: Duplicated Notes (And Documenting My Failed Attempt…)

  1. Maureen,
    I saw you post this on twitter, and since it is an xquery question I couldn’t resist taking a look. I see a few possible gotcha’s.

    First your path to the parent node needs an extra step, since you are initially starting out on the text of the the ead:note. So the direct parent of your $doc variable is actually ead:note (the parent of the text node).

    let $parent := $note/parent::ead:note/parent::ead:c/parent::ead:c/ead:note[1]//text()

    Should work better. Also I’m always skeptical about text nodes, so I would at least run a normalize-space() on both text nodes that you are comparing, and perhaps use matches() instead of eq for your comparison. matches() allows you to throw in some regex if you need it and may make comparisons easier.

    Hope that helps!
    -Winona

    • Oh boy, I can usually count on it being a problem with my xpath!
      I also think I’m going to add a predicate to $note so that I only get results for notes that have a parent note. I’ll keep you posted on my progress — I really appreciate the help.

  2. Maureen,

    Here one way to do this, which relies on the powerful “deep-equal” function:

    I’ve removed the reference to the database in this example, but if you have the database already open in BaseX, you won’t need to use that db:open function. Also, this query should work whether or not you have c or c0X element names in the EAD corpus, but it still only looks for note elements. Since I don’t know how this duplication could have occurred, I’d think that you will probably want to check other EAD “notes,” too, like accessrestrict, relatedmaterials, etc.

    • Mark, this is very elegant, thank you. I didn’t know about fn:deep-equal — very handy. I’ll do a little bit of testing and let you know how this turns up.

  3. Hi Maureen,

    I might take a different approach–like query all the note elements first and identify duplicate values. Then after isolating the duplicates identifying the location of the parent note nodes. And then, use a xquery update function to delete the unwanted duplicate note nodes.

    Here’s a query I wrote for locating duplicate titles in MARCXML: https://gist.github.com/caschwartz/2789344. You could do something similar with the EAD notes.

    Thanks for sharing your xquery work, I’ve really enjoyed reading your posts!

    • Christine,

      Thank you — I think that this approach makes a lot of good sense! I’m so grateful for this community of x-enthusiasts to help me out of my dumb bind.

      Maureen

    • Also, I’ve been meaning to say that I’m sorry that we didn’t find each other back when I worked in Princeton, NJ. I know that there are quite a few folks at the libraries at the university who are doing interesting work with xquery — do you ever find yourself crossing paths with them?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s