It’s a cliche to say that adoption of EAD is only gradually moving from a way to mark-up text-based documents to a way of encoding archival data-as-data. But it’s a cliche for a reason.
At Tamiment, we’ve done a good job of making sure that any description we’ve created is present in our archival management system and online. This is an impressive and important accomplishment, and couldn’t have been done if not for the work of dozens of staff and graduate students committed to Tamiment’s researchers and collections.
But now that step one* (get whatever data we have online so that people can discover our stuff) is done — CHECK — it’s time for step two. Let’s look at our legacy description and make sure that it’s more amenable to next-generation interfaces, data sharing, and re-use.
An article in the most recent issue of American Archivist by Katherine Wisser and Jackie Dean analyzes encoding practices across a corpus of more than 1,000 finding aids and finds, as they put it so diplomatically, that “the flexibility of the EAD structure is being taken full advantage of in practice”. ** An article by Marc Bron, Bruce Washburn and Merrilee Proffitt in the Code4Lib journal discusses the consequences of this diversity of practices within the ArchiveGrid portal — broadly speaking, because the creators of ArchiveGrid can’t be assured that tags in finding aids will be uniformly present or similarly used, discovery tools can’t support the data (or vice versa). ***
For the most part, we don’t really know the shape our data is in at Tamiment, but we can guess — legacy practices favored subject schemata instead of container information, allowed indices to flourish and propagate, and didn’t require standards-compliant titles or dates. We know that some data will be missing, some will be wrong, and some might not meet current practices for clarity and user-friendliness.
So, this post is about our first steps — understanding what shape our data is in so that we can make sure we can get it to do what we need it to. To do this, we had to decide what we wanted to know about our data. These questions fell into three categories:
- Do we have data in the places we should, as prescribed by our content and encoding standard?
- Do we have data in the places we should, as prescribed by our local standards?
- To the extent that we can determine this programmatically, is the content of these elements correct?
I’ve been tasked with investigating these questions and reporting back to our group. To do so, I wrote an xquery that looks at questions 1 and 2, and gives us an aggregate view of all of our finding aids. In some cases, I ask for the content of an element — in other cases, I ask whether or not an element is present.
This xquery looks at all of the DACS single-level optimum elements, but also looks at stuff that I just want to know — how many components are in a finding aid? How many components have dates? How many dates are normalized? How many components are “undated”? How many series or subseries are in a finding aid? Of those, how many have a scope and content note? Do we have all of the administrative information that isn’t required by DACS single-level but will be essential to getting collection control projects on track?
I’ll post our results (they aren’t pretty) in my next blog post, and talk about our remediation plans, too.
*We may be done with step one, but we’re still working very hard on step zero — making sure that everything in our custody is described. Much more about this to come.
** Wisser, Katherine, and Jackie Dean. 2013. “EAD Tag Usage: Community Analysis of the Use of Encoded Archival Description Elements.”American Archivist 76 (2) (September 1): 542–566. http://archivists.metapress.com/content/X4H78GX76780Q072.
*** Bron, M., M. Proffitt, and B. Washburn. 2013. “Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems.” The Code4Lib Journal (22) (October 14). http://journal.code4lib.org/articles/8956.