Beast tables and relationships

I thought it would be helpful to spend a minute looking at our data before we dive into mapping Beast fields to ArchivesSpace.

I’ll note that while I refer to the Beast database, we actually have three separate Beast databases. This was due to size and performance issues as well as the department and unit structure of the Libraries (which has since changed). While the three databases share the same table and relationship structure, often the data entry practices and habits varied between them, including adherence (or not) to our local requirements and national content standards, such as DACS.

Here’s a screenshot of the database structure:

Tables and relationships in the UMD Beast database

Tables and relationships in the UMD Beast database

Fun, right? The meat of our information is stored in the archdescid table with an auto generated primary key of archdescid. This field ties all of our tables together. If you are familiar with EAD some of the field names will look familiar (unitid, abstract). Other fields tie information to other UMD systems (pid) or unit structure (corpname).

You will also notice that similar types of EAD information are not clustered together. So, we have EAD publication information stored in its own table, yet the <address> for the repository, which appears in <publicationstmt> is stored in the archdescid table. In current practice, this does not matter as we apply a converter script to create the EAD. From a trying to understand how this data interacts with each other perspective it has me scratching my head as to why we did not store like information together.

In upcoming posts we’ll talk about how this information can map to ArchivesSpace accession records.

Advertisements

The Legacy Landscape

When thinking about legacy data I have a whole lot of “legacy” to contend with; the Rare Book & Manuscript Library was officially established in 1930, though our collections date back to the founding of the University in 1754. There are 12 different units  within the library, and, for a long time, collection management was practiced on the “unit” rather than department level with different curatorial units keeping their own collection lists and donor files. This all leaves me with a LOT of non-standardized data to wrangle.

This legacy data is scattered over various spreadsheets, shelf lists, catalog records, accession registers, and accessions databases (one homegrown, the other Archivists’ Toolkit). This means that as much information as we have about our collections it is all dispersed and I’ve never really been able to answer questions like “how big is your collection?” “How much of that is processed” or “how many manuscript collections does the RBML hold?” I know that’s not super unusual, but it can be frustrating; especially because we do know… sort of… I mean, let me get together a bunch of spreadsheets … and then count lines…. and most of our collections are represented on at least one of those… well, except the ones that aren’t…or are there multiple times…and the binders– don’t forget the binders! I think a lot of you know what I mean. We HAVE collection data, we just don’t have it stored or structured in a way that can be used or queried particularly effectively, and what we do have is in no way standardized.

Getting intellectual control over the collection has been an ongoing project of mine since I started my position, but it has ramped up as we have been starting to think more seriously about making a transition to ArchivesSpace. If we are going to make a significant commitment to using a new collection management system the first step needs to be to have the ability to populate that system with reliable, complete, and standardized information about our collections. This has led me to spend a significant chunk of my time the last year either updating our AT accession records to make sure that they are compliant with DACS’s requirements for single-level minimal records, or adding collection information from other sources into AT so that we have collection-level information about every collection in one place. I chose to go with minimal rather than optimal records since we are not using AT to generate finding aids or to hold the bulk of our descriptive metadata or any component metadata (more on that later!). My goal here is to get baseline intellectual control over our collection, so I am keeping the records lean. I am, however, adding both processing status and physical location in addition to the DACS required elements so that I can more easily determine the size of our backlog, set processing priorities, and to encourage myself and my colleagues to get used to looking in one central place for all key collection information. More about some of the strategies I’m using to make all of this happen in upcoming posts!

The Beast

No, really, we call our homegrown archival management system the “Beast”. It is an Access database launched in the early-mid 2000s. Originally started as the locations register, the database evolved to support the creation of EAD finding aids published on ArchivesUM. A JAVA based conversion program pulls information from the database into EAD elements in a XML file. Files are then checked by hand for XML and EAD compliance before upload to ArchivesUM. You can read all about its creation in “Taming the ‘Beast’: An Archival Management System Based on EAD.“*

At the time, the Beast did some good things. It launched local EAD implementation, got more finding aids online, and consolidated collection information into a central location. It did a great job at facilitating putting up abstracts of new accessions or unprocessed collections.

Over the years the Beast has evolved including development on the conversion scripts, but this has been pretty minimal in terms of new functionality over the past several years. The decision was made to wait for ArchivesSpace instead of making major changes to the Beast.

Here’s a highlight of general issues with the Beast and our associated practices:

  • It was built based on current local policies/practices at the time making its functionality rigid in cases. It was seen as a tool to get away from paper instead of viewing it as a source of reusable data.
  • While staff enter information using forms in Access for either “accessions” or “finding aids”, most of the information is stored in the same table making it difficult on the back end to know what you are looking at.
  • It is clunky and difficult to link multiple accessions with a collection description.
  • Some fields don’t map to the best EAD tag choice (ex: all extent information is dumped into <physdesc> and not <extent>.)
  • For container lists, the Beast/ArchivesUM stylesheet requires that your intellectual and physical order MATCH EXACTLY. This limits flexibility in description and processing at various levels and often requires spending too much time physically moving around materials.
  • Not all finding aids uploaded to ArchivesUM are EAD compliant (they are all XML compliant.) People fell on the side of getting the finding aid up instead of figuring out the EAD error and required changes in the Beast.
  • We did not do a good job at quality control. We just didn’t. We didn’t utlize controlled fields when we could have (ex: linear feet is spelled ten ways) and didn’t enforce adhering to local policies (ex: dates entered in date fields don’t match format of acceptable dates in our processing manual.)

In future posts I’ll share how we are mapping the Beast fields to ArchivesSpace as well as the specific data cleanup issues we are facing.

*Jennie A. Levine, Jennifer Evans, and Amit Kumar, “Taming the “Beast”: An Archival Management System Based on EAD,” Journal of Archival Organization, 4, no. 2 (2007): 63-98. http://dx.doi.org/10.1300/J201v04n03_05

Hello from Maryland

Similar to many we are dealing with a ton of legacy data in various forms at the University of Maryland. I am nowhere near as advanced as Maureen is with OpenRefine, regular expressions, or XSLT, but I will be sharing my experiences as I learn new tools for managing our data more efficiently.

I am really excited as we have some major infrastructure projects to work on this year including implementing ArchivesSpace (coming from a homegrown system), redesigning ArchivesUM, our current finding aid interface, and implementing Aeon. As you can imagine these are some hefty projects that will completely change the way we operate.

A new archives content management system has many moving components, stakeholders, policy decisions, and legacy data quandaries. This will give us an opportunity to update our policies, practices, and workflows and bring descriptions more in line with standards and best practices. A redesign of ArchivesUM is long overdue (we can talk about why in a future post) and we’ll be rebuilding both the user interface as well as the back end administrative side.

I was part of a team that implemented Archivists’ Toolkit (AT) several years ago at the University of Oregon, so I have experience changing over to a new system. Looking back, I know we did a lot of cleanup work by hand that could have been automated. I also really wish we shared more publicly about our work including our detailed local AT documentation. You can view what we did share publicly:

  • Elizabeth Nielsen and Cassandra A. Schmitt, “A Joint Instance of the Archivists’ Toolkit as a Tool for Collaboration,” presentation at the annual meeting of the Society of American Archivists’, Chicago, IL, August 22-27, 2011. http://ir.library.oregonstate.edu/xmlui/handle/1957/25253
  • Nathan Georgitis and Cassie Schmitt, “University of Oregon Archivists’ Toolkit Implementation,” A Webinar for Northwest Digital Archives Archivists’ Toolkit Interest Group, September 30, 2010. http://vimeo.com/15469318 [Confession: I have not watched this in years.]

While I will be posting on our legacy data cleanup, I will also be sharing our implementation plans, strategies, successes, and pitfalls as we venture into ArchivesSpace. I hope others will be able to benefit from this work. I know the AT@Yale blog was extremely helpful to me back in the day.

Looking forward to diving in!

How good is our data as data?

It’s a cliche to say that adoption of EAD is only gradually moving from a way to mark-up text-based documents to a way of encoding archival data-as-data. But it’s a cliche for a reason.

At Tamiment, we’ve done a good job of making sure that any description we’ve created is present in our archival management system and online. This is an impressive and important accomplishment, and couldn’t have been done if not for the work of dozens of staff and graduate students committed to Tamiment’s researchers and collections.

But now that step one* (get whatever data we have online so that people can discover our stuff) is done — CHECK — it’s time for step two. Let’s look at our legacy description and make sure that it’s more amenable to next-generation interfaces, data sharing, and re-use.

An article in the most recent issue of American Archivist by Katherine Wisser and Jackie Dean analyzes encoding practices across a corpus of more than 1,000 finding aids and finds, as they put it so diplomatically, that “the flexibility of the EAD structure is being taken full advantage of in practice”. ** An article by Marc Bron, Bruce Washburn and Merrilee Proffitt in the Code4Lib journal discusses the consequences of this diversity of practices within the ArchiveGrid portal — broadly speaking, because the creators of ArchiveGrid can’t be assured that tags in finding aids will be uniformly present or similarly used, discovery tools can’t support the data (or vice versa). ***

For the most part, we don’t really know the shape our data is in at Tamiment, but we can guess — legacy practices favored subject schemata instead of container information, allowed indices to flourish and propagate, and didn’t require standards-compliant titles or dates. We know that some data will be missing, some will be wrong, and some might not meet current practices for clarity and user-friendliness.

So, this post is about our first steps — understanding what shape our data is in so that we can make sure we can get it to do what we need it to. To do this, we had to decide what we wanted to know about our data. These questions fell into three categories:

  1. Do we have data in the places we should, as prescribed by our content and encoding standard?
  2. Do we have data in the places we should, as prescribed by our local standards?
  3. To the extent that we can determine this programmatically, is the content of these elements correct?

I’ve been tasked with investigating these questions and reporting back to our group. To do so, I wrote an xquery that looks at questions 1 and 2, and gives us an aggregate view of all of our finding aids. In some cases, I ask for the content of an element — in other cases, I ask whether or not an element is present.

This xquery looks at all of the DACS single-level optimum elements, but also looks at stuff that I just want to know — how many components are in a finding aid? How many components have dates? How many dates are normalized? How many components are “undated”? How many series or subseries are in a finding aid? Of those, how many have a scope and content note? Do we have all of the administrative information that isn’t required by DACS single-level but will be essential to getting collection control projects on track?

I’ll post our results (they aren’t pretty) in my next blog post, and talk about our remediation plans, too.

*We may be done with step one, but we’re still working very hard on step zero — making sure that everything in our custody is described. Much more about this to come.

** Wisser, Katherine, and Jackie Dean. 2013. “EAD Tag Usage: Community Analysis of the Use of Encoded Archival Description Elements.”American Archivist 76 (2) (September 1): 542–566. http://archivists.metapress.com/content/X4H78GX76780Q072.

*** Bron, M., M. Proffitt, and B. Washburn. 2013. “Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems.” The Code4Lib Journal (22) (October 14). http://journal.code4lib.org/articles/8956.

spreadsheet to finding aid — the game plan

Where I work, I have a lot of legacy data that isn’t accessible to researchers. Since we manage our description in an archival management system that exports to EAD-encoded finding aids (and we publish images through the finding aid), I want to get as much of that converted to EAD as I can.

My current project may be my white whale. Several times throughout the last decades, there have been description, re-description, and digitization projects for a particular collection of negatives in our repository. The images are described at the item level and they were digitized TEN years ago, but still, researchers don’t have access to these images. We still make them sit down with gloves and a light box, and frankly, these images don’t get as much use as they should.

The negatives were described at the item level. Titles were transcribed and possibly enhanced, with liberal use of brackets and abbreviations. I also have information about the photographer, subject (generally), negative number, something called a shoot number that I still can’t suss out, gauge, two different date fields in different combinations of m/d/y (usually the same, sometimes different, which is hard to fathom for an image), and information about the Kodak Gold CD onto which it was digitized (seriously). In many cases, there are many images with the same title, same date and same photographer.

So, I started with an EAD mapping and a plan.

Mapping:

  • Image title will map to <unittitle>. I’ll need to do some major clean-up to write out abbreviations, acronyms and initialisms. It will also be good to pay attention to brackets and delete or use alternate punctuation, as appropriate.
  • Photographer will map to <origination><persname>. There seem to be variants of names and initials — I’ll use faceting and clustering in OpenRefine to bring consistency to these. Since these are amateur photographers, the names will not be authorized.
  • After doing a lot of thinking, I think I’m going to use the subject column as the series structure of the finding aid. A previous version of this finding aid has series scope and contents notes based on these subjects. If this were a collection of more notable photographers, I would organize the finding aid that way, but this seems the best of bad options for a collection that researchers will approach by subject.
  • The negative number will map to <unitid>. In the case of images with the same caption, date and photographer, I’ll plan to have one component with multiple values in unitid (separated by a delimiter, since Archivists’ Toolkit doesn’t allow for multiple unitids, BLERGH), and an extent statement indicating how many images are in a component. I’ll figure out an excel formal to calculate the extent statement, and I’ll collapse the multiple components into one using the blank down and join multi-valued cells features in OpenRefine.
  • Gauge will map to <dimensions>.
  • Dates will map to <unitdate>. It looks like the first date is more complete than the second, but there are some cases where there’s a second date and not a first. I’ll use excel formulas to break each apart by year, month and date, and create a new date column that asks whether the first date is present and, if not, adds the second date. I’ll also use excel formulas to convert to our (not DACS compliant, unfortunately)  date format and also to create normalized date formatting.
  • I’ll wait until the finding aid is created to map from c @id to information about the digitized file, since Archivists’ Toolkit doesn’t import that information (ARRRGGHHHH).
  • There are also some lines for missing negatives (these should be moved to the collection-level scope and contents note, for lack of a better place) and lines for negative numbers that were never asssigned to an image (which should just be deleted).

Most of this work happens in Excel and OpenRefine. The Excel spreadsheet will then be imported into oXygen, and I’ll use an XSLT (already written, thank goodness) to convert this to EAD. I’ll then import the EAD to Archivists’ Toolkit, since the institution has decided to make this the canonical source of data. Finally, I’ll export the EAD =OUT= of AT and use split multi-valued cells / paste down in OpenRefine to get an item-level list of images WITH AT-created unitids, and use MS Access to map that to information about the digitized file in the original spreadsheet (using negative number as the primary key). This then gets sent to digital library folks, who do the <dao> linking.

Does this all sound complicated and stupid? It should, because it is. I’m fairly confident that I’m using the most efficient tools for this kind of clean-up, but I’m sure part of my process could be improved. I also think that our tools and processes at my institution could be better.

My thoughts on how to use our tools better and avoid these kinds of clean-up projects in the future.

  • Most important. Don’t describe at the item level. Why does archival theory seem to go out the window when we start working with images? Describe by roll, by event, by creator — give the researcher better clues about the context of creation. Photographic records, like all records, can be described in aggregate. It’s very difficult to do so, though, which context is destroyed and provenance isn’t documented. Especially in this case, where images are digitized, there’s no reason to create item-level metadata.
  • Let’s have everyone use the same platform and keep the metadata together right away. Wouldn’t it be great if each image file were assigned a URI and linked to the component in the <dao> field BY DIGITIZATION TECHNICIANS DURING SCANNING? Sure, maybe there would need to be some post-processing and some of this data would change, but it would be all together right away from the outset, instead of relying on the crazy internal logic of file systems or secondary spreadsheets.
  • The project isn’t finished until the project’s finished. A ten-year gap between digitization and publication in the finding aid is a big problem.
  • Archivists’ Toolkit isn’t a very good tool for archivists who want to do bulk manipulation of data. There, I said it. It would be nearly impossible to make these changes in AT — I need tools like Excel, OpenRefine, XSLT, and the xpath find/replace functions in oXygen to change this much data. Sure, I can export and then re-import, but AT doesn’t reliably round-trip EAD.
  • Maybe we shouldn’t be using AT as our canonical data source. It really don’t offer much added value from data clean-up point of view, beyond being able to make bulk changes in a few fields (the names and subject modules are particularly useful, although our finding aids don’t even link that data!). And frankly, I’m not nuts about the EAD that it spits out. WHY DO YOU STRIP C @ID during import?!?! Why can’t extent be repeated? Why can’t some notes (like <physfacet>, which really isn’t even a note) be repeated? Why not multiple <unitid>s? And as I look at our AT-produced finding aids, I find a LOT of data mistakes that, thinking about it, are pretty predictable. A lot of crap gets thrown into <unittitle>. There’s confusion about the difference between container attributes and <extent>, <physdesc>, and <physfacet> notes. I’m not endorsing the hand-coding of finding aids, but I think that there was some slippage between “oh good! We don’t have to use an XML editor!” and “Oh good! No one needs to pay attention to the underlying encoding!”

I’ll be sure to report back when this project is done. Until then, keep me in your thoughts.

getting dates out of titles

Today, I’m taking a 43 thousand-line spreadsheet and turning it into a CRAAAAAAAAZZZZY finding aid. Even though the spreadsheet has a separate column for dates, there are often dates in the title column. They need to go. And luckily, they don’t need to be moved elsewhere (this is possible but slightly trickier).

I spent some quality time with my data and noticed that in every case, the date was formatted like the examples below:

  • Meli, Robert What Do You Think Is Causing the Fighting In Palestine 2/15/48
  • Mc Arteen, Patrick Do You Favor Henry Wallace’s Decision to Run for President In on a Third-party Ticket 1/18/48
  • Keeran, Vincent What Is Causing the Current Rise In Prices 10/12/47

Basically, these are interview pieces. There’s the interviewee’s name, then the title of the interview, then the date.

Since I was working in OpenRefine (formerly Google Refine), I knew that I could use GREL (the Google Refine Expression Language) to replace this with blank text using a regular expression.

Google Refine regular expression

The expression is value.replace(/[0-9]{1,2}\/[0-9]{1,2}\/\d{2}/,””) . Basically, this looks for a pattern of one or two numbers, then a slash, then one or two numbers, then a slash, then two numbers. It replaces it with nothing.

This is not the most elegant regex anyone has ever written. But it was pretty quick to figure out, and it worked!

Next, I want to figure out a way to add a period after the interviewee’s name. More on that when I figure it out!