Repeating information at lower levels description

Today I discovered a description pet peeve while testing how finding aid requesting will work with our Aeon implementation.

Highlighting practice of reusing a series title in a folder title.

Highlighting practice of reusing a series title in a folder title.

Almost every single folder in this collections starts by repeating the series name followed but more specific information for that particular folder. I know I’ve seen this before in our finding aids (and previous institutions), but it’s pretty widespread in this example. You can view the full Anne St. Clair Wright papers finding aid.

Really, we don’t need to do this as it doesn’t add more value, makes the display more cluttered, and isn’t a good use of our time spent repeating information.

DACS covers this:

Principle 7.3: Information provided at each level of description must be appropriate to that level.

When a multilevel description is created, the information provided at each level of description must be relevant to the material being described at that level. This means that it is inappropriate to provide detailed information about the contents of files in a description of a higher level. Similarly, archivists should provide administrative or biographical information appropriate to the materials being described at a given level (e.g., a series). This principle also implies that it is undesirable to repeat information recorded at higher levels of description. Information that is common to the component parts should be provided at the highest appropriate level.

This principle is discussed in numerous articles on archival description (including on page 246 of Greene and Meissner’s 2005 MPLP article) and can be seen in many institution’s processing manuals.

Going back to our processing manual, there really aren’t any explanations on the hierarchical relationships between levels of description or instructions stating that lower levels of description inherit description from above . There is some guidance on creating folder titles, but most of it has to do with formatting. There’s almost no explanation of how to develop series titles.

Adding this to the list of updates to make!



Our EAD — Standards Compliance

I mentioned in an earlier post that in anticipation for our three big archival systems projects (migration to ArchivesSpace from Archivists’ Toolkit, implementation of Aeon, and re-design of our finding aids portal), we’re taking a cold, hard look at our archival data. After all, both Aeon and the finding aids portal will be looking directly at the EAD to perform functions — both use xslt to display, manipulate, and transform the data.

So, there are some basic things we want to know. Will our data be good enough for Aeon to be able to turn EAD into a call slip (or add it to the proper processing queue, or know which reading room to send the call slip to)? Are our dates present and machine readable in such a way that the interface would be able to sort contents lists by date? And, while we’re at it, do our finding aids meet professional and local standards?

Let’s take a look at our criteria.

A single-level description with the minimum number of DACS elements must include:

  • Reference Code Element (2.1) — <unitid>
  • Name and Location of Repository Element (2.2) — <repository>
  • Title Element (2.3) — <unittitle>
  • Date Element (2.4) — <unitdate>
  • Extent Element (2.5) — <extent>
  • Name of Creator(s) Element (2.6) (if known) — <origination>
  • Scope and Content Element (3.1) — <scopecontent>
  • Conditions Governing Access Element (4.1) — <accessrestrict>
  • Languages and Scripts of the Material Element (4.5) — <langmaterial> (I decided to be generous and allow <langmaterial>/<language @langcode>, although I would prefer that there be content in the note)

For a descriptive record to meet DACS optimum standards, it must also include:

  • Administrative/Biographical History Element (2.7) — <bioghist>
  • Access points — <controlaccess>

At Tamiment, we’ve determined that the following elements must be included in a finding aid to meet local standards:

  • Physical location note — <physloc>
  • Restrictions on use note — <userestrict>
  • Immediate source of acquisition note — <acqinfo>
  • Appraisal note — <appraisal>
  • Abstract — <abstract>
  • Arrangement note — <arrangement>
  • Processing information note — <processinfo>
  • Our local standards also require that every series or subseries have a scope and content note, every component have a title, date and container, and every date be normalized.

I’ll talk about our reasons for these local standards in subsequent blog posts.

Finally, we’ve started thinking about which data elements must be present for us to be able to use the Aeon circulation system effectively. To print a call slip, a component in a finding aid needs the following information. Useful (but not required) fields are italicized:

  • Reference code element / call number — <unitid>. We have to know what collection the patron is requesting.
  • Repository note — <repository>. This should be a controlled string, so that the stylesheet knows which queue to send the call slip to. It may also be possible to do post-processing to add an attribute to this tag or a different tag, so that the string can vary but the attribute would be consistent enough for a computer to understand. In any case, we need SOME piece of controlled data telling us which reading room to visit to pull this material.
  • Container information — <container>. Every paged container should have a unique combination of call number and box number. There’s no good way to check this computationally — we’ve all seen crazy systems of double numbering, numbering each series, etc.
  • Collection title — <unittitle>. This is the title of the collection, which is useful for paging boxes.
  • Physical location note — <physloc>. This isn’t strictly necessary, but it is very useful to know whether boxes are onsite or offsite.
  • Access restrictions — <accessrestrict>. This is an operational requirement. By having the access restriction note, the page can see right away whether it’s okay to pull this box.
  • Fancy-pants scripting piece to add location information…. This would require a lot of data standardization (and probably data gathering, in some cases), but it would be great to have the location on the repository-eyes-only side of the call slip.

So, how are we doing?


Frankly, I was pleasantly surprised. As you can see from the chart on the right, out of 1217 finding aids from that harvest, about two-thirds meet DACS single-level and optimum requirements. The reasons for failure vary –many are missing creator information, notes about the conditions governing access, and information about the language of material. Happily, information about the historical context of the collection and the presence of access points is fairly common.

We also see that the vast majority of our finding aids will meet the requirements for Aeon compliance. The problem of components without containers is a big one, but is something that we’ve obviously dealt with using paper call slips, and will have to be a remediation priority. Once this is addressed, we still have the outstanding issue of how to consistently tell the computer where a finding aid is coming from. Once we decide how we want that data to look, we’ll be able to fix it programmatically.

Our most distressing number is about local compliance, and the biggest offenders are physical location, immediate source of acquisition, and appraisal information. This reflects an overall trend in our repository of being careless with administrative information — we have very little information about when and how collections came and what interventions archivists made.

The requirement that appraisal information be included is extremely recent — unfortunately, this is the kind of information that is difficult to recover if not recorded at the time of processing. Hopefully, some information about appraisal may be included in processing information and separated materials notes.

For anyone interested in how our data breaks down, a chart is below.


Beast tables and relationships

I thought it would be helpful to spend a minute looking at our data before we dive into mapping Beast fields to ArchivesSpace.

I’ll note that while I refer to the Beast database, we actually have three separate Beast databases. This was due to size and performance issues as well as the department and unit structure of the Libraries (which has since changed). While the three databases share the same table and relationship structure, often the data entry practices and habits varied between them, including adherence (or not) to our local requirements and national content standards, such as DACS.

Here’s a screenshot of the database structure:

Tables and relationships in the UMD Beast database

Tables and relationships in the UMD Beast database

Fun, right? The meat of our information is stored in the archdescid table with an auto generated primary key of archdescid. This field ties all of our tables together. If you are familiar with EAD some of the field names will look familiar (unitid, abstract). Other fields tie information to other UMD systems (pid) or unit structure (corpname).

You will also notice that similar types of EAD information are not clustered together. So, we have EAD publication information stored in its own table, yet the <address> for the repository, which appears in <publicationstmt> is stored in the archdescid table. In current practice, this does not matter as we apply a converter script to create the EAD. From a trying to understand how this data interacts with each other perspective it has me scratching my head as to why we did not store like information together.

In upcoming posts we’ll talk about how this information can map to ArchivesSpace accession records.

The Beast

No, really, we call our homegrown archival management system the “Beast”. It is an Access database launched in the early-mid 2000s. Originally started as the locations register, the database evolved to support the creation of EAD finding aids published on ArchivesUM. A JAVA based conversion program pulls information from the database into EAD elements in a XML file. Files are then checked by hand for XML and EAD compliance before upload to ArchivesUM. You can read all about its creation in “Taming the ‘Beast’: An Archival Management System Based on EAD.“*

At the time, the Beast did some good things. It launched local EAD implementation, got more finding aids online, and consolidated collection information into a central location. It did a great job at facilitating putting up abstracts of new accessions or unprocessed collections.

Over the years the Beast has evolved including development on the conversion scripts, but this has been pretty minimal in terms of new functionality over the past several years. The decision was made to wait for ArchivesSpace instead of making major changes to the Beast.

Here’s a highlight of general issues with the Beast and our associated practices:

  • It was built based on current local policies/practices at the time making its functionality rigid in cases. It was seen as a tool to get away from paper instead of viewing it as a source of reusable data.
  • While staff enter information using forms in Access for either “accessions” or “finding aids”, most of the information is stored in the same table making it difficult on the back end to know what you are looking at.
  • It is clunky and difficult to link multiple accessions with a collection description.
  • Some fields don’t map to the best EAD tag choice (ex: all extent information is dumped into <physdesc> and not <extent>.)
  • For container lists, the Beast/ArchivesUM stylesheet requires that your intellectual and physical order MATCH EXACTLY. This limits flexibility in description and processing at various levels and often requires spending too much time physically moving around materials.
  • Not all finding aids uploaded to ArchivesUM are EAD compliant (they are all XML compliant.) People fell on the side of getting the finding aid up instead of figuring out the EAD error and required changes in the Beast.
  • We did not do a good job at quality control. We just didn’t. We didn’t utlize controlled fields when we could have (ex: linear feet is spelled ten ways) and didn’t enforce adhering to local policies (ex: dates entered in date fields don’t match format of acceptable dates in our processing manual.)

In future posts I’ll share how we are mapping the Beast fields to ArchivesSpace as well as the specific data cleanup issues we are facing.

*Jennie A. Levine, Jennifer Evans, and Amit Kumar, “Taming the “Beast”: An Archival Management System Based on EAD,” Journal of Archival Organization, 4, no. 2 (2007): 63-98.

How good is our data as data?

It’s a cliche to say that adoption of EAD is only gradually moving from a way to mark-up text-based documents to a way of encoding archival data-as-data. But it’s a cliche for a reason.

At Tamiment, we’ve done a good job of making sure that any description we’ve created is present in our archival management system and online. This is an impressive and important accomplishment, and couldn’t have been done if not for the work of dozens of staff and graduate students committed to Tamiment’s researchers and collections.

But now that step one* (get whatever data we have online so that people can discover our stuff) is done — CHECK — it’s time for step two. Let’s look at our legacy description and make sure that it’s more amenable to next-generation interfaces, data sharing, and re-use.

An article in the most recent issue of American Archivist by Katherine Wisser and Jackie Dean analyzes encoding practices across a corpus of more than 1,000 finding aids and finds, as they put it so diplomatically, that “the flexibility of the EAD structure is being taken full advantage of in practice”. ** An article by Marc Bron, Bruce Washburn and Merrilee Proffitt in the Code4Lib journal discusses the consequences of this diversity of practices within the ArchiveGrid portal — broadly speaking, because the creators of ArchiveGrid can’t be assured that tags in finding aids will be uniformly present or similarly used, discovery tools can’t support the data (or vice versa). ***

For the most part, we don’t really know the shape our data is in at Tamiment, but we can guess — legacy practices favored subject schemata instead of container information, allowed indices to flourish and propagate, and didn’t require standards-compliant titles or dates. We know that some data will be missing, some will be wrong, and some might not meet current practices for clarity and user-friendliness.

So, this post is about our first steps — understanding what shape our data is in so that we can make sure we can get it to do what we need it to. To do this, we had to decide what we wanted to know about our data. These questions fell into three categories:

  1. Do we have data in the places we should, as prescribed by our content and encoding standard?
  2. Do we have data in the places we should, as prescribed by our local standards?
  3. To the extent that we can determine this programmatically, is the content of these elements correct?

I’ve been tasked with investigating these questions and reporting back to our group. To do so, I wrote an xquery that looks at questions 1 and 2, and gives us an aggregate view of all of our finding aids. In some cases, I ask for the content of an element — in other cases, I ask whether or not an element is present.

This xquery looks at all of the DACS single-level optimum elements, but also looks at stuff that I just want to know — how many components are in a finding aid? How many components have dates? How many dates are normalized? How many components are “undated”? How many series or subseries are in a finding aid? Of those, how many have a scope and content note? Do we have all of the administrative information that isn’t required by DACS single-level but will be essential to getting collection control projects on track?

I’ll post our results (they aren’t pretty) in my next blog post, and talk about our remediation plans, too.

*We may be done with step one, but we’re still working very hard on step zero — making sure that everything in our custody is described. Much more about this to come.

** Wisser, Katherine, and Jackie Dean. 2013. “EAD Tag Usage: Community Analysis of the Use of Encoded Archival Description Elements.”American Archivist 76 (2) (September 1): 542–566.

*** Bron, M., M. Proffitt, and B. Washburn. 2013. “Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems.” The Code4Lib Journal (22) (October 14).