Clean up: Instructions for title in accession records

I’ll be tackling some of the core information in our ArchivesSpace accession records in a series of posts over the next few weeks. (I originally had multiple fields in a post, but it was getting way too long so I’m going to split things up a bit.)

Going in a somewhat logical order, we’ll start with the title field for accession records.

Actions:

Every accession requires an accession title. This isn’t required by ArchivesSpace, but will be a local requirement (and is a DACS required field.) We’ve mostly used Excel and OpenRefine, including the clustering feature and some simple find and replace operations to change our data.

Make titles for the same materials consistent/match to facilitate management and searching.

  • ex: Records of the Office of the President, Office of the President, President’s Office records, and Office of the President records all clearly have the same originating department with the same name, but were formatted/entered differently.

Make titles DACS compliant when possible. See DACS 2.3 Title

  • Make first segment the person, family, or corporate body, put in direct order
  • Make second segment the archival unit
  • Replace “Archives” with “records”, “papers”, “collection”, or other appropriate term

If there is a memorabilia item# in the accession number, use “memorabilia” in the title statement.

  • ex: “Harry Clifton Byrd papers” as the current title for a memorabilia accession. Rename “Harry Clifton Byrd memorabilia”.

For accessions titled only “memorabilia” add additional information from the abstract field when available.

  • ex: Abstract starts with “Democratic Party (Montgomery County).” Rename accession title to “Democratic Party (Montgomery County). memorabilia”.
  • ex: Abstract starts with “First Look Fair memorabilia” Rename accession title to “First Look Fair memorabilia”

 

Collections Assessment

Over the past two years (starting before I was hired) UMD has been conducting a “partial” collections assessment to determine the correct shelf location, descriptive state and products, and intellectual value of our archival collections. I use “partial” not because the project isn’t large or complicated, but because we started with collections that were listed in the Beast. I know we’re missing collections. I have heard people talk about them, but found no other information about them. But, this collections assessment is a great first step that will (hopefully for a majority of our collections) really paint a picture on where we stand and what our priorities are.

Our assessment consisted of three major parts.

Shelf Read. There had not been a comprehensive shelf read of archival collections since the materials moved into this building in the early 2000s. When work started on the other portions of the assessment it was quickly determined that shelf reading would be essential. We also made our shelf numbering system consistent on each floor (some floors numbered their shelves differently!) The shelf read also included a large amount of triage accessioning. Over 600 collections weren’t entered into the Beast at all. 200 of these had accession numbers and were entered without a problem. At least another 200 of these didn’t even have accession numbers (we searched control files high and lo for evidence). The remaining contained lots of other various problems such as “duplicate” entries that partially matched other records or dealing with boxes that had a name on them, but no accession number, so there was lots of comparing and reconciling work.

Descriptive State. Using some outputs from the Beast we identified the descriptive products available for each collection at the accession level. Since we rarely recorded (or updated) some of this information in the Beast, much of this work involved manually checking/looking for the products (I would do this process very differently now!) Initially “yes” or “no” values were used for some fields, but shortly into the process we added more information to help plan future work. Due to the lack of documentation, sometimes our answers are really guesses.

Product Value
Inventory yes, no, Word, Excel, PDF, Beast, database, paper
Abstract in ArchivesUM yes, no, collection level (When can’t tell if accession is included)
Finding Aid yes, no, Word, Excel, PDF, Beast, database, paper
Finding Aid in ArchivesUM yes, no, collection level (When can’t tell if accession is included)
Type of Finding Aid minimal, full [We dropped this one over time as we had no shared understanding of what was “full”, so what was marked as “minimal” to one person looked “full” to someone else.
MARC record yes, no

Intellectual Value. Based on other institution’s examples (specifically Columbia) we adopted our own ranking system to designate a collection’s value. Curators assigned numbers from 1-5 (5 being the best) at the collection level (all accessions for a collection will receive the same ranking) for both the intellectual/research value as well as a local value. The local value score helped account for issues such as exhibits, political importance, institutional value, donor stewardship, and other similar factors. We provided curators with brief descriptions of each ranking as well as questions to ask about the collections to help assign a value. Value can also be changed in the future based on changing priorities or new information.

Lessons learned about how we manage our data (some of this should be obvious!):

  • Shelf maintenance is extremely important. This function needs to be managed and documented in a common way instead of allowing multiple individuals to make different (often conflicting) decisions on these issues. (We don’t currently do this!)
  • We need to review our accessioning process. Lots of information (and collections) fell through the gaps. (This is in process.)
  • Have a standard location on your server for any inventories or finding aids. (We don’t currently do this! We do have some common places to find these files, but still very decentralized.)
  • Use a file naming convention for inventories, finding aids, and drafts. Also, please, please, please include the accession number or collection identifier in said file name. (We don’t currently do this! We have some good file naming for particular areas or from when particular people were on staff, but not consistent.)

In future posts I will discuss how we plan to use this data for project prioritization and planning, bringing it into ArchivesSpace, and adding rankings into our accessioning process.

As a plug, you should read the article, “Tackling the Backlog: Conducting a Collections Assessment on a Shoestring,” by colleagues Joanne Archer and Caitlin Wells in Management: Innovative Practices for Archives and Special Collections edited by Kate Theimer coming out this month.

Building a Case: Semantic URLs for Finding Aids

I’ve been working on the Beast project with Cassie as my final field study before graduation (sniff). My task has been to look at the resource records, analyze the EAD being produced by the Beast and how that will translate to ArchivesSpace, and identify the data cleanup steps needed to get us there. It’s been a fantastic learning experience, and I’m sad to be leaving now that the semester is ending.

One of the things I’ve spent a lot of time in graduate school thinking about is usability and user experience, and although it’s outside of the scope of my project, there’s something that’s been bugging me…

The Problem (as I see it)

When you look at a finding aid generated through ArchivesUM, the page URL looks something like this:

http://digital.lib.umd.edu/archivesum/actions.DisplayEADDoc.do?source=/MdU.ead.histms.0011.xml&style=ead

The ability to see what database actions are performed when a finding aid is called are of absolutely no use to the person viewing the page. The XML file title inserted in the middle is also useless, because it is related to none of the identifiers a researcher would use to access the collection.

By way of contrast, the URL for this post looks something like this:

https://icantiemyownshoes.wordpress.com/2014/05/08/semantic-urls-for-finding-aids/

WordPress, like many other sites, creates semantic URLs for each of the pages it generates.  It clearly identifies the source of the page, the date it was originally posted, and some human-readable form of the title, which can be altered by the author of the post.

Like the ArchivesUM URL, WordPress has provided a unique, static identifier for the information contained therein. Unlike its ArchivesUM counterpart, it provides the viewer with several important pieces of information such as date of publication and if it is a sub-topic of the site.  This has an impact on its findability, both on its website and when it appears on a Google search results page.  Users are quickly able to determine whether they find the source trustworthy, how new it is, and that it references the topic they are interested in.

Semantic URLs also clearly identify one subpage of a website from another. This can have an impact on search engine optimization, as pages with similar long and indecipherable URLs may not be crawled.

Possible Solution

Let’s take a look at what some other repositories are doing with their URLs.  I’ve been using Princeton and Duke as touchstones throughout my project because they’re clearly thinking progressively about a lot of things, including URLs:

Princeton: http://findingaids.princeton.edu/collections/C0159

Duke: http://library.duke.edu/rubenstein/findingaids/africanamericanmisc/

In both cases, the last part of the URL is inserted using the <eadid> tag. Princeton uses its collection number, while Duke uses a shortened version of the collection title.  Both are clean and easy to read. Users like when they can easily understand a URL- it helps build confidence and trust in the website.  It is arguable how practical these URLs are to the average user because they’re not going to memorize them, but this consistent structure may be useful for archives staff. The same cannot be said for the current ArchivesUM URL standard.

I think the ArchivesSpace transition gives us the chance to adopt a new URL standard that looks something like this:

http://digital.lib.umd.edu/archivesum/finding-aids/<eadid&gt;

As with the Duke example above, this would use the <eadid> tag to provide some version of the title of the collection. It could contain either a shortened title or the full title of the collection, e.g.:

http://digital.lib.umd.edu/archivesum/finding-aids/Adelphi-Citizens-Association-records

Admittedly, that could get a bit long for a semantic URL, but it would avoid confusion with similar records. This will be something to work out on the policy level.

The Pros of This Approach:

-Quickly conveys information about the repository and collection name

-Short and human-readable, which is an advantage for search engines

-Provides level of trust to user (which is admittedly hard to quantify)

-Can be done using the EAD ID field in ArchivesSpace

-Easy for reference archivists and researchers to identify collection by URL

-Removes “sausage making” database calls currently in URL, which reduces researcher confusion

-Can remain static even if database structure is changed

-If the finding aid is cross-linked on another page, users will have an idea where they’re going before they click on the link.

And the Cons:

-Can be confusing if collections have similar titles

-Need to set very clear rules for formatting, or automate it when converting from ArchivesSpace to finding aid

-Length is a concern- should be short but convey the information

-Requires policy changes for both the contents of <eadid> and the use of unique identifiers

The negatives to adopting a semantic URL approach are primarily in the implementation and can be mitigated by proper planning and clear policies. The positives boil down to the fact that for a relatively small amount of effort, we can have a huge impact on user experience as well as search engine optimization.

We are implementing ArchivesSpace in part to better serve our researchers into the future. It’s important, then, to consider everything about our EAD content, our finding aids, and our websites from the user’s perspective. The URL is the first thing a researcher will encounter, so why not start there?

 

Backlog Control — Known Unknowns

As part of a continuing attempt to understand our holdings, I’ve been writing a series of reports against our EAD. Of course, what’s in a finding aid only accounts for the stuff that someone bothered to record in the first place. To tackle undescribed collections, we’ve also been doing a shelfread project to get an understanding of what’s on our shelves.

Today, we take an “accessioning as processing” approach to accruals — we describe what’s there at the appropriate level of description at the time of accessioning, and we include a lot of the background information about how it came to us, what it all means, etc., to help make sense of it. This helps us avoid building a backlog.

In the past, however, there was a mysterious (to me) understanding of the nature of processed/unprocessed materials. We have many, many series of materials in collections (usually accruals) that may even have file-level inventories but are described as “unprocessed.” They don’t include essential information about immediate source of acquisition, creators, or what about these materials makes them hang together. I’m frankly not sure what my predecessors were waiting for — they did all the work of creating lots of description without doing any real explanation!

So, my boss wanted a sense of these known knowns — parts of collections that we need to at least give a better series title, or somehow take out of the limbo of “unprocessed”. She wanted to know how many series there were, which collections these series belong to, and how many boxes of stuff we’re talking about. It would also be great to know linear footage, but this is frankly unknowable from the data we have.

So, I wrote an xQuery. You can find it here. The xQuery looks for any series or subseries that has the string “unprocessed” in its title. From there, it reports out the distinct values of containers. The result looks something like this:

Screen Shot 2014-05-06 at 10.06.10 PM

Perhaps you see the problem. Originally, I thought I just wanted to get a count of the distinct containers. My xpath for the variable that would give me box info (called footage here) originally looked like this:

$unprocessedfootage := count(distinct-values($series//ead:container[@type eq ‘Box’]))

The idea here was that it would take a series, get a census of the different boxes in that series, and count ’em up. But this gave me bad data. In the case of :

<containertype=“Box”>10-17</container>

I would have “10-17” be considered one distinct value in the count, when really it represents 8 boxes. The report as I first had it was severely undercounting boxes.

If I want to get a count of the distinct containers, I have to deal with ranges like 10-17. I started by importing this into OpenRefine and separated the multi-valued cells in the “unprocessed” column so that each number or range was in its own cell/row.

Then, I did some googling and came across this StackOverflow answer that explained how to enumerate the values in a range in Excel (this will give me 10, 11, 12, 13, 14, 15, 16 and 17 from 10-17). I exported from OpenRefine and brought the document into Excel, separated the ranges into two columns, and did a quick if/then statement to repeat single values in the second column. From there, I just ran the VBA code that was provided. I brought the document BACK into Refine and separated multi-valued cells again, and found out that we have 908 distinct boxes of “unprocessed” materials in 67 collections.

Now, happily, we know exactly how big of a mess our described “unprocessed” materials are, and we’re in a much better position to make good sense of them.

Update — 2014 May 7

@ostephens on twitter very helpfully pointed out that the dumb VBA step can easily be avoided by doing the work in OpenRefine.

He was completely right and was kind enough to give me the recipe

After multi-valued cells were each in their own row/cell, I separated by “-” so that the beginning and end of each range was in its own column. Then, I created a new column based on the “First” column and did the following:

Screen Shot 2014-05-07 at 10.29.10 AM

On error, it copies the value from the original column so that my “enum” column is everything I need. Once I had the values enumerated, I split multi-value cells again and ended up with a much more beautiful process.

You can follow my steps by importing my OpenRefine project here.