Backlog Control — Known Unknowns

As part of a continuing attempt to understand our holdings, I’ve been writing a series of reports against our EAD. Of course, what’s in a finding aid only accounts for the stuff that someone bothered to record in the first place. To tackle undescribed collections, we’ve also been doing a shelfread project to get an understanding of what’s on our shelves.

Today, we take an “accessioning as processing” approach to accruals — we describe what’s there at the appropriate level of description at the time of accessioning, and we include a lot of the background information about how it came to us, what it all means, etc., to help make sense of it. This helps us avoid building a backlog.

In the past, however, there was a mysterious (to me) understanding of the nature of processed/unprocessed materials. We have many, many series of materials in collections (usually accruals) that may even have file-level inventories but are described as “unprocessed.” They don’t include essential information about immediate source of acquisition, creators, or what about these materials makes them hang together. I’m frankly not sure what my predecessors were waiting for — they did all the work of creating lots of description without doing any real explanation!

So, my boss wanted a sense of these known knowns — parts of collections that we need to at least give a better series title, or somehow take out of the limbo of “unprocessed”. She wanted to know how many series there were, which collections these series belong to, and how many boxes of stuff we’re talking about. It would also be great to know linear footage, but this is frankly unknowable from the data we have.

So, I wrote an xQuery. You can find it here. The xQuery looks for any series or subseries that has the string “unprocessed” in its title. From there, it reports out the distinct values of containers. The result looks something like this:

Screen Shot 2014-05-06 at 10.06.10 PM

Perhaps you see the problem. Originally, I thought I just wanted to get a count of the distinct containers. My xpath for the variable that would give me box info (called footage here) originally looked like this:

$unprocessedfootage := count(distinct-values($series//ead:container[@type eq ‘Box’]))

The idea here was that it would take a series, get a census of the different boxes in that series, and count ’em up. But this gave me bad data. In the case of :


I would have “10-17” be considered one distinct value in the count, when really it represents 8 boxes. The report as I first had it was severely undercounting boxes.

If I want to get a count of the distinct containers, I have to deal with ranges like 10-17. I started by importing this into OpenRefine and separated the multi-valued cells in the “unprocessed” column so that each number or range was in its own cell/row.

Then, I did some googling and came across this StackOverflow answer that explained how to enumerate the values in a range in Excel (this will give me 10, 11, 12, 13, 14, 15, 16 and 17 from 10-17). I exported from OpenRefine and brought the document into Excel, separated the ranges into two columns, and did a quick if/then statement to repeat single values in the second column. From there, I just ran the VBA code that was provided. I brought the document BACK into Refine and separated multi-valued cells again, and found out that we have 908 distinct boxes of “unprocessed” materials in 67 collections.

Now, happily, we know exactly how big of a mess our described “unprocessed” materials are, and we’re in a much better position to make good sense of them.

Update — 2014 May 7

@ostephens on twitter very helpfully pointed out that the dumb VBA step can easily be avoided by doing the work in OpenRefine.

He was completely right and was kind enough to give me the recipe

After multi-valued cells were each in their own row/cell, I separated by “-” so that the beginning and end of each range was in its own column. Then, I created a new column based on the “First” column and did the following:

Screen Shot 2014-05-07 at 10.29.10 AM

On error, it copies the value from the original column so that my “enum” column is everything I need. Once I had the values enumerated, I split multi-value cells again and ended up with a much more beautiful process.

You can follow my steps by importing my OpenRefine project here.


One thought on “Backlog Control — Known Unknowns

  1. If you wanted to stick with XQuery for string manipulation, here’s a solution to parse “10-17”:

    xquery version “1.0-ml”;

    (: 5/8/14 Build a sequence of numbers and count the number of items in the sequence 🙂

    fn:count(let $unprocessedfootage := “10-17”
    let $firstNum := xs:integer(fn:substring-before($unprocessedfootage, “-“))
    let $lastNum := xs:integer(fn:substring-after($unprocessedfootage, “-“))
    return for $box in ($firstNum to $lastNum)
    return $box)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s