On Containers

I’m here  to talk about boxes. Get excited.

I’ve been spending a LOT of time lately thinking about containers — fixing them, modelling them, figuring out what they are and aren’t supposed to do. And I’ve basically come to the conclusion that as a whole, we spend too much time futzing with containers because we haven’t spent enough time figuring out what they’re for and what they do.

For instance, I wrote a blog post a couple of months ago about work we’re doing to remediate stuff that should not but is happening with containers — barcodes being assigned to two different containers, two different container types with the same barcode/identifier information, etc. Considering the scale of our collections, the scale of these problems is mercifully slight, but these are the kinds of problems that turn into a crisis if a patron is expecting to find material in the box she ordered and the material simply isn’t there.

I’m also working with my colleagues here at Yale and our ArchivesSpace development vendor Hudson Molonglo to add functionality to ArchivesSpace so that it’s easier to work with containers as containers. I wrote a blog post about it on our ArchivesSpace blog. In short, we want to make it much easier to do stuff like assigning locations, assigning barcodes, indicating that container information has been exported to our ILS, etc. In order to do this, we need to know exactly how we want containers to relate to archival description and how they relate to each other.

As I’ve been doing this thinking about specific container issues, I’ve had some thoughts about containers in general. Here they are, in no particular order.

What are container numbers doing for us?

A container number is just a human-readable barcode, right? Something to uniquely identify a container? In other words, speaking in terms of the data model, isn’t this data that says something different but means the same thing? And is this possibly a point of vulnerability? At the end of the day, isn’t a container number  something that we train users to care about when really they want the content they’ve identified? And we have a much better system for getting barcodes to uniquely identify something than we do with box numbers?

In the days that humans were putting box numbers on a call slip and another human was reading that and using that information to interpret shelf location, it made sense to ask the patron to be explicit about which containers were associated with the actual thing that they want to see. But I think that we’ve been too good at training them (and training ourselves) to think in terms of box numbers (and, internally, locations) instead of creating systems that do all of that on the back end. Information about containers should be uniform, unadorned, reliable, and interact seamlessly with data systems. Boxes should be stored wherever is best for their size and climate, and that should be tracked in a locations database that interacts with the requesting database. And the actual information should be associated seamlessly with containers.

This means that instead of writing down a call number and box number and reading a note about how materials of this type are stored on-site and materials of another type are stored off-site, let’s take a lot of human error out of this. Let’s let them just click on what they want to see. Then, the system says “a-ha! There are so many connections in my database! This record is in box 58704728702861, which is stored in C-29 Row 11, Bay 2, Shelf 2. I’ll send this to the queue that prints a call slip so a page can get that right away!” And instead of storing box numbers and folder numbers in the person’s “shopping cart” of what she’s seen, let’s store unique identifiers for the archival description, so that if that same record get’s re-housed into box 28704728702844 and moved to a different location, the patron doesn’t have to update her citation in any scholarly work she produces. Even if the collection gets re-processed, we could make sure that identifiers for stuff that’s truly the same persists.

Also, don’t tell me that box numbers do a good job of giving cues about order and scale. There are waaaaaayyyyy better ways of doing that than making people infer relationships based on how much material fits into 0.42 linear feet.

We have the concepts. Our practice needs to catch up, and our tools do too.

Darn it, Archivists’ Toolkit, you do some dumb things with containers

Archival management systems are, obviously, a huge step up from managing this kind of information in disparate documents and databases. But I think that we’re still a few years away from our systems meeting their potential. And I really think that folks who do deep thinking about archival description and standards development need to insert themselves into these conversations.

Here’s my favorite example. You know that thing where you’re doing description in AT and you want to associate a container with the records that you just described in a component? You know how it asks you what kind of an instance you want to create? That is not a thing. This is just part of the AT data model — there’s nothing like this in DACS, nothing like it in EAD. Actual archival standards are smart enough to not say very much about boxes because they’re boxes and who cares? When it exports to EAD, it serializes as @label. LABEL. The pinnacle of semantic nothingness!

This is not a thing.

This is not a thing.

Like, WHY? I can see that this could be the moment where AT is asking you “oh, hey, do you want to associate this with a physical container in a physical place or do you want to associate it with a digital object on teh interwebz?” but there’s probably a better way of doing this.

My problem with this is that it has resulted in A LOT of descriptive malpractice. Practitioners who aren’t familiar with how this serializes in EAD think that they’re describing the content (“oh yes! I’ve done the equivalent of assigning a form/genre term and declaring in a meaningful way that these are maps!”) when really they’ve put a label on the container. The container is not the stuff! If you want to describe the stuff, you do that somewhere else!

Oh my gosh, my exclamation point count is pretty high right now. I’ll see if I can pull myself together and soldier on.

Maybe we should be more explicit about container relationships.

Now, pop quiz, if you have something that is in the physical collection and has also been microfilmed, how do you indicate that?

In Archivists’ Toolkit, there’s nothing clear about this. You can associate more than one instance with an archival description, but you can also describe levels of containers that (ostensibly) describe the same stuff, but happen to be a numbered item within a folder, within a box.

Anything can happen here.

Anything can happen here.

So this means that in the scenario I mentioned above, it often happens that someone will put the reel number into container 3, making the database think that the reel is a child of the box.

But even if all of the data entry happens properly, EAD import into Archivists’ Toolkit will take any three <container> tags and instead of making them siblings, brings the three together into parent-child instance relationship like you see above. This helps maintain relationships between boxes and folders, but is a nightmare if you have a reel in there.

EAD has a way of representing these relationships, but the AT EAD export doesn’t really even do that properly.

 <c id="ref10" level="file">
     <unittitle>Potter, Hannah</unittitle>
     <unitdate normal="1851/1851">1851</unitdate>
     <container id="cid342284" type="Box" label="Mixed Materials (39002038050457)">1</container>
     <container parent="cid342284" type="Folder">2</container>

 <c id="ref11" level="file">
     <unittitle>Potter, Horace</unittitle>
     <unitdate normal="1824/1824">1824</unitdate>
     <container id="cid342283" type="Box" label="Mixed Materials (39002038050457)">1</container>
     <container parent="cid342283" type="Folder">3</container>

Here, we see that these box 1’s are the same — they have the same barcode (btw, see previous posts for help working out what to do with this crazy export and barcodes). But the container id makes it seem like these are two different things — they have two different container id’s and their folders refer two two different parents.

What we really want to say is “This box 1 is the same as the other box 1’s. It’s not the same as reel 22. Folder 2 is inside of box 1, and so is folder 3.” Once we get our systems to represent all of this, we can do much better automation, better reporting, and have a much more reliable sense of where our stuff is.

So if we want to be able to work with our containers as they actually are, we need to represent those properly in our technology. What should we be thinking about in our descriptive practice now that we’ve de-centered the box?

“Box” is not a level of description.

In ISAD(G) (explicitly) and DACS (implicitly), archivists are required to explain the level at which they’re describing aggregations of records. There isn’t a vocabulary for this, but traditionally, these levels include “collection”, “record group”, “series”, “file” and “item.” Note that “box” is not on this list or any other reasonable person’s list. I know everyone means well, and I would never discourage someone from processing materials in aggregate, but the term “box-level processing” is like nails on a chalkboard to me. As a concept, it should not be a thing. Now, series-level processing? Consider me on board! File-group processing? Awesome, sounds good! Do you want to break those file groups out into discrete groups of records that are often surrounded by a folder and hopefully are associated with distinctive terms, like proper nouns? Sure, if you think it will help and you don’t have anything better to do.

A box is usually just an accident of administravia. I truly believe that archivists’ value is our ability to discern and describe aggregations of records — that box is not a meaningful aggregation, and describing it as such gives a false impression of the importance of one linear foot of material. I’d really love to see a push toward better series-level or file-group-level description, and less file-level mapping, especially for organizations’ records. Often, unless someone is doing a known item search, there’s nothing distinct enough about individual files as evidence (and remember, this is why we do processing — to provide access to and explain records that give evidence of the past) to justify sub-dividing them. I also think that this could help us think past unnecessary sorting and related housekeeping — our job isn’t to make order from chaos*, it’s to explain records and their context of creation of use. If records were created chaotically and kept in a chaotic way, are we really illuminating anything by prescribing artificial order?

This kind of thinking will be increasingly important when our records aren’t tied to physical containers.

In conclusion, let’s leave the robot work to the robots.

If I never had to translate a call number to a shelf location again, it would be too soon (actually, we don’t do that at MSSA, but still). Let’s stop making our patrons care about boxes, and let’s start making our technology work for us.

* This blog’s title, Chaos –> Order, is not about bringing order to a chaotic past — it’s about bringing order to our repositories and to our work habits. In other words, get that beam out of your own eye, sucka, before you get your alphabetization on.


8 thoughts on “On Containers

  1. Interesting–so are you suggesting that we should not perform any physical sorting of materials at any level, no matter how little they reflect a logical order?

    • This isn’t the theme of this post, but since you asked, no, I don’t think that there’s usually any good reason to do arrangement.

      I mean, what do you mean by “logical order”? Do you mean that they were part of a formal filing system and something was truly mis-filed? Okay, fine. But personal papers usually don’t have filing systems. What does arrangement get you that description doesn’t, other than giving the researcher an inaccurate presentation of how the creator interacted with his materials?

  2. Two comments:


    1a) We need unambiguous identifiers for content. We tend to use container identifiers as surrogate identifiers for their content because most content doesn’t have an attached, unambiguous identifier. An identifier that is not attached to the content will eventually become useless.

    So, containers it is, right? Please tell me you have another way of unambiguously attaching identifiers to content!

    1b) No where in the profession have we dealt with the fact that container identifiers do double duty, as both content identifiers and as container identifiers. We need to be explicit about this.


    We had massive legacy numbering system to cope with, and I attempt to describe below how we coped. Essentially, what is needed is to divorce location, container identifier, and archival resource identifiers from one another. It isn’t rocket science, but it takes an honest understanding that these are different things, even if the data that expresses them is often identical.

    Regarding Archivists’ Toolkit locations: basically, the data model has each instance “sitting” on a location as if there were no surrounding containers at all in the real world. In the real world we have boxes, we pull them, we track them, we put them back. If a system doesn’t address that, it’s a poor model. Despite our efforts, ArchivesSpace went forward with the AT model.

    Every archivist I’ve spoken with has “group boxes.” By this I mean containers that house components of many collections and/or the whole of many small collections. It’s not rocket science to manage this if the data model addresses the actual way we operate.

    I think we have the only archival holdings management system that deals with multi-level storage and retrieval. It’s a system in which a box can be just a box, and the content associated with it. The problem with implementing such a system is that it requires each archival unit of description and each piece of it that is physically separated to have an unambiguous identifier.

    Identifiers for described content (we call these control numbers) are associated in our system with identifiers for containers (we call these retrieval units). While most retrieval units (~210,000) are the same as the “box” of a collection and have exactly the same identifier as their content (that is Collection number and box number, see above about dual duty of this data), about 65,000 do not.

    Retrieval units can be stored inside other retrieval units, We can record something removed from any level of a retrieval unit as it’s own retrieval unit. Thus, the platinum print that is being loaned from a photograph collection for a museum exhibit can be recorded, removed, loaned, tracked, and returned without ever needing to be described.

    I look forward to seeing what you’ve come up with!

    • Hey Kate — I’m so glad that you commented!

      For 1, I would say that component-level URIs are a really good candidate for this kind of work. They just do the work of identifying the component-level record, and they’re separate from describing the physical place where the stuff is (which is especially important considering that so much of our stuff is increasingly not physical). PULFA has implemented this in a way that takes the EAD intellectual ID structure (http://findingaids.princeton.edu/collections/MC076/c01560) and I know that ArchivesSpace also manages components with unique identifiers that become URIs in the public interface. RESTful URIs are obviously the way to go, in my book, not just because they help solve the “identify the description” problem but because they’re also the direction of the web at large.

      With 2, I suppose that I would say that this sounds like a circulation function and that there are ways that I could imagine a circulation system like Aeon interacting with ArchivesSpace in a seamless way to track/update containers in their various locations. Are you folks Aeon users?

      Also, keep your eyes peeled on the ASpace@Yale blog — we’re working with an ASpace development vendor to model containers in such a way that you’ll be able to do more rigorous updating, tracking, managing locations, etc. This includes a need for managing “group boxes” as such (although we don’t have that practice at MSSA).

      I think you’re going to be happy with what we come up with, but until then let’s be sure to keep in touch to make sure that we’re all asking for the right things from these systems.

  3. Pingback: Chaos —> Order | Rehousing is Not Processing

  4. “Unless someone is doing a known item search” – glad you added that caveat, which is an important one, although not so important that it should complicate the reforms you propose.

    I’m sure there is a certain class of ‘power user’ that might need to track something down at the item level. Or maybe a user needs to re-examine key evidence. I’m sure there is some way to balance your new (semantic?) method of container labeling with precise, item-level citation. Even if that is just as simple as allowing a user to order a box based on the bar code number.

    • So, I really like Michelle Light’s statement in her EAD@10 talk about how we’re not really here to map everything that’s in our collection — we’re here to help people find stuff. I think that there are varying degrees of trade-offs where we can say “yes, this is the group of records with the kind of evidence you’re looking for. Welcome to doing research. Dig in.”

      Generally, when thinking about trade-offs, I’m a big fan of producing to the level of distinctive data. That is, if subject files are labelled, a file-level description is a great way to go, since there’s a lot of stuff for folks to find in a search environment. If files aren’t labelled and are only dated, that seems like a great candidate for rich series-level description.

      As far as distinct tracking/citations go, like I mentioned in reply to Kate’s comment, I really think we need to move away from citations that reference box numbers and toward persistent identifiers for our description. If the idea is to make research reproducible, I’d much rather give a person a URI pointer to the component record (which may become augmented overtime but is fundamentally the same) than box or folder numbers, which seem far more capricious.

      • Great point about tradeoffs. As a counterpoint to my counterpoint, you discuss the durability of a persistent identifier. If someone has to re-produce research, you wouldn’t want them to rely on an outdated box number for a collection that has been rehoused (or a shelf location for a collection that has been physically moved). Maybe better for the researcher to take more time to locate an item than have a subjective or inaccurate citation to begin with.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s