Getting into the guts of AT

It’s the thing that we keep saying — in order to deal with our masses of stuff better, we need better ways of understanding what we have. A lot of my questions aren’t just about what’s in our finding aids — they’re about the relationships between archival materials and other archival management functions — accessioning, digital object management, location management, container management. For instance, the following questions have come up in the past or could easily come up in the future:

  • Which collections are constituted of accessions that came in before 1980?
  • Which collections have digital objects associated with them? What are the URIs of those digital objects?
  • I have a barcode for a box. Can you tell me the materials that are supposed to be in that box? What collection is this from?
  • We haven’t used this location listed in the location table since 2005! Are there any boxes associated with that location? What are they?

In order to answer these questions, I need to write reports that join different tables in Archivists’ Toolkit together. And this is a little bit tricky, because in their own way, components in the AT database are hierarchical (just like in an EAD-encoded finding aid). If I have an instance (a container with a barcode), and I want to know which collection it belongs to, I don’t have a direct relationship in the database. Instead, an instance is associated with a component. That component is associated with its parent component. It may have a lot of ancestor components before the most high-level component is associated with the collection-level information in the resource table.

These relationships are made in sql through what are called “joins”. And joining a table on itself (in some cases several times, recursively) is a huge friggin pain in the neck. So, after mucking around for a little while, the solution was to just ask someone smarter than me how he would handle this.

This is where my colleague Steelsen comes in — Steelsen introduced the idea of writing a stored procedure that would look for the top-most component instead of having to do this through joins. And then he wrote them for me, because he is a mensch of the first order. His procedures are here, and available to anyone who might find them useful. They have seriously revolutionized the way that I’ve been able to do reporting and solve problems.

For instance, something that folks have been begging for is a barcode look-up tool — they have a barcode, and they want to know which collection it belongs to, what its call number is, which location it’s assigned to, and which components are associated with that box. So here’s what I wrote (the user indicates the barcode in the where statement):

use schema;
SELECT 
 CONCAT(r.resourceIdentifier1,
 ' ',
 LPAD(r.resourceIdentifier2, 4, '00')) 'Collection',
 r.title 'Collection Title',
 series.subdivisionIdentifier 'Series/Accession Number',
 series.title 'Series Title',
 rc.title 'Component Title',
 rc.dateExpression 'Component Date',
 adi.container1Type 'Container Type',
 adi.container1NumericIndicator BoxNum,
 adi.container1AlphaNumIndicator BoxAlpha,
 adi.container2NumericIndicator FolderNum,
 adi.container2AlphaNumIndicator FolderAlpha,
 adi.archDescriptionInstancesId InstanceID,
 adi.barcode Barcode,
 adi.userDefinedString1 'Voyager Info',
 loc.LocationId,
 loc.building,
 loc.room,
 loc.coordinate1NumericIndicator ShelfNum,
 loc.coordinate1AlphaNumIndicator ShelfAlpha
FROM
 ArchDescriptionInstances adi
 INNER JOIN
 ResourcesComponents rc ON adi.resourceComponentId = rc.resourceComponentId
 INNER JOIN
 LocationsTable loc ON adi.locationID = loc.locationID
 INNER JOIN
 Resources r ON r.resourceId = GETRESOURCEFROMCOMPONENT(rc.resourceComponentId)
 LEFT OUTER JOIN
 ResourcesComponents series ON GETTOPCOMPONENT(rc.resourceComponentId) = series.resourceComponentID
WHERE
 adi.barcode = 39002042658774;

Here I use two of Steelsen’s procedures. In GETRESOURCEFROMCOMPONENT, I go up the tree of a component to find out what resource it belongs to and join that to the resource. I use GETTOPCOMONENT to help figure out what series a component belongs to (this assumes that the top-most component is a series, but that’s usually a safe bet for us).

I’m a sql n00b, and this isn’t the most efficient query I’ve ever run, but I’m really happy with the results, which can be viewed in a spreadsheet here.

By changing the where statement, I can find out all kinds of associated information about a location, a collection, a box, whatever. I can find out if barcodes have been assigned to components with different box numbers; I can find out if components with the same barcode have been assigned to more than one location. This set of procedures has really been a godsend to help me know more about the problems I’m fixing. So many thanks to Steelsen. I hope others find them useful too.

Advertisements

Another Quick One — Locations where Accessions have been Assigned

If you assign accessions to locations, but move them around to a final home after processing, it may be helpful to see where your accessions were assigned and when the record was last touched. This query will help you do a little clean-up:

SELECT
Accessions.title,
Accessions.accessionNumber1,
Accessions.accessionNumber2,
Accessions.accessionNumber3,
Accessions.lastUpdated,
LocationsTable.building,
LocationsTable.room,
LocationsTable.coordinate1AlphaNumIndicator Shelf,
AccessionsLocations.accessionId,
AccessionsLocations.locationId
FROM
yale_dev.AccessionsLocations
INNER JOIN
Accessions ON Accessions.accessionId = AccessionsLocations.accessionId
JOIN
LocationsTable ON AccessionsLocations.locationId = LocationsTable.locationId

Here’s an example of some output. We may check, for instance, accessions from before 2015 to make sure that the accession location is still relevant. I hope this is useful to someone else!

Title Accn Date last modified loc accnID locID
Yale Guidance Nursery yearly reports 2010 A 085 2010-06-22 12:18:55 SML XXX X 7077 1933
Margenau, Henry, papers 2010 M 053 2010-11-16 15:51:37 SML XXX X 7078 1940

Quick Query — Finding Locations where Nothing is Assigned in Archivists’ Toolki

I just wrote a quick query to give records in the locations table in Archivists’ Toolkit that don’t have instances assigned to them. This sounds like a pretty common thing that folks want to see — here it is:

SELECT
*
FROM
LocationsTable loc
WHERE
loc.locationId BETWEEN 0 AND 10000
AND loc.locationId NOT IN (SELECT
locationId
FROM
ArchDescriptionInstances containers
WHERE
containers.locationId BETWEEN 0 AND 10000)
AND loc.locationId NOT IN (SELECT
locationId
FROM
AccessionsLocations accession
WHERE
accession.locationId BETWEEN 0 AND 10000);

dwn w/ abbrs

Maybe you’ve heard — library and archives content standards are NOT DOWN with abbreviations these days. This is part of an effort to recognize that we no longer have to fit our descriptions onto catalog cards, and that the less confusing jargon we present to our patrons, the better!

In my repository, we made an official switch from abbreviated months to spelled-out months. However, there was still an enormous corpus of legacy abbreviations in Archivists’ Toolkit.

This is the kind of problem that power tools are great at solving. Check out my sql, posted here, for a big, fat find-and-replace that goes through the date expression field in AT and makes everything better. It looks for month abbreviations, common misspellings, variations of the term “circa” and variations on “undated.” I have my eye on brackets, too, but it might be too hairy to deal with them right now. Please feel free to use this in your own repository.

Also, stay tuned for part two, where I dispose of contemptible manuscript tradition abbreviations!

On Containers

I’m here  to talk about boxes. Get excited.

I’ve been spending a LOT of time lately thinking about containers — fixing them, modelling them, figuring out what they are and aren’t supposed to do. And I’ve basically come to the conclusion that as a whole, we spend too much time futzing with containers because we haven’t spent enough time figuring out what they’re for and what they do.

For instance, I wrote a blog post a couple of months ago about work we’re doing to remediate stuff that should not but is happening with containers — barcodes being assigned to two different containers, two different container types with the same barcode/identifier information, etc. Considering the scale of our collections, the scale of these problems is mercifully slight, but these are the kinds of problems that turn into a crisis if a patron is expecting to find material in the box she ordered and the material simply isn’t there.

I’m also working with my colleagues here at Yale and our ArchivesSpace development vendor Hudson Molonglo to add functionality to ArchivesSpace so that it’s easier to work with containers as containers. I wrote a blog post about it on our ArchivesSpace blog. In short, we want to make it much easier to do stuff like assigning locations, assigning barcodes, indicating that container information has been exported to our ILS, etc. In order to do this, we need to know exactly how we want containers to relate to archival description and how they relate to each other.

As I’ve been doing this thinking about specific container issues, I’ve had some thoughts about containers in general. Here they are, in no particular order.

What are container numbers doing for us?

A container number is just a human-readable barcode, right? Something to uniquely identify a container? In other words, speaking in terms of the data model, isn’t this data that says something different but means the same thing? And is this possibly a point of vulnerability? At the end of the day, isn’t a container number  something that we train users to care about when really they want the content they’ve identified? And we have a much better system for getting barcodes to uniquely identify something than we do with box numbers?

In the days that humans were putting box numbers on a call slip and another human was reading that and using that information to interpret shelf location, it made sense to ask the patron to be explicit about which containers were associated with the actual thing that they want to see. But I think that we’ve been too good at training them (and training ourselves) to think in terms of box numbers (and, internally, locations) instead of creating systems that do all of that on the back end. Information about containers should be uniform, unadorned, reliable, and interact seamlessly with data systems. Boxes should be stored wherever is best for their size and climate, and that should be tracked in a locations database that interacts with the requesting database. And the actual information should be associated seamlessly with containers.

This means that instead of writing down a call number and box number and reading a note about how materials of this type are stored on-site and materials of another type are stored off-site, let’s take a lot of human error out of this. Let’s let them just click on what they want to see. Then, the system says “a-ha! There are so many connections in my database! This record is in box 58704728702861, which is stored in C-29 Row 11, Bay 2, Shelf 2. I’ll send this to the queue that prints a call slip so a page can get that right away!” And instead of storing box numbers and folder numbers in the person’s “shopping cart” of what she’s seen, let’s store unique identifiers for the archival description, so that if that same record get’s re-housed into box 28704728702844 and moved to a different location, the patron doesn’t have to update her citation in any scholarly work she produces. Even if the collection gets re-processed, we could make sure that identifiers for stuff that’s truly the same persists.

Also, don’t tell me that box numbers do a good job of giving cues about order and scale. There are waaaaaayyyyy better ways of doing that than making people infer relationships based on how much material fits into 0.42 linear feet.

We have the concepts. Our practice needs to catch up, and our tools do too.

Darn it, Archivists’ Toolkit, you do some dumb things with containers

Archival management systems are, obviously, a huge step up from managing this kind of information in disparate documents and databases. But I think that we’re still a few years away from our systems meeting their potential. And I really think that folks who do deep thinking about archival description and standards development need to insert themselves into these conversations.

Here’s my favorite example. You know that thing where you’re doing description in AT and you want to associate a container with the records that you just described in a component? You know how it asks you what kind of an instance you want to create? That is not a thing. This is just part of the AT data model — there’s nothing like this in DACS, nothing like it in EAD. Actual archival standards are smart enough to not say very much about boxes because they’re boxes and who cares? When it exports to EAD, it serializes as @label. LABEL. The pinnacle of semantic nothingness!

This is not a thing.

This is not a thing.

Like, WHY? I can see that this could be the moment where AT is asking you “oh, hey, do you want to associate this with a physical container in a physical place or do you want to associate it with a digital object on teh interwebz?” but there’s probably a better way of doing this.

My problem with this is that it has resulted in A LOT of descriptive malpractice. Practitioners who aren’t familiar with how this serializes in EAD think that they’re describing the content (“oh yes! I’ve done the equivalent of assigning a form/genre term and declaring in a meaningful way that these are maps!”) when really they’ve put a label on the container. The container is not the stuff! If you want to describe the stuff, you do that somewhere else!

Oh my gosh, my exclamation point count is pretty high right now. I’ll see if I can pull myself together and soldier on.

Maybe we should be more explicit about container relationships.

Now, pop quiz, if you have something that is in the physical collection and has also been microfilmed, how do you indicate that?

In Archivists’ Toolkit, there’s nothing clear about this. You can associate more than one instance with an archival description, but you can also describe levels of containers that (ostensibly) describe the same stuff, but happen to be a numbered item within a folder, within a box.

Anything can happen here.

Anything can happen here.

So this means that in the scenario I mentioned above, it often happens that someone will put the reel number into container 3, making the database think that the reel is a child of the box.

But even if all of the data entry happens properly, EAD import into Archivists’ Toolkit will take any three <container> tags and instead of making them siblings, brings the three together into parent-child instance relationship like you see above. This helps maintain relationships between boxes and folders, but is a nightmare if you have a reel in there.

EAD has a way of representing these relationships, but the AT EAD export doesn’t really even do that properly.

 <c id="ref10" level="file">
   <did>
     <unittitle>Potter, Hannah</unittitle>
     <unitdate normal="1851/1851">1851</unitdate>
     <container id="cid342284" type="Box" label="Mixed Materials (39002038050457)">1</container>
     <container parent="cid342284" type="Folder">2</container>
   </did>
 </c>

 <c id="ref11" level="file">
   <did>
     <unittitle>Potter, Horace</unittitle>
     <unitdate normal="1824/1824">1824</unitdate>
     <container id="cid342283" type="Box" label="Mixed Materials (39002038050457)">1</container>
     <container parent="cid342283" type="Folder">3</container>
   </did>
 </c>

Here, we see that these box 1’s are the same — they have the same barcode (btw, see previous posts for help working out what to do with this crazy export and barcodes). But the container id makes it seem like these are two different things — they have two different container id’s and their folders refer two two different parents.

What we really want to say is “This box 1 is the same as the other box 1’s. It’s not the same as reel 22. Folder 2 is inside of box 1, and so is folder 3.” Once we get our systems to represent all of this, we can do much better automation, better reporting, and have a much more reliable sense of where our stuff is.

So if we want to be able to work with our containers as they actually are, we need to represent those properly in our technology. What should we be thinking about in our descriptive practice now that we’ve de-centered the box?

“Box” is not a level of description.

In ISAD(G) (explicitly) and DACS (implicitly), archivists are required to explain the level at which they’re describing aggregations of records. There isn’t a vocabulary for this, but traditionally, these levels include “collection”, “record group”, “series”, “file” and “item.” Note that “box” is not on this list or any other reasonable person’s list. I know everyone means well, and I would never discourage someone from processing materials in aggregate, but the term “box-level processing” is like nails on a chalkboard to me. As a concept, it should not be a thing. Now, series-level processing? Consider me on board! File-group processing? Awesome, sounds good! Do you want to break those file groups out into discrete groups of records that are often surrounded by a folder and hopefully are associated with distinctive terms, like proper nouns? Sure, if you think it will help and you don’t have anything better to do.

A box is usually just an accident of administravia. I truly believe that archivists’ value is our ability to discern and describe aggregations of records — that box is not a meaningful aggregation, and describing it as such gives a false impression of the importance of one linear foot of material. I’d really love to see a push toward better series-level or file-group-level description, and less file-level mapping, especially for organizations’ records. Often, unless someone is doing a known item search, there’s nothing distinct enough about individual files as evidence (and remember, this is why we do processing — to provide access to and explain records that give evidence of the past) to justify sub-dividing them. I also think that this could help us think past unnecessary sorting and related housekeeping — our job isn’t to make order from chaos*, it’s to explain records and their context of creation of use. If records were created chaotically and kept in a chaotic way, are we really illuminating anything by prescribing artificial order?

This kind of thinking will be increasingly important when our records aren’t tied to physical containers.

In conclusion, let’s leave the robot work to the robots.

If I never had to translate a call number to a shelf location again, it would be too soon (actually, we don’t do that at MSSA, but still). Let’s stop making our patrons care about boxes, and let’s start making our technology work for us.


* This blog’s title, Chaos –> Order, is not about bringing order to a chaotic past — it’s about bringing order to our repositories and to our work habits. In other words, get that beam out of your own eye, sucka, before you get your alphabetization on.

 

Sniffing out problems — bad barcode information

Like many large repositories, we barcode all of our containers and keep them offsite. This is a pretty straightforward and on the whole error-resistant process, where an archivist uses a plug-in to Archivists’ Toolkit to select a container and then wands in a barcode to associate it with that container. That barcode then gets stored in an item record in our ILS, in the off-site storage system’s database, and on a barcode sheet. It’s also serialized as part of the EAD. When a patron wants a box, a series of systems and people  (including Aeon, our ILS, and our off-site storage workers) work behind the scenes to bring it here within two business days. It’s pretty magical, and it’s how we can provide a high quality of service to our patrons.

However, there are a few less fool-proof methods by which a barcode could be introduced to a database. A barcode could be entered as part of rapid data entry. Since container information is entered as part of each component, it’s possible to accidentally mistype the container number but wand the same barcode. It’s also possible for barcodes to be entered as part of EAD import (although they don’t make it easy) or as a SQL update.

We also have an extra legacy problem — a bad update (I don’t know if it was SQL or XSL) assigned barcodes based on container number, but didn’t check for container type. So, reel 1 was assigned the same barcode as box 1.

The diagnosis is pretty straightforward (although I attempted it a zillion much-harder ways first). As always, many many thanks to my colleague Mark, the brains behind this operation, who actually wrote this SQL query after seeing me flail about.

This query basically asks — okay. Give me a list of all of your distinct barcodes and container numbers. Do you have any barcodes that are associated with more than one container number? If so, that’s a huge problem! Report back that barcode!

Barcode 987654321 should only be associated with box 1 — if it’s associated with both box 1 AND box 2, that’s a real problem. It means either that somehow the barcode was duplicated (which we go to GREAT PAINS to prevent happening) or that there was a data entry error.

You can run this query against your own AT database to see if there were any barcodes assigned to more than one box.

select barcode,
group_concat(distinct coalesce(container1NumericIndicator, container1AlphaNumIndicator)) from ArchDescriptionInstances where barcode <> ''
group by barcode
having count(barcode) > 1
and count(distinct coalesce(container1NumericIndicator, container1AlphaNumIndicator)) > 1

Maybe you have a healthier relationship with the small stuff than I do, but this is the kind of thing that maintains a constant low-level stress buzz in the back of my head. I find it reassuring that we can run these kinds of reports/checks to know for sure that collection control is working the way it should.

Oh, and by the way, the diagnosis query for my reel/box problem is here:

SELECT barcode FROM
(
SELECT distinct barcode, container1Type FROM
ArchDescriptionInstances
) as tt
group by tt.barcode
having count(tt.barcode) > 1

I hope that someone out there finds this useful!

Round-tripping EAD, updated — how to not lose those barcodes

Several weeks ago, Bonnie published a hugely-useful checklist of things to think about when trying to round-trip data from Archivists’ Toolkit to EAD and back to Archivists’ Toolkit to help minimize data loss.

Don’t lose those barcodes!

In a comment, my colleague Mark gave a very sensible suggestion for how to not lose barcodes during that trip. When you export an EAD from AT, the container information looks something like this:

<c id="ref21" level="file">
    <did>
        <unittitle>Notecards</unittitle>
        <container id="cid1063028" type="Box" label="Mixed Materials (39002042652603)">7</container>
        <unitdate normal="1930/1940" type="inclusive">circa 1935-1950</unitdate>
    </did>
</c>

What I see here is that two separate fields in the AT database have been jammed together for EAD export. “Mixed materials” lives in the ArchDescriptionInstances table in AT as instanceType. 39002042652603 lives in the ArchDescriptionInstances table as barcode. During export, these are brought together as container @label.

So, the trick is to get them back into the fields in tables where they belong. By default, AT imports container @label to ArchDescriptionInstances.instanceType. You want to get the barcode information into barcode. Here’s how.

I sometimes have bad luck getting AT to re-import if container @label=”Mixed Materials (39002042652603)” — I do a big find/replace to make that look like @label=”39002042652603″

Then, I import into AT.

Getting the barcodes where they belong is pretty easy, but it has to happen in SQL in the back-end database. If you haven’t written SQL before, you might want to run this by a database administrator, just to have another set of eyes on it. You DEFINITELY want to do this in the test or dev version of your database first.

Step 1: copy over barcodes from instanceType to barcode (this assumes 14-digit barcodes)

UPDATE ArchDescriptionInstances 
SET barcode = instanceType
WHERE instanceType REGEXP '^[0-9]{14}$';

Step 2: replace barcodes in instanceType with “Mixed materials” (this assumes that you want everything to be “Mixed Materials”. I have a rant in the works about how inappropriate it is to have this label on containers anyway, but I’ll save that for another day).

UPDATE ArchDescriptionInstances 
SET instanceType = 'Mixed materials'
WHERE instanceType REGEXP '^[0-9]{14}$';

Congratulations! You haven’t lost your barcodes! You can now delete the previous resource record and enjoy your new-and-improved one.

But that’s not all…

Extra super-secret tips for folks keeping Voyager item information in Archivists’ Toolkit.

So, where I work, this wouldn’t be enough. We have a whole other set of data in the ArchDescriptionInstances table that isn’t serialized in EAD. Container type, location code, Voyager bibid, container restriction and a Boolean of whether this information has been exported to Voyager are kept in the ArchDescriptionInstances table in user defined fields. We then have a background program that sends this information to Voyager, where holdings/item records are created.

This means that if I round-trip EAD, this item information that exists is disassociated with the descriptive information. This doesn’t work.

Note: If you’re also using the plug-in that Yale developed to keep Voyager holdings information in AT, could you please drop me a line? We’re starting to think about migration paths to ArchivesSpace.

But I had a brainwave — if I round-trip EAD with barcodes, I now have a key. The same information associated with a barcode which is now blank (what’s the Voyager bib/holdings ID? was it exported?), is present in a different record with the same barcode from when the information was entered before, before the EAD was round-tripped.

This means that no one has to re-enter the same information or re-export to Voyager. Check out my SQL update below — same warnings and caveats apply.

 

start transaction;

UPDATE schema.ArchDescriptionInstances AS ArchDesc

JOIN (
SELECT
barcode,
userDefinedString1,
userDefinedString2,
locationId,
userDefinedBoolean1,
userDefinedBoolean2
FROM schema.ArchDescriptionInstances
WHERE userDefinedString1 = "{{{bibid}}}"
) AS newdata ON newdata.barcode=ArchDesc.barcode
SET
ArchDesc.userDefinedString1 = newdata.UserDefinedString1,
ArchDesc.userDefinedString2 = newdata.UserDefinedString2,
ArchDesc.locationId = newdata.locationId,
ArchDesc.userDefinedBoolean1 = newdata.userDefinedBoolean1,
ArchDesc.userDefinedBoolean2 = newdata.userDefinedBoolean2

WHERE
ArchDesc.userDefinedString1 = ""
AND ArchDesc.userDefinedString2 = "";

commit;