Clean Metadata for Non-Metadata Geeks

Over the past two years, Maureen, Carrie, Meghan, Cassie and their guests have turned this blog into a powerhouse of know-how around working smarter with archival metadata. Some of us really enjoy this type of work; we find it crazy satisfying and it aligns well with our worldviews. We acknowledge, with some pride, that we are metadata geeks. But not all archivists are like this, AND THAT’S TOTALLY OKAY. We all have different strengths, and not all archivists need to be data wranglers. But we can all produce clean metadata.

Just one of the awesome buttons from AVPreserve

Just one of the awesome metadata jokes promulgated by AVPreserve‘s button campaign

Today, though, I’m going to take a BIG step backward and talk for a few minutes about what we actually mean when we talk about “clean” data, and I’ll share a few basic things that any archivist can do to help prevent their painstakingly produced metadata from becoming someone else’s “clean up” project later.

As Maureen explained in the very first Chaos —> Order post, the raison d’etre of all of this is to use computers to do what they do best, freeing the humans to do what they do best. Computers are really good at quickly accomplishing tasks like indexing, searching, replacing, linking and slicing up information for which you can define a rule or pattern, things that would take a human tens or hundreds of hours to do by hand and wouldn’t require any of the higher-level processes that are still unique to humans, let alone the specialized training or knowledge of an archivist. Clean data is, quite simply, data that is easy for a computer to digest in order to accomplish these tasks.

Continue reading

Getting into the guts of AT

It’s the thing that we keep saying — in order to deal with our masses of stuff better, we need better ways of understanding what we have. A lot of my questions aren’t just about what’s in our finding aids — they’re about the relationships between archival materials and other archival management functions — accessioning, digital object management, location management, container management. For instance, the following questions have come up in the past or could easily come up in the future:

  • Which collections are constituted of accessions that came in before 1980?
  • Which collections have digital objects associated with them? What are the URIs of those digital objects?
  • I have a barcode for a box. Can you tell me the materials that are supposed to be in that box? What collection is this from?
  • We haven’t used this location listed in the location table since 2005! Are there any boxes associated with that location? What are they?

In order to answer these questions, I need to write reports that join different tables in Archivists’ Toolkit together. And this is a little bit tricky, because in their own way, components in the AT database are hierarchical (just like in an EAD-encoded finding aid). If I have an instance (a container with a barcode), and I want to know which collection it belongs to, I don’t have a direct relationship in the database. Instead, an instance is associated with a component. That component is associated with its parent component. It may have a lot of ancestor components before the most high-level component is associated with the collection-level information in the resource table.

These relationships are made in sql through what are called “joins”. And joining a table on itself (in some cases several times, recursively) is a huge friggin pain in the neck. So, after mucking around for a little while, the solution was to just ask someone smarter than me how he would handle this.

This is where my colleague Steelsen comes in — Steelsen introduced the idea of writing a stored procedure that would look for the top-most component instead of having to do this through joins. And then he wrote them for me, because he is a mensch of the first order. His procedures are here, and available to anyone who might find them useful. They have seriously revolutionized the way that I’ve been able to do reporting and solve problems.

For instance, something that folks have been begging for is a barcode look-up tool — they have a barcode, and they want to know which collection it belongs to, what its call number is, which location it’s assigned to, and which components are associated with that box. So here’s what I wrote (the user indicates the barcode in the where statement):

use schema;
SELECT 
 CONCAT(r.resourceIdentifier1,
 ' ',
 LPAD(r.resourceIdentifier2, 4, '00')) 'Collection',
 r.title 'Collection Title',
 series.subdivisionIdentifier 'Series/Accession Number',
 series.title 'Series Title',
 rc.title 'Component Title',
 rc.dateExpression 'Component Date',
 adi.container1Type 'Container Type',
 adi.container1NumericIndicator BoxNum,
 adi.container1AlphaNumIndicator BoxAlpha,
 adi.container2NumericIndicator FolderNum,
 adi.container2AlphaNumIndicator FolderAlpha,
 adi.archDescriptionInstancesId InstanceID,
 adi.barcode Barcode,
 adi.userDefinedString1 'Voyager Info',
 loc.LocationId,
 loc.building,
 loc.room,
 loc.coordinate1NumericIndicator ShelfNum,
 loc.coordinate1AlphaNumIndicator ShelfAlpha
FROM
 ArchDescriptionInstances adi
 INNER JOIN
 ResourcesComponents rc ON adi.resourceComponentId = rc.resourceComponentId
 INNER JOIN
 LocationsTable loc ON adi.locationID = loc.locationID
 INNER JOIN
 Resources r ON r.resourceId = GETRESOURCEFROMCOMPONENT(rc.resourceComponentId)
 LEFT OUTER JOIN
 ResourcesComponents series ON GETTOPCOMPONENT(rc.resourceComponentId) = series.resourceComponentID
WHERE
 adi.barcode = 39002042658774;

Here I use two of Steelsen’s procedures. In GETRESOURCEFROMCOMPONENT, I go up the tree of a component to find out what resource it belongs to and join that to the resource. I use GETTOPCOMONENT to help figure out what series a component belongs to (this assumes that the top-most component is a series, but that’s usually a safe bet for us).

I’m a sql n00b, and this isn’t the most efficient query I’ve ever run, but I’m really happy with the results, which can be viewed in a spreadsheet here.

By changing the where statement, I can find out all kinds of associated information about a location, a collection, a box, whatever. I can find out if barcodes have been assigned to components with different box numbers; I can find out if components with the same barcode have been assigned to more than one location. This set of procedures has really been a godsend to help me know more about the problems I’m fixing. So many thanks to Steelsen. I hope others find them useful too.

Another Quick One — Locations where Accessions have been Assigned

If you assign accessions to locations, but move them around to a final home after processing, it may be helpful to see where your accessions were assigned and when the record was last touched. This query will help you do a little clean-up:

SELECT
Accessions.title,
Accessions.accessionNumber1,
Accessions.accessionNumber2,
Accessions.accessionNumber3,
Accessions.lastUpdated,
LocationsTable.building,
LocationsTable.room,
LocationsTable.coordinate1AlphaNumIndicator Shelf,
AccessionsLocations.accessionId,
AccessionsLocations.locationId
FROM
yale_dev.AccessionsLocations
INNER JOIN
Accessions ON Accessions.accessionId = AccessionsLocations.accessionId
JOIN
LocationsTable ON AccessionsLocations.locationId = LocationsTable.locationId

Here’s an example of some output. We may check, for instance, accessions from before 2015 to make sure that the accession location is still relevant. I hope this is useful to someone else!

Title Accn Date last modified loc accnID locID
Yale Guidance Nursery yearly reports 2010 A 085 2010-06-22 12:18:55 SML XXX X 7077 1933
Margenau, Henry, papers 2010 M 053 2010-11-16 15:51:37 SML XXX X 7078 1940

Making DACS Dates

Manipulating date strings (which is the data type we usually have in archival description), particularly when you have a lot of legacy data, is a pain. I was working with a friend to update some legacy data in her finding aid, and it occurred to me that there isn’t a lot of direct guidance out there about how to manipulate dates with various tools. So, here’s a run-down of some of my methods — please feel free to add your own in the comments.

Why does this matter?

I’ll be honest, in a lot of situations, date formats don’t matter at all. I’ve said it before and I’ll say it again — we put a whole lot of effort into creating structured data, considering that most of us just flatten it into HTML and put it up as a webpage. However, there is a brighter tomorrow. With structured data, you can make far better interfaces, and there are really nice examples of places that let you do stuff with date data.

In the Princeton finding aids site, you can sort by title, date, or container. This means that in a series like this, in the George F. Kennan papers, where the archivist (or possibly creator) filed by title, this isn’t the only way to look through materials.

George Kennan Finding Aid

The order of materials as they are presented

George Kennan Finding Aid

The order of materials, sorted by me (the user) by date ascending.

Letting users sort by title or date means a few things — we can stop wasting time with alpha or chron arrangement and spend more of our energies on the true value that archivists add to description — context, meaning, transparency — without worrying that there’s too much for the researcher to sort through. It also means that we don’t have to presume that a researcher’s primary discovery vector is either time or title — we can let her choose for herself. Finally, and most importantly, we can let original arrangement schemes and organic order (the true intellectual basis of arrangement) reign supreme.

The other reason why date formats are important is because our content standard tells us they are. Now, I personally think that it’s actually far more important to associate an ISO-compliant date with a descriptive component, which can then be rendered any way you want, but since until recently our tools didn’t support that very well, I think that the DACS format of YYYY Month D brings us a step closer to easier date clean-up and extracting ISO compliant dates from date expressions.

Excel

Excel, odi et amo. Excel offers a GUI for programming-ish functions, but I find as I do more and more advanced stuff that I get frustrated by not knowing what exactly is happening with the underlying data. Dates are particularly frustrating, since Excel stores dates as a serial number starting with January 1, 1900. As an archivist who has PLENTY OF DATES from before then, this can lead to rage. There are a few ways to deal with this — if your dates are all 20th or 21st century, congratulations! You don’t have a problem. There are ways to get Excel to change the ways it assigns serial numbers, to allow for negative numbers, which let’s you do the normal sorting and date re-formatting. Or, you can store everything as text and move each part of the date string to its own column to manipulate it.

So, an example of a clean-up project:

Excel Dates -- untouched

In this data, we have a bit to clean up. When I start a clean-up project, I usually start with a pencil-and-paper list of all of the steps that I need to go through before I change anything. This way, I see if I need to do research about how to do a step, and I can also see if there are dependencies in the data that may require me to sequence these steps in a particular way. When you’re first learning, it’s easy to jump right in without planning, but trust me — every time I’ve been burned by automation it’s because I didn’t plan. In a live data environment, you should always know what the computer’s going to do before you run a command, even if that command is just a formula in Excel. The flip side of this is, of course, that as long as you have good back-ups, you should feel free to experiment and try new things. Just make sure you make the effort to figure out what actually happened when you’re experimenting suddenly produces the results you want to see.

So, here’s my list of steps to perform on this data.

  • Check my encoding, which in this case just means which data is in which columns. Do you see the row where some of the date data is in the title column? It’s in row 4. I would probably survey the data and see how prevalent this kind of problem is. If it’s just a handful of errors, I’ll move the data over by hand. If there’s more, I’ll figure out a script/formula to automate this.
  • Check for unwanted characters. In this case, get rid of brackets. In case you haven’t heard, brackets are not a meaningful way of indicating uncertainty to researchers. There is a certainty attribute on <unitdate> for that, which can then be rendered in your institution’s EAD -> HTML stylesheet. However, my problem with brackets is more fundamental — in archival description, the date element is just a transcription of what we see on the record. We don’t actually know that this date represents anything. So in reality, these are all guesses to varying degrees of certainty, with the aim of giving the researcher some clue to time.
  • Fix the date format. DACS dates are YYYY Month D. (e.g., 2015 March 6)
  • Create an ISO date to serialize as an @normal attribute with <unitdate>

Let’s skip the obvious clean-up tasks and go straight to formatting dates. If everything is after 1900 (and if everything is a three-part date), this is really straightforward.

First, create a new column. Use the DATEVALUE formula to tell Excel to regard your date string as a date value — if your date string is in B2, your formula in C2 should be:

=DATEVALUE(B2)

Double-click on the bottom right corner of the cell to have that formula apply to the whole column.

Now that Excel knows that this is a date, you simply need to give it the format you want to see, in this case, yyyy mmmm d.

Choosing a custom date format

Choosing a custom date format.

This works great for three-part dates after 1900. If that’s not your situation, there are a few things you can do. One of my favorite methods is to filter the date list to each of the different date types and apply the custom date format to each of these (trying to apply a custom date format to a date that doesn’t fit the type will result in really confusing and bad results). Another option is to split the date into three different columns, treat each like text, and then bring them back together in the order you want with the CONCATENATE formula. Play around — Excel doesn’t make it easy, but there are lots of options.

OpenRefine

If you do a lot of data manipulation, I would definitely encourage you to stop torturing yourself and learn OpenRefine. I use it every day. OpenRefine uses something called GREL (Google Refine Expression Language — I wonder if they’ll be changing that to OREL now that this isn’t under the Google umbrella?), which is trickier to learn than Excel formulas but a lot more powerful and more in alignment with other programming languages. In fact, I should say that you only need to learn GREL for the fancy stuff — a lot of OpenRefine’s magic can be done through the GUI.

So, looking through this data set, I would do a lot of the same steps. One option is to just use the commands Edit Cells -> Common Transformations -> To Date, but unfortunately, most of these strings aren’t written in a way that OpenRefine understands them as dates.

The best path forward is probably to split this date string apart and put it back together. You could split by whitespace and turn them into three columns, but since some dates are just a year, or a year and a month, you wouldn’t necessarily have each of the three parts of the date in the columns where you want them.

So, I’m going to tell OpenRefine what a year looks like and ask it to put the year in its own column.

This formula pulls the year from the date string and puts it into its own column.

This formula pulls the year from the date string and puts it into its own column.

In this formula, I’m partitioning the string by a four-digit number and then taking that part of the partition for my new column. In the case of the year, the formula is:

value.partition(/\d{4}/)[1]

For a month it’s:

value.partition(/[A-Za-z]+/)[1]

And for the day it’s:

value.partition(/\d{1,2}/)[1]

There may be a more elegant way of partitioning this all as one step, but I don’t yet know how!

Then, once you have each of these parts of the date in their own columns, they should look like this:

Each part of the date element is in its own column.

Each part of the date element is in its own column.

The final step is to put the pieces back together in the order you want them. You can do this by clicking on the Year column, and selecting “Create column based on this column.” Then, use GREL to put everything in the order that you want to see it.

The plus signs signify that everything should be smushed together -- pay attention to the syntax of calling the value of columns.

The plus signs signify that everything should be smushed together — pay attention to the syntax of calling the value of columns.

The formula for this is:

value + " " + cells["Month"].value + " " + cells["Day"].value

And voila, you’ve turned your non-DACS date into a DACS-formatted date. You can use similar steps to make a column that creates an ISO-formatted date, too, although you’ll first have to convert months into two-digit numbers.

Finally, SQL

The two methods above require ETL — extract, transform, load. That is, you’re going to get data out of the database (or transform it into a tabbed sheet from xml) and then get it back into the database or the EAD (and then the database). There is a better way if you’re using Archivists’ Toolkit or ArchivesSpace, and it involves doing SQL updates. I’m going to punt on this for now, because I know that this will be a huge part of my future once we get into ArchivesSpace (I’ll also be creating normalized dates, which is data that Archivists’ Toolkit can’t store properly but ASpace can). So, stay tuned!

Sniffing out problems — bad barcode information

Like many large repositories, we barcode all of our containers and keep them offsite. This is a pretty straightforward and on the whole error-resistant process, where an archivist uses a plug-in to Archivists’ Toolkit to select a container and then wands in a barcode to associate it with that container. That barcode then gets stored in an item record in our ILS, in the off-site storage system’s database, and on a barcode sheet. It’s also serialized as part of the EAD. When a patron wants a box, a series of systems and people  (including Aeon, our ILS, and our off-site storage workers) work behind the scenes to bring it here within two business days. It’s pretty magical, and it’s how we can provide a high quality of service to our patrons.

However, there are a few less fool-proof methods by which a barcode could be introduced to a database. A barcode could be entered as part of rapid data entry. Since container information is entered as part of each component, it’s possible to accidentally mistype the container number but wand the same barcode. It’s also possible for barcodes to be entered as part of EAD import (although they don’t make it easy) or as a SQL update.

We also have an extra legacy problem — a bad update (I don’t know if it was SQL or XSL) assigned barcodes based on container number, but didn’t check for container type. So, reel 1 was assigned the same barcode as box 1.

The diagnosis is pretty straightforward (although I attempted it a zillion much-harder ways first). As always, many many thanks to my colleague Mark, the brains behind this operation, who actually wrote this SQL query after seeing me flail about.

This query basically asks — okay. Give me a list of all of your distinct barcodes and container numbers. Do you have any barcodes that are associated with more than one container number? If so, that’s a huge problem! Report back that barcode!

Barcode 987654321 should only be associated with box 1 — if it’s associated with both box 1 AND box 2, that’s a real problem. It means either that somehow the barcode was duplicated (which we go to GREAT PAINS to prevent happening) or that there was a data entry error.

You can run this query against your own AT database to see if there were any barcodes assigned to more than one box.

select barcode,
group_concat(distinct coalesce(container1NumericIndicator, container1AlphaNumIndicator)) from ArchDescriptionInstances where barcode <> ''
group by barcode
having count(barcode) > 1
and count(distinct coalesce(container1NumericIndicator, container1AlphaNumIndicator)) > 1

Maybe you have a healthier relationship with the small stuff than I do, but this is the kind of thing that maintains a constant low-level stress buzz in the back of my head. I find it reassuring that we can run these kinds of reports/checks to know for sure that collection control is working the way it should.

Oh, and by the way, the diagnosis query for my reel/box problem is here:

SELECT barcode FROM
(
SELECT distinct barcode, container1Type FROM
ArchDescriptionInstances
) as tt
group by tt.barcode
having count(tt.barcode) > 1

I hope that someone out there finds this useful!

Sorting, Adding, and Integrating Container Lists

We recently received a large (220 record carton) addition to one of our collections of publishers’ records. The files were in good order in well labeled folders so it wasn’t a matter of having to do any complicated arrangement and description work with the records, and we definitely weren’t refoldering, just compiling a box list and integrating that box list into the existing finding aid.

One of our students went through the containers and created an inventory, but then we hit what is a pretty typical challenge with these sorts of accessions, and with additions to existing collections more generally. I had a box list, but the boxes themselves weren’t received in any particular order so while obviously the records had been stored alphabetically, the list I received (in a semi-structured Microsoft Word document) reflected that on the box level, but not as a whole. So I wanted to take that box list and alphabetize the entire list by file name (which, in this case since these are editorial files for a publisher, was also by the last name of the author). Complicating this was the fact that there were nested files within the document and I wanted to keep those files grouped with their parent files and only alphabetize by the parent component. Also, I wanted to be able to integrate this list with the existing EAD container list– ideally to insert the new files into the container list in proper alphabetical sequence.

Because part of my data (the existing EAD container list) was only available to me in xml, I first converted the Microsoft Word container list into EAD as well so that all of the data I was working with was at least in the same format. I then pasted all of the code together into one document.

Then I created a new open refine project by opening my xml document in Open Refine– when importing xml into OR it asks the user to click on the first element to load, so I skipped my dsc tag since I was working on the component level and selected my first <c> tag and then “create project” to get something that looks like this:

 

screenshot1

But then I need to scroll right, because I am dealing with hierarchical description with nested sub-components and I want to alphabetize by that top level hierarchy– what is represented here as the c-did-unittitle

screenshot2

To sort the list alphabetically a this point you need to click on the arrow at the top of that column, sort by text a-z

Voila, the file list is now sorted alphabetically, with the nested subcomponents grouped together as part of the record for the parent component.

Up at very top go to “sort” and preserve the reordering permanently so that when you export it is still alphabetized (you can also do a custom export that will allow you to only export certain columns and to reorder as necessary). Awesome, so now I have a spreadsheet that has taken all of my data, integrated it, and alphabetized it.

Now, however, I need to get it back into xml so that I can insert it into my container list. We’ve covered this ground before– Mo has a stellar tutorial for this spreadsheet to EAD thing, and another intrepid Chaos–>Order commenter (this blog is one of the few places where you actually should read the comments) brought to my attention this fantastic site, or you can export what you have into an Excel sheet and import it directly into Oxygen and massage from there.

So I converted my spreadsheet back into EAD for upload to our finding aids site.  This solution didn’t end up being perfect–in all of the format conversions I lost some nesting of components that I had to go back and deal with manually. There was also some sloppy data entry that I didn’t catch before I started working through the data– but these are things I can solve either through better management or a bit more experimenting with conversion processes. Even though this wasn’t perfect, it was a fast solution for dealing with a very common and often troublesome scenario.

Round-tripping EAD, updated — how to not lose those barcodes

Several weeks ago, Bonnie published a hugely-useful checklist of things to think about when trying to round-trip data from Archivists’ Toolkit to EAD and back to Archivists’ Toolkit to help minimize data loss.

Don’t lose those barcodes!

In a comment, my colleague Mark gave a very sensible suggestion for how to not lose barcodes during that trip. When you export an EAD from AT, the container information looks something like this:

<c id="ref21" level="file">
    <did>
        <unittitle>Notecards</unittitle>
        <container id="cid1063028" type="Box" label="Mixed Materials (39002042652603)">7</container>
        <unitdate normal="1930/1940" type="inclusive">circa 1935-1950</unitdate>
    </did>
</c>

What I see here is that two separate fields in the AT database have been jammed together for EAD export. “Mixed materials” lives in the ArchDescriptionInstances table in AT as instanceType. 39002042652603 lives in the ArchDescriptionInstances table as barcode. During export, these are brought together as container @label.

So, the trick is to get them back into the fields in tables where they belong. By default, AT imports container @label to ArchDescriptionInstances.instanceType. You want to get the barcode information into barcode. Here’s how.

I sometimes have bad luck getting AT to re-import if container @label=”Mixed Materials (39002042652603)” — I do a big find/replace to make that look like @label=”39002042652603″

Then, I import into AT.

Getting the barcodes where they belong is pretty easy, but it has to happen in SQL in the back-end database. If you haven’t written SQL before, you might want to run this by a database administrator, just to have another set of eyes on it. You DEFINITELY want to do this in the test or dev version of your database first.

Step 1: copy over barcodes from instanceType to barcode (this assumes 14-digit barcodes)

UPDATE ArchDescriptionInstances 
SET barcode = instanceType
WHERE instanceType REGEXP '^[0-9]{14}$';

Step 2: replace barcodes in instanceType with “Mixed materials” (this assumes that you want everything to be “Mixed Materials”. I have a rant in the works about how inappropriate it is to have this label on containers anyway, but I’ll save that for another day).

UPDATE ArchDescriptionInstances 
SET instanceType = 'Mixed materials'
WHERE instanceType REGEXP '^[0-9]{14}$';

Congratulations! You haven’t lost your barcodes! You can now delete the previous resource record and enjoy your new-and-improved one.

But that’s not all…

Extra super-secret tips for folks keeping Voyager item information in Archivists’ Toolkit.

So, where I work, this wouldn’t be enough. We have a whole other set of data in the ArchDescriptionInstances table that isn’t serialized in EAD. Container type, location code, Voyager bibid, container restriction and a Boolean of whether this information has been exported to Voyager are kept in the ArchDescriptionInstances table in user defined fields. We then have a background program that sends this information to Voyager, where holdings/item records are created.

This means that if I round-trip EAD, this item information that exists is disassociated with the descriptive information. This doesn’t work.

Note: If you’re also using the plug-in that Yale developed to keep Voyager holdings information in AT, could you please drop me a line? We’re starting to think about migration paths to ArchivesSpace.

But I had a brainwave — if I round-trip EAD with barcodes, I now have a key. The same information associated with a barcode which is now blank (what’s the Voyager bib/holdings ID? was it exported?), is present in a different record with the same barcode from when the information was entered before, before the EAD was round-tripped.

This means that no one has to re-enter the same information or re-export to Voyager. Check out my SQL update below — same warnings and caveats apply.

 

start transaction;

UPDATE schema.ArchDescriptionInstances AS ArchDesc

JOIN (
SELECT
barcode,
userDefinedString1,
userDefinedString2,
locationId,
userDefinedBoolean1,
userDefinedBoolean2
FROM schema.ArchDescriptionInstances
WHERE userDefinedString1 = "{{{bibid}}}"
) AS newdata ON newdata.barcode=ArchDesc.barcode
SET
ArchDesc.userDefinedString1 = newdata.UserDefinedString1,
ArchDesc.userDefinedString2 = newdata.UserDefinedString2,
ArchDesc.locationId = newdata.locationId,
ArchDesc.userDefinedBoolean1 = newdata.userDefinedBoolean1,
ArchDesc.userDefinedBoolean2 = newdata.userDefinedBoolean2

WHERE
ArchDesc.userDefinedString1 = ""
AND ArchDesc.userDefinedString2 = "";

commit;

SAA 2014 Sessions of Interest

Here are a few sessions (not comprehensive!) related to the content of this blog at SAA this week:

Wednesday, August 13
3:30pm – 5:00pm

Carrie: Friday, August 15 • 2:45pm – 3:45pm; SESSION 503 – How Are We Doing? Improving Access Through Assessment

Maureen: Friday, August 15 • 2:45pm – 3:45pm; SESSION 501 – Taken for Granted: How Term Positions Affect New Professionals and the Repositories That Employ Them

Meghan: Thursday, August 14 • 3:00pm – 3:30pm and Friday, August 15 • 4:00pm – 4:30pm; P05 PROFESSIONAL POSTER – Mapping Duke History with Historypin

Steve: Thursday, August 14 • 5:30pm – 7:30pm; Graduate Student Poster Presentations: ArchivesSpace and the Opportunity for Institutional Change

A very brief guide to deleting records with the ArchivesSpace API, from a total tyro

If you’ve ever used cURL before, you don’t need this.

Also, the videos and documentation that Hudson Molonglo put together are really stellar and recommended to anyone starting with this.

This guide is a true project-pad of my notes of how I did this. It might also be useful for those of us who never had formal training with scripting, but are in charge of the archival data in our repositories and appreciate power tools. Obviously, the problem with power tools is that you can cut your arm off. Use this carefully. Use in test/dev. Ask someone to check your work if you’re doing something truly crazy.

Here’s what I did

This came up for me because I had done a failed test migration (we think there’s a weird timestamp problem in the accessions table) and I wanted to delete the repository and all records in the repository in ASpace before trying again. As far as I can tell, there isn’t a great way to delete thousands of records in the user interface. So, the API seemed the way to go.

I figured this out by watching the video and reading the documentation on GitHub, and then doing a little extra googling around to learn more about curl options.

If you’re using a Mac, just fire up the terminal and get on with your life. I use a Windows PC at work, so I use Cygwn as a Unix emulator. The internet gave me good advice about how to add curl.exe.

Note: you won’t be able to do any of this unless you have admin access.

Let’s start with “Hello, World!”

$ curl 'http://test-aspace.yourenvironment.org:port/'

In this example, the url before the colon should be your ASpace instance (use test/dev!) and “port” should be your port. The response you get should basically just tell you that yes, you have communicated with this server.

Connect to the server

$ curl -F password='your password' 'http://test-aspace.yourenvironment.org:port/users/admin/login'

Here, you’re logging on as admin. The server will respond with a session token — go ahead and copy the token response and make it a variable, so you don’t have to keep track of it.

export TOKEN=cc0984b7bfa0718bd5c831b419cb8353c7545edb63b62319a69cdd29ea5775fa

Delete the records

Here, you definitely want to check the API documentation on GitHub. Basically, this tells you how to format the URI and the command to use. For instance, below, I wanted to delete an entire repository. I found out, though, that I couldn’t delete the repository if it had records that belonged to it. Since agents and subjects exist in ASpace without belonging to a repository, and since accessions and digital records hadn’t successfully migrated, I only needed to delete resource records.

$ curl -H "X-ArchivesSpace-Session: $TOKEN" -X "DELETE" 'http://test-aspace.yourenvironment.org:port/repositories/3/resources/[278-1693]'

So, I passed something to the header that gave my token ID, then I sent a command to delete some records. But which ones?

Let’s parse this URI. The first part is my ASpace test server, the port is my port.

The next thing to understand is that each repository, resource, accession, agent, whatever, has a numeric ID. URIs are formatted according to the record type and the ID. So, I go to repositories/3, because the resources I want to delete are in a particular repository, and that repository has the numeric ID of “3”. In order to find this out, you can look in the ASpace interface, or you can send a call to yoururl/repositories, which will give you a json response with id (and other) information about all of the repositories on your server.

After that, I tell curl which resource records I want to delete. There’s probably a better way, but I figured this out by sorting resources by date created, both ascending and descending, to find out what the first and last IDs are. I’d imagine, though, that if I didn’t want to look that up and I just asked for

'http://test-aspace.yourenvironment.org:port/repositories/3/resources/[1-2000]'

I would probably be okay, because it’s only deleting resource records in repository 3 and I want to get rid of all of those anyway. I’d get an error for resources that don’t exist in that repository, but it wouldn’t break anything. I had wondered if there are wildcards for curl, so that I could get ANY number after resources, but (according to some brief googling) it doesn’t look like there are.

What does this all mean?

Uh, I don’t know? I mean, the API is obviously very powerful and amazing, and I’m glad I didn’t have to figure out a way to delete those records in the interface. But I’m really just starting to dip my toe into the potential of this. I’m sure you can look forward to more updates.

Figuring Out What Has Been Done: Folder Numbers that Go On and On and On

What was the problem?

As part of a retrospective conversion project, paper-based finding aids were turned into structured data. A lot of this work was done in Excel, and one problem was a mistake with folder numbers — instead of folder numbers starting at number one at the beginning of each box, their numbering continues as the next box starts. For instance, instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.

How did I figure this out?

Since I’m new here and not overly familiar with numbering conventions, I approached it two ways. First, I want a list of finding aids that have really, really high folder numbers. This is really easy — I basically point to the folder and ask to return the biggest one. Of course, fn:max() can only handle numbers that look like numbers, so I included a predicate [matches(.,’^[0-9]+$’)] that only looks for folder numbers that are integers. This means that folder ranges and folders with letters in their name won’t be included, but it’s very unlikely that the biggest folder numbers in a collection would be styled this way.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $folder := $ead//ead:container[@type="Folder"][matches(.,'^[0-9]+$')]
 let $maxfolder := max($folder)
 return 
 $maxfolder
 }
 
 </document>
}
</root>

Looking through this, there are a LOT of collections with really high folder numbers. When I dig in, I realize that in a lot of cases, this can be just because of typos (for instance, a person means to type folder “109” but accidentally types “1090”). But I thought it would be good to know, more generally, which boxes in a collection have a “Folder 1”.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $box := $ead/(//ead:container[@type="Box"])
 let $folder1 := $box/following-sibling::ead:container[@type ="Folder"][. eq "1"]
 let $boxbelong := $folder1/preceding-sibling::ead:container[@type ="Box"]
 return
 $boxbelong
 }
 </document>
}
</root>

And, like a lot of places, practice has varied over time here. Sometimes folder numbering continues across boxes for each series. Sometimes it starts over for each box. Sometimes it goes through the whole collection. This could be tweaked to answer other questions. Which box numbers that aren’t Box 1 have a folder 1? How many/which boxes are in this collection, anyway?

From this, I got a good list of finding aids with really folder numbers that will help us fix a few dumb typos and identify finding aids that have erroneous numbering. We’re still on the fence regarding what to do about this (I think I would advocate just deleting the folder numbers, since I’m actually not a huge fan of them anyway), but we have a good start to understanding the scope of the problem.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?  We now know which notes are duplicated from their children.