Clean Metadata for Non-Metadata Geeks

Over the past two years, Maureen, Carrie, Meghan, Cassie and their guests have turned this blog into a powerhouse of know-how around working smarter with archival metadata. Some of us really enjoy this type of work; we find it crazy satisfying and it aligns well with our worldviews. We acknowledge, with some pride, that we are metadata geeks. But not all archivists are like this, AND THAT’S TOTALLY OKAY. We all have different strengths, and not all archivists need to be data wranglers. But we can all produce clean metadata.

Just one of the awesome buttons from AVPreserve

Just one of the awesome metadata jokes promulgated by AVPreserve‘s button campaign

Today, though, I’m going to take a BIG step backward and talk for a few minutes about what we actually mean when we talk about “clean” data, and I’ll share a few basic things that any archivist can do to help prevent their painstakingly produced metadata from becoming someone else’s “clean up” project later.

As Maureen explained in the very first Chaos —> Order post, the raison d’etre of all of this is to use computers to do what they do best, freeing the humans to do what they do best. Computers are really good at quickly accomplishing tasks like indexing, searching, replacing, linking and slicing up information for which you can define a rule or pattern, things that would take a human tens or hundreds of hours to do by hand and wouldn’t require any of the higher-level processes that are still unique to humans, let alone the specialized training or knowledge of an archivist. Clean data is, quite simply, data that is easy for a computer to digest in order to accomplish these tasks.

Continue reading

Advertisements

Archival Description for Web Archives

If you follow me on Twitter, you may have seen that the task I set out for myself this week was to devise a way to describe web archives using the tools available to me: Archivists’ Toolkit, Archive-It, DACS and EAD. My goals were both practical and philosophical: to create useful description, but also to bring archival principles to bear on the practice of web archiving in a way that is sometimes absent in discussions on the topic. And you may have seen that I was less than entirely successful.

Appropriate to the scope of my goals, the problems I encountered were also both practical and philosophical in nature:

  • I was simply dissatisfied with the options that my tools offered for recording information about web archives. There were a lot of “yeah, it kind of makes sense to put it in that field, but it could also go over here, and neither are a perfect fit” moments that I’m sure anyone doing this work has encountered. A Web Archiving Roundtable/TS-DACS white paper recommending best practices in this area would be fantastic, and may become reality.
  • More fundamentally, though, I came to understand that the units of arrangement, description and access typically used in web archives simply don’t map well onto traditional archival units of arrangement and description, particularly if one is concerned with preserving information about the creation of the archive itself, i.e., provenance.

Continue reading

Records Management for Discards

Maybe this is a familiar problem for some other archivists. You have a collection that you’ve just finished processing — maybe it’s a new acquisition, or maybe it’s been sitting around for awhile — and you have some boxes of weeded papers leftover, waiting to be discarded. But for some reason — a reason usually falling outside of your job purview — you are not able to discard them. Maybe the gift agreement insists that all discards be returned to the donor, and you can’t track down the donor without inviting another accession, and you just don’t have time or space for that right now. Maybe your library is about to renovate and move, and your curators are preoccupied with trying to install 10 exhibitions simultaneously. Maybe the acquisition was a high-value gift, for which the donor took a generous tax deduction, and your library is legally obligated to keep all parts of the gift for at least three years. Maybe your donor has vanished, the gift agreement is non-existent, or the discards are actually supposed to go to another institution and that institution isn’t ready to pay for them. The reasons don’t matter, really. You have boxes of archival material and you need to track them, but they aren’t a part of your archival collection any more. How do you manage these materials until the glorious day when you are actually able to discard them?

We’ve struggled with this at Duke for a long time, but it became a more pressing issue during our recent renovation and relocation. Boxes of discards couldn’t just sit in the stacks in a corner anymore; we had to send them to offsite storage, which meant they needed to be barcoded and tracked through our online catalog. We ended up attaching them to the collection record, which was not ideal. Because the rest of the collection was processed and available, we could not suppress the discard items from the public view of the catalog. (Discards Box 1 is not a pretty thing for our patrons to see.) Plus, it was too easy to attach them to the collection and then forget about the boxes, since they were out of sight in offsite storage. There was no easy way to regularly collect all the discard items for curators to review from across all our collections. It was messy and hard to use, and the items were never going to actually be discarded! This was no good.

I ended up making a Discards 2015 Collection, which is suppressed in the catalog and therefore not discoverable by patrons. All materials identified for discard in 2015 will be attached to this record. I also made an internal resource record in Archivists’ Toolkit (soon to be migrated to ArchivesSpace) that has a series for each collection with discards we are tracking for the year. It is linked to the AT accession records, if possible. In the resource record’s series descriptions, I record the details about the discards: what is being discarded, who processed it, who reviewed it, why we haven’t been able to discard it immediately, and when we expect to be able to discard the material (if known). The Discard Collection’s boxes are numbered, barcoded, and sent to offsite storage completely separated from their original collection — as it should be. No co-mingling, physically or intellectually! Plus, all our discards are tracked together, so from now on, I can remind our curators and other relevant parties at regular intervals about the boxes sitting offsite that need to be returned, shredded, sold, or whatever.

I’d love to hear other approaches to discards — this is a new strategy for us, so maybe I’ve missed something obvious that your institution has already solved. Let me know in the comments. Happy weeding, everyone!

Getting into the guts of AT

It’s the thing that we keep saying — in order to deal with our masses of stuff better, we need better ways of understanding what we have. A lot of my questions aren’t just about what’s in our finding aids — they’re about the relationships between archival materials and other archival management functions — accessioning, digital object management, location management, container management. For instance, the following questions have come up in the past or could easily come up in the future:

  • Which collections are constituted of accessions that came in before 1980?
  • Which collections have digital objects associated with them? What are the URIs of those digital objects?
  • I have a barcode for a box. Can you tell me the materials that are supposed to be in that box? What collection is this from?
  • We haven’t used this location listed in the location table since 2005! Are there any boxes associated with that location? What are they?

In order to answer these questions, I need to write reports that join different tables in Archivists’ Toolkit together. And this is a little bit tricky, because in their own way, components in the AT database are hierarchical (just like in an EAD-encoded finding aid). If I have an instance (a container with a barcode), and I want to know which collection it belongs to, I don’t have a direct relationship in the database. Instead, an instance is associated with a component. That component is associated with its parent component. It may have a lot of ancestor components before the most high-level component is associated with the collection-level information in the resource table.

These relationships are made in sql through what are called “joins”. And joining a table on itself (in some cases several times, recursively) is a huge friggin pain in the neck. So, after mucking around for a little while, the solution was to just ask someone smarter than me how he would handle this.

This is where my colleague Steelsen comes in — Steelsen introduced the idea of writing a stored procedure that would look for the top-most component instead of having to do this through joins. And then he wrote them for me, because he is a mensch of the first order. His procedures are here, and available to anyone who might find them useful. They have seriously revolutionized the way that I’ve been able to do reporting and solve problems.

For instance, something that folks have been begging for is a barcode look-up tool — they have a barcode, and they want to know which collection it belongs to, what its call number is, which location it’s assigned to, and which components are associated with that box. So here’s what I wrote (the user indicates the barcode in the where statement):

use schema;
SELECT 
 CONCAT(r.resourceIdentifier1,
 ' ',
 LPAD(r.resourceIdentifier2, 4, '00')) 'Collection',
 r.title 'Collection Title',
 series.subdivisionIdentifier 'Series/Accession Number',
 series.title 'Series Title',
 rc.title 'Component Title',
 rc.dateExpression 'Component Date',
 adi.container1Type 'Container Type',
 adi.container1NumericIndicator BoxNum,
 adi.container1AlphaNumIndicator BoxAlpha,
 adi.container2NumericIndicator FolderNum,
 adi.container2AlphaNumIndicator FolderAlpha,
 adi.archDescriptionInstancesId InstanceID,
 adi.barcode Barcode,
 adi.userDefinedString1 'Voyager Info',
 loc.LocationId,
 loc.building,
 loc.room,
 loc.coordinate1NumericIndicator ShelfNum,
 loc.coordinate1AlphaNumIndicator ShelfAlpha
FROM
 ArchDescriptionInstances adi
 INNER JOIN
 ResourcesComponents rc ON adi.resourceComponentId = rc.resourceComponentId
 INNER JOIN
 LocationsTable loc ON adi.locationID = loc.locationID
 INNER JOIN
 Resources r ON r.resourceId = GETRESOURCEFROMCOMPONENT(rc.resourceComponentId)
 LEFT OUTER JOIN
 ResourcesComponents series ON GETTOPCOMPONENT(rc.resourceComponentId) = series.resourceComponentID
WHERE
 adi.barcode = 39002042658774;

Here I use two of Steelsen’s procedures. In GETRESOURCEFROMCOMPONENT, I go up the tree of a component to find out what resource it belongs to and join that to the resource. I use GETTOPCOMONENT to help figure out what series a component belongs to (this assumes that the top-most component is a series, but that’s usually a safe bet for us).

I’m a sql n00b, and this isn’t the most efficient query I’ve ever run, but I’m really happy with the results, which can be viewed in a spreadsheet here.

By changing the where statement, I can find out all kinds of associated information about a location, a collection, a box, whatever. I can find out if barcodes have been assigned to components with different box numbers; I can find out if components with the same barcode have been assigned to more than one location. This set of procedures has really been a godsend to help me know more about the problems I’m fixing. So many thanks to Steelsen. I hope others find them useful too.

Another Quick One — Locations where Accessions have been Assigned

If you assign accessions to locations, but move them around to a final home after processing, it may be helpful to see where your accessions were assigned and when the record was last touched. This query will help you do a little clean-up:

SELECT
Accessions.title,
Accessions.accessionNumber1,
Accessions.accessionNumber2,
Accessions.accessionNumber3,
Accessions.lastUpdated,
LocationsTable.building,
LocationsTable.room,
LocationsTable.coordinate1AlphaNumIndicator Shelf,
AccessionsLocations.accessionId,
AccessionsLocations.locationId
FROM
yale_dev.AccessionsLocations
INNER JOIN
Accessions ON Accessions.accessionId = AccessionsLocations.accessionId
JOIN
LocationsTable ON AccessionsLocations.locationId = LocationsTable.locationId

Here’s an example of some output. We may check, for instance, accessions from before 2015 to make sure that the accession location is still relevant. I hope this is useful to someone else!

Title Accn Date last modified loc accnID locID
Yale Guidance Nursery yearly reports 2010 A 085 2010-06-22 12:18:55 SML XXX X 7077 1933
Margenau, Henry, papers 2010 M 053 2010-11-16 15:51:37 SML XXX X 7078 1940

Quick Query — Finding Locations where Nothing is Assigned in Archivists’ Toolki

I just wrote a quick query to give records in the locations table in Archivists’ Toolkit that don’t have instances assigned to them. This sounds like a pretty common thing that folks want to see — here it is:

SELECT
*
FROM
LocationsTable loc
WHERE
loc.locationId BETWEEN 0 AND 10000
AND loc.locationId NOT IN (SELECT
locationId
FROM
ArchDescriptionInstances containers
WHERE
containers.locationId BETWEEN 0 AND 10000)
AND loc.locationId NOT IN (SELECT
locationId
FROM
AccessionsLocations accession
WHERE
accession.locationId BETWEEN 0 AND 10000);

Making DACS Dates

Manipulating date strings (which is the data type we usually have in archival description), particularly when you have a lot of legacy data, is a pain. I was working with a friend to update some legacy data in her finding aid, and it occurred to me that there isn’t a lot of direct guidance out there about how to manipulate dates with various tools. So, here’s a run-down of some of my methods — please feel free to add your own in the comments.

Why does this matter?

I’ll be honest, in a lot of situations, date formats don’t matter at all. I’ve said it before and I’ll say it again — we put a whole lot of effort into creating structured data, considering that most of us just flatten it into HTML and put it up as a webpage. However, there is a brighter tomorrow. With structured data, you can make far better interfaces, and there are really nice examples of places that let you do stuff with date data.

In the Princeton finding aids site, you can sort by title, date, or container. This means that in a series like this, in the George F. Kennan papers, where the archivist (or possibly creator) filed by title, this isn’t the only way to look through materials.

George Kennan Finding Aid

The order of materials as they are presented

George Kennan Finding Aid

The order of materials, sorted by me (the user) by date ascending.

Letting users sort by title or date means a few things — we can stop wasting time with alpha or chron arrangement and spend more of our energies on the true value that archivists add to description — context, meaning, transparency — without worrying that there’s too much for the researcher to sort through. It also means that we don’t have to presume that a researcher’s primary discovery vector is either time or title — we can let her choose for herself. Finally, and most importantly, we can let original arrangement schemes and organic order (the true intellectual basis of arrangement) reign supreme.

The other reason why date formats are important is because our content standard tells us they are. Now, I personally think that it’s actually far more important to associate an ISO-compliant date with a descriptive component, which can then be rendered any way you want, but since until recently our tools didn’t support that very well, I think that the DACS format of YYYY Month D brings us a step closer to easier date clean-up and extracting ISO compliant dates from date expressions.

Excel

Excel, odi et amo. Excel offers a GUI for programming-ish functions, but I find as I do more and more advanced stuff that I get frustrated by not knowing what exactly is happening with the underlying data. Dates are particularly frustrating, since Excel stores dates as a serial number starting with January 1, 1900. As an archivist who has PLENTY OF DATES from before then, this can lead to rage. There are a few ways to deal with this — if your dates are all 20th or 21st century, congratulations! You don’t have a problem. There are ways to get Excel to change the ways it assigns serial numbers, to allow for negative numbers, which let’s you do the normal sorting and date re-formatting. Or, you can store everything as text and move each part of the date string to its own column to manipulate it.

So, an example of a clean-up project:

Excel Dates -- untouched

In this data, we have a bit to clean up. When I start a clean-up project, I usually start with a pencil-and-paper list of all of the steps that I need to go through before I change anything. This way, I see if I need to do research about how to do a step, and I can also see if there are dependencies in the data that may require me to sequence these steps in a particular way. When you’re first learning, it’s easy to jump right in without planning, but trust me — every time I’ve been burned by automation it’s because I didn’t plan. In a live data environment, you should always know what the computer’s going to do before you run a command, even if that command is just a formula in Excel. The flip side of this is, of course, that as long as you have good back-ups, you should feel free to experiment and try new things. Just make sure you make the effort to figure out what actually happened when you’re experimenting suddenly produces the results you want to see.

So, here’s my list of steps to perform on this data.

  • Check my encoding, which in this case just means which data is in which columns. Do you see the row where some of the date data is in the title column? It’s in row 4. I would probably survey the data and see how prevalent this kind of problem is. If it’s just a handful of errors, I’ll move the data over by hand. If there’s more, I’ll figure out a script/formula to automate this.
  • Check for unwanted characters. In this case, get rid of brackets. In case you haven’t heard, brackets are not a meaningful way of indicating uncertainty to researchers. There is a certainty attribute on <unitdate> for that, which can then be rendered in your institution’s EAD -> HTML stylesheet. However, my problem with brackets is more fundamental — in archival description, the date element is just a transcription of what we see on the record. We don’t actually know that this date represents anything. So in reality, these are all guesses to varying degrees of certainty, with the aim of giving the researcher some clue to time.
  • Fix the date format. DACS dates are YYYY Month D. (e.g., 2015 March 6)
  • Create an ISO date to serialize as an @normal attribute with <unitdate>

Let’s skip the obvious clean-up tasks and go straight to formatting dates. If everything is after 1900 (and if everything is a three-part date), this is really straightforward.

First, create a new column. Use the DATEVALUE formula to tell Excel to regard your date string as a date value — if your date string is in B2, your formula in C2 should be:

=DATEVALUE(B2)

Double-click on the bottom right corner of the cell to have that formula apply to the whole column.

Now that Excel knows that this is a date, you simply need to give it the format you want to see, in this case, yyyy mmmm d.

Choosing a custom date format

Choosing a custom date format.

This works great for three-part dates after 1900. If that’s not your situation, there are a few things you can do. One of my favorite methods is to filter the date list to each of the different date types and apply the custom date format to each of these (trying to apply a custom date format to a date that doesn’t fit the type will result in really confusing and bad results). Another option is to split the date into three different columns, treat each like text, and then bring them back together in the order you want with the CONCATENATE formula. Play around — Excel doesn’t make it easy, but there are lots of options.

OpenRefine

If you do a lot of data manipulation, I would definitely encourage you to stop torturing yourself and learn OpenRefine. I use it every day. OpenRefine uses something called GREL (Google Refine Expression Language — I wonder if they’ll be changing that to OREL now that this isn’t under the Google umbrella?), which is trickier to learn than Excel formulas but a lot more powerful and more in alignment with other programming languages. In fact, I should say that you only need to learn GREL for the fancy stuff — a lot of OpenRefine’s magic can be done through the GUI.

So, looking through this data set, I would do a lot of the same steps. One option is to just use the commands Edit Cells -> Common Transformations -> To Date, but unfortunately, most of these strings aren’t written in a way that OpenRefine understands them as dates.

The best path forward is probably to split this date string apart and put it back together. You could split by whitespace and turn them into three columns, but since some dates are just a year, or a year and a month, you wouldn’t necessarily have each of the three parts of the date in the columns where you want them.

So, I’m going to tell OpenRefine what a year looks like and ask it to put the year in its own column.

This formula pulls the year from the date string and puts it into its own column.

This formula pulls the year from the date string and puts it into its own column.

In this formula, I’m partitioning the string by a four-digit number and then taking that part of the partition for my new column. In the case of the year, the formula is:

value.partition(/\d{4}/)[1]

For a month it’s:

value.partition(/[A-Za-z]+/)[1]

And for the day it’s:

value.partition(/\d{1,2}/)[1]

There may be a more elegant way of partitioning this all as one step, but I don’t yet know how!

Then, once you have each of these parts of the date in their own columns, they should look like this:

Each part of the date element is in its own column.

Each part of the date element is in its own column.

The final step is to put the pieces back together in the order you want them. You can do this by clicking on the Year column, and selecting “Create column based on this column.” Then, use GREL to put everything in the order that you want to see it.

The plus signs signify that everything should be smushed together -- pay attention to the syntax of calling the value of columns.

The plus signs signify that everything should be smushed together — pay attention to the syntax of calling the value of columns.

The formula for this is:

value + " " + cells["Month"].value + " " + cells["Day"].value

And voila, you’ve turned your non-DACS date into a DACS-formatted date. You can use similar steps to make a column that creates an ISO-formatted date, too, although you’ll first have to convert months into two-digit numbers.

Finally, SQL

The two methods above require ETL — extract, transform, load. That is, you’re going to get data out of the database (or transform it into a tabbed sheet from xml) and then get it back into the database or the EAD (and then the database). There is a better way if you’re using Archivists’ Toolkit or ArchivesSpace, and it involves doing SQL updates. I’m going to punt on this for now, because I know that this will be a huge part of my future once we get into ArchivesSpace (I’ll also be creating normalized dates, which is data that Archivists’ Toolkit can’t store properly but ASpace can). So, stay tuned!