Clean Metadata for Non-Metadata Geeks

Over the past two years, Maureen, Carrie, Meghan, Cassie and their guests have turned this blog into a powerhouse of know-how around working smarter with archival metadata. Some of us really enjoy this type of work; we find it crazy satisfying and it aligns well with our worldviews. We acknowledge, with some pride, that we are metadata geeks. But not all archivists are like this, AND THAT’S TOTALLY OKAY. We all have different strengths, and not all archivists need to be data wranglers. But we can all produce clean metadata.

Just one of the awesome buttons from AVPreserve

Just one of the awesome metadata jokes promulgated by AVPreserve‘s button campaign

Today, though, I’m going to take a BIG step backward and talk for a few minutes about what we actually mean when we talk about “clean” data, and I’ll share a few basic things that any archivist can do to help prevent their painstakingly produced metadata from becoming someone else’s “clean up” project later.

As Maureen explained in the very first Chaos —> Order post, the raison d’etre of all of this is to use computers to do what they do best, freeing the humans to do what they do best. Computers are really good at quickly accomplishing tasks like indexing, searching, replacing, linking and slicing up information for which you can define a rule or pattern, things that would take a human tens or hundreds of hours to do by hand and wouldn’t require any of the higher-level processes that are still unique to humans, let alone the specialized training or knowledge of an archivist. Clean data is, quite simply, data that is easy for a computer to digest in order to accomplish these tasks.

Continue reading

Advertisements

Making DACS Dates

Manipulating date strings (which is the data type we usually have in archival description), particularly when you have a lot of legacy data, is a pain. I was working with a friend to update some legacy data in her finding aid, and it occurred to me that there isn’t a lot of direct guidance out there about how to manipulate dates with various tools. So, here’s a run-down of some of my methods — please feel free to add your own in the comments.

Why does this matter?

I’ll be honest, in a lot of situations, date formats don’t matter at all. I’ve said it before and I’ll say it again — we put a whole lot of effort into creating structured data, considering that most of us just flatten it into HTML and put it up as a webpage. However, there is a brighter tomorrow. With structured data, you can make far better interfaces, and there are really nice examples of places that let you do stuff with date data.

In the Princeton finding aids site, you can sort by title, date, or container. This means that in a series like this, in the George F. Kennan papers, where the archivist (or possibly creator) filed by title, this isn’t the only way to look through materials.

George Kennan Finding Aid

The order of materials as they are presented

George Kennan Finding Aid

The order of materials, sorted by me (the user) by date ascending.

Letting users sort by title or date means a few things — we can stop wasting time with alpha or chron arrangement and spend more of our energies on the true value that archivists add to description — context, meaning, transparency — without worrying that there’s too much for the researcher to sort through. It also means that we don’t have to presume that a researcher’s primary discovery vector is either time or title — we can let her choose for herself. Finally, and most importantly, we can let original arrangement schemes and organic order (the true intellectual basis of arrangement) reign supreme.

The other reason why date formats are important is because our content standard tells us they are. Now, I personally think that it’s actually far more important to associate an ISO-compliant date with a descriptive component, which can then be rendered any way you want, but since until recently our tools didn’t support that very well, I think that the DACS format of YYYY Month D brings us a step closer to easier date clean-up and extracting ISO compliant dates from date expressions.

Excel

Excel, odi et amo. Excel offers a GUI for programming-ish functions, but I find as I do more and more advanced stuff that I get frustrated by not knowing what exactly is happening with the underlying data. Dates are particularly frustrating, since Excel stores dates as a serial number starting with January 1, 1900. As an archivist who has PLENTY OF DATES from before then, this can lead to rage. There are a few ways to deal with this — if your dates are all 20th or 21st century, congratulations! You don’t have a problem. There are ways to get Excel to change the ways it assigns serial numbers, to allow for negative numbers, which let’s you do the normal sorting and date re-formatting. Or, you can store everything as text and move each part of the date string to its own column to manipulate it.

So, an example of a clean-up project:

Excel Dates -- untouched

In this data, we have a bit to clean up. When I start a clean-up project, I usually start with a pencil-and-paper list of all of the steps that I need to go through before I change anything. This way, I see if I need to do research about how to do a step, and I can also see if there are dependencies in the data that may require me to sequence these steps in a particular way. When you’re first learning, it’s easy to jump right in without planning, but trust me — every time I’ve been burned by automation it’s because I didn’t plan. In a live data environment, you should always know what the computer’s going to do before you run a command, even if that command is just a formula in Excel. The flip side of this is, of course, that as long as you have good back-ups, you should feel free to experiment and try new things. Just make sure you make the effort to figure out what actually happened when you’re experimenting suddenly produces the results you want to see.

So, here’s my list of steps to perform on this data.

  • Check my encoding, which in this case just means which data is in which columns. Do you see the row where some of the date data is in the title column? It’s in row 4. I would probably survey the data and see how prevalent this kind of problem is. If it’s just a handful of errors, I’ll move the data over by hand. If there’s more, I’ll figure out a script/formula to automate this.
  • Check for unwanted characters. In this case, get rid of brackets. In case you haven’t heard, brackets are not a meaningful way of indicating uncertainty to researchers. There is a certainty attribute on <unitdate> for that, which can then be rendered in your institution’s EAD -> HTML stylesheet. However, my problem with brackets is more fundamental — in archival description, the date element is just a transcription of what we see on the record. We don’t actually know that this date represents anything. So in reality, these are all guesses to varying degrees of certainty, with the aim of giving the researcher some clue to time.
  • Fix the date format. DACS dates are YYYY Month D. (e.g., 2015 March 6)
  • Create an ISO date to serialize as an @normal attribute with <unitdate>

Let’s skip the obvious clean-up tasks and go straight to formatting dates. If everything is after 1900 (and if everything is a three-part date), this is really straightforward.

First, create a new column. Use the DATEVALUE formula to tell Excel to regard your date string as a date value — if your date string is in B2, your formula in C2 should be:

=DATEVALUE(B2)

Double-click on the bottom right corner of the cell to have that formula apply to the whole column.

Now that Excel knows that this is a date, you simply need to give it the format you want to see, in this case, yyyy mmmm d.

Choosing a custom date format

Choosing a custom date format.

This works great for three-part dates after 1900. If that’s not your situation, there are a few things you can do. One of my favorite methods is to filter the date list to each of the different date types and apply the custom date format to each of these (trying to apply a custom date format to a date that doesn’t fit the type will result in really confusing and bad results). Another option is to split the date into three different columns, treat each like text, and then bring them back together in the order you want with the CONCATENATE formula. Play around — Excel doesn’t make it easy, but there are lots of options.

OpenRefine

If you do a lot of data manipulation, I would definitely encourage you to stop torturing yourself and learn OpenRefine. I use it every day. OpenRefine uses something called GREL (Google Refine Expression Language — I wonder if they’ll be changing that to OREL now that this isn’t under the Google umbrella?), which is trickier to learn than Excel formulas but a lot more powerful and more in alignment with other programming languages. In fact, I should say that you only need to learn GREL for the fancy stuff — a lot of OpenRefine’s magic can be done through the GUI.

So, looking through this data set, I would do a lot of the same steps. One option is to just use the commands Edit Cells -> Common Transformations -> To Date, but unfortunately, most of these strings aren’t written in a way that OpenRefine understands them as dates.

The best path forward is probably to split this date string apart and put it back together. You could split by whitespace and turn them into three columns, but since some dates are just a year, or a year and a month, you wouldn’t necessarily have each of the three parts of the date in the columns where you want them.

So, I’m going to tell OpenRefine what a year looks like and ask it to put the year in its own column.

This formula pulls the year from the date string and puts it into its own column.

This formula pulls the year from the date string and puts it into its own column.

In this formula, I’m partitioning the string by a four-digit number and then taking that part of the partition for my new column. In the case of the year, the formula is:

value.partition(/\d{4}/)[1]

For a month it’s:

value.partition(/[A-Za-z]+/)[1]

And for the day it’s:

value.partition(/\d{1,2}/)[1]

There may be a more elegant way of partitioning this all as one step, but I don’t yet know how!

Then, once you have each of these parts of the date in their own columns, they should look like this:

Each part of the date element is in its own column.

Each part of the date element is in its own column.

The final step is to put the pieces back together in the order you want them. You can do this by clicking on the Year column, and selecting “Create column based on this column.” Then, use GREL to put everything in the order that you want to see it.

The plus signs signify that everything should be smushed together -- pay attention to the syntax of calling the value of columns.

The plus signs signify that everything should be smushed together — pay attention to the syntax of calling the value of columns.

The formula for this is:

value + " " + cells["Month"].value + " " + cells["Day"].value

And voila, you’ve turned your non-DACS date into a DACS-formatted date. You can use similar steps to make a column that creates an ISO-formatted date, too, although you’ll first have to convert months into two-digit numbers.

Finally, SQL

The two methods above require ETL — extract, transform, load. That is, you’re going to get data out of the database (or transform it into a tabbed sheet from xml) and then get it back into the database or the EAD (and then the database). There is a better way if you’re using Archivists’ Toolkit or ArchivesSpace, and it involves doing SQL updates. I’m going to punt on this for now, because I know that this will be a huge part of my future once we get into ArchivesSpace (I’ll also be creating normalized dates, which is data that Archivists’ Toolkit can’t store properly but ASpace can). So, stay tuned!

OpenRefine and Messy Legacy Access Points in an Archivists’ Toolkit Database

After having read posts on this blog and articles concerning the use of OpenRefine to handle metadata stored in Excel and Access files, I found myself asking how this could be done with an Archivists’ Toolkit (MySQL) database. Since the literature was not forthcoming, I did my own experiment, which Maureen graciously offered me the space here to describe. Before attempting this on a larger scale, you may wish to create a local version of Archivists’ Toolkit on your computer to test it with. To do this in a working archives, contact your database administrator and work with her on determining how you’ll do the project.

For my experiment, I didn’t work on an active version of the database. Instead I duplicated my Archivists’ Toolkit database into something called `messydb` and temporarily linked it to my Archivists’ Toolkit software.

I chose to restrict my experiment to personal names, but with a little more time going through the database structure/export, I could have done all names and subjects. I seeded the database with 5 less-optimal versions of 3 names which already existed. I did this in Archivists’ Toolkit by opening 3 different records and linking new improper names to them. I also created a non-preferred form for one of the names, for the sake of a later part of the experiment. I backed up this problem database so that I could reload and repeat the experiment as many times as necessary.

Next, I had to write my queries.0 I began by working from the database export I’d used to create the duplicate to determine the tables and fields which would be relevant to the project. I determined that in the table `names` the field `nameId` was used to create a primary key for the name entry (used as a foreign key in other tables) and `sortName` was the best way to view a full version of the name. There’s also an important field called `nameType` which is used to designate that the name is a “Person” (vs. “Corporate Body” etc.). So, I wrote a query which would take the nameId and the sortName from any entry where the nameType was “Person” and put them into a comma-separated file.

SELECT nameId, sortName INTO OUTFILE '/names.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
FROM names
WHERE nameType = 'Person'

1

This resulted in the following file. I then opened it in Notepad++ so that I could quickly add comma-separated headers before the OpenRefine. These weren’t necessary, but I found adding them at this point helpful.

Adding Headers in Notepad ++

Working in OpenRefine

I then imported this file into OpenRefine. I then used OpenRefine’s cluster and edit option, ticking off metaphone3 as the keying function.

The results I got were:

Burns, Jacob A.
Burns, Jacob, 1902-1993
Burns, Jacob

O'Connor, Sandra D.
O'Connor, Sandra Day 1930-

Barron, Jerome
Barron, Jerome A.

which is all well and good, but if you recall above I said that I’d put in three problem names for Jacob Burns. The name “Burns, J.” wasn’t caught by metaphone3, which didn’t parse the J as close enough to Jacob. Of course, J could also be Jerome or Jacqueline or James. I’ll come back to this at the end.

Now that I’ve got most of the duplicates selected, it’s not as simple as using the editing function in OpenRefine to standardize the names. Even if you’re sure that these names are true duplicates, they must be edited within Archivists’ Toolkit. There are three ways to do it. No matter the method, I need to first flag all the names, sort to select all flagged names, export this view of the flagged names into an Excel file, and sort them by name in Excel.2 Now we have a list of names and corresponding nameIds from which to work.

Removing the Duplicates

The first method is to simply take the exported Excel file and work from it. This would involve going into the Names module of Archivists’ Toolkit and locating the finding aids attached to each improper name form. The archivist would double-check in each finding aid that this is really the same person, then replace it with the preferred name from the possible list. After it’d been removed from all linked records, the name could be deleted in the Names module.

The second method is one for which I’m still writing the specific SQL sequence (it involves 5 table joins and one temporary loop). The result will pull the following table.fields: resources.eadFaUniqueIdentifier (<eadid>), resources.findingAidTitle (<titleproper>), and names.sortName (display version of the name) into a list for any cases where the names.nameId is one of the potential duplicates. This could print into a neat list which the archivist could then use to view every finding aid where a problem name is linked without as much repetitive work as the first method would require.

The third method involves a mix of either the first or second and a SQL batch update. Using either the first or second method, the archivist would confirm that the names are true duplicates. Using the second method, for example, might allow her to easily view the finding aids online using the eadFaUniqueIdentifier/<eadid> and scroll through them to double check the name. Then she could follow these three steps to do SQL batch updates using the appropriate nameIds.

Removing Duplicates with SQL

As I begin this section, I feel compelled to remind everyone to check with your database administrator before trying this method. This may be outside the bounds of what she’s willing to do, and there are probably good reasons why. If she’s not able to help you with this, use methods one or two. You will also need her assistance to use the second method, but as it’s just running a query to generate a list of names and not altering anything in the database, she’s more likely to work with you on it.

Updating the Names

Archivists’ Toolkit’s database uses the linking table `archdescriptionnames` to handle links between the name records and the archival records. There are other ways to update this linking table, but the simplest query is the following three lines, where the number in the SET row is the nameId of the good version of the name and the number in the WHERE row is the nameId of the deprecated name. With this example, you’d have to run one query for each name, but a good macro or copy/paste setup could help you generate it pretty quickly.

UPDATE archdescriptionnames
SET primaryNameID=6
WHERE primaryNameID=10001;

3

Handling Any Non-Preferred Names

At this point, the main mission has been accomplished. All the deprecated names have been removed from the finding aids and have been replaced with the optimized version. However, if any non-preferred forms were created for those now-deprecated names, you’ll be unable to simply delete the unwanted names from your database without handling the non-preferred forms first. This part mirrors above. The query below will update each non-preferred name record that’s connected to the wrong name & connect it to the right one.

UPDATE nonpreferrednames
SET primaryNameID=6
WHERE primaryNameID=10001;

If you’d rather just delete the non-preferred names for any deprecated name, mimic the query below, but change `names` to `nonpreferrednames`.

Deleting Deprecated Names

Now that the deprecated names have been removed from records and disconnected from their non-preferred versions, they can be deleted. This is a very important step, since you don’t someone using AT’s features later on to add the wrong name to their record.

DELETE FROM names
WHERE nameID=10001
OR NameID=10002
OR NameID=20001
OR NameID=20002;

Voila, you’re done!

Final Thoughts

Like all the other work done using metaphone3, this is only as good at catching duplications as the phonetic algorithm allows. In my case, it caught 5 out of 6 duplications and the duplicate it missed was rather different.

Notes

0. To run these queries on a local installation, navigate to your phpmyadmin in your browser, probably http://localhost/phpmyadmin/ then click on the database, click on the SQL tab at the top when viewing the database, and run your queries in the SQL box.

1. Line-by-line, this 1) pulls each nameID and sortName into a file named names.csv, which can be found at the drive root (C: in this case), 2) with commas between each field, 3) and enclose the contents of each field in ” ” (which keeps CSV software from thinking sort names like “Burns, Jacob” are two fields vs. one). It 4) pulls these fields from the table `names` 5) whenever the `nameType` field is “Person.” The order makes writing it out as an ordered description a little tricky, but is proper SQL order.

2. I could do the final step in OpenRefine, but I found Excel wasn’t parsing the alphabetical sort.

3. Line-by-line, this tells the database to 1) update the table `archdescriptionnames` by 2) inserting the `primaryNameID` number included 3) in every row where the `primaryNameID` number of the last line is right now. So if it occurs once, it’ll replace it once. If it occurs 150 times, it’ll replace it 150 times.

Fighting Zombies — De-Duplicating Our Finding Aids

It’s one of those problems that drive everyone a little nutty. I mentioned in my previous posts that I’ve been looking across our finding aids to get a better sense of how well we’re meeting standards. In order to do this, I’ve had to request a copy of all finding aids from our digital library people each month, since I don’t have access to the xml files that our website is based on.

While doing this, I noticed that we had a LOT of files in the bundle digital libraries had sent us. Like, more than 2000. When really, we only have 1200 collections. And I noticed, too, that many of the files were named in a way that don’t meet our current conventions. Opening some of them up, it was clear that these were very, very old versions of finding aids that had since been updated — and that we have have also requested to be deleted from the production environment before.

We couldn’t figure out why the zombies kept coming back. So, digital libraries folks and I worked together and decided that I would identify the duplicates, they would flush everything from the production environment, and they would then re-index the current set, minus the list of duplicates I would send.

To do this, I had to think through which pieces of data I needed in order to determine which finding aids were versions of one another.

So, I wrote an xQuery. My xQuery asks for the following pieces of information:

  • What’s the call number for this finding aid? This is represented by the element <unitid> at the collection level. In our finding aids, we can’t consistently assume that the file name is the same as (or is a derivative of) the call number.
  • When was this finding aid updated? I’ll need to know this so that I can delete any finding aid that is out of date.
  • What file is it a part of? I’ll need this so that I can determine which files to delete.

Let’s walk through what my xQuery looks like, from the top.

xquery version "1.0";

 declare namespace ead="urn:isbn:1-931666-22-9";
 declare namespace xlink = "http://www.w3.org/1999/xlink";
 import module namespace functx="http://www.functx.com" 
     at "http://www.xqueryfunctions.com/xq/functx-1.0-doc-2007-01.xq";

This is all boilerplate that tells the computer that we’re writing an xQuery, and that it is querying data that is in the EAD and xlink namespaces. I’m also importing the functx module, which I don’t actually use in this context but is a useful library for commonly-used xQuery functions.

Next, I tell my computer which files I’m working with.

declare variable $COLLECTION as document-node()* := collection("file:///Users/staff/Documents/dlts_findingaids_eads/tamwag/?recurse=yes;select=*.xml");

Here, I’m telling my computer where my EAD files live. I’m also saying that it should only look at xml files, and that if there are folders within this location, it should look at them too. I’m declaring the location of these files as a variable, called $COLLECTION.

Let’s jump to the bottom, where I tell the xQuery what my report should look like.

return
<doc>
<file>{$doc}</file>
{$callno}
{$datemodified}
</doc>

This report is saying that I want information about each document about what the file name is, what the call number is, and when it was modified. In order to tell the report where to find that information, I explain those variables, above.

for $i in $COLLECTION//ead:ead
let $callno := $i//ead:archdesc/ead:did/ead:unitid,
$doc := base-uri($i),
$datemodified := $i//ead:profiledesc/ead:creation/ead:date

The first thing I do is decide to declare a jumping-off variable, called $i, that calls the root of the EAD. Below, to get the file name of that EAD within my file system, I use the function base-uri() to get the file name of my EAD. This fulfills one of the three pieces of information I’m looking for — I now know which file I’m talking about.

The next piece of information I want is the call number of the EAD. I declare a variable $callno (no reason to be mysterious with variable names!), and I tell the computer where in the EAD it can find this. In this case, I want the collection-level <unitid> (there may be other unitids in a finding aid, but I just want the collection-level one), which is found by going to the root of the EAD <ead>, then going to <archdesc>, then going immediately to <did>, then looking to <unitid>. I now have my call number.

Finally, I want to know the date that this file was created. This is stored in the EAD when the file is exported from Archivists’ Toolkit (and can be updated using a variety of methods based on institutions’ practices even if they don’t use AT). I’ll call this variable $datemodified , and I tell the computer that it can be found under <profiledesc>, then <creation>, then <date>.

Whew! Now that that’s all together, I run the xquery, and get something that looks like this:

dedup screenshot

See? For each document, I get the file name/path, I get the unitid, and I get the date.

So how do I analyze/de-dupe these? I’m sure that there lots of ways to do this, but I want to see it all in a big table — so I import into Excel. Did you know that you can import xml into Excel? It’s under the data tab.

From here, it’s pretty straightforward. I get the data into the columns I want, and I take a look at it. Below, do you see how aia_sullivan.xml and AIA.048-ead.xml have the same call number, but one was created two years before the other? I definitely want to keep the newer one and delete the older one.

dedup excel

But boy, there are a lot here. I don’t want to do this by hand!. First I do a multi-level sort — by call number and then by date. Then I apply a formula that says if the first call number is the same as the second call number, and the first call number was created before the second call number, give the first call number the value of “delete” and the second call number the value of “keep”. In excel, it looks like this:

=IF(AND(D1=D2,C1<C2),"delete","keep")

Then, I simply filter that list so that I only see the “delete” values, I copy the list of file names with delete values, and I save that to a text file for digital libraries folks to delete.

spreadsheet to finding aid — the game plan

Where I work, I have a lot of legacy data that isn’t accessible to researchers. Since we manage our description in an archival management system that exports to EAD-encoded finding aids (and we publish images through the finding aid), I want to get as much of that converted to EAD as I can.

My current project may be my white whale. Several times throughout the last decades, there have been description, re-description, and digitization projects for a particular collection of negatives in our repository. The images are described at the item level and they were digitized TEN years ago, but still, researchers don’t have access to these images. We still make them sit down with gloves and a light box, and frankly, these images don’t get as much use as they should.

The negatives were described at the item level. Titles were transcribed and possibly enhanced, with liberal use of brackets and abbreviations. I also have information about the photographer, subject (generally), negative number, something called a shoot number that I still can’t suss out, gauge, two different date fields in different combinations of m/d/y (usually the same, sometimes different, which is hard to fathom for an image), and information about the Kodak Gold CD onto which it was digitized (seriously). In many cases, there are many images with the same title, same date and same photographer.

So, I started with an EAD mapping and a plan.

Mapping:

  • Image title will map to <unittitle>. I’ll need to do some major clean-up to write out abbreviations, acronyms and initialisms. It will also be good to pay attention to brackets and delete or use alternate punctuation, as appropriate.
  • Photographer will map to <origination><persname>. There seem to be variants of names and initials — I’ll use faceting and clustering in OpenRefine to bring consistency to these. Since these are amateur photographers, the names will not be authorized.
  • After doing a lot of thinking, I think I’m going to use the subject column as the series structure of the finding aid. A previous version of this finding aid has series scope and contents notes based on these subjects. If this were a collection of more notable photographers, I would organize the finding aid that way, but this seems the best of bad options for a collection that researchers will approach by subject.
  • The negative number will map to <unitid>. In the case of images with the same caption, date and photographer, I’ll plan to have one component with multiple values in unitid (separated by a delimiter, since Archivists’ Toolkit doesn’t allow for multiple unitids, BLERGH), and an extent statement indicating how many images are in a component. I’ll figure out an excel formal to calculate the extent statement, and I’ll collapse the multiple components into one using the blank down and join multi-valued cells features in OpenRefine.
  • Gauge will map to <dimensions>.
  • Dates will map to <unitdate>. It looks like the first date is more complete than the second, but there are some cases where there’s a second date and not a first. I’ll use excel formulas to break each apart by year, month and date, and create a new date column that asks whether the first date is present and, if not, adds the second date. I’ll also use excel formulas to convert to our (not DACS compliant, unfortunately)  date format and also to create normalized date formatting.
  • I’ll wait until the finding aid is created to map from c @id to information about the digitized file, since Archivists’ Toolkit doesn’t import that information (ARRRGGHHHH).
  • There are also some lines for missing negatives (these should be moved to the collection-level scope and contents note, for lack of a better place) and lines for negative numbers that were never asssigned to an image (which should just be deleted).

Most of this work happens in Excel and OpenRefine. The Excel spreadsheet will then be imported into oXygen, and I’ll use an XSLT (already written, thank goodness) to convert this to EAD. I’ll then import the EAD to Archivists’ Toolkit, since the institution has decided to make this the canonical source of data. Finally, I’ll export the EAD =OUT= of AT and use split multi-valued cells / paste down in OpenRefine to get an item-level list of images WITH AT-created unitids, and use MS Access to map that to information about the digitized file in the original spreadsheet (using negative number as the primary key). This then gets sent to digital library folks, who do the <dao> linking.

Does this all sound complicated and stupid? It should, because it is. I’m fairly confident that I’m using the most efficient tools for this kind of clean-up, but I’m sure part of my process could be improved. I also think that our tools and processes at my institution could be better.

My thoughts on how to use our tools better and avoid these kinds of clean-up projects in the future.

  • Most important. Don’t describe at the item level. Why does archival theory seem to go out the window when we start working with images? Describe by roll, by event, by creator — give the researcher better clues about the context of creation. Photographic records, like all records, can be described in aggregate. It’s very difficult to do so, though, which context is destroyed and provenance isn’t documented. Especially in this case, where images are digitized, there’s no reason to create item-level metadata.
  • Let’s have everyone use the same platform and keep the metadata together right away. Wouldn’t it be great if each image file were assigned a URI and linked to the component in the <dao> field BY DIGITIZATION TECHNICIANS DURING SCANNING? Sure, maybe there would need to be some post-processing and some of this data would change, but it would be all together right away from the outset, instead of relying on the crazy internal logic of file systems or secondary spreadsheets.
  • The project isn’t finished until the project’s finished. A ten-year gap between digitization and publication in the finding aid is a big problem.
  • Archivists’ Toolkit isn’t a very good tool for archivists who want to do bulk manipulation of data. There, I said it. It would be nearly impossible to make these changes in AT — I need tools like Excel, OpenRefine, XSLT, and the xpath find/replace functions in oXygen to change this much data. Sure, I can export and then re-import, but AT doesn’t reliably round-trip EAD.
  • Maybe we shouldn’t be using AT as our canonical data source. It really don’t offer much added value from data clean-up point of view, beyond being able to make bulk changes in a few fields (the names and subject modules are particularly useful, although our finding aids don’t even link that data!). And frankly, I’m not nuts about the EAD that it spits out. WHY DO YOU STRIP C @ID during import?!?! Why can’t extent be repeated? Why can’t some notes (like <physfacet>, which really isn’t even a note) be repeated? Why not multiple <unitid>s? And as I look at our AT-produced finding aids, I find a LOT of data mistakes that, thinking about it, are pretty predictable. A lot of crap gets thrown into <unittitle>. There’s confusion about the difference between container attributes and <extent>, <physdesc>, and <physfacet> notes. I’m not endorsing the hand-coding of finding aids, but I think that there was some slippage between “oh good! We don’t have to use an XML editor!” and “Oh good! No one needs to pay attention to the underlying encoding!”

I’ll be sure to report back when this project is done. Until then, keep me in your thoughts.