Making DACS Dates

Manipulating date strings (which is the data type we usually have in archival description), particularly when you have a lot of legacy data, is a pain. I was working with a friend to update some legacy data in her finding aid, and it occurred to me that there isn’t a lot of direct guidance out there about how to manipulate dates with various tools. So, here’s a run-down of some of my methods — please feel free to add your own in the comments.

Why does this matter?

I’ll be honest, in a lot of situations, date formats don’t matter at all. I’ve said it before and I’ll say it again — we put a whole lot of effort into creating structured data, considering that most of us just flatten it into HTML and put it up as a webpage. However, there is a brighter tomorrow. With structured data, you can make far better interfaces, and there are really nice examples of places that let you do stuff with date data.

In the Princeton finding aids site, you can sort by title, date, or container. This means that in a series like this, in the George F. Kennan papers, where the archivist (or possibly creator) filed by title, this isn’t the only way to look through materials.

George Kennan Finding Aid

The order of materials as they are presented

George Kennan Finding Aid

The order of materials, sorted by me (the user) by date ascending.

Letting users sort by title or date means a few things — we can stop wasting time with alpha or chron arrangement and spend more of our energies on the true value that archivists add to description — context, meaning, transparency — without worrying that there’s too much for the researcher to sort through. It also means that we don’t have to presume that a researcher’s primary discovery vector is either time or title — we can let her choose for herself. Finally, and most importantly, we can let original arrangement schemes and organic order (the true intellectual basis of arrangement) reign supreme.

The other reason why date formats are important is because our content standard tells us they are. Now, I personally think that it’s actually far more important to associate an ISO-compliant date with a descriptive component, which can then be rendered any way you want, but since until recently our tools didn’t support that very well, I think that the DACS format of YYYY Month D brings us a step closer to easier date clean-up and extracting ISO compliant dates from date expressions.

Excel

Excel, odi et amo. Excel offers a GUI for programming-ish functions, but I find as I do more and more advanced stuff that I get frustrated by not knowing what exactly is happening with the underlying data. Dates are particularly frustrating, since Excel stores dates as a serial number starting with January 1, 1900. As an archivist who has PLENTY OF DATES from before then, this can lead to rage. There are a few ways to deal with this — if your dates are all 20th or 21st century, congratulations! You don’t have a problem. There are ways to get Excel to change the ways it assigns serial numbers, to allow for negative numbers, which let’s you do the normal sorting and date re-formatting. Or, you can store everything as text and move each part of the date string to its own column to manipulate it.

So, an example of a clean-up project:

Excel Dates -- untouched

In this data, we have a bit to clean up. When I start a clean-up project, I usually start with a pencil-and-paper list of all of the steps that I need to go through before I change anything. This way, I see if I need to do research about how to do a step, and I can also see if there are dependencies in the data that may require me to sequence these steps in a particular way. When you’re first learning, it’s easy to jump right in without planning, but trust me — every time I’ve been burned by automation it’s because I didn’t plan. In a live data environment, you should always know what the computer’s going to do before you run a command, even if that command is just a formula in Excel. The flip side of this is, of course, that as long as you have good back-ups, you should feel free to experiment and try new things. Just make sure you make the effort to figure out what actually happened when you’re experimenting suddenly produces the results you want to see.

So, here’s my list of steps to perform on this data.

  • Check my encoding, which in this case just means which data is in which columns. Do you see the row where some of the date data is in the title column? It’s in row 4. I would probably survey the data and see how prevalent this kind of problem is. If it’s just a handful of errors, I’ll move the data over by hand. If there’s more, I’ll figure out a script/formula to automate this.
  • Check for unwanted characters. In this case, get rid of brackets. In case you haven’t heard, brackets are not a meaningful way of indicating uncertainty to researchers. There is a certainty attribute on <unitdate> for that, which can then be rendered in your institution’s EAD -> HTML stylesheet. However, my problem with brackets is more fundamental — in archival description, the date element is just a transcription of what we see on the record. We don’t actually know that this date represents anything. So in reality, these are all guesses to varying degrees of certainty, with the aim of giving the researcher some clue to time.
  • Fix the date format. DACS dates are YYYY Month D. (e.g., 2015 March 6)
  • Create an ISO date to serialize as an @normal attribute with <unitdate>

Let’s skip the obvious clean-up tasks and go straight to formatting dates. If everything is after 1900 (and if everything is a three-part date), this is really straightforward.

First, create a new column. Use the DATEVALUE formula to tell Excel to regard your date string as a date value — if your date string is in B2, your formula in C2 should be:

=DATEVALUE(B2)

Double-click on the bottom right corner of the cell to have that formula apply to the whole column.

Now that Excel knows that this is a date, you simply need to give it the format you want to see, in this case, yyyy mmmm d.

Choosing a custom date format

Choosing a custom date format.

This works great for three-part dates after 1900. If that’s not your situation, there are a few things you can do. One of my favorite methods is to filter the date list to each of the different date types and apply the custom date format to each of these (trying to apply a custom date format to a date that doesn’t fit the type will result in really confusing and bad results). Another option is to split the date into three different columns, treat each like text, and then bring them back together in the order you want with the CONCATENATE formula. Play around — Excel doesn’t make it easy, but there are lots of options.

OpenRefine

If you do a lot of data manipulation, I would definitely encourage you to stop torturing yourself and learn OpenRefine. I use it every day. OpenRefine uses something called GREL (Google Refine Expression Language — I wonder if they’ll be changing that to OREL now that this isn’t under the Google umbrella?), which is trickier to learn than Excel formulas but a lot more powerful and more in alignment with other programming languages. In fact, I should say that you only need to learn GREL for the fancy stuff — a lot of OpenRefine’s magic can be done through the GUI.

So, looking through this data set, I would do a lot of the same steps. One option is to just use the commands Edit Cells -> Common Transformations -> To Date, but unfortunately, most of these strings aren’t written in a way that OpenRefine understands them as dates.

The best path forward is probably to split this date string apart and put it back together. You could split by whitespace and turn them into three columns, but since some dates are just a year, or a year and a month, you wouldn’t necessarily have each of the three parts of the date in the columns where you want them.

So, I’m going to tell OpenRefine what a year looks like and ask it to put the year in its own column.

This formula pulls the year from the date string and puts it into its own column.

This formula pulls the year from the date string and puts it into its own column.

In this formula, I’m partitioning the string by a four-digit number and then taking that part of the partition for my new column. In the case of the year, the formula is:

value.partition(/\d{4}/)[1]

For a month it’s:

value.partition(/[A-Za-z]+/)[1]

And for the day it’s:

value.partition(/\d{1,2}/)[1]

There may be a more elegant way of partitioning this all as one step, but I don’t yet know how!

Then, once you have each of these parts of the date in their own columns, they should look like this:

Each part of the date element is in its own column.

Each part of the date element is in its own column.

The final step is to put the pieces back together in the order you want them. You can do this by clicking on the Year column, and selecting “Create column based on this column.” Then, use GREL to put everything in the order that you want to see it.

The plus signs signify that everything should be smushed together -- pay attention to the syntax of calling the value of columns.

The plus signs signify that everything should be smushed together — pay attention to the syntax of calling the value of columns.

The formula for this is:

value + " " + cells["Month"].value + " " + cells["Day"].value

And voila, you’ve turned your non-DACS date into a DACS-formatted date. You can use similar steps to make a column that creates an ISO-formatted date, too, although you’ll first have to convert months into two-digit numbers.

Finally, SQL

The two methods above require ETL — extract, transform, load. That is, you’re going to get data out of the database (or transform it into a tabbed sheet from xml) and then get it back into the database or the EAD (and then the database). There is a better way if you’re using Archivists’ Toolkit or ArchivesSpace, and it involves doing SQL updates. I’m going to punt on this for now, because I know that this will be a huge part of my future once we get into ArchivesSpace (I’ll also be creating normalized dates, which is data that Archivists’ Toolkit can’t store properly but ASpace can). So, stay tuned!

Advertisements

Backlog Control — Known Unknowns

As part of a continuing attempt to understand our holdings, I’ve been writing a series of reports against our EAD. Of course, what’s in a finding aid only accounts for the stuff that someone bothered to record in the first place. To tackle undescribed collections, we’ve also been doing a shelfread project to get an understanding of what’s on our shelves.

Today, we take an “accessioning as processing” approach to accruals — we describe what’s there at the appropriate level of description at the time of accessioning, and we include a lot of the background information about how it came to us, what it all means, etc., to help make sense of it. This helps us avoid building a backlog.

In the past, however, there was a mysterious (to me) understanding of the nature of processed/unprocessed materials. We have many, many series of materials in collections (usually accruals) that may even have file-level inventories but are described as “unprocessed.” They don’t include essential information about immediate source of acquisition, creators, or what about these materials makes them hang together. I’m frankly not sure what my predecessors were waiting for — they did all the work of creating lots of description without doing any real explanation!

So, my boss wanted a sense of these known knowns — parts of collections that we need to at least give a better series title, or somehow take out of the limbo of “unprocessed”. She wanted to know how many series there were, which collections these series belong to, and how many boxes of stuff we’re talking about. It would also be great to know linear footage, but this is frankly unknowable from the data we have.

So, I wrote an xQuery. You can find it here. The xQuery looks for any series or subseries that has the string “unprocessed” in its title. From there, it reports out the distinct values of containers. The result looks something like this:

Screen Shot 2014-05-06 at 10.06.10 PM

Perhaps you see the problem. Originally, I thought I just wanted to get a count of the distinct containers. My xpath for the variable that would give me box info (called footage here) originally looked like this:

$unprocessedfootage := count(distinct-values($series//ead:container[@type eq ‘Box’]))

The idea here was that it would take a series, get a census of the different boxes in that series, and count ’em up. But this gave me bad data. In the case of :

<containertype=“Box”>10-17</container>

I would have “10-17” be considered one distinct value in the count, when really it represents 8 boxes. The report as I first had it was severely undercounting boxes.

If I want to get a count of the distinct containers, I have to deal with ranges like 10-17. I started by importing this into OpenRefine and separated the multi-valued cells in the “unprocessed” column so that each number or range was in its own cell/row.

Then, I did some googling and came across this StackOverflow answer that explained how to enumerate the values in a range in Excel (this will give me 10, 11, 12, 13, 14, 15, 16 and 17 from 10-17). I exported from OpenRefine and brought the document into Excel, separated the ranges into two columns, and did a quick if/then statement to repeat single values in the second column. From there, I just ran the VBA code that was provided. I brought the document BACK into Refine and separated multi-valued cells again, and found out that we have 908 distinct boxes of “unprocessed” materials in 67 collections.

Now, happily, we know exactly how big of a mess our described “unprocessed” materials are, and we’re in a much better position to make good sense of them.

Update — 2014 May 7

@ostephens on twitter very helpfully pointed out that the dumb VBA step can easily be avoided by doing the work in OpenRefine.

He was completely right and was kind enough to give me the recipe

After multi-valued cells were each in their own row/cell, I separated by “-” so that the beginning and end of each range was in its own column. Then, I created a new column based on the “First” column and did the following:

Screen Shot 2014-05-07 at 10.29.10 AM

On error, it copies the value from the original column so that my “enum” column is everything I need. Once I had the values enumerated, I split multi-value cells again and ended up with a much more beautiful process.

You can follow my steps by importing my OpenRefine project here.

Clean Up: Dates and OpenRefine

Here are some of the ways we are using OpenRefine to clean up dates in accession records based on our instructions. In some cases I have included more than one way to complete the same activity. There are probably more elegant ways to do this, so if you have other suggestions please let us know in the comments!

In this example we’re starting with running formulas on our unitdateinclusive column, splitting this out into begin and end dates, and then will return to format our date expression.

Periodically I use Edit Cells>Common Transforms>Trim leading and trailing whitespace just to keep things tidy. Be sure to run this anytime you’ll be using an expression that looks at the beginning or end of a cell.

Remove various characters, punctuation, and circa notations

Use Edit Cells>Transform with the following formals:

  • value.replace(‘c.’, ‘ ‘).replace(‘c’, ‘ ‘).replace(‘ca.’, ‘ ‘).replace)(‘Ca.’, ‘ ‘)replace(‘circa’, ‘ ‘).replace(‘unknown’, ‘ ‘).replace(‘n.d.’, ‘undated’).replace(‘, undated’, ‘and undated’).replace(‘no date’, ‘ ‘)
    • You could use Facet>Text facet and enter one of the words to analyze the fields first. Then either use all or part of the formula above or make use of the cluster function.
  • value.replace(/[?]/,’ ‘)  or value.replace(“?”,’ ‘)  [These do the same things, just showing different syntax you can use.]
  • value.replace(“[“,’ ‘).replace(“]”, ‘ ‘)
  • value.replace(“‘s”, ‘s’)

Convert to standardized dates

  • Edit Cells>Common Transforms>To text
  • Edit Cells>Common Transforms>To date

From here we’ll run a set of operations to give more conformity to the data. (This is where there must be better ways of doing this.)

  • Edit>Cells>Common Transforms>To text
  • Change months to numbers with
    • value.replace(‘January ‘, ’01/’).replace(‘February ‘, ’02/’).replace(‘March ‘, ’03/’).replace(‘April ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘June’, ’06/’).replace(‘July ‘, ’07/’).replace(‘August ‘, ’08/’).replace(‘September ‘, ’09/’).replace(‘October ‘, ’10/’).replace(‘November ‘, ’11/’).replace(‘December ‘, ’12/’)
    • value.replace(‘Jan ‘, ’01/’).replace(‘Feb ‘, ’02/’).replace(‘Mar ‘, ’03/’).replace(‘Apr ‘, ’04/’).replace(‘May ‘, ’05/’).replace(‘Jun’, ’06/’).replace(‘Jul ‘, ’07/’).replace(‘Aug ‘, ’08/’).replace(‘Sep ‘, ’09/’).replace(‘Sept ‘, ’09/’).replace(‘Oct ‘, ’10/’).replace(‘Nov ‘, ’11/’).replace(‘Dec ‘, ’12/’)
  • Replace commas and space after day before year: value.replace(‘, ‘, ‘/’)

We now have dates that are in the same and standard format as opposed to the variations we saw in the Date Formats post.

Create begin and end date columns with this data:

  • Duplicate the column: Edit Column>Add column based on this column… and by supplying no formula we can copy over all the data to our new column.
  • Split the new column using Edit Column>Split into several columns… and designate “-” as our separator.
  • Rename first column “date_1_begin” and second “date_1_end”.
  • Run Edit Cells>Transform: toString(toDate(value),”yyyy-MM-dd”) on both columns to get in the format required by ArchivesSpace, without having to deal with the day, hours, minutes, and seconds of the date format in OpenRefine.

We’ll still want to review the date begin column to identify the rows that had two single dates indicated by commas, ampersands, semi colons, and, or some other designator. We would separate out the second date to “date 2” fields.

Upcoming

  • Standardizing dates in to our preferred local natural language for date expression
  • Converting decades (ex: 1920s) programmatically

 

 

OpenRefine and Messy Legacy Access Points in an Archivists’ Toolkit Database

After having read posts on this blog and articles concerning the use of OpenRefine to handle metadata stored in Excel and Access files, I found myself asking how this could be done with an Archivists’ Toolkit (MySQL) database. Since the literature was not forthcoming, I did my own experiment, which Maureen graciously offered me the space here to describe. Before attempting this on a larger scale, you may wish to create a local version of Archivists’ Toolkit on your computer to test it with. To do this in a working archives, contact your database administrator and work with her on determining how you’ll do the project.

For my experiment, I didn’t work on an active version of the database. Instead I duplicated my Archivists’ Toolkit database into something called `messydb` and temporarily linked it to my Archivists’ Toolkit software.

I chose to restrict my experiment to personal names, but with a little more time going through the database structure/export, I could have done all names and subjects. I seeded the database with 5 less-optimal versions of 3 names which already existed. I did this in Archivists’ Toolkit by opening 3 different records and linking new improper names to them. I also created a non-preferred form for one of the names, for the sake of a later part of the experiment. I backed up this problem database so that I could reload and repeat the experiment as many times as necessary.

Next, I had to write my queries.0 I began by working from the database export I’d used to create the duplicate to determine the tables and fields which would be relevant to the project. I determined that in the table `names` the field `nameId` was used to create a primary key for the name entry (used as a foreign key in other tables) and `sortName` was the best way to view a full version of the name. There’s also an important field called `nameType` which is used to designate that the name is a “Person” (vs. “Corporate Body” etc.). So, I wrote a query which would take the nameId and the sortName from any entry where the nameType was “Person” and put them into a comma-separated file.

SELECT nameId, sortName INTO OUTFILE '/names.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
FROM names
WHERE nameType = 'Person'

1

This resulted in the following file. I then opened it in Notepad++ so that I could quickly add comma-separated headers before the OpenRefine. These weren’t necessary, but I found adding them at this point helpful.

Adding Headers in Notepad ++

Working in OpenRefine

I then imported this file into OpenRefine. I then used OpenRefine’s cluster and edit option, ticking off metaphone3 as the keying function.

The results I got were:

Burns, Jacob A.
Burns, Jacob, 1902-1993
Burns, Jacob

O'Connor, Sandra D.
O'Connor, Sandra Day 1930-

Barron, Jerome
Barron, Jerome A.

which is all well and good, but if you recall above I said that I’d put in three problem names for Jacob Burns. The name “Burns, J.” wasn’t caught by metaphone3, which didn’t parse the J as close enough to Jacob. Of course, J could also be Jerome or Jacqueline or James. I’ll come back to this at the end.

Now that I’ve got most of the duplicates selected, it’s not as simple as using the editing function in OpenRefine to standardize the names. Even if you’re sure that these names are true duplicates, they must be edited within Archivists’ Toolkit. There are three ways to do it. No matter the method, I need to first flag all the names, sort to select all flagged names, export this view of the flagged names into an Excel file, and sort them by name in Excel.2 Now we have a list of names and corresponding nameIds from which to work.

Removing the Duplicates

The first method is to simply take the exported Excel file and work from it. This would involve going into the Names module of Archivists’ Toolkit and locating the finding aids attached to each improper name form. The archivist would double-check in each finding aid that this is really the same person, then replace it with the preferred name from the possible list. After it’d been removed from all linked records, the name could be deleted in the Names module.

The second method is one for which I’m still writing the specific SQL sequence (it involves 5 table joins and one temporary loop). The result will pull the following table.fields: resources.eadFaUniqueIdentifier (<eadid>), resources.findingAidTitle (<titleproper>), and names.sortName (display version of the name) into a list for any cases where the names.nameId is one of the potential duplicates. This could print into a neat list which the archivist could then use to view every finding aid where a problem name is linked without as much repetitive work as the first method would require.

The third method involves a mix of either the first or second and a SQL batch update. Using either the first or second method, the archivist would confirm that the names are true duplicates. Using the second method, for example, might allow her to easily view the finding aids online using the eadFaUniqueIdentifier/<eadid> and scroll through them to double check the name. Then she could follow these three steps to do SQL batch updates using the appropriate nameIds.

Removing Duplicates with SQL

As I begin this section, I feel compelled to remind everyone to check with your database administrator before trying this method. This may be outside the bounds of what she’s willing to do, and there are probably good reasons why. If she’s not able to help you with this, use methods one or two. You will also need her assistance to use the second method, but as it’s just running a query to generate a list of names and not altering anything in the database, she’s more likely to work with you on it.

Updating the Names

Archivists’ Toolkit’s database uses the linking table `archdescriptionnames` to handle links between the name records and the archival records. There are other ways to update this linking table, but the simplest query is the following three lines, where the number in the SET row is the nameId of the good version of the name and the number in the WHERE row is the nameId of the deprecated name. With this example, you’d have to run one query for each name, but a good macro or copy/paste setup could help you generate it pretty quickly.

UPDATE archdescriptionnames
SET primaryNameID=6
WHERE primaryNameID=10001;

3

Handling Any Non-Preferred Names

At this point, the main mission has been accomplished. All the deprecated names have been removed from the finding aids and have been replaced with the optimized version. However, if any non-preferred forms were created for those now-deprecated names, you’ll be unable to simply delete the unwanted names from your database without handling the non-preferred forms first. This part mirrors above. The query below will update each non-preferred name record that’s connected to the wrong name & connect it to the right one.

UPDATE nonpreferrednames
SET primaryNameID=6
WHERE primaryNameID=10001;

If you’d rather just delete the non-preferred names for any deprecated name, mimic the query below, but change `names` to `nonpreferrednames`.

Deleting Deprecated Names

Now that the deprecated names have been removed from records and disconnected from their non-preferred versions, they can be deleted. This is a very important step, since you don’t someone using AT’s features later on to add the wrong name to their record.

DELETE FROM names
WHERE nameID=10001
OR NameID=10002
OR NameID=20001
OR NameID=20002;

Voila, you’re done!

Final Thoughts

Like all the other work done using metaphone3, this is only as good at catching duplications as the phonetic algorithm allows. In my case, it caught 5 out of 6 duplications and the duplicate it missed was rather different.

Notes

0. To run these queries on a local installation, navigate to your phpmyadmin in your browser, probably http://localhost/phpmyadmin/ then click on the database, click on the SQL tab at the top when viewing the database, and run your queries in the SQL box.

1. Line-by-line, this 1) pulls each nameID and sortName into a file named names.csv, which can be found at the drive root (C: in this case), 2) with commas between each field, 3) and enclose the contents of each field in ” ” (which keeps CSV software from thinking sort names like “Burns, Jacob” are two fields vs. one). It 4) pulls these fields from the table `names` 5) whenever the `nameType` field is “Person.” The order makes writing it out as an ordered description a little tricky, but is proper SQL order.

2. I could do the final step in OpenRefine, but I found Excel wasn’t parsing the alphabetical sort.

3. Line-by-line, this tells the database to 1) update the table `archdescriptionnames` by 2) inserting the `primaryNameID` number included 3) in every row where the `primaryNameID` number of the last line is right now. So if it occurs once, it’ll replace it once. If it occurs 150 times, it’ll replace it 150 times.