OpenRefine and Messy Legacy Access Points in an Archivists’ Toolkit Database

After having read posts on this blog and articles concerning the use of OpenRefine to handle metadata stored in Excel and Access files, I found myself asking how this could be done with an Archivists’ Toolkit (MySQL) database. Since the literature was not forthcoming, I did my own experiment, which Maureen graciously offered me the space here to describe. Before attempting this on a larger scale, you may wish to create a local version of Archivists’ Toolkit on your computer to test it with. To do this in a working archives, contact your database administrator and work with her on determining how you’ll do the project.

For my experiment, I didn’t work on an active version of the database. Instead I duplicated my Archivists’ Toolkit database into something called `messydb` and temporarily linked it to my Archivists’ Toolkit software.

I chose to restrict my experiment to personal names, but with a little more time going through the database structure/export, I could have done all names and subjects. I seeded the database with 5 less-optimal versions of 3 names which already existed. I did this in Archivists’ Toolkit by opening 3 different records and linking new improper names to them. I also created a non-preferred form for one of the names, for the sake of a later part of the experiment. I backed up this problem database so that I could reload and repeat the experiment as many times as necessary.

Next, I had to write my queries.0 I began by working from the database export I’d used to create the duplicate to determine the tables and fields which would be relevant to the project. I determined that in the table `names` the field `nameId` was used to create a primary key for the name entry (used as a foreign key in other tables) and `sortName` was the best way to view a full version of the name. There’s also an important field called `nameType` which is used to designate that the name is a “Person” (vs. “Corporate Body” etc.). So, I wrote a query which would take the nameId and the sortName from any entry where the nameType was “Person” and put them into a comma-separated file.

SELECT nameId, sortName INTO OUTFILE '/names.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
FROM names
WHERE nameType = 'Person'

1

This resulted in the following file. I then opened it in Notepad++ so that I could quickly add comma-separated headers before the OpenRefine. These weren’t necessary, but I found adding them at this point helpful.

Adding Headers in Notepad ++

Working in OpenRefine

I then imported this file into OpenRefine. I then used OpenRefine’s cluster and edit option, ticking off metaphone3 as the keying function.

The results I got were:

Burns, Jacob A.
Burns, Jacob, 1902-1993
Burns, Jacob

O'Connor, Sandra D.
O'Connor, Sandra Day 1930-

Barron, Jerome
Barron, Jerome A.

which is all well and good, but if you recall above I said that I’d put in three problem names for Jacob Burns. The name “Burns, J.” wasn’t caught by metaphone3, which didn’t parse the J as close enough to Jacob. Of course, J could also be Jerome or Jacqueline or James. I’ll come back to this at the end.

Now that I’ve got most of the duplicates selected, it’s not as simple as using the editing function in OpenRefine to standardize the names. Even if you’re sure that these names are true duplicates, they must be edited within Archivists’ Toolkit. There are three ways to do it. No matter the method, I need to first flag all the names, sort to select all flagged names, export this view of the flagged names into an Excel file, and sort them by name in Excel.2 Now we have a list of names and corresponding nameIds from which to work.

Removing the Duplicates

The first method is to simply take the exported Excel file and work from it. This would involve going into the Names module of Archivists’ Toolkit and locating the finding aids attached to each improper name form. The archivist would double-check in each finding aid that this is really the same person, then replace it with the preferred name from the possible list. After it’d been removed from all linked records, the name could be deleted in the Names module.

The second method is one for which I’m still writing the specific SQL sequence (it involves 5 table joins and one temporary loop). The result will pull the following table.fields: resources.eadFaUniqueIdentifier (<eadid>), resources.findingAidTitle (<titleproper>), and names.sortName (display version of the name) into a list for any cases where the names.nameId is one of the potential duplicates. This could print into a neat list which the archivist could then use to view every finding aid where a problem name is linked without as much repetitive work as the first method would require.

The third method involves a mix of either the first or second and a SQL batch update. Using either the first or second method, the archivist would confirm that the names are true duplicates. Using the second method, for example, might allow her to easily view the finding aids online using the eadFaUniqueIdentifier/<eadid> and scroll through them to double check the name. Then she could follow these three steps to do SQL batch updates using the appropriate nameIds.

Removing Duplicates with SQL

As I begin this section, I feel compelled to remind everyone to check with your database administrator before trying this method. This may be outside the bounds of what she’s willing to do, and there are probably good reasons why. If she’s not able to help you with this, use methods one or two. You will also need her assistance to use the second method, but as it’s just running a query to generate a list of names and not altering anything in the database, she’s more likely to work with you on it.

Updating the Names

Archivists’ Toolkit’s database uses the linking table `archdescriptionnames` to handle links between the name records and the archival records. There are other ways to update this linking table, but the simplest query is the following three lines, where the number in the SET row is the nameId of the good version of the name and the number in the WHERE row is the nameId of the deprecated name. With this example, you’d have to run one query for each name, but a good macro or copy/paste setup could help you generate it pretty quickly.

UPDATE archdescriptionnames
SET primaryNameID=6
WHERE primaryNameID=10001;

3

Handling Any Non-Preferred Names

At this point, the main mission has been accomplished. All the deprecated names have been removed from the finding aids and have been replaced with the optimized version. However, if any non-preferred forms were created for those now-deprecated names, you’ll be unable to simply delete the unwanted names from your database without handling the non-preferred forms first. This part mirrors above. The query below will update each non-preferred name record that’s connected to the wrong name & connect it to the right one.

UPDATE nonpreferrednames
SET primaryNameID=6
WHERE primaryNameID=10001;

If you’d rather just delete the non-preferred names for any deprecated name, mimic the query below, but change `names` to `nonpreferrednames`.

Deleting Deprecated Names

Now that the deprecated names have been removed from records and disconnected from their non-preferred versions, they can be deleted. This is a very important step, since you don’t someone using AT’s features later on to add the wrong name to their record.

DELETE FROM names
WHERE nameID=10001
OR NameID=10002
OR NameID=20001
OR NameID=20002;

Voila, you’re done!

Final Thoughts

Like all the other work done using metaphone3, this is only as good at catching duplications as the phonetic algorithm allows. In my case, it caught 5 out of 6 duplications and the duplicate it missed was rather different.

Notes

0. To run these queries on a local installation, navigate to your phpmyadmin in your browser, probably http://localhost/phpmyadmin/ then click on the database, click on the SQL tab at the top when viewing the database, and run your queries in the SQL box.

1. Line-by-line, this 1) pulls each nameID and sortName into a file named names.csv, which can be found at the drive root (C: in this case), 2) with commas between each field, 3) and enclose the contents of each field in ” ” (which keeps CSV software from thinking sort names like “Burns, Jacob” are two fields vs. one). It 4) pulls these fields from the table `names` 5) whenever the `nameType` field is “Person.” The order makes writing it out as an ordered description a little tricky, but is proper SQL order.

2. I could do the final step in OpenRefine, but I found Excel wasn’t parsing the alphabetical sort.

3. Line-by-line, this tells the database to 1) update the table `archdescriptionnames` by 2) inserting the `primaryNameID` number included 3) in every row where the `primaryNameID` number of the last line is right now. So if it occurs once, it’ll replace it once. If it occurs 150 times, it’ll replace it 150 times.