It’s been a while since I last posted, and there’s a good reason for that — I’ve started an exciting new job as an archivist and metadata specialist at Yale. I miss my colleagues and friends at Tamiment every day, and I look forward to continued awesome things from them.
Here at Yale, I work in Manuscripts and Archives. The major project for the first year will be to migrate from Archivists’ Toolkit to ArchivesSpace. In anticipation of this migration, I’m learning about the department’s priorities for data clean-up and thinking through what I can do to help implement those goals.
One of the first projects that was added to my list was to take a look at a project that has been ongoing for a while — cleaning up known errors from the conversion of EAD 1.0 to EAD 2002. Much of the work of fixing problems has already been done — my boss was hoping that I could do some reporting to determine what problems remain and in which finding aids they can be found.
- Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?
- During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
- Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
- Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.
- The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
- See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
- Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?
Getting to the Files
Access to files that have been published to our portal is easy — they’re kept in a file directory that is periodically uploaded to the web. And I also have a cache of the EAD 1.0 files, pre-transformation. These were both easy to pull down copies of. But, one of the questions I was asking was how these differ from what’s in the AT. It’s so, so easy to make changes in AT and forget to export to EAD.
If any of you know good ways to batch export EAD from AT, PLEASE LET ME KNOW. I have a pretty powerful machine and I know that folks here have worked on optimizing our AT database, but I could only export about 200 files at a time, for fear of the application hanging and eventually seizing up. So, I ran this in the background over the course of several days and worked on other stuff while I was waiting.
For some analyses, I wanted to exclude finding aids that aren’t published to our portal — for these, I copied the whole corpus to a separate directory. To get a list of which finding aids are internal-only, I very simply copied the resource record screen in AT (you can customize this to show internal-only finding aids as a column), which pastes very nicely into Excel.
Once in Excel, I filtered the list to get a list of Internal Only = “TRUE”. From there, I used the same technique that I had used to kill our zombie finding aids at NYU. I made a text document called KillEAD.txt, which had a list of the internal-only finding aids, and I used the command
cat KillEAD.txt | tr -d '\r' | xargs echo rm | sh
to look through a list of files and delete the ones that are listed in that text document. (In case you’re wondering, I’m now using a Unix simulator called Cygwin and there are weird things that don’t play nicely with Windows, including the fact that Windows text documents append /r to the ends of lines to indicate carriage returns. Oh, also, I put this together with spit and bailing wire and a google search — suggestions on better ways to do this kind of thing are appreciated).
So, that’s step 0. Don’t worry, I’ll be blogging my approach to goals 1-7 in the days ahead. I have some ideas about how I’ll do much of it (1, 3, and 4 I know how to assess, 2 I have a harebrained plan for, 5-7 are still nascent), but any suggestions or brainstorming for approaching these problems would be MORE THAN WELCOME.