Why database structure matters

I could have titled this post “Why Accessioning Information Matters” too. You may remember back to my first post about the Beast I included a lovely map of the fields and tables.

I didn’t talk in great depth about all the associated issues, but one of the biggest challenges migrating data out has been the archdescid table. In some respects, it’s easy as most of the fields make sense, have a clear field to map to in ArchivesSpace, and we have ways to clean up the data when it’s in the wrong spot. However, one of the hardest quirks is that the archdescid table includes both accessions and resources (aka collection) information in the same table and fields.

Archdesc table from Beast Database on the wall. First brainstorming of accession/resource records.

Archdesc table from Beast Database on the wall. First brainstorming of accession/resource records.

Staff access the information through two different front end forms in Access. The “accessions form” and the “finding aid form” pull various fields to display. Not all fields are available in both forms, but all the information is stored in the same place.

Staff view of Beast to enter accessions or finding aid information.

Staff view of Beast to enter accessions or finding aid information.

This means in the process of creating a finding aid accessions information is OVERWRITTEN. (Part of this is also due to the fact that we haven’t used separate “collection numbers”, but instead use the earliest accession number to identify a group of materials. This took away a necessary step of creating a new record for a finding aid.) Sometimes overwriting information is fine because maybe we had sketchy dates and now we have better ones. BUT, sometimes this means we are overwriting very specific accessions information with aggregate information about a larger set of materials meant for a finding aid.

For example, maybe we get the first installment of Sally Smith’s papers of 2 linear feet in 2002 and create an accession record to document that donation. By 2010, we’ve gotten three more accretions and decided to process all four accessions together and create a finding aid. Our finding aid is built from the 2002 record in the Beast.

It’s now 2014 and we just got another accretion. We create the 2014 accession record and decide to add the materials right away to the processed collection. Because of how the archdescid table is set up every time we added information to the Sally Smith papers finding aid we altered the accession record for the 2002 accession as the data in that record serves as our finding aid data too. By the time we’re done adding in our 2014 materials, the 2002 accession record now says 25 linear feet, the volume of all five accessions together. We have now lost the specific accessions information for the oldest accession for this collection (unless it happens to live on paper buried in control files.)

So the question is, how do we retroactively create accession records from a possibly aggregate finding aid?

We have two major categories of these accessions:

  • Records where we think the finding aid only represents one accession.
    • These range in size from a few items to almost 100 linear feet. We’re pretty confident that the information for smaller collections equates to accessions, but can’t be sure and who knows what the real correlation to size is.
  • Records where we know there are additional accretions.
    • Accretions may or may not be represented in the finding aid. Sometimes we tracked this, sometimes we didn’t. If we tracked it, we did it in different ways. Sometimes we know that some of the accretions were added and others weren’t. Or that some were and are unsure of the others. There’s lots of variation here.

We spent a good chuck of time analyzing these records trying to determine what information was reliable for accessions. Here’s what we think is mostly reliable:

  • Title
  • Accession number
  • Accession date (There are little to no accession dates supplied for these records, so most will inherit the default date)
  • Dates of materials
  • Access and use restrictions
  • Donor name and contact information

Information that’s more of a crap shot:

  • Extents
  • Contents description (will populate from a combo of our abstract and scope notes, includes lots of bio info, going to be messy and not ideal)

Solution for now:

Import these accession records with the information available knowing that for some (most?) the information doesn’t accurately document the accession.

Add a general note to explain the data. Something like “This accession record was created from an aggregate finding aid and may not represent accurate accessions data. Accession specific data may be available in control files.”

Continue to refine data as accessions/resources are worked on through our normal processing/prioritization workflows.

What would you do? Have others dealt with a similar problem?

Advertisements

Figuring Out What Has Been Done: Double Arrangement Statements

What was the problem?

During the transformation from EAD 1.0 to EAD 2002, table of contents information was migrated to an arrangement note, along with the already-existing arrangement note. As a clean-up task, folks were going through and paring this down to just one collection-level arrangement statement. We want to know how much of this is left to do.

How did I figure this out?

Luckily, this one is pretty easy. I’m just identifying documents that have more than one arrangement tag in a section (I could have specifically asked for more than one arrangement tag in the archdesc, but it doesn’t really matter. If I found dupe arrangement notes elsewhere I would want to know about them). The xquery is here, and it basically identifies where my files are,

declare variable $COLLECTION as document-node()* := collection("file:///C:/Users/cmc279/Desktop/ATExport/?recurse=yes;select=*.xml");

and then asks if there’s a second arrangement element

for $i in $COLLECTION//ead:ead
let $arrangement := exists($i//ead:arrangement[2]),

(that ead:arrangement[2] is basically saying “show me all the second arrangement notes” and exists() is just a way of getting a true/false answer about whether something is there — so here I’m asking “is it true or false that there’s a second arrangement note?)

since I want to know which finding aid it’s in, I say

$doc := base-uri($i)

And then I write a return statement to give myself a report with those variables:

return
<doc>
<uri>{$doc}</uri>
<arrangement>{$arrangement}</arrangement>
</doc>

What did I find out?

The report from this xQuery looked like this:

reportArrangement

(At first you’ll get the red bar of doom — just enclose everything in a root element to fix that). From here, I just wanted to know which finding aids had arrangement = true. To get a sense of how many, you can use the xpath 2.0 bar at the top of oXygen.

count(//arrangement[contains(., "true")])

This is, very simply, counting how many of these arrangement elements (remember, the arrangement variable asked for whether a second arrangement note exists) came back true. In our case, ONLY THREE! The people who did this clean-up work were clearly very thorough.

If it were significantly more, I might export this to excel to get a table of which EADs came back true. But since it’s only three, I used xpath again and wrote:

//arrangement[contains(., "true")]

At the bottom of the screen, it tells me which those are, and I can click on each to see the finding aid URI.

doubleArrangementResults

Conclusions

This was a pretty easy one to work out — fortunately, there were only three problem files in the end. Wouldn’t you much rather write a report to find those three problems than read through each finding aid (and we have more than 3,000!)?

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?

 

Figuring Out What Has Been Done

It’s been a while since I last posted, and there’s a good reason for that — I’ve started an exciting new job as an archivist and metadata specialist at Yale. I miss my colleagues and friends at Tamiment every day, and I look forward to continued awesome things from them.

Here at Yale, I work in Manuscripts and Archives. The major project for the first year will be to migrate from Archivists’ Toolkit to ArchivesSpace. In anticipation of this migration, I’m learning about the department’s priorities for data clean-up and thinking through what I can do to help implement those goals.

The Goals

One of the first projects that was added to my list was to take a look at a project that has been ongoing for a while — cleaning up known errors from the conversion of EAD 1.0 to EAD 2002. Much of the work of fixing problems has already been done — my boss was hoping that I could do some reporting to determine what problems remain and in which finding aids they can be found.

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?

Getting to the Files

Access to files that have been published to our portal is easy — they’re kept in a file directory that is periodically uploaded to the web. And I also have a cache of the EAD 1.0 files, pre-transformation. These were both easy to pull down copies of.  But, one of the questions I was asking was how these differ from what’s in the AT. It’s so, so easy to make changes in AT and forget to export to EAD.

If any of you know good ways to batch export EAD from AT, PLEASE LET ME KNOW. I have a pretty powerful machine and I know that folks here have worked on optimizing our AT database, but I could only export about 200 files at a time, for fear of the application hanging and eventually seizing up. So, I ran this in the background over the course of several days and worked on other stuff while I was waiting.

For some analyses, I wanted to exclude finding aids that aren’t published to our portal — for these, I copied the whole corpus to a separate directory. To get a list of which finding aids are internal-only, I very simply copied the resource record screen in AT (you can customize this to show internal-only finding aids as a column), which pastes very nicely into Excel.

ATInternal

InternalOnlyExcel

Once in Excel, I filtered the list to get a list of Internal Only = “TRUE”. From there, I used the same technique that I had used to kill our zombie finding aids at NYU. I made a text document called KillEAD.txt, which had a list of the internal-only finding aids, and I used the command

cat KillEAD.txt | tr -d '\r' | xargs echo rm | sh

to look through a list of files and delete the ones that are listed in that text document. (In case you’re wondering, I’m now using a Unix simulator called Cygwin  and there are weird things that don’t play nicely with Windows, including the fact that Windows text documents append /r to the ends of lines to indicate carriage returns. Oh, also, I put this together with spit and bailing wire and a google search — suggestions on better ways to do this kind of thing are appreciated).

So, that’s step 0. Don’t worry, I’ll be blogging my approach to goals 1-7 in the days ahead. I have some ideas about how I’ll do much of it (1, 3, and 4 I know how to assess, 2 I have a harebrained plan for, 5-7 are still nascent), but any suggestions or brainstorming for approaching these problems would be MORE THAN WELCOME.

Clean up: Instructions for accession date in accession records

Every accession record requires an accession date. ArchivesSpace requires this to save a record. Our processing manual instructs to include the date the materials arrived at UMD, but it’s buried in the Beast documentation as opposed to a policy/requirement. About 70% of the time there is a decent accession date in the Beast. We actually ended up checking our paper files and adding accession dates too. We weren’t going to do this originally, but ended up checking the paper for some other issues so went ahead and added an accession date when found.

Actions:

Every accession requires an accession date in YYYY-MM-DD format.

We have two date fields related to accession records. The “acqdate” field from the “source’ table that SHOULD specify the accession date of the materials. We also have the “dateoffirstentry” from the “archdesc” table that is the date the form in the Beast was entered. Usually, this date is later than the acqdate and pretty regularly it is much, much later.

If date in “acqdate” seems to be the same fiscal year as the accession number use this date.

When “acqdate” and “dateoffirstentry” both have similar dates, choose the earliest date, which will usually be in “acqdate”.

  • ex: “dateoffirstentry” is 6/18/2003 and “acqdate” is 06/11/2003. Use acqdate.

If no, “acqdate”, check “dateoffirstentry”. If this date seems to be the same fiscal year, use this date. If this field was obviously entered much later, do not use this date.

  • ex:  “dateoffirstentry” is 1/19/2003, no “acqdate”, accession number is 2003-19. Use dateoffirstentry.
  • ex: “dateoffirstentry” is 1/19/2003, no “acqdate”, accession number is 92-86. Do not use dateoffirstentry.

If you have a partial date:

  • Year and Month (August 2002)
    • Use 01 as the day value: 2002-08-01
    • (used first day of month to avoid different number of days in a month)
  • Season (Fall 2005 or Spring 2006)
    • Choose the end of the fiscal year or calendar year
    • Summer and Fall go to December 31, 2005
    • Winter and Spring go to June 30, 2006
  • Year only (2002, 1985)
    • Using the accession number, try to determine the fiscal year the materials came in.
    • If you can determine the fiscal year pick either the end of the fiscal year (June 30) or end of calendar year (December 31). This will help maintain that this accession came in during that fiscal year, but also attempt to ensure we know the calendar year the materials came in.
    • If the year matches the first part of the accession number, materials came in during the second half of the fiscal year, but the first half of the calendar year: January-June XXXX
      • Use June 30, XXXX
      • ex: 2002 with accession number 2002-172 is FY2002
        • FY2002 includes Jan-June 2002, use June 30, 2002
      • (used end of month to account for whole period)
    • If the year is one higher than the accession number, materials came in during the first half of the fiscal year, but the second half of the calendar year: July-December XXXX
      • Use December 31, XXXX
      • ex: 2003 with accession number 2002-267 is FY2003
        • FY2003 includes July-Dec 2002, use December 31, 2002
      • (used end of month to account for whole period)

If there is no date:

  • Use default date of 1856-01-01
    • Selected as the year of the charter of the Maryland Agricultural College.
    • Debated using the year of the accession number and selecting dummy months and days, but decided not to do this as it would negatively impact calendar year reporting.

What date to pick if there are multiple dates:

  • Unfortunately, a very common practice was to add new materials arriving to a recent accession instead of documenting each discreet group as their own accessions. There is no way to tell which materials arrived on which date so we can’t split this into two accession records.
    • Examples include:
      • September 11, 1990 and September 28, 1990
      • 7/27/2006 and 8/20/2006
    • Use the earliest date for the accession date.
    • In a general note state “Portions of this accession also arrived on [dates].”
  • Other times collections were not documented at the point of accession, so dates are generalized such as: circa 1990s.
    • Use default accession date of 1856-01-01
    • In a general note state “Accession date of materials may be [dates].”

 

 

 

 

Clean up: Instructions for accession identifier in accession records

Every accession record requires an unique accession number per local requirements. ArchivesSpace requires a value in the accession identifier field to save a record, which will greatly help in enforcing compliance for this field.

For the most part, our collections have accession numbers, though sometimes weird things happened. Most numbers look like this: 2014-18 or 92-113. There are a few variations with unit abbreviations as well as additional numbers or formatting.

Actions:

Format accession numbers according to the following:

  • accession_number_1 = fiscal year (2014 or 72)
  • accession_number_2 = number of accession for the year, will be sequential
    • Add leading zeros to second part of number. In OpenRefine: “000”[0,3-length(value)] + value
  • accession_number_3 = Letter code for collecting units
    • We are developing a letter code so we can quickly identify that an accession belongs to a particular collecting unit. We used an EAD tag for this in the Beast, which won’t fly anymore.
  • accession_number_4 = use for additional memorabilia or photograph numbers (more on that in another post, it’s a headache)

If more than one accession number is listed for a record:

  • Keep the oldest number for the record
  • Create a new record for each additional number, using the same title information
  • Include other information only if you know it is the same for each accession, ex: all the same donor
  • Add “described in Accession [acc#]” to the general information note. This will signal that the only description we have is contained in the earlier record. (Obviously, this is lousy for documenting accessions, but it’s the best we can do!)

If there is no accession number:

  • Consult the excel spreadsheet where accession numbers are assigned. Use the acqdate and dateoffirstentry fields to narrow down to a fiscal year and see if you can match up the information with an accession number.
  • Consult paper accession files (set ordered by accession name) to see if you can match something up.
  • Check the physical boxes to see if an accession number is on them (most of our boxes only have a title on them with no identifying number, a personal pet peeve.)
  • If nothing can be found anywhere, give accession a new number.
    • Add “retroactively assigned” to the general information note so we’ll know this was a backlog accession.
    • Write the accession number on box 1 of the materials.