Clean Up: Date Instructions for Accession Records

In an effort to be transparent (and highlight the volume of work) I’m attempting to document all of our cleanup instructions and decisions. For each ArchivesSpace field we’re including actions and questions as we explore our data. Some of the questions focus on how ArchivesSpace works or implements fields while others focus on our specific data, policies, or procedures. Over time, the questions turn into actions and we are continually updating our documentation as we work through these issues.

Below is what we have so far for dates of materials in accession records. We still have lots of questions (and will no doubt develop more) so feel free to provide any feedback or suggestions.

Actions

Dates from Beast fields unitdateinclusive and unitdatebulk will parse into date 1 and date 2 fields each including:

  • date_1_begin
  • date_1_end
  • date_1_type (required by AS)
  • date_1_expression (required by AS if no normalized dates)
  • date_1_label (required by AS)

Adhere to required date formats and definitions for ArchivesSpace fields

  • Control begin and end fields for date format: YYYY, YYYY-MM, or YYYY-MM-DD
  • Single dates do not need a date end
  • Control date expression based on local convention [revising current local format]
  • Split date ranges using “-” as delimiter and designate as “inclusive” or “bulk” in date type, based on what column they came from. Use date 1 fields for inclusive and date 2 fields for bulk
  • For values with commas, “1950, 1950” parse to two single dates using date 1 fields and date 2 fields
  • Label defaults to “creation” unless otherwise specified

For dates that include “undated”

  • Keep “undated” as part of whole statement in the date expression field.
  • Parse remaining dates as normal and remove “undated” from begin and end
  • ex: “1857-1919 and undated” remains as the date expression, 1857 goes to date_begin, 1919 goes to date_end, type is “inclusive”

Certainty

  • Assume that all collection dates in accession records are best estimates [Policy decision]
  • remove all forms of “circa” from accession dates
  • remove question marks and variations
  • remove square brackets and variations indicating guessed dates
  • remove “undated” if it is the only value
  • remove “unknown” if it is the only value

For dates listed as decades

  • Control decades to year spans. Use the first and last year of decade.
  • 1940s-1950s becomes 1940-1959 in date expression, begin 1940, and end 1959

If bulk dates are the exact same as inclusive, delete bulk dates

Questions

What date cutoffs do we use for partial decades?

  • ex: Late 1990s, mid 1970s, early 1930s
  • late = 1995-1999
  • mid = ???
  • early = 1930-1935

If bulk dates exist for single items, when to delete or not?

  • Will delete if same
  • Should we keep if there is a difference?
  • If a difference what is the year cutoff? 1? 5? 10? etc……

Are single dates with “undated” really single?

  • ex “1941 and undated”

Can we have “bulk” dates that are “single”?

  • ex: 1989, type as “bulk” in ArchivesSpace?

For date expression, can we agree on the preferred date formats?

  • Start with guidelines in processing manual
  • Update and make suggestions for changes
  • Solicit comments/feedback
  • Finalize decisions
  • ex: 1880s, 18th-20th Century, Oct. 1976, Sept-9-1980, May 5, 1990

What if accession is a single item with a date range, but the abstract gives single date?

  • ex: Edwin Warfield accessions, range 1875-1912, abstract for accession says 1889 for a single item. Assuming that this date range was for all Warfield papers? Ignore and take date from abstract?

What do we do if we have more dates than fields?

  • ex: single dates of 1966, 1967, 1969 or 1930, 1950, and 2002 would parse to three single date fields
  • Version 1.0.7.1 currently only imports date 1 and date 2 in CSV accession template
  • When do we want to turn single dates into a range instead? How many years in between? Based on size of materials? Never and develop procedure for adding dates beyond second into ArchivesSpace record post import?

 

In the next post we’ll go through some of the specific ways we are executing the clean up actions.

 

Advertisements

Date Formats

Now that we’re eliminated most of our duplicate bulk dates let’s take a look at the plethora of date formats in our accession records. Does your repository have local definitions for how to record collection dates? My guess is most would have something somewhere, even if not a formal written policy. We have a small section of our processing manual detailing preferred date formats and unacceptable date formats. It is suppose to apply to collection dates in accession records and in finding aids. Do people follow it? Sometimes. Usually not. Or they follow some of it, but not all of it. We also have dates created before the processing manual existed. The samples below are just from a portion of our accession records, so we might have additional formats yet to be discovered, but you’ll get the idea.

Our date fields could contain three elements: year only, month and year, or month, day, and year. The type might be a single date, multiple single dates, range, multiple ranges, or a combination of these (although that isn’t specified). For dates in accession records I have already gone ahead and removed any variation of the word “circa”. There’s also a healthy amount of “unknown” and “undated” speckled throughout.

Element, type unitdateinclusive (Beast field)
Year, single 1909
[1922]
1636 (approx.)
1920[?]
1940?
1946, undated
1957(?)
1999?
Year, multiple single 1913, 1963
1945 or 1946
1953, 1961, 1969, 1994
1954, 1956, 1966-1967, 1971
1958, 1960, 1962
1966, 1967, 1969
1967, 1968, 1969
1969, 1970
1995, 2000, undated
Year, range 1910-1950
1920s
1920s-1930s?
1921-1981 and undated
1940’s-2006
1980’s-1990’s
2000-2001 (FY 2001)
Early 1970s
late 1980s-early 1990s
undated, 1970s-2002
Year, multiple range 1920s, 1969-1975
1932-1934, 1950s
1937-1942; 1947-1950
Year, single and range 1928; 1938-1962 and undated
1938, 1950-1951
1950s-1960s, 1988
2008 [1901-2002]
Month Year, single November 1962
April 2001?
Month Year, range January 1977- November 1981
May2005-January 2007
Otober 1920-Marh 1921
Month, Day, Year, single 11/9/1911
June 14, 1924
Marh 8, 2006
Otober 26, 1963
Month, Day, Year, multiple single 12/19/2005; 4/4/2006
January 5, 2000,  July 12, 2000
9/19 & 9/20/2007
Month, Day, Year, range 10/24-10/26/2008
January 30, 2011-February 2, 2011
Marh 22-24, 2001
Otober 13, 1987-Deember 7, 1987

Here’s a summary of the issues:

  • Punctuation is not standard. Multiple dates may be separated with a period, comma, semi-colon, ampersand, or the word “and”.
  • We used a variety of methods to convey we were unsure of the date, such as ?, (?), [ ], [?], (approx.) in addition to all the circa variations. I’m guessing there are other dates we weren’t sure of, but we didn’t specify that.
  • Spacing isn’t consistent. Sometimes there are no spaces around punctuation, others times one, two , or more spaces.
  • Spelling. Sometimes we just couldn’t spell October or March (the most popular offenders apparently)
  • Formats are all over the place, even comparing the same element and type. Ex: March 22-24, 2001 compared to March 22, 2001-March 24, 2001.
  • Use of decades was a common practice.
  • Providing single dates instead of ranges. Do we really need to say “1966, 1967, 1969” instead of “1966-1969” if we’re only missing 1968?

Next post we’ll talk about the instructions and rules we’re developing for cleaning this up and how we go about executing those decisions.

Baby Steps in Using Openrefine to clean up collection data

As I mentioned in my last post, my latest collection management project is making sure that we have collection level records for everything in the repository, which I am doing through creating accession records in Archivists’ Toolkit (I chose accession records rather than resource records based on a set of legacy decisions about how the institution uses AT, if I was starting from scratch I probably would do that differently).  The piece of the puzzle that I’ve been working on recently is integrating the records of one of our sub-collections, the Bakhmeteff Archive of Russian and East European Culture, into AT.

The manuscript unit took over accessioning materials for this collection in about 2007, so collections  that have been acquired in the 7 or 8 years do have an AT record, as do about 60 older collections that were added as part of a separate project.  So, my first step was to figure out which Bakhmeteff Collections already had a collection level record and which ones did not.  Since there was not one comprehensive collection list, this involved checking our AT records against a whole slew of other data sources* to see what was already in the database and which collections still needed collection descriptions in Archivists’ Toolkit.

The next issue was to figure out the best way to identify duplicate records.  In looking at all of the source data it became clear very quickly that way that the title was expressed across all of the data sources I was working with varied wildly — sometimes expresses as “John Smith Papers” sometimes “Smith, John Papers” and, in the case of many of our older catalog records, just “Papers” with John Smith living in the 100 field and not reappearing in the 245.  Some sources used diacritical marks and some didn’t (always thorny, but with several hundred collections in Russian a potential dealbreaker).  Therefore I chose to use the collection number rather than title.  The one issue with that is that I was using AT accession records, not resource records, so the collection number was expressed as part of the title filed (I know, I know) and had to be broken out into its own column, but not a huge deal.  Once I had that as a master document I could combine this spreadsheet and my other data sources and then use Open Refine to facet my spreadsheet by ID number to identify (and eliminate) any collection that shows up both in AT and  in one of my data sources.  I then had a comprehensive list of collections not represented in AT so that I knew which collections needed to be added.  It’s not a perfect solution, but it is a down and dirty way to identify where I have work to do so that I am not having a student manually check every collection against different data sources to identify what needs a better record.   It also let me combine data from all sources to come up with a new master collection list to work.  Plus it was a good, baby-steps introduction to using OpenRefine.

 

*Since information was coming from so many sources, and because I didn’t trust the completeness of any of them, I was checking our AT holdings against accession registers, a collection number spreadsheet, our stack directory, and a list of collections that I asked one of our fantastic systems librarians to generate for that that queried our catalog for any record that was cataloged with a bibliographic level of c: collection in the Leader, and had a location code that tied it back to our library.

Our EAD — Standards Compliance

I mentioned in an earlier post that in anticipation for our three big archival systems projects (migration to ArchivesSpace from Archivists’ Toolkit, implementation of Aeon, and re-design of our finding aids portal), we’re taking a cold, hard look at our archival data. After all, both Aeon and the finding aids portal will be looking directly at the EAD to perform functions — both use xslt to display, manipulate, and transform the data.

So, there are some basic things we want to know. Will our data be good enough for Aeon to be able to turn EAD into a call slip (or add it to the proper processing queue, or know which reading room to send the call slip to)? Are our dates present and machine readable in such a way that the interface would be able to sort contents lists by date? And, while we’re at it, do our finding aids meet professional and local standards?

Let’s take a look at our criteria.

A single-level description with the minimum number of DACS elements must include:

  • Reference Code Element (2.1) — <unitid>
  • Name and Location of Repository Element (2.2) — <repository>
  • Title Element (2.3) — <unittitle>
  • Date Element (2.4) — <unitdate>
  • Extent Element (2.5) — <extent>
  • Name of Creator(s) Element (2.6) (if known) — <origination>
  • Scope and Content Element (3.1) — <scopecontent>
  • Conditions Governing Access Element (4.1) — <accessrestrict>
  • Languages and Scripts of the Material Element (4.5) — <langmaterial> (I decided to be generous and allow <langmaterial>/<language @langcode>, although I would prefer that there be content in the note)

For a descriptive record to meet DACS optimum standards, it must also include:

  • Administrative/Biographical History Element (2.7) — <bioghist>
  • Access points — <controlaccess>

At Tamiment, we’ve determined that the following elements must be included in a finding aid to meet local standards:

  • Physical location note — <physloc>
  • Restrictions on use note — <userestrict>
  • Immediate source of acquisition note — <acqinfo>
  • Appraisal note — <appraisal>
  • Abstract — <abstract>
  • Arrangement note — <arrangement>
  • Processing information note — <processinfo>
  • Our local standards also require that every series or subseries have a scope and content note, every component have a title, date and container, and every date be normalized.

I’ll talk about our reasons for these local standards in subsequent blog posts.

Finally, we’ve started thinking about which data elements must be present for us to be able to use the Aeon circulation system effectively. To print a call slip, a component in a finding aid needs the following information. Useful (but not required) fields are italicized:

  • Reference code element / call number — <unitid>. We have to know what collection the patron is requesting.
  • Repository note — <repository>. This should be a controlled string, so that the stylesheet knows which queue to send the call slip to. It may also be possible to do post-processing to add an attribute to this tag or a different tag, so that the string can vary but the attribute would be consistent enough for a computer to understand. In any case, we need SOME piece of controlled data telling us which reading room to visit to pull this material.
  • Container information — <container>. Every paged container should have a unique combination of call number and box number. There’s no good way to check this computationally — we’ve all seen crazy systems of double numbering, numbering each series, etc.
  • Collection title — <unittitle>. This is the title of the collection, which is useful for paging boxes.
  • Physical location note — <physloc>. This isn’t strictly necessary, but it is very useful to know whether boxes are onsite or offsite.
  • Access restrictions — <accessrestrict>. This is an operational requirement. By having the access restriction note, the page can see right away whether it’s okay to pull this box.
  • Fancy-pants scripting piece to add location information…. This would require a lot of data standardization (and probably data gathering, in some cases), but it would be great to have the location on the repository-eyes-only side of the call slip.

So, how are we doing?

TamStandards

Frankly, I was pleasantly surprised. As you can see from the chart on the right, out of 1217 finding aids from that harvest, about two-thirds meet DACS single-level and optimum requirements. The reasons for failure vary –many are missing creator information, notes about the conditions governing access, and information about the language of material. Happily, information about the historical context of the collection and the presence of access points is fairly common.

We also see that the vast majority of our finding aids will meet the requirements for Aeon compliance. The problem of components without containers is a big one, but is something that we’ve obviously dealt with using paper call slips, and will have to be a remediation priority. Once this is addressed, we still have the outstanding issue of how to consistently tell the computer where a finding aid is coming from. Once we decide how we want that data to look, we’ll be able to fix it programmatically.

Our most distressing number is about local compliance, and the biggest offenders are physical location, immediate source of acquisition, and appraisal information. This reflects an overall trend in our repository of being careless with administrative information — we have very little information about when and how collections came and what interventions archivists made.

The requirement that appraisal information be included is extremely recent — unfortunately, this is the kind of information that is difficult to recover if not recorded at the time of processing. Hopefully, some information about appraisal may be included in processing information and separated materials notes.

For anyone interested in how our data breaks down, a chart is below.

TamElements

Adding Structure to a Word Document Using Regular Expressions

At Tamiment I’ve been working on a team project to describe our photographic collection backlog.  The New York Hotel and Motel Trades Council Photographs contains around 15 linear feet of materials, which are mostly in labeled folders and grouped by subject.  While my first instinct was to describe this collection at the box level, I discovered that there was a folder-level MS Word inventory that had been done on-site.  Maureen and I turned this Word document into the container list of an EAD finding aid.

initialinventory

As Maureen showed in her previous tutorial, the first step was copying this into a spreadsheet.  However, this inventory needed some extra cleanup before it looked how I wanted it to.  Problems included: dates at the beginning and ends of folder titles; subseries that ended in one box and started again two boxes later; and administrative “codes.”  In order to clean up the document and add structure, I used regular expressions in the find and replace in my word processor.

Regular Expressions

Regular expressions allow us to match patterns instead of specific words.  For example, instead of searching for a specific abbreviation, such as Misc., I could use the pattern [A-Za-z]+\. to match all abbreviations in my inventory.  Regular expressions are used in a number of programming languages and can also be used in the find and replace tool in several word processors.  In order to use this functionality of find and replace, you need to have “regular expressions” checked (usually under “more options”).  Not all word processors have the same functionality; I used LibreOffice Writer, which is free, open-source software.

I learned how to use regular expressions through the tutorial “Understanding Regular Expressions” by Doug Knox on the Programming Historian, which I highly recommend checking out.

Here is a list of special characters, and combinations of special characters, that I used to edit and add structure to my inventory.  A full list of LibreOffice’s regular expressions can be found here.  Note: this exact syntax won’t work with Microsoft Word, which uses their own wildcards.  More information about Microsoft Word’s find and replace can be found here.

.*

any

[A-Za-z]+

all capital and lowercase letters occurring one or more times

^

beginning of a paragraph

$

end of a paragraph

^$

empty paragraph

\t

tab

[0-9]{2}

all digits repeated two times

\*

“\” indicates that the following special character (in this case, “*”) is a normal character

( )

in a search, defines the characters inside the parentheses as a reference

$1

in a replace, refers to the first reference defined in the search

Cleanup

First, I cleaned up the formatting.  I removed the bullet points and made sure everything was aligned left.  Though I could remove the blank lines easily in my spreadsheet, I removed them at this stage so the data would be easier to work with:

Search for

^$

Replace with

nothing

The markers “*” and “*P” appeared at the beginning of some of the lines; this was for administrative purposes and I don’t want it in the published finding aid.  To remove these, I searched for them at the beginning of the line, which I indicated with “^”.  However, because “*” is a special character, I needed to use “\” before it to indicate that it was a normal.  So my expressions looked like this:

Search for 

^\*P

Replace with

nothing

Search for

^\*

Replace with

nothing

In this inventory, when a subseries continues into another box, it is indicated by “[Members continued]”.  Since in my spreadsheet it will be evident that these folders are all part of one series, I got rid of them using:

Search for

^\t\[.*continued\]$

Replace with

nothing

Adding Structure

When I copied and pasted the document into a spreadsheet, it was only one column.  I want the series title in the first column, subseries title in the second column, folder title in the third column, date in the fourth column, and box number in the fifth column.  I can add structure to this inventory by indicating columns with tabs.

I manually tabbed the subseries titles (which are indicated with italics) once.

To put the box number in the fifth column, I used

Search for

^Box

Replace with

\t\t\t\tBox

To put my folder titles in the third column, I used

Search for

^([A-Za-z]+)

Replace with

\t\t$1

Dates

Some of the dates are in the beginning of the folder title; I wanted them all in the end of the folder title.  Since I know that what I want to move is in the beginning of the line, I indicated this with “^”. There are some four-digit numbers that are not years, and all dates are from the 20th century, so I made sure that it began with “19” and was followed by two digits, so it looked like this: “19[0-9]{2}”.  Because I wanted to switch this value with the rest of the line, I indicated the rest of the line using “.*$” and surrounded each value in parentheses.  In my replace field, the value $1 refers to the first group in parentheses and the value $2 refers to the second group in parentheses.

Search for

(19[0-9]{2}) (.*)$

Replace with

$1\t$2

Because I want the date in the next column over, I put a tab between the folder title and year by searching for “([A-Za-z]+) (19[0-9]{2})$” and replacing with “$1\t$2”.  The “\t” indicates a tab.

Search for

([A-Za-z]+) (19[0-9]{2})$

Replace with

$1\t$2

Into a Spreadsheet

I’m ready to put this data into a spreadsheet.  When pasting, and then click “paste special.”  Then check “unformatted text” and “separated by tabs.”  My spreadsheet looks like this:

hotelworkersfinal

It still needs some further edits, which I can do in this spreadsheet or back in the word processor.  But it’s a lot better than before, with little editing by hand, and is a step closer to being turned into the container list of an EAD-encoded finding aid.

Clean Up: Inclusive and Bulk Dates Comparison

Let’s start with a more straightforward cleanup issue identified during our accession record mapping. In this example, we’ll use a set of accession records (only a portion of our total) we have exported from the Beast into an Excel spreadsheet and focus on two fields: <unitdateinclusive> and <unitdatebulk>. We’ll map these to date fields in ArchivesSpace, but before we get to that let’s examine the data.

This spreadsheet contains 3361 accession records. 2685 rows have an inclusive date and 1908 rows include a bulk date. By sorting the spreadsheet by date and spot checking the data, we’ve come up with a list of inconsistent date formatting issues. One of the most pervasive habits was to always fill out both the inclusive and bulk dates, even if the values for each were the exact same. (For now, ignore the other date formatting issues in these examples.)

same inclusive and bulk dates

Supplying this information twice isn’t necessary for our users and could be confusing to some (plus it is extra work for us!) DACS 2.4.10 suggests providing a bulk date when dates differ significantly from the inclusive dates, so we want to keep the bulk dates that are different than our inclusive dates while removing the duplicate values.

We could compare these by hand (done that before!) or use a formula in Excel to do the work for us:

=IF(A2=B2, “same”, B2)

This formula asks if the value in <unitdateinclusive> equals the value in <unitdatebulk>. If they are equal, return the value “same” and if they are different return the value of <unitdatebulk>.

After dragging down the formula for the entire sheet I then copy the results of this new column to another one, utilizing the “paste values” feature to carry over the content and not the formula for the cell value.

Pasting values from formual

I could have put nothing instead of “same” in my new column if the values were equal, but I wanted to know how many times these dates were equal. Sorting by my newbulkdate column I know that:

  • 777 rows only contained an inclusive date.
    • The formula as I have it would have returned a “0” here, because I didn’t tell it how to handle a blank cell in <unitdatebulk>.
    • Easy fix is to go back to my <unitdatebulk> column, find all the blank cells, and replace them with “empty.” Empty will carry forward with the formula. (I’m sure there is a way to handle this with the formula if anyone wants to jump in.)
  • 567 rows were identified as having a different value in bulk date so these dates were retained.
  • 1341 rows had an identical value in inclusive and bulk dates and were not kept.

I can now do a global replace on the newbulkdate column to replace “same” and “empty” with nothing. I then remove my original <unitdatebulk> column and my bulkformula column.

Only different bulk dates remain

Of course, this method only worked on cells were the characters were the exact same. There will be bulk dates that are the same as inclusive dates this didn’t catch, such as values with different spacing or punctuation. (ex: c.2007 v. c. 2007)

In other posts we’ll look at more date clean up questions, issues, and cleanup.

History and Politics

I want to step away from ArchivesSpace migration and take a moment to summarize some of the legacy data at Special Collections and University Archives. Carrie did a great job at painting the legacy landscape at Columbia. Our situation is similar in many ways. One of her points couldn’t be more on:

“We HAVE collection data, we just don’t have it stored or structured in a way that can be used or queried particularly effectively, and what we do have is in no way standardized.”

Until 2011, UMD maintained separate departments (and sometimes units within those departments) that were responsible for all the work pertaining to their collections. Curatorial units created and maintained data about their collections in their own ways, sometimes in ways the same or similar to other units, but often not. Collections data lives in paper accession and control files, spreadsheets, word documents, Access and FileMaker databases (for single collections or for similar types of materials), catalog records, finding aids, in someone’s head, etc…. These files live on the server in different locations and generally without consistent file names. I’ll also throw in that since we acquired the AFL-CIO records last fall, this comes with thirty plus years of collection data, including data from an archives management system.

In the summer of 2011 the following departments and units merged into one department:

  • Archives and Manuscripts department
    • Literature unit
    • Historical manuscripts unit
    • University Archives unit
  • Marylandia, Rare Books, and National Trust Library department
    • Marylandia and Rare Books unit
    • National Trust for Historic Preservation Library unit
  • Library of American Broadcasting department
  • National Public Broadcasting Archives department

Along with this move came the creation of “functional areas” that would manage specific common functions consistently across the new department. The Access Group became responsible for managing arrangement and description and associated functions for the entire department. Until I was hired in February 2013 there was not a person solely devoted to planning and managing this work, but multiple people on the access team that had other main responsibilities outside of the team. The creation of my position is enabling SCUA to analyze our technical service operations, update our practices, and manage functions consistently.

Interesting to note, that at least currently, there are three other special collection units in the Libraries (Gordon D. Prange Collection, Special Collections in the Performing Arts, and the International Piano Archives at Maryland) that operate outside of Special Collections and University Archives. SCUA provides some services to some of these units (Beast database) and shares some policies/procedures (ex: processing manual) with some of them.