Archival Description for Web Archives

If you follow me on Twitter, you may have seen that the task I set out for myself this week was to devise a way to describe web archives using the tools available to me: Archivists’ Toolkit, Archive-It, DACS and EAD. My goals were both practical and philosophical: to create useful description, but also to bring archival principles to bear on the practice of web archiving in a way that is sometimes absent in discussions on the topic. And you may have seen that I was less than entirely successful.

Appropriate to the scope of my goals, the problems I encountered were also both practical and philosophical in nature:

  • I was simply dissatisfied with the options that my tools offered for recording information about web archives. There were a lot of “yeah, it kind of makes sense to put it in that field, but it could also go over here, and neither are a perfect fit” moments that I’m sure anyone doing this work has encountered. A Web Archiving Roundtable/TS-DACS white paper recommending best practices in this area would be fantastic, and may become reality.
  • More fundamentally, though, I came to understand that the units of arrangement, description and access typically used in web archives simply don’t map well onto traditional archival units of arrangement and description, particularly if one is concerned with preserving information about the creation of the archive itself, i.e., provenance.

Continue reading

Advertisements

SAA 2014 Sessions of Interest

Here are a few sessions (not comprehensive!) related to the content of this blog at SAA this week:

Wednesday, August 13
3:30pm – 5:00pm

Carrie: Friday, August 15 • 2:45pm – 3:45pm; SESSION 503 – How Are We Doing? Improving Access Through Assessment

Maureen: Friday, August 15 • 2:45pm – 3:45pm; SESSION 501 – Taken for Granted: How Term Positions Affect New Professionals and the Repositories That Employ Them

Meghan: Thursday, August 14 • 3:00pm – 3:30pm and Friday, August 15 • 4:00pm – 4:30pm; P05 PROFESSIONAL POSTER – Mapping Duke History with Historypin

Steve: Thursday, August 14 • 5:30pm – 7:30pm; Graduate Student Poster Presentations: ArchivesSpace and the Opportunity for Institutional Change

Figuring Out What Has Been Done: Folder Numbers that Go On and On and On

What was the problem?

As part of a retrospective conversion project, paper-based finding aids were turned into structured data. A lot of this work was done in Excel, and one problem was a mistake with folder numbers — instead of folder numbers starting at number one at the beginning of each box, their numbering continues as the next box starts. For instance, instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.

How did I figure this out?

Since I’m new here and not overly familiar with numbering conventions, I approached it two ways. First, I want a list of finding aids that have really, really high folder numbers. This is really easy — I basically point to the folder and ask to return the biggest one. Of course, fn:max() can only handle numbers that look like numbers, so I included a predicate [matches(.,’^[0-9]+$’)] that only looks for folder numbers that are integers. This means that folder ranges and folders with letters in their name won’t be included, but it’s very unlikely that the biggest folder numbers in a collection would be styled this way.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $folder := $ead//ead:container[@type="Folder"][matches(.,'^[0-9]+$')]
 let $maxfolder := max($folder)
 return 
 $maxfolder
 }
 
 </document>
}
</root>

Looking through this, there are a LOT of collections with really high folder numbers. When I dig in, I realize that in a lot of cases, this can be just because of typos (for instance, a person means to type folder “109” but accidentally types “1090”). But I thought it would be good to know, more generally, which boxes in a collection have a “Folder 1”.

xquery version "1.0";
 
declare namespace ead="urn:isbn:1-931666-22-9";
declare namespace xlink = "http://www.w3.org/1999/xlink";
 
<root>
{
 for $ead in ead:ead
 let $doc := base-uri($ead)
 return
 <document uri="{$doc}">
 {
 for $ead in $ead
 let $box := $ead/(//ead:container[@type="Box"])
 let $folder1 := $box/following-sibling::ead:container[@type ="Folder"][. eq "1"]
 let $boxbelong := $folder1/preceding-sibling::ead:container[@type ="Box"]
 return
 $boxbelong
 }
 </document>
}
</root>

And, like a lot of places, practice has varied over time here. Sometimes folder numbering continues across boxes for each series. Sometimes it starts over for each box. Sometimes it goes through the whole collection. This could be tweaked to answer other questions. Which box numbers that aren’t Box 1 have a folder 1? How many/which boxes are in this collection, anyway?

From this, I got a good list of finding aids with really folder numbers that will help us fix a few dumb typos and identify finding aids that have erroneous numbering. We’re still on the fence regarding what to do about this (I think I would advocate just deleting the folder numbers, since I’m actually not a huge fan of them anyway), but we have a good start to understanding the scope of the problem.

Where are we with goals?

  1. Which finding aids from this project have been updated in Archivists’ Toolkit but have not yet been published to our finding aid portal?  We know which finding aids are out of sync in regard to numbers of components and fixed arrangement statements.
  2. During the transformation from 1.0 to 2002, the text inside of mixed content was stripped (bioghist/blockquote, scopecontent/blockquote, scopecontent/emph, etc.). How much of this has been fixed and what remains?
  3. Container information is sometimes… off. Folders will be numbered 1-n across all boxes — instead of Box 1, Folders 1-20; Box 2, Folders 1-15, etc., we have Box 1, Folders 1-20; Box 2, Folders 21-35.
  4. Because of changes from 1.0 to 2002, it was common to have duplicate arrangement information in 1.0 (once as a table of contents, once as narrative information). During the transformation, this resulted in two arrangement statements.  We now know that only three finding aids have duplicate arrangement statements!
  5. The content of <title> was stripped in all cases. Where were <title> elements in 1.0 and has all the work been done to add them back to 2002?
  6. See/See Also references were (strangely) moved to parent components instead of where they belong. Is there a way of discovering the extent to which this problem endures?
  7. Notes were duplicated and moved to parent components. Again, is there a way of discovering the extent to which this problem endures?  We now know which notes are duplicated from their children.

Exporting, Editing, and Importing EAD in Archivists’ Toolkit: A Checklist

Sometimes, it can be extremely helpful to take EAD XML files out of Archivists’ Toolkit to edit them.  Maybe you have a contents list that you generated from a spreadsheet, or maybe you want to quickly change 500 “otherlevel”s to “file”s.  Since there are so many small steps, I created a checklist.  Using the checklist will help to make sure that information doesn’t get lost and that the record looks like you want it to.

First, a word of caution: when the record is imported back into AT, it will overwrite all refids with new ones.  So if you’re using those refids elsewhere, this won’t work.  Additionally, before exporting the record, it’s important to copy down information that won’t be included in the export.  This includes any repository processing notes and linked accession records.  This is also why it’s important to make sure that “internal only” notes are included in the export.  Also, the file won’t re-import with barcode information, because barcodes are kept as non-valid attributes and violate the importer’s validation rules.

We found that when exporting, AT added information that we didn’t want when we re-imported it, or imported information to different fields.  For example, at Tamiment, we use the container summary to on the “Basic Info” tab to record the container summary.  When this is exported, it maps to <extent> in <physdec>.  When it’s re-imported into the Toolkit, it does not go into the container summary but becomes a Physical Description Note.  You can also change some of these in the EAD XML file, instead of after importing into AT.

You can find my checklist here or below

Before Exporting EAD:

  • Write down which accession records are linked to the resource record
  • Record any information in repository processing note(s)
  • Do NOT check “Suppress components and notes when marked ‘internal only’” when exporting the original resource record

Before importing EAD:

  • If there are barcodes: do a find/replace on containers (using dot matches all) to delete barcodes
  • Make sure that the record it is replacing has been deleted

After importing EAD:

Basic Description:

  • Separate the prefix and numeric sections of the Resource Identifier into separate fields
  • Remove bulk dates from Date Expression field (this may also need to be done at the series or sub-series level)
  • Copy the text from the General Physical Description note into the Container Summary

Notes:

  • Remove General Physical Description note

Finding Aid Data:

  • Remove call number from Finding Aid Title field
  • Remove “Collection processed by” in Author field

Barcodes

  • Re-enter barcodes

Accessions:

  • Re-link resource record to accession record(s)

Building a Case: Semantic URLs for Finding Aids

I’ve been working on the Beast project with Cassie as my final field study before graduation (sniff). My task has been to look at the resource records, analyze the EAD being produced by the Beast and how that will translate to ArchivesSpace, and identify the data cleanup steps needed to get us there. It’s been a fantastic learning experience, and I’m sad to be leaving now that the semester is ending.

One of the things I’ve spent a lot of time in graduate school thinking about is usability and user experience, and although it’s outside of the scope of my project, there’s something that’s been bugging me…

The Problem (as I see it)

When you look at a finding aid generated through ArchivesUM, the page URL looks something like this:

http://digital.lib.umd.edu/archivesum/actions.DisplayEADDoc.do?source=/MdU.ead.histms.0011.xml&style=ead

The ability to see what database actions are performed when a finding aid is called are of absolutely no use to the person viewing the page. The XML file title inserted in the middle is also useless, because it is related to none of the identifiers a researcher would use to access the collection.

By way of contrast, the URL for this post looks something like this:

https://icantiemyownshoes.wordpress.com/2014/05/08/semantic-urls-for-finding-aids/

WordPress, like many other sites, creates semantic URLs for each of the pages it generates.  It clearly identifies the source of the page, the date it was originally posted, and some human-readable form of the title, which can be altered by the author of the post.

Like the ArchivesUM URL, WordPress has provided a unique, static identifier for the information contained therein. Unlike its ArchivesUM counterpart, it provides the viewer with several important pieces of information such as date of publication and if it is a sub-topic of the site.  This has an impact on its findability, both on its website and when it appears on a Google search results page.  Users are quickly able to determine whether they find the source trustworthy, how new it is, and that it references the topic they are interested in.

Semantic URLs also clearly identify one subpage of a website from another. This can have an impact on search engine optimization, as pages with similar long and indecipherable URLs may not be crawled.

Possible Solution

Let’s take a look at what some other repositories are doing with their URLs.  I’ve been using Princeton and Duke as touchstones throughout my project because they’re clearly thinking progressively about a lot of things, including URLs:

Princeton: http://findingaids.princeton.edu/collections/C0159

Duke: http://library.duke.edu/rubenstein/findingaids/africanamericanmisc/

In both cases, the last part of the URL is inserted using the <eadid> tag. Princeton uses its collection number, while Duke uses a shortened version of the collection title.  Both are clean and easy to read. Users like when they can easily understand a URL- it helps build confidence and trust in the website.  It is arguable how practical these URLs are to the average user because they’re not going to memorize them, but this consistent structure may be useful for archives staff. The same cannot be said for the current ArchivesUM URL standard.

I think the ArchivesSpace transition gives us the chance to adopt a new URL standard that looks something like this:

http://digital.lib.umd.edu/archivesum/finding-aids/<eadid&gt;

As with the Duke example above, this would use the <eadid> tag to provide some version of the title of the collection. It could contain either a shortened title or the full title of the collection, e.g.:

http://digital.lib.umd.edu/archivesum/finding-aids/Adelphi-Citizens-Association-records

Admittedly, that could get a bit long for a semantic URL, but it would avoid confusion with similar records. This will be something to work out on the policy level.

The Pros of This Approach:

-Quickly conveys information about the repository and collection name

-Short and human-readable, which is an advantage for search engines

-Provides level of trust to user (which is admittedly hard to quantify)

-Can be done using the EAD ID field in ArchivesSpace

-Easy for reference archivists and researchers to identify collection by URL

-Removes “sausage making” database calls currently in URL, which reduces researcher confusion

-Can remain static even if database structure is changed

-If the finding aid is cross-linked on another page, users will have an idea where they’re going before they click on the link.

And the Cons:

-Can be confusing if collections have similar titles

-Need to set very clear rules for formatting, or automate it when converting from ArchivesSpace to finding aid

-Length is a concern- should be short but convey the information

-Requires policy changes for both the contents of <eadid> and the use of unique identifiers

The negatives to adopting a semantic URL approach are primarily in the implementation and can be mitigated by proper planning and clear policies. The positives boil down to the fact that for a relatively small amount of effort, we can have a huge impact on user experience as well as search engine optimization.

We are implementing ArchivesSpace in part to better serve our researchers into the future. It’s important, then, to consider everything about our EAD content, our finding aids, and our websites from the user’s perspective. The URL is the first thing a researcher will encounter, so why not start there?

 

Case Study: Clean Data, Cool Project

SPLCblogEvery now and then I get to work on a project from the very beginning, meaning that instead of cleaning up legacy data, I get to define and collect the data from scratch. Such was the case with one of Duke’s recent acquisitions, the records of the Southern Poverty Law Center Intelligence Project. Beginning in the 1970s, SPLC collected publications and ephemera from a wide range of right-wing and left-wing extremist groups. The Intelligence Project included groups monitored by SPLC for militia-like or Ku Klux Klan-like activities. There are also many organizations represented in the collection that are not considered “hate groups”– they simply made it onto SPLC’s radar and therefore into the Project’s records. The collection arrived at Duke in good condition, but very disorganized. Issues of various serial titles were spread across 90 record cartons with no apparent rhyme or reason. Inserted throughout were pamphlets, fliers, and correspondence further documenting the organizations and individuals monitored by SPLC.

What do you do when an archival collection arrives and consists mostly of printed materials and serials? In the past, Duke did one of two things: either pull out the books/serials and catalog them separately, or leave them in the archival collection and list them in the finding aid, sort of like a bibliography within a box list. This project was a great opportunity to try out something new. In consultation with our rare book and serials catalogers, we developed a hybrid plan to handle SPLC. Since we had to do an intensive sort of the collection anyway, I used that chance to pull out the serials and house each title separately. They are now being cataloged individually by our serials cataloger, which will get them into OCLC and therefore more publicly available than they would ever be if just buried in a list in the finding aid. She is also creating authority records for the various organizations and individuals represented in the collection, allowing us to build connections across the various groups as they merged and split over time. While she catalogs the serials, I have been archivally processing the non-serial pieces of the collection, tracking materials by organization and describing them in an AT finding aid. When all of the serials are cataloged, I will update the finding aid to include links to each title, so that although the printed materials have been physically separated from their archival cousins, the entire original collection will be searchable and integrated intellectually within the context of the SPLC Collection.

To further ensure that the SPLC serials did not lose their original provenance, we developed a template that our cataloger is applying to each record to keep the titles intellectually united with their original collection. All of the serials being cataloged are receiving 541 and 561 fields identifying them as part of the SPLC Collection within the Rubenstein Special Collections Library. We are also adding 710s for the Southern Poverty Law Center, and an 856 that includes a link to the SPLC collection guide. (Duke inserts all its finding aid links in the 856 field, but we rarely do this for non-manuscript catalog records.) The result is a catalog record for each serial that makes it blatantly obvious that the title was acquired through the SPLC Collection, and that there are other titles also present within the collection, should researchers care to check out the links. But, cataloging the serials this way also allows the researcher to find materials without necessarily searching for “SPLC.”

Screenshot 2014-04-07 at 8.20.22 PM

An example of one of the SPLC serials: The Crusader, a KKK publication.

Along with hammering out our various print and manuscript workflows to better meet the needs of this collection, we also saw it as an opportunity to create and collect data that would allow us to easily extract information from all the discrete catalog records we are creating. We are being as consistent as possible with controlled vocabularies. Our serials cataloger is adding various 7xxs to track each publisher using RBMS or LOC relator codes. LOC geographic headings are being added as 752s. We are also trying to be consistent in applying genre terms in the 655 field using the RBMS gathering term “Political Works.”

Screenshot 2014-04-07 at 9.10.28 PM

A view of the MARC fields from The Crusader’s catalog record.

Equally important, we are replicating this sort of data collection in the archival description of the non-serial portions of the SPLC Collection. When we finally reunite the serials with the finding aid, the same sort of geographic, subject, and publisher data will allow us to match up all of the fields and create relationships between an organization’s random fliers and its various newsletters.

Furthermore, my colleagues and I have dreams of going beyond a basic finding aid to create some sort of portal that will capitalize on our clean data to offer researchers a new way to access this collection. SPLC’s own website has a neat map of the various hate groups it has identified in the United States, but we would like to build something that specifically addresses the organizations and topics represented in this particular collection–after all, the Intelligence Project collected materials from all sorts of groups. We’re thinking about using something like Google Fusion Tables or some other online tool that can both map and sort the groups and their various agendas, but also connect back to the catalog records and collection guide so that researchers can quickly get to the original sources too.

I’ll have more to report on this cool project — and what we end up doing with our clean data — as it continues to progress over the next few months. Already, our serials cataloger has created 55 new OCLC records for various serial titles, and has replaced or enhanced another 140. She’s about halfway done with the cataloging part of the project. With so many of these groups being obscure, secretive, or short-lived, we believe that creating such thorough catalog records is worth our time and energy. Not only will it make the titles widely discoverable in OCLC, but hopefully it will build connections for patrons across the diverse organizations represented within this collection.

Our EAD — Standards Compliance

I mentioned in an earlier post that in anticipation for our three big archival systems projects (migration to ArchivesSpace from Archivists’ Toolkit, implementation of Aeon, and re-design of our finding aids portal), we’re taking a cold, hard look at our archival data. After all, both Aeon and the finding aids portal will be looking directly at the EAD to perform functions — both use xslt to display, manipulate, and transform the data.

So, there are some basic things we want to know. Will our data be good enough for Aeon to be able to turn EAD into a call slip (or add it to the proper processing queue, or know which reading room to send the call slip to)? Are our dates present and machine readable in such a way that the interface would be able to sort contents lists by date? And, while we’re at it, do our finding aids meet professional and local standards?

Let’s take a look at our criteria.

A single-level description with the minimum number of DACS elements must include:

  • Reference Code Element (2.1) — <unitid>
  • Name and Location of Repository Element (2.2) — <repository>
  • Title Element (2.3) — <unittitle>
  • Date Element (2.4) — <unitdate>
  • Extent Element (2.5) — <extent>
  • Name of Creator(s) Element (2.6) (if known) — <origination>
  • Scope and Content Element (3.1) — <scopecontent>
  • Conditions Governing Access Element (4.1) — <accessrestrict>
  • Languages and Scripts of the Material Element (4.5) — <langmaterial> (I decided to be generous and allow <langmaterial>/<language @langcode>, although I would prefer that there be content in the note)

For a descriptive record to meet DACS optimum standards, it must also include:

  • Administrative/Biographical History Element (2.7) — <bioghist>
  • Access points — <controlaccess>

At Tamiment, we’ve determined that the following elements must be included in a finding aid to meet local standards:

  • Physical location note — <physloc>
  • Restrictions on use note — <userestrict>
  • Immediate source of acquisition note — <acqinfo>
  • Appraisal note — <appraisal>
  • Abstract — <abstract>
  • Arrangement note — <arrangement>
  • Processing information note — <processinfo>
  • Our local standards also require that every series or subseries have a scope and content note, every component have a title, date and container, and every date be normalized.

I’ll talk about our reasons for these local standards in subsequent blog posts.

Finally, we’ve started thinking about which data elements must be present for us to be able to use the Aeon circulation system effectively. To print a call slip, a component in a finding aid needs the following information. Useful (but not required) fields are italicized:

  • Reference code element / call number — <unitid>. We have to know what collection the patron is requesting.
  • Repository note — <repository>. This should be a controlled string, so that the stylesheet knows which queue to send the call slip to. It may also be possible to do post-processing to add an attribute to this tag or a different tag, so that the string can vary but the attribute would be consistent enough for a computer to understand. In any case, we need SOME piece of controlled data telling us which reading room to visit to pull this material.
  • Container information — <container>. Every paged container should have a unique combination of call number and box number. There’s no good way to check this computationally — we’ve all seen crazy systems of double numbering, numbering each series, etc.
  • Collection title — <unittitle>. This is the title of the collection, which is useful for paging boxes.
  • Physical location note — <physloc>. This isn’t strictly necessary, but it is very useful to know whether boxes are onsite or offsite.
  • Access restrictions — <accessrestrict>. This is an operational requirement. By having the access restriction note, the page can see right away whether it’s okay to pull this box.
  • Fancy-pants scripting piece to add location information…. This would require a lot of data standardization (and probably data gathering, in some cases), but it would be great to have the location on the repository-eyes-only side of the call slip.

So, how are we doing?

TamStandards

Frankly, I was pleasantly surprised. As you can see from the chart on the right, out of 1217 finding aids from that harvest, about two-thirds meet DACS single-level and optimum requirements. The reasons for failure vary –many are missing creator information, notes about the conditions governing access, and information about the language of material. Happily, information about the historical context of the collection and the presence of access points is fairly common.

We also see that the vast majority of our finding aids will meet the requirements for Aeon compliance. The problem of components without containers is a big one, but is something that we’ve obviously dealt with using paper call slips, and will have to be a remediation priority. Once this is addressed, we still have the outstanding issue of how to consistently tell the computer where a finding aid is coming from. Once we decide how we want that data to look, we’ll be able to fix it programmatically.

Our most distressing number is about local compliance, and the biggest offenders are physical location, immediate source of acquisition, and appraisal information. This reflects an overall trend in our repository of being careless with administrative information — we have very little information about when and how collections came and what interventions archivists made.

The requirement that appraisal information be included is extremely recent — unfortunately, this is the kind of information that is difficult to recover if not recorded at the time of processing. Hopefully, some information about appraisal may be included in processing information and separated materials notes.

For anyone interested in how our data breaks down, a chart is below.

TamElements