Using grep to Control Vocabulary

The Archival Services department at the Center for Jewish History in New York provides processing services to the Center’s five partner organizations (American Jewish Historical Society, American Sephardi Federation, Leo Baeck Institute, Yeshiva University Museum and YIVO Institute for Jewish Research). The department is six full-time archivists, one part-time archivist, and a manager (that’s me).

Three archivists are currently processing the records of Hadassah, the Women’s Zionist Organization of America, which are on long-term deposit at the American Jewish Historical Society. The existing arrangement and description of the roughly 1,000 linear feet of materials varies widely (from item to record group level, and everything in between). The ultimate goal is folder-level control over the entire collection, using as much of the legacy description as possible. A high-level summary finding aid is available here: http://findingaids.cjh.org/?pID=2916671

As we process, we are trying to ensure that the terms in our narratives and assigned titles (names and places in particular) are consistent. This is tricky – there are many variant spellings and transliterations, and name changes abound as well. So we asked ourselves, how can we run a set of terms against a body of description?

I tried using an XSLT stylesheet, a Schematron, and xQuery via BaseX, but I kept running into problems with string processing. I’m sure there are many other ways to peel this grape, but eventually I tried using the Unix command-line program grep, and ultimately was successful. Most of this is stolen directly from a Stack Exchange post cleverly titled How to find multiple strings in files?.

We ended up with the following as our workflow, which we run across all the project finding aids periodically or as we add new terms to our list or create new finding aids.

  • First and foremost, come up with a list of preferred terms and their alternates. Save this compilation somewhere (we are using an email chain at the moment to do the work, and then saving a text file on the shared project drive). A couple of caveats here – if our preferred narrative term differs from the LC term, we use the LC term in a controlled tag like persname or corpname, and we introduce our preferred term in the narrative together with the alternates. The goal is to have consistent terms in our finding aids so they are easily searchable.
  • Create a text file that contains the alternate terms we want to avoid, one per line. Save this as patterns.txt.
AvoidTermsList

A list of terms we want to avoid

  • Save into one folder all the files you want to check – in our case, we  started with the three completed RG-level finding aids in EAD, rg1.xml, rg5.xml, and rg13.xml. You can run the terms against any text file though, such as an HTML finding aid or a text document.
  • Create a virtual Unix machine (I used the Oracle VirtualBox to create a Ubuntu 12.10 machine– this is the same way BitCurator is installed, so one could follow those instructions, substituting a regular Ubuntu disk image instead of the BitCurator image). NB THIS PART IS TRICKY. If you’ve never installed a virtual machine before, this could require some time and effort. Of course, if you are in a -nix environment, you can skip the VM entirely.
  • Boot up the virtual machine by highlighting the correct machine and hitting “Start.”
VMStart

The virtual machine start screen

  • In the virtual machine, using the Devices menu, enable bi-directional copy-and-paste and bi-directional drag-and-drop in the virtual machine. This allows you (in theory!) to move files and copy text between your Windows desktop and the VM. I always have a hard time getting this to work; sometimes I email myself files from inside the virtual machine.
  • Install aha (an ANSI to HTML convertor) via the command-line (Ctrl-Alt T to open the command line):

sudo apt-get install aha

  • Create a folder on the virtual machine desktop, and drag your patterns.txt and the finding aids to be checked over from your desktop.
  • Open the command line (Ctrl-Alt T), and navigate to the folder where your files are found.
  • Type the following (this will change DOS/Windows line endings from CRLF to just LF; thanks to Google and the hundreds of people who have encountered and posted about this frustrating quirk!):

sed ‘s/\r$//’ < patterns.txt > patterns_u.txt

  • You are now ready to run the key command, grep:

grep -n rg*.* -iHFf patterns_u.txt –color=ALWAYS | aha > output.html

So what’s going on here?

grep is powerful unix tool for pattern matching
-n is a flag that prints line numbers of the found terms in the output
rg*.* grabs all files we are checking, in this case everything starting with rg
-i flag: make the search case-insensitive
-H flag: print the filename in the output (useful if looking multiple files at once)
-F flag: read the text strings as strings, nor regular expressions
-f flag: look at a file for text strings, in this case patterns_u.txt
patterns_u.txt: our list of terms we are looking for
–color=ALWAYS: this flag makes the output pretty, with ANSI colors
| aha > output.html: this pipes the standard terminal output to pretty HTML

  • Examine the results by opening output.html in a browser; matched terms are in red:
results

The terms we seek, in red

And, voila, we can go back and examine where these terms occur, and see if they should be changed to the preferred term. Since this requires human judgment, it’s not automated, but it should be possible to add some find-and-replace functionality using a variety of tools (maybe a shell script that loops through the terms list and uses sed or awk to replace them?). But I leave that for brighter minds than mine.

Thanks to my processing team, Andrey Filimonov, Nicole Greenhouse and Patricia Glowinski, for working this out with me; to Maureen and the gang for letting me post this; and to Nicole for encouraging me to write it up.

How I learned to stop worrying and love the API

At University of Maryland, we are migrating from a local MS Access database appropriately named the Beast. We chose to begin our migration project with our accessions data. To get this data into ArchivesSpace we decided to use the csv importer since it seemed to be the easiest format to crosswalk our data to, and honestly, the only option for us at the time.

minions

Okay. Let me catch my breath.

For us, it seemed that the lowest barrier for getting our accession data into ArchivesSpace was to use the csv importer. Since we could get our data out of the Beast in a spreadsheet format, this made the most sense at the time. (Oh, if we had only known.)

Our data was messier than we thought, so getting our data reconciled to the importer template had its fair share of hiccups. The clean-up is not the moral of this story, although a bit of summary may be useful: some of the issues were our own doing, such as missing accession numbers that required going back to the control files, and just missing data in general. Our other major issue was understanding the importer and the template. The documentation contained some competing messages regarding the list of columns, importance (or unimportance) ofcolumn order, as well as unanticipated changes to the system that were not always reflected in the csv importer and template We did finally manage to get a decent chunk of our data cleaned and in the template after almost a year of cleaning and restructuring thousands of records.

AND THEN. Just when we thought we had it all figured out, ArchivesSpace moved processing/processing status from collection management to events. Unfortunately, at the current time there is not a way to import event information via the CSV importer. So we were stuck. We had already invested a lot of time in cleaning up our accessions data and now had a pretty important piece of data that we could no longer ingest in that same manner.

In comes the ArchivesSpace API to save the day!!

[In hindsight, I wish we had just used the API for accessions in the first place, but when this process began we were all just wee babes and had nary a clue how to really use the API and really thought the csv importer was the only option for us. Oh how far we’ve come!]

So, we revised our process to:

  1. Clean accessions in excel/open refine
  2. Keep the processing data we would need to create the event record in a separate sheet to keep the data together
  3. Import accessions (minus the processing event info) using csv importer
  4. After successful import, have a bright-eyed student worker (thanks Emily!) do the thankless task (sorry Emily!) of recording the ID of each accession (which the API will need to associate the processing event with the correct accession) into that separate sheet mentioned in step 2
  5. Using the spreadsheet from step 4 as the source, create a spreadsheet that includes the accession id and associated processing status with the rest of the information required for importing events (Getting to know the various ArchivesSpace data schemas is important). To make life easier, you may want to just name the columns according to the schema elements to which they will map.openrefine_event
  6. Since the API wants this to be in a JSON file, I then upload this spreadsheet file into OpenRefine (see screenshot above). This gives me a chance to double check data, but most importantly, makes it REALLY easy for me to create the JSON file (I am not a programmer).
  7. Once I am happy with my data in OpenRefine, I go to export, templating, then I put in the custom template (see below) I’ve created to match data schemas (listed in step 5). Since some is boilerplate, I didn’t need to include it in the spreadsheet.

OR_template

Here’s the template I developed based on the schemas for event, date, and linked records:


{"jsonmodel_type":{{jsonize(cells["jsonmodelType"].value)}},"event_type":{{jsonize(cells["event_type"].value)}},"external_ids":[],"external_documents":[],"linked_agents":[{"role":"executing_program","ref":"/agents/software/1"}],"linked_records":[{"role":"source","ref":"/repositories/2/accessions/{{jsonize(cells["linked_records"].value)}}"}],"repository":{"ref":"/repositories/2"},"date":{"label":{{jsonize(cells["label"].value)}},"date_type":{{jsonize(cells["date_type"].value)}},"expression":{{jsonize(cells["date"].value)}},"jsonmodel_type":"date"}}

Then export! Make sure to save the file with a memorable filename.

I then open the file in a text editor (for me, TextWrangler does the trick) and I have to do two things: make sure all whitespaces have been removed (using find and replace), and make sure there is one json per line. (regex find and replace of \r). However, you should be able to create the template in such a way as to do this.

Then, I put together a little bash script that tells curl to take the json file that was just created, read it line by line and POST each line via the API.

#!/bin/bash

url="http://test-aspace.yourenvironment.org:port/repositories/[repo#]/event"
for line in $(cat your_events.json);
do
echo `curl -H "X-ArchivesSpace-Session: $TOKEN" -d "$line" $url`;
done

Now, I just transfer need to transfer both the bash script and the json file from my local files to the archivespace server. (using the command  scp <filename> <location>. If you’re like me, you may have needed to ask a sysadmin how to do this in the first place).

Make sure you have logged in, and exported the Session ID as a $TOKEN. (I won’t walk you through that whole process of logging in, since Maureen outlines it so well here, as does the the Bentley here.)

Now, from the command line, all you need to do is:

bash curl_json.sh

And there you go. You should see lines streaming by telling you that events have been created.

If you don’t…or if the messages you see are of error and not success, fear not. See if the message makes sense (often it will be an issue with a hard-to-catch format error in the json file, like a missing semi-colon, or an extra ‘/’ (I speak from experience). These are not always easy to suss out at first, and trust me, I spent a lot of time with trial and error to figure out what I was doing wrong (I am not a programmer, and still very, very new at this).

Figuring out how to get our processing event data into ArchivesSpace after hitting a major roadblock with the csv importer still feels like a great accomplishment. We were initially worried that we were going to have to either a) go without the data, or b) enter it manually. So to find a solution that doesn’t require too much manual work was satisfying, professionally speaking (did I mention I’m not a programmer and had never really dealt with APIs before?).

So to all of you out there working in ArchivesSpace, or in anything else, and you feel like you keep hitting a wall that’s a bit higher than what you’ve climbed before, keep at it! You’ll be amazed at what you can do.

Processing Levels: The Hows and Whys

It’s no surprise to anyone who has been reading this blog that I am a firm believer in building a processing program that relies heavily on minimal processing techniques, including developing a program that applies different levels of processing to collections, or parts of collections.   Describing our collections is one of the most important things that we as archivists do, and also one of the most time-consuming and expensive. We want to make sure that our time and intellectual capital is being well spent, and I firmly believe that the thoughtful, intentional application of processing levels is a really important way to ensure that.  This leads to more accessible collections, encourages collection-level thinking and analysis, and opens up archivists’ time to work on big, strategic projects.

Standards like DACS encourage this kind of collection-level thinking and support different levels of arrangement and description, but there’s not a lot of advice out there about how and when to apply each of these levels (though the University of California’s Efficient Processing Guidelines, as always, does a great job of this). How do we decide if a collection is a good candidate to describe at the collection level versus the file level?  And what principles guide these decisions?  Here I’ll give you some thoughts into the principles I’ve used to build a levels-based processing program, and then some criteria I use when making decisions about an individual collection.

Thinking Programatically

Put Your Thinking Cap On:

Start by analyzing the records (their content and context) at a high level. What does the collection as a whole tell us? What are the major pieces of the collection and how do they relate to each other? Why does that matter? How can be best expose all of this to researchers? I’m not gonna lie- this is hard. This is way harder than making a container list. However, it brings a lot more value to the table. Archivists are trained to understand the ways that records are created; and to assess their potential value as evidence, information, and/or as symbols. Often by doing this higher level intellectual work at the outset we can create very robust and meaningful description that exposes how the parts of the whole of the collection come together and how they function without doing a significant amount of work in the collection.

Define Terms and Build Consensus:

Be clear about what you mean by a level of processing. It is critical that all stakeholders—archivists, curators, research services staff, donors—are all on the same page about what it means for a collection to be arranged and described at a certain level. This means defining and documenting these terms and circulating them widely throughout the organization. It also means being clear about amount of both time and money required to arrange and describe collections to different levels.

It’s also very important to involve institutional stakeholders in your decision making process. Assessing stakeholder needs and managing expectations from the outset helps to ensure that processing projects go smoothly and end with happy stakeholders. In my institution this generally means that the archivists work with with curators to make sure that we understand the value and needs of the collection, that they understand what I mean by a certain level of description, and that I can clearly communicate how more detailed processing of one collection impacts the time and resources available to address other projects that individual has a stake in.

Always Look For the Golden Minimum:

I always approach assigning processing levels by determining what the goals for a collection are (determined in conjunction with stakeholders!) are and what path provides the lowest set of barriers to getting there.  Greene and Meissner call this sweet spot where you meet your stated needs with the lowest investment of time and resources the golden minimum and this should be the goal of all processing projects.

Processing is Iterative:

This is huge for me. I go back and tweak description ALL THE TIME. Sometimes I’ve straight up misjudged what amount of description was necessary to facilitate discovery, sometimes research interests and needs change and the old levels of arrangement or description didn’t cut it anymore. Your needs change and evolve, the researchers needs change and evolve over time, the institutional context changes, sometimes you realize that something, for whatever reason just isn’t working. You always have the option to go back into a collection and do more. You never, however, have the ability to recapture the time that you spent on a collection up front, so be thoughtful about how you apply that time to best meet the needs of the institution, the researchers, you and your colleagues, and the collection itself.

Arrangement and Description are Not the Same Thing:

And don’t need to happen at the same level nor happen at the same level across all parts of a collection. A collection can be arranged at the series level and described at the file level. Or completely unarranged but described at the series level. By breaking apart these two aspects of processing we have more flexibility in how we approach and make available collections, and we can be more efficient and effective in managing each individual collection and serving our users.

Discovery and Access are Key:

At the end of the day, the main things to keep in mind when determining the most appropriate processing level are discovery and access. The main goal of any processing project is to give users enough information in our description to both identify the material they are most interested in, and to be able to put their hands on it. How much description is necessary to find relevant material in a collection? What do you need to know to successfully retrieve that relevant box?

Making Decisions at the Collection Level

Now that we know why we’re doing this, and what principles are guiding the application of processing levels, here are some criteria that I use to determine what the most appropriate levels of arrangement and description for a collection are:

  • Research Value and Use: If a collection has a high research value and you anticipate heavy and sustained use, it may well be worthwhile to invest additional time and resources into it in at the outset. This is especially true if the collection is not well ordered or is difficult to access.
  • Institutional Priorities: While I tend to default towards more minimal processing most of the time, there are plenty of internal factors that may encourage me to apply more detailed levels of processing. A flagship collection in an area where we are trying to build collections, if material from a collection is going to be featured in an exhibition, how much staff time needs to be devoted to other projects, how administrators allocate resources—all of these may affect processing decisions.
  • Restrictions: If a collection has significant access or use restrictions, or if there is a high likelihood that there are materials in the collection that would fall under legal protections such as FERPA or HIPAA (especially if these items are scattered throughout the collection) you will need to arrange the collection at a more granular level to ensure that you are doing your due diligence to meet your legal obligations.
  • Material Type and Original (Dis)Order:   The nature of a collection and the state in which a repository receives it will also, to some extent, determine the amount of archival intervention that it needs to be usable. If a collection arrives foldered, but entirely without a discernable order, it may require at least a series level sorting to enable a researcher to navigate the collection and locate material of interest. This also means that a collection that arrives unfoldered or without any organization will require more granular processing in order to be able to provide meaningful access. If the material is pretty uniform a collection level description will probably suffice. In general, the greater the diversity of the collection the more description is required to render the collection intelligible.
  • Size: I try not to make too many blanket decisions based solely on the size of a collection, but it can be a factor in determining processing levels. A collection that is only one box will not need a tremendous amount of description beyond the collection level because a researcher will only need to request one box to see the entirety of material available—tons of additional description is not going to aid in locating material. Conversely, a collection where one series spans hundreds of boxes will need additional file level description so that a user can isolate and access the part of that series that he or she needs.

These are some of the things that I take into consideration in my role as a manager at an academic special collection library. Other types of repositories and institutional contexts may well have other needs and different criteria. Feel free to add or expand in the comments!

What We Talk About When We (Don’t) Talk About Accessioning

Last week Meghan did a wonderful job explaining the benefits of formal processing plans. I’d like to back up a bit and talk about one of the sources of information an archivist turns to in developing a processing plan: the accession record. Actually, accessioning more generally. As a profession, we don’t chatter all that much about accessioning, which is a shame as it sets the tone for our stewardship of the materials in our care. It is the process by which we lay the foundation upon which all subsequent steps rely. Done well, it provides access to materials as soon as possible, allows for arrangement and further descriptive work to be built upon initial efforts, and limits the loss of knowledge about the materials so that we are not dependent upon some Proustian reverie to discern basic information like who sent us a collection and when. Done poorly, or not done at all, it is the root of all problems that linger and haunt future archivists. Dramatic, no?

Gandalf discovering the complex custodial history for the One Ring in the Minas Tirith archives, which has rather lax reading room rules.

Definitions for accessioning are ambiguous and don’t provide much in the way of guidance for carrying out the function in practical terms. To some it simply refers to the action whereby records are transferred and no more. If it were a verb form in this line of thinking, it would be the simple past, static at a specific moment in time (please be patient while you await my forthcoming Buzzfeed quiz “Which archival function are you based on your zodiac sign’s favorite grammatical tense?”). Instead of viewing accessioning as synonymous with transfer, let’s think of it as a process by which we examine, stabilize, and document what we know about materials upon their arrival. To continue with a labored analogy, the progressive rather than the simple verb aspect.

So why is accessioning so important, and what should we aim to accomplish through it?

  • To establish physical custody and baseline levels of control that make it possible to track the materials’ location(s), assess and address immediate preservation concerns, and identify less urgent needs to be remediated in the future. In the past month alone I’ve dealt with mold, book lice, and broken glass in recent accessions, not to mention, slumping, crowding, and crushed boxes. Sometimes archives are gross. In thinking about the descriptive record, this part of the process gives us information on Extent and Processing Information.
  • To establish legal custody by assessing and documenting any restrictions on access and communicate intellectual property status. Descriptive elements related to this include Conditions Governing Access and Conditions Governing Use.
  • To establish intellectual control by identifying and documenting provenance, extant original order, as well as information about the content and context of the materials themselves. Relevant descriptive elements include Title, Date, Scope and Contents, and Immediate Source of Acquisition, amongst others.
  • To maintain a clear record of intervention with collection material, including appraisal, arrangement, descriptive, and preservation decisions and actions. Notes include Appraisal, Accruals, Destruction, and Processing Information.

We may not always know the same pieces of information for each accession, but we surely know more than what most collections management systems require in order to create a valid accession record (ArchivesSpace, for example, only requires an identifier and a date). The notes mentioned above come close to the requirements for DACS single-level description, and while most people use DACS in the service of creating access tools like finding aids, the standard is intended to be output-neutral. We can use it as a helpful guidepost for capturing and creating description in the accessioning process.

Let’s not stop there. Accessioning is an exemplar of the power of archival description, and how we can leverage that to provide access to some materials without performing arrangement. This follows the ethos of “Accessioning as Processing,” although in my opinion that phrase as a shorthand has recently become somewhat muddled with the idea of processing at the same time as accessioning. Instead, it’s a robust, access-driven approach that produces description sufficient in creating a baseline for access in some collections. During accessioning we are often able to get enough of a sense of a collection to create quality description rich in meaningful keywords that provides a reasonable range of materials for researchers. Machines can perform some of the basic types of arrangement like alphabetizing and chronological sorting, especially when we create clean metadata. It’s not boutique processing, and may often require further iterations in arrangement and description, but it allows for access sooner rather than later, and that’s at the crux of the public records tradition. And even if accessioning does not produce a public access tool – as this may not be appropriate in all cases – it should make it so that the next archivist who comes to the collection, whether for public services or arrangement and description, feels confident that she has all of the information about the materials that she needs. To close on a corny note, you can’t spell accessioning without “access.”

Access Restrictions that Promote Access

Access restrictions, if done well, are tools for ensuring that as much information as possible is made available as broadly as possible while still respecting and adhering to individual privacy, corporate confidentiality, legal requirements, cultural sensitivities, and agreements. In order to promote access, rather than present unnecessary barriers to it, restrictions on the availability of archival materials for research should follow these principles:

noun_30816_cc

Unlock by Eric Bird from the Noun Project

  • They should be as broad as necessary to be practicable, but no broader. Where this point falls will vary between restriction types, collections and repositories, but as archivists we should champion increasing access whenever we can.
  • They should be clear, as concise as possible, and avoid jargon of any type. A typical user should be able to understand the access restrictions. Not sure if your restrictions pass this test? Why not ask a user? This isn’t just a usability issue; it’s an equal access issue.
  • They should spell out exceptions and make the implicit explicit. Publishing information about exceptions, appeals and alternatives that may exist helps ensure that all users have equal access to that information, and that learning about them does not require additional inquiry or personal interaction with a gatekeeper an archivist.
  • They should acknowledge the role of professional judgement and enable appeal. In support of professional transparency and accountability, we need to explain restrictions well enough that researchers can understand both their basis and application, and challenge either element if they have good cause to believe our judgment is in error.

DACS gives some good guidance on what to include in an access restriction. In keeping with and expanding on that, a specific practice that I find helpful is to pay attention to the Five Ws and one H of access restrictions: who, what, where, when, why and how. Most access restrictions will not address all of these, but asking whether or not each applies can be useful when drafting restrictions.

Continue reading

Ethical Internships: Mentoring the Leaders We Need

I gave this talk last Friday to the Arizona Archives Association annual symposium — many thanks to that group for their excellent ideas and discussion, and for their strong sense of mission and values.


I wanted to start by explaining how excited I am to be here with you, and what it means to me to be an archivist speaking to a room of Arizona archivists. I grew up in Arizona, in Maricopa county in an area called Ahwatukee, which is a neighborhood on the south side of South Mountain, misnamed by the original white landowners for the Crow phrase for “land in the next valley.” Obviously the Crow people never lived anywhere near Arizona. The Crow are a northern plains tribe who lived in Wyoming and were forcibly moved to Montana. And so it is especially strange to me that the area was given a Crow name when we consider that Ahwatukee is bounded to the south by the Gila River Indian Community.

Crow (Apsaroke) Indians of Montana --

Crow (Apsaroke) Indians of Montana — “Holds the Enemy” by Edward Curtis. Library of Congress Prints and Photographs Division

What does it tell us of Dr. and Mrs. Ames’, the landowners who named the area, regard for their American Indian neighbors that they used the language of a group far enough away to be largely irrelevant to their lives instead of their immediate neighbors? I have to assume that they were caught up in popular romantic notions of American Indians, possibly best represented in the photographs of Edward Curtis, who aestheticized and fictionalized American Indians at precisely the moment when it was clear that there would be no more Indian wars and that the United States government’s program of forced removal had successfully met its intended ends.

This founding vignette resonates with me, because I see reverberations of it in my experience growing up in Ahwatukee. My middle school was named for the Akimel O’odham, the Pima people, who reside in Arizona, and our school donned bright turquoise and copper, vaguely pan-Indian pictographs. This was all done with a sharp lack of specificity; it gave the impression that American Indian culture is a stylistic flourish instead of a tradition, culture and worldview. Looking at it now, this divide between seeing American Indians as a people and seeing them as a trace on the now white-occupied land is especially cruel when you consider the persistent inequities that American Indians in Arizona encounter today. Indeed, during the last census there were only 738 American Indian-identified people living in Ahwatukee, which has the wealthiest and one of the whitest school districts in Arizona. I was surrounded by empty gestures to Indians but had no real contact with first Arizonans in my life. The land was empty of traces and traditions of people who had lived there, considered a tabula rasa onto which developers could build tract houses.

And so, growing up, I made the mistake that I think is pretty common among some Arizonans of assuming that there’s no history to be found here. I was participating in an act of mass forgetting. Continue reading

A Case for Processing Proposals

It seems to me that processing proposals are the homework assignments of the archival processing world. We know we should do them, but sometimes we skip them. Sometimes we opt to have an informal meeting and vaguely talk about our plans rather than formalize them in a written document. Sometimes we look at a box of jumbled papers and think, well, by the time I write down what I’m going to do, I could be halfway done sorting this. So, I get it. In my experience, though, the act of writing a processing proposal is a useful exercise that forces you to think through your project. Even though I occasionally skip them too, I’ve never regretted writing a processing proposal. Whether you end up following it or not (more on that below), writing a proposal breaks down the act of processing into digestible components. It is a super valuable resource for your manager and other project stakeholders, which in turn makes your life a bit easier. And finally, it is important for institutional memory — just in case you happen to leave your job before the project is finished.

What should be included in a processing proposal?

  1. Basic information about the collection: Title, creator, dates, extent, formats present, languages present
  2. Collection provenance: Where did it come from? When? Who’s the collector and why did they acquire it?
  3. Collection condition and preservation concerns: Did it come from a moldy basement? Is it in beautiful file folders and perfectly arranged already? (Hopefully some version of the latter.) More specifically: Is there acidic paper? Rolled items? Photographs or negatives? Scrapbooks? Nitrate, glass negatives, or other highly fragile materials? Oversize materials or objects? Audiovisual materials? Electronic records?
  4. Collection “significance,” or some estimation of its research potential and value. This is a squishy value judgment, but I think it’s useful. Here’s why: we want to devote our energy and resources to processing collections at the appropriate level, and to providing a reasonable (but not excessive) amount of description. If you are working on a collection that you and the curator expect will be barely used, then you can probably process at a less-detailed level, which moves the collection through Technical Services much more quickly. By recording your reasoning here, you offer justification for your subsequent proposed rehousing, arrangement, and description scheme. It gives the other stakeholders an opportunity to respond to your judgment. Whether they agree or not, it is better to come together on these issues ahead of the processing itself.
  5. Restrictions. Always better to think about restricted material up front. Most of the time this field will relate to donor-specified or repository-instituted restrictions, but it could also stem from anticipated access issues. Plus, you can build on the “significance” field if you believe that the material is so valuable or vulnerable that it should require special housing or access policies.
  6. Accessions included: If you’re going to process a collection, best to make sure you have all the pieces. (Or at least acknowledge there are others besides what you will be processing.)
  7. Current arrangement and description. Has your repository owned this collection for generations? Oh dear, what did your forefathers do? Or, maybe it was recently acquired from the professor’s office. Oh dear, what did that sweet professor do?
  8. Additions expected? Use this proposal to have a conversation with your curator about their plans for this collection. If there are going to be annual additions, adjust your plans accordingly.
  9. Proposed arrangement: List proposed series and subseries, with title, brief description, size, date range, and anticipated processing level. This is where you can share alternative approaches to different arrangement schemes, if there are any — it would subsequently be discussed with the relevant stakeholders connected with this collection. It is also where you should describe any rehousing, reformatting, or disk imaging you will do as part of processing.
  10. Staff assigned, estimated time processing, special equipment, patron access during processing. Are you able to stop everything and help a patron who might show up and ask to see your collection? (Would it be possible to find what they wanted mid-processing?) If it’s not going to be feasible, it’s better to lay those ground rules in the proposal.
  11. Cost: I have found estimating processing costs to be challenging but ultimately worthwhile. Even if your repository is flush with cash, estimating anticipated supplies and staff time is putting a concrete value on Technical Services work, which translates well to both stakeholders and potential donors outside the department. It is especially important if you know there are reformatting, custom housing, or other “unusual” costs that will add to your project’s expenses. Round up.

What’s Next?

The proposal should be a collaborative document. It is most useful when all the relevant stakeholders have had a chance to weigh in and inform themselves about each of the issues it addresses — whether it be the proposed level of arrangement, anticipated timeline, or expected costs. Odds are the first draft of the proposal is not the “final” draft. It will be edited and tweaked, so mentally brace yourself for feedback. It is better to have that feedback before you begin the project! And it could be that the act of writing the proposal leads to further exploration about the issues surrounding a particular collection, which may mean a reevaluation of its feasibility or priority ranking in the processing queue. Again, that is a good thing. That means you have developed a useful, worthwhile proposal that allows you and your colleagues to thoughtfully allocate the resources you have available.

Save your proposal as a useful reference. As you begin processing the collection, revisit it to keep yourself (and your student employees) on point about the specific pieces you included regarding arrangement and description. It should be seen as a guide, however; if you realize during processing that your proposed arrangement is ridiculous (for whatever reason), consult with the relevant parties and update your proposal. I have also found that having some sort of evaluation or recap, amended to the end of the proposal, is a really useful practice for evaluating on how the project went. This can be collaborative or not; I tend to just reflect on what I actually did as opposed to what I had initially estimated. Lessons learned, etc. And again, I’m always trying to adjust the estimated processing costs to reflect reality. This is a moving target, of course, and changes with each collection. Every collection is different! But that’s what makes processing so fun.

Let me know in the comments if you have found other useful fields to include in your processing proposals. I’d love to hear what other people are doing.

Clean Metadata for Non-Metadata Geeks

Over the past two years, Maureen, Carrie, Meghan, Cassie and their guests have turned this blog into a powerhouse of know-how around working smarter with archival metadata. Some of us really enjoy this type of work; we find it crazy satisfying and it aligns well with our worldviews. We acknowledge, with some pride, that we are metadata geeks. But not all archivists are like this, AND THAT’S TOTALLY OKAY. We all have different strengths, and not all archivists need to be data wranglers. But we can all produce clean metadata.

Just one of the awesome buttons from AVPreserve

Just one of the awesome metadata jokes promulgated by AVPreserve‘s button campaign

Today, though, I’m going to take a BIG step backward and talk for a few minutes about what we actually mean when we talk about “clean” data, and I’ll share a few basic things that any archivist can do to help prevent their painstakingly produced metadata from becoming someone else’s “clean up” project later.

As Maureen explained in the very first Chaos —> Order post, the raison d’etre of all of this is to use computers to do what they do best, freeing the humans to do what they do best. Computers are really good at quickly accomplishing tasks like indexing, searching, replacing, linking and slicing up information for which you can define a rule or pattern, things that would take a human tens or hundreds of hours to do by hand and wouldn’t require any of the higher-level processes that are still unique to humans, let alone the specialized training or knowledge of an archivist. Clean data is, quite simply, data that is easy for a computer to digest in order to accomplish these tasks.

Continue reading

Archival Description for Web Archives

If you follow me on Twitter, you may have seen that the task I set out for myself this week was to devise a way to describe web archives using the tools available to me: Archivists’ Toolkit, Archive-It, DACS and EAD. My goals were both practical and philosophical: to create useful description, but also to bring archival principles to bear on the practice of web archiving in a way that is sometimes absent in discussions on the topic. And you may have seen that I was less than entirely successful.

Appropriate to the scope of my goals, the problems I encountered were also both practical and philosophical in nature:

  • I was simply dissatisfied with the options that my tools offered for recording information about web archives. There were a lot of “yeah, it kind of makes sense to put it in that field, but it could also go over here, and neither are a perfect fit” moments that I’m sure anyone doing this work has encountered. A Web Archiving Roundtable/TS-DACS white paper recommending best practices in this area would be fantastic, and may become reality.
  • More fundamentally, though, I came to understand that the units of arrangement, description and access typically used in web archives simply don’t map well onto traditional archival units of arrangement and description, particularly if one is concerned with preserving information about the creation of the archive itself, i.e., provenance.

Continue reading

Records Management for Discards

Maybe this is a familiar problem for some other archivists. You have a collection that you’ve just finished processing — maybe it’s a new acquisition, or maybe it’s been sitting around for awhile — and you have some boxes of weeded papers leftover, waiting to be discarded. But for some reason — a reason usually falling outside of your job purview — you are not able to discard them. Maybe the gift agreement insists that all discards be returned to the donor, and you can’t track down the donor without inviting another accession, and you just don’t have time or space for that right now. Maybe your library is about to renovate and move, and your curators are preoccupied with trying to install 10 exhibitions simultaneously. Maybe the acquisition was a high-value gift, for which the donor took a generous tax deduction, and your library is legally obligated to keep all parts of the gift for at least three years. Maybe your donor has vanished, the gift agreement is non-existent, or the discards are actually supposed to go to another institution and that institution isn’t ready to pay for them. The reasons don’t matter, really. You have boxes of archival material and you need to track them, but they aren’t a part of your archival collection any more. How do you manage these materials until the glorious day when you are actually able to discard them?

We’ve struggled with this at Duke for a long time, but it became a more pressing issue during our recent renovation and relocation. Boxes of discards couldn’t just sit in the stacks in a corner anymore; we had to send them to offsite storage, which meant they needed to be barcoded and tracked through our online catalog. We ended up attaching them to the collection record, which was not ideal. Because the rest of the collection was processed and available, we could not suppress the discard items from the public view of the catalog. (Discards Box 1 is not a pretty thing for our patrons to see.) Plus, it was too easy to attach them to the collection and then forget about the boxes, since they were out of sight in offsite storage. There was no easy way to regularly collect all the discard items for curators to review from across all our collections. It was messy and hard to use, and the items were never going to actually be discarded! This was no good.

I ended up making a Discards 2015 Collection, which is suppressed in the catalog and therefore not discoverable by patrons. All materials identified for discard in 2015 will be attached to this record. I also made an internal resource record in Archivists’ Toolkit (soon to be migrated to ArchivesSpace) that has a series for each collection with discards we are tracking for the year. It is linked to the AT accession records, if possible. In the resource record’s series descriptions, I record the details about the discards: what is being discarded, who processed it, who reviewed it, why we haven’t been able to discard it immediately, and when we expect to be able to discard the material (if known). The Discard Collection’s boxes are numbered, barcoded, and sent to offsite storage completely separated from their original collection — as it should be. No co-mingling, physically or intellectually! Plus, all our discards are tracked together, so from now on, I can remind our curators and other relevant parties at regular intervals about the boxes sitting offsite that need to be returned, shredded, sold, or whatever.

I’d love to hear other approaches to discards — this is a new strategy for us, so maybe I’ve missed something obvious that your institution has already solved. Let me know in the comments. Happy weeding, everyone!