If you follow me on Twitter, you may have seen that the task I set out for myself this week was to devise a way to describe web archives using the tools available to me: Archivists’ Toolkit, Archive-It, DACS and EAD. My goals were both practical and philosophical: to create useful description, but also to bring archival principles to bear on the practice of web archiving in a way that is sometimes absent in discussions on the topic. And you may have seen that I was less than entirely successful.
Appropriate to the scope of my goals, the problems I encountered were also both practical and philosophical in nature:
- I was simply dissatisfied with the options that my tools offered for recording information about web archives. There were a lot of “yeah, it kind of makes sense to put it in that field, but it could also go over here, and neither are a perfect fit” moments that I’m sure anyone doing this work has encountered. A Web Archiving Roundtable/TS-DACS white paper recommending best practices in this area would be fantastic, and may become reality.
- More fundamentally, though, I came to understand that the units of arrangement, description and access typically used in web archives simply don’t map well onto traditional archival units of arrangement and description, particularly if one is concerned with preserving information about the creation of the archive itself, i.e., provenance.
I was lucky enough to attend the first day of the Web Archiving Collaboration conference last week, where I had a bit of a professional revelation about the importance of documenting collection development and appraisal decisions for web archives. During a presentation about Warcbase, a tool he collaborated on with Jimmy Lin, a professor of computer science at the University of Maryland, historian Ian Milligan shared a particularly enlightening experience he had using a web archive created by a library.
The archive was on a subject highly relevant to his research interests and he did some interesting analyses of it, but ultimately was unable to answer questions about the dataset that would be necessary to pass peer review, such as “what are the biases in this data?” and “what was excluded from this dataset?” In a situation that will probably sound all-too-familiar to librarians and archivists, the person who had set up the web archive was no longer at the library that owned it and no documentation existed that would help him answer these questions.
Recently I’ve been working on seed definition and intense crawl scoping on some web collections, and the question of documenting what I felt were clearly appraisal decisions had been percolating in the back of my mind already. So when Ian made his comments, a fluorescent bulb exploded in my brain, and I left convinced of the necessity of enhancing our descriptive practices to include this information, and thinking back to one of my favorite articles from grad school. Then I tried to put this in practice.
Although I’ve been talking about web archiving in general to this point, for my own practices I’m actually talking about Archive-It, and the rest of this discussion takes place in the context of Archive-It tools and interfaces. I’d be interested in hearing, however, whether and to what extent this argument applies to other tools and methods of web archiving.
Archive-It organizes web archiving activities and the resultant collections into Collections, Seeds, and Crawls.
- Each Seed is a single URL starting point for a web capture
- Seeds are generally thought of as the website that is the object of the capture, although the actual capture may include more or less depending on how it is scoped.
- Seeds are instantiated in the current wayback interface as Sites.
- A DC-based interface for describing Seeds is available in Archive-It.
- A Collection is a unit of organization for Seeds
- Each Collection contains one or more Seeds.
- The definition of a Collection currently has some impact on the capture and playback of websites in the Wayback interface.
- Each Collection is assigned a unique identifying number by Archive-It.
- A DC-based interface for describing Collections is available in Archive-It.
- A Crawl is the primary unit of collection activity
- A Crawl is logically equivalent to an archival accession
- Each Crawl contains one or more Seeds
- All Seeds in a crawl must be from the same Collection
- Each Crawl is assigned a unique identifying number by Archive-It
- A Crawl is primarily defined by:
It seems clear that some of the critical information I want to be capturing and making accessible per my recent revelation is best described as features of Crawls, which are snapshots at specific points in time, rather than Seeds or Collections, both of which may change between those snapshots.
For example, I am building a campus life collection that documents the public online presences of officially recognized student groups. I don’t expect that definition of the collection’s scope to change. However, currently, the collection’s scope includes group websites and public Facebook pages, but not those of national organizations that don’t include significant information about the local chapter; this scope may change as web archiving technology and platform popularity shift over time. Similarly, I fully expect that the list of student organizations that we capture will change as new groups are recognized and other cease to exist. I was unable to conceive of a way to represent these shifts in collecting activity without focusing on Crawls as a primary unit for description.
And there’s the rub: our current interfaces and methods of describing and accessing web archives privilege Collections and especially Seeds over Crawls. There is no way to view, browse, describe or even link to a complete Crawl as a unit of data with information on all of its relevant seeds, constraints, limits and expansions through the Archive-It Wayback interface. This information and more is readily available about individual Crawls through the Reports section of Archive-It’s administrative interface. However, once a Crawl is published, it ceases to exist as an aggregate entity that can be described and accessed, and instead exists only as captures of individual sites with no information about the collecting scope, constraints, limits or expansions that were in place at the time of the capture.
I’ve thought of at least a half dozen different ways of shoehorning the description of Crawls into archival descriptive tools, but have not come up with one that satisfied my goals, which include scalability (did I mention that the campus life collection currently has over 400 seeds?) and, you know, actual usefulness.
If anyone else out there has given thought to the issue of documenting collecting and appraisal activities for web archives, though, I’m very interested in hearing what solutions you’ve devised.