Archival Description for Web Archives

If you follow me on Twitter, you may have seen that the task I set out for myself this week was to devise a way to describe web archives using the tools available to me: Archivists’ Toolkit, Archive-It, DACS and EAD. My goals were both practical and philosophical: to create useful description, but also to bring archival principles to bear on the practice of web archiving in a way that is sometimes absent in discussions on the topic. And you may have seen that I was less than entirely successful.

Appropriate to the scope of my goals, the problems I encountered were also both practical and philosophical in nature:

  • I was simply dissatisfied with the options that my tools offered for recording information about web archives. There were a lot of “yeah, it kind of makes sense to put it in that field, but it could also go over here, and neither are a perfect fit” moments that I’m sure anyone doing this work has encountered. A Web Archiving Roundtable/TS-DACS white paper recommending best practices in this area would be fantastic, and may become reality.
  • More fundamentally, though, I came to understand that the units of arrangement, description and access typically used in web archives simply don’t map well onto traditional archival units of arrangement and description, particularly if one is concerned with preserving information about the creation of the archive itself, i.e., provenance.

I was lucky enough to attend the first day of the Web Archiving Collaboration conference last week, where I had a bit of a professional revelation about the importance of documenting collection development and appraisal decisions for web archives. During a presentation about Warcbase, a tool he collaborated on with Jimmy Lin, a professor of computer science at the University of Maryland, historian Ian Milligan shared a particularly enlightening experience he had using a web archive created by a library.

The archive was on a subject highly relevant to his research interests and he did some interesting analyses of it, but ultimately was unable to answer questions about the dataset that would be necessary to pass peer review, such as “what are the biases in this data?” and “what was excluded from this dataset?” In a situation that will probably sound all-too-familiar to librarians and archivists, the person who had set up the web archive was no longer at the library that owned it and no documentation existed that would help him answer these questions.

Recently I’ve been working on seed definition and intense crawl scoping on some web collections, and the question of documenting what I felt were clearly appraisal decisions had been percolating in the back of my mind already. So when Ian made his comments, a fluorescent bulb exploded in my brain, and I left convinced of the necessity of enhancing our descriptive practices to include this information, and thinking back to one of my favorite articles from grad school. Then I tried to put this in practice.

Although I’ve been talking about web archiving in general to this point, for my own practices I’m actually talking about Archive-It, and the rest of this discussion takes place in the context of Archive-It tools and interfaces. I’d be interested in hearing, however, whether and to what extent this argument applies to other tools and methods of web archiving.

Archive-It organizes web archiving activities and the resultant collections into CollectionsSeeds, and Crawls.

  • Each Seed is a single URL starting point for a web capture
    • Seeds are generally thought of as the website that is the object of the capture, although the actual capture may include more or less depending on how it is scoped.
    • Seeds are instantiated in the current wayback interface as Sites.
    • A DC-based interface for describing Seeds is available in Archive-It.
  • Collection is a unit of organization for Seeds
    • Each Collection contains one or more Seeds.
    • The definition of a Collection currently has some impact on the capture and playback of websites in the Wayback interface.
    • Each Collection is assigned a unique identifying number by Archive-It.
    • A DC-based interface for describing Collections is available in Archive-It.
  • Crawl is the primary unit of collection activity
    • Crawl is logically equivalent to an archival accession
    • Each Crawl contains one or more Seeds
    • All Seeds in a crawl must be from the same Collection
    • Each Crawl is assigned a unique identifying number by Archive-It
    • A Crawl is primarily defined by:

It seems clear that some of the critical information I want to be capturing and making accessible per my recent revelation is best described as features of Crawls, which are snapshots at specific points in time, rather than Seeds or Collections, both of which may change between those snapshots.

For example, I am building a campus life collection that documents the public online presences of officially recognized student groups. I don’t expect that definition of the collection’s scope to change. However, currently, the collection’s scope includes group websites and public Facebook pages, but not those of national organizations that don’t include significant information about the local chapter; this scope may change as web archiving technology and platform popularity shift over time. Similarly, I fully expect that the list of student organizations that we capture will change as new groups are recognized and other cease to exist. I was unable to conceive of a way to represent these shifts in collecting activity without focusing on Crawls as a primary unit for description.

And there’s the rub: our current interfaces and methods of describing and accessing web archives privilege Collections and especially Seeds over Crawls. There is no way to view, browse, describe or even link to a complete Crawl as a unit of data with information on all of its relevant seeds, constraints, limits and expansions through the Archive-It Wayback interface. This information and more is readily available about individual Crawls through the Reports section of Archive-It’s administrative interface. However, once a Crawl is published, it ceases to exist as an aggregate entity that can be described and accessed, and instead exists only as captures of individual sites with no information about the collecting scope, constraints, limits or expansions that were in place at the time of the capture.

I’ve thought of at least a half dozen different ways of shoehorning the description of Crawls into archival descriptive tools, but have not come up with one that satisfied my goals, which include scalability (did I mention that the campus life collection currently has over 400 seeds?) and, you know, actual usefulness.

If anyone else out there has given thought to the issue of documenting collecting and appraisal activities for web archives, though, I’m very interested in hearing what solutions you’ve devised.

7 thoughts on “Archival Description for Web Archives

  1. What are some of the ways you tried to shoehorn the description? Sometimes it’s more valuable to document dead ends than successes.

  2. Pingback: Editors’ Choice: Archival Description for Web Archives | Digital Humanities Now

  3. Thanks for this much needed post Christie. I definitely think there’s an opportunity here to better describe collections of web content. I also agree that documenting when a crawl is started, and with what parameters (seeds, scope, etc) and making it understandable to the researchers is key. The current de facto web archives discovery tool (wayback) doesn’t do a great job of this. But it’s early days for web archives, and there’s lots of room for improvement/innovation.

    I must admit, when I’ve heard Ian talk about dataset bias in collection development decisions I immediately thought he was talking primarily about the seeds: why was this website deemed worth collecting and another not?

    In your case it sounds like you may have a very clear mandate to collect officially recognized student groups. But I can’t help but wonder: are you collecting all of them? How do you keep up with the new ones? How easy is it to communicate the documentation strategy or collection development policy with the users of the web archives in the context of the content?

    So for me, the descriptions at the Collection and Seed level seem extremely important. I wonder if anyone has done a study of the ways archivists are using the Dublin Core metadata at those levels?

  4. Nice post. There have been some interesting conversations about metadata for web archives on the IIPC mailing list in the past couple of weeks as well. If interested: http://netpreserve.org/iipc-mailing-list. Prediction: 2015–2016 will be good years for theorizing metadata for web archives and designing solutions.

    We are considering similar questions about web archives and metadata, which take on many dimensions. We want metadata to facilitate access and research, and we want it to be appropriate for the kind of web-collecting activity we think we’re engaged in, which could be government record-keeping, library or archival acquisitions, or large-scale domain crawls. Each one will have different needs and different levels of description, but theoretically they should all co-exist nicely in a Dublin Core environment.

    I would consider seed- and collection-level descriptions to be most important, in part because these would tend to be the primary access points to web archives (apart from full-text search, I suppose). I see your point about the importance of crawl descriptions, though; I have mostly thought of the crawl as a technical process subordinate to the seed, but it is also a record of archival activities that determine the shape of a collection. Depending on how you set your constraints, a single seed crawl could result in a single page of HTML, or a multi-terabyte domain. Seed URLs provide points of access to collections, but in themselves they don’t tell the full story of what’s in a collection or why those decisions were made. For that, you might need to record not only crawl data, but accompanying descriptions.

    There is certainly work to be done in visualizing the complex relationships between seed URLs, crawls, and collections (and across collections). A big problem is that we want to conceptualize web archives as a collection of linked objects (seed URLs) with definable boundaries (collections), but the web is not like that and maybe we shouldn’t expect web archives to be either.

    As for dataset bias: Yikes! That’s a tough one, and it must be an issue across the digital humanities. In web archives, even with the most precise and well-documented methodologies will yield all sorts of unexpected stuff, which is generally considered a good thing — better to ovecrawl than undercrawl. I would be curious to know what would be required of a web archives dataset to make it peer review ready.

  5. I’m a bit late to the party, but I always like to think of digital records in terms of what their physical counterparts would be. There are lots of complex digital objects (think anything with interaction) that doesn’t *have* a physical counterpart, I know. But instagram is pretty similar to a photoalbum and an email is basically just correspondence (an attachments are enclosures), etc. If you were collecting the paper material of student groups, how would you describe it? Would each group be a fonds or a series in a fonds of student organizations? Seed selection is like appraisal. Why did you keep what you did? Crawls are like accruals. When did stuff get transferred to you? You could even consult with the groups themselves about what seeds to select and that could inform both the donor information and the custodial history.

  6. Pingback: Made to Be Broken | Digital Preservation

Leave a comment