Over the past two years, Maureen, Carrie, Meghan, Cassie and their guests have turned this blog into a powerhouse of know-how around working smarter with archival metadata. Some of us really enjoy this type of work; we find it crazy satisfying and it aligns well with our worldviews. We acknowledge, with some pride, that we are metadata geeks. But not all archivists are like this, AND THAT’S TOTALLY OKAY. We all have different strengths, and not all archivists need to be data wranglers. But we can all produce clean metadata.
Today, though, I’m going to take a BIG step backward and talk for a few minutes about what we actually mean when we talk about “clean” data, and I’ll share a few basic things that any archivist can do to help prevent their painstakingly produced metadata from becoming someone else’s “clean up” project later.
As Maureen explained in the very first Chaos —> Order post, the raison d’etre of all of this is to use computers to do what they do best, freeing the humans to do what they do best. Computers are really good at quickly accomplishing tasks like indexing, searching, replacing, linking and slicing up information for which you can define a rule or pattern, things that would take a human tens or hundreds of hours to do by hand and wouldn’t require any of the higher-level processes that are still unique to humans, let alone the specialized training or knowledge of an archivist. Clean data is, quite simply, data that is easy for a computer to digest in order to accomplish these tasks.
The data that you create today is someone’s future legacy data project, maybe even your own. So why not make that person’s job a little bit easier, and maybe even allow your data to be reused in ways that didn’t occur to you when you created it? While this list isn’t exhaustive, here are a few general guidelines that anyone — metadata geek or not — can use to produce cleaner metadata:
- Structure your data Structured data makes it easy for a computer to identify each piece of information for what it is. A table with clearly defined columns and rows of information is structured data. Unstructured data is data that has to be read and interpreted in order for meaning to be derived. A Word document full of hierarchical structure implied by indentations, paragraphs, and lists is unstructured data. An Excel form optimized for printing, with a layout full of joined fields, page numbers and pretty headings is likely semi-structured data. Using AT or another EAD authoring tool will help you structure your data, but it’s no guarantee. If you ever have the urge to “put it in the notes field,” you’re probably creating unstructured data.
- Atomize your data While you’re structuring your data, make sure that each field contains exactly one characteristic of what you’re describing (such as folder name) and each entry contains no more than one value. This is sometimes complicated by the tools we have available for creating metadata and sometimes the tools we have available, in their efforts to be widely applicable, make it really easy not to do this (I’m looking at you, AT, on both accounts). But whenever possible, strive to avoid metadata fields that include multiple, distinct types of information, even if it’s related.
- Make your data machine-parsable It happens. You have to break guideline #2. For reasons that involve hours of debate and multiple spells of crying, you HAVE to include the date in the title field. It’s OK. Well, it’s not OK, but it happens. So now your job is to make sure that in the future, when someone wants to pull the dates out of the titles, you don’t make them cry in the process. How do you do that? Think through: what is the rule that defines how to uniquely identify each piece of information in this field? Do you always use a semicolon to separate the pieces? Are you sure you don’t use semicolons anywhere else in the field? Ever? OK, then, a machine will be able to pull those pieces apart into proper fields at a later date. Otherwise, you’re creating a job for someone.
- Whatever you do, be consistent This is both the simplest and the hardest to follow clean metadata commandment. Computers deal best with consistent input, and it’s easy for them to produce consistent output. Humans are usually really good at interpreting inconsistent input, and as a result it’s usually really hard for us to produce consistent output. Compounding this, most of the tools you will have at your disposal as an archivist will make it really easy for you to produce inconsistent metadata. There is absolutely nothing but you preventing you from entering “John Smith” one time, “Smith, John” another and “J. H. Smith” a third. And nobody is perfect, and all of us are susceptible to lapses of memory. So the single best tool I’ve found to maintain consistency is to document ahead of time how you will use each field and how you will form the data that goes in each field, then use that documentation to refresh your memory throughout the data entry portion of your project. Standards and controlled vocabulary can help, but only if you use them consistently.
So there you have it, four easy-peasy steps, right? In practice, they’re almost impossible to achieve perfectly. Still, if you use these as guidelines, you’ll be on the right track, and this is one of those cases where every little bit really does help.
I’m sure I’ve missed a few great tips, so add your ideas below, or let me know if you have questions about these.