Using grep to Control Vocabulary

The Archival Services department at the Center for Jewish History in New York provides processing services to the Center’s five partner organizations (American Jewish Historical Society, American Sephardi Federation, Leo Baeck Institute, Yeshiva University Museum and YIVO Institute for Jewish Research). The department is six full-time archivists, one part-time archivist, and a manager (that’s me).

Three archivists are currently processing the records of Hadassah, the Women’s Zionist Organization of America, which are on long-term deposit at the American Jewish Historical Society. The existing arrangement and description of the roughly 1,000 linear feet of materials varies widely (from item to record group level, and everything in between). The ultimate goal is folder-level control over the entire collection, using as much of the legacy description as possible. A high-level summary finding aid is available here:

As we process, we are trying to ensure that the terms in our narratives and assigned titles (names and places in particular) are consistent. This is tricky – there are many variant spellings and transliterations, and name changes abound as well. So we asked ourselves, how can we run a set of terms against a body of description?

I tried using an XSLT stylesheet, a Schematron, and xQuery via BaseX, but I kept running into problems with string processing. I’m sure there are many other ways to peel this grape, but eventually I tried using the Unix command-line program grep, and ultimately was successful. Most of this is stolen directly from a Stack Exchange post cleverly titled How to find multiple strings in files?.

We ended up with the following as our workflow, which we run across all the project finding aids periodically or as we add new terms to our list or create new finding aids.

  • First and foremost, come up with a list of preferred terms and their alternates. Save this compilation somewhere (we are using an email chain at the moment to do the work, and then saving a text file on the shared project drive). A couple of caveats here – if our preferred narrative term differs from the LC term, we use the LC term in a controlled tag like persname or corpname, and we introduce our preferred term in the narrative together with the alternates. The goal is to have consistent terms in our finding aids so they are easily searchable.
  • Create a text file that contains the alternate terms we want to avoid, one per line. Save this as patterns.txt.

A list of terms we want to avoid

  • Save into one folder all the files you want to check – in our case, we  started with the three completed RG-level finding aids in EAD, rg1.xml, rg5.xml, and rg13.xml. You can run the terms against any text file though, such as an HTML finding aid or a text document.
  • Create a virtual Unix machine (I used the Oracle VirtualBox to create a Ubuntu 12.10 machine– this is the same way BitCurator is installed, so one could follow those instructions, substituting a regular Ubuntu disk image instead of the BitCurator image). NB THIS PART IS TRICKY. If you’ve never installed a virtual machine before, this could require some time and effort. Of course, if you are in a -nix environment, you can skip the VM entirely.
  • Boot up the virtual machine by highlighting the correct machine and hitting “Start.”

The virtual machine start screen

  • In the virtual machine, using the Devices menu, enable bi-directional copy-and-paste and bi-directional drag-and-drop in the virtual machine. This allows you (in theory!) to move files and copy text between your Windows desktop and the VM. I always have a hard time getting this to work; sometimes I email myself files from inside the virtual machine.
  • Install aha (an ANSI to HTML convertor) via the command-line (Ctrl-Alt T to open the command line):

sudo apt-get install aha

  • Create a folder on the virtual machine desktop, and drag your patterns.txt and the finding aids to be checked over from your desktop.
  • Open the command line (Ctrl-Alt T), and navigate to the folder where your files are found.
  • Type the following (this will change DOS/Windows line endings from CRLF to just LF; thanks to Google and the hundreds of people who have encountered and posted about this frustrating quirk!):

sed ‘s/\r$//’ < patterns.txt > patterns_u.txt

  • You are now ready to run the key command, grep:

grep -n rg*.* -iHFf patterns_u.txt –color=ALWAYS | aha > output.html

So what’s going on here?

grep is powerful unix tool for pattern matching
-n is a flag that prints line numbers of the found terms in the output
rg*.* grabs all files we are checking, in this case everything starting with rg
-i flag: make the search case-insensitive
-H flag: print the filename in the output (useful if looking multiple files at once)
-F flag: read the text strings as strings, nor regular expressions
-f flag: look at a file for text strings, in this case patterns_u.txt
patterns_u.txt: our list of terms we are looking for
–color=ALWAYS: this flag makes the output pretty, with ANSI colors
| aha > output.html: this pipes the standard terminal output to pretty HTML

  • Examine the results by opening output.html in a browser; matched terms are in red:

The terms we seek, in red

And, voila, we can go back and examine where these terms occur, and see if they should be changed to the preferred term. Since this requires human judgment, it’s not automated, but it should be possible to add some find-and-replace functionality using a variety of tools (maybe a shell script that loops through the terms list and uses sed or awk to replace them?). But I leave that for brighter minds than mine.

Thanks to my processing team, Andrey Filimonov, Nicole Greenhouse and Patricia Glowinski, for working this out with me; to Maureen and the gang for letting me post this; and to Nicole for encouraging me to write it up.

Records Management for Discards

Maybe this is a familiar problem for some other archivists. You have a collection that you’ve just finished processing — maybe it’s a new acquisition, or maybe it’s been sitting around for awhile — and you have some boxes of weeded papers leftover, waiting to be discarded. But for some reason — a reason usually falling outside of your job purview — you are not able to discard them. Maybe the gift agreement insists that all discards be returned to the donor, and you can’t track down the donor without inviting another accession, and you just don’t have time or space for that right now. Maybe your library is about to renovate and move, and your curators are preoccupied with trying to install 10 exhibitions simultaneously. Maybe the acquisition was a high-value gift, for which the donor took a generous tax deduction, and your library is legally obligated to keep all parts of the gift for at least three years. Maybe your donor has vanished, the gift agreement is non-existent, or the discards are actually supposed to go to another institution and that institution isn’t ready to pay for them. The reasons don’t matter, really. You have boxes of archival material and you need to track them, but they aren’t a part of your archival collection any more. How do you manage these materials until the glorious day when you are actually able to discard them?

We’ve struggled with this at Duke for a long time, but it became a more pressing issue during our recent renovation and relocation. Boxes of discards couldn’t just sit in the stacks in a corner anymore; we had to send them to offsite storage, which meant they needed to be barcoded and tracked through our online catalog. We ended up attaching them to the collection record, which was not ideal. Because the rest of the collection was processed and available, we could not suppress the discard items from the public view of the catalog. (Discards Box 1 is not a pretty thing for our patrons to see.) Plus, it was too easy to attach them to the collection and then forget about the boxes, since they were out of sight in offsite storage. There was no easy way to regularly collect all the discard items for curators to review from across all our collections. It was messy and hard to use, and the items were never going to actually be discarded! This was no good.

I ended up making a Discards 2015 Collection, which is suppressed in the catalog and therefore not discoverable by patrons. All materials identified for discard in 2015 will be attached to this record. I also made an internal resource record in Archivists’ Toolkit (soon to be migrated to ArchivesSpace) that has a series for each collection with discards we are tracking for the year. It is linked to the AT accession records, if possible. In the resource record’s series descriptions, I record the details about the discards: what is being discarded, who processed it, who reviewed it, why we haven’t been able to discard it immediately, and when we expect to be able to discard the material (if known). The Discard Collection’s boxes are numbered, barcoded, and sent to offsite storage completely separated from their original collection — as it should be. No co-mingling, physically or intellectually! Plus, all our discards are tracked together, so from now on, I can remind our curators and other relevant parties at regular intervals about the boxes sitting offsite that need to be returned, shredded, sold, or whatever.

I’d love to hear other approaches to discards — this is a new strategy for us, so maybe I’ve missed something obvious that your institution has already solved. Let me know in the comments. Happy weeding, everyone!

Getting into the guts of AT

It’s the thing that we keep saying — in order to deal with our masses of stuff better, we need better ways of understanding what we have. A lot of my questions aren’t just about what’s in our finding aids — they’re about the relationships between archival materials and other archival management functions — accessioning, digital object management, location management, container management. For instance, the following questions have come up in the past or could easily come up in the future:

  • Which collections are constituted of accessions that came in before 1980?
  • Which collections have digital objects associated with them? What are the URIs of those digital objects?
  • I have a barcode for a box. Can you tell me the materials that are supposed to be in that box? What collection is this from?
  • We haven’t used this location listed in the location table since 2005! Are there any boxes associated with that location? What are they?

In order to answer these questions, I need to write reports that join different tables in Archivists’ Toolkit together. And this is a little bit tricky, because in their own way, components in the AT database are hierarchical (just like in an EAD-encoded finding aid). If I have an instance (a container with a barcode), and I want to know which collection it belongs to, I don’t have a direct relationship in the database. Instead, an instance is associated with a component. That component is associated with its parent component. It may have a lot of ancestor components before the most high-level component is associated with the collection-level information in the resource table.

These relationships are made in sql through what are called “joins”. And joining a table on itself (in some cases several times, recursively) is a huge friggin pain in the neck. So, after mucking around for a little while, the solution was to just ask someone smarter than me how he would handle this.

This is where my colleague Steelsen comes in — Steelsen introduced the idea of writing a stored procedure that would look for the top-most component instead of having to do this through joins. And then he wrote them for me, because he is a mensch of the first order. His procedures are here, and available to anyone who might find them useful. They have seriously revolutionized the way that I’ve been able to do reporting and solve problems.

For instance, something that folks have been begging for is a barcode look-up tool — they have a barcode, and they want to know which collection it belongs to, what its call number is, which location it’s assigned to, and which components are associated with that box. So here’s what I wrote (the user indicates the barcode in the where statement):

use schema;
 ' ',
 LPAD(r.resourceIdentifier2, 4, '00')) 'Collection',
 r.title 'Collection Title',
 series.subdivisionIdentifier 'Series/Accession Number',
 series.title 'Series Title',
 rc.title 'Component Title',
 rc.dateExpression 'Component Date',
 adi.container1Type 'Container Type',
 adi.container1NumericIndicator BoxNum,
 adi.container1AlphaNumIndicator BoxAlpha,
 adi.container2NumericIndicator FolderNum,
 adi.container2AlphaNumIndicator FolderAlpha,
 adi.archDescriptionInstancesId InstanceID,
 adi.barcode Barcode,
 adi.userDefinedString1 'Voyager Info',
 loc.coordinate1NumericIndicator ShelfNum,
 loc.coordinate1AlphaNumIndicator ShelfAlpha
 ArchDescriptionInstances adi
 ResourcesComponents rc ON adi.resourceComponentId = rc.resourceComponentId
 LocationsTable loc ON adi.locationID = loc.locationID
 Resources r ON r.resourceId = GETRESOURCEFROMCOMPONENT(rc.resourceComponentId)
 ResourcesComponents series ON GETTOPCOMPONENT(rc.resourceComponentId) = series.resourceComponentID
 adi.barcode = 39002042658774;

Here I use two of Steelsen’s procedures. In GETRESOURCEFROMCOMPONENT, I go up the tree of a component to find out what resource it belongs to and join that to the resource. I use GETTOPCOMONENT to help figure out what series a component belongs to (this assumes that the top-most component is a series, but that’s usually a safe bet for us).

I’m a sql n00b, and this isn’t the most efficient query I’ve ever run, but I’m really happy with the results, which can be viewed in a spreadsheet here.

By changing the where statement, I can find out all kinds of associated information about a location, a collection, a box, whatever. I can find out if barcodes have been assigned to components with different box numbers; I can find out if components with the same barcode have been assigned to more than one location. This set of procedures has really been a godsend to help me know more about the problems I’m fixing. So many thanks to Steelsen. I hope others find them useful too.

Making DACS Dates

Manipulating date strings (which is the data type we usually have in archival description), particularly when you have a lot of legacy data, is a pain. I was working with a friend to update some legacy data in her finding aid, and it occurred to me that there isn’t a lot of direct guidance out there about how to manipulate dates with various tools. So, here’s a run-down of some of my methods — please feel free to add your own in the comments.

Why does this matter?

I’ll be honest, in a lot of situations, date formats don’t matter at all. I’ve said it before and I’ll say it again — we put a whole lot of effort into creating structured data, considering that most of us just flatten it into HTML and put it up as a webpage. However, there is a brighter tomorrow. With structured data, you can make far better interfaces, and there are really nice examples of places that let you do stuff with date data.

In the Princeton finding aids site, you can sort by title, date, or container. This means that in a series like this, in the George F. Kennan papers, where the archivist (or possibly creator) filed by title, this isn’t the only way to look through materials.

George Kennan Finding Aid

The order of materials as they are presented

George Kennan Finding Aid

The order of materials, sorted by me (the user) by date ascending.

Letting users sort by title or date means a few things — we can stop wasting time with alpha or chron arrangement and spend more of our energies on the true value that archivists add to description — context, meaning, transparency — without worrying that there’s too much for the researcher to sort through. It also means that we don’t have to presume that a researcher’s primary discovery vector is either time or title — we can let her choose for herself. Finally, and most importantly, we can let original arrangement schemes and organic order (the true intellectual basis of arrangement) reign supreme.

The other reason why date formats are important is because our content standard tells us they are. Now, I personally think that it’s actually far more important to associate an ISO-compliant date with a descriptive component, which can then be rendered any way you want, but since until recently our tools didn’t support that very well, I think that the DACS format of YYYY Month D brings us a step closer to easier date clean-up and extracting ISO compliant dates from date expressions.


Excel, odi et amo. Excel offers a GUI for programming-ish functions, but I find as I do more and more advanced stuff that I get frustrated by not knowing what exactly is happening with the underlying data. Dates are particularly frustrating, since Excel stores dates as a serial number starting with January 1, 1900. As an archivist who has PLENTY OF DATES from before then, this can lead to rage. There are a few ways to deal with this — if your dates are all 20th or 21st century, congratulations! You don’t have a problem. There are ways to get Excel to change the ways it assigns serial numbers, to allow for negative numbers, which let’s you do the normal sorting and date re-formatting. Or, you can store everything as text and move each part of the date string to its own column to manipulate it.

So, an example of a clean-up project:

Excel Dates -- untouched

In this data, we have a bit to clean up. When I start a clean-up project, I usually start with a pencil-and-paper list of all of the steps that I need to go through before I change anything. This way, I see if I need to do research about how to do a step, and I can also see if there are dependencies in the data that may require me to sequence these steps in a particular way. When you’re first learning, it’s easy to jump right in without planning, but trust me — every time I’ve been burned by automation it’s because I didn’t plan. In a live data environment, you should always know what the computer’s going to do before you run a command, even if that command is just a formula in Excel. The flip side of this is, of course, that as long as you have good back-ups, you should feel free to experiment and try new things. Just make sure you make the effort to figure out what actually happened when you’re experimenting suddenly produces the results you want to see.

So, here’s my list of steps to perform on this data.

  • Check my encoding, which in this case just means which data is in which columns. Do you see the row where some of the date data is in the title column? It’s in row 4. I would probably survey the data and see how prevalent this kind of problem is. If it’s just a handful of errors, I’ll move the data over by hand. If there’s more, I’ll figure out a script/formula to automate this.
  • Check for unwanted characters. In this case, get rid of brackets. In case you haven’t heard, brackets are not a meaningful way of indicating uncertainty to researchers. There is a certainty attribute on <unitdate> for that, which can then be rendered in your institution’s EAD -> HTML stylesheet. However, my problem with brackets is more fundamental — in archival description, the date element is just a transcription of what we see on the record. We don’t actually know that this date represents anything. So in reality, these are all guesses to varying degrees of certainty, with the aim of giving the researcher some clue to time.
  • Fix the date format. DACS dates are YYYY Month D. (e.g., 2015 March 6)
  • Create an ISO date to serialize as an @normal attribute with <unitdate>

Let’s skip the obvious clean-up tasks and go straight to formatting dates. If everything is after 1900 (and if everything is a three-part date), this is really straightforward.

First, create a new column. Use the DATEVALUE formula to tell Excel to regard your date string as a date value — if your date string is in B2, your formula in C2 should be:


Double-click on the bottom right corner of the cell to have that formula apply to the whole column.

Now that Excel knows that this is a date, you simply need to give it the format you want to see, in this case, yyyy mmmm d.

Choosing a custom date format

Choosing a custom date format.

This works great for three-part dates after 1900. If that’s not your situation, there are a few things you can do. One of my favorite methods is to filter the date list to each of the different date types and apply the custom date format to each of these (trying to apply a custom date format to a date that doesn’t fit the type will result in really confusing and bad results). Another option is to split the date into three different columns, treat each like text, and then bring them back together in the order you want with the CONCATENATE formula. Play around — Excel doesn’t make it easy, but there are lots of options.


If you do a lot of data manipulation, I would definitely encourage you to stop torturing yourself and learn OpenRefine. I use it every day. OpenRefine uses something called GREL (Google Refine Expression Language — I wonder if they’ll be changing that to OREL now that this isn’t under the Google umbrella?), which is trickier to learn than Excel formulas but a lot more powerful and more in alignment with other programming languages. In fact, I should say that you only need to learn GREL for the fancy stuff — a lot of OpenRefine’s magic can be done through the GUI.

So, looking through this data set, I would do a lot of the same steps. One option is to just use the commands Edit Cells -> Common Transformations -> To Date, but unfortunately, most of these strings aren’t written in a way that OpenRefine understands them as dates.

The best path forward is probably to split this date string apart and put it back together. You could split by whitespace and turn them into three columns, but since some dates are just a year, or a year and a month, you wouldn’t necessarily have each of the three parts of the date in the columns where you want them.

So, I’m going to tell OpenRefine what a year looks like and ask it to put the year in its own column.

This formula pulls the year from the date string and puts it into its own column.

This formula pulls the year from the date string and puts it into its own column.

In this formula, I’m partitioning the string by a four-digit number and then taking that part of the partition for my new column. In the case of the year, the formula is:


For a month it’s:


And for the day it’s:


There may be a more elegant way of partitioning this all as one step, but I don’t yet know how!

Then, once you have each of these parts of the date in their own columns, they should look like this:

Each part of the date element is in its own column.

Each part of the date element is in its own column.

The final step is to put the pieces back together in the order you want them. You can do this by clicking on the Year column, and selecting “Create column based on this column.” Then, use GREL to put everything in the order that you want to see it.

The plus signs signify that everything should be smushed together -- pay attention to the syntax of calling the value of columns.

The plus signs signify that everything should be smushed together — pay attention to the syntax of calling the value of columns.

The formula for this is:

value + " " + cells["Month"].value + " " + cells["Day"].value

And voila, you’ve turned your non-DACS date into a DACS-formatted date. You can use similar steps to make a column that creates an ISO-formatted date, too, although you’ll first have to convert months into two-digit numbers.

Finally, SQL

The two methods above require ETL — extract, transform, load. That is, you’re going to get data out of the database (or transform it into a tabbed sheet from xml) and then get it back into the database or the EAD (and then the database). There is a better way if you’re using Archivists’ Toolkit or ArchivesSpace, and it involves doing SQL updates. I’m going to punt on this for now, because I know that this will be a huge part of my future once we get into ArchivesSpace (I’ll also be creating normalized dates, which is data that Archivists’ Toolkit can’t store properly but ASpace can). So, stay tuned!

Sniffing out problems — bad barcode information

Like many large repositories, we barcode all of our containers and keep them offsite. This is a pretty straightforward and on the whole error-resistant process, where an archivist uses a plug-in to Archivists’ Toolkit to select a container and then wands in a barcode to associate it with that container. That barcode then gets stored in an item record in our ILS, in the off-site storage system’s database, and on a barcode sheet. It’s also serialized as part of the EAD. When a patron wants a box, a series of systems and people  (including Aeon, our ILS, and our off-site storage workers) work behind the scenes to bring it here within two business days. It’s pretty magical, and it’s how we can provide a high quality of service to our patrons.

However, there are a few less fool-proof methods by which a barcode could be introduced to a database. A barcode could be entered as part of rapid data entry. Since container information is entered as part of each component, it’s possible to accidentally mistype the container number but wand the same barcode. It’s also possible for barcodes to be entered as part of EAD import (although they don’t make it easy) or as a SQL update.

We also have an extra legacy problem — a bad update (I don’t know if it was SQL or XSL) assigned barcodes based on container number, but didn’t check for container type. So, reel 1 was assigned the same barcode as box 1.

The diagnosis is pretty straightforward (although I attempted it a zillion much-harder ways first). As always, many many thanks to my colleague Mark, the brains behind this operation, who actually wrote this SQL query after seeing me flail about.

This query basically asks — okay. Give me a list of all of your distinct barcodes and container numbers. Do you have any barcodes that are associated with more than one container number? If so, that’s a huge problem! Report back that barcode!

Barcode 987654321 should only be associated with box 1 — if it’s associated with both box 1 AND box 2, that’s a real problem. It means either that somehow the barcode was duplicated (which we go to GREAT PAINS to prevent happening) or that there was a data entry error.

You can run this query against your own AT database to see if there were any barcodes assigned to more than one box.

select barcode,
group_concat(distinct coalesce(container1NumericIndicator, container1AlphaNumIndicator)) from ArchDescriptionInstances where barcode <> ''
group by barcode
having count(barcode) > 1
and count(distinct coalesce(container1NumericIndicator, container1AlphaNumIndicator)) > 1

Maybe you have a healthier relationship with the small stuff than I do, but this is the kind of thing that maintains a constant low-level stress buzz in the back of my head. I find it reassuring that we can run these kinds of reports/checks to know for sure that collection control is working the way it should.

Oh, and by the way, the diagnosis query for my reel/box problem is here:

SELECT distinct barcode, container1Type FROM
) as tt
group by tt.barcode
having count(tt.barcode) > 1

I hope that someone out there finds this useful!

Converting Preliminary Inventories to Tables with Macros: Folder Titles and Numbers

In my last post, I talked about how we have used macros at Maryland to start converting preliminary inventories into more useful formats to prepare for eventual importation to ArchivesSpace and create EAD finding aids. Using another macro, I have replaced folder title rows containing “(2f)” or “(3f)” with multiple rows. Using these two macros together, each single line in our tables will represent a single folder. To complete the process, I used a formula that quickly creates folder numbers. Folder numbers will allow you to sort the spreadsheet data while preserving the physical order and location of each folder.

In a Word document, the 2f or 3f shorthand might help preserve space in an inventory or finding aid, or it might seem more logical or aesthetically pleasing. But in a table, we want an accurate representation of each unit (the folder). In the Athletic Media Relations example that I discussed in my last post, there were 67 rows that contained either 2f, 3f,or 4f:


Copying these lines one by one would have taken an excessive amount of time. Instead of copying a folder title over and over again, I automated the task by using this macro:

Sub two_f()
' Sub and End Sub are commands that say the macro will start and end
'Do is a command that marks the beginning of a loop
'Search the B column for 2f instances
Selection.Find(What:=", (2f)", After:=ActiveCell, LookIn:=xlFormulas, _
LookAt:=xlPart, SearchOrder:=xlByRows, SearchDirection:=xlNext, _
MatchCase:=False, SearchFormat:=False).Activate
'Create a new line and copy down
ActiveCell.Offset(1, 0).Rows("1:1").EntireRow.Select
Selection.Insert Shift:=xlDown, CopyOrigin:=xlFormatFromLeftOrAbove
'replace the “2f” from the line you copied and line you pasted
ActiveCell.Offset(-1, 1).Range("A1:A2").Select
Selection.Replace What:=", (2f)", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
Loop Until Cells.findnext is nothing
MsgBox "Conversion complete!"
End Sub

This macro can be adjusted for instances of “(3f)”, “(4f)”, etc. The following code shows the appropriate adjustments:

Sub three_f()
' Sub and End Sub are commands that say the macro will start and end
'Do is a command that marks the beginning of a loop
'Search the B column for 3f instances
Selection.Find(What:=", (3f)", After:=ActiveCell, LookIn:=xlFormulas, _
LookAt:=xlPart, SearchOrder:=xlByRows, SearchDirection:=xlNext, _
MatchCase:=False, SearchFormat:=False).Activate
'Create two new lines and copy down each time
ActiveCell.Offset(1, 0).Rows("1:1").EntireRow.Select
Selection.Insert Shift:=xlDown, CopyOrigin:=xlFormatFromLeftOrAbove
ActiveCell.Offset(1, 0).Rows("1:1").EntireRow.Select
Selection.Insert Shift:=xlDown, CopyOrigin:=xlFormatFromLeftOrAbove
'replace the “3f” from the line you copied and two lines you pasted
ActiveCell.Offset(-2, 1).Range("A1:A3").Select
Selection.Replace What:=", (3f)", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
Loop Until Cells.findnext is nothing
MsgBox "Conversion complete!"
End Sub

The code could also be adjusted accordingly to accommodate instances of 4f, 5f, etc. The macros will make the following adjustments to our example:


Within a few seconds, the macro created about 200 lines in this Athletic Media Relations spreadsheet.

As with any macro that you might copy and paste into your VBA Project Window, save frequently and make sure this fits your needs exactly to prevent any unintended results. For example, the sample code above searches for and replaces “(3f)”, but your inventories might not use parentheses.

Now that each line in this table represents a single folder in this collection, we should create a folder number to preserve the order inside each box.

Formulas have been used elsewhere on this blog. Like macros, formulas are extremely powerful for data cleanup and this one will hopefully give you some inspiration. As Cassie has written earlier, the IF formula can be very useful for testing a condition of your choice for a column of data. It can also be used for creating new data. In this case, I want to create a folder number.

This formula uses the same If/Then/Else logic as Cassie’s previous example. The =IF command requires three statements separated by commas. The first (“If”) is the true or false condition you are testing. The second (“Then”) is the result you want if the condition has been met; the third (“Else”) is the result that you want if the condition was not met.

In this formula, we want to see if a folder is in the same box as the previous folder. If it is, we want to continue a sequence of folder numbers (1 to 2, 2 to 3, etc.). If this folder is not in the same box, we want to reset the sequence of folder numbers and start over with the number one (“1”).

In the Excel IF formula, this logical expression looks like this:


The one problem here is that this formula will not work for your very first folder, since it has no previous line to refer back to. To resolve this, just put the number “1” as your first folder number and put the formula in the cell for the second folder number.


After you hit enter, fill down and don’t forget to copy the cells, and Paste Special your values!

As with macros, help with formulas is easy to find through Google, especially on sites such as or StackOverflow.

Converting Preliminary Inventories to Tables with Macros: Moving Box Numbers

At University of Maryland we have lots of folder- (or item-) level inventories for processed, partially processed, or unprocessed collections. Preliminary inventories were mostly created using Microsoft Word, which is meant to create written works on printed pages, not the tabular data that we need to operate between systems. Complicating matters further, these files were each structured quite differently. My challenge this summer was to bring some of these Word documents into a standardized table format. Getting this data into Excel is a first step before we can use it in an EAD finding aid/ArchivesSpace. This has required quite a bit of cleanup and conversion. By using macros and formulas, we can make the conversion process a little easier.

I started the conversion process using many of the same regular expressions that Bonnie has described here on the blog. I especially looked out for tabs and line breaks, which you can locate in Word by searching for ^p or ^t (as well as ^13 and ^9, if you are using wildcards). To convert the inventory into a table, I had to make sure that each folder title was separated by one line break, with no line breaks in between. This would allow me to copy and paste the contents of the Word document into Excel and create a clean table.

Cleaning up a Word document with regular expressions was only part of the conversion process. I was still left with a number of redundant tasks moving around standardized data in Excel. First among them was to eliminate lines reporting a box number and move that number into a separate column. This is where I started introducing macros into our conversion work.

A Microsoft Office macro is essentially a process that automatically executes a task or series of tasks. You can run one by clicking an option on a list or even pressing a custom-made hotkey. Microsoft Office macros use Visual Basic for Applications (VBA) programming language. VBA is intended to be a simple, forgiving language, but it still takes some time to learn. I will point out a few simple steps you can take to get started with macros and even how to customize each code.

The simplest way to create a macro is to record it. Microsoft Word and Excel can “record” your actions by translating your activity in the program into a written code. You can then save these actions and re-create them later by “playing” the macro.

To record a macro, go to the View tab, click on the Macros list, and click “Record Macro.” When you are finished recording, go back to the Macros list and click “Stop Recording.” To replay the macro (re-run the same operations you recorded) or edit the source code, go to the Macro list and click View Macro, then “Run” or “Edit”.

In some instances, you may already have the code for a macro but do not have it stored into your copy of Microsoft Excel. To insert the pre-written code into Excel, type Alt+F11 to open the VBA Project Window, then Insert -> Module. Paste your code into the new window and press the Play button.

The example that I will use is from one of our Athletic Media Relations accessions. If I were to move the words “Box 5” manually, I would perform a series of steps: search for “Box 5”, copy the cell contents, delete the row, find an adjacent empty cell to paste the value “5”, and fill “5” down. I wanted to turn this:

Excel Sample Before Macro Conversion

into this:


This particular inventory only had thirteen boxes, and could be converted in a minute or two manually, but that could take a very long time if you have a preliminary inventory with hundreds of boxes. So far I have applied the macro to about thirty inventories comprising several hundred boxes and saved hours of work. I used the following code to automate the process:

Sub Macro1()
' Sub and End Sub are commands that say the macro will start and end
'Create a new A column for the box number
Selection.Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove
'Take the value Box 1 from Cell B1 to Cell A1
ActiveCell.Offset(0, 1).Range("A1").Select
ActiveCell.Offset(0, -1).Range("A1").Select
'Fill the box number down
Range(Selection, Selection.End(xlDown)).Select
'Delete the row that initially had Box 1 in it
'Do is a command that marks the beginning of a loop
'Selection.Find searches within the selection (the B column) for a cell with a box number
Selection.Find(What:="box *", After:=ActiveCell, LookIn:=xlFormulas, _
LookAt:=xlPart, SearchOrder:=xlByRows, SearchDirection:=xlNext, _
MatchCase:=False, SearchFormat:=False).Activate
ActiveCell.Offset(0, -1).Select
'Take out the word box in the A column so it is just a number
Columns(1).Cells.Replace What:="Box ", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
Range(Selection, Selection.End(xlDown)).Select
Application.CutCopyMode = False
Selection.Delete Shift:=xlUp
'This statement causes the commands to repeat until there are no more cells with the word Box in it
Loop Until Cells.FindNext Is Nothing
'These last five lines clean up the leftover box number from the last fill down
ActiveCell.Offset(1, -1).Select
Range(Selection, Selection.End(xlDown)).Select
'End Sub is a required command that ends the macro
End Sub

Before you press “play,” a few words of warning. First, make sure you save a backup copy. You cannot undo the changes that are made through a macro! Any existing VBA code may need minor changes and tweaks to work with a new document, so be careful to review your code and make sure it fits your needs. Also, check to see if the macro will produce any unintended results when it runs.

The above sample macro identifies the “Box x” field by searching for the word “box” followed by a space and more characters. So make sure that there are no folder titles with the word “box” in it! (For example, in the sample screenshots above, “Box Scores” would need to be changed to read “Box-Scores”.)

For additional macro and VBA tips and workarounds, search Google for using “Excel”, “macro” and “VBA” as search terms. and StackOverflow are especially useful sites.

In the next blog post, I will introduce a few other Excel macros and functions we have used at Maryland to continue this conversion process.