Let’s start with a more straightforward cleanup issue identified during our accession record mapping. In this example, we’ll use a set of accession records (only a portion of our total) we have exported from the Beast into an Excel spreadsheet and focus on two fields: <unitdateinclusive> and <unitdatebulk>. We’ll map these to date fields in ArchivesSpace, but before we get to that let’s examine the data.
This spreadsheet contains 3361 accession records. 2685 rows have an inclusive date and 1908 rows include a bulk date. By sorting the spreadsheet by date and spot checking the data, we’ve come up with a list of inconsistent date formatting issues. One of the most pervasive habits was to always fill out both the inclusive and bulk dates, even if the values for each were the exact same. (For now, ignore the other date formatting issues in these examples.)
Supplying this information twice isn’t necessary for our users and could be confusing to some (plus it is extra work for us!) DACS 2.4.10 suggests providing a bulk date when dates differ significantly from the inclusive dates, so we want to keep the bulk dates that are different than our inclusive dates while removing the duplicate values.
We could compare these by hand (done that before!) or use a formula in Excel to do the work for us:
=IF(A2=B2, “same”, B2)
This formula asks if the value in <unitdateinclusive> equals the value in <unitdatebulk>. If they are equal, return the value “same” and if they are different return the value of <unitdatebulk>.
After dragging down the formula for the entire sheet I then copy the results of this new column to another one, utilizing the “paste values” feature to carry over the content and not the formula for the cell value.
I could have put nothing instead of “same” in my new column if the values were equal, but I wanted to know how many times these dates were equal. Sorting by my newbulkdate column I know that:
- 777 rows only contained an inclusive date.
- The formula as I have it would have returned a “0” here, because I didn’t tell it how to handle a blank cell in <unitdatebulk>.
- Easy fix is to go back to my <unitdatebulk> column, find all the blank cells, and replace them with “empty.” Empty will carry forward with the formula. (I’m sure there is a way to handle this with the formula if anyone wants to jump in.)
- 567 rows were identified as having a different value in bulk date so these dates were retained.
- 1341 rows had an identical value in inclusive and bulk dates and were not kept.
I can now do a global replace on the newbulkdate column to replace “same” and “empty” with nothing. I then remove my original <unitdatebulk> column and my bulkformula column.
Of course, this method only worked on cells were the characters were the exact same. There will be bulk dates that are the same as inclusive dates this didn’t catch, such as values with different spacing or punctuation. (ex: c.2007 v. c. 2007)
In other posts we’ll look at more date clean up questions, issues, and cleanup.