Wednesday, October 3, 2012

Automating Preservation Processes

As I mentioned in an earlier post I have found an archive called tDAR; http://www.tdar.org/ and the services they offer are compelling.  Like most of the newer archival operations their approach is to automate as many processes as possible.  This means that they do check sums and fixity checks for the files under their care.  There is some version control and there are metadata fields to track provenance. Much of the information is entered by the depositor.  The rest is automated by their "workflow engine" and through this they can track, process, and manage files when they are deposited and over time.

But here's the thing:  In my situation, we have data we got a long time ago, up to 35-40 years ago.  Initially, I worked with punch cards and these have been converted or migrated over time to new storage media, and to ensure that data sets can still be used when there are changes in operating systems.  This has sometimes required us to write little programs or do other actions to for example, move files  from EBCDIC to ASCII, or to enable use from main frame OS to DOS to Windows.  We have also had to be sure that where we had system files produced by SPSS, SAS, STATA, etc., or data plus a set up file (now our preferred archival mode) that these were still usable in newer versions of statistical software.  I kept and still keep paper files (yes, slowly being converted to PDF) on what all we did with each file. 

We now use Data Documentation Initiative metadata fields in DDI Codebook (Section 1.0, Document description and Section 2.0 Study Description, as appropriate) to keep information about the software version(s). (Ex. when we first started monkeying around with DDI we used a very odd free-ware editor that we realized later did not easily move into other better editors, so we record the editor info in section 1.0 now, and we record the stat package versions in the 2.0 sections)  It takes effort to find this out when a file is being evaluated.  Sometimes it is in the header info and sometimes not.  You have to dig to find it. And it is worth it because we want to be sure we are documenting versions and clearly delineating provenance.  We think this will matter to future researchers and archivists. 


The info on versions and what we have done to migrate is also hand recorded in our SQL database of holdings and we can run a little report from time to time to see if anything is getting kind of old.  We keep track of websites with little hacks, etc. to make these stat package conversions and also for other formats (ex. early versions of PDF did not easily convert to PDF-A)  The process is tedious and human labor intensive.  We have had lots of help from our statistical consultants and they know a number of helpful tricks for making data usable again.  It takes a lot of time to convert older file formats and we have a huge backlog. 

So, when I am looking at the newer repository operations I am looking for where this kind of work gets done and how and to what extent. I know that different file formats require different amount of loving care and attention; for example, tDAR accepts formats that are either open standards formats like CSV, XSLX, TIFF which are international standards that are not patent protected and can be implemented via publicly available specs. Or they accept industry standards like MDB files or DOC files which are not "open" but widely used and their system can convert the files when the need arises.  And they say "We do some conversion of files, but this is to provide more accessible formats, but not to archival formats.  Why not?  All of the formats we accept are either open standards or so commonly used that we don't see the need.

My experience is that truly managing the data files we handle at UCLA requires a lot of hands on care and for that reason we don't take much new stuff unless it meets our requirements for documentation and format.  Also we are too small of a shop to handle a lot of data.  We have a list of what we look for in an initial assessment and if the depositor doesn't provide us with enough we don't ingest it.  


In some automated repository systems there are options for individuals to upload materials and put in some details and voila, they can say it's archived.   Having had the experiences I have had with statistical software, operating systems and media storage formats, this does not sound like enough and therefore I can't say to what extent the materials managed by some of the better known repository systems are truly being preserved.  I am told that there is no repository system anywhere that addresses these issues. 

Are these issues not being addressed because we have not figured out how to do so?  Or are my concerns no longer relevant in today's technology environment?  Or is it really that these checking processes are hard to automate and without automation it is too time consuming, labor intensive and expensive?  If the answer to this last question is yes, then what about coming up with strategies to address this?   I have always felt that it is better to manage a smaller collection of well documented materials than to try to take everything.  I'd rather feel that at least a few things are being preserved for someone to use in 40-50 years rather than have a lot of stuff that nobody can use. 

1 comment:

Ann Green said...

Here is a related blog post from "Digital Preservation for Beginners" that picks up on some of your post.
http://easydigitalpreservation.wordpress.com/2010/10/05/file-formats-and-preservation/

and it refers to U of Minn Digital Conservancy's preservation support levels for commonly used file formats. Indicates which formats will be USABLE over time, not just ACCESSIBLE.
http://conservancy.umn.edu/pol-preservation.jsp

But nothing here about a system that actually does the hard work.