Showing posts with label Migration. Show all posts
Showing posts with label Migration. Show all posts

Wednesday, October 3, 2012

Automating Preservation Processes

As I mentioned in an earlier post I have found an archive called tDAR; http://www.tdar.org/ and the services they offer are compelling.  Like most of the newer archival operations their approach is to automate as many processes as possible.  This means that they do check sums and fixity checks for the files under their care.  There is some version control and there are metadata fields to track provenance. Much of the information is entered by the depositor.  The rest is automated by their "workflow engine" and through this they can track, process, and manage files when they are deposited and over time.

But here's the thing:  In my situation, we have data we got a long time ago, up to 35-40 years ago.  Initially, I worked with punch cards and these have been converted or migrated over time to new storage media, and to ensure that data sets can still be used when there are changes in operating systems.  This has sometimes required us to write little programs or do other actions to for example, move files  from EBCDIC to ASCII, or to enable use from main frame OS to DOS to Windows.  We have also had to be sure that where we had system files produced by SPSS, SAS, STATA, etc., or data plus a set up file (now our preferred archival mode) that these were still usable in newer versions of statistical software.  I kept and still keep paper files (yes, slowly being converted to PDF) on what all we did with each file. 

We now use Data Documentation Initiative metadata fields in DDI Codebook (Section 1.0, Document description and Section 2.0 Study Description, as appropriate) to keep information about the software version(s). (Ex. when we first started monkeying around with DDI we used a very odd free-ware editor that we realized later did not easily move into other better editors, so we record the editor info in section 1.0 now, and we record the stat package versions in the 2.0 sections)  It takes effort to find this out when a file is being evaluated.  Sometimes it is in the header info and sometimes not.  You have to dig to find it. And it is worth it because we want to be sure we are documenting versions and clearly delineating provenance.  We think this will matter to future researchers and archivists. 


The info on versions and what we have done to migrate is also hand recorded in our SQL database of holdings and we can run a little report from time to time to see if anything is getting kind of old.  We keep track of websites with little hacks, etc. to make these stat package conversions and also for other formats (ex. early versions of PDF did not easily convert to PDF-A)  The process is tedious and human labor intensive.  We have had lots of help from our statistical consultants and they know a number of helpful tricks for making data usable again.  It takes a lot of time to convert older file formats and we have a huge backlog. 

So, when I am looking at the newer repository operations I am looking for where this kind of work gets done and how and to what extent. I know that different file formats require different amount of loving care and attention; for example, tDAR accepts formats that are either open standards formats like CSV, XSLX, TIFF which are international standards that are not patent protected and can be implemented via publicly available specs. Or they accept industry standards like MDB files or DOC files which are not "open" but widely used and their system can convert the files when the need arises.  And they say "We do some conversion of files, but this is to provide more accessible formats, but not to archival formats.  Why not?  All of the formats we accept are either open standards or so commonly used that we don't see the need.

My experience is that truly managing the data files we handle at UCLA requires a lot of hands on care and for that reason we don't take much new stuff unless it meets our requirements for documentation and format.  Also we are too small of a shop to handle a lot of data.  We have a list of what we look for in an initial assessment and if the depositor doesn't provide us with enough we don't ingest it.  


In some automated repository systems there are options for individuals to upload materials and put in some details and voila, they can say it's archived.   Having had the experiences I have had with statistical software, operating systems and media storage formats, this does not sound like enough and therefore I can't say to what extent the materials managed by some of the better known repository systems are truly being preserved.  I am told that there is no repository system anywhere that addresses these issues. 

Are these issues not being addressed because we have not figured out how to do so?  Or are my concerns no longer relevant in today's technology environment?  Or is it really that these checking processes are hard to automate and without automation it is too time consuming, labor intensive and expensive?  If the answer to this last question is yes, then what about coming up with strategies to address this?   I have always felt that it is better to manage a smaller collection of well documented materials than to try to take everything.  I'd rather feel that at least a few things are being preserved for someone to use in 40-50 years rather than have a lot of stuff that nobody can use. 

Wednesday, September 26, 2012

tDAR ~ The Digital Archaeological Record


I'll just say it: I really like what tDAR is doing.  From their website they say "tDAR is an international digital archive and repository that houses data about archaeological investigations, research, resources, and scholarship."  Their work represents an international effort including the U.K. based Archaeology Data Service and Digital Antiquity in the U.S. Other players are University of Arkansas, Arizona State University, Pennsylvania State University, Washington State University, SRI Foundation and the University of York. tDAR has a sustainable model for ongoing operation AND for preserving deposited materials into the future.  This is the first repository I have encountered where there is substantial consideration of how changes in technology and in software can affect the future usability of data.  Their approach is to limit the data formats accepted and to use some automated processes to check and migrate files when needed. 

And they will work with a wide variety of materials even with these limits:
I have also been pleased that the tDAR staff have been so helpful and responsive as I ask my questions.  As I proceed with the project for the Cotsen Institute I think tDAR is one of the leading candidates for repository choice.