As I mentioned in an earlier post I have found an archive called
tDAR; http://www.tdar.org/ and the
services they offer are compelling. Like most of the newer archival
operations their approach is to automate as many processes as possible. This means that they do check sums and fixity
checks for the files under their care. There is some version control and there are metadata fields to track provenance. Much of the information is entered by the depositor. The rest is automated by their "workflow engine" and through this they can track, process, and manage files when they are deposited and over time.
But
here's the thing: In my situation, we have data we got a long time ago,
up to 35-40 years ago. Initially, I worked with punch cards and these
have been converted or migrated over time to new storage media, and to
ensure that data sets can still be used when there are changes in
operating systems. This has sometimes required us to write little
programs or do other actions to for example, move files from EBCDIC to
ASCII, or to enable use from main frame OS to DOS to Windows. We have also
had to be sure that where we had system files produced by SPSS, SAS,
STATA, etc., or data plus a set up file (now our preferred archival
mode) that these were still usable in newer versions of statistical
software. I kept and still keep paper files (yes, slowly being
converted to PDF) on what all we did with each file.
We now use
Data Documentation Initiative metadata fields in DDI Codebook (Section 1.0, Document description and Section
2.0 Study Description, as appropriate) to keep information about the
software version(s). (Ex. when we first started monkeying around with
DDI we used a very odd free-ware editor that we realized later did not
easily move into other better editors, so we record the editor info in
section 1.0 now, and we record the stat package versions in the 2.0
sections) It takes effort to find this out when a file is being
evaluated. Sometimes it is in the header info and sometimes not. You
have to dig to find it. And it is worth it because we want to be sure we
are documenting versions and clearly delineating provenance. We think
this will matter to future researchers and archivists.
The info on versions and what we have
done to migrate is also hand recorded in our SQL database of holdings
and we can run a little report from time to time to see if anything is
getting kind of old. We keep track of websites with little hacks, etc.
to make these stat package conversions and also for other formats (ex.
early versions of PDF did not easily convert to PDF-A) The process is
tedious and human labor intensive. We have had lots of help from our
statistical consultants and they know a number of helpful tricks for
making data usable again. It takes a lot of time to convert older file
formats and we have a huge backlog.
So, when I am looking at
the newer repository operations I am looking for where this kind of work
gets done and how and to what extent. I know that different file
formats require different amount of loving care and attention; for
example, tDAR accepts formats that are either open standards formats
like CSV, XSLX, TIFF which are
international standards that are not patent protected and can be
implemented via publicly available specs. Or they accept industry
standards like MDB files or DOC files which are not "open" but widely
used and their system can convert the files when the need arises. And
they say "We do some conversion of files, but this is to provide more accessible
formats, but not to archival formats. Why not? All of the formats we
accept are either open standards or so commonly used that we don't see
the need."
My experience is that truly managing the data
files we handle at UCLA requires a
lot of hands on care and for that reason we don't take much new stuff
unless it meets our requirements for documentation and format. Also we
are too small of a shop to handle a lot of data. We have a list of what we look for in an initial assessment and if the depositor doesn't
provide us with enough we don't ingest it.
In some automated repository
systems there are options for individuals
to upload materials and put in some details and voila, they can say it's
archived. Having had the experiences I have had with statistical
software, operating systems and media storage formats, this does not
sound like enough and therefore I can't say to what extent the materials
managed by some of the better known repository systems are truly being
preserved. I am told that there is no repository system anywhere that
addresses these issues.
Are these issues not being addressed
because we have not figured out how to do so? Or are my concerns no
longer relevant in today's technology environment? Or is it really that
these checking processes are hard to automate and without automation it
is too time consuming, labor intensive and expensive? If the answer to
this last question is yes, then what about coming up with strategies to
address this? I have always felt that it is better to manage a
smaller collection of well documented materials than to try to take
everything. I'd rather feel that at least a few things are being
preserved for someone to use in 40-50 years rather than have a lot of
stuff that nobody can use.
Wednesday, October 3, 2012
Automating Preservation Processes
Subscribe to:
Post Comments (Atom)
1 comment:
Here is a related blog post from "Digital Preservation for Beginners" that picks up on some of your post.
http://easydigitalpreservation.wordpress.com/2010/10/05/file-formats-and-preservation/
and it refers to U of Minn Digital Conservancy's preservation support levels for commonly used file formats. Indicates which formats will be USABLE over time, not just ACCESSIBLE.
http://conservancy.umn.edu/pol-preservation.jsp
But nothing here about a system that actually does the hard work.
Post a Comment