Monday, October 15, 2012

What is C-14 dating and how is it data?

No, Carbon-14 dating is not another way to meet your true love, unless you are looking to date a fossil. I have dated some fossils in my life, but that story is for a different blog.  Seriously, C-14 dating is a technique used to determine the age of an artifact, which could be a pot sherd, a bone, wood, cloth and even plants.  It is used a lot in archaeological and anthropological research and as such C-14 measurements can be recorded and kept as data.

 Ok, what is C-14?

What happens is very interesting.  See: http://science.howstuffworks.com/environmental/earth/geology/carbon-141.htm Actually, it sounds like it is more of a demolition derby to me.  First, the sun sends out cosmic rays which sounds very science fiction to me, but it happens all the time; in fact I read that every person gets hit by about half a million cosmic rays every hour. No wonder I feel beat all the time!  These cosmic rays not only hit people, they hit atoms which creates new cosmic rays called atomic neutrons. These atomic neutrons also begin bombarding everything in their way including banging into nitrogen atoms. This creates carbon-14 and carbon-12 atoms.  Personally, I think the cosmic rays, neutrons and atoms need a time out for all this colliding and hitting.

Anyway, the key thing is that all this anti-social behavior results in the C-14 atoms being radioactive, and as well all know, there is a half-life to anything radioactive.  In this case, it turns out that C-14 has a half-life of 5700 years.  But we are not finished yet.  When carbon-14 interacts with oxygen on planet earth, it turns into carbon dioxide, which the plants soak up and then we humans and our animal friends eat.  Yes, even if you are a vegetarian you are getting carbon dioxide! And you are getting a certain amount of radioactive C-14 and some C-12, and apparently the amount we have in our system is pretty much constant. Until we die.

Once we die the amount of C-14 in our bodies starts to lessen, but the C-12 amount stays the same. Scientists have been measuring the carbon in plants, animals and organic things so they know about how much is present in for example, a living tree.  And using some fancy equations which I will not go into here, it is possible to measure how much C-14 is left in a biological or geological artifact as compared to how much C-12 is present.  Scientists can do this for things that are up to 60,000 years old.  Isn't that cool! This technique was discovered by a guy named Willard Libby in 1949.

OK, so how does this turn into a data file?

So, the data that is gathered in C-14 dating is the number of years or age of an artifact.  It can be recorded as something like 10,000 years, 36,000 years, 700 years, and so on.  And this is good, but most researchers also record other details like where the artifact came from, a description of the artifact, and who gathered the artifact and with what instruments.  This could include the date the artifact was found; geographic details such as latitude and longitude, depth, what was found next to or around the artifact; or the name of the researcher or project; equipment used; and maybe even a text description or abstract. All of these items can be recorded as fields in a spreadsheet. I am still researching this but it seems that the kind and amount of metadata recorded varies depending on what is being dated.

At tDAR a search reveals some datasets based on C-14 dating.  Within tDAR the metadata is extensive and includes ways to identify and describe each column. There are details on the data type (text, numeric, etc.), the type of value, the category the measurement would be identified by, and if created, the ontology used by the investigator to organize the details. Depositors also record items such as site name, type of site, the anthropological or archeological culture (i.e., late archaic), the material being measured (i.e., fauna), the method of collection or investigation type (i.e., excavation). And there are some generic items such as a record number, a DOI and resource language.

So, what does this mean for long term data management?

Although there is a lot of effort required for finding artifacts and dating them, the results can be organized into a spreadsheet and described in such a way that others can use and re-use the material.  tDAR has a useful metadata structure, but one could also use other XML-based metadata schemas.  I am not sure if these data files will be around for the next 60,000 years, but with proper management they could be around for some time to come.

Wednesday, October 3, 2012

Automating Preservation Processes

As I mentioned in an earlier post I have found an archive called tDAR; http://www.tdar.org/ and the services they offer are compelling.  Like most of the newer archival operations their approach is to automate as many processes as possible.  This means that they do check sums and fixity checks for the files under their care.  There is some version control and there are metadata fields to track provenance. Much of the information is entered by the depositor.  The rest is automated by their "workflow engine" and through this they can track, process, and manage files when they are deposited and over time.

But here's the thing:  In my situation, we have data we got a long time ago, up to 35-40 years ago.  Initially, I worked with punch cards and these have been converted or migrated over time to new storage media, and to ensure that data sets can still be used when there are changes in operating systems.  This has sometimes required us to write little programs or do other actions to for example, move files  from EBCDIC to ASCII, or to enable use from main frame OS to DOS to Windows.  We have also had to be sure that where we had system files produced by SPSS, SAS, STATA, etc., or data plus a set up file (now our preferred archival mode) that these were still usable in newer versions of statistical software.  I kept and still keep paper files (yes, slowly being converted to PDF) on what all we did with each file. 

We now use Data Documentation Initiative metadata fields in DDI Codebook (Section 1.0, Document description and Section 2.0 Study Description, as appropriate) to keep information about the software version(s). (Ex. when we first started monkeying around with DDI we used a very odd free-ware editor that we realized later did not easily move into other better editors, so we record the editor info in section 1.0 now, and we record the stat package versions in the 2.0 sections)  It takes effort to find this out when a file is being evaluated.  Sometimes it is in the header info and sometimes not.  You have to dig to find it. And it is worth it because we want to be sure we are documenting versions and clearly delineating provenance.  We think this will matter to future researchers and archivists. 


The info on versions and what we have done to migrate is also hand recorded in our SQL database of holdings and we can run a little report from time to time to see if anything is getting kind of old.  We keep track of websites with little hacks, etc. to make these stat package conversions and also for other formats (ex. early versions of PDF did not easily convert to PDF-A)  The process is tedious and human labor intensive.  We have had lots of help from our statistical consultants and they know a number of helpful tricks for making data usable again.  It takes a lot of time to convert older file formats and we have a huge backlog. 

So, when I am looking at the newer repository operations I am looking for where this kind of work gets done and how and to what extent. I know that different file formats require different amount of loving care and attention; for example, tDAR accepts formats that are either open standards formats like CSV, XSLX, TIFF which are international standards that are not patent protected and can be implemented via publicly available specs. Or they accept industry standards like MDB files or DOC files which are not "open" but widely used and their system can convert the files when the need arises.  And they say "We do some conversion of files, but this is to provide more accessible formats, but not to archival formats.  Why not?  All of the formats we accept are either open standards or so commonly used that we don't see the need.

My experience is that truly managing the data files we handle at UCLA requires a lot of hands on care and for that reason we don't take much new stuff unless it meets our requirements for documentation and format.  Also we are too small of a shop to handle a lot of data.  We have a list of what we look for in an initial assessment and if the depositor doesn't provide us with enough we don't ingest it.  


In some automated repository systems there are options for individuals to upload materials and put in some details and voila, they can say it's archived.   Having had the experiences I have had with statistical software, operating systems and media storage formats, this does not sound like enough and therefore I can't say to what extent the materials managed by some of the better known repository systems are truly being preserved.  I am told that there is no repository system anywhere that addresses these issues. 

Are these issues not being addressed because we have not figured out how to do so?  Or are my concerns no longer relevant in today's technology environment?  Or is it really that these checking processes are hard to automate and without automation it is too time consuming, labor intensive and expensive?  If the answer to this last question is yes, then what about coming up with strategies to address this?   I have always felt that it is better to manage a smaller collection of well documented materials than to try to take everything.  I'd rather feel that at least a few things are being preserved for someone to use in 40-50 years rather than have a lot of stuff that nobody can use. 

Wednesday, September 26, 2012

Archaeology 2.0

Archaeology 2.0 provides a great overview of issues and concerns in managing data created, gathered and collected in research.
 [page image]

tDAR ~ The Digital Archaeological Record


I'll just say it: I really like what tDAR is doing.  From their website they say "tDAR is an international digital archive and repository that houses data about archaeological investigations, research, resources, and scholarship."  Their work represents an international effort including the U.K. based Archaeology Data Service and Digital Antiquity in the U.S. Other players are University of Arkansas, Arizona State University, Pennsylvania State University, Washington State University, SRI Foundation and the University of York. tDAR has a sustainable model for ongoing operation AND for preserving deposited materials into the future.  This is the first repository I have encountered where there is substantial consideration of how changes in technology and in software can affect the future usability of data.  Their approach is to limit the data formats accepted and to use some automated processes to check and migrate files when needed. 

And they will work with a wide variety of materials even with these limits:
I have also been pleased that the tDAR staff have been so helpful and responsive as I ask my questions.  As I proceed with the project for the Cotsen Institute I think tDAR is one of the leading candidates for repository choice.