During the summer this year CEDA welcomed 4 undergraduate students to take up summer studentships working alongside CEDA staff. During this time their work touched on data from three of the data centres operated by CEDA: the British Atmospheric Data Centre, NERC Earth Observation Data Centre and the UK Solar System Data Centre.
Three of these students were working together as part of a team lead by Esther Conway feeding into the ESA funded Long-Term Data Preservation (LTDP) project, while the forth worked within the UKSSDC.
Further details of their work are given below.
CEDA will be looking to make further opportunities available to students in the future and will make further announcements of such opportunities through the CEDA website and social media.
Each of the 4 students have made a significant contribution the work of the Centre for Environmental Data Archival (CEDA) in the last few months, for which CEDA would like to state their thanks.
The UK Solar System Data Centre has been collecting ionospheric data since 1957, and has built up an extensive archive of such data from the oldest records dating back to 1931 to data only a few minutes old. As a result, these data are available at varying levels of detail and in a variety of formats. In particular, the online access to these data has also altered gradually over the years, with different tools providing access to different subsets of ionospheric data.
Physics undergraduate James Parkinson joined the UKSSDC for the summer to work on a project to improve the online access to ionospheric data for UKSSDC users. James quick came to grips with the data and the web services and, with a fresh pair of eyes on the situation, began with a thorough assessment of the existing provision. During the course of his work James worked up the design of a new interface to bring together ionospheric data services in one place and even implemented some of the new code that will be needed to make it operational – a very impressive feat given the short duration of the project!
CEDA’s Esther Conway brought together these three summer placement students this summer to work on the ESA funded “Long Term Data Preservation” project with the aim of carrying out an archive wide format review. This was no small task for the team given the scale (both in size and number of files) and breadth of the BADC and NEODC holdings built up over a number of decades, during which the archive had developed in a very heterogeneous and organic fashion. As a consequence a manual inspection of the archive wasn’t practical and simple off-the-shelf solutions were not available to the team. Coming to grips with the task ahead the team worked alongside members of the CEDA team to scope out and develop a way to sort through the huge mass of data with little or no format documentation.
Building on earlier attempts by CEDA to get a handle on the plethora of formats in the CEDA archive, Charlie set to work developing what would eventually become the HEFTI (Helping Environmental Formats Through Identification) software. Building up from a script that initially followed conventional methods of just checking file extensions, Charlie went on to develop other diagnostic methods used by the HEFTI software. These were required as the file extensions were often found to have been utilised to store information about the parameters or instrument names (for example) as opposed to the format of the data. Instead, Charlie developed the system to make use of additional information found by other members of the team, either by determining patterns within filenames themselves or, more usefully, determining the format’s “magic number” within the first few bytes of a file that would help fingerprint the format used.
Harry played a key role in analysing data in the archive found to be in unknown, non-standard formats. Over the course of the project he was able to determining patterns, magic numbers of format and other crucial diagnostic clues that enabled Charlie to continue to develop the HEFTI framework to diagnose more and more formats, especially of the more obscure, bespoke formats used by specific instruments or within given projects or institutions.
At the same time that the format diagnostic tool kit was developed Robin compiled information on the most important formats and their versions, detailing the relevant organisations or communities and assessing their long term preservation risk. This task
Formats have a related software set that must be available to the user whether through the archive itself or as is simpler for the archive from some outside organisation. But the community will need checking to make sure the format will not fall into disuse and the data become unreadable. This task was a crucial step forward for CEDA as it began to ensure that each format found within the CEDA archive was risk assessed and preservation plans began to be formed by assessing matters such as the level of documentation, community use and strength of supporting community and tools. Given its importance, Harry also continued this work on following the ending of the summer project itself.
The outcome of this work is that CEDA now have a considerably better understanding of the formats contained within the CEDA archives. Of particular note is that the team was able to determined, by assessing the data formats, what is of value to keep, what is possible to keep, and what must be kept due to obligations.