Satellites and climate models are renowned for their ability to produce large amounts of data - as such, the CEDA Archive handles the largest datasets amongst the NERC Environmental Data Service (NERC’s collection of environmental data centres). Big data is not just a challenge for the CEDA Archive; the other NERC data centres are increasingly in discussions with researchers who have data that is too large for their infrastructures. As CEDA is at the forefront of development of infrastructure and services for large data volumes we have been sharing and integrating this knowledge across the NERC EDS. Here we highlight four different pieces of work that are ongoing behind the scenes to adapt to our big data challenges.
CEDA is assisting as the storage component for some of the large datasets managed by the other NERC data centres. This is because their infrastructures are not as big as ours, and therefore cannot hold large data volumes. As a result, we have adapted the way researchers deposit data. The system now allows any environmental researcher to deposit large data at CEDA, it is then managed by the relevant data centre as normal.
We have also created a new system that allows the coordination of data management across the EDS. This improves efficiency and communication between the various data centres.
As the volume and variety of data increase daily in the CEDA Archive, internal improvements are frequently required to the data ingestion process. We have recently engineered a new monitoring system that notifies relevant staff about each file delivered to the CEDA Archive. This allows us to see the overall deposit rate and helps us debug issues in a visual way. This piece of work allows us to spot issues and improve processes - meaning that data is more efficiently archived.
Not only does CEDA need to monitor data ingestion, we also need to track ~130 services. To deal with this challenge, we have developed and implemented an inventory of services with a simple web interface. The inventory identifies who is responsible and can restore a service that is broken. This simple tracking tool allows us to resolve service issues more efficiently - therefore reducing the service downtime for users.
As big data continues to grow in the environmental sector, the storage infrastructures must keep up. In response to data growth, we have moved away from a traditional storage architecture, towards a more heterogeneous storage environment. This evolution means data on the CEDA Archive is making use of various different storage technologies. Each of these has its own properties and best methods of working, with different interfaces and lag times. This can present challenges to users as it complicates workflows. The CEDA team have been working on new ways to assist users using the various storage types, including the development of the Near-line archive (NLA), Joint Data Migration Application (JDMA) and S3netCDF. We are continually developing new ways to use our storage media types, to improve efficiency and deliver the best user experience possible.
All of the work in this article is covered in more detail in our latest Annual Report (19-20), find it here..