Data Management

Data processing, archiving and access

»The Observatory is expected to generate approximately 100 petabytes (PB) of data by 2030 (1 PB = 1 million GB).«

The CTA Observatory (CTAO) will provide access to archival data, the software for data processing and analysis, as well as all services for observation proposals. The design requirements are based on three specific classes of users: guest observers, archive users and advanced users. The first two categories represent the majority of “basic users” of the CTAO who are analyzing CTA high-level scientific products. The advanced users are the managers, operators and those consortium users that need access to deeper levels of data for scientific or software development purposes.

CTA’s data management concerns include all major components for both data-flow administration and the scientific data production and analysis for CTA. The design scheme addresses the following three central needs, which have been used to define the detailed user and project requirements:

 

  1. The treatment and flow of data from remote telescopes
  2. “Big-data” archiving and processing
  3. Open data access

Big Data Archiving and Processing

The aim of the scientific analysis environment is to accomplish the following:

 

  • Conduct Monte Carlo simulations for studying and optimizing the instrumental design and for exploring physics cases
  • Organize cooperative design, development and administration of software
  • Explore and evaluate algorithm prototypes
  • Conduct data challenges and other data product validation schemes at any level

 

To achieve this, the goal is to build up a shared scientific analysis system (SAS), which includes software and computing. After the first telescopes are deployed and the first CTA data are acquired, such a platform will be used for any potential early science, as well as a test bench for testing all services and pipelines to be delivered to the CTAO in the production phase.

 

This amounts to a tremendous volume of data. With an annual (reduced) raw data volume of 3.7 PB and with 4 PB of data products, CTA is undoubtedly a BIG data project. One petabyte (PB) is equal to 1015 bytes of data, 1,000 terabytes (TB) or 1,000,000 gigabytes (GB). The total volume to be managed by the CTAO archive is of the order of 25 PB per year, when all data-set versions and backup replicas are considered. The Observatory is expected to generate approximately 100 PB of data by 2030.

Data Flow

The figure below depicts the main path and rate of data within the CTA Observatory (CTAO). On each CTA site, the data rates are based on the event rates from Cherenkov and night-sky-background triggers registered by the telescope array (calculated from Monte-Carlo simulations and site measurements), given a particular trigger scheme.

Science Data Management Centre

The CTA Science Data Management Centre (SDMC), to be located on the DESY campus in Zeuthen, Germany, will be responsible for off-site handling of data reduction, Monte Carlo simulations, data archiving and data dissemination. The remote (e.g. intercontinental) transmission of data from CTA sites to the CTA archive is one of the key services that the SDMC will administer. The development and provision of software and middle-ware services for dissemination, including observation proposal handling, is a task that data management guarantees to be interfaced with the operation centre.

Open Data Access

One of CTAO’s main data management challenges is the open access to CTA data. To operate as an open observatory, a minimum set of services and tools are needed by basic users (e.g. guest observers and archive users) to perform a successful scientific analysis of CTA data. These services are intended to be mostly web-oriented and consist of the following:

 

  • Electronic support services to help guest observers in writing and submitting a proposal to CTA in response to an Announcement of Opportunity
  • User interfaces to follow the status of an observation, including the scheduling, the data acquisition, the data processing, the data distribution and the ingestion of the data in the public archive after the end of the proprietary period
  • Services for downloading the processed data (DL3) as well as the software tools that are necessary for scientific analysis.

 

Science analysis will be performed on the basic user’s own computing infrastructure. Web-based information about the data and the analysis software, including user manuals, cookbooks, etc., also will be available. Archive users will browse the archive to access and retrieve CTA data of interest. They will query the archive database in order to select events based on specific selection criteria (source location, observation time interval, observation condition criteria). The archive of high-level data provides archive users with scientific products (such as images, spectra, light-curves, catalogs) produced by the CTAO. The Observatory also will provide user support: training, tutorials, help-desk and a newsletter.

Data Contacts:

Work Package Leader: Giovanni Lamanna, LAPP – IN2P3 – CNRS
Work Package Manager: Nadine Neyroud, LAPP – IN2P3 – CNRS
Work Package Software Architect: Karl Kosack, CEA – IRFU