CERN’s 300 TB – the biggest open data release yet?
Yesterday a press release from CERN announced that its CMS Collaboration unit had released more than 300 terabytes (TB) of high-quality open data. This includes over 100 TB, or 2.5 inverse femtobarns (fb−1), of data from proton collisions at 7 TeV, making up half the data collected at the Large Hadron Collider (LHC) by the Compact Muon Solenoid (CMS) detector in 2011. This follows a previous release from November 2014, which made available around 27 TB of research data collected in 2010.
This open data is available on the CERN Open Data Portal — which is built in collaboration with members of CERN’s IT Department and Scientific Information Service — the collision data are released into the public domain under the CC0 waiver and come in types: The so-called “primary datasets” are in the same format used by the CMS Collaboration to perform research. The “derived datasets” on the other hand require a lot less computing power and can be readily analysed by university or high-school students, and CMS has provided a limited number of datasets in this format.
CMS is also providing the simulated data generated with the same software version that should be used to analyse the primary datasets. Simulations play a crucial role in particle-physics research and CMS is also making available the protocols for generating the simulations that are provided. The data release is accompanied by analysis tools and code examples tailored to the datasets. A virtual-machine image based on CernVM, which comes preloaded with the software environment needed to analyse the CMS data, can also be downloaded from the portal.
Kati Lassila-Perini, a German physicist working on the CMS detector stated: “Once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly. The benefits are numerous, from inspiring high school students to the training of the particle physicists of tomorrow. And personally, as CMS’s data preservation coordinator, this is a crucial part of ensuring the long-term availability of our research data.”
In our own more modest lab, we’re wondering if this is the largest open data release yet. If readers can confirm or refute this, please feel free to comment below.