Skip to content

Day 1 Understanding the data and simulation workflows

Christophe Delaere edited this page Nov 8, 2024 · 27 revisions

Looking at the 2011 CMS data

Data comes in different formats, from bulky raw data to compact processed and skimmed data. In CMS, it goes like this (for that period):

RAW -> RECO -> AOD -> PAT -> Ntuples
  • RAW data is a binary format. It is compact but proprietary and requires a lot of knowledge about the detector electronics to be decoded.
  • RECO data contains higher level objects (tracks, calorimeter clusters, electrons, ...). It typically takes a lot of disk space (because low-level data is kept) but can be the starting point of data reprocessing.
  • AOD (for Analysis Object Data) is a thinner data format where only the high-level calibrated objects are retained. It is centrally produced. These are the AOD files that CMS made public.
  • PAT (for Physics Analysis Toolkit) is an even more simplified dataformat, typically produced by each analysis team. They typically contain only the subset of objects needed for a given study.
  • Ntuples are simple data tables generally containing simple float quantities. Very compact, these are the files that we usually use in the interactive analysis.

In this lab, we will inspect AOD files with the event display tool of CMS, then produce a small ntuple. In the second lab, we will analyze larger ntuples produced by our IT team in advance of the lab.

Event visualization

In a fresh terminal, run the following commands. Make here a choice between electrons (first file) and muons (second file).

cms-shell

and in that shell:

cd CMSSW_5_3_32/src/
. /cvmfs/cms.cern.ch/cmsset_default.sh
cmsenv
cmsShow root://eospublic.cern.ch//eos/opendata/cms/Run2011A/DoubleElectron/AOD/12Oct2013-v1/20000/0014CE62-9C3E-E311-8FCC-00261894389F.root
cmsShow root://eospublic.cern.ch//eos/opendata/cms/Run2011A/DoubleMu/AOD/12Oct2013-v1/20000/045CCED6-033F-E311-9E93-003048678F74.root
  1. What kind of events do you see?
  2. Do you see electrons or muons?
  3. You may try to filter the collection, for example asking for one or more muons... note that it takes time.

Processing data to create a ntuple

The data directory contains a sample of type "PAT". It contains some preprocessed events, similar to the AOD just visualized, but filtered and slimmed down (only the high-level info). We will process it and create a ntuple with the information relevant for this lab.

Take some time to look at the code in Labo/WeakBosonsAnalyzer/src/WeakBosonsAnalyzer.cc. Most of the job is done in the method WeakBosonsAnalyzer::analyze. This analysis code accesses the information in the PAT (using the CMS library) and create a simple file that can be read by the ROOT standalone software.

To run the code on few events:

cd Labo/WeakBosonsAnalyzer/test
cmsRun weakBosonsAnalyzer_cfg.py

By defaults, this runs on one file containing 21000 CMS event passing the double muon trigger. You can change that by editing the configuration file, weakBosonsAnalyzer_cfg.py to either point to another file or to run on a subset of events. When this is done, you can close the CMS shell.

We will now run Jupyter to look at the result using the ROOT package. In a new terminal:

conda activate cms-analysis
cd LPHYS2131/analysis
jupyter-lab Visualization.ipynb

Take some time to get familiar with the various quantities in the ntuple.

Simulating Monte Carlo (MC) events to compare to

To understand what the data contains, a standard approach consist in simulating event using Monte Carlo techniques to reflect the best of our understanding of the theory. By emulating the detector response and the analysis workflow, we obtain files that can be compared to data.

In this case, we will use MadGraph to simulate events of interest (pp->Z->l+l- and pp->W->lnu), pass them through a simplified detector simulation (Delphes) and produce a ntuple with the same format as for data.

Running MadGraph

Start MadGraph in a new terminal:

conda activate cms-analysis
export FFLAGS+=" -fno-automatic"
cd MG5_aMC_v2_9_21
./bin/mg5_aMC

Generate Drell-Yann events:

MG5_aMC>generate p p > l+ l-
MG5_aMC>output ppNeutralCurrents
MG5_aMC>open index.html
MG5_aMC>launch ppNeutralCurrents

On the first prompt, enable "Pythia".

PYTHIA is a computer simulation program for particle collisions at very high energies. Some of the features PYTHIA is capable of simulating are Hard and soft interactions, Parton distributions, Initial/final-state parton showers, Multiple interactions, Fragmentation and decay, etc. PYTHIA is used here to perform the hadronization of final state quarks, and to simulate both initial state and final state radiations.

Various parameters can be modified in the run card. This goes from the number of events (the default 10k is perfect in our case), some cuts (for example on the minimum Pt of some particles), or the beam energy.

Then, repeat the operation to generate W bosons leptonic decays:

MG5_aMC>generate p p > l- vl~
MG5_aMC>add process p p > l+ vl
MG5_aMC>output ppChargedCurrents
MG5_aMC>open index.html
MG5_aMC>launch ppChargedCurrents

Again, on the first prompt, enable "Pythia".

By default, and as just described, the simulation is performed at leading order (LO). To perform the simulation at next-to-leading order (NLO) it is enough to add "[QCD]" at the end of the process string. In that case, Pythia is not used, as the hadronization is performed internally by another method adapted to NLO.

You can try to generate the process at NLO to see the diagrams produced by Madgraph in that case. We won't try to generate events since their use requires extra care (events are weighted) that complicates the analysis later on.

Then, we quit:

MG5_aMC>quit

Finally, we will unzip the results:

gunzip ppChargedCurrents/Events/run_01/tag_1_pythia8_events.hepmc.gz
gunzip ppNeutralCurrents/Events/run_01/tag_1_pythia8_events.hepmc.gz

Running Delphes

You just generated MC events, with sets of particles as they would emerge from the collisions. The next step is to emulate the detector response. We use for that Delphes, a standalone public tool that emulates a simplified version of the CMS detector.

cd ../Delphes-3.5.0
cp ~/LPHYS2131/analysis/delphes_card_CMS_mod.tcl cards/
./DelphesHepMC2 cards/delphes_card_CMS_mod.tcl ppChargedCurrents.root ../MG5_aMC_v2_9_21/ppChargedCurrents/Events/run_01/tag_1_pythia8_events.hepmc
./DelphesHepMC2 cards/delphes_card_CMS_mod.tcl ppNeutralCurrents.root ../MG5_aMC_v2_9_21/ppNeutralCurrents/Events/run_01/tag_1_pythia8_events.hepmc

Visualizing the events

We can now look as the events as we did for data:

root -l examples/EventDisplay.C'("cards/delphes_card_CMS.tcl","ppChargedCurrents.root")'
root [0] .q

What are the similarities and differences between these simulated events and the data events seen earlier?

Processing the events to generate a ROOT ntuple

Finally, we will produce a ntuple similar to the one we have for data, using the CreateTreeFromDelphes script. You can briefly look at it and then run the conversion:

cp ~/LPHYS2131/analysis/CreateTreeFromDelphes.C .
root -l
root [0] gSystem->Load("libDelphes.so")
root [1] .L CreateTreeFromDelphes.C++
root [2] CreateTreeFromDelphes("ppChargedCurrents.root","delpheAnalysisW.root")
root [3] CreateTreeFromDelphes("ppNeutralCurrents.root","delpheAnalysisZ.root")
root [4] .q

You can look at the resulting files... how does it compare to what we have for data? To achieve this, it suffices to point the Visualization.ipynb notebook that we used earlier to either delpheAnalysisW.root or delpheAnalysisZ.root. For example:

data_file = "/home/student/Delphes-3.5.0/delpheAnalysisZ.root"