Definition of LFNs for the custom datasets in columnflow #412

hephysicist · 2024-03-20T15:22:09Z

hephysicist
Mar 20, 2024

Dear users, maintainers, all!
I am interested in understanding the process of LFNs propagation through the framework. As far as I understand, we define the wlcg directories in law.cfg like:

[wlcg_fs_run3_2022_postEE_nano_tau_v12]
base: root://eoscms.cern.ch/eos/cms/store/group/phys_tau/irandreo/Run3_22_postEE_new
use_cache: $CF_WLCG_USE_CACHE
cache_root: $CF_WLCG_CACHE_ROOT
cache_cleanup: $CF_WLCG_CACHE_CLEANUP
cache_max_size: 15GB
cache_global_lock: True
cache_mtime_patience: -1

Then we obtain LFNs by running a function like this:

ef get_dataset_lfns(dataset_inst: od.Dataset, shift_inst: od.Shift, dataset_key: str) -> list[str]:
            # destructure dataset_key into parts and create the lfn base directory
            dataset_id = dataset_key.split("/", 1)[1]
            print(f"Creating custom get_dataset_lfns for {config_name}")   
            campagn_name = cfg.campaign.x("custom").get("name")
            lfn_base = law.wlcg.WLCGDirectoryTarget(
                f"{dataset_id}",
                fs=f"wlcg_fs_{campagn_name}",
            )
            
            # loop through files and interpret paths as lfns
            return [
                lfn_base.child(basename, type="f").path
                for basename in lfn_base.listdir(pattern="*.root")
            ]
        # define the lfn retrieval function
        cfg.x.get_dataset_lfns = get_dataset_lfns

Then, somewhere in the framework (probably in GetDatasetLFNs task), this function is executed. My question is how to be with the case when my files are stored not in a single directory. So I need to define different paths for different datasets.
Also, I saw that in cmsdb dataset definition it is possible to add a location field to provide a dataset-specific path.
What is the best practice in your opinion? Could you provide some insights on how LFNs are being propagated through the framework so we can think of an optimal and sustainable solution?

Best,
Stepan.

Answered by pkausw

Mar 21, 2024

Hi Stepan,

Indeed, what you describe is correct. The get_dataset_lfns function is called here, and is used to create json files that contain the final paths to the individual nanoAOD files with ther logical file names (lfns). After this point, only these json files are used in subsequent tasks - so once the list of files has been collected, the get_dataset_lfns function is not used anymore.

As a user, you have complete freedom of how you want to obtain the paths to these lfns. In principle, you could write a function that parses paths to files based on individual datasets, based on the name of a given dataset - this is a design choice you could make when writing your code. Imho, this is n…

View full answer

pkausw · 2024-03-21T13:45:00Z

pkausw
Mar 21, 2024
Maintainer

Hi Stepan,

Indeed, what you describe is correct. The get_dataset_lfns function is called here, and is used to create json files that contain the final paths to the individual nanoAOD files with ther logical file names (lfns). After this point, only these json files are used in subsequent tasks - so once the list of files has been collected, the get_dataset_lfns function is not used anymore.

As a user, you have complete freedom of how you want to obtain the paths to these lfns. In principle, you could write a function that parses paths to files based on individual datasets, based on the name of a given dataset - this is a design choice you could make when writing your code. Imho, this is not the most efficient way to go about this though, since this doesn't scale very well if you consider a lot of datasets in your analysis. I agree that it would be better to attach the information about the location of a given dataset to the dataset itself. You can do this in many ways, e.g. via the auxiliary dictionary that most of the order objects provide. In this dictionary, you can basically asign any key word argument you want, and then later access it for example in the get_dataset_lfns function.

So bottom line is: I think you got the structure right, and the optimal way to implement a dataset-dependent location when retrieving the paths to the lfns is a matter of taste. Personally, I agree that it's most efficient/sustainable to attach this information to the datasets themselves, for example with the auxiliary dictionary you can access/fill either within the cmsdb or even in your analysis config at runtime.

Hope this helps!

Cheers,
Philip

1 reply

hephysicist Mar 21, 2024
Author

Hi Philip!
Thanks for the clarification, this is indeed useful to know the exact place in the code where I can put my own lfns if there is a need.
Thanks a lot!
Best,
Stepan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Definition of LFNs for the custom datasets in columnflow #412

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Definition of LFNs for the custom datasets in columnflow #412

hephysicist Mar 20, 2024

Replies: 1 comment · 1 reply

pkausw Mar 21, 2024 Maintainer

hephysicist Mar 21, 2024 Author

hephysicist
Mar 20, 2024

Replies: 1 comment 1 reply

pkausw
Mar 21, 2024
Maintainer

hephysicist Mar 21, 2024
Author