Skip to content

[Feature Request] Add support for loading datasets from local Minari cache #3067

@Ibinarriaga8

Description

@Ibinarriaga8

Motivation

The current implementation of MinariExperienceReplay requires datasets to be downloaded using the class itself, which creates an env_metadata.json file in the target directory. This workflow does not accommodate custom Minari datasets created by users or datasets that have been loaded into the local Minari cache by other means (e.g., through minari.load_dataset or custom dataset creation via DataCollector).

As a result, attempting to instantiate MinariExperienceReplay with download=False for locally available datasets leads to a FileNotFoundError due to missing metadata files, even though the dataset exists in the Minari cache. This limitation is frustrating for users who want to leverage their own datasets without redownloading or duplicating data, and it hinders workflows where datasets are managed independently of TorchRL.

This issue is meant to enable loading datasets directly from the local Minari cache (typically ~/.minari/datasets) without requiring prior setup via MinariExperienceReplay's download workflow, making it more flexible and compatible with custom and preloaded datasets.

Solution

Add and fully support the argument load_from_local_minari to the MinariExperienceReplay class. When set to True, this argument will instruct the class to:

  • Look for the dataset in the user's local Minari cache (e.g., ~/.minari/datasets/{dataset_id}/data/main_data.hdf5).
  • Bypass any download or remote fetching logic.
  • If the required files are present, load the dataset and construct any necessary metadata on-the-fly (e.g., from the Minari dataset spec, if possible).
  • Raise a clear and informative FileNotFoundError if the dataset is not found in the expected local cache location.
  • Ensure that custom datasets created by users (such as those with DataCollector(...).create_dataset(...)) or datasets first loaded with minari.load_dataset can be used seamlessly with MinariExperienceReplay.

This solution allows for greater flexibility, avoids unnecessary downloads and data duplication, and makes TorchRL compatible with the wider Minari ecosystem.

Alternatives

  • Manual copying of files: Users could manually copy datasets and metadata to the expected TorchRL directory, but this is error-prone and not user-friendly.
  • Automated metadata generation scripts: Provide standalone tools for generating env_metadata.json based on existing Minari datasets. This adds maintenance burden and complexity for users.

Additional context

  • The new load_from_local_minari argument should default to False to preserve backward compatibility.
  • If load_from_local_minari=True is set, the MinariExperienceReplay class will prioritize loading the dataset directly from the local Minari cache (typically located at ~/.minari/datasets). If the dataset exists in the cache, the class will skip any fetching from the Minari server; no remote download or overwrite will occur. After loading the dataset from the local cache, all subsequent preprocessing and loading steps will proceed as usual, ensuring the dataset is processed and made available correctly.
  • This feature will facilitate workflows for research, benchmarking, and development using custom or proprietary datasets, and it is more in line with how Minari itself manages datasets locally.
  • Example usage:
import minari
data = MinariExperienceReplay(
    dataset_id=dataset_id,
    split_trajs=False,
    batch_size=128,
    sampler=SamplerWithoutReplacement(drop_last=True),
    prefetch=4,
    load_from_local_minari=True,  # <--- key addition
)

Checklist

  • I have checked that there is no similar issue in the repo (required)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions