Skip to content

Latest commit

 

History

History
144 lines (107 loc) · 6.02 KB

File metadata and controls

144 lines (107 loc) · 6.02 KB

Adding a Custom Dataset Loader

:::{note} We recommend reading the Evaluating NeMo Agent Toolkit Workflows guide before proceeding with this detailed documentation. :::

NeMo Agent Toolkit provides built-in dataset loaders for common file formats (json, jsonl, csv, xls, parquet, and custom). In addition, the toolkit provides a plugin system to add custom dataset loaders for new file formats or data sources.

Summary

This guide provides a step-by-step process to create and register a custom dataset loader with NeMo Agent Toolkit. A TSV (tab-separated values) dataset loader is used as an example to demonstrate the process.

Existing Dataset Loaders

You can view the list of existing dataset loaders by running the following command:

nat info components -t dataset_loader

Extending NeMo Agent Toolkit with Custom Dataset Loaders

To extend NeMo Agent Toolkit with custom dataset loaders, you need to create a dataset loader configuration class and a registration function, then register it with NeMo Agent Toolkit using the register_dataset_loader decorator.

Dataset Loader Configuration

The dataset loader configuration defines the dataset type name and any format-specific parameters. This configuration is paired with a registration function that yields a DatasetLoaderInfo object containing the load function.

The following example shows how to define and register a custom dataset loader for TSV files:

# my_plugin/dataset_loader_register.py
import pandas as pd
from pydantic import Field

from nat.builder.builder import EvalBuilder
from nat.builder.dataset_loader import DatasetLoaderInfo
from nat.cli.register_workflow import register_dataset_loader
from nat.data_models.dataset_handler import EvalDatasetBaseConfig


class EvalDatasetTsvConfig(EvalDatasetBaseConfig, name="tsv"):
    """Configuration for TSV dataset loader."""
    separator: str = Field(default="\t", description="Column separator character.")


@register_dataset_loader(config_type=EvalDatasetTsvConfig)
async def register_tsv_dataset_loader(config: EvalDatasetTsvConfig, builder: EvalBuilder):
    """Register TSV dataset loader."""

    def load_tsv(file_path, **kwargs):
        return pd.read_csv(file_path, sep=config.separator, **kwargs)

    yield DatasetLoaderInfo(config=config, load_fn=load_tsv, description="TSV file dataset loader")
  • The EvalDatasetTsvConfig class extends EvalDatasetBaseConfig with the name="tsv" parameter, which sets the _type value used in YAML configuration files.
  • The register_tsv_dataset_loader function uses the @register_dataset_loader decorator to register the dataset loader with NeMo Agent Toolkit.
  • The function yields a DatasetLoaderInfo object, which binds the config, load function, and a human-readable description.

Understanding DatasetLoaderInfo

The DatasetLoaderInfo class contains the following fields:

  • config: The dataset loader configuration object (an instance of EvalDatasetBaseConfig or a subclass).
  • load_fn: A callable that takes a file path and optional keyword arguments and returns a pandas.DataFrame. This function is used by the evaluation framework to load the dataset.
  • description: A human-readable description of the dataset loader.

Importing for Registration

To ensure the dataset loader is registered at runtime, import the registration function in your project's register.py file -- even if the function is not called directly.

# my_plugin/register.py
from .dataset_loader_register import register_tsv_dataset_loader

Entry Point

Add an entry point in your pyproject.toml so that NeMo Agent Toolkit discovers the plugin automatically:

[project.entry-points.'nat.plugins']
my_plugin = "my_plugin.register"

Display All Dataset Loaders

To display all registered dataset loaders, run the following command:

nat info components -t dataset_loader

This will now display the custom dataset loader tsv in the list of dataset loaders.

Using the Custom Dataset Loader

Once registered, you can use the custom dataset loader in your evaluation configuration:

eval:
  general:
    dataset:
      _type: tsv
      file_path: <path to your file>
      separator: "\t"

The _type field specifies the dataset loader name. All fields defined in the configuration class are available as YAML keys.

Running the Evaluation

Run the evaluation using the standard command:

nat eval --config_file <path to file>

Built-in Dataset Loaders

The following dataset loaders are included with NeMo Agent Toolkit:

Type Description Load Function
json JSON file dataset pandas.read_json
jsonl JSON Lines file dataset Custom JSONL reader
csv CSV file dataset pandas.read_csv
parquet Parquet file dataset pandas.read_parquet
xls Excel file dataset pandas.read_excel
custom Custom parser function User-provided function via function config key

For more details on the built-in dataset formats and their configuration options, see the Using Datasets section in the evaluation guide.