Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent filter modules behavior #515

Open
zxnie opened this issue Feb 4, 2025 · 0 comments
Open

Inconsistent filter modules behavior #515

zxnie opened this issue Feb 4, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@zxnie
Copy link

zxnie commented Feb 4, 2025

Describe the bug

The Score module and Filter module are not behaving the same as ScoreFilter module. They are here: https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/modules/filter.py

Steps/Code to reproduce bug
from nemo_curator import Score
from nemo_curator.filters import WordCountFilter

"""
Load your dataset as dataset
"""

filter = Score(
WordCountFilter(min_words=80, max_words=200_000),
score_field="word_count",
text_field="text",
score_type=int,
)

filtered_dataset = filter(dataset)

Error message:

Exception: 'TypeError("\'WordCountFilter\' object is not callable")'

Expected behavior

Since all the filters are implemented as a children class of DocumentFilter now (expect for bitext filter), the Score module and Filter module should be consistent with ScoreFilter module and take filter_obj: DocumentFilter as input instead of filter_fn (Callable).

Environment overview (please complete the following information)

  • Environment location: local dev env
  • Method of NeMo-Curator install: pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
  • If method of install is [Docker], provide docker pull & docker run commands used: not applied

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version: Ubuntu 24.04
  • Dask version: 2024.9.0
  • Python version: Python 3.10.16 (main, Jan 13 2025, 16:25:23) [GCC 13.3.0] on linux

Additional context

None

@zxnie zxnie added the bug Something isn't working label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant