Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding loader for Jazz Trio Database #652

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

HuwCheston
Copy link

@HuwCheston HuwCheston commented Dec 23, 2024

Description

Please include the following information at the top level docstring for the dataset's module mydataset.py:

  • Describe annotations included in the dataset
  • Indicate the size of the datasets (e.g. number files and duration, hours)
  • Mention the origin of the dataset (e.g. creator, institution)
  • Describe the type of music included in the dataset
  • Indicate any relevant papers related to the dataset
  • Include a description about how the data can be accessed and the license it uses (if applicable)

Dataset loaders checklist:

  • Create a script in scripts/, e.g. make_my_dataset_index.py, which generates an index file.
  • Run the script on the canonical version of the dataset and upload the index to Zenodo Audio Data Loaders community.
    • Currently awaiting approval.
  • Create a sample version of the index with the necessary information for testing.
  • Create a module in mirdata, e.g. mirdata/my_dataset.py
  • Create tests for your loader in tests/datasets/, e.g. test_my_dataset.py
  • Add your module to docs/source/mirdata.rst and docs/source/table.rst
  • Run black, flake8 and mypy (see Running your tests locally).
  • Run tests/test_full_dataset.py on your dataset.
  • Check that codecov coverage does not decrease.

If your dataset is not fully downloadable there are two extra steps you should follow:

  • Contacting the mirdata organizers by opening an issue or PR so we can discuss how to proceed with the closed dataset.
    • I am opening this PR, hopefully this is ok for discussion 😀
  • Show that the version used to create the checksum is the "canonical" one, either by getting the version from the dataset creator, or by verifying equivalence with several other copies of the dataset.
    • I am the creator of the database and have created the checksums using my own, canonical, version.
  • Make sure someone has run pytest -s tests/test_full_dataset.py --local --dataset my_dataset once on your dataset locally and confirmed it passes.

@HuwCheston HuwCheston changed the title [WIP] Adding loader for Jazz Trio Database Adding loader for Jazz Trio Database Dec 23, 2024

"""
return (
self._multitrack_metadata["mbz_id"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider simplifying these kinds of expressions to

self._multitrack_metadata.get("mbz_id", None)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'd agree that your suggestion is more pythonic @pmcharrison but I've seen this syntax used in other loaders within mirdata e.g. cipi.py.

Will wait for more opinions!

Copy link
Author

@HuwCheston HuwCheston Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look now, I think this syntax is potentially redundant. The structure of the JTD metadata means that these fields (e.g., mbz_id, track_name, etc.) will always be present for every track, with no missing values.

So, in reality, these functions will never return None, and will always return the expected type (str, int, whatever) from the field. With that in mind, in 87227c6 I'm just accessing the desired fields directly without any of this additional logic. e.g., we can just do:

@property
def musicbrainz_id(self) -> str:
    """The MusicBrainz ID for the recording

    Returns:
        * str - musicbrainz ID

    """
    return self._multitrack_metadata["mbz_id"]

Happy to revert back if the mirdata maintainers prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer the .get() method, but I see that this is not something consistent throughout all loaders, and that here is not technically needed. For now, no need to change it. I'll let the other maintainers to give their opinion if they wish so!
In any case... mybe it'd be cool to standardize it in all the package...!

mirdata/datasets/jtd.py Outdated Show resolved Hide resolved
mirdata/datasets/jtd.py Outdated Show resolved Hide resolved
mirdata/datasets/jtd.py Outdated Show resolved Hide resolved
mirdata/datasets/jtd.py Outdated Show resolved Hide resolved
@HuwCheston
Copy link
Author

Made some changes in response to @pmcharrison's comments (for maintainers, Peter is a co-author on the original paper introducing this dataset)

@pmcharrison
Copy link

These changes look good, thank you!

Copy link

codecov bot commented Jan 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.13%. Comparing base (581f4c4) to head (f6f6d43).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #652      +/-   ##
==========================================
+ Coverage   97.07%   97.13%   +0.06%     
==========================================
  Files          68       69       +1     
  Lines        7583     7747     +164     
==========================================
+ Hits         7361     7525     +164     
  Misses        222      222              

Copy link
Collaborator

@genisplaja genisplaja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @HuwCheston and also @pmcharrison, thanks for this great PR! That is neat :) I am leaving here minor suggested changes mostly to improve readability of the docs. About Issue #651... I am taking a look at it. But also as I am suggesting for the new decorator, I think this fixes and upgrade could be better included in separated PRs :) I'd like the rest of the colleagues maintaining mirdata to take a look though. Thanks again!

docs/source/mirdata.rst Outdated Show resolved Hide resolved
mirdata/datasets/jtd.py Outdated Show resolved Hide resolved
Comment on lines +608 to +623
def coerce_to_string_io_multiple_args(func) -> Callable:
"""Little hack of the decorator in mirdata.io that allows for multiple args to be passed to the `func`"""

@functools.wraps(func)
def wrapper(
file_path_or_obj: Optional[Union[str, TextIO]], *args
) -> Optional[io.T]:
if not file_path_or_obj:
return None
if isinstance(file_path_or_obj, str):
with open(file_path_or_obj, encoding="utf-8") as f:
return func(f, *args)
else:
return func(file_path_or_obj, *args)

return wrapper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! We have been doing it differently when we have load_* functions with multiple args, however, this solution might be a bit better. I would like the other mirdata maintainers to take a look, but in any case, if we chose your solution, I'd suggest to use the same hack that is used in other loaders for the load_* functions with multiple args here, and then we can have a separate PR to include this decorator to the core files and update all loaders with load functions that have multiple arguments. Thanks for that!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use the same hack that is used in other loaders for the load_* functions with multiple args here

Thanks -- is there any chance you can point me towards these other loaders so I can implement this?

@HuwCheston
Copy link
Author

Thanks @genisplaja, should've now made the requested changes to the docs and can see from the readthedocs build that they look better. If you can point me towards the same hack that is used in other loaders for the load_* functions with multiple args I'm happy to integrate this rather than creating my own hack of the decorator :)

@HuwCheston HuwCheston requested a review from genisplaja January 3, 2025 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants