Skip to content

[ENH] Add "study" DatasetType to organize a collection of source and derivative datasets #1972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

yarikoptic
Copy link
Collaborator

@yarikoptic yarikoptic commented Oct 29, 2024

edit: formerly it was "project" but then renamed to "study" for better alignment

This PR was initially submitted as #1861 but I made a mistake to combine it with a discussion of transformations of existing projects' layouts into such BIDS project dataset. Please refer to that PR for examples but otherwise let's concentrate here on the discussion of this specific proposed change.

  • Rationale 1 (major): BIDS standard already provides reasonable structure to formalize organization of various components of a neuroscientific data project: where to place code, original (source) data, derivative data, README, CHANGES. Many projects (e.g. nipoppy, YODA, etc) propose similar and often might be even "inspired" templates . If we explicitly allow for BIDS standard to prescribe study level organization, IMHO it would help many people and projects decide on how to organize their studies/projects.
  • Rationale 2: IMHO BIDS standard should describe only what standard prescribe and not recommend some potential "non-standardized" layouts. That is why I "reworked" that example into a legitimate BIDS dataset merely by adding dataset_description.json.

TODOs:

@yarikoptic yarikoptic added the opinions wanted Please read and offer your opinion on this matter label Oct 29, 2024
@yarikoptic yarikoptic requested a review from effigies October 29, 2024 21:01
@yarikoptic yarikoptic requested a review from tsalo October 29, 2024 21:01
@yarikoptic yarikoptic changed the title Enh project datasettype Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset Oct 30, 2024
@yarikoptic yarikoptic force-pushed the enh-project-datasettype branch from d0d5c37 to fb4f5a4 Compare January 23, 2025 02:40
Copy link

codecov bot commented Jan 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.15%. Comparing base (8e175f4) to head (1ac57a0).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1972   +/-   ##
=======================================
  Coverage   82.15%   82.15%           
=======================================
  Files          17       17           
  Lines        1530     1530           
=======================================
  Hits         1257     1257           
  Misses        273      273           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Mar 1, 2025

FWIW, we conversed with @effigies and he brought up an interesting argument, although IMHO not contradicting this one per se, is that ATM any BIDS dataset (raw or derivative) which already contains some subdatasets under derivatives/ could be considered to be a "project" BIDS dataset. Indeed it is true. But IMHO it is just a legacy side-effect of BIDS recommendation to stick derivatives under derivatives/ which somewhat invalidate the claim that the BIDS dataset as a whole is a "raw" dataset.

edit: related linked below is #2103 highlighting the same situation with "raw" dataset containing "derivatives/"

@yarikoptic
Copy link
Collaborator Author

@effigies I wonder if we should extend DatasetType to become a list or replace or compliment with DatasetTypes which could then indeed be just "project" when containing other derivative bids-datasets or sourcedata, or ["raw", "project"] when also a raw and project, or could be also ["derivative", "project"] when derivative but points to sourcedata/ original "raw" ... ?

@effigies
Copy link
Collaborator

I'm skeptical of that need. I would expect your ["raw", "project"] to look like:

project/
  dataset_description.json  # "DatasetType": "project"
  rawdata/
    dataset_description.json  # "DatasetType": "raw"
  derivatives/
    preproc/
      dataset_description.json  # "DatasetType": "derivative"

And ["derivative", "project"]:

project/
  dataset_description.json  # "DatasetType": "project"
  sourcedata/
    dataset_description.json  # "DatasetType": "raw"
  rawdata/
    dataset_description.json  # "DatasetType": "derivative"
  deriv/
    analysis/
      dataset_description.json  # "DatasetType": "derivative"

Subdatasets should be validatable BIDS datasets in their own right, avoiding the need for a top-level dataset_description.json to modify how they are intended to be validated.

@effigies
Copy link
Collaborator

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

@yarikoptic
Copy link
Collaborator Author

yet to "process" but a quick side idea inspired by #1928 --- I wonder if there is a hierarchy here: project (everything common) -> raw (current default, requires having sub- folder(s)) -> derivative (more stuff could be added), as every next level adds capabilities but includes all of the prior one as derivative could include raw in it? or we have already something which invalidates that?

@yarikoptic
Copy link
Collaborator Author

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

They are already there, that's somewhat the point here -- that we are already defining the structure of all those folders, nothing new to add.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

I think presence of the subdatasets is not really the differentiation here, and formalization of rules for their validation is orthogonal to this issue. Having in mind my prior observation that "raw" is pretty much "a project with data in sub-* folders" we might be circling back to that issue of requiring sub- folders in "raw" BIDS datasets? ;-) and otherwise they are just "BIDS project" datasets, which could contain sourcedata/ and derivatives/ subdatasets, logs/, code/ etc, but not sub-* subfolders with actual data?

Could even kinda become nice that we would facilitate people to even start their "raw BIDS datasets" as "project BIDS datasets" where they plan (README, code/ etc) until they start populate with data and thus becoming raw?

@yarikoptic yarikoptic force-pushed the enh-project-datasettype branch 2 times, most recently from 9c68bf8 to 6f236ce Compare June 10, 2025 12:45
@yarikoptic
Copy link
Collaborator Author

@effigies Following your idea, I have now added a "warning" (to reflect level of the analogous SubjectFolders check in "raw" BIDS). I guess, in principle, we could take this as an opportunity to revert SubjectFolders check back to "error" and demand making it "project" type, and then make it also "error" in project type. WDYT?

@yarikoptic yarikoptic added the copenhagen For discussion in Copenhagen label Jun 11, 2025
@yarikoptic
Copy link
Collaborator Author

While discussing with @jbpoline we wondered, if may be study would be a better descriptor to use here in favor of project. One of the rationales, is that e.g. in BEP035 (attn @bids-standard/bep035) on Mega-analysis they introduce study- entity as a groupping element. It kinda then would match natively.

we also mention "study" in various places in BIDS which seems to align nicely here
❯ git grep study
src/CHANGES.md:-   \[FIX] update physio bids name in longitudinal study page examples [#863](https://github.com/bids-standard/bids-specification/pull/863) ([Remi-Gau](https://github.com/Remi-Gau))
src/appendices/coordinate-systems.md:The following template identifiers are RECOMMENDED for individual- and study-specific reference
src/appendices/coordinate-systems.md:In the case of multiple study templates, additional names may need to be defined.
src/appendices/coordinate-systems.md:| study                 | Custom space defined using a group/study-specific template. This coordinate system requires specifying an additional file to be fully defined.                                                                                                                         |
src/appendices/hed.md:numerical values that are similar across the recordings in the study.
src/appendices/hed.md:repository on GitHub should be used to validate the study event annotations.
src/common-principles.md:    unless when appropriate given the study goals, for example, when scanning babies.
src/introduction.md:> The data used in the study were organized using the
src/modality-specific-files/genetic-descriptor.md:     "Dataset": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001364.v1.p1",
src/modality-specific-files/intracranial-electroencephalography.md:Note that the date and time information SHOULD be stored in the study key file
src/modality-specific-files/magnetic-resonance-spectroscopy.md:acquisition parameters in filenames is helpful or necessary to distinguish datasets in a given study.
src/modality-specific-files/motion.md:Note that the onsets of the recordings SHOULD be stored in the study key file [(`scans.tsv`)](../modality-agnostic-files.md#scans-file).
src/modality-specific-files/positron-emission-tomography.md:This entity is OPTIONAL if only one tracer is used in the study,
src/modality-specific-files/task-events.md:Please mind that this does not imply that only so called "event related" study designs
src/schema/objects/common_principles.yaml:    A set of neuroimaging and behavioral data acquired for a purpose of a particular study.
src/schema/objects/common_principles.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/common_principles.yaml:    A person or animal participating in the study.
src/schema/objects/entities.yaml:    For example, this should be used when a study includes two T1w images -
src/schema/objects/entities.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/entities.yaml:    A person or animal participating in the study.
src/schema/objects/enums.yaml:study:
src/schema/objects/enums.yaml:  value: study
src/schema/objects/enums.yaml:  display_name: study
src/schema/objects/enums.yaml:    Custom space defined using a group/study-specific template.
src/schema/objects/metadata.yaml:    Reference to the study/studies on which the implementation is based.
src/schema/objects/metadata.yaml:    The version of the HED schema used to validate HED tags for study.
tools/schemacode/src/bidsschematools/tests/data/broken_dataset_description.json:"EthicsApprovals": ["The original study from which this BIDS example dataset was derived was approved by the Ethics committee of Ghent University Hospital with identifier EC 2017/1103."]

and "project" mentionings are not particularly aligned. So, I think, we should just make it a "study", hence renaming accordingly.

@yarikoptic yarikoptic changed the title Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset Add DatasetType="study" and rework existing "layout" example into a proper BIDS dataset Jun 12, 2025
This reverts commit a3c12f8 where I have tried
to introduce it in
bids-standard#1741 but it required a
little more of further detailing.
Idea from @effigies while discussing this PR at BIDS Maintainers meeting 2025
…ith SubjectFolders check

Also adjusted wording to be aligned too
While discussing with @jbpoline we wondered, if may be `study` would be a better descriptor to use here in favor of `project`.  One of the rationales, is that e.g. in [BEP035](https://bids.neuroimaging.io/extensions/beps/bep_035.html) (attn @bids-standard/bep035) on Mega-analysis they introduce `study-` entity as a groupping element.  It kinda then would match natively.

we also  mention "study" in various places in BIDS which seems to align nicely here

```shell
❯ git grep study
src/CHANGES.md:-   \[FIX] update physio bids name in longitudinal study page examples [bids-standard#863](bids-standard#863) ([Remi-Gau](https://github.com/Remi-Gau))
src/appendices/coordinate-systems.md:The following template identifiers are RECOMMENDED for individual- and study-specific reference
src/appendices/coordinate-systems.md:In the case of multiple study templates, additional names may need to be defined.
src/appendices/coordinate-systems.md:| study                 | Custom space defined using a group/study-specific template. This coordinate system requires specifying an additional file to be fully defined.                                                                                                                         |
src/appendices/hed.md:numerical values that are similar across the recordings in the study.
src/appendices/hed.md:repository on GitHub should be used to validate the study event annotations.
src/common-principles.md:    unless when appropriate given the study goals, for example, when scanning babies.
src/introduction.md:> The data used in the study were organized using the
src/modality-specific-files/genetic-descriptor.md:     "Dataset": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001364.v1.p1",
src/modality-specific-files/intracranial-electroencephalography.md:Note that the date and time information SHOULD be stored in the study key file
src/modality-specific-files/magnetic-resonance-spectroscopy.md:acquisition parameters in filenames is helpful or necessary to distinguish datasets in a given study.
src/modality-specific-files/motion.md:Note that the onsets of the recordings SHOULD be stored in the study key file [(`scans.tsv`)](../modality-agnostic-files.md#scans-file).
src/modality-specific-files/positron-emission-tomography.md:This entity is OPTIONAL if only one tracer is used in the study,
src/modality-specific-files/task-events.md:Please mind that this does not imply that only so called "event related" study designs
src/schema/objects/common_principles.yaml:    A set of neuroimaging and behavioral data acquired for a purpose of a particular study.
src/schema/objects/common_principles.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/common_principles.yaml:    A person or animal participating in the study.
src/schema/objects/entities.yaml:    For example, this should be used when a study includes two T1w images -
src/schema/objects/entities.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/entities.yaml:    A person or animal participating in the study.
src/schema/objects/enums.yaml:study:
src/schema/objects/enums.yaml:  value: study
src/schema/objects/enums.yaml:  display_name: study
src/schema/objects/enums.yaml:    Custom space defined using a group/study-specific template.
src/schema/objects/metadata.yaml:    Reference to the study/studies on which the implementation is based.
src/schema/objects/metadata.yaml:    The version of the HED schema used to validate HED tags for study.
tools/schemacode/src/bidsschematools/tests/data/broken_dataset_description.json:"EthicsApprovals": ["The original study from which this BIDS example dataset was derived was approved by the Ethics committee of Ghent University Hospital with identifier EC 2017/1103."]

```

and "project" mentionings are not particularly  aligned.  So, I think, we should just make it a "study", hence renaming accordingly.
@effigies effigies force-pushed the enh-project-datasettype branch from 154a8cc to 315c08f Compare June 12, 2025 09:20
@yarikoptic yarikoptic requested a review from effigies June 13, 2025 19:24
…dataset

Well, any BIDS dataset is a "study" dataset, but there the point is that ATM for both
"raw" and "derivative" types we expect to have sub- folders and that was the prior
behavior, which should not be affected by this PR.

This should address the review comment of @effigies
https://github.com/bids-standard/bids-specification/pull/1972/files#r2142639988
Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

@effigies effigies changed the title Add DatasetType="study" and rework existing "layout" example into a proper BIDS dataset Add "study" DatasetType to organize a collection of source and derivative datasets Jun 16, 2025
@effigies effigies changed the title Add "study" DatasetType to organize a collection of source and derivative datasets [ENH] Add "study" DatasetType to organize a collection of source and derivative datasets Jun 16, 2025
@effigies
Copy link
Collaborator

A functional question for the validator: Should we error if there are no valid BIDS datasets in sourcedata/?

cc @rwblair @nellh

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Jun 16, 2025

sorry, I do not see how this relates to this PR since having a "valid BIDS datasets in sourcedata/" seems to point to be a "derivative BIDS" dataset by its definition.

edit: the point is that "study" dataset could even be empty to start with, start collecting various other sourcedata/ and eventually do get sourcedata/raw BIDS dataset. I think it should remain valid through those life cycles.

@yarikoptic
Copy link
Collaborator Author

@effigies who, among maintainers, do you think might also be interested to review this PR?

@effigies
Copy link
Collaborator

But what are you validating if there are no valid subdatasets? Why are you running the validator?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
copenhagen For discussion in Copenhagen needs-review opinions wanted Please read and offer your opinion on this matter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants