[ENH] Add "study" DatasetType to organize a collection of source and derivative datasets #1972

yarikoptic · 2024-10-29T21:01:22Z

edit: formerly it was "project" but then renamed to "study" for better alignment

This PR was initially submitted as #1861 but I made a mistake to combine it with a discussion of transformations of existing projects' layouts into such BIDS project dataset. Please refer to that PR for examples but otherwise let's concentrate here on the discussion of this specific proposed change.

Rationale 1 (major): BIDS standard already provides reasonable structure to formalize organization of various components of a neuroscientific data project: where to place code, original (source) data, derivative data, README, CHANGES. Many projects (e.g. nipoppy, YODA, etc) propose similar and often might be even "inspired" templates . If we explicitly allow for BIDS standard to prescribe study level organization, IMHO it would help many people and projects decide on how to organize their studies/projects.
- See Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861 on some examples and initial discussions on possible "transformations". But this PR has merit regardless whether individual existing projects adopt it partially or fully.
Rationale 2: IMHO BIDS standard should describe only what standard prescribe and not recommend some potential "non-standardized" layouts. That is why I "reworked" that example into a legitimate BIDS dataset merely by adding dataset_description.json.

TODOs:

provided more relation to existing approaches (mostly by @snastase in Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861)
crafted example for bids-examples : Add examples with DatasetType = "project" bids-examples#451
- ensure bids-validator with modified schema passes its validation

codecov · 2025-01-23T02:44:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.15%. Comparing base (8e175f4) to head (1ac57a0).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1972   +/-   ##
=======================================
  Coverage   82.15%   82.15%           
=======================================
  Files          17       17           
  Lines        1530     1530           
=======================================
  Hits         1257     1257           
  Misses        273      273

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

yarikoptic · 2025-03-01T01:41:16Z

FWIW, we conversed with @effigies and he brought up an interesting argument, although IMHO not contradicting this one per se, is that ATM any BIDS dataset (raw or derivative) which already contains some subdatasets under derivatives/ could be considered to be a "project" BIDS dataset. Indeed it is true. But IMHO it is just a legacy side-effect of BIDS recommendation to stick derivatives under derivatives/ which somewhat invalidate the claim that the BIDS dataset as a whole is a "raw" dataset.

edit: related linked below is #2103 highlighting the same situation with "raw" dataset containing "derivatives/"

yarikoptic · 2025-04-23T14:55:40Z

@effigies I wonder if we should extend DatasetType to become a list or replace or compliment with DatasetTypes which could then indeed be just "project" when containing other derivative bids-datasets or sourcedata, or ["raw", "project"] when also a raw and project, or could be also ["derivative", "project"] when derivative but points to sourcedata/ original "raw" ... ?

effigies · 2025-04-25T12:56:23Z

I'm skeptical of that need. I would expect your ["raw", "project"] to look like:

project/
  dataset_description.json  # "DatasetType": "project"
  rawdata/
    dataset_description.json  # "DatasetType": "raw"
  derivatives/
    preproc/
      dataset_description.json  # "DatasetType": "derivative"

And ["derivative", "project"]:

project/
  dataset_description.json  # "DatasetType": "project"
  sourcedata/
    dataset_description.json  # "DatasetType": "raw"
  rawdata/
    dataset_description.json  # "DatasetType": "derivative"
  deriv/
    analysis/
      dataset_description.json  # "DatasetType": "derivative"

Subdatasets should be validatable BIDS datasets in their own right, avoiding the need for a top-level dataset_description.json to modify how they are intended to be validated.

effigies · 2025-04-25T13:07:38Z

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

yarikoptic · 2025-04-25T20:25:27Z

yet to "process" but a quick side idea inspired by #1928 --- I wonder if there is a hierarchy here: project (everything common) -> raw (current default, requires having sub- folder(s)) -> derivative (more stuff could be added), as every next level adds capabilities but includes all of the prior one as derivative could include raw in it? or we have already something which invalidates that?

yarikoptic · 2025-05-19T17:23:06Z

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

They are already there, that's somewhat the point here -- that we are already defining the structure of all those folders, nothing new to add.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

I think presence of the subdatasets is not really the differentiation here, and formalization of rules for their validation is orthogonal to this issue. Having in mind my prior observation that "raw" is pretty much "a project with data in sub-* folders" we might be circling back to that issue of requiring sub- folders in "raw" BIDS datasets? ;-) and otherwise they are just "BIDS project" datasets, which could contain sourcedata/ and derivatives/ subdatasets, logs/, code/ etc, but not sub-* subfolders with actual data?

Could even kinda become nice that we would facilitate people to even start their "raw BIDS datasets" as "project BIDS datasets" where they plan (README, code/ etc) until they start populate with data and thus becoming raw?

yarikoptic · 2025-06-10T12:48:20Z

@effigies Following your idea, I have now added a "warning" (to reflect level of the analogous SubjectFolders check in "raw" BIDS). I guess, in principle, we could take this as an opportunity to revert SubjectFolders check back to "error" and demand making it "project" type, and then make it also "error" in project type. WDYT?

yarikoptic · 2025-06-12T08:07:27Z

While discussing with @jbpoline we wondered, if may be study would be a better descriptor to use here in favor of project. One of the rationales, is that e.g. in BEP035 (attn @bids-standard/bep035) on Mega-analysis they introduce study- entity as a groupping element. It kinda then would match natively.

we also mention "study" in various places in BIDS which seems to align nicely here

❯ git grep study
src/CHANGES.md:-   \[FIX] update physio bids name in longitudinal study page examples [#863](https://github.com/bids-standard/bids-specification/pull/863) ([Remi-Gau](https://github.com/Remi-Gau))
src/appendices/coordinate-systems.md:The following template identifiers are RECOMMENDED for individual- and study-specific reference
src/appendices/coordinate-systems.md:In the case of multiple study templates, additional names may need to be defined.
src/appendices/coordinate-systems.md:| study                 | Custom space defined using a group/study-specific template. This coordinate system requires specifying an additional file to be fully defined.                                                                                                                         |
src/appendices/hed.md:numerical values that are similar across the recordings in the study.
src/appendices/hed.md:repository on GitHub should be used to validate the study event annotations.
src/common-principles.md:    unless when appropriate given the study goals, for example, when scanning babies.
src/introduction.md:> The data used in the study were organized using the
src/modality-specific-files/genetic-descriptor.md:     "Dataset": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001364.v1.p1",
src/modality-specific-files/intracranial-electroencephalography.md:Note that the date and time information SHOULD be stored in the study key file
src/modality-specific-files/magnetic-resonance-spectroscopy.md:acquisition parameters in filenames is helpful or necessary to distinguish datasets in a given study.
src/modality-specific-files/motion.md:Note that the onsets of the recordings SHOULD be stored in the study key file [(`scans.tsv`)](../modality-agnostic-files.md#scans-file).
src/modality-specific-files/positron-emission-tomography.md:This entity is OPTIONAL if only one tracer is used in the study,
src/modality-specific-files/task-events.md:Please mind that this does not imply that only so called "event related" study designs
src/schema/objects/common_principles.yaml:    A set of neuroimaging and behavioral data acquired for a purpose of a particular study.
src/schema/objects/common_principles.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/common_principles.yaml:    A person or animal participating in the study.
src/schema/objects/entities.yaml:    For example, this should be used when a study includes two T1w images -
src/schema/objects/entities.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/entities.yaml:    A person or animal participating in the study.
src/schema/objects/enums.yaml:study:
src/schema/objects/enums.yaml:  value: study
src/schema/objects/enums.yaml:  display_name: study
src/schema/objects/enums.yaml:    Custom space defined using a group/study-specific template.
src/schema/objects/metadata.yaml:    Reference to the study/studies on which the implementation is based.
src/schema/objects/metadata.yaml:    The version of the HED schema used to validate HED tags for study.
tools/schemacode/src/bidsschematools/tests/data/broken_dataset_description.json:"EthicsApprovals": ["The original study from which this BIDS example dataset was derived was approved by the Ethics committee of Ghent University Hospital with identifier EC 2017/1103."]

and "project" mentionings are not particularly aligned. So, I think, we should just make it a "study", hence renaming accordingly.

This reverts commit a3c12f8 where I have tried to introduce it in bids-standard#1741 but it required a little more of further detailing.

@effigies

Idea from @effigies while discussing this PR at BIDS Maintainers meeting 2025

…ith SubjectFolders check Also adjusted wording to be aligned too

@jbpoline

While discussing with @jbpoline we wondered, if may be `study` would be a better descriptor to use here in favor of `project`. One of the rationales, is that e.g. in [BEP035](https://bids.neuroimaging.io/extensions/beps/bep_035.html) (attn @bids-standard/bep035) on Mega-analysis they introduce `study-` entity as a groupping element. It kinda then would match natively. we also mention "study" in various places in BIDS which seems to align nicely here ```shell ❯ git grep study src/CHANGES.md:- \[FIX] update physio bids name in longitudinal study page examples [bids-standard#863](bids-standard#863) ([Remi-Gau](https://github.com/Remi-Gau)) src/appendices/coordinate-systems.md:The following template identifiers are RECOMMENDED for individual- and study-specific reference src/appendices/coordinate-systems.md:In the case of multiple study templates, additional names may need to be defined. src/appendices/coordinate-systems.md:| study | Custom space defined using a group/study-specific template. This coordinate system requires specifying an additional file to be fully defined. | src/appendices/hed.md:numerical values that are similar across the recordings in the study. src/appendices/hed.md:repository on GitHub should be used to validate the study event annotations. src/common-principles.md: unless when appropriate given the study goals, for example, when scanning babies. src/introduction.md:> The data used in the study were organized using the src/modality-specific-files/genetic-descriptor.md: "Dataset": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001364.v1.p1", src/modality-specific-files/intracranial-electroencephalography.md:Note that the date and time information SHOULD be stored in the study key file src/modality-specific-files/magnetic-resonance-spectroscopy.md:acquisition parameters in filenames is helpful or necessary to distinguish datasets in a given study. src/modality-specific-files/motion.md:Note that the onsets of the recordings SHOULD be stored in the study key file [(`scans.tsv`)](../modality-agnostic-files.md#scans-file). src/modality-specific-files/positron-emission-tomography.md:This entity is OPTIONAL if only one tracer is used in the study, src/modality-specific-files/task-events.md:Please mind that this does not imply that only so called "event related" study designs src/schema/objects/common_principles.yaml: A set of neuroimaging and behavioral data acquired for a purpose of a particular study. src/schema/objects/common_principles.yaml: Session can (but doesn't have to) be synonymous to a visit in a longitudinal study. src/schema/objects/common_principles.yaml: A person or animal participating in the study. src/schema/objects/entities.yaml: For example, this should be used when a study includes two T1w images - src/schema/objects/entities.yaml: Session can (but doesn't have to) be synonymous to a visit in a longitudinal study. src/schema/objects/entities.yaml: A person or animal participating in the study. src/schema/objects/enums.yaml:study: src/schema/objects/enums.yaml: value: study src/schema/objects/enums.yaml: display_name: study src/schema/objects/enums.yaml: Custom space defined using a group/study-specific template. src/schema/objects/metadata.yaml: Reference to the study/studies on which the implementation is based. src/schema/objects/metadata.yaml: The version of the HED schema used to validate HED tags for study. tools/schemacode/src/bidsschematools/tests/data/broken_dataset_description.json:"EthicsApprovals": ["The original study from which this BIDS example dataset was derived was approved by the Ethics committee of Ghent University Hospital with identifier EC 2017/1103."] ``` and "project" mentionings are not particularly aligned. So, I think, we should just make it a "study", hence renaming accordingly.

src/schema/objects/metadata.yaml

src/schema/rules/checks/dataset.yaml

@effigies

…dataset Well, any BIDS dataset is a "study" dataset, but there the point is that ATM for both "raw" and "derivative" types we expect to have sub- folders and that was the prior behavior, which should not be affected by this PR. This should address the review comment of @effigies https://github.com/bids-standard/bids-specification/pull/1972/files#r2142639988

effigies

Works for me.

effigies · 2025-06-16T15:35:46Z

A functional question for the validator: Should we error if there are no valid BIDS datasets in sourcedata/?

cc @rwblair @nellh

yarikoptic · 2025-06-16T15:55:57Z

sorry, I do not see how this relates to this PR since having a "valid BIDS datasets in sourcedata/" seems to point to be a "derivative BIDS" dataset by its definition.

edit: the point is that "study" dataset could even be empty to start with, start collecting various other sourcedata/ and eventually do get sourcedata/raw BIDS dataset. I think it should remain valid through those life cycles.

yarikoptic · 2025-06-16T15:59:16Z

@effigies who, among maintainers, do you think might also be interested to review this PR?

effigies · 2025-06-16T16:00:53Z

But what are you validating if there are no valid subdatasets? Why are you running the validator?

yarikoptic added the opinions wanted Please read and offer your opinion on this matter label Oct 29, 2024

yarikoptic requested a review from effigies October 29, 2024 21:01

yarikoptic requested review from erdalkaraca and DimitriPapadopoulos as code owners October 29, 2024 21:01

yarikoptic requested a review from tsalo October 29, 2024 21:01

yarikoptic changed the title ~~Enh project datasettype~~ Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset Oct 30, 2024

yarikoptic force-pushed the enh-project-datasettype branch from d0d5c37 to fb4f5a4 Compare January 23, 2025 02:40

This was referenced Apr 16, 2025

DOC: derivatives datasets vs derivatives data #2103

Open

[ENH] Formalize presence of optional docs/ folder #2104

Merged

yarikoptic force-pushed the enh-project-datasettype branch 2 times, most recently from 9c68bf8 to 6f236ce Compare June 10, 2025 12:45

yarikoptic added the copenhagen For discussion in Copenhagen label Jun 11, 2025

effigies assigned effigies and jbpoline Jun 11, 2025

yarikoptic changed the title ~~Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset~~ Add DatasetType="study" and rework existing "layout" example into a proper BIDS dataset Jun 12, 2025

yarikoptic added 5 commits June 12, 2025 05:20

Add the notion that example layout can in fact be a valid BIDS dataset

c529ca6

This reverts commit a3c12f8 where I have tried to introduce it in bids-standard#1741 but it required a little more of further detailing.

Move and extend description and definition of DatasetType "project"

8eb8769

Add rule to ensure that "project" DatasetType has no subject folders

1929135

Idea from @effigies while discussing this PR at BIDS Maintainers meeting 2025

Make NoSubjectFolders into a "warning" from "error" to have aligned w…

adfcc79

…ith SubjectFolders check Also adjusted wording to be aligned too

effigies force-pushed the enh-project-datasettype branch from 154a8cc to 315c08f Compare June 12, 2025 09:20

effigies reviewed Jun 12, 2025

View reviewed changes

src/schema/objects/metadata.yaml Outdated Show resolved Hide resolved

Update src/schema/objects/metadata.yaml

e8c5d84

effigies reviewed Jun 12, 2025

View reviewed changes

src/schema/rules/checks/dataset.yaml Outdated Show resolved Hide resolved

yarikoptic requested a review from effigies June 13, 2025 19:24

effigies approved these changes Jun 16, 2025

View reviewed changes

effigies added the needs-review label Jun 16, 2025

effigies changed the title ~~Add DatasetType="study" and rework existing "layout" example into a proper BIDS dataset~~ Add "study" DatasetType to organize a collection of source and derivative datasets Jun 16, 2025

effigies changed the title ~~Add "study" DatasetType to organize a collection of source and derivative datasets~~ [ENH] Add "study" DatasetType to organize a collection of source and derivative datasets Jun 16, 2025

ericearl mentioned this pull request Jun 27, 2025

output maintainers meeting openneuropet/outreach#52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Add "study" DatasetType to organize a collection of source and derivative datasets #1972

[ENH] Add "study" DatasetType to organize a collection of source and derivative datasets #1972

Uh oh!

yarikoptic commented Oct 29, 2024 •

edited

Loading

Uh oh!

codecov bot commented Jan 23, 2025 •

edited

Loading

Uh oh!

yarikoptic commented Mar 1, 2025 •

edited

Loading

Uh oh!

yarikoptic commented Apr 23, 2025

Uh oh!

effigies commented Apr 25, 2025

Uh oh!

effigies commented Apr 25, 2025

Uh oh!

yarikoptic commented Apr 25, 2025

Uh oh!

yarikoptic commented May 19, 2025

Uh oh!

yarikoptic commented Jun 10, 2025

Uh oh!

yarikoptic commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

effigies left a comment

Uh oh!

effigies commented Jun 16, 2025

Uh oh!

yarikoptic commented Jun 16, 2025 •

edited

Loading

Uh oh!

yarikoptic commented Jun 16, 2025

Uh oh!

effigies commented Jun 16, 2025

Uh oh!

Uh oh!

[ENH] Add "study" DatasetType to organize a collection of source and derivative datasets #1972

Are you sure you want to change the base?

[ENH] Add "study" DatasetType to organize a collection of source and derivative datasets #1972

Uh oh!

Conversation

yarikoptic commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yarikoptic commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarikoptic commented Apr 23, 2025

Uh oh!

effigies commented Apr 25, 2025

Uh oh!

effigies commented Apr 25, 2025

Uh oh!

yarikoptic commented Apr 25, 2025

Uh oh!

yarikoptic commented May 19, 2025

Uh oh!

yarikoptic commented Jun 10, 2025

Uh oh!

yarikoptic commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

effigies left a comment

Choose a reason for hiding this comment

Uh oh!

effigies commented Jun 16, 2025

Uh oh!

yarikoptic commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarikoptic commented Jun 16, 2025

Uh oh!

effigies commented Jun 16, 2025

Uh oh!

Uh oh!

yarikoptic commented Oct 29, 2024 •

edited

Loading

codecov bot commented Jan 23, 2025 •

edited

Loading

yarikoptic commented Mar 1, 2025 •

edited

Loading

yarikoptic commented Jun 16, 2025 •

edited

Loading