Skip to content

Feature Request: Generate a Croissant metadata file (or any export format) before a dataset is published #11305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mrisdal opened this issue Mar 5, 2025 · 5 comments · May be fixed by #11398
Assignees
Labels
Croissant Croissant and Kaggle related work FY25 Sprint 20 FY25 Sprint 20 (2025-03-26 - 2025-04-09) Size: 20 A percentage of a sprint. 14 hours. Type: Feature a feature request

Comments

@mrisdal
Copy link

mrisdal commented Mar 5, 2025

Overview of the Feature Request

Allow a depositor to download (via UI or API) the Croissant metadata file (or any metadata format) before the dataset is published. This is particularly important for datasets shared via Preview URL.

What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)

API User, Depositor, Guest (someone accessing a Dataset via Preview URL)

What inspired the request?

I'm a resource co-chair for NeurIPS Datasets & Benchmarks track in 2025 and we are evaluating recommending Dataverse as a recommended repository to authors as part of a new requirement that authors generate and make accessible Croissant metadata representations of their datasets in order to automate that submissions are valid and to streamline the review process.

We expect that many authors of NeurIPS D&B track papers would choose to deposit their data via Harvard Dataverse because it offers a Preview URL feature, but it's a major limitation that a Croissant file is not generated. Authors will still need to manually generate them.

Additionally, Kaggle is another repository that offers both Preview URLs and is adding the ability to download a Croissant file for such datasets.

Even if the feature isn't added for this year's CFPs, it will likely be useful for next year.

What existing behavior do you want changed?

Allow download of Croissant metadata and data files for un-published Harvard Dataverse datasets.

Any brand new behavior do you want to add to Dataverse?

NA

Any open or closed issues related to this feature request?

After speaking with the team, I don't believe there are any.

Are you thinking about creating a pull request for this feature?
Help is always welcome, is this feature something you or your organization plan to implement?

No.

@mrisdal mrisdal added the Type: Feature a feature request label Mar 5, 2025
@cmbz cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Mar 5, 2025
@cmbz cmbz added the Croissant Croissant and Kaggle related work label Mar 5, 2025
@jggautier
Copy link
Contributor

I think #4132, "Allow to export metadata of unpublished datasets", is related, although some of the use cases might be different.

#4372 is about how the metadata exports always contain metadata of the latest published version, even when a user is looking at an older version. I think it's a little less related but might be helpful to be aware of.

@johannes-darms
Copy link
Contributor

I would love to see support for generating citations and export files for all versions of a dataset. Ideally, these citations and exports should include version information. This enhancement would also help efforts like version DOIs #4499

@cmbz cmbz added the Size: 20 A percentage of a sprint. 14 hours. label Mar 27, 2025
@cmbz cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Mar 27, 2025
@pdurbin pdurbin moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Mar 27, 2025
@pdurbin pdurbin self-assigned this Mar 27, 2025
@pdurbin pdurbin moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Mar 27, 2025
@pdurbin pdurbin added the FY25 Sprint 20 FY25 Sprint 20 (2025-03-26 - 2025-04-09) label Mar 27, 2025
@pdurbin pdurbin changed the title Feature Request: Generate a Croissant metadata file before a dataset is published Feature Request: Generate a Croissant metadata file (or any export format) before a dataset is published Mar 31, 2025
pdurbin added a commit that referenced this issue Apr 1, 2025
@pdurbin
Copy link
Member

pdurbin commented Apr 1, 2025

For now I'm looking into exporting just drafts rather than all versions. Here's some work in progress:

pdurbin added a commit that referenced this issue Apr 3, 2025
Drafts are exported on-the-fly rather than being cached.
@pdurbin pdurbin linked a pull request Apr 3, 2025 that will close this issue
@pdurbin
Copy link
Member

pdurbin commented Apr 3, 2025

A draft PR for now:

pdurbin added a commit that referenced this issue Apr 8, 2025
Drafts are exported on-the-fly rather than being cached.
pdurbin added a commit that referenced this issue Apr 9, 2025
Drafts are exported on-the-fly rather than being cached.
pdurbin added a commit that referenced this issue Apr 9, 2025
@pdurbin
Copy link
Member

pdurbin commented Apr 9, 2025

This PR is ready for review:

Allow a depositor to download (via UI or API) the Croissant metadata file (or any metadata format) before the dataset is published.

@mrisdal you said "or API". 😄 Heads up that at least as of this writing, the new "export drafts" functionality is API-only. (See the PR for why adding it to the UI is a bit complicated.

I think #4132, "Allow to export metadata of unpublished datasets", is related, although some of the use cases might be different.

@jggautier yes, highly related. I just left a comment linking to the new PR.

#4372 is about how the metadata exports always contain metadata of the latest published version, even when a user is looking at an older version. I think it's a little less related but might be helpful to be aware of.

@jggautier right, I didn't touch the UI at all so I didn't try to address this. Maybe we can work on this with the new UI. This issue:

I would love to see support for generating citations and export files for all versions of a dataset.

@johannes-darms I don't know about citations (@qqmyers did some recent work in #11163) but for exports, my new PR does not allow export for all versions of a datasets. Only latest published (as always) and draft (new!) are supported. For other versions, you're welcome to open an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Croissant Croissant and Kaggle related work FY25 Sprint 20 FY25 Sprint 20 (2025-03-26 - 2025-04-09) Size: 20 A percentage of a sprint. 14 hours. Type: Feature a feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants