Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata subclasses and conversion #451

Open
adamcantor22 opened this issue Feb 10, 2023 · 2 comments
Open

Metadata subclasses and conversion #451

adamcantor22 opened this issue Feb 10, 2023 · 2 comments
Labels
Metadata Relating to the generation, checking, or format of metadata
Milestone

Comments

@adamcantor22
Copy link
Member

Is your feature request related to a problem? Please describe.
The new Metadata class #432 will need to be able to represent metadata in a variety of different formats. We need to be able to import any of these formats and switch between them at will.

Describe the solution you'd like
We can use subclasses to describe the formats:

  • Metadata (generic)
    • Qiime1
    • Qiime2
    • MMEDS Full
      • MMEDS Subject
      • MMEDS Specimen
    • LEfSe
    • SRA

In each level, there should be a function going 'up' (e.g. Qiime2 -> Metadata generic) and going down (e.g. Metadata generic -> MMEDS Full).

Converting to MMEDS
Converting to MMEDS from another format presents by far the biggest challenge. In MMEDS format, we have 5 header rows: Table Name, Var Name, Opt/Req, Format, Unit/Length Restriction. If we're trying to get, say, the MMEDS Var 'Weight' from a Qiime2 file, we need to be prepared for multiple situations for example:

  • Var name "weight" (lowercase)
  • Var name "mass" (something related)
  • Var name "Subject_Information_Pounds" (something vaguely but not simply related)
  • Var name "k" (essentially no information at all)

Then, once it is determined that the variable in question is indeed Weight, we also need to infer units. What if the data doesn't include any units at all and is purely numerical?

@circlespie and I discussed a solution that would use a two-tiered approach: a first pass using some kind of AI assistance, such as a word associative cluster that would be able to infer that a related word such as 'mass' implies the variable 'Weight'; then a fallback to check uncertainty with a user, asking 'Does 'Subanalysis' match 'SpecimenType'? y/n'.

Alternatively, a user could provide as supplementary input a dictionary explicitly defining what each label mapped to. However, this would require user preprocessing, the very thing we're attempting to avoid. Further discussion on this issue is warranted.

@adamcantor22 adamcantor22 added the Metadata Relating to the generation, checking, or format of metadata label Feb 10, 2023
@adamcantor22 adamcantor22 added this to the 0.9.0 milestone Feb 10, 2023
@adamcantor22
Copy link
Member Author

Note: conversion to Lefse needs to replace spaces in values with _ or nospace

@adamcantor22
Copy link
Member Author

Additional conversion type: REDcap format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metadata Relating to the generation, checking, or format of metadata
Projects
None yet
Development

No branches or pull requests

1 participant