Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce specific similarity artifacts for Exomizer using custom counts #125

Open
caufieldjh opened this issue Apr 29, 2024 · 6 comments
Open

Comments

@caufieldjh
Copy link
Member

Based off of #124

  • Establish a consistent way to count phenotype for MP and ZP
  • Produce Exomizer similarity artifact for:
    • HP v HP
    • HP v MP
    • HP v ZP
  • Include some provenance and reproducibility data in the header of each, including semsimian version and the command(s) run to produce the file.

Each will need to incorporate the counts from HPOA (for HP) and Monarch's phenotype files (for MP and ZP) - or in the latter case from wherever Monarch is getting them.

@caufieldjh
Copy link
Member Author

caufieldjh commented Apr 29, 2024

See the gene to phenotype tables here: https://data.monarchinitiative.org/latest/tsv/gene_associations/index.html

These appear to be identical to the owlsim tables already in use by Exomizer (see https://archive.monarchinitiative.org/latest/owlsim/data/Danio_rerio/Dr_gene_phenotype.txt)

@justaddcoffee
Copy link
Member

To flesh this out a bit, the plan we discussed I think

  • start with our existing Jenkinsfile that runs semsimian
  • make each HP x HP, HP x MP, HP x ZP table separately using:
    • HP phenotype counts from HPOA
    • MP phenotype counts from IMPC and MGI
    • ZP phenotype counts from gene_phenotype.7955.tsv from Monarch data website here I think? - put on KG-Hub s3 bucket somewhere coherent
  • finish implementing custom IC maps in semsimian

@caufieldjh
Copy link
Member Author

IMPC phenotypes may not already be included in Monarch G2P tables, but Exomizer does use them.

@caufieldjh
Copy link
Member Author

caufieldjh commented May 6, 2024

So to get this done, we will need to:

  • Decouple creation/updating of the IC map and closure map. They are currently stored separately (see

    semsimian/src/lib.rs

    Lines 68 to 78 in d66e20c

    pub struct RustSemsimian {
    spo: Vec<(TermID, Predicate, TermID)>,
    predicates: Option<Vec<Predicate>>,
    ic_map: HashMap<PredicateSetKey, HashMap<TermID, f64>>,
    // ic_map is something like {("is_a_+_part_of"), {"GO:1234": 1.234}}
    closure_map: HashMap<PredicateSetKey, HashMap<TermID, HashSet<TermID>>>,
    // closure_map is something like {("is_a_+_part_of"), {"GO:1234": {"GO:1234", "GO:5678"}}}
    embeddings: Embeddings,
    pairwise_similarity_attributes: Option<Vec<String>>,
    prefix_expansion_cache: HashMap<TermID, HashMap<TermID, HashSet<TermID>>>,
    max_ic_cache: HashMap<String, (HashSet<String>, f64)>,
    ) but updated at the same time.
  • Check for whether we have been provided with a custom IC map (as a filepath). If so, don't create a new one, but parse the provided one and use that instead.
  • Raise error if there is not alignment between the custom IC map and the closures.

@caufieldjh
Copy link
Member Author

See also #47 - but for terms missing from IC map instead of closure map

@caufieldjh
Copy link
Member Author

If the closure map and IC map do not contain each others' keys, raise an error.
We will work under the assumption that the user will provide inputs which are in alignment and contain all necessary IDs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants