Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semsiman calculation taking huge amout of time and resources #115

Open
souzadevinicius opened this issue Dec 19, 2023 · 10 comments
Open

Semsiman calculation taking huge amout of time and resources #115

souzadevinicius opened this issue Dec 19, 2023 · 10 comments

Comments

@souzadevinicius
Copy link
Member

souzadevinicius commented Dec 19, 2023

I'm trying to calculate semantic similarity profiles using Phenio ontology comparing different term sets

  1. HPxHP
  2. HPxMP
  3. HPxZP
  • HP term set: 17097 entries
  • MP term set: 13809 entries
  • ZP term set: 39373 entries

Ontology used: Phenio
Library versions

semsimian                  0.2.11
oaklib                     0.5.24

command line execution example:

runoak --stacktrace -vvv  -i semsimian:sqlite:phenio-monarch.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file hp_terms.txt \
--min-ancestor-information-content 4.0 \
--min-jaccard-similarity 0 \
--autolabel \
-O csv \
-o phenio-monarch-hp-hp.0.semsimian.tsv

I tried to run these experiments locally (32 and 64 GB RAM machines) and in a HPC (writing output process took more than one week and then was killed)

@caufieldjh
Copy link
Member

One week, oof!
I have previously been able to complete a very similar PHENIO HP vs MP on a GCloud instance with < 64 GB memory, though it did consume a lot of that resource.
What kind of resource usage do you see with higher thresholds, like a min of 10 for AIC and 0.4 for Jaccard?
The labeling can also consume a surprisingly large amount of resources and is very redundant for this sort of comparison, so I'd suggest dropping that parameter and mapping CURIEs to labels after the comparison is complete.

@matentzn
Copy link
Member

@souzadevinicius lets try a 0.4 Jaccard threshold and removing the labelling options and see if that makes it at least possible to run HP-ZP

@souzadevinicius
Copy link
Member Author

Ok

@souzadevinicius lets try a 0.4 Jaccard threshold and removing the labelling options and see if that makes it at least possible to run HP-ZP

@cmungall
Copy link
Member

cmungall commented Feb 1, 2024

what's the status of this?

@justaddcoffee
Copy link
Member

Discussing in the MWF hackathon now

We were thinking we would deploy semsimian/oak on our build server and run on a regular cadence. This way we have an objective measure of how much memory/time we are talking about here, and we can also emit a new artifact with a PURL so people can use this downstream.

@caufieldjh perhaps we already have a repo to do this?

@justaddcoffee
Copy link
Member

Ah okay, Harry has already made a repo for this here

@hrshdhgd
Copy link
Collaborator

hrshdhgd commented Feb 2, 2024

Sorry I'm a little late to the party but @souzadevinicius , did you run this without --autolabel or specify --no-autolabel? Just to get an idea how fast it'll be.

@justaddcoffee
Copy link
Member

justaddcoffee commented Feb 2, 2024

The last build in Aug '23 took 1h and 18m.

Sorry I'm a little late to the party but @souzadevinicius , did you run this without --autolabel or specify --no-autolabel? Just to get an idea how fast it'll be.

Yep good question @hrshdhgd

Harry says a previous build with auto-label turned on took 15h so this might be at least one thing that is slowing down Vinicius's run

@caufieldjh
Copy link
Member

Note that the Jenkins build performed by that repo takes a bit over 1 hr without autolabel and ~15 hrs w/ autolabel.

@caufieldjh
Copy link
Member

For reasons not entirely clear to me, this build took 3 hours. Here's the command:

runoak -i semsimian:sqlite:obo:phenio similarity --no-autolabel -p i --set1-file HPO_terms.txt --set2-file MP_terms.txt -O csv -o HP_vs_MP_semsimian.tsv --min-ancestor-information-content 4.0

That's with:

semsimian-0.2.11
oaklib-0.5.25

The product:
http://kg-hub-public-data.s3.amazonaws.com/monarch/HP_vs_MP_semsimian.tsv.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants