Semsiman calculation taking huge amout of time and resources #115

souzadevinicius · 2023-12-19T14:15:26Z

I'm trying to calculate semantic similarity profiles using Phenio ontology comparing different term sets

HPxHP
HPxMP
HPxZP

HP term set: 17097 entries
MP term set: 13809 entries
ZP term set: 39373 entries

Ontology used: Phenio
Library versions

semsimian                  0.2.11
oaklib                     0.5.24

command line execution example:

runoak --stacktrace -vvv  -i semsimian:sqlite:phenio-monarch.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file hp_terms.txt \
--min-ancestor-information-content 4.0 \
--min-jaccard-similarity 0 \
--autolabel \
-O csv \
-o phenio-monarch-hp-hp.0.semsimian.tsv

I tried to run these experiments locally (32 and 64 GB RAM machines) and in a HPC (writing output process took more than one week and then was killed)

The text was updated successfully, but these errors were encountered:

caufieldjh · 2023-12-19T21:04:37Z

One week, oof!
I have previously been able to complete a very similar PHENIO HP vs MP on a GCloud instance with < 64 GB memory, though it did consume a lot of that resource.
What kind of resource usage do you see with higher thresholds, like a min of 10 for AIC and 0.4 for Jaccard?
The labeling can also consume a surprisingly large amount of resources and is very redundant for this sort of comparison, so I'd suggest dropping that parameter and mapping CURIEs to labels after the comparison is complete.

matentzn · 2023-12-21T16:30:37Z

@souzadevinicius lets try a 0.4 Jaccard threshold and removing the labelling options and see if that makes it at least possible to run HP-ZP

souzadevinicius · 2024-01-02T10:09:19Z

Ok

@souzadevinicius lets try a 0.4 Jaccard threshold and removing the labelling options and see if that makes it at least possible to run HP-ZP

cmungall · 2024-02-01T03:57:11Z

what's the status of this?

justaddcoffee · 2024-02-02T16:17:10Z

Discussing in the MWF hackathon now

We were thinking we would deploy semsimian/oak on our build server and run on a regular cadence. This way we have an objective measure of how much memory/time we are talking about here, and we can also emit a new artifact with a PURL so people can use this downstream.

@caufieldjh perhaps we already have a repo to do this?

justaddcoffee · 2024-02-02T16:18:51Z

Ah okay, Harry has already made a repo for this here

hrshdhgd · 2024-02-02T16:24:34Z

Sorry I'm a little late to the party but @souzadevinicius , did you run this without --autolabel or specify --no-autolabel? Just to get an idea how fast it'll be.

justaddcoffee · 2024-02-02T16:27:11Z

The last build in Aug '23 took 1h and 18m.

Sorry I'm a little late to the party but @souzadevinicius , did you run this without --autolabel or specify --no-autolabel? Just to get an idea how fast it'll be.

Yep good question @hrshdhgd

Harry says a previous build with auto-label turned on took 15h so this might be at least one thing that is slowing down Vinicius's run

caufieldjh · 2024-02-02T16:27:11Z

Note that the Jenkins build performed by that repo takes a bit over 1 hr without autolabel and ~15 hrs w/ autolabel.

caufieldjh · 2024-02-02T20:06:14Z

For reasons not entirely clear to me, this build took 3 hours. Here's the command:

runoak -i semsimian:sqlite:obo:phenio similarity --no-autolabel -p i --set1-file HPO_terms.txt --set2-file MP_terms.txt -O csv -o HP_vs_MP_semsimian.tsv --min-ancestor-information-content 4.0

That's with:

semsimian-0.2.11
oaklib-0.5.25

The product:
http://kg-hub-public-data.s3.amazonaws.com/monarch/HP_vs_MP_semsimian.tsv.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semsiman calculation taking huge amout of time and resources #115

Semsiman calculation taking huge amout of time and resources #115

souzadevinicius commented Dec 19, 2023 •

edited

Loading

caufieldjh commented Dec 19, 2023

matentzn commented Dec 21, 2023

souzadevinicius commented Jan 2, 2024

cmungall commented Feb 1, 2024

justaddcoffee commented Feb 2, 2024

justaddcoffee commented Feb 2, 2024

hrshdhgd commented Feb 2, 2024

justaddcoffee commented Feb 2, 2024 •

edited

Loading

caufieldjh commented Feb 2, 2024

caufieldjh commented Feb 2, 2024

Semsiman calculation taking huge amout of time and resources #115

Semsiman calculation taking huge amout of time and resources #115

Comments

souzadevinicius commented Dec 19, 2023 • edited Loading

caufieldjh commented Dec 19, 2023

matentzn commented Dec 21, 2023

souzadevinicius commented Jan 2, 2024

cmungall commented Feb 1, 2024

justaddcoffee commented Feb 2, 2024

justaddcoffee commented Feb 2, 2024

hrshdhgd commented Feb 2, 2024

justaddcoffee commented Feb 2, 2024 • edited Loading

caufieldjh commented Feb 2, 2024

caufieldjh commented Feb 2, 2024

souzadevinicius commented Dec 19, 2023 •

edited

Loading

justaddcoffee commented Feb 2, 2024 •

edited

Loading