Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_chunks in _load_seq.py does not end a chunk early at the end of a supercontig #43

Open
EricR86 opened this issue Nov 15, 2017 · 6 comments
Labels
bug Something isn't working major

Comments

@EricR86
Copy link
Member

EricR86 commented Nov 15, 2017

Original report (archived issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Currently, Genomedata does not index missing data greater than MIN_GAP_LEN.

However, if the ending of a supercontig is completely full of NaNs, this data will be indexed regardless of length. In the extreme case a supercontig could start with a single datapoint and contain only remaining NaNs and the chunk start and end would contain the entire region even if the region was far greater than MIN_GAP_LEN.

This results in Genomedata reporting large empty regions if the "chunk_starts/ends" attributes are used at the beginning and ending of supercontigs.

@EricR86
Copy link
Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


  • Edited issue description

@EricR86
Copy link
Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


  • changed priority from "minor" to "major"

@EricR86
Copy link
Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


  • Edited issue description

@EricR86
Copy link
Member Author

EricR86 commented Nov 15, 2017

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


What do you mean, it "does not index" this data?

@EricR86
Copy link
Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


The chunk_starts and chunk_ends genomedata/hdf5 attributes are not updated. The attributes get updated when gaps greater than MIN_GAP_LEN are found. No "gaps" are detected at the beginning or end of a supercontig since Genomedata looks between already existing datapoints.

@EricR86
Copy link
Member Author

EricR86 commented Nov 16, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


After a discussion, the following solution was proposed:

  • At telomeric regions, the chunk boundaries should start/end at first/last occurrence of data.
  • Between supercontigs, chunks should have gaps if the length between supercontigs is greater than MIN_GAP_LEN.

@EricR86 EricR86 added major bug Something isn't working labels Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant