New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

find_chunks in _load_seq.py does not end a chunk early at the end of a supercontig #43

Open

EricR86 opened this issue Nov 15, 2017 · 6 comments

Labels

bug major

Member

EricR86 commented Nov 15, 2017

Original report (archived issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Currently, Genomedata does not index missing data greater than MIN_GAP_LEN.

However, if the ending of a supercontig is completely full of NaNs, this data will be indexed regardless of length. In the extreme case a supercontig could start with a single datapoint and contain only remaining NaNs and the chunk start and end would contain the entire region even if the region was far greater than MIN_GAP_LEN.

This results in Genomedata reporting large empty regions if the "chunk_starts/ends" attributes are used at the beginning and ending of supercontigs.

Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Edited issue description

Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

changed priority from "minor" to "major"

Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Edited issue description

Member Author

EricR86 commented Nov 15, 2017

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

What do you mean, it "does not index" this data?

Member Author

EricR86 commented Nov 15, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

The chunk_starts and chunk_ends genomedata/hdf5 attributes are not updated. The attributes get updated when gaps greater than MIN_GAP_LEN are found. No "gaps" are detected at the beginning or end of a supercontig since Genomedata looks between already existing datapoints.

Member Author

EricR86 commented Nov 16, 2017

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

After a discussion, the following solution was proposed:

At telomeric regions, the chunk boundaries should start/end at first/last occurrence of data.
Between supercontigs, chunks should have gaps if the length between supercontigs is greater than MIN_GAP_LEN.

EricR86 added major bug labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment