Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break up/recombine larger genomes #144

Open
joshfactorial opened this issue Mar 31, 2025 · 0 comments
Open

Break up/recombine larger genomes #144

joshfactorial opened this issue Mar 31, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@joshfactorial
Copy link
Collaborator

NEAT parallelism should eventually make this not needed, but as a measure that will be easier to implement, we can offer a utility that will break up large genomes and reassemble them at the end. These could be add-ons that run beside NEAT for now, integrated later, or superceded by multi-threading, if possible.

We'd need two scripts:

  1. Splitting script: breaks genome up by chromosome, or into large chunks of reads into unique fasta files. We would want to have user input the run configuration file and how to break it up (by chrom, by size (and if so, what size (712kb default or something))). Then the program would run, produce a folder with a set of input files and a set of configuration files matching those. Each file would get a unique name, be a valid FASTA file, have a name that could be reassembled back into the original chromosome name field. This may require a guidance document/index of some kind. For the FASTA, it would need to create some overlap segments in each file, so that reads don't have hard boundaries.
  2. Perhaps another utility that can scan the above run folder and start an instance of NEAT for each? Would be tricky to manage with an unknown cluster to run on. Maybe better left to the user.
  3. Stitching script: Joins the fastq/vcf/bam output from the split files back together. Note that a script like this existed in NEAT 2.0 somewhere. Basically, it would use the guidance document output from splitting, or just the order of the files, and stitch together all the output files. In the end we would want one master fastq, one master golden bam, one master golden vcf with all the output from the previous steps included.
@joshfactorial joshfactorial self-assigned this Mar 31, 2025
@joshfactorial joshfactorial added enhancement New feature or request help wanted Extra attention is needed labels Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant