Break up/recombine larger genomes #144

joshfactorial · 2025-03-31T16:36:33Z

NEAT parallelism should eventually make this not needed, but as a measure that will be easier to implement, we can offer a utility that will break up large genomes and reassemble them at the end. These could be add-ons that run beside NEAT for now, integrated later, or superceded by multi-threading, if possible.

We'd need two scripts:

Splitting script: breaks genome up by chromosome, or into large chunks of reads into unique fasta files. We would want to have user input the run configuration file and how to break it up (by chrom, by size (and if so, what size (712kb default or something))). Then the program would run, produce a folder with a set of input files and a set of configuration files matching those. Each file would get a unique name, be a valid FASTA file, have a name that could be reassembled back into the original chromosome name field. This may require a guidance document/index of some kind. For the FASTA, it would need to create some overlap segments in each file, so that reads don't have hard boundaries.
Perhaps another utility that can scan the above run folder and start an instance of NEAT for each? Would be tricky to manage with an unknown cluster to run on. Maybe better left to the user.
Stitching script: Joins the fastq/vcf/bam output from the split files back together. Note that a script like this existed in NEAT 2.0 somewhere. Basically, it would use the guidance document output from splitting, or just the order of the files, and stitch together all the output files. In the end we would want one master fastq, one master golden bam, one master golden vcf with all the output from the previous steps included.

joshfactorial self-assigned this Mar 31, 2025

joshfactorial added enhancement New feature or request help wanted Extra attention is needed labels Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break up/recombine larger genomes #144

Break up/recombine larger genomes #144

joshfactorial commented Mar 31, 2025

Break up/recombine larger genomes #144

Break up/recombine larger genomes #144

Comments

joshfactorial commented Mar 31, 2025