@@ -57,14 +57,14 @@ The below is an example of output figures of wheat (ABD, 1n=3x=21):
5757![ wheat] ( example_data/wheat_figures.png )
5858** Figure. Phased subgenomes of allohexaploid bread wheat genome.** Colors are unified with each subgenome in subplots ` B-F ` , i.e. the same color means the same subgenome.
5959* (** A** ) The histogram of differential k-mers among homoeologous chromosome sets.
60- * (** B** ) Heatmap and clustering of differential k-mers. The x-axis, k-mers; y-axis, chromosomes.
60+ * (** B** ) Heatmap and clustering of differential k-mers. The x-axis, differential k-mers; y-axis, chromosomes. The vertical color bar, each chromosome is assigned to which subgenome; the horizontal color bar, each k-mer is specific to which subgenome (blank for non-specific kmers) .
6161* (** C** ) Principal component analysis (PCA) of differential k-mers.
62- * (** D** ) Chromosomal characteristics. Rings from outer to inner:
63- - (** 1** ) Karyotypes of subgenome assignments by a k-Means algorithm.
64- - (** 2** ) Significant enrichment of subgenome-specific k-mers.
62+ * (** D** ) Chromosomal characteristics (window size: 1 Mb) . Rings from outer to inner:
63+ - (** 1** ) Subgenome assignments by a k-Means algorithm.
64+ - (** 2** ) Significant enrichment of subgenome-specific k-mers (blank for non-enriched windows) .
6565 - (** 3** ) Normalized proportion of subgenome-specific k-mers.
66- - (** 4-6** ) Density distribution of each subgenome-specific k-mer set.
67- - (** 7** ) Density distribution of subgenome-specific LTR-RTs and other LTR-RTs (the most outer, in grey color).
66+ - (** 4-6** ) Density distribution (count) of each subgenome-specific k-mer set.
67+ - (** 7** ) Density distribution (count) of subgenome-specific LTR-RTs and other LTR-RTs (the most outer, in grey color).
6868 - (** 8** ) Homoeologous blocks of each homoeologous chromosome set.
6969* (** E** ) Insertion time of subgenome-specific LTR-RTs.
7070* (** F** ) A phylogenetic tree of 1,000 randomly subsampled LTR/Gypsy elements.
@@ -137,6 +137,7 @@ phase-results/
137137├── k15_q200_f2.chrom-subgenome.tsv # subgenome assignments and bootstrap values
138138├── k15_q200_f2.sig.kmer-subgenome.tsv # subgenome-specific kmers
139139├── k15_q200_f2.bin.enrich # subgenome-specific enrichments by genome window/bin
140+ ├── k15_q200_f2.bin.group # grouped bins by potential exchanges based on enrichments
140141├── k15_q200_f2.ltr.enrich # subgenome-specific LTR-RTs
141142├── k15_q200_f2.ltr.insert.pdf # density plot of insertion age of subgenome-specific LTR-RTs
142143├── k15_q200_f2.ltr.insert.R # R script for the density plot
@@ -164,28 +165,32 @@ tmp/
164165```
165166usage: subphaser [-h] -i GENOME [GENOME ...] -c CFGFILE [CFGFILE ...]
166167 [-labels LABEL [LABEL ...]] [-no_label]
167- [-target FILE] [-sep STR]
168+ [-target FILE] [-sg_assigned FILE] [- sep STR]
168169 [-custom_features FASTA [FASTA ...]] [-pre STR]
169170 [-o DIR] [-tmpdir DIR] [-k INT] [-f FLOAT] [-q INT]
170171 [-baseline BASELINE] [-lower_count INT]
171172 [-min_prop FLOAT] [-max_freq INT] [-max_prop FLOAT]
172173 [-low_mem] [-by_count] [-re_filter] [-nsg INT]
173174 [-replicates INT] [-jackknife FLOAT]
174- [-max_pval FLOAT] [-figfmt {pdf,png}]
175+ [-max_pval FLOAT]
176+ [-test_method {ttest_ind,kruskal,wilcoxon,mannwhitneyu}]
177+ [-figfmt {pdf,png}]
175178 [-heatmap_colors COLOR [COLOR ...]]
176- [-heatmap_options STR] [-disable_ltr]
179+ [-heatmap_options STR] [-just_core] [- disable_ltr]
177180 [-ltr_detectors {ltr_finder,ltr_harvest} [{ltr_finder,ltr_harvest} ...]]
178181 [-ltr_finder_options STR] [-ltr_harvest_options STR]
179182 [-tesorter_options STR] [-all_ltr] [-intact_ltr]
180- [-shared_ltr ] [-mu FLOAT ] [-disable_ltrtree ]
181- [-subsample INT]
183+ [-exclude_exchanges ] [-shared_ltr ] [-mu FLOAT ]
184+ [-disable_ltrtree] [- subsample INT]
182185 [-ltr_domains {GAG,PROT,INT,RT,RH,AP,RNaseH} [{GAG,PROT,INT,RT,RH,AP,RNaseH} ...]]
183186 [-trimal_options STR]
184187 [-tree_method {iqtree,FastTree}] [-tree_options STR]
185188 [-ggtree_options STR] [-disable_circos]
186189 [-window_size INT] [-disable_blocks] [-aligner PROG]
187- [-aligner_options STR] [-min_block INT] [-p INT]
188- [-max_memory MEM] [-cleanup] [-overwrite] [-v]
190+ [-aligner_options STR] [-min_block INT]
191+ [-alt_cfgs CFGFILE [CFGFILE ...]] [-chr_ordered FILE]
192+ [-p INT] [-max_memory MEM] [-cleanup] [-overwrite]
193+ [-v]
189194
190195Phase and visualize subgenomes of an allopolyploid or hybrid based on the repetitive kmers.
191196
@@ -198,19 +203,22 @@ Input:
198203 -i GENOME [GENOME ...], -genomes GENOME [GENOME ...]
199204 Input genome sequences in fasta format [required]
200205 -c CFGFILE [CFGFILE ...], -sg_cfgs CFGFILE [CFGFILE ...]
201- Subgenomes config file (one homoeologous group per
206+ Subgenomes config file (one homologous group per
202207 line); this chromosome set is for identifying
203208 differential kmers [required]
204209 -labels LABEL [LABEL ...]
205210 For multiple genomes, provide prefix labels for each
206211 genome sequence to avoid conficts among chromosome id
207212 [default: '1-, 2-, ..., n-']
208213 -no_label Do not use default prefix labels for genome sequences
209- as there is no confict among chromosome id [default:
210- False]
214+ as there is no confict among chromosome id
215+ [default= False]
211216 -target FILE Target chromosomes to output; id mapping is allowed;
212217 this chromosome set is for cluster and phase [default:
213218 the same chromosome set as `-sg_cfgs`]
219+ -sg_assigned FILE Provide subgenome assignments to skip k-means
220+ clustering and to identify subgenome-specific features
221+ [default=None]
214222 -sep STR Seperator for chromosome ID [default="|"]
215223 -custom_features FASTA [FASTA ...]
216224 Custom features in fasta format to enrich subgenome-
@@ -243,7 +251,7 @@ Kmer:
243251 [default=None]
244252 -low_mem Low MEMory but slower [default: True if genome size >
245253 3G, else False]
246- -by_count Calculate fold by count instead of by propor
254+ -by_count Calculate fold by count instead of by proportion
247255 [default=False]
248256 -re_filter Re-filter with subset of chromosomes (subgenome
249257 assignments are expected to change) [default=False]
@@ -253,68 +261,73 @@ Cluster:
253261
254262 -nsg INT Number of subgenomes (>1) [default: auto]
255263 -replicates INT Number of replicates for bootstrap [default=1000]
256- -jackknife FLOAT Percent of kmers to resample for bootstrap
264+ -jackknife FLOAT Percent of kmers to resample for each bootstrap
257265 [default=50]
258266 -max_pval FLOAT Maximum P value for all hypothesis tests
259267 [default=0.05]
268+ -test_method {ttest_ind,kruskal,wilcoxon,mannwhitneyu}
269+ The test method to identify differiential
270+ kmers[default=ttest_ind]
260271 -figfmt {pdf,png} Format of figures [default=pdf]
261272 -heatmap_colors COLOR [COLOR ...]
262- Color panel (2 or 3 colors) for heatmap plot
263- [default= ('green', 'black', 'red')]
273+ Color panel (2 or 3 colors) for heatmap plot [default:
274+ ('green', 'black', 'red')]
264275 -heatmap_options STR Options for heatmap plot (see more in R shell with
265276 `?heatmap.2` of `gplots` package) [default="Rowv=T,Col
266277 v=T,scale='col',dendrogram='row',labCol=F,trace='none'
267278 ,key=T,key.title=NA,density.info='density',main=NA,xla
268- b=NA,margins=c(4,8)"]
279+ b='Differential kmers',margins=c(2.5,12)"]
280+ -just_core Exit after the core phasing module
281+ [default=False]
269282
270283LTR:
271284 Options for LTR analyses
272285
273286 -disable_ltr Disable this step (this step is time-consuming for
274287 large genome) [default=False]
275288 -ltr_detectors {ltr_finder,ltr_harvest} [{ltr_finder,ltr_harvest} ...]
276- Programs to detect LTR-RTs [default=['ltr_harvest',
277- 'ltr_finder']]
289+ Programs to detect LTR-RTs [default=['ltr_harvest']]
278290 -ltr_finder_options STR
279291 Options for `ltr_finder` to identify LTR-RTs (see more
280- with `ltr_finder -h`) [default="-w 2 -D 20000 -d 1000
281- -L 7000 -l 100 -p 20 -C -M 0.6 "]
292+ with `ltr_finder -h`) [default="-w 2 -D 15000 -d 1000
293+ -L 7000 -l 100 -p 20 -C -M 0.8 "]
282294 -ltr_harvest_options STR
283295 Options for `gt ltrharvest` to identify LTR-RTs (see
284296 more with `gt ltrharvest -help`) [default="-seqids yes
285- -similar 60 -vic 10 -seed 20 -minlenltr 100 -maxlenltr
286- 7000 -maxdistltr 20000 -mindistltr 1000 -mintsd 4
287- -maxtsd 20"]
297+ -similar 80 -vic 10 -seed 20 -minlenltr 100 -maxlenltr
298+ 7000 -mintsd 4 -maxtsd 6"]
288299 -tesorter_options STR
289300 Options for `TEsorter` to classify LTR-RTs (see more
290- with `TEsorter -h`) [default="-db rexdb-plant -dp2"]
291- -all_ltr Use all LTR identified by `-ltr_detectors` (more LTRs
292- but slower) [default: only use LTR as classified by
293- `TEsorter`]
294- -intact_ltr Use completed LTR as classified by `TEsorter` (less
295- LTRs but faster) [default: the same as `-all_ltr`]
296- -shared_ltr Identify shared LTRs among subgenomes (experimental)
297- [default=False]
301+ with `TEsorter -h`) [default="-db rexdb -dp2"]
302+ -all_ltr Use all LTR-RTs identified by `-ltr_detectors` (more
303+ LTR-RTs but slower) [default: only use LTR as
304+ classified by `TEsorter`]
305+ -intact_ltr Use completed LTR-RTs classified by `TEsorter` (less
306+ LTR-RTs but faster) [default: the same as `-all_ltr`]
307+ -exclude_exchanges Exclude potential exchanged LTRs for insertion age
308+ estimation and phylogenetic trees [default=False]
309+ -shared_ltr Identify shared LTR-RTs among subgenomes
310+ (experimental) [default=False]
298311 -mu FLOAT Substitution rate per year in the intergenic region,
299312 for estimating age of LTR insertion [default=1.3e-08]
300313 -disable_ltrtree Disable subgenome-specific LTR tree (this step is
301- time-consuming when subgenome-specific LTRs are too
314+ time-consuming when subgenome-specific LTR-RTs are too
302315 many, so `-subsample` is enabled by defualt)
303316 [default=False]
304- -subsample INT Subsample LTRs to avoid too many to construct a tree
305- [default=1000] (0 to disable)
317+ -subsample INT Subsample LTR-RTs to avoid too many to construct a
318+ tree [default=1000] (0 to disable)
306319 -ltr_domains {GAG,PROT,INT,RT,RH,AP,RNaseH} [{GAG,PROT,INT,RT,RH,AP,RNaseH} ...]
307320 Domains for LTR tree (Note: for domains identified by
308321 `TEsorter`, PROT (rexdb) = AP (gydb), RH (rexdb) =
309- RNaseH (gydb)) [default= ['INT', 'RT', 'RH']]
322+ RNaseH (gydb)) [default: ['INT', 'RT', 'RH']]
310323 -trimal_options STR Options for `trimal` to trim alignment (see more with
311324 `trimal -h`) [default="-automated1"]
312325 -tree_method {iqtree,FastTree}
313326 Programs to construct phylogenetic trees
314- [default=iqtree ]
327+ [default=FastTree ]
315328 -tree_options STR Options for `-tree_method` to construct phylogenetic
316329 trees (see more with `iqtree -h` or `FastTree
317- -expert`) [default="-mset JTT "]
330+ -expert`) [default=""]
318331 -ggtree_options STR Options for `ggtree` to show phylogenetic trees (see
319332 more from `https://yulab-smu.top/treedata-book`)
320333 [default="branch.length='none', layout='circular'"]
@@ -324,17 +337,22 @@ Circos:
324337
325338 -disable_circos Disable this step [default=False]
326339 -window_size INT Window size (bp) for circos plot [default=1000000]
327- -disable_blocks Disable to plot homoeologous blocks [default=False]
328- -aligner PROG Programs to identify homoeologous blocks
340+ -disable_blocks Disable to plot homologous blocks [default=False]
341+ -aligner PROG Programs to identify homologous blocks
329342 [default=minimap2]
330343 -aligner_options STR Options for `-aligner` to align chromosome sequences
331344 [default="-x asm20 -n 10"]
332345 -min_block INT Minimum block size (bp) to show [default=100000]
346+ -alt_cfgs CFGFILE [CFGFILE ...]
347+ An alternative config file for identifying homologous
348+ blocks [default=None]
349+ -chr_ordered FILE Provide a chromosome order to plot circos
350+ [default=None]
333351
334352Other options:
335353 -p INT, -ncpu INT Maximum number of processors to use [default=32]
336354 -max_memory MEM Maximum memory to use where limiting can be enabled.
337- [default=65.1G ]
355+ [default=65.2G ]
338356 -cleanup Remove the temporary directory [default=False]
339357 -overwrite Overwrite even if check point files existed
340358 [default=False]
0 commit comments