Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastx_quality_stats bug in median calculation #15

Open
winni2k opened this issue May 10, 2017 · 1 comment
Open

fastx_quality_stats bug in median calculation #15

winni2k opened this issue May 10, 2017 · 1 comment

Comments

@winni2k
Copy link

winni2k commented May 10, 2017

Hi, when I run the command

fastx_quality_stats -i input.fastq -o output.stats

, where input.fastq consists of

@0 <unknown description>
A
+
]
@1 <unknown description>
A
+
]

, then output.stats consists of

column  count   min     max     sum     mean    Q1      med     Q3      IQR     lW      rW      A_Count C_Count G_Count T_Count N_Count Max_count
1       2       60      60      120     60.00   60      50      50      -10     75      35      2       0       0       0
       0       2

. Note the med column has a value of 50, whereas the mean column has a value of 60, and the two quality scores in input.fastq are identical (]).

I believe this is a bug?

PS fastx_quality_stats -h prints:

usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon ([email protected])

   [-h] = This helpful help screen.
   [-i INFILE]  = FASTQ input file. default is STDIN.
   [-o OUTFILE] = TEXT output file. default is STDOUT.
   [-N]         = New output format (with more information per nucleotide/cycle).

The *OLD* output TEXT file will have the following fields (one row per column):
	column	= column number (1 to 36 for a 36-cycles read solexa file)
	count   = number of bases found in this column.
	min     = Lowest quality score value found in this column.
	max     = Highest quality score value found in this column.
	sum     = Sum of quality score values for this column.
	mean    = Mean quality score value for this column.
	Q1	= 1st quartile quality score.
	med	= Median quality score.
	Q3	= 3rd quartile quality score.
	IQR	= Inter-Quartile range (Q3-Q1).
	lW	= 'Left-Whisker' value (for boxplotting).
	rW	= 'Right-Whisker' value (for boxplotting).
	A_Count	= Count of 'A' nucleotides found in this column.
	C_Count	= Count of 'C' nucleotides found in this column.
	G_Count	= Count of 'G' nucleotides found in this column.
	T_Count	= Count of 'T' nucleotides found in this column.
	N_Count = Count of 'N' nucleotides found in this column.
	max-count = max. number of bases (in all cycles)


The *NEW* output format:
	cycle (previously called 'column') = cycle number
	max-count
	For each nucleotide in the cycle (ALL/A/C/G/T/N):
		count   = number of bases found in this column.
		min     = Lowest quality score value found in this column.
		max     = Highest quality score value found in this column.
		sum     = Sum of quality score values for this column.
		mean    = Mean quality score value for this column.
		Q1	= 1st quartile quality score.
		med	= Median quality score.
		Q3	= 3rd quartile quality score.
		IQR	= Inter-Quartile range (Q3-Q1).
		lW	= 'Left-Whisker' value (for boxplotting).
		rW	= 'Right-Whisker' value (for boxplotting).


@winni2k
Copy link
Author

winni2k commented May 10, 2017

Also, if I run fastx_quality_stats with the -N flag enabled, then I get the following output:

cycle   max_count       ALL_count       ALL_min ALL_max ALL_sum ALL_mean        ALL_Q1  ALL_med ALL_Q3  ALL_IQR ALL_lW  ALL_rW  A_count A_min   A_max   A_sum   A_mean  A_Q1    A_med   A_Q3    A_IQR   A_lW    A_rW    C_count C_min   C_max   C_sum   C_mean  C_Q1    C_med   C_Q3    C_IQR   C_lW    C_rW    G_count G_min   G_max   G_sum   G_mean  G_Q1    G_med   G_Q3    G_IQR   G_lW    G_rW    T_count T_min   T_max   T_sum   T_mean  T_Q1    T_med   T_Q3    T_IQR   T_lW    T_rW    N_count N_min   N_max   N_sum   N_mean  N_Q1    N_med   N_Q3    N_IQR   N_lW    N_rW
1       2       2       60      60      120     60.00   60      50      50      -10     75      35      2       60      60      120     60.00   60      -45     -45     -105    217     -202    0       100     -100    0       nan     100     100     100     0       100
     -100    0       100     -100    0       nan     100     100     100     0       100     -100    0       100     -100    0       nan     100     100     100     0       100     -100    0       100     -100    0       nan     100     100     100     0       100     -100

The median and quartile values still appear to be off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant