Skip to content

Preprocessing reads

maelyg edited this page Oct 31, 2023 · 4 revisions

If multiple fastq files exist for a single sample, they will first need to be merged using the --merge option. Then the read names of the fastq file created will be trimmed after the first whitespace, for compatiblity purposes with all downstream tools.

Reads can also be optionally trimmed of adapters and/or quality filtered:

  • Search for presence of adapters in sequences reads using Porechop ABI by specifying the --adapter_trimming parameter. Porechop ABI parameters can be specified using --porechop_options '{options} ', making sure you leave a space at the end before the closing quote. Please refer to the Porechop manual.
    To limit the search to known adapters listed in adapter.py, just specify the --adapter_trimming option.
    To search ab initio for adapters on top of known adapters, specify --adapter_trimming --porechop_options '-abi '.
    To limit the search to custom adapters, specify --adapter_trimming --porechop_custom_primers --porechop_options '-ddb ' and list the custom adapters in the text file located under bin/adapters.txt following the format:

     line 1: Adapter name
     line 2: Start adapter sequence
     line 3: End adapter sequence
     --- repeat for each adapter pair---
    
  • Perform a quality filtering step using Chopper by specifying the --qual_filt parameter. Chopper parameters can be specified using the --chopper_options '{options}'. Please refer to the Chopper manual.
    For instance to filter reads shorter than 1000 bp and longer than 20000 bp, and reads with a minimum Phred average quality score of 10, you would specify: --qual_filt --chopper_options '-q 10 -l 1000 --maxlength 20000'.

A zipped copy of the resulting preprocessed and/or quality filtered fastq file will be saved in the preprocessing folder.

If you trim raw read of adapters and/or quality filter the raw reads, an additional quality control step will be performed and a qc report will be generated summarising the read counts recovered before and after preprocessing for all samples listed in the index.csv file.