Skip to content

Latest commit

 

History

History
403 lines (289 loc) · 23.6 KB

processing-nanopore-samples.md

File metadata and controls

403 lines (289 loc) · 23.6 KB

Processing nanopore samples

The following are crucial steps as part of the Snitkin lab SOP to process nanopore data (long reads).

Getting started

Firstly, you need to install the Data Flow repository from Github and follow the directions in this document to create your project folder on turbo. Your project folder is the name of the project you are currently working on. For example, if you are working on the MDHHS project, there is a folder called Project_MDHHS_genomics on /nfs/turbo/umms-esnitkin/. If you are unable to find your project folder (rare), please slack Evan to check if exists or if it needs to be created. If the project folder does not exist, you will have the opportunity to create it further along the SOP.

Please ensure you are cloning the Github repository on your scratch directory i.e. /scratch/esnitkin_root/esnitkin1/your_uniqname/. All the steps in this SOP depend on your successful completion of the preceding steps. If you are unable to find the relevant scripts and/or the instructions are unclear, please slack Dhatri and do not move forward with the SOP.

Installation

Firstly, navigate to your scratch directory. Your path should look something like this: /scratch/esnitkin_root/esnitkin1/uniqname/

cd /scratch/esnitkin_root/esnitkin1/dhatrib/

Once you are in your scratch, clone the github directory onto your system.

git clone https://github.com/Snitkin-Lab-Umich/Data-Flow-SOP.git

If you have successfully cloned the Gtihub repository on your scratch directory, you should see the following messages.

(base) [dhatrib@gl-login2 dhatrib]$ pwd
/scratch/esnitkin_root/esnitkin1/dhatrib
(base) [dhatrib@gl-login2 dhatrib]$ git clone https://github.com/Snitkin-Lab-Umich/Data-Flow-SOP.git
Cloning into 'Data-Flow-SOP'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 11 (delta 1), reused 8 (delta 1), pack-reused 0 (from 0)
Receiving objects: 100% (11/11), 10.18 KiB | 10.18 MiB/s, done.
Resolving deltas: 100% (1/1), done.

Ensure you have successfully cloned Data-Flow-SOP. Type ls and you should see the newly created directory Data-Flow-SOP. Move to the newly created directory. This is what Data-Flow-SOP should look like.

(base) [dhatrib@gl-login2 dhatrib]$ ls
Data-Flow-SOP
(base) [dhatrib@gl-login2 dhatrib]$ pwd
/scratch/esnitkin_root/esnitkin1/dhatrib
(base) [dhatrib@gl-login2 dhatrib]$ cd Data-Flow-SOP/
(base) [dhatrib@gl-login2 Data-Flow-SOP]$ pwd
/scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP
(base) [dhatrib@gl-login2 Data-Flow-SOP]$ ls
create_directories.py        hybrid    nanopore  processing-hybrid-samples.md    processing-nanopore-samples.md
create_higher_level_dirs.py  illumina  pics      processing-illumina-samples.md  README.md

Create project folder

If your project folder already exists, skip to this step.

If you project folder does not exist, follow the steps below.

The following script, create_higher_level_dirs.py, will create your Project folder for you.

  1. To get started, type this command python3 create_higher_level_dirs.py -h. This will give you an idea of all the flags present in the script and what you need to specify for each argument as seen below.
(base) [dhatrib@gl-login3 Data-Flow-SOP]$ python3 create_higher_level_dirs.py -h
usage: create_higher_level_dirs.py [-h] --dest_path DEST_PATH --project_name PROJECT_NAME --data_type {illumina,nanopore,both}

Create project folder and higher level directory structure for illumina and nanopore data.

options:
  -h, --help            show this help message and exit
  --dest_path DEST_PATH
                        Destination path where directories need to be created—do NOT include project name (e.g., /nfs/turbo/umms-esnitkin/)
  --project_name PROJECT_NAME
                        Name of your project (format: Project_Name-of-Project, e.g., Project_MDHHS)
  --data_type {illumina,nanopore,both}
                        Type of data (illumina/nanopore/both)
  1. Create project folder and its relevant sub-directories. This script is case-sensitive—if you want to create Project folder called MERLIN but gave MERLiN instead, the script will create the directories for Project_MERLiN, /nfs/turbo/umms-esnitkin/Project_MERLiN, instead of Project_MERLIN, /nfs/turbo/umms-esnitkin/Project_MERLIN.

The --dest_path should be the path to turbo on Great Lakes i.e. /nfs/turbo/umms-esnitkin/, --project_name should be the name of your project and --data_type is nanopore.

python3 create_higher_level_dirs.py --dest_path /nfs/turbo/umms-esnitkin/ --project_name Project_Test_Nanopore_Org --data_type nanopore

Once you have successfully created the relevant subdirectories, you should see the below messages on your terminal.

(base) [dhatrib@gl-login3 Data-Flow-SOP]$ python3 create_higher_level_dirs.py --dest_path /nfs/turbo/umms-esnitkin/ --project_name Project_Test_Nanopore_Org --data_type nanopore

Metadata and subdirectories created successfully at /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/metadata.

Variant calling folder created successfully at /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/variant_calling.

Assembly directory structure created successfully at /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/assembly.

Created ONT directory structure under /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org

Success! All specified directories have been created.
  1. Visualize Project directory structure: tree /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/ to look at the Project folder and relevant directories created.
(base) [dhatrib@gl-login3 Data-Flow-SOP]$ tree /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/
/nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/
└── Sequence_data
    ├── assembly
    │   └── ONT
    ├── metadata
    │   ├── AGC_submission
    │   ├── plasmidsaurus
    │   └── sample_lookup
    ├── ONT
    └── variant_calling

9 directories, 0 files

STOP AND CHECK: If your Project folder i.e. /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org does not contain the directories seen above, please review the path(s) you provided to the flags specified in create_higher_level_dirs.py on your terminal.

If your Project directory looks like the above, you are ready to move to the next step.

Rename samples

  1. Once you have created the Project folder/confirmed the existence of an already created Project folder, navigate to this directory /nfs/turbo/umms-esnitkin/Raw_sequencing_data/ and find the folder corresponding to your dataset. Rename samples using demultiplex file *_sample_lookup.csv.

Move to Raw_sequencing_data folder and navigate to your corresponding dataset folder.

cd /nfs/turbo/umms-esnitkin/Raw_sequencing_data
ls
cd your/samples/folder

For the sake of this tutorial, we will be using the samples in test_nanopore_org.

(base) [dhatrib@gl-login3 dhatrib]$ cd /nfs/turbo/umms-esnitkin/Raw_sequencing_data/
(base) [dhatrib@gl-login3 Raw_sequencing_data]$ ls
10728-ES  27MSPS_fastq    9FSQ29_fastq    G6C                                 N3TG4Y_results  S6HSTY_results     Y9RJY8_fastq
10909-ES  27MSPS_results  9FSQ29_results  G6C_rerun                           R5G             test_illumina_org  Y9RJY8_results
11091-ES  98CML4_fastq    B5JNZD_fastq    move_long_reads_to_corr_folders.sh  R5G_rerun       VTVLPR_1           ZJ9XL5_fastq
11143-ES  98CML4_results  B5JNZD_results  N3TG4Y_fastq                        S6HSTY_1        VTVLPR_results     ZJ9XL5_results
(base) [dhatrib@gl-login3 Raw_sequencing_data]$ cd test_nanopore_org/
(base) [dhatrib@gl-login3 test_nanopore_org]$ ls
test_1.fastq.gz  test_2.fastq.gz  test_3.fastq.gz  test_sample_lookup.csv

To understand how to use the bash script, execute the following command

/scratch/esnitkin_root/esnitkin1/path/to/Data-Flow-SOP/nanopore/rename_samples_nanopore.sh 

You will need to supply the name of the folder that contains the fastqs and a lookup file called *_sample_lookup.csv where the * is the name of your dataset. If you are unable to find the fastq folder and/or the lookup file, please do not move forward with the following steps and slack Evan.

/scratch/esnitkin_root/esnitkin1/path/to/Data-Flow-SOP/rename_samples.sh fastq_folder *_sample_lookup.csv Batch_X

Find out the batch number i.e. Batch_Number from *_sample_lookup.csv. If you successfully renamed the samples, you should see the following message.

(base) [dhatrib@gl-login2 test_nanopore_org]$ less test_sample_lookup.csv
(base) [dhatrib@gl-login2 test_nanopore_org]$ /scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP/nanopore/rename_samples_nanopore.sh . test_sample_lookup.csv Batch_1
Renaming sample ./test_1.fastq.gz to ./cauris_1.fastq.gz
Renaming sample ./test_2.fastq.gz to ./cauris_2.fastq.gz
Renaming sample ./test_3.fastq.gz to ./cauris_3.fastq.gz
Renaming completed. Check renamed_file_commands.sh for details.
  1. As part of the renaming process, renamed_file_commands.sh would have been created. This file contains mv commands that show the names of the old files and their corresponding renamed file names.

Confirm the creation of renamed_file_commands.sh

(base) [dhatrib@gl-login2 test_nanopore_org]$ ls
cauris_1.fastq.gz  cauris_2.fastq.gz  cauris_3.fastq.gz  renamed_file_commands.sh  test_sample_lookup.csv

Inspect contents of the file. Hit q to exit.

less /nfs/turbo/umms-esnitkin/Raw_sequencing_data/your/samples/folder/renamed_file_commands.sh

Quick sanity check to ensure all samples have been renamed.


mv ./test_1.fastq.gz ./cauris_1.fastq.gz
mv ./test_2.fastq.gz ./cauris_2.fastq.gz
mv ./test_3.fastq.gz ./cauris_3.fastq.gz
....
(END)

Ensure all the files have been renamed. Navigate to the fastq folder. If samples have been renamed, we can move to the next step in the SOP.

(base) [dhatrib@gl-login2 test_nanopore_org]$ ls
cauris_1.fastq.gz  cauris_2.fastq.gz  cauris_3.fastq.gz  renamed_file_commands.sh  test_sample_lookup.csv

STOP AND CHECK: If there are some files that have not been renamed, please slack Dhatri immediately and do not move forward with the following steps.

Create batch directory to house the renamed samples

Once the samples have been renamed, you are now ready to move them from temporary storage ../Raw_sequencing_data/dataset_folder/fastq_folder to a more permanent location—your Project folder /nfs/turbo/umms-esnitkin/Project_Name.

  1. Run create_directories.py to create the relevant subdirectories in your Project folder. Type python3 create_directories.py -h on your terminal. This will give you an idea of all the flags present in the script and what you need to specify for each argument as seen below.
(base) [dhatrib@gl-login2 fastqs_test_illumina]$ python3 /scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP/create_directories.py -h
usage: create_directories.py [-h] --dest_path DEST_PATH --project_name PROJECT_NAME --data_type {illumina,nanopore,both} [--folder_names_illumina [FOLDER_NAMES_ILLUMINA]]
                             [--folder_names_nanopore [FOLDER_NAMES_NANOPORE]]

Create directory structure for illumina and nanopore data.

options:
  -h, --help            show this help message and exit
  --dest_path DEST_PATH
                        Destination path where directories need to be created—do NOT include project name (e.g., /nfs/turbo/umms-esnitkin/)
  --project_name PROJECT_NAME
                        Name of your project (format: Project_Name-of-Project, e.g., Project_MDHHS)
  --data_type {illumina,nanopore,both}
                        Type of data (illumina/nanopore/both)
  --folder_names_illumina [FOLDER_NAMES_ILLUMINA]
                        Comma separated folder names for Illumina (if multiple) (format: date_PlateInfo, e.g., 2024-12-24_Plate1-to-Plate3,2024-12-25_Plate4)
  --folder_names_nanopore [FOLDER_NAMES_NANOPORE]
                        Comma separated folder names for Nanopore (if multiple) (format: date_PlateInfo, e.g., 2024-09-14_Batch4-to-Batch6,2024-09-15_Batch7)

If you are unsure which batch your samples are from, click here. If you are unable to open the link, slack Evan for access to the excel.

Create the batch folders and respective subdirectories in your project folder.

python3 /scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP/create_directories.py --dest_path /nfs/turbo/umms-esnitkin/ --project_name Project_Test_Nanopore_Org --data_type nanopore --folder_names_nanopore 2024-11-07_Batch1

If the directories were created successfully, you should see the following messages.

(base) [dhatrib@gl-login2 test_nanopore_org]$ python3 /scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP/create_directories.py  --dest_path /nfs/turbo/umms-esnitkin/ --project_name Project_Test_Nanopore_Org --data_type nanopore --folder_names_nanopore 2024-11-07_Batch1

Validating folders at /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org for data type nanopore

Metadata and subdirectories created successfully at /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/metadata.

Variant calling folder created successfully at /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/variant_calling.

Creating assembly directory at /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/assembly

Assembly directory structure created successfully.

Creating ONT directory structure for 2024-11-07_Batch1 under /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org

ONT directory structure for 2024-11-07_Batch1 created successfully.

Success! All specified directories have been created.
  1. Visualize updated directory structure of plate folder and its subdirectories in the project folder: tree /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org
(base) [dhatrib@gl-login2 test_nanopore_org]$ tree /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org
/nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org
└── Sequence_data
    ├── assembly
    │   └── ONT
    ├── metadata
    │   ├── AGC_submission
    │   ├── plasmidsaurus
    │   └── sample_lookup
    ├── ONT
    │   ├── 2024-11-07_Batch1
    │   │   ├── failed_qc_samples
    │   │   ├── passed_qc_samples
    │   │   └── raw_fastq
    │   └── clean_fastq_qc_pass_samples
    └── variant_calling

14 directories, 0 files

STOP AND CHECK: If your batch directory aka ../Sequence_data/ONT/2024-11-07_Batch1/ does not have the subdirectories as seen above, do not move on and slack Dhatri.

Move renamed samples to Project folder

  1. Before you move nanoQC results, we need to move the renamed samples to your Project folder i.e. the path to your Project folder on turbo.

Log onto globus. Move only the fastq files from /nfs/turbo/umms-esnitkin/Raw_sequencing_data/test_nanopore_org/ to the raw_fastq folder in your Project folder on turbo i.e. /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/raw_fastq/ by clicking the Start button.

image

Move the rest of the files found in your dataset folder mv README.md renamed_file_commands.sh /nfs/turbo/umms-esnitkin/Project_Name/Sequence_data/metadata/sample_lookup to the metadata/sample_lookup folder. E.g. of files that can be found the Demultiplex lookup file (*_sample_lookup.csv), README (if applicable), renamed_file_commands.sh.

(base) [dhatrib@gl-login3 test_nanopore_org]$ ls
renamed_file_commands.sh  test_sample_lookup.csv
(base) [dhatrib@gl-login3 test_nanopore_org]$ mv renamed_file_commands.sh test_sample_lookup.csv /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/metadata/sample_lookup
(base) [dhatrib@gl-login3 test_nanopore_org]$ cd /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/metadata/sample_lookup
(base) [dhatrib@gl-login3 sample_lookup]$ ls
test_sample_lookup.csv  renamed_file_commands.sh

Run nanoQC

  1. You are now ready to start processing your long reads. We will be using one of our in house pipelines—nanoQC—to run on the renamed samples. Click on the link and follow the instructions as described on the Github page.

STOP AND CHECK: If you run into any issues running nanoQC, please slack Dhatri immediately and do not move forward with the following steps. Ensure you have generated the QC report before moving on.

Move nanoQC results to Project directory

Once you have finished running nanoQC on your scratch directory and globus has succesfully transferred the raw fastq files, you are ready to move important nanoQC outputs to your Project folder.

  1. Move nanoQC outputs using globus since the output files are massive and take a while to transfer.

Log onto globus. Move only the nanoQC results directory i.e. 2024-10-28_Project_Cauris_nanoQC from /scratch/esnitkin_root/esnitkin1/dhatrib/nanoQC/results/2024-10-28_Project_Cauris_nanoQC/ to the ONT folder in your Project directory on turbo i.e. /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/

image

  1. Once you have confirmed that the results have been moved (you should get an email from globus), create a folder, nanoQC_snakemake_pipeline, in your newly moved results folder here /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/.

cd into your nanoQC results folder and create a new folder nanoQC_snakemake_pipeline.

(base) [dhatrib@gl-login3 nanoQC]$ cd /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/2024-10-28_Project_Cauris_nanoQC
(base) [dhatrib@gl-login3 2024-10-28_Project_Cauris_nanoQC]$ mkdir nanoQC_snakemake_pipeline
(base) [dhatrib@gl-login3 2024-10-28_Project_Cauris_nanoQC]$ ls
2024-10-28_Project_Cauris_nanoQC_report nanoQC_snakemake_pipeline  filtlong  flye  medaka ....

Move into your results folder and copy the Snakefiles, config, samples and cluster files to your plate directory.

(base) [dhatrib@gl3021 nanoQC]$ mv workflow/nanoQC.smk workflow/nanoQC_summary.smk config/config.yaml config/samples.csv config/cluster.json /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/2024-10-28_Project_Cauris_nanoQC/nanoQC_snakemake_pipeline/
(base) [dhatrib@gl3021 nanoQC]$ ls /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/2024-10-28_Project_Cauris_nanoQC/nanoQC_snakemake_pipeline/
nanoQC.smk nanoQC_summary.smk config.yaml samples.csv cluster.json
  1. Once you have confirmed that the results has been moved (you should get an email from globus), you are now ready to move the results from nanoQC to the respective folders in your batch directory using move_files_to_dirs_nanopore.py. To understand how to use the python script, try python3 /scratch/esnitkin_root/esnitkin1/uniqname/path/to/Data-Flow-SOP/nanopore/move_files_to_dirs_nanopore.py -h.
(base) [dhatrib@gl3021 results]$ python3 /scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP/nanopore/move_files_to_dirs_nanopore.py -h
usage: move_files_to_dirs_nanopore.py [-h] --batch_info_path BATCH_INFO_PATH --nanoQC_results_path NANOQC_RESULTS_PATH

Process and organize nanopore sequencing data.

options:
  -h, --help            show this help message and exit
  --batch_info_path BATCH_INFO_PATH
                        Path to the date_BatchInfo directory (e.g., /nfs/turbo/umms-esnitkin/Project_Marimba/Sequence_data/ONT/2024-12-24_Batch1)
  --nanoQC_results_path NANOQC_RESULTS_PATH
                        Path to the nanoQC results (e.g., /nfs/turbo/umms-esnitkin/Project_Marimba/Sequence_data/ONT/2024-12-24_Batch1/2024-12-24_Project_MDHHS_Nano_QC)

Ensure you have the paths to your batch directory i.e. /nfs/turbo/umms-esnitkin/Your_Project/Sequence_data/ONT/date_BatchNum and nanoQC outputs i.e. /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/2024-10-28_Project_Cauris_nanoQC/ handy.

python3 /scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP/nanopore/move_files_to_directories_nanopore.py --batch_info_path /nfs/turbo/umms-esnitkin/Your_Project/Sequence_data/ONT/2024-11-07_BatchNum  --nanoQC_results_path /nfs/turbo/umms-esnitkin/Your_Project/Sequence_data/ONT/2024-11-07_BatchNum/date_Your_Project_nanoQC/

Your terminal window that you used to move the nanoQC outputs will be busy and it should not take more than 10 minutes (sometimes more depending on the number of samples you have) to move the files. Please open another tab/terminal window if you need to contine using the terminal to work on other things.

The message you will see once your files have moved successfully.

(base) [dhatrib@gl3021 results]$ python3 /scratch/esnitkin_root/esnitkin1/dhatrib/Data-Flow-SOP/nanopore/move_files_to_directories_nanopore.py --batch_info_path /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1  --nanoQC_results_path /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/2024-10-28_Project_Cauris_nanoQC

Created master QC summary file at: /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/master_qc_summary.csv

Appended /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/2024-11-07_Batch1/2024-10-28_Project_Cauris_nanoQC/2024-10-28_Project_Cauris_nanoQC_report/2024-10-28_Project_Cauris_nanoQC_report.csv to /nfs/turbo/umms-esnitkin/Project_Test_Nanopore_Org/Sequence_data/ONT/master_qc_summary.csv
Yay, passed and failed sample files were transferred to their respective folders successfully!

Yay, samples from passed_samples.txt file have been moved to clean_fastq_qc_pass_samples directory and all specified samples have been moved to the assembly directory successfully!

If you see the above message, you have finished QC-ing your long reads and are ready to move forward with downstream analysis. Happy science-ing!