GEO_RNASeq_upload

This page is to simplify the upload of data to GEO (Gene Expression Omnibus), which is the portal in which you can upload raw reads and gene expression counts to NCBI.

Uploading counts and expression data is the standard procedure for RNA-Seq experiments, although if you wish not to upload expression counts (so, only raw counts), then you have to upload raw reads directly through SRA (Sequence read archive, on NCBI) and not GEO.

If you used a public genome and annotation, continue to next steps. If not, then we need to upload the genome and assembly to NCBI first (We need a page to explain this too)

Prerequisites:

Make sure you are logged on to the VPN CISCO AnyConnect (if you are outside UCL).
Make an account with NCBI: https://www.ncbi.nlm.nih.gov/geo/
Download Filezila client. https://filezilla-project.org/

Compile data

Make an expression counts table, these could be from featureCounts (e.g. Nextflow rnaseq pipeline). You can submit these per sample, or provide a single table (in the example I submit a single counts table).
Make sure all your raw reads and expression count files are in the same folder. For me I had to link all my files in a new directory: #copying the raw data links to a single folder: ln -s /Volumes/ritd-ag-project-rd016y-eefav92/Polybia_occidentalis/experiment2/data/*/raw/*.gz . #copying my counts table into same folder: cp /Volumes/ritd-ag-project-rd016y-eefav92/Polybia_occidentalis/experiment2/NCBI_counts_table.txt .
Copy the excel table in this repo (Multispecies...), which provides a basic template to upload your data.

Fill in excel spreadsheet

Start filling in the basic details of the project (see document, and examples 1 and 2 for more info). e.g.: title summary overall design

HINT: If you hover over the cell titles, it will give you further information about the data type or expected input.

Work out the length of your reads (Either check manually, ask sequencing facility or use FASTQC [within nextflow rnaseq run])
Work out Illumina machine name (ask sequencing facility).
Add custom characteristics columns with useful features, such as caste (in the example)

Processed data.

First you need to enter the annotation you used from NCBI. Under genome (see Example 1 tab). If you do not have a genome, remove line.
Add data processing steps, to explain briefly what programs were run on the data to get the processed file/s.
Add final processed file format and content info.

**Raw files **

Add sample raw file names, data type.
Add the md5 sums. These are unique hash codes for a particular file. This ensures that the files are copied correctly to the NCBI database.

On mac you use md5 and on linux (inc cluster) you use md5sum . These are standard terminal commands, and require you to specify the files you want hashes for. md5 *.gz

E.g. in the above example, on a macm, giev me the hashes for all files in the current directory with the ending .gz.

WANRING: This may take >30mins or more, deopending on file sizes and number.

Copy this md5 table into the excel spreadsheet where it has md5 as a header of the section. Match them up to your dataset by sorting the rows alphabetically.
Add: instrument model, read length, single or paired-end
Put in paired end data at end of excel table, paired R1 then R2. If your data are single end, leave this section blank.

Submission to NCBI

Go to https://www.ncbi.nlm.nih.gov/geo/ and login. Go to https://www.ncbi.nlm.nih.gov/geo/info/seq.html and click "Transfer files"
Follow their instructions for filezila: https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html
ONce you are connected to their server, you need to find your folder with data in it on the left (normally). Then select the files you want to transfer, and drag and drop them over to the server side in a folder with an appropriate name (e.g. Polybia_GEO_Submission). WARNING: This step could take >24hours, esp. if the data are on the RDS, which is super slow moving files. If the copy breaks, you can always try again later, and move the files that didn't transfer correctly.
Once the files have all been transferred (excel table, raw counts, and gene expression), go back to GEO, and select "notify GEO". Follow their instructions, including choosing a publish date. This is provisional, but you can put three years from now, and can always remove the data if your paper is not ready by then.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
Multispecies_GEO_file copy 2.xls		Multispecies_GEO_file copy 2.xls
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEO_RNASeq_upload

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GEO_RNASeq_upload

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages