When I was a student, my advisor John Storey made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.
- It got me caught up on the field of computational genomics
- It was expertly curated, so it filtered a lot of papers I didn't need to read
- It gave me my first set of ideas to try to pursue as I was reading the papers
I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistical genomics). So this is my attempt at that list. I've tried to separate the papers into categories and I've probably missed important papers. I'm happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.
- Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid - the paper describing the structure of DNA. Was the very beginning of the genomics revolution. The authors won a Nobel Prize. They used data from Rosalind Franklin to do it.
- Central dogma of molecular biology - by one of the people who discovered the structure of DNA, outlines the main information flow from DNA to proteins (which then flow to phenotypes).
- Next-generation DNA sequencing - introduces the main technology used today to measure DNA, RNA, protein-DNA binding, epigenetic marks like DNA methylation, chromatin folding, etc.
- Ultrafast and memory-efficient alignment of short DNA sequences to the human genome - a paper describing a very fast way to align sequence reads to the genome. One of the first to do this.
- A gene expression barcode for microarray data - a paper describing the way that genes are expressed ("turned on") or not expressed ("turned off") in microarray data.
- RNA-Seq: a revolutionary tool for transcriptomics - introduces RNA-sequencing, the main type of data we look at in the Leek group.
- Mapping and quantifying mammalian transcriptomes by RNA-seq - introduces many of the key computational issues in RNA-seq analysis.
- From RNA-seq reads to differential expression results - probably the single best review of RNA-seq analysis written.
- IVT-seq reveals extreme bias in RNA sequencing - an important paper showing potential sources of bias in RNA-seq analysis using experimental data.
- Sequencing technology does not eliminate biological variability - a paper describing the sources of variability in genomic experiments and the importance of different types of replicates.
- Linear models and empirical bayes methods for assessing differential expression in microarray experiments. - introduces a general linear modeling framework, including the most successful use of variance shrinkage to date. This is the first paper behind the limma package.
- Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments - one of the first papers to describe basic statistical modeling for RNA-seq, covers many of the most important issues.
- voom: precision weights unlock linear model analysis tools for RNA-seq read counts - updates the limma framework to sequencing experiments.
- edgeR: a Bioconductor package for differential expression analysis of digital gene expression data and Differential expression of sequence count data are papers on how to do differential expression using read counts, based on similar shrinkage ideas to those in limma.
- Statistical significance for genome-wide studies - introduces the basic concepts behind high-dimensional multiple testing and the false discovery rate in an approachable way.
- Tackling the widespread and critical impact of batch effects in high-throughput data - talks about batch effects, one of the most common confounders in genomic studies, and how to address them; related software is the sva package. There are other confounders as well, this paper: http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.0030161 talks about some of them.
- Statistical inferences for isoform expression in RNA-Seq - a very useful paper for defining a model or isoform expression in RNA-seq.
- Gene set enrichment analysis made simple - a paper describing a simple approach to identifying gene sets that are enriched for differential expression.
- Tidy data - a paper where Hadley Wickham describes proper organization of data sets that I really like.
- The Leek group guide to data sharing - how to organize data you are working on.
- Bioconductor: open software development for computational biology and bioinformatics - introduces the Bioconductor project, the most successful project in genomic software development to date.
- Scalable genomics with R and Bioconductor - how to do big genomics data in R using Bioconductor.
- The Leek group guide to writing R packages - how to make R packages.
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks - protocols for using Cufflinks and Tophat, two commmon pieces of software we use.
Obviously all of them
- Sequencing technology does not eliminate biological variability - a paper describing the sources of variability in genomic experiments and the importance of different types of replicates.
- Differential expression analysis of RNA-seq data at single-base resolution - proposes a new approach to finding differentially expressed regions in the human genome.
- Flexible analysis of transcriptome assemblies with Ballgown - describes a statistical backend for popular transcript assembly algorithms.
- PHYLOGENIES FROM MOLECULAR SEQUENCES: INFERENCE ANDRELIABILITY - a slightly older review of phylogenetic inference by one of the real titans in the area.
- On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data - probably the first real description of empirical bayes approaches for genomics data.
- The allelic architecture of human disease genes: common disease-common variant...or not? - a discussion of whether common genetic variants lead to common diseases.
- Genetic Dissection of Transcriptional Regulation in Budding Yeast - the paper that launched the whole area of eQTL analysis which is really hot right now.
- Inference of population structure using multi-locus genotype data - one of the most foundational papers on how we infer population structure.