Gene Expression: RNA-seq

Gene expression is a key determinant of cellular phenotypes. While high-throughput microarrays have been the predominant technology for measuring gene expression for years, all hybridization-based technologies are subject to biases and limitations, such as reliance on known gene models and potential for cross-hybridization to probes with similar sequences. Microarrays offer a limited ability to fully catalog and quantify the diverse spectrum of RNA molecules expressed from cellular genomes, and are of little use for detecting previously unknown transcripts, novel exons, or exons with novel splice junctions.

Next-generation sequencing has revolutionized transcriptomics by enabling massively parallel sequencing of cDNA (RNA-seq). In addition to overcoming the limitations mentioned above for microarrays, RNA-seq provides a digital readout of gene expression, enabling a higher sensitivity and an essentially unlimited dynamic range.
Diagram of workflow uva


The core currently offers the following bioinformatics analysis workflow for RNA-seq experiments:

Pre-alignment quality assessment

  • Per-base sequence quality
  • Per-base sequence content
  • Per-base GC content
  • Search for overrepresented sequences (adapters, primers, etc)

Alignment to a reference genome using TopHat

  • Homo sapiens
  • Mus musculus
  • Rattus norvegicus
  • Bos taurus
  • Canis familiaris
  • Gallus gallus
  • Drosophila melanogaster
  • Arabidopsis thaliana
  • Caenorhabditis elegans
  • Saccharomyces cerevisiae

Post-alignment quality assessment

  • Flagging duplicate reads
  • Estimation of library complexity
  • Insert size distribution (for paired-end sequencing)
  • Analysis of coverage over transcript position

Transcript assembly

Differential expression testing

  • Isoforms: tests of differential expression for each isoform.
  • Genes: tests of differential expression summing expression of all transcripts at the gene level.
  • Primary transcripts: tests of differential expression summing expression of all transcripts sharing the same transcription start site, regardless of coding sequence.
  • Coding sequence: tests of differential expression summing expression of all transcripts sharing the same coding sequence, regardless of transcription start site.

Differential splicing tests

For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript.

Differential coding output

For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples.

Differential promoter use

For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.


Assistance with visualization using IGV.


Q: What software packages do you use in your RNA-seq pipeline?

The core uses a combination of custom-built and open-source application-specific software, including the Tuxedo suite of tools (Bowtie, Tophat, Cufflinks, and cummeRbund) for RNA-seq alignment, assembly, transcript abundance estimation, and differential expression testing. The STAR aligner may also be used for alignment. When performing RNA-seq for examining expression at the gene level only, count-based negative binomial methods such as DESeq or edgeR will be used.

Q: How much sequence do I need?

RNA-seq has the unquestionable advantage over current hybridization-based microarrays. However, with the cost per sample of RNA-Seq still higher than microarray, it can be beneficial to multiplex samples in a single sequencing reaction. But how many reads are sufficient?

RNA-seq is a counting application – coverage is estimated in the number of reads (as opposed to genome sequencing, where coverage is measured in gigabases). Most published RNA-seq papers have sequenced from as little as 20-million reads per sample, up to as many as 100-million reads per sample, with an average of about 40-million reads per sample. The number of reads required depends on your research question. At the lower end (20M), you can really only focus on expression at the gene level (although this data is still arguably better than microarrays). At the upper end (100M) you can confidently examine novel and rarely expressed transcripts.

The ENCODE consortium’s Standards, Guidelines and Best Practices for RNA-Seq recommends

“Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are mappable to the genome or known transcriptome.”

and for

“Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms… a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended.”

Estimating the depth of sequencing coverage (i.e., the average number of times each nucleotide is sequenced) is somewhat different for RNA-seq than for genome sequencing, and involves several assumptions. Sequencing depth in RNA-seq is calculated as (NT sequenced / #mRNAs per cell) / (#nucleotides/mRNA). For example, if you have 50-million 100-bp paired end reads, you have 10-billion, or 10^10 NTs sequenced. Assuming you have 2e6 mRNAs per cell and an average transcript length of 1500 bases: (10^10 nt / 2e6 mRNAs-per-cell) / (1500NT/mRNA) = 3X coverage of an mRNA present at one copy per cell.

Q: Which sequencing platform should I use?

See Table 1, “Comparison of Sequencing Technologies” from this paper in the January 2012 issue of Nature Reviews Genetics for a comparison of read length, insert size, run time, reads per run, error rate, and relative cost per nucleotide for several common sequencing platforms, including Roche 454, Illumina GAIIx, Illumina HiSeq 2000, Illumina MiSeq, PacBio, and Ion Torrent. The DNA Sciences Core has Illumina MiSeq and NextSeq instruments. The MiSeq has a short run time, its primary applications are for targeted resequencing studies (e.g. following up GWAS hits in a small region), amplicon sequencing, microbial identification (where assembly is not an issue), or targeted sequencing of small regions in a clinical setting. The MiSeq’s throughput is too low to make RNA-seq studies in complex genomes cost-effective. If you would like to schedule a joint meeting with directors from both the Bioinformatics Core and the DNA Sciences  Core, please email us and indicate this on your consultation request form.

Q: How much will this cost me?

Bioinformatics costs are separate from sequencing costs, and are the same whether data is generated here at the DNA Sciences Core or elsewhere through an external vendor. Read the about page, and submit a consultation request form.