RNA-seq

In this section, we will cover one of the most common NGS analyses: RNA-seq. The biological question that we want to answer is: how do the genes in a sample change in their expression after a treatment is performed. Here, treatment can be anything from gene knock-outs, drug treatments, diet changes, etc.

A basic RNA-seq experiment consists of the following steps

1. Clean reads and read QC
Reads that come from the sequencing machine usually come with adapter sequences, and many reads might be too short to be correctly mapped. In this step we assess the quality of the reads, and remove adapters if necessary
2. Generate genome indices
Mapping reads to a genome is usually a very computationally intensive process. In order to accelerate read mapping, a genome is "indexed", so that the read mapping softare can more efficiently search for read matches in the genome
3. Map reads to genome
In this step, we use a read mapping software to map our reads to the genome. This step will report how the reads are distributed across the genome.
4. Feature counting
In this step, the idea is to count how many reads overlap each gene in the genome. The number of reads per gene is going to be our measure of gene expression.

Software

I tested this analysis on a Linux environment. Mac computers should work out of the box as well. If you are on Windows, you can look into how to install the Windows Subsystem for Linux (WSL).

For this analysis, the following software is needed

All the software needed can be installed with one command from a conda environment: conda install fastqc sra-tools hisat2 trim-galore samtools subread

If you need help installing and activating a Conda environment, you can follow the tutorial here. ()

Conda

Conda is a system that allows us to install programs, packages, and modules, in a project specific manner. This is useful in our case, because the installation is done in our user directory, so we don't need administrator privileges. Moreover, because it is project-specific, we can install several environments without worrying about some packages in one project to interfere with packages in another project.

Here we are going to install Bioconda, a Conda distribution with repositories specialized for bioinformatics data analisys. The following steps are taken from the Bioconda website

# Starting from scratch
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
# Follow the instructions

# Now add channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

That's it. If you let the installer initialize conda in your .bashrc (highly recommended), you will be in your newly created environment every time you open your terminal. If you want to check if you are in, you can enter which python in your terminal, and you should see the path to Python in your environment.

Data aquisition

We will use yeast mRNA-seq data from the GEO accession SRR453566.

The following code will download the reads, plus the genome and gene annotation needed for this analysis


        # Download reads
        fastq-dump --split-files SRR453566
        # move reads to reads directory (create if necessary)
        mkdir -p reads
        mv SRR453566_1.fastq SRR453566_2.fastq reads
        # Create folder for genome and annotation
        mkdir -p yeast/{genome,genes}
        # Download yeast genome and save to genome directory
        curl http://ftp.ensembl.org/pub/release-104/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz | gunzip > yeast/genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
        # Download yeast genes and save to genes directory
        curl https://ftp.ensembl.org/pub/release-104/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.104.gtf.gz | gunzip > yeast/genes/Saccharomyces_cerevisiae.R64-1-1.104.gtf
    

Reads cleaning and QC

Read quality check

It is always a good idea to monitor the quality of NGS data at every step. Here, we will check the quality of the reads before and after trimming adapters. I like to do that because then I can see that the reads come with adapter (in the pre-trimming) step, and then that those adapters were successfully removed (in the post-trimming check). Also, in some applications (e.g. environmental samples, extracellular samples), the read quality (particularly read length) will appear "good" before trimming, but not great after trimming. If that is the case, this double checking approach is useful because we can start to rule out sequencing errors (because the adapters were correctly sequenced), and focus on optimizing the sampling process to reduce chances of RNA degradation.

Adapter trimming

Removing adapters is a very important step of any NGS-based analysis. It is safe to assume that the reads we get from the sequencer include some sort of adapters (plus UMIs, barcodes, etc; depending on the application). However, to double check, those adapters should pop-up as over-represented sequences in the first read QC.

Here we will use Trim galore, an adapter trimmer app that automatically detects the most commonly used adapters and removes them, returning a read file (or files) with their adapters removed. For more customized experiments (especially those including multiplexing, or UMIs), more "manual" (e.g. cutadapt) apps are recommended.

The following commands will run an initial read quality check, followed by adapter trimming, and a final read quality check.


            # First QC. Results will be stored under qc/pre_trimming
            # Create output folder
            mkdir -p qc/pre_trimming
            # Run QC. Here, -t 16 uses 16 threads. Adjust this parameter to your system
            fastqc -t 16 -o qc/pre_trimming reads/SRR453566_1.fastq reads/SRR453566_2.fastq

            # Remove adapters
            trim_galore -o reads/trimmed -j 16 --paired reads/SRR453566_1.fastq reads/SRR453566_2.fastq


            # Run post-trimming QC
            mkdir -p qc/post_trimming
            fastqc -t 16 -o qc/post_trimming reads/trimmed/SRR453566_1_val_1.fq reads/trimmed/SRR453566_2_val_2.fq
        

Genome indexing

In this analysis, we will use Hisat2. Hisat2 indices consist of a series of files with .ht2 suffix. Since we are going to use these indices for RNA-seq analyses, Hisat2 allows the index builder to take exon and splice information into account in the indices it produces. In order to do that, Hisat2 provides appropriate scripts to extract splicing and exon information from gene annotation data in GFF format. The commands needed to generate such files and the genome indices are:


      # Generate splice sites file
      hisat2_extract_splice_sites.py yeast/genes/Saccharomyces_cerevisiae.R64-1-1.104.gtf > yeast/genes/yeast_splice_sites.txt
      # Generate exons file
      hisat2_extract_exons.py yeast/genes/Saccharomyces_cerevisiae.R64-1-1.104.gtf > yeast/genes/yeast_exons.txt
      # generate Indices
      hisat2-build --exon yeast/genes/yeast_exons.txt --ss yeast/genes/yeast_splice_sites.txt yeast/genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa yeast/genome/yeast
  

This command will generate the files yeast.1.ht2, yeast.2.ht2... yeast.8.ht2 under the yeast/genome/ folder.

Read mapping

This is the main section of the analysis. As mentioned in the genome indexing part, we are going to use Hisat2. IMPORTANT: For this example, we will use generic options that tend to work with basic analyses. However, it is critical that you look into the parameters of all of the apps mentioned here, in case there are arguments or options specific to your use case. Now, the code:



            # create folders
            mkdir mapping

            # mapping
            hisat2 -p 16 --known-splicesite-infile yeast/genes/yeast_splice_sites.txt -x yeast/genome/yeast -1 reads/trimmed/SRR453566_1_val_1.fq -2 reads/trimmed/SRR453566_2_val_2.fq  | samtools view --threads 16 -bS - | samtools sort - [email protected] 16 -o mapping/SRR453566.bam
        

Ok, there is a lot going on on that last command. Let's break it down:

hisat2
Run the Hisat2 read mapper
-p
Number of threads
--known-splicesite-infile
Use splice site information from file
-x
Name of Hisat2 index
-1, -2
Path to read 1 and read 2 (paired end example)
samtools view -bS
Compress output
--threads
Number of threads
samtools sort -
Sort items in output file for increased performance (-o sets final output file)
[email protected]
Number of threads

Feature counting

Since we are performing an RNA-seq analysis, we need to translate the mapped reads from the previous step, to gene coordinates. In that way, we will be able to count how many reads were mapped to each gene of the genome in question. That is the purpose of this step. The feature counter we will use is featureCounts, from the Subread package. Without further due, the code:


            # create final directory
            mkdir features
            featureCounts -T 16 -t exon -g gene_id -a yeast/genes/Saccharomyces_cerevisiae.R64-1-1.104.gtf -o features/SRR453566.txt mapping/SRR453566.bam
        

To reiterate, all apps used here have been tested with parameter values that tend to fit most commonly run experiments. Take a look at the parameters -t exon and -g gene_id. What this means is that reads are going to be mapped to exon features, and the output will report at the gene level. Depending of your specific analysis, this might be appropriate. Note for example, that if you need to map each isoform independently, you might need to change -g gene_id to -g transcript_id, assuming that your annotation file (GTF) follows Ensembl's naming convention.

And that's it. The file features/SRR453566.txt is a table where each row is a gene, and the first 5 columns have the coordinates and length of the gene, and the last column contains the number of reads in each gene.

Final note: Downstrean from this step, you might run differential gene expression. Different apps for differential gene expression need different input files. I personally like featureCounts, because it has a module to manage deduplication with UMIs built-in. However, if another input format is needed, most feature counters use the same input, the .bam file that we generated in the previous step, so feel free to mix-and-match different feature counters.