Quality control of sequencing reads

Course material for the INF-BIOx121 ‘High Throughput Sequencing technologies and bioinformatics analysis ‘ course at the University of Oslo, Fall 2016

Conventions in this document

This is normal text

For text describing a unix command, e.g. grep - the command will then be look like this:

This is a command you need to enter on the command line
This command has one word HERE that you need to change

For example, HERE might be the name of the folder that will contain the output of the command

Where is what

All data for this part of the practical is in this folder:

/data/qc

You will find several fastq files in that folder. We will start the practical with these two files :

/data/qc/cod_read1.fastq  
/data/qc/cod_read2.fastq

They contain 1 million randomly sampled reads from a HiSeq 2x100 bp PE (paired end) run

Understanding reads

Learning points:

  • Recognizing the fastq file format
  • How to prepare and judge a QC report

A peak into the fastq files

Fastq files are very big. In order to be able to view them in a ‘page-by-page’ way, we will use the less command:

less /data/qc/cod_read1.fastq

This file contains the forward read (‘read 1’) dataset of the run for the sample. Use the space bar to browse through the file. Use q to go out of the less program. Make sure you recognize the fastq format, if needed use the slides from today’s presentation.

Question: which of the different Illumina Sequence identifiers are used for these reads? See http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers.

Repeat this for the read 2 file:

less /data/qc/cod_read2.fastq

Question: do you see whether the reads in the same order in both files?

Quality control of Illumina data

We will be using a program called FastQC. The program is available with a graphical user interface, or as a command-line only version. We will use the latter one. It takes a single fastq file (the file can be compressed) as input, and produces a web page (html file) with the results of a number of analyses.

Before we run the program, let’s create a new folder for the output. Do this in your home folder. First, go to your home directory. Remember you can simply type:

cd

Followed by the ‘enter’ key.

Now, we’ll make the new folder and move into it:

mkdir qc
cd qc
pwd

To run fastqc on the first file, run the command below. Note that the command should be written on a single line. Also note where you should (and should not) put spaces!

fastqc -o ./ /data/qc/cod_read1.fastq

We use ‘-o ./‘ here, which specifies the current folder ‘./‘ as location for the output.

The program will tell you how far it has come, and should finish in a minute or so. Check that it finished without error messages.

In the folder you specified after -o, you should now see a new zip file called cod_read1_fastqc.zip, and an html file called cod_read1_fastqc.html

Open a webbrowser, and, using the menu option ‘Open file’, locate the html file. Alternatively, you could browse the file system and double-click on the file.

Study the results.

The plot called “Per base sequence quality” shows an overview of the range of quality scores across all based at each position in the fastq file. The y-axis shows quality scores and the x-axis shows the read position. For each read position, a boxplot is used to show the distribution of quality scores for all reads. The yellow boxes represent quality scores within the inter-quartile range (25% - 75%). The upper and lower whiskers represent 10% and 90% point. The central red line shows the median of the quality values and the blue line shows the mean of the quality values.

A rule of thumb is that a quality score of 30 indicates a 1 in 1000 probability of error and a quality score of 20 indicates a 1 in 100 probability of error (see the wikipedia page on the fastq format at http://en.wikipedia.org/wiki/Fastq. The higher the score the better the base call. You will see from the plots that the quality of the base calling deteriorates along the read (as is always the case with Illumina sequencing).

The plot ‘Per tile sequence quality’ shows the deviation from the average quality for each tile, i.e. part of the flowcell. The graph allows you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. The colours are on a cold to hot scale, with cold colours being positions where the quality was at or below the average for that base in the run, and hotter colours indicate that a tile had worse qualities than other tiles for that base. A good run should show a plot that is blue all over.

Now, answer these questions:

Questions

  • What quality encoding did fastqc determine the quality scores to be in? See also the wikipedia page on the fastq format again
  • How many reads were there in total in the cod_read1.fastq file?
  • How many bases were there in total in the file?
  • Which part(s) of the reads would you say are of low quality - if any?
  • Would you have accepted this data if you were given it by your sequencing provider?

Repeat the fastqc analysis for the file /data/qc/cod_read2.fastq, which contains the reverse read (‘read2’).

Open the cod_read2_fastqc.html in a webbrowser.

Questions

  • Are there part(s) of the reads that have a lower quality compared to the cod_read1.fastq file?
  • Would you have accepted this data if you were given it by your sequencing provider?

NB. You can get more information about the use of the fastqc program by writing

fastqc -h

Trimming reads

We want to do the following with the raw reads:

  • reads may contain adaptor sequences left from library preparation, these we want to remove
  • remove bad-quality parts of reads

For this, we will use a program called AdapterRemoval. One useful feature of this program is that it can help us identify the sequences of the adapters that may be present in some of the reads.

Run this as follows:

AdapterRemoval --identify-adapters \
--file1 /data/qc/cod_read1.fastq \
--file2 /data/qc/cod_read2.fastq

NOTE: here this long command is split over several lines to make it more readable. This can be done in unix by typing a space, the backslash symbol \ and pressing the return key. You can type this command as such into the terminal, or type it one line:

AdapterRemoval --identify-adapters --file1 /data/qc/cod_read1.fastq --file2 /data/qc/cod_read2.fastq

Study the and check the suggested adaptor sequences by simply Googling them. If there are strings of NNNN in the proposed adaptor, only take the part before these N‘s.

Now we can remove the adaptors, as well as some low quality bases, using the same program:

AdapterRemoval \
--file1 /data/qc/cod_read1.fastq \
--file2 /data/qc/cod_read2.fastq \
--adapter1 SEQUENCE_OF_ADAPTOR1 \
--adapter2 SEQUENCE_OF_ADAPTOR2 \
--basename cod-read_trimmed --trimns --trimqualities 

For details of what these parameters mean, run AdapterRemoval without any options.

Study the output files and try to find out which files now contain the trimmed paired-end reads. Run fastqc on these reads, and compare the results to the fastqc results of the raw reads.

More read files

Now run fastqc on the other files in the /data/qc folder and evaluate the results. We’ll discuss these together afterwards:

  • start with the files called more_cod_read*. How do these compare to the cod reads you looked at before?
  • then take the ChipSeq and microRNA example read files (they only have one fastq file each)

Question: which of the different Illumina Sequence identifiers are used for these reads?

Question: discuss the results with your neighbour, and try to explain the fastqc results for these files.

Other programs to try

You could try the online QC program PRINSEQ on these datasets: http://edwards.sdsu.edu/prinseq/