Quality control of sequencing reads =================================== Course material for the INF-BIOx121 'High Throughput Sequencing technologies and bioinformatics analysis ' course at the University of Oslo, Fall 2015 Conventions in this document ---------------------------- This is normal text For text describing a unix command, e.g. ``grep`` - the command will then be look like this: :: This is a command you need to enter on the command line :: This command has one word HERE that you need to change For example, HERE might be the name of the folder that will contain the output of the command Where is what ------------- All data for this part of the practical is in this folder: ``/data/qc`` You will find several fastq files in that folder. We will start the practical with these two files : :: /data/qc/cod_read1.fastq /data/qc/cod_read2.fastq They contain 1 million randomly sampled reads from a HiSeq 2x100 bp PE (paired end) run Part 1: Understanding reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Learning points: - Recognizing the fastq file format - How to prepare and judge a QC report A peak into the fastq files ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Fastq files are very big. In order to be able to view them in a ‘page-by-page’ way, we will use the ``less`` command: :: less /data/qc/cod_read1.fastq This file contains the forward read (‘read 1’) dataset of the run for the sample. Use the space bar to browse through the file. Use ``q`` to go out of the ``less`` program. Make sure you recognize the fastq format, if needed use the slides from today's presentation. **Question:** which of the different Illumina Sequence identifiers are used for these reads? See http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers. Repeat this for the read 2 file: :: less /data/qc/cod_read2.fastq **Question:** do you see whether the reads in the same order in both files? Part 2: quality control of Illumina data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We will be using a program called **FastQC**. The program is available with a graphical user interface, or as a command-line only version. We will use the latter one. It takes a single fastq file (the file can be compressed) as input, and produces a web page (html file) with the results of a number of analyses. +-----------+-----------------+-----------------------------------------------------------------------------------------------------------------------+ | Program | Options | Explanation | +===========+=================+=======================================================================================================================+ | fastqc | | Quality control of sequence data | +-----------+-----------------+-----------------------------------------------------------------------------------------------------------------------+ | . | -o foldername | tells the program to place the output in a folder called foldername instead of in the same folder as the input file | +-----------+-----------------+-----------------------------------------------------------------------------------------------------------------------+ | . | fastq file | file to be analysed by the program | +-----------+-----------------+-----------------------------------------------------------------------------------------------------------------------+ Before we run the program, let’s create a new folder for the output. Do this in your home folder. First, go to your home directory. Remember you can simply type: :: cd Followed by the 'enter' key. Now, we'll make the new folder and move into it: :: mkdir qc cd qc pwd We will be using the *module* system to 'activate' programs (technically, to add them to your environment). To be able to use fastqc, run this command: :: module load fastqc To check what modules we have loaded, type :: module list You should see :: Currently Loaded Modulefiles: 1) fastqc/0.11.2 (for more technical information on the module system, see http://modules.sourceforge.net/). To run fastqc on the first file, run the command below; YOUR\_USERNAME should be the name you used for your folder. Note that the command should be written on a *single line*. Also note where you should put spaces! :: fastqc -o ./ /data/qc/cod_read1.fastq Note that we use '``-o ./``' here, which specifies the current folder '``./``' as location for the output. The program will tell you how far it has come, and should finish in a minute or so. Check that it finished without error messages. In the folder you specified after ``-o``, you should now see a new zip file called ``cod_read1_fastqc.zip``, and an html file called ``cod_read1_fastqc.html`` Download the html file to the local hard disk of the PC/Mac you are using, see the instructions on the course wiki. Open a webbrowser, and, using the menu option ‘Open file’, locate the html file. Alternatively, you could browse the file system and double-click on the file. Study the results. The plot called "Per base sequence quality" shows an overview of the range of quality scores across all based at each position in the fastq file. The y-axis shows quality scores and the x-axis shows the read position. For each read position, a boxplot is used to show the distribution of quality scores for all reads. The yellow boxes represent quality scores within the inter-quartile range (25% - 75%). The upper and lower whiskers represent 10% and 90% point. The central red line shows the median of the quality values and the blue line shows the mean of the quality values. A rule of thumb is that a quality score of 30 indicates a 1 in 1000 probability of error and a quality score of 20 indicates a 1 in 100 probability of error (see the wikipedia page on the fastq format at http://en.wikipedia.org/wiki/Fastq. The higher the score the better the base call. You will see from the plots that the quality of the base calling deteriorates along the read (as is always the case with Illumina sequencing). The plot 'Per tile sequence quality' shows the deviation from the average quality for each tile, i.e. part of the flowcell. The graph allows you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. The colours are on a cold to hot scale, with cold colours being positions where the quality was at or below the average for that base in the run, and hotter colours indicate that a tile had worse qualities than other tiles for that base. A good run should show a plot that is blue all over. Now, answer these questions: **Questions** - What quality encoding did fastqc determine the quality scores to be in? See also the wikipedia page on the fastq format again - How many reads were there in total in the ``cod_read1.fastq`` file? - How many bases were there in total in the file? - Which part(s) of the reads would you say are of low quality - if any? - Would you have accepted this data if you were given it by your sequencing provider? Repeat the fastqc analysis for the file ``/data/qc/cod_read2.fastq``, which contains the reverse read ('read2'). Open the ``cod_read2_fastqc.html`` in your webbrowser. **Questions** - Are there part(s) of the reads that have a lower quality compared to the ``cod_read1.fastq`` file? - Would you have accepted this data if you were given it by your sequencing provider? NB. You can get more information about the use of the fastqc program by writing :: fastqc -h More read files ^^^^^^^^^^^^^^^ Now run fastqc on the other files in the ``/data/qc`` folder and evaluate the results. We'll discuss these together afterwards: - start with the files called ``more_cod_read*``. How do these compare to the cod reads you looked at before? - then take the ChipSeq and microRNA example read files (they only have one fastq file each) **Question:** which of the different Illumina Sequence identifiers are used for these reads? **Question:** discuss the results with your neighbour, and try to explain the fastqc results for these files. Other programs to try ^^^^^^^^^^^^^^^^^^^^^ You could try the online QC program PRINSEQ on these datasets: `http://edwards.sdsu.edu/prinseq/ `__