Mapping reads to an assembly and visualising the results ======================================================== We will use ``bwa`` for mapping. this is the same program you used for the variant calling module. Set up the environment ^^^^^^^^^^^^^^^^^^^^^^ Load the following modules: :: module load bwa/0.7.8 module load samtools/1.1 module load reapr/1.0.18 Indexing the assembly ^^^^^^^^^^^^^^^^^^^^^ Your new assembly now becomes the 'reference' for ``bwa``. ``bwa`` needs an index of the sequences to make mapping go faster. For large genomes such as the human genome, this takes a long time. For the small bacterial genome we work with here this is very fast. Move (using ``cd``) to the folder with your final assembled sequences, i.e. the ``velvet_pe+mp.fa`` file when you first do this. Index the fasta file with: :: bwa index -a bwtsw ASSEMBLY.FASTA Replace ``ASSEMBLY.FASTA`` with the name of your fasta file. Run ``ls`` to check the results, you should see a couple of new files. Mapping paired end reads ^^^^^^^^^^^^^^^^^^^^^^^^ Mapping the reads using ``bwa mem`` yields SAM output. Instead of saving this output to disk, we will immediately convert it to a sorted (binary) BAM file by piping into the ``samtools``\ program. 'Sorted' here means that the alignments of the mapped reads are in the order of the reference sequences, rather than random. Finally, we will generate an index of the sorted BAM file for faster searching later on. First, create a new folder *in the same folder as the ``ASSEMBLY.FASTA`` file* and ``cd`` into it: :: mkdir bwa cd bwa Then do the mapping: :: bwa mem -t 2 ../ASSEMBLY.FASTA \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq \ | samtools view -buS - | samtools sort - map_pe.sorted Generate an index of the BAM file: :: samtools index map_pe.sorted.bam Explanation of some of the parameters: - ``../`` means 'look in the folder one level up', i.e. where the fasta file is - ``-t 2``\ tells ``bwa mem`` to use 2 threads (cpus) - ``-buS``\ tells ``samtools view`` that the input is in SAM format (``S``) and to output uncompressed (``u``) BAM format (``b``). - the ``-`` for both ``samtools`` commands indicate that instead of using a file as input, the input comes from a pipe (technically, from 'standard in', or 'STDIN'). - ``map_pe.sorted`` tells ``samtools view`` to call the outputfile ``map_pe.sorted.bam`` If you would like to have a look at the alignments in the BAM file (which is in binary format), use ``samtools view``\ again: :: samtools view map_pe.sorted.bam |less Mapping mate pairs ^^^^^^^^^^^^^^^^^^ Repeat the ``bwa mem`` and ``samtools`` commands above, but: - use the mate pair reads ``Nextera_MP_R1_50x.fastq`` and ``Nextera_MP_R2_50x.fastq`` - change the output name to ``map_mp.sorted`` Plotting the insert size distribution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Since we know know where the pairs of reads map, we can obtain he distance between them. That information is stored in the SAM/BAM output in the 9th column, 'TLEN' (observed Template LENgth). We will use python, and the python module ``pysam`` to plot the distribution of insert sizes for a subset of the alignments. This we will do in another Jupyter notebook. - copy the ``bwa`` folder with the sorted ``.bam`` files and index files (``.bam.bai``) from the server to the assembly folder on your local Linux machine - copy the notebook file ``/data/assembly/Plot_insertsizes.ipynb`` to the same folder on your local Linux machine - in the terminal, ``cd`` to the same folder - open the Jupyter notebook :: ipython notebook Plot_insertsizes.ipynb - execute the cells as listed - for ``infile``, use the name of the sorted BAM file for the mapping of the paired end or mate pair reads - generate plots for both the paired end mapping *and* the mate pair mapping **Questions** - Which insert size distribution is the tightest around the mean? - Why isn't the mean of the distribution a useful metric for the mate pair library? Visualising the assembly in a genome browser ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For this part, we will use Integrative Genomics Viewer (IGV), a genome browser developed by the Broad Institute. Instead of using one of the built-in genomes, we will add the assembly as a new reference genome. **On the PC** (*NOT* on the server): - download the assembly ``fasta`` file to the folder where you also have the ``bwa`` result files - start the IGV program by typing ``igv`` - Choose ``Genomes --> Load Genome from File…`` (**NB** not File --> Load from File...) - Select the ``fasta`` file with your assembly (**NB** the same file as you used for mapping the reads against!) | **Adding the mapped reads** | Adding tracks to the browser is as simple as uploading a new file: - Choose ``File --> Load from File…`` - Choose the sorted ``bam`` file of the paired end mapping - Repeat this for the ``bam`` file of the mate pair mapping - You can choose different sequences (contigs/scaffolds) from the drop-down menu at the top. Start by selecting (one of) the longest scaffold(s) - Start browsing! - Zoom in to see the alignments **Question:** - Do you see differences between some of the reads relative to the reference? What are these? - Is coverage even? Are there gaps in the coverage, or peaks? Where? Adding the locations of gaps as another track ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It would be convenient to be able to see the location of gaps in the browser. For this purpose use a script made by your teacher that creates a ``bed`` file with gap locations. We will use 10 bases as minimum gap length: ``-m 10``. The scuipt uses BioPython so the 'python2' module is needed for it to run. :: module load python2/2.7.9 scaffoldgap2bed.py -i ASSEMBLY.FASTA -m 10 >gaps.bed - Inspect the BED file - Add the BED file to the browser (download it first to the PC) - Drag the track to the top - Zoom in one gaps and look at the alignments **Question:** - Check for some gaps whether they are spanned by mate pairs? Tip: choose 'view as pairs' for the tracks Saving the IGV session ^^^^^^^^^^^^^^^^^^^^^^ We will get back to this assembly browser, so save your session: ``File --> Save Session…``