Assembly using velvet ===================== *De novo* assembly of Illumina reads using velvet ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Assembling short-reads with Velvet ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We will use Velvet to assemble Illumina reads on their own. Velvet uses the *de Bruijn graph* approach. We will assemble *E. coli K12* strain MG1655 which was sequenced on an Illumina MiSeq. The instrument read 150 bases from each direction. We wil first use paired end reads only: | ``/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq`` | ``/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq`` Building the Velvet Index File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Velvet requires an index file to be built before the assembly takes place. We must choose a *k-* mer value for building the index. Longer *k-* mers result in a more stringent assembly, at the expense of coverage. There is no definitive value of *k* for any given project. However, there are several absolute rules: - *k* must be less than the read length - it should be an odd number Firstly we are going to run Velvet in single-end mode, *ignoring the pairing information*. Later on we will incorporate this information. First, we need to make sure we can use velvet: Set up the environment ^^^^^^^^^^^^^^^^^^^^^^ For this part of the course, *every time you log into the server* you need to execute the command below. IMPORTANT do not use spaces between ``PATH=`` and ``$PATH``! :: export PATH=/data/bin/:$PATH To be able to use velvet, load the following module: :: module load velvet Now, 'go home': :: cd ~ or simply type :: cd Create the assembly folder: :: mkdir assembly cd assembly mkdir velvet cd velvet A first assembly ^^^^^^^^^^^^^^^^ Find a value of *k* (between 21 and 113) to start with, and record your choice in this google spreadsheet: `bit.ly/INFBIO1 `__. Run ``velveth`` to build the hash index (see below). +-----------+----------------+-------------------------------------------------------+ | Program | Options | Explanation | +===========+================+=======================================================+ | velveth | | Build the Velvet index file | +-----------+----------------+-------------------------------------------------------+ | | foldername | use this name for the results folder | +-----------+----------------+-------------------------------------------------------+ | | value\_of\_k | use k-mers of this size | +-----------+----------------+-------------------------------------------------------+ | | -short | short reads (as opposed to long, Sanger-like reads) | +-----------+----------------+-------------------------------------------------------+ | | -separate | read1 and read2 are in separate files | +-----------+----------------+-------------------------------------------------------+ | | -fastq | read type is fastq | +-----------+----------------+-------------------------------------------------------+ Build the index as follows: :: velveth ASM_NAME VALUE_OF_K \ -short -separate -fastq \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq **NOTES** - Change ``ASM_NAME`` to something else of your choosing - Change ``VALUE_OF_K`` to the value you have picked - The command is split over several lines by adding a space, and a ``\`` (backslash) to each line. This trick makes long commands more readable. If you want, you can write the whole command on one line instead. After ``velveth`` is finished, look in the new folder that has the name you chose. You should see the following files: :: Log Roadmaps Sequences The '``Log``' file has a useful reminder of what commands you typed to get this assembly result, for reproducing results later on. '``Sequences``' contains the sequences we put in, and '``Roadmaps``' contains the index you just created. Now we will run the assembly with default parameters: :: velvetg ASM_NAME Velvet will end with a text like this: ``Final graph has ... nodes and n50 of ..., max ..., total ..., using .../... reads`` The number of nodes represents the number of nodes in the graph, which (more or less) is the number of contigs. Velvet reports its N50 (as well as everything else) in 'kmer' space. The conversion to 'basespace' is as simple as adding k-1 to the reported length. Look again at the folder ``ASM_NAME``, you should see the following extra files: | ``contigs.fa`` | ``Graph`` | ``LastGraph`` | ``PreGraph`` | ``stats.txt`` The important files are: | ``contigs.fa`` - the assembly itself | ``Graph`` - a textual representation of the contig graph | ``stats.txt`` - a file containing statistics on each contig **Questions** - What k-mer did you use? - What is the N50 of the assembly? - What is the size of the largest contig? - How many contigs are there in the ``contigs.fa`` file? Use ``grep -c NODE contigs.fa``. Is this the same number as velvet reported? Log your results in this google spreadsheet: ``bit.ly/INFBIO1`` **We will discuss the results together and determine *the optimal* k-mer for this dataset.** **Advanced tip:** You can also use VelvetOptimiser to automate this process of selecting appropriate *k*-mer values. VelvetOptimizer is included with the Velvet installation. Now run ``velveth`` and ``velvetg`` for the kmer size determined by the whole class. Use this kmer from now on! Estimating and setting ``exp_cov`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Much better assemblies are produced if Velvet understands the expected coverage for unique regions of your genome. This allows it to try and resolve repeats. The data to determine this is in the ``stats.txt`` file. The full description of this file is in the Velvet Manual, at http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf. A so-called Jupyter notebook has been provided to plot the distribution of the coverage of the nodes. In order to use it, you need to do the following on the local linux machine *Not on the server*: NOTE: if you are on ``vetur`` OR ``vor``, type: :: ssh nordur OR :: ssh austur and enter your password. - install the Jupyter notebook and some python packages (this may take a few minutes): :: pip install --user jupyter pandas numpy pysam - prepare a folder on your linux machine :: cd ~ mkdir assembly cd assembly mkdir velvet cd velvet mkdir ASM_NAME cd ASM_NAME - copy the ``stats.txt`` file from the server to this folder using the ``rsync`` command - copy the notebook file ``/data/assembly/node_coverage.ipynb`` from the server to this folder using ``rsync`` - start the notebook: :: python notebook node_coverage.ipynb OR :: ipython notebook node_coverage.ipynb - After a little while, your web browser will start with a new tab with the notebook in it - follow the instructions in the notebook **Question:** - What do you think is the approximate expected k-mer coverage for your assembly? When you are done with the Jupyter notebook: - save the notebook - close the browser windows - in the terminal where you started Jupyter notebook, click ctrl-c and confirm. Now run velvet again, supplying the value for ``exp_cov`` (k-mer coverage) corresponding to your answer: :: velvetg ASM_NAME -exp_cov PEAK_K_MER_COVERAGE **Question:** - What improvements do you see in the assembly by setting a value for ``exp_cov``? Setting ``cov_cutoff`` ^^^^^^^^^^^^^^^^^^^^^^ You can also clean up the graph by removing low-frequency nodes from the *de Bruijn* graph using the ``cov_cutoff`` parameter. Low-frequency nodes can result from sequencing errors, or from parts of the genome with very little sequencing coverage. Removing them will often result in better assemblies, but setting the cut-off too high will also result in losing useful parts of the assembly. Using the histogram from previously, estimate a good value for ``cov_cutoff``. :: velvetg ASM_NAME -exp_cov YOUR_VALUE -cov_cutoff YOUR_VALUE Try some different values for ``cov_cutoff``, keeping ``exp_cov`` the same and record your assembly results. Asking velvet to determine the parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can also ask Velvet to predict the values for you: :: velvetg ASM_NAME -exp_cov auto -cov_cutoff auto **Questions:** - What values of *exp\_cov* and *cov\_cutoff* did Velvet choose? - Check the output to the screen. Is this assembly better than your best one? Incorporating paired-end information ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Paired end information contributes additional information to the assembly, allowing contigs to be scaffolded. We will first re-index your reads telling Velvet to use paired-end information, by using ``-shortPaired`` instead of ``-short`` for ``velveth``. Then, re-run velvetg using the best value of ``k``, ``exp_cov`` and ``cov_cutoff`` from the previous step. **!!! IMPORTANT Pick a new name for your assembly !!!** :: velveth ASM_NAME2 VALUE_OF_K \ -shortPaired -fastq -separate \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq velvetg ASM_NAME2 -exp_cov auto \ -cov_cutoff auto **Questions:** - How does doing this affect the assembly? - what does velvet say about the insert size of the paired end library? Scaffold and contig metrics ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | The sequences in the ``contigs.fa`` file are actually scaffolds. | Use the ``assemblathon_stats.pl`` script to generate metrics for this, and all following assemblies. **The assemblathon stats script** The assemblathon `www.assemblathon.org `__ used a perl script to obtain standardized metrics for the assemblies that were submitted. Here we use (a slightly modified version of) this script. It takes the size of the genome, and one sequence fasta file as input. The script breaks the sequences into contigs when there are 20 or more N’s, and reports all sorts of metrics. +--------------------------+-------------+----------------------------------------------------------------+ | Program | Options | Explanation | +==========================+=============+================================================================+ | assemblathon\_stats.pl | | Provide basic assembly metrics | +--------------------------+-------------+----------------------------------------------------------------+ | | -size | size (in Mbp, million basepairs) of target genome (optional) | +--------------------------+-------------+----------------------------------------------------------------+ | | seq.fasta | fasta file of contigs or scaffolds to report on | +--------------------------+-------------+----------------------------------------------------------------+ Example, for a 3.2 Mbp genome: :: assemblathon_stats.pl -s 3.2 scaffolds.fasta OR, save the output to a file with :: assemblathon_stats.pl -s 3.2 scaffolds.fasta > metrics.txt Here, ``>`` (redirect) symbol used to ‘redirect’ what is written to the screen to a file. **For this exercise**, use the known length for this strain, 4.6 Mbp, for the genome size. NOTE make sure you have run this command to enable the use of the script: :: export PATH=/data/bin/:$PATH Some of the metrics the script reports are: - N50 is based on the total assembly size - NG50 is based on the estimated/known genome size - L50 (LG50) count: number of scaffolds/contigs at least N50 (NG50) bases **Questions** - How much of the estimated genome size is covered in the scaffolds - how many gap bases ('N') are left in the scaffolds Looking for repeats ^^^^^^^^^^^^^^^^^^^ Have a look for contigs which are long and have a much higher coverage than the average for your genome. One tedious way to do this is to look into the ``contigs.fa`` file (with ``less``). You will see the name of the contig ('NODE'), it's length and the kmer coverage. However, trying to find long contigs with high coverage this way is not very efficient. A faster was is to again use the ``stats.txt`` file. Relevant columns are: 1) ID --> sequence ID, same as 'NODE' number in the ``contigs.fa`` file 2) lgth --> sequence 'length' 3) short1\_cov --> kmer coverage (column 6) Knowing this, we can use the ``awk`` command to select lines for contigs at least 1kb, with k-mer coverage greater than 60: :: awk '($2>=1000 && $6>=60)' stats.txt ``awk`` is an amazing program for tabular data. In this case, we ask it to check that column 2 ($2, the length) is at least 1000 and column 6 ($6, coverage) at least 60. If this is the case, awk will print the entire line. See http://bit.ly/QjbWr7 for more information on awk. Find the contig with the highest coverage in the ``contigs.fa`` file. Perform a BLAST search using NCBI. **Question:** - What is it? - Is this surprising? Why, or why not? The effect of mate pair library reads ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Long-range "mate-pair" libraries can also dramatically improve an assembly by scaffolding contigs. Typical sizes for Illumina are 2kb and 6kb, although any size is theoretically possible. You can supply a second library to Velvet. However, it is important that files are reverse-complemented first as Velvet expects a specific orientation. We have supplied a 3kb mate-pair library in the correct orientation. **!!! IMPORTANT Pick a new name for your assembly !!!** We will use ``-shortPaired`` for the paired end library reads as before, and add ``-shortPaired2`` for the mate pairs. Also, to make sure we all end up having the same assembly, the kmer size is given: :: velveth ASM_NAME3 81 \ -shortPaired -separate -fastq \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \ /data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq \ -shortPaired2 -separate -fastq \ /data/assembly/Nextera_MP_R1_50x.fastq \ /data/assembly/Nextera_MP_R2_50x.fastq We use auto values for velvetg because the addition of new reads will change the genome coverage. The assembly command then becomes: :: velvetg ASM_NAME3 -cov_cutoff auto -exp_cov auto **Questions:** - What is the N50 of this assembly? - How many scaffolds? - How many bases are in gaps? - What did velvet estimate for the insert length of the paired-end reads, and for the standard deviation? Use the last mention of this in the velvet output. - And for the mate-pair library? **TIP** Some mate pair libraries have a significant amount of paired end reads present as a by-effect of the library preparation. This may generate misassemblies. If this is the case for your data, add the ``-shortMatePaired2 yes`` to let Velvet know it. Make a copy of the contigs file and call it ``velvet_pe+mp.fa`` Optional: Skipping the paired end reads ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As both the mate pairs and the paired end *reads* are of the same length, and provide the same coverage, it could be interesting to try an assembly of the mate pair reads only. The read sequences would still be used to build the contigs, and the mate pair information to build scaffolds. **!!! IMPORTANT Pick a new name for your assembly !!!** The assembly for this part then becomes: :: velveth ASM_NAME4 81 \ -shortPaired -separate -fastq \ /data/assembly/Nextera_MP_R1_50x.fastq \ /data/assembly/Nextera_MP_R2_50x.fastq velvetg ASM_NAME4 -cov_cutoff auto -exp_cov auto **Questions:** - What is the N50 of this assembly? - How many scaffolds? - How many bases are in gaps? - How does this assembly compare to the previous ones? Make a copy of the contigs file and call it ``velvet_mp_only`` Next steps ~~~~~~~~~~ Next, map the reads used for the assemblies back to the scaffolds. See the tutorial 'Mapping reads to an assembly' .. toctree:: :maxdepth: 1