Comparing assemblies to the reference¶
The Quast program can be used to generate similar metrics as the assemblathon_stat.pl script, pluss some more and some visualisations.
Quast
options:
-o
: name of output folder-R
: Reference genome-G
: File with positions of genes in the reference (see manual)-T
: number of threads (cpu’s) to usesequences.fasta
: one or more files with assembled sequences-l
: comma-separates list of names for the assemblies, e.g."assembly 1", "assembly 2"
(in the same order as the sequence files)--scaffolds
: input sequences are scaffolds, not contigs. They will be split at 10 N’s or more to analyse contigs (‘broken’ assembly)--est-ref-size
: estimated reference genome size (when not provided):--gene-finding
: applyGenemarkS
for gene finding
See the manual for information on the output of Quast: http://quast.bioinf.spbau.ru/manual.html#sec3
NOTE: on the course server, you can’t run quast
if anaconda is in your PATH. To temporarily remove anaconda, run
$ cd
$ mv anaconda3 anaconda3_bak
Now log out and back in again.
Other programs/scripts need anaconda, so you should name the folder back to anaconda3
when you want to use them again and log out and back in again. Sorry for the confusion.
Running Quast¶
On the server, make a folder called quast
and move into it. Then run:
quast.py -t 2 \
-o out_folder_name \
-R /data/assembly/NC_000913_K12_MG1655.fasta \
-G /data/assembly/e.coli_genes.gff \
../path/to/assembly1.fasta \
../path/to/assembly2.fasta \
-l "Assembly 1, Assembly 2"
Note that the --scaffold
option is not used here for simplification. Also, make sure you name the assemblies (-l
) in the same order as you give them to quast!
Quast output¶
Quast will produce a html report file report.html
. Open this html
file in your browser. Hover over the row names to get a description. Also have a look at the ‘Extended report’.
Alternatively, have a look at the report.pdf file (it has a few more plots).