Assembly using SPADES¶
Spades was written as an assembly program for bacterial genomes, from regular, as well as from whole-genome amplified samples. It performed very well in the GAGE-B competition, see http://ccb.jhu.edu/gage_b/. SPAdes also works well, sometimes even best, when given high-coverage datasets.
Before assembly, SPADES will error-correct the reads.
Using SPADES¶
Spades can be used with paired end and mate pair data:
- The
--careful
flag is used to reduce the number of mismatches and short indels. - For each read file, a flag is used to indicate whether it is from a
paired end (
--pe
) or mate (--mp
) pair dataset, followed by a number for the dataset, and a number for read1 or read2. For example:--pe1-1
and--pe1-2
indicate pared end data set 1, read1 and read2, respectively. - Similarly, use
--mp-1-1
and--mp1-2
for the mate pair files. - Spades assumes mate pairs are in the orientation as they are in the
original files coming from the Illumina instrument: <– and –>
(‘outie’ orientation, or ‘rf’ for reverse-forward). Our reads are in
the –> and <– (‘innie’, ‘fr’ for forward-reverse) orientation, so
we add the
--mp1-fr
flag to let SPADES know about this
Other parameters:
-t
number of threads (CPUs) to use for calculations--memory
maximum memory usage in Gb-k
k-mers to use (this gives room for experimenting!)-o
name of the output folder
Setting up the assembly¶
To enable SPAdes, run:
module load spades/3.6.0
/usit/abel/u1/YOUR_USERNAME/assembly/spades
and cd
into it.>spades.out
in a
file to be able to follow progress. 2>&1
makes sure any
error-messages are written to the same file. Run the assembly as
follows:NOTE the assembly will take several hours, so use the screen
command! See
https://wiki.uio.no/projects/clsi/index.php/Tip:using_screen
NOTE we use different files for the paired end reads giving spades more data to work with.
Choose an assembly:
Option 1: paired end Illumina with Illumina mate Pairs:
For this assembly, we’ll tell SPADES what range of khmers to use.
spades.py -t 2 -k 21,33,55,77 --careful --memory 33 \
--pe1-1 /data/assembly/MiSeq_Ecoli_MG1655_110721_R1.fastq \
--pe1-2 /data/assembly/MiSeq_Ecoli_MG1655_110721_R2.fastq \
--mp1-1 /data/assembly/Nextera_MP_R1_50x.fastq \
--mp1-2 /data/assembly/Nextera_MP_R2_50x.fastq \
--mp1-fr -o ASM_NAME >spades2.out 2>&1
Option 2: paired end Illumina with MinION data:
The Nanopore data consists of 22270 so-called ‘2D’ reads with average length 6 Kbp, giving around 30x coverage of the E. coli genome. We’ll let SPADES found out itself what range of khmers to use.
spades.py -t 2 --careful --memory 33 \
--pe1-1 /data/assembly/MiSeq_Ecoli_MG1655_110721_R1.fastq \
--pe1-2 /data/assembly/MiSeq_Ecoli_MG1655_110721_R2.fastq \
--nanopore /data/assembly/ERA411499_2D_all.fastq \
-o ASM_NAME >spades.out 2>&1
Option 3: paired end Illumina with PacBio data:
The PacBio data consists of 26250 raw, uncorrected filtered subreads with average length 5.2 Kbp, giving around 30x coverage of the E. coli genome. We’ll let SPADES found out itself what range of khmers to use.
spades.py -t 2 --careful --memory 33 \
--pe1-1 /data/assembly/MiSeq_Ecoli_MG1655_110721_R1.fastq \
--pe1-2 /data/assembly/MiSeq_Ecoli_MG1655_110721_R2.fastq \
--pacbio /data/assembly/m130404_014004_filtered_subreads_30x.fastq \
-o ASM_NAME >spades.out 2>&1
If the assembly is running in a ‘screen’, you can follow the output by
checking the out
file.
TIP: use this command to track the output as it is added to the
file. Use ctrl-c
to cancel.
tail -f spades.out
SPADES output¶
- error-corrected reads
- contigs for each individual k-mer assembly
- final
contigs.fasta
andscaffolds.fasta
, use the scaffolds file (!)
You can have a look at the lengths of the largest sequence(s) with
fasta_length contigs.fasta |sort -nr |less
Re-using error-corrected reads¶
Once you have run SPADES, you will have files with the error-corrected
reads in spades_folder/corrected/
. There will be one file for each
input file, and one additional one for unpaired reads (where during
correction, one of the pairs was removed from the dataset). Instead of
running the full SPADES pipeline for your next assembly, you could add
the error-corrected reads from the previous assembly. This will save
time by skipping the error-correction step. I suggest to not include the
files with unpaired reads.
Error-corrected read files are compressed, but SPADES will accept them as such (no need to uncompress).
Changes to the command line when using error-corrected reads:
- point to the error-corrected read files instead of the raw read files
- add the
--only-assembler
flag to skip correction
Next steps¶
As for the previous assemblies, you could map reads back to the assembly, run reapr and visualise in the browser.