Assembly using SPADES ===================== Spades was written as an assembly program for bacterial genomes, from regular, as well as from whole-genome amplified samples. It performed very well in the GAGE-B competition, see http://ccb.jhu.edu/gage_b/. SPAdes also works well, sometimes even best, when given high-coverage datasets. Before assembly, SPADES will error-correct the reads. Using SPADES ~~~~~~~~~~~~ Spades can be used with paired end and mate pair data: - The ``--careful`` flag is used to reduce the number of mismatches and short indels. - For each read file, a flag is used to indicate whether it is from a paired end (``--pe``) or mate (``--mp``) pair dataset, followed by a number for the dataset, and a number for read1 or read2. For example: ``--pe1-1`` and ``--pe1-2`` indicate pared end data set 1, read1 and read2, respectively. - Similarly, use ``--mp-1-1`` and ``--mp1-2`` for the mate pair files. - Spades assumes mate pairs are in the orientation as they are in the original files coming from the Illumina instrument: <-- and --> ('outie' orientation, or 'rf' for reverse-forward). Our reads are in the --> and <-- ('innie', 'fr' for forward-reverse) orientation, so we add the ``--mp1-fr`` flag to let SPADES know about this Other parameters: - ``-t`` number of threads (CPUs) to use for calculations - ``--memory`` maximum memory usage in Gb - ``-k`` k-mers to use (this gives room for experimenting!) - ``-o`` name of the output folder Setting up the assembly ^^^^^^^^^^^^^^^^^^^^^^^ To enable SPAdes, run: :: module load spades/3.6.0 | First, create a new folder called ``/usit/abel/u1/YOUR_USERNAME/assembly/spades`` and ``cd`` into it. | We will save the output from the command using ``>spades.out`` in a file to be able to follow progress. ``2>&1`` makes sure any error-messages are written to the same file. Run the assembly as follows: **NOTE** the assembly will take several hours, so use the ``screen`` command! See https://wiki.uio.no/projects/clsi/index.php/Tip:using_screen **NOTE** we use different files for the paired end reads giving spades more data to work with. Choose an assembly: **Option 1: paired end Illumina with Illumina mate Pairs:** For this assembly, we'll tell SPADES what range of khmers to use. :: spades.py -t 2 -k 21,33,55,77 --careful --memory 33 \ --pe1-1 /data/assembly/MiSeq_Ecoli_MG1655_110721_R1.fastq \ --pe1-2 /data/assembly/MiSeq_Ecoli_MG1655_110721_R2.fastq \ --mp1-1 /data/assembly/Nextera_MP_R1_50x.fastq \ --mp1-2 /data/assembly/Nextera_MP_R2_50x.fastq \ --mp1-fr -o ASM_NAME >spades2.out 2>&1 **Option 2: paired end Illumina with MinION data:** The Nanopore data consists of 22270 so-called '2D' reads with average length 6 Kbp, giving around 30x coverage of the *E. coli* genome. We'll let SPADES found out itself what range of khmers to use. :: spades.py -t 2 --careful --memory 33 \ --pe1-1 /data/assembly/MiSeq_Ecoli_MG1655_110721_R1.fastq \ --pe1-2 /data/assembly/MiSeq_Ecoli_MG1655_110721_R2.fastq \ --nanopore /data/assembly/ERA411499_2D_all.fastq \ -o ASM_NAME >spades.out 2>&1 **Option 3: paired end Illumina with PacBio data:** The PacBio data consists of 26250 raw, uncorrected filtered subreads with average length 5.2 Kbp, giving around 30x coverage of the *E. coli* genome. We'll let SPADES found out itself what range of khmers to use. :: spades.py -t 2 --careful --memory 33 \ --pe1-1 /data/assembly/MiSeq_Ecoli_MG1655_110721_R1.fastq \ --pe1-2 /data/assembly/MiSeq_Ecoli_MG1655_110721_R2.fastq \ --pacbio /data/assembly/m130404_014004_filtered_subreads_30x.fastq \ -o ASM_NAME >spades.out 2>&1 If the assembly is running in a 'screen', you can follow the output by checking the ``out`` file. **TIP**: use this command to track the output as it is added to the file. Use ``ctrl-c`` to cancel. :: tail -f spades.out SPADES output ^^^^^^^^^^^^^ - error-corrected reads - contigs for each individual k-mer assembly - final ``contigs.fasta`` and ``scaffolds.fasta``, use the scaffolds file (!) You can have a look at the lengths of the largest sequence(s) with :: fasta_length contigs.fasta |sort -nr |less Re-using error-corrected reads ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once you have run SPADES, you will have files with the error-corrected reads in ``spades_folder/corrected/``. There will be one file for each input file, and one additional one for unpaired reads (where during correction, one of the pairs was removed from the dataset). Instead of running the full SPADES pipeline for your next assembly, you could add the error-corrected reads from the previous assembly. This will save time by skipping the error-correction step. I suggest to not include the files with unpaired reads. Error-corrected read files are compressed, but SPADES will accept them as such (no need to uncompress). Changes to the command line when using error-corrected reads: - point to the error-corrected read files instead of the raw read files - add the ``--only-assembler`` flag to skip correction Next steps ~~~~~~~~~~ As for the previous assemblies, you could map reads back to the assembly, run reapr and visualise in the browser.