Assembly using CANU

Canu is “a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION)”. Celera Assembler was developed during the time of Sanger sequencing by the company Celera Genomics. Celera Assembler was used to assemble the Drosophila genome, as well as the human genome.

Canu:

  • starts with the raw reads
  • maps all of them to the longest set of reads
  • corrects the longest set of reads using the information of the mapped reads
  • runs the corrected reads through an Overlap-Layout-Consensus (OLC) assembler

Given 60-100x coverage in raw PacBio or MinION reads, Canu very often yields complete, gapless assemblies, i.e., one contig per chromosomal element.

From the manual:

While Canu corrects sequences and has 99% identity or greater with PacBio or Nanopore sequences, for the best accuracy we recommend polishing with a sequence-specific tool. We recommend Quiver for PacBio and Nanopolish for Oxford Nanpore data. If you have Illumina sequences available, Pilon can also be used to polish either PacBio or Oxford Nanopore assemblies.

Assembling MinION data with canu

Run canu as:

canu -p canu_MAP006-1_2D -d canu_MAP006-1_2D \
genomeSize=4.6m \
-maxThreads=2 \
-maxMemory=33 \
-nanopore-raw /data/assembly/MAP006-1_2D_pass.fastq
  • -p and -d tell can what to call the output folder and files
  • -nanopore-raw speaks for itself
  • the assembly file will be called canu_MAP006-1_2D.contigs.fasta

If you want to run reapr on the can output, there is one catch. We’ll check whether reapr can actually work with the canu assembly file:

reapr facheck ASSEMBLY.FASTA

From the help text:

Checks that the names in the fasta file are ok. Things like trailing whitespace or characters |’:- could break the pipeline.

If this command warns about one or more names of sequences breaking the pipeline, you can have the program fix it for you:

reapr facheck ASSEMBLY.FASTA fixed_ASSEMBLY

This will create a new file fixed_ASSEMBLY.fasta with fixed sequence names, and a file called fixed_ASSEMBLY.info listing the old and new names. Now your assembly is ready for reapr.

NB: Make sure you do this before you start mapping with bwa!

Assembling PacBio data with canu

Use all available reads from the P6C4 run, i.e. 155 x coverage:

/data/assembly/pacbio/Analysis_Results/m141013_011508_sherri_c100709962550000001823135904221533_s1_p0.filtered_subreads.fastq

Ran as for the MinIOn data, but use the -pacbio-raw for the PacBio reads.

Correcting the canu PacBio assembly using Quiver

Quiver is a program that takes a set of aligned PacBio reads and recalls the bases based on the consensus of the alignment.

First, we need to map the raw (!) PacBio reds to the assembly. For this, we use the raw output from the instrument, which is in so-called bax.h5 files (these are in the HDF5 binary format). We make a fofn (a ‘file-of-filenames’) which lists all input files:

find /data/assembly/pacbio/ -name "*.h5" >input.fofn

We also need to set up the environment to be able to run the correct programs (pbalign and quiver), simple type:

smrtshell

There will some warnings but please ignore these.

Now we do the mapping using pbalign:

pbalign \
--tmpDir ./ \
--nproc 2 \
input.fofn \
assembly.fasta canu_quiver.cmp.h5 \
--forQuiver

For the next step, we need to index the assembly with samtools faidx:

samtools faidx assembly.fasta

Now we can run quiver:

quiver -j 2 canu_quiver.cmp.h5 \
-r assembly.fasta \
-o canu_quiver.variants.gff \
-o canu_quiver.consensus.fasta

Our ‘new’ assembly is in the consensus.fasta file, while the variants.gff file is a list of changes quiver made to the original assembly (in gff format).