Assembly using miniasm+racon

A recent paper described a fast approach for assembling and correcting PacBio and MinION data. The principle is:

  • using minimap for fast all-against-all overlap of raw reads
  • using miniasm, this “simply concatenates pieces of read sequences to generate the final sequences. Thus the per-base error rate is similar to the raw input reads.”
  • mapping the raw reads back to the assembly using minimap again
  • using racon (‘rapid consensus’) for consensus calling

They also recommend running racon twice, we will settle for once.

Running miniasm and racon on MinION data

All-against-all overlap with minimap

First, ensure that you are in the assembly directory. Create a new directory created miniasm and inside of that one, a directory called minion. Go inside the minion directory.

Note how the reads are used twice here, as we map the reads against themselves:

minimap -Sw5 -L100 -m0 \
-t 3 \
/share/inf-biox121/data/assembly/MAP006-1_2D_pass.fastq \
/share/inf-biox121/data/assembly/MAP006-1_2D_pass.fastq \
| gzip -1 >racon_MAP006-1_2D_1.paf.gz

The output is in the so-called [PAF (Pairwise mApping) Format] (https://github.com/lh3/miniasm/blob/master/PAF.md), and is compressed ‘on the fly’.

Assembly with miniasm

miniasm takes the paf file and produces an assembly in GFA (Graphical Fragment Assembly) format.

miniasm -f /share/inf-biox121/data/assembly/MAP006-1_2D_pass.fastq \
racon_MAP006-1_2D_1.paf.gz \
>racon_MAP006-1_2D_1.gfa

Since we have only one sequence in the GFA file (at least for this assembly), we can use a simple set of unix commands to turn it into a fasta file:

head -n 1 racon_MAP006-1_2D_1.gfa | awk '{print ">"$2; print $3}' > racon_MAP006-1_2D_1.raw_assembly.fasta

Correction with racon

We first use minimap again, this time with the original reads mapped against the ‘raw’ assembly:

minimap racon_MAP006-1_2D_1.raw_assembly.fasta \
/share/inf-biox121/data/assembly/MAP006-1_2D_pass.fastq \
>racon_MAP006-1_2D_1.raw_assembly.reads_mapped.paf

racon is basically run as racon -t num_threads reads.fastq mapped_reads.paf assembly.fasta consensus.fasta:

racon -t 2 \
/share/inf-biox121/data/assembly/MAP006-1_2D_pass.fastq \
racon_MAP006-1_2D_1.raw_assembly.reads_mapped.paf \
racon_MAP006-1_2D_1.raw_assembly.fasta \
racon_MAP006-1_2D_1.racon1.fasta

This will take some time.

Correction with racon, round 2

As mentioned in the paper, for the best results, we could run racon again. We will not do that here, but if we were, this would be how.

Run the mapping with minimap and the correction with racon again, but now with the results of the first round of correction.

That is:

  • for minimap: map the reads to the racon1 assembly fasta file
  • for racon: use the racon1 fasta file as the assembly file, and the reads that mapped to racon1 as the mapped reads.

Please be careful when naming files!

Running miniasm and racon on PacBio data

Use all available reads from the P6C4 run, i.e. :

/share/inf-biox121/data/assembly/m141013_011508_sherri_c100709962550000001823135904221533_s1_p0.filtered_subreads.fastq

Do this in a sister directory of the previous directory, name this one pacbio.

The commands are the same as for the MinION data. Again, please be careful when naming files!