Assembly using HGAP¶

From the PacBio website: “The Hierarchical Genome Assembly Process (HGAP) for long single pass reads generated by the PacBio Single Molecule Real Time (SMRT) sequencer was developed to allow the complete and accurate shotgun assembly of bacterial sized genomes.” See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP

Using HGAP¶

The HGAP pipeline:

starts with the raw output from the instrument, so-called bax.h5 file (these are in the HDF5 binary format)
extracts the reads and maps all of them to the longest set of reads
corrects the longest set of reads
runs the corrected reads through the Celera assembler (Celera Assembler was developed during the time of Sanger sequencing by the company Celera Genomics. Celera Assembler was used to assemble the Drosophila genome, as well as the human genome)
maps all the raw reads back to the assembled contigs and recalls the consensus bases

Given 60-100x coverage in raw PacBio reads, HGAP very often yields complete, gapless, highly accurate assemblies, i.e., one contig per chromosomal element with maybe a few bases wrong.

Here we will use a dataset provided by Pacific Biosciences, comprising ~95x coverage in raw reads.

Setting up the assembly¶

HGAP is part of a software suite, developed by Pacific Biosciences, called smrtanalysis. The main command, smrtpipe, takes as input:

an xml file pointing to a set of raw PacBio files
an xml file with settings, in our case, specific for HGAP

NOTE the assembly will take several hours, so use the screen command! See https://wiki.uio.no/projects/clsi/index.php/Tip:using_screen

Make a new folder called, for example, called hgap
we don’t use a module this time, but need to add the software to our environment by using this command (remember to use screen first):

source /cluster/software/smrtanalysis/current/etc/setup.sh

(ignore the WARNINGS).

make a file that has a list of the input files like this:

ls /data/assembly/*.bax.h5 >input.fofn

we simply redirect the output of the ls command to a file
´fofn´ stands for ‘file of filenames’
NOTE an alternative, more common way of doing this is using the find command
convert this fofn file to an xml file the software understands:

fofnToSmrtpipeInput.py input.fofn >input.xml

add a copy of a settings xml file to your folder, this is needed to tell the software how to run the analysis

cp /data/assembly/HGAP3_settings.xml ./

start the analysis, see NOTES below

smrtpipe.py -D TMP=./ -D SHARED_DIR=./ \
-D NPROC=2 -D MAX_THREADS=2 \
--params=HGAP3_settings.xml xml:input.xml > smrtpipe.out 2>&1

NOTES

-D TMP=./ -D SHARED_DIR=./ indicate to use the local folder for temporary files
-D NPROC=2 -D MAX_THREADS=2 limits the use of CPUs to 2

HGAP output¶

The data folder contains the most important output:

the folder 0-mercounts up to 4-unitigger and files/folders with celera-assembler in the name are output from Celera
corrected.* contain the corrected PacBio reads
draft_assembly.fasta is the first assembly generated by Celera
polished_assembly.* are the final contigs after correcting the bases using the raw reads, these are compressed using gzip
lots of other files and folders are present, have a look around if you have time

In order to have use the final assembly, uncompress the ´polished_assembly.fasta.gz´ file like this:

gunzip polished_assembly.fasta.gz

You can have a look at the lengths of the largest sequence(s) with

fasta_length polished_assembly.fasta |sort -nr |less

Next steps¶

As for the previous assemblies, you could map reads back to the assembly, run reapr and visualise in the browser. Is the assembly error-less?

NOTE reapr will complain about the naming of the sequence(s) in the polished_assembly.fasta.gz file. A fix for this is to run this command BEFORE running bwa:

sed -i 's/|quiver/_quiver/' polished_assembly.fasta

This replaces the | symbol with an underscore.

Assembly using HGAP¶

Using HGAP¶

Setting up the assembly¶

HGAP output¶

Next steps¶

Table Of Contents

This Page