Assembly using HGAP

From the PacBio website: “The Hierarchical Genome Assembly Process (HGAP) for long single pass reads generated by the PacBio Single Molecule Real Time (SMRT) sequencer was developed to allow the complete and accurate shotgun assembly of bacterial sized genomes.” See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP

Using HGAP

The HGAP pipeline:

  • starts with the raw output from the instrument, so-called bax.h5 file (these are in the HDF5 binary format)
  • extracts the reads and maps all of them to the longest set of reads
  • corrects the longest set of reads
  • runs the corrected reads through the Celera assembler (Celera Assembler was developed during the time of Sanger sequencing by the company Celera Genomics. Celera Assembler was used to assemble the Drosophila genome, as well as the human genome)
  • maps all the raw reads back to the assembled contigs and recalls the consensus bases

Given 60-100x coverage in raw PacBio reads, HGAP very often yields complete, gapless, highly accurate assemblies, i.e., one contig per chromosomal element with maybe a few bases wrong.

Here we will use a dataset provided by Pacific Biosciences, comprising ~95x coverage in raw reads.

Setting up the assembly

HGAP is part of a software suite, developed by Pacific Biosciences, called smrtanalysis. The main command, smrtpipe, takes as input:

  • an xml file pointing to a set of raw PacBio files
  • an xml file with settings, in our case, specific for HGAP

NOTE the assembly will take several hours, so use the screen command! See https://wiki.uio.no/projects/clsi/index.php/Tip:using_screen

  • Make a new folder called, for example, called hgap
  • we don’t use a module this time, but need to add the software to our environment by using this command (remember to use screen first):
source /cluster/software/smrtanalysis/current/etc/setup.sh

(ignore the WARNINGS).

  • make a file that has a list of the input files like this:
ls /data/assembly/*.bax.h5 >input.fofn
  • we simply redirect the output of the ls command to a file
  • ´fofn´ stands for ‘file of filenames’
  • NOTE an alternative, more common way of doing this is using the find command
  • convert this fofn file to an xml file the software understands:
fofnToSmrtpipeInput.py input.fofn >input.xml
  • add a copy of a settings xml file to your folder, this is needed to tell the software how to run the analysis
cp /data/assembly/HGAP3_settings.xml ./
  • start the analysis, see NOTES below
smrtpipe.py -D TMP=./ -D SHARED_DIR=./ \
-D NPROC=2 -D MAX_THREADS=2 \
--params=HGAP3_settings.xml xml:input.xml > smrtpipe.out 2>&1

NOTES

  • -D TMP=./ -D SHARED_DIR=./ indicate to use the local folder for temporary files
  • -D NPROC=2 -D MAX_THREADS=2 limits the use of CPUs to 2

HGAP output

The data folder contains the most important output:

  • the folder 0-mercounts up to 4-unitigger and files/folders with celera-assembler in the name are output from Celera
  • corrected.* contain the corrected PacBio reads
  • draft_assembly.fasta is the first assembly generated by Celera
  • polished_assembly.* are the final contigs after correcting the bases using the raw reads, these are compressed using gzip
  • lots of other files and folders are present, have a look around if you have time

In order to have use the final assembly, uncompress the ´polished_assembly.fasta.gz´ file like this:

gunzip polished_assembly.fasta.gz

You can have a look at the lengths of the largest sequence(s) with

fasta_length polished_assembly.fasta |sort -nr |less

Next steps

As for the previous assemblies, you could map reads back to the assembly, run reapr and visualise in the browser. Is the assembly error-less?

NOTE reapr will complain about the naming of the sequence(s) in the polished_assembly.fasta.gz file. A fix for this is to run this command BEFORE running bwa:

sed -i 's/|quiver/_quiver/' polished_assembly.fasta

This replaces the | symbol with an underscore.