Assembly using HGAP¶
From the PacBio website: “The Hierarchical Genome Assembly Process (HGAP) for long single pass reads generated by the PacBio Single Molecule Real Time (SMRT) sequencer was developed to allow the complete and accurate shotgun assembly of bacterial sized genomes.” See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP
Using HGAP¶
The HGAP pipeline:
- starts with the raw output from the instrument, so-called
bax.h5
file (these are in the HDF5 binary format) - extracts the reads and maps all of them to the longest set of reads
- corrects the longest set of reads
- runs the corrected reads through the Celera assembler (Celera Assembler was developed during the time of Sanger sequencing by the company Celera Genomics. Celera Assembler was used to assemble the Drosophila genome, as well as the human genome)
- maps all the raw reads back to the assembled contigs and recalls the consensus bases
Given 60-100x coverage in raw PacBio reads, HGAP very often yields complete, gapless, highly accurate assemblies, i.e., one contig per chromosomal element with maybe a few bases wrong.
Here we will use a dataset provided by Pacific Biosciences, comprising ~95x coverage in raw reads.
Setting up the assembly¶
HGAP is part of a software suite, developed by Pacific Biosciences,
called smrtanalysis
. The main command, smrtpipe
, takes as input:
- an xml file pointing to a set of raw PacBio files
- an xml file with settings, in our case, specific for HGAP
NOTE the assembly will take several hours, so use the screen
command! See
https://wiki.uio.no/projects/clsi/index.php/Tip:using_screen
- Make a new folder called, for example, called
hgap
- we don’t use a module this time, but need to add the software to our
environment by using this command (remember to use
screen
first):
source /cluster/software/smrtanalysis/current/etc/setup.sh
(ignore the WARNINGS).
- make a file that has a list of the input files like this:
ls /data/assembly/*.bax.h5 >input.fofn
- we simply redirect the output of the
ls
command to a file - ´fofn´ stands for ‘file of filenames’
- NOTE an alternative, more common way of doing this is using the
find
command - convert this
fofn
file to anxml
file the software understands:
fofnToSmrtpipeInput.py input.fofn >input.xml
- add a copy of a settings xml file to your folder, this is needed to tell the software how to run the analysis
cp /data/assembly/HGAP3_settings.xml ./
- start the analysis, see NOTES below
smrtpipe.py -D TMP=./ -D SHARED_DIR=./ \
-D NPROC=2 -D MAX_THREADS=2 \
--params=HGAP3_settings.xml xml:input.xml > smrtpipe.out 2>&1
NOTES
-D TMP=./ -D SHARED_DIR=./
indicate to use the local folder for temporary files-D NPROC=2 -D MAX_THREADS=2
limits the use of CPUs to 2
HGAP output¶
The data
folder contains the most important output:
- the folder
0-mercounts
up to4-unitigger
and files/folders withcelera-assembler
in the name are output from Celera corrected.*
contain the corrected PacBio readsdraft_assembly.fasta
is the first assembly generated by Celerapolished_assembly.*
are the final contigs after correcting the bases using the raw reads, these are compressed usinggzip
- lots of other files and folders are present, have a look around if you have time
In order to have use the final assembly, uncompress the ´polished_assembly.fasta.gz´ file like this:
gunzip polished_assembly.fasta.gz
You can have a look at the lengths of the largest sequence(s) with
fasta_length polished_assembly.fasta |sort -nr |less
Next steps¶
As for the previous assemblies, you could map reads back to the assembly, run reapr and visualise in the browser. Is the assembly error-less?
NOTE reapr will complain about the naming of the sequence(s) in the
polished_assembly.fasta.gz
file. A fix for this is to run this
command BEFORE running bwa
:
sed -i 's/|quiver/_quiver/' polished_assembly.fasta
This replaces the |
symbol with an underscore.