Assembly using HGAP
===================

From the PacBio website: "The Hierarchical Genome Assembly Process
(HGAP) for long single pass reads generated by the PacBio Single
Molecule Real Time (SMRT) sequencer was developed to allow the complete
and accurate shotgun assembly of bacterial sized genomes." See
https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP

Using HGAP
~~~~~~~~~~

The HGAP pipeline:

-  starts with the raw output from the instrument, so-called ``bax.h5``
   file (these are in the HDF5 binary format)
-  extracts the reads and maps all of them to the longest set of reads
-  corrects the longest set of reads
-  runs the corrected reads through the Celera assembler (Celera
   Assembler was developed during the time of Sanger sequencing by the
   company Celera Genomics. Celera Assembler was used to assemble the
   *Drosophila* genome, as well as the human genome)
-  maps all the raw reads back to the assembled contigs and recalls the
   consensus bases

Given 60-100x coverage in raw PacBio reads, HGAP very often yields
complete, gapless, highly accurate assemblies, i.e., one contig per
chromosomal element with maybe a few bases wrong.

Here we will use a dataset provided by Pacific Biosciences, comprising
~95x coverage in raw reads.

Setting up the assembly
^^^^^^^^^^^^^^^^^^^^^^^

HGAP is part of a software suite, developed by Pacific Biosciences,
called ``smrtanalysis``. The main command, ``smrtpipe``, takes as input:

-  an xml file pointing to a set of raw PacBio files
-  an xml file with settings, in our case, specific for HGAP

**NOTE** the assembly will take several hours, so use the ``screen``
command! See
https://wiki.uio.no/projects/clsi/index.php/Tip:using_screen

-  Make a new folder called, for example, called ``hgap``
-  we don't use a module this time, but need to add the software to our
   environment by using this command (remember to use ``screen`` first):

::

    source /cluster/software/smrtanalysis/current/etc/setup.sh

(ignore the WARNINGS).

-  make a file that has a list of the input files like this:

::

    ls /data/assembly/*.bax.h5 >input.fofn

-  we simply redirect the output of the ``ls`` command to a file
-  ´fofn´ stands for 'file of filenames'
-  NOTE an alternative, more common way of doing this is using the
   ``find`` command
-  convert this ``fofn`` file to an ``xml`` file the software
   understands:

::

    fofnToSmrtpipeInput.py input.fofn >input.xml

-  add a copy of a settings xml file to your folder, this is needed to
   tell the software how to run the analysis

::

    cp /data/assembly/HGAP3_settings.xml ./

-  start the analysis, see **NOTES** below

::

    smrtpipe.py -D TMP=./ -D SHARED_DIR=./ \  
    -D NPROC=2 -D MAX_THREADS=2 \  
    --params=HGAP3_settings.xml xml:input.xml > smrtpipe.out 2>&1

**NOTES**

-  ``-D TMP=./ -D SHARED_DIR=./`` indicate to use the local folder for
   temporary files
-  ``-D NPROC=2 -D MAX_THREADS=2`` limits the use of CPUs to 2

HGAP output
^^^^^^^^^^^

The ``data`` folder contains the most important output:

-  the folder ``0-mercounts`` up to ``4-unitigger`` and files/folders
   with ``celera-assembler`` in the name are output from Celera
-  ``corrected.*`` contain the corrected PacBio reads
-  ``draft_assembly.fasta`` is the first assembly generated by Celera
-  ``polished_assembly.*`` are the final contigs after correcting the
   bases using the raw reads, these are compressed using ``gzip``
-  lots of other files and folders are present, have a look around if
   you have time

In order to have use the final assembly, uncompress the
´polished\_assembly.fasta.gz´ file like this:

::

    gunzip polished_assembly.fasta.gz

You can have a look at the lengths of the largest sequence(s) with

::

    fasta_length polished_assembly.fasta |sort -nr |less

Next steps
~~~~~~~~~~

As for the previous assemblies, you could map reads back to the
assembly, run reapr and visualise in the browser. Is the assembly
error-less?

NOTE reapr will complain about the naming of the sequence(s) in the
``polished_assembly.fasta.gz`` file. A fix for this is to run this
command *BEFORE* running ``bwa``:

::

    sed -i 's/|quiver/_quiver/' polished_assembly.fasta

This replaces the ``|`` symbol with an underscore.