Assembly using HGAP =================== From the PacBio website: "The Hierarchical Genome Assembly Process (HGAP) for long single pass reads generated by the PacBio Single Molecule Real Time (SMRT) sequencer was developed to allow the complete and accurate shotgun assembly of bacterial sized genomes." See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP Using HGAP ~~~~~~~~~~ The HGAP pipeline: - starts with the raw output from the instrument, so-called ``bax.h5`` file (these are in the HDF5 binary format) - extracts the reads and maps all of them to the longest set of reads - corrects the longest set of reads - runs the corrected reads through the Celera assembler (Celera Assembler was developed during the time of Sanger sequencing by the company Celera Genomics. Celera Assembler was used to assemble the *Drosophila* genome, as well as the human genome) - maps all the raw reads back to the assembled contigs and recalls the consensus bases Given 60-100x coverage in raw PacBio reads, HGAP very often yields complete, gapless, highly accurate assemblies, i.e., one contig per chromosomal element with maybe a few bases wrong. Here we will use a dataset provided by Pacific Biosciences, comprising ~95x coverage in raw reads. Setting up the assembly ^^^^^^^^^^^^^^^^^^^^^^^ HGAP is part of a software suite, developed by Pacific Biosciences, called ``smrtanalysis``. The main command, ``smrtpipe``, takes as input: - an xml file pointing to a set of raw PacBio files - an xml file with settings, in our case, specific for HGAP **NOTE** the assembly will take several hours, so use the ``screen`` command! See https://wiki.uio.no/projects/clsi/index.php/Tip:using_screen - Make a new folder called, for example, called ``hgap`` - we don't use a module this time, but need to add the software to our environment by using this command (remember to use ``screen`` first): :: source /cluster/software/smrtanalysis/current/etc/setup.sh (ignore the WARNINGS). - make a file that has a list of the input files like this: :: ls /data/assembly/*.bax.h5 >input.fofn - we simply redirect the output of the ``ls`` command to a file - ´fofn´ stands for 'file of filenames' - NOTE an alternative, more common way of doing this is using the ``find`` command - convert this ``fofn`` file to an ``xml`` file the software understands: :: fofnToSmrtpipeInput.py input.fofn >input.xml - add a copy of a settings xml file to your folder, this is needed to tell the software how to run the analysis :: cp /data/assembly/HGAP3_settings.xml ./ - start the analysis, see **NOTES** below :: smrtpipe.py -D TMP=./ -D SHARED_DIR=./ \ -D NPROC=2 -D MAX_THREADS=2 \ --params=HGAP3_settings.xml xml:input.xml > smrtpipe.out 2>&1 **NOTES** - ``-D TMP=./ -D SHARED_DIR=./`` indicate to use the local folder for temporary files - ``-D NPROC=2 -D MAX_THREADS=2`` limits the use of CPUs to 2 HGAP output ^^^^^^^^^^^ The ``data`` folder contains the most important output: - the folder ``0-mercounts`` up to ``4-unitigger`` and files/folders with ``celera-assembler`` in the name are output from Celera - ``corrected.*`` contain the corrected PacBio reads - ``draft_assembly.fasta`` is the first assembly generated by Celera - ``polished_assembly.*`` are the final contigs after correcting the bases using the raw reads, these are compressed using ``gzip`` - lots of other files and folders are present, have a look around if you have time In order to have use the final assembly, uncompress the ´polished\_assembly.fasta.gz´ file like this: :: gunzip polished_assembly.fasta.gz You can have a look at the lengths of the largest sequence(s) with :: fasta_length polished_assembly.fasta |sort -nr |less Next steps ~~~~~~~~~~ As for the previous assemblies, you could map reads back to the assembly, run reapr and visualise in the browser. Is the assembly error-less? NOTE reapr will complain about the naming of the sequence(s) in the ``polished_assembly.fasta.gz`` file. A fix for this is to run this command *BEFORE* running ``bwa``: :: sed -i 's/|quiver/_quiver/' polished_assembly.fasta This replaces the ``|`` symbol with an underscore.