.. _oa_index:

The :program:`index` command
============================

|Orgasm|'s :program:`index` command indexes
lexicographicaly a set of sequence reads to make them usable by the
assembler.

The constraints of the |orgasm|
-------------------------------

|Orgasm| was developed to deal with Illumina
paired-end reads. Consequently, the algortihm of |orgasm|
requires that the indexed reads respect several constraints.

    - The reads must be paired.
    - They must have all the same length.

Moreover:

    - The read length must be odd.
    - Sequences cannot contain :ref:`IUPAC <iupac_code>` ambiguity code.
    - They must be formated in :ref:`fastq <fastq>`.

To be able to run on more diverse types of sequence datasets, the
:program:`index` command provides a set of options allowing to overcome these
limits and to format various kind of sequence.

These options allow for indicating:

    - The sequencing strategy (paired end by default)
        - Single direction sequencing
        - Mate-pairs library
    - The file format (fastq by default)
        - Sequence file following the :ref:`fasta <fasta>` format
        - File compressed by `gzip`_ (file name ending by `.gz`)
        - File compressed by `bzip2`_ (file name ending by `.bz2`)

They also permit to define the read length following three strategies.

Setting up the length of the indexed reads
------------------------------------------

The default strategy
++++++++++++++++++++

When nothing is specified |orgasm| :program:`index <oa_index>` command
considers that all the reads from the dataset have the same length.
Considering this, the actual read length of the dataset is estimated
from the first read of the forward file. If the read length is even,
it is decreased by one.

During the indexing procedure, read pairs containing a read shorter than
the limit are discarded, read longer than the limit are trimmed on their
3' end to fit the good length. After read pairs containing a
:ref:`IUPAC <iupac_code>` ambiguity code are also discarded.

The user defined read length
++++++++++++++++++++++++++++

Using the option :ref:`--length <index.length>`, users can specify the read
length to index. If the specified read length is even, it is decreased by one.

As for the default strategy, read pairs containing a read shorter than the
specified limit are discarded, read longuer than the limit are trimmed on
their 3' end to fit the good length. After the trimming read pairs containing a
:ref:`IUPAC <iupac_code>` ambiguity code are also discarded.

The estimated read length
+++++++++++++++++++++++++

Running the :program:`index` with the option
:ref:`--estimate-length FRACTION <index.estimate-length>` asks to estimate
the maximum length usable to use at least ``FRACTION`` part of the dataset.
``FRACTION`` is a float number ranging between *0.0* and *1.0*.

In this mode the dataset is read a first time and the longest sub-sequence of
each read containing no :ref:`IUPAC <iupac_code>` ambiguity code is extracted.
The length distribution of these sub-sequences is computed. According to this
distribution the maximal length allowing to use at least ``FRACTION`` part
of the dataset is estimated.

Only the sub-sequences without :ref:`IUPAC <iupac_code>` ambiguity code are
indexed. Read pairs containing a read shorter than the estimated length are
discarded

The most common way to run the index command
--------------------------------------------

The basic unix command for running the :program:`index` command looks like
to this:

.. code-block:: bash

  $ oa index --estimate-length 0.9 \
             seqindex \
             forward.fastq.gz reverse.fastq.gz

the :program:`index` command creates four files  :

    - ``<index>.ogx`` : contains information concerning the index
    - ``<index>.ofx`` : contains the sequences themselves and the forward index
    - ``<index>.orx`` : contains reverse index
    - ``<index>.opx`` : contains read pairing data

|Orgasm| will need all these file to process assembling.
``<index>``` represents the name of index that will be used later by the assembler.

A fifth file named ``<index>.log`` contains the traces generated by the indexation
process.

command prototype
-----------------

.. code-block:: none

    usage:     $ oa index [-h] [--reformat]
                          [--single | --mate-pairs]
                          [--check-ids] [--check-pairing]
                          [--max-read ###] [--skip ###]
                          [--check-phiX17|--no-check-phiX174]
                          [--length ### | --estimate-length #.##]
                          [--minimum-length ###]
                          [--5-prime-trim ###]
                          [--3-prime-quality ###] [--bad-quality ###]
                          [--fasta | --forward-fasta | --reverse-fasta]
                          [--fastq-dump] [--bypass-filtering]
                          [--quality-encoding-offset]
                          [--no-pipe] [--low-memory]
                          <index> <forward_fastq_file> [reverse_fastq_file]
positional arguments
--------------------

.. option::        index

      Name of the produced index

.. option::      forward

    Filename of the forward reads

.. option::      reverse

    Filename of the reverse reads

optional arguments
------------------

.. program:: oa index

General option
++++++++++++++

.. option::    -h, --help

                      show the help message and exit

.. option::    --reformat

      Asks for reformatting an old sequence index to the new
      format


Sequencing strategy
+++++++++++++++++++

.. option::    --single

      Single read mode.

.. option::    --mate-pairs

      Indicates that the two read files were obtained using a mate pair
      sequencing strategy.

Sequence file checking
++++++++++++++++++++++

.. option::    --check-ids

      Checks that forward and reverse ids are identical.
      The two sequence ids `seqid/1` and `seqid/2` are considered as
      identical.

.. option::    --check-pairing

      Ensure that forward and reverse files are correctly paired.
      The pairing is checked based on the sequence identifier.
      The two sequence with the ids `seqid/1` and `seqid/2` are
      considered as paired.


Sequence quality checking
+++++++++++++++++++++++++

.. option::    --check-phiX174

      Checks for PhiX174 contamination

.. option::    --no-check-phiX174

      Does not check for PhiX174 contamination (default)

.. option::    --5-prime-trim ##

      Cut the N first base pairs of reads (default 0bp)

.. option::    --3-prime-quality ##

      Hard clips the 3' end of each readsafter the first
      base with a score less or equal to Q (default 0 no
      clipping)

.. option::    --bad-quality ##

      Consider quality below Q as bad quality score, and try
      to clip reads to maximise the overall quality. Zero
      means no clipping (default 10)

.. option::    --skip ##

      Skip the N first read pairs (default 0)

.. option::    --bypass-filtering

      Sequence files are considered as pre-filtered fastq files


Limit for the indexation
++++++++++++++++++++++++

.. _index.max-read:

.. option::    --max-read ###

        `###` indicates the number of millions of reads to index. If not
        specified all the reads are indexed within the limit imposed by
        the program and printed at the beginning of the program trace.

        .. code-block:: bash

          $ oa index --max-read 4 seqindex forward.fastq reverse.fastq

        Build the index with a maximum of four millons of reads.

.. _index.length:

.. option::   --length ###

        `###` represents the read length to consider. Only reads
        with a length greater or equal to `###` will be indexed. Reads longer
        than the specified length are truncated at the specified length.

        .. code-block:: bash

          $ oa index --length 90 seqindex forward.fastq reverse.fastq

        Indexes the `forward.fastq`  and `reverse.fastq` files using only
        reads longer than 90 bp.

        If the :option:`--length ###` option is not used the length is estimated
        from the length of the first read of the forward file or through the
        :option:`--estimate-length #.##` option.

.. _index.estimate-length:

.. option::    --estimate-length #.##

        `#.##` ranging between 0.0 and 1.0, indicates which fraction
        of the overall dataset we want to use. When this option is used
        the sequence length to index is estimated to respect this constraint.

        .. code-block:: bash

          $ oa index --estimate-length 0.9 seqindex forward.fastq reverse.fastq

        Indexes the `forward.fastq`  and `reverse.fastq` files using a length
        such as at least 90% of the reads will be indexed.

.. option::    --minimum-length ###

      The minimum length of the read to index if the
      *--estimate-length* option is activated (default 81)

.. option::    --fastq-dump

      Dump the fastq file or the trimmed reads


Sequence file format
++++++++++++++++++++

.. _index.fasta:

.. option::    --fasta

        Indicates than the two sequence files to index are :ref:`fasta <fasta>` files.

        .. code-block:: bash

          $ oa index --fasta seqindex forward.fasta reverse.fasta

.. _index.forward-fasta:

.. option::    --forward-fasta

        Indicates than the forward file is a fasta file

        .. code-block:: bash

          $ oa index --forward-fasta seqindex forward.fasta reverse.fastq

.. _index.reverse-fasta:

.. option::    --reverse-fasta

        Indicates than the reverse file is a fasta file

        .. code-block:: bash

          $ oa index --reverse-fasta seqindex forward.fastq reverse.fasta

.. _index.quality-encoding-offset:

.. option::    --quality-encoding-offset ##

        The code offset added to each quality score to encode
        fastq quality (default 33 - Sanger format)

        .. code-block:: bash

          $ oa index --quality-encoding-offset 64 seqindex forward.fastq reverse.fasta

        Allows for reading old *Solexa* fastq format. Look at the FastQ format
        Wikipedia web page to know what if the *quality encoding offset*
        corresponding to your files.


If the file names end by `.gz` or `.bz2` they are considered as compressed
respectively by the `gzip`_ or the `bzip2`_ program and are uncompressed on the
fly. The :ref:`fasta <fasta>` related options can be combined without restriction with this
feature.

          .. code-block:: bash

            $ oa index --reverse-fasta seqindex \
                       forward.fastq.gz reverse.fasta.bz2

The forward file follows the :ref:`fastq <fastq>` format and is compressed
with `gzip`_. The reverse file follow the :ref:`fasta <fasta>` format and is
compressed with `bzip2`_.

System option
+++++++++++++

.. _index.no-pipe:

.. option::    --no-pipe

        By default the :ref:`organelle assembler <oa>` uses named pipes to transfer
        data among programs. Using this option you can enforce to use
        tempory files instead.

.. _index.low-memory:

.. option::    --low-memory

        Reduce memory usage for optimal length computation

.. _`gzip`: http://www.gzip.org
.. _`bzip2`: http://www.bzip.org