.. _oa_index: The :program:`index` command ============================ |Orgasm|'s :program:`index` command indexes lexicographicaly a set of sequence reads to make them usable by the assembler. The constraints of the |orgasm| ------------------------------- |Orgasm| was developed to deal with Illumina paired-end reads. Consequently, the algortihm of |orgasm| requires that the indexed reads respect several constraints. - The reads must be paired. - They must have all the same length. Moreover: - The read length must be odd. - Sequences cannot contain :ref:`IUPAC ` ambiguity code. - They must be formated in :ref:`fastq `. To be able to run on more diverse types of sequence datasets, the :program:`index` command provides a set of options allowing to overcome these limits and to format various kind of sequence. These options allow for indicating: - The sequencing strategy (paired end by default) - Single direction sequencing - Mate-pairs library - The file format (fastq by default) - Sequence file following the :ref:`fasta ` format - File compressed by `gzip`_ (file name ending by `.gz`) - File compressed by `bzip2`_ (file name ending by `.bz2`) They also permit to define the read length following three strategies. Setting up the length of the indexed reads ------------------------------------------ The default strategy ++++++++++++++++++++ When nothing is specified |orgasm| :program:`index ` command considers that all the reads from the dataset have the same length. Considering this, the actual read length of the dataset is estimated from the first read of the forward file. If the read length is even, it is decreased by one. During the indexing procedure, read pairs containing a read shorter than the limit are discarded, read longer than the limit are trimmed on their 3' end to fit the good length. After read pairs containing a :ref:`IUPAC ` ambiguity code are also discarded. The user defined read length ++++++++++++++++++++++++++++ Using the option :ref:`--length `, users can specify the read length to index. If the specified read length is even, it is decreased by one. As for the default strategy, read pairs containing a read shorter than the specified limit are discarded, read longuer than the limit are trimmed on their 3' end to fit the good length. After the trimming read pairs containing a :ref:`IUPAC ` ambiguity code are also discarded. The estimated read length +++++++++++++++++++++++++ Running the :program:`index` with the option :ref:`--estimate-length FRACTION ` asks to estimate the maximum length usable to use at least ``FRACTION`` part of the dataset. ``FRACTION`` is a float number ranging between *0.0* and *1.0*. In this mode the dataset is read a first time and the longest sub-sequence of each read containing no :ref:`IUPAC ` ambiguity code is extracted. The length distribution of these sub-sequences is computed. According to this distribution the maximal length allowing to use at least ``FRACTION`` part of the dataset is estimated. Only the sub-sequences without :ref:`IUPAC ` ambiguity code are indexed. Read pairs containing a read shorter than the estimated length are discarded The most common way to run the index command -------------------------------------------- The basic unix command for running the :program:`index` command looks like to this: .. code-block:: bash $ oa index --estimate-length 0.9 \ seqindex \ forward.fastq.gz reverse.fastq.gz the :program:`index` command creates four files : - ``.ogx`` : contains information concerning the index - ``.ofx`` : contains the sequences themselves and the forward index - ``.orx`` : contains reverse index - ``.opx`` : contains read pairing data |Orgasm| will need all these file to process assembling. ````` represents the name of index that will be used later by the assembler. A fifth file named ``.log`` contains the traces generated by the indexation process. command prototype ----------------- .. code-block:: none usage: $ oa index [-h] [--reformat] [--single | --mate-pairs] [--check-ids] [--check-pairing] [--max-read ###] [--skip ###] [--check-phiX17|--no-check-phiX174] [--length ### | --estimate-length #.##] [--minimum-length ###] [--5-prime-trim ###] [--3-prime-quality ###] [--bad-quality ###] [--fasta | --forward-fasta | --reverse-fasta] [--fastq-dump] [--bypass-filtering] [--quality-encoding-offset] [--no-pipe] [--low-memory] [reverse_fastq_file] positional arguments -------------------- .. option:: index Name of the produced index .. option:: forward Filename of the forward reads .. option:: reverse Filename of the reverse reads optional arguments ------------------ .. program:: oa index General option ++++++++++++++ .. option:: -h, --help show the help message and exit .. option:: --reformat Asks for reformatting an old sequence index to the new format Sequencing strategy +++++++++++++++++++ .. option:: --single Single read mode. .. option:: --mate-pairs Indicates that the two read files were obtained using a mate pair sequencing strategy. Sequence file checking ++++++++++++++++++++++ .. option:: --check-ids Checks that forward and reverse ids are identical. The two sequence ids `seqid/1` and `seqid/2` are considered as identical. .. option:: --check-pairing Ensure that forward and reverse files are correctly paired. The pairing is checked based on the sequence identifier. The two sequence with the ids `seqid/1` and `seqid/2` are considered as paired. Sequence quality checking +++++++++++++++++++++++++ .. option:: --check-phiX174 Checks for PhiX174 contamination .. option:: --no-check-phiX174 Does not check for PhiX174 contamination (default) .. option:: --5-prime-trim ## Cut the N first base pairs of reads (default 0bp) .. option:: --3-prime-quality ## Hard clips the 3' end of each readsafter the first base with a score less or equal to Q (default 0 no clipping) .. option:: --bad-quality ## Consider quality below Q as bad quality score, and try to clip reads to maximise the overall quality. Zero means no clipping (default 10) .. option:: --skip ## Skip the N first read pairs (default 0) .. option:: --bypass-filtering Sequence files are considered as pre-filtered fastq files Limit for the indexation ++++++++++++++++++++++++ .. _index.max-read: .. option:: --max-read ### `###` indicates the number of millions of reads to index. If not specified all the reads are indexed within the limit imposed by the program and printed at the beginning of the program trace. .. code-block:: bash $ oa index --max-read 4 seqindex forward.fastq reverse.fastq Build the index with a maximum of four millons of reads. .. _index.length: .. option:: --length ### `###` represents the read length to consider. Only reads with a length greater or equal to `###` will be indexed. Reads longer than the specified length are truncated at the specified length. .. code-block:: bash $ oa index --length 90 seqindex forward.fastq reverse.fastq Indexes the `forward.fastq` and `reverse.fastq` files using only reads longer than 90 bp. If the :option:`--length ###` option is not used the length is estimated from the length of the first read of the forward file or through the :option:`--estimate-length #.##` option. .. _index.estimate-length: .. option:: --estimate-length #.## `#.##` ranging between 0.0 and 1.0, indicates which fraction of the overall dataset we want to use. When this option is used the sequence length to index is estimated to respect this constraint. .. code-block:: bash $ oa index --estimate-length 0.9 seqindex forward.fastq reverse.fastq Indexes the `forward.fastq` and `reverse.fastq` files using a length such as at least 90% of the reads will be indexed. .. option:: --minimum-length ### The minimum length of the read to index if the *--estimate-length* option is activated (default 81) .. option:: --fastq-dump Dump the fastq file or the trimmed reads Sequence file format ++++++++++++++++++++ .. _index.fasta: .. option:: --fasta Indicates than the two sequence files to index are :ref:`fasta ` files. .. code-block:: bash $ oa index --fasta seqindex forward.fasta reverse.fasta .. _index.forward-fasta: .. option:: --forward-fasta Indicates than the forward file is a fasta file .. code-block:: bash $ oa index --forward-fasta seqindex forward.fasta reverse.fastq .. _index.reverse-fasta: .. option:: --reverse-fasta Indicates than the reverse file is a fasta file .. code-block:: bash $ oa index --reverse-fasta seqindex forward.fastq reverse.fasta .. _index.quality-encoding-offset: .. option:: --quality-encoding-offset ## The code offset added to each quality score to encode fastq quality (default 33 - Sanger format) .. code-block:: bash $ oa index --quality-encoding-offset 64 seqindex forward.fastq reverse.fasta Allows for reading old *Solexa* fastq format. Look at the FastQ format Wikipedia web page to know what if the *quality encoding offset* corresponding to your files. If the file names end by `.gz` or `.bz2` they are considered as compressed respectively by the `gzip`_ or the `bzip2`_ program and are uncompressed on the fly. The :ref:`fasta ` related options can be combined without restriction with this feature. .. code-block:: bash $ oa index --reverse-fasta seqindex \ forward.fastq.gz reverse.fasta.bz2 The forward file follows the :ref:`fastq ` format and is compressed with `gzip`_. The reverse file follow the :ref:`fasta ` format and is compressed with `bzip2`_. System option +++++++++++++ .. _index.no-pipe: .. option:: --no-pipe By default the :ref:`organelle assembler ` uses named pipes to transfer data among programs. Using this option you can enforce to use tempory files instead. .. _index.low-memory: .. option:: --low-memory Reduce memory usage for optimal length computation .. _`gzip`: http://www.gzip.org .. _`bzip2`: http://www.bzip.org