The index command

The ORGanelle ASeMbler’s index command indexes lexicographicaly a set of sequence reads to make them usable by the assembler.

The constraints of the the ORGanelle ASeMbler

The ORGanelle ASeMbler was developed to deal with Illumina paired-end reads. Consequently, the algortihm of the ORGanelle ASeMbler requires that the indexed reads respect several constraints.

  • The reads must be paired.

  • They must have all the same length.

Moreover:

  • The read length must be odd.

  • Sequences cannot contain IUPAC ambiguity code.

  • They must be formated in fastq.

To be able to run on more diverse types of sequence datasets, the index command provides a set of options allowing to overcome these limits and to format various kind of sequence.

These options allow for indicating:

  • The sequencing strategy (paired end by default)
    • Single direction sequencing

    • Mate-pairs library

  • The file format (fastq by default)
    • Sequence file following the fasta format

    • File compressed by gzip (file name ending by .gz)

    • File compressed by bzip2 (file name ending by .bz2)

They also permit to define the read length following three strategies.

Setting up the length of the indexed reads

The default strategy

When nothing is specified the ORGanelle ASeMbler index <oa_index> command considers that all the reads from the dataset have the same length. Considering this, the actual read length of the dataset is estimated from the first read of the forward file. If the read length is even, it is decreased by one.

During the indexing procedure, read pairs containing a read shorter than the limit are discarded, read longer than the limit are trimmed on their 3’ end to fit the good length. After read pairs containing a IUPAC ambiguity code are also discarded.

The user defined read length

Using the option –length, users can specify the read length to index. If the specified read length is even, it is decreased by one.

As for the default strategy, read pairs containing a read shorter than the specified limit are discarded, read longuer than the limit are trimmed on their 3’ end to fit the good length. After the trimming read pairs containing a IUPAC ambiguity code are also discarded.

The estimated read length

Running the index with the option –estimate-length FRACTION asks to estimate the maximum length usable to use at least FRACTION part of the dataset. FRACTION is a float number ranging between 0.0 and 1.0.

In this mode the dataset is read a first time and the longest sub-sequence of each read containing no IUPAC ambiguity code is extracted. The length distribution of these sub-sequences is computed. According to this distribution the maximal length allowing to use at least FRACTION part of the dataset is estimated.

Only the sub-sequences without IUPAC ambiguity code are indexed. Read pairs containing a read shorter than the estimated length are discarded

The most common way to run the index command

The basic unix command for running the index command looks like to this:

$ oa index --estimate-length 0.9 \
           seqindex \
           forward.fastq.gz reverse.fastq.gz

the index command creates four files :

  • <index>.ogx : contains information concerning the index

  • <index>.ofx : contains the sequences themselves and the forward index

  • <index>.orx : contains reverse index

  • <index>.opx : contains read pairing data

The ORGanelle ASeMbler will need all these file to process assembling. <index>` represents the name of index that will be used later by the assembler.

A fifth file named <index>.log contains the traces generated by the indexation process.

command prototype

usage:     $ oa index [-h] [--reformat]
                      [--single | --mate-pairs]
                      [--check-ids] [--check-pairing]
                      [--max-read ###] [--skip ###]
                      [--check-phiX17|--no-check-phiX174]
                      [--length ### | --estimate-length #.##]
                      [--minimum-length ###]
                      [--5-prime-trim ###]
                      [--3-prime-quality ###] [--bad-quality ###]
                      [--fasta | --forward-fasta | --reverse-fasta]
                      [--fastq-dump] [--bypass-filtering]
                      [--quality-encoding-offset]
                      [--no-pipe] [--low-memory]
                      <index> <forward_fastq_file> [reverse_fastq_file]

positional arguments

index

Name of the produced index

forward

Filename of the forward reads

reverse

Filename of the reverse reads

optional arguments

General option

-h, --help

show the help message and exit

--reformat

Asks for reformatting an old sequence index to the new format

Sequencing strategy

--single

Single read mode.

--mate-pairs

Indicates that the two read files were obtained using a mate pair sequencing strategy.

Sequence file checking

--check-ids

Checks that forward and reverse ids are identical. The two sequence ids seqid/1 and seqid/2 are considered as identical.

--check-pairing

Ensure that forward and reverse files are correctly paired. The pairing is checked based on the sequence identifier. The two sequence with the ids seqid/1 and seqid/2 are considered as paired.

Sequence quality checking

--check-phiX174

Checks for PhiX174 contamination

--no-check-phiX174

Does not check for PhiX174 contamination (default)

--5-prime-trim ##

Cut the N first base pairs of reads (default 0bp)

--3-prime-quality ##

Hard clips the 3’ end of each readsafter the first base with a score less or equal to Q (default 0 no clipping)

--bad-quality ##

Consider quality below Q as bad quality score, and try to clip reads to maximise the overall quality. Zero means no clipping (default 10)

--skip ##

Skip the N first read pairs (default 0)

--bypass-filtering

Sequence files are considered as pre-filtered fastq files

Limit for the indexation

--max-read ###

### indicates the number of millions of reads to index. If not specified all the reads are indexed within the limit imposed by the program and printed at the beginning of the program trace.

$ oa index --max-read 4 seqindex forward.fastq reverse.fastq

Build the index with a maximum of four millons of reads.

--length ###

### represents the read length to consider. Only reads with a length greater or equal to ### will be indexed. Reads longer than the specified length are truncated at the specified length.

$ oa index --length 90 seqindex forward.fastq reverse.fastq

Indexes the forward.fastq and reverse.fastq files using only reads longer than 90 bp.

If the --length ### option is not used the length is estimated from the length of the first read of the forward file or through the --estimate-length #.## option.

--estimate-length #.##

#.## ranging between 0.0 and 1.0, indicates which fraction of the overall dataset we want to use. When this option is used the sequence length to index is estimated to respect this constraint.

$ oa index --estimate-length 0.9 seqindex forward.fastq reverse.fastq

Indexes the forward.fastq and reverse.fastq files using a length such as at least 90% of the reads will be indexed.

--minimum-length ###

The minimum length of the read to index if the –estimate-length option is activated (default 81)

--fastq-dump

Dump the fastq file or the trimmed reads

Sequence file format

--fasta

Indicates than the two sequence files to index are fasta files.

$ oa index --fasta seqindex forward.fasta reverse.fasta
--forward-fasta

Indicates than the forward file is a fasta file

$ oa index --forward-fasta seqindex forward.fasta reverse.fastq
--reverse-fasta

Indicates than the reverse file is a fasta file

$ oa index --reverse-fasta seqindex forward.fastq reverse.fasta
--quality-encoding-offset ##

The code offset added to each quality score to encode fastq quality (default 33 - Sanger format)

$ oa index --quality-encoding-offset 64 seqindex forward.fastq reverse.fasta

Allows for reading old Solexa fastq format. Look at the FastQ format Wikipedia web page to know what if the quality encoding offset corresponding to your files.

If the file names end by .gz or .bz2 they are considered as compressed respectively by the gzip or the bzip2 program and are uncompressed on the fly. The fasta related options can be combined without restriction with this feature.

$ oa index --reverse-fasta seqindex \
           forward.fastq.gz reverse.fasta.bz2

The forward file follows the fastq format and is compressed with gzip. The reverse file follow the fasta format and is compressed with bzip2.

System option

--no-pipe

By default the organelle assembler uses named pipes to transfer data among programs. Using this option you can enforce to use tempory files instead.

--low-memory

Reduce memory usage for optimal length computation