The index command¶
The ORGanelle ASeMbler’s index command indexes lexicographicaly a set of sequence reads to make them usable by the assembler.
The constraints of the the ORGanelle ASeMbler¶
The ORGanelle ASeMbler was developed to deal with Illumina paired-end reads. Consequently, the algortihm of the ORGanelle ASeMbler requires that the indexed reads respect several constraints.
The reads must be paired.
They must have all the same length.
Moreover:
To be able to run on more diverse types of sequence datasets, the index command provides a set of options allowing to overcome these limits and to format various kind of sequence.
These options allow for indicating:
They also permit to define the read length following three strategies.
Setting up the length of the indexed reads¶
The default strategy¶
When nothing is specified the ORGanelle ASeMbler index <oa_index> command considers that all the reads from the dataset have the same length. Considering this, the actual read length of the dataset is estimated from the first read of the forward file. If the read length is even, it is decreased by one.
During the indexing procedure, read pairs containing a read shorter than the limit are discarded, read longer than the limit are trimmed on their 3’ end to fit the good length. After read pairs containing a IUPAC ambiguity code are also discarded.
The user defined read length¶
Using the option –length, users can specify the read length to index. If the specified read length is even, it is decreased by one.
As for the default strategy, read pairs containing a read shorter than the specified limit are discarded, read longuer than the limit are trimmed on their 3’ end to fit the good length. After the trimming read pairs containing a IUPAC ambiguity code are also discarded.
The estimated read length¶
Running the index with the option
–estimate-length FRACTION asks to estimate
the maximum length usable to use at least FRACTION
part of the dataset.
FRACTION
is a float number ranging between 0.0 and 1.0.
In this mode the dataset is read a first time and the longest sub-sequence of
each read containing no IUPAC ambiguity code is extracted.
The length distribution of these sub-sequences is computed. According to this
distribution the maximal length allowing to use at least FRACTION
part
of the dataset is estimated.
Only the sub-sequences without IUPAC ambiguity code are indexed. Read pairs containing a read shorter than the estimated length are discarded
The most common way to run the index command¶
The basic unix command for running the index command looks like to this:
$ oa index --estimate-length 0.9 \
seqindex \
forward.fastq.gz reverse.fastq.gz
the index command creates four files :
<index>.ogx
: contains information concerning the index
<index>.ofx
: contains the sequences themselves and the forward index
<index>.orx
: contains reverse index
<index>.opx
: contains read pairing data
The ORGanelle ASeMbler will need all these file to process assembling.
<index>`
represents the name of index that will be used later by the assembler.
A fifth file named <index>.log
contains the traces generated by the indexation
process.
command prototype¶
usage: $ oa index [-h] [--reformat]
[--single | --mate-pairs]
[--check-ids] [--check-pairing]
[--max-read ###] [--skip ###]
[--check-phiX17|--no-check-phiX174]
[--length ### | --estimate-length #.##]
[--minimum-length ###]
[--5-prime-trim ###]
[--3-prime-quality ###] [--bad-quality ###]
[--fasta | --forward-fasta | --reverse-fasta]
[--fastq-dump] [--bypass-filtering]
[--quality-encoding-offset]
[--no-pipe] [--low-memory]
<index> <forward_fastq_file> [reverse_fastq_file]
positional arguments¶
-
index
¶
Name of the produced index
-
forward
¶
Filename of the forward reads
-
reverse
¶
Filename of the reverse reads
optional arguments¶
General option¶
-
-h
,
--help
¶
show the help message and exit
-
--reformat
¶
Asks for reformatting an old sequence index to the new format
Sequencing strategy¶
-
--single
¶
Single read mode.
-
--mate-pairs
¶
Indicates that the two read files were obtained using a mate pair sequencing strategy.
Sequence file checking¶
-
--check-ids
¶
Checks that forward and reverse ids are identical. The two sequence ids seqid/1 and seqid/2 are considered as identical.
-
--check-pairing
¶
Ensure that forward and reverse files are correctly paired. The pairing is checked based on the sequence identifier. The two sequence with the ids seqid/1 and seqid/2 are considered as paired.
Sequence quality checking¶
-
--check-phiX174
¶
Checks for PhiX174 contamination
-
--no-check-phiX174
¶
Does not check for PhiX174 contamination (default)
-
--5-prime-trim
##
¶ Cut the N first base pairs of reads (default 0bp)
-
--3-prime-quality
##
¶ Hard clips the 3’ end of each readsafter the first base with a score less or equal to Q (default 0 no clipping)
-
--bad-quality
##
¶ Consider quality below Q as bad quality score, and try to clip reads to maximise the overall quality. Zero means no clipping (default 10)
-
--skip
##
¶ Skip the N first read pairs (default 0)
-
--bypass-filtering
¶
Sequence files are considered as pre-filtered fastq files
Limit for the indexation¶
-
--max-read
###
¶ ### indicates the number of millions of reads to index. If not specified all the reads are indexed within the limit imposed by the program and printed at the beginning of the program trace.
$ oa index --max-read 4 seqindex forward.fastq reverse.fastq
Build the index with a maximum of four millons of reads.
-
--length
###
¶ ### represents the read length to consider. Only reads with a length greater or equal to ### will be indexed. Reads longer than the specified length are truncated at the specified length.
$ oa index --length 90 seqindex forward.fastq reverse.fastq
Indexes the forward.fastq and reverse.fastq files using only reads longer than 90 bp.
If the
--length ###
option is not used the length is estimated from the length of the first read of the forward file or through the--estimate-length #.##
option.
-
--estimate-length
#.##
¶ #.## ranging between 0.0 and 1.0, indicates which fraction of the overall dataset we want to use. When this option is used the sequence length to index is estimated to respect this constraint.
$ oa index --estimate-length 0.9 seqindex forward.fastq reverse.fastq
Indexes the forward.fastq and reverse.fastq files using a length such as at least 90% of the reads will be indexed.
-
--minimum-length
###
¶ The minimum length of the read to index if the –estimate-length option is activated (default 81)
-
--fastq-dump
¶
Dump the fastq file or the trimmed reads
Sequence file format¶
-
--fasta
¶
Indicates than the two sequence files to index are fasta files.
$ oa index --fasta seqindex forward.fasta reverse.fasta
-
--forward-fasta
¶
Indicates than the forward file is a fasta file
$ oa index --forward-fasta seqindex forward.fasta reverse.fastq
-
--reverse-fasta
¶
Indicates than the reverse file is a fasta file
$ oa index --reverse-fasta seqindex forward.fastq reverse.fasta
-
--quality-encoding-offset
##
¶ The code offset added to each quality score to encode fastq quality (default 33 - Sanger format)
$ oa index --quality-encoding-offset 64 seqindex forward.fastq reverse.fasta
Allows for reading old Solexa fastq format. Look at the FastQ format Wikipedia web page to know what if the quality encoding offset corresponding to your files.
If the file names end by .gz or .bz2 they are considered as compressed respectively by the gzip or the bzip2 program and are uncompressed on the fly. The fasta related options can be combined without restriction with this feature.
$ oa index --reverse-fasta seqindex \ forward.fastq.gz reverse.fasta.bz2
The forward file follows the fastq format and is compressed with gzip. The reverse file follow the fasta format and is compressed with bzip2.
System option¶
-
--no-pipe
¶
By default the organelle assembler uses named pipes to transfer data among programs. Using this option you can enforce to use tempory files instead.
-
--low-memory
¶
Reduce memory usage for optimal length computation