mcast - a motif search tool

Usage: mcast [options] <motifs> <database>

Description:

MCAST searches a sequence database for statistically significant clusters of non-overlapping occurrences of a given set of motifs.

A motif "hit" is a sequence position that is sufficiently similar to a motif in the query, where the score for a motif at a particular sequence position is computed without gaps. To compute the p-value of a motif score, MCAST assumes that the sequences in the database were generated by a 0-order Markov process (see option --bgfile, below). To be considered a hit, the p-value of the motif alignment score must be less than the significance threshold, pthresh (see option --motif-pthresh, below). Note that MCAST searches for hits on both strands of the sequences.

A cluster of non-overlapping hits is called a "match". The user specifies the maximum allowed distance between the hits in a match using the --max-gap option. Two hits separated by more than the maximum allowed gap will be reported in separate matches.

The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is

S = -log₂(p/pthresh),

The total score of a match is the sum of the p-scores of the hits making up the match.

MCAST searches for all possible matches between the query motifs and the sequences in the database, and reports the matches with the largest scores in decreasing order. Three types of statistical confidence estimates (p-value, E-value, and q-value) are estimated for each score, and the reported matches can be filtered by applying p-value or q-value thresholds (see the options --output-pthresh and --output-pthresh below).

In order for MCAST to compute statistical confidence estimates, at least 100 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth and

---bgfile
      options

. When those options are set, synthetic sequences will be generated from the provided background model and used to estimate significance statistics.

When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisified the user-specified p-value or q-value threshold.

A full description of the algorithm may be found in:

Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics (Proceedings of the European Conference on Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.

Input:

<motifs> is a list of motifs, in MEME or TRANSFAC format.
<database> is a collection of DNA sequences in FASTA format.

Output:

MCAST will create a directory named mcast_out (the name of this directory can be overridden via the --o or --oc options) The directory will contain:

A file named mcast.xml describing the inputs to MCAST in XML format
A file named cisml.xml reporting the matches in XML format using the CisML schema
A file named mcast.html reporting the matches in HTML format
A file named mcast.txt reporting the matches in tab-delimited format
A file named mcast.gff reporting the matches in GFF format
A file named mcast.wig reporting the matches in wiggle track format

The HTML output contains

A list of the motifs, and the best possible "hit" for each
A list of matches. Each match record contains
- The name of the sequence
- The starting and ending coordinates for the match. The coordinates are 1-based, and start at the first position in the sequence.
- The match score
- The match p-value
- The match e-value
- The match q-value
- A block diagram showing the relative positions of each motif "hit" in the match
- A detailed view of each match is available as a pop-up window. The detailed view shows the full sequence for the match, the alignment to the motifs, and the p-values for the motif hits
The inputs to MCAST
Text describing the MCAST results

The plain text output contains a line for each match. Each line contains the following fields:

An id string for the match
The name of the sequence containing the match
The starting and ending coordinates of the match (1-based)
The match score
The match p-value
The match e-value
The match q-value
The sequence of the matched region

The lines are sorted by score in descending order.

The wiggle track output contains the following entries:

A track line containing:

The track type
The source of the track (MCAST)

A step size line containing:

The sequence name
The width of the match

A data line containing:

The start position of the match (closed, 1-based coordinates)
A score based on the p-value of the match: -log(p-value) where

The wiggle track output is sorted by sequence name and position.

Options:

--bgfile <bfile> - Read background frequencies from <bfile>. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file.
--bgweight <weight> - Add <weight> times the background frequency to the corresponding letter counts in each motif when converting them to postion specific scoring matrices. The default value is 4.0.
--max-gap <max-gap> - The value of <max-gap> specifies the longest distance allowed between two hits in a match. Hits separated by more than <max-gap> will be placed in different matches. The default value is 50. Note: Large values of <max-gap> combined with large values of pthresh may prevent MCAST from computing E-values.
--max-stored-scores <max> - Set the maximum number of scores that will be stored. Precise calculation of q-values depends on having a complete list of scores. However, keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped, and approximate q-values will be calculated. By default the maximum number of stored matches is 100,000.
--motif-pthresh <pv> sets the scale for calculating pscores for motif hits. The default value is 0.0005. The p-score for a hit with p-value p is
S = -log₂(p/pthresh),
--o <dir name> - Specifies the output directory. If the directory already exists, the contents will not be overwritten.
--oc <dir name> - Specifies the output directory. If the directory already exists, the contents will be overwritten.
--output-ethresh <float> - The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. The default E-value threshold is 10.0. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command line will determine the effective output filter.
--output-pthresh <float> - The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. By default, a p-value threshold is not used. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command line will determine the effective output filter.
--output-qthresh <float> - The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. By default, a q-value threshold is not used. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command line will determine the effective output filter.
--transfac - The input motif file is assumed to be in TRANSFAC format and is converted to MEME format before being used.
--verbosity 1|2|3|4 - Set the verbosity of status reports to standard error. The default level is 2.