
mcast [options] <motifs> <database>
Description:
MCAST
searches a sequence database for
statistically significant clusters of non-overlapping
occurrences of a given set of motifs.
A motif "hit" is a sequence position that is sufficiently similar to a
motif in the query, where the
score for a motif at a particular sequence position is
computed without gaps.
To compute the p-value of a
motif score, MCAST
assumes that the
sequences in the database were generated by a 0-order Markov
process (see option --bgfile
, below).
To be considered a hit, the p-value of the motif
alignment score must be less than the significance threshold,
pthresh (see
option --motif-pthresh
, below). Note
that MCAST
searches for hits on both strands of the
sequences.
A cluster of non-overlapping hits is called a "match". The
user specifies the maximum allowed distance between the hits in
a match using the --max-gap
option.
Two hits separated by more than the maximum allowed gap will be
reported in separate matches.
The p-value of a hit is converted to a "p-score" in order
to compute the total score of the match it participates in.
The p-score for a hit with p-value p is
S = -log2(p/pthresh),
The total score of a match is the sum of the p-scores of
the hits making up the match.
MCAST
searches for all possible matches between the
query motifs and the sequences in the database, and reports the
matches with the largest scores in decreasing order. Three
types of statistical confidence estimates
(p-value, E-value, and q-value) are
estimated for each score, and the reported matches can be
filtered by applying p-value or q-value
thresholds (see the
options --output-pthresh
and --output-pthresh
below).
In order for MCAST
to compute statistical
confidence estimates, at least 100 matches must be found. If
the database contains too few sequences, or if certain other
options are made too stringent, then too few
matches may exist for significance statistics to be computed.
In this case, the p-value, q-value,
and E-value columns are set to "NaN", and all matches are
printed. This limitation can be overcome by specifying the
--synth
and ---bgfile
options
. When those options are set, synthetic
sequences will be generated from the provided background model
and used to estimate significance statistics.
When computing statistical confidence
estimates, MCAST
must retain the matches in memory
until the final distribution of scores can be estimated. This
means that the scanning of genome sized datasets has the
potential to exhaust all available memory. To avoid this
problem,
MCAST
uses reservoir sampling of the match scores,
and limits the number of matches that are kept in memory.
The default number of matches kept in memory is 100,000,
but this value can be adjusted via the
--max-stored-scores
option.
If the maximum number of stored matches is reached, then
MCAST
will drop the least significant half of the
matches. This behavior may result in matches missing from
the MCAST
output, even though they would have
satisified the user-specified p-value or q-value
threshold.
A full description of the algorithm may be found in:
Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics (Proceedings of the European Conference on Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.Input:
-
<motifs>
is a list of motifs, in MEME or TRANSFAC format. -
<database>
is a collection of DNA sequences in FASTA format.
mcast_out
(the name of this directory can be overridden via the
--o
or --oc
options)
The directory will contain:
-
A file named
mcast.xml
describing the inputs to MCAST in XML format -
A file named
cisml.xml
reporting the matches in XML format using the CisML schema -
A file named
mcast.html
reporting the matches in HTML format -
A file named
mcast.txt
reporting the matches in tab-delimited format -
A file named
mcast.gff
reporting the matches in GFF format -
A file named
mcast.wig
reporting the matches in wiggle track format
- A list of the motifs, and the best possible "hit" for each
- A list of matches.
Each match record contains
- The name of the sequence
- The starting and ending coordinates for the match. The coordinates are 1-based, and start at the first position in the sequence.
- The match score
- The match p-value
- The match e-value
- The match q-value
- A block diagram showing the relative positions of each motif "hit" in the match
- A detailed view of each match is available as a pop-up window. The detailed view shows the full sequence for the match, the alignment to the motifs, and the p-values for the motif hits
- The inputs to
MCAST
- Text describing the MCAST results
- An id string for the match
- The name of the sequence containing the match
- The starting and ending coordinates of the match (1-based)
- The match score
- The match p-value
- The match e-value
- The match q-value
- The sequence of the matched region
- A track line containing:
- The track type
- The source of the track (MCAST)
- A step size line containing:
- The sequence name
- The width of the match
- A data line containing:
- The start position of the match (closed, 1-based coordinates)
- A score based on the p-value of the match: -log(p-value) where
--bgfile <bfile>
- Read background frequencies from<bfile>
. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keywordmotif-file
, then the frequencies will be taken from the motif file.--bgweight <weight>
- Add <weight> times the background frequency to the corresponding letter counts in each motif when converting them to postion specific scoring matrices. The default value is 4.0.--max-gap <max-gap>
- The value of<max-gap>
specifies the longest distance allowed between two hits in a match. Hits separated by more than<max-gap>
will be placed in different matches. The default value is 50. Note: Large values of<max-gap>
combined with large values of pthresh may preventMCAST
from computing E-values.--max-stored-scores
- Set the maximum number of scores that will be stored. Precise calculation of q-values depends on having a complete list of scores. However, keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped, and approximate q-values will be calculated. By default the maximum number of stored matches is 100,000.<max>
--motif-pthresh <pv>
sets the scale for calculating pscores for motif hits. The default value is 0.0005. The p-score for a hit with p-value p isS = -log2(p/pthresh),--o <dir name>
- Specifies the output directory. If the directory already exists, the contents will not be overwritten.--oc <dir name>
- Specifies the output directory. If the directory already exists, the contents will be overwritten.--output-ethresh <float>
- The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. The default E-value threshold is 10.0. If any combination of--output-ethresh
,--output-pthresh
, or--output-qthresh
is given, whichever option occurs last on the command line will determine the effective output filter.--output-pthresh <float>
- The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. By default, a p-value threshold is not used. If any combination of--output-ethresh
,--output-pthresh
, or--output-qthresh
is given, whichever option occurs last on the command line will determine the effective output filter.--output-qthresh <float>
- The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. By default, a q-value threshold is not used. If any combination of--output-ethresh
,--output-pthresh
, or--output-qthresh
is given, whichever option occurs last on the command line will determine the effective output filter.--transfac
- The input motif file is assumed to be in TRANSFAC format and is converted to MEME format before being used.--verbosity 1|2|3|4
- Set the verbosity of status reports to standard error. The default level is 2.