MCAST logo

Usage: mcast [options] <motifs> <database>

Description:

MCAST searches a sequence database for statistically significant clusters of non-overlapping occurrences of a given set of motifs.

A motif "hit" is a sequence position that is sufficiently similar to a motif in the query, where the score for a motif at a particular sequence position is computed without gaps. To compute the p-value of a motif score, MCAST assumes that the sequences in the database were generated by a 0-order Markov process (see option --bgfile, below). To be considered a hit, the p-value of the motif alignment score must be less than the significance threshold, pthresh (see option --motif-pthresh, below). Note that MCAST searches for hits on both strands of the sequences.

A cluster of non-overlapping hits is called a "match". The user specifies the maximum allowed distance between the hits in a match using the --max-gap option. Two hits separated by more than the maximum allowed gap will be reported in separate matches.

The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is

S = -log2(p/pthresh),

The total score of a match is the sum of the p-scores of the hits making up the match.

MCAST searches for all possible matches between the query motifs and the sequences in the database, and reports the matches with the largest scores in decreasing order. Three types of statistical confidence estimates (p-value, E-value, and q-value) are estimated for each score, and the reported matches can be filtered by applying p-value or q-value thresholds (see the options --output-pthresh and --output-pthresh below).

In order for MCAST to compute statistical confidence estimates, at least 100 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth and ---bgfile options. When those options are set, synthetic sequences will be generated from the provided background model and used to estimate significance statistics.

When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisified the user-specified p-value or q-value threshold.

A full description of the algorithm may be found in:

Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics (Proceedings of the European Conference on Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.

Input:

Output:

MCAST will create a directory named mcast_out (the name of this directory can be overridden via the --o or --oc options) The directory will contain:

The HTML output contains

The plain text output contains a line for each match. Each line contains the following fields: The lines are sorted by score in descending order.

The wiggle track output contains the following entries:

The wiggle track output is sorted by sequence name and position.

Options: