meme <sequence file> [options]
MEME is a tool for discovering motifs in a group of related DNA or protein sequences.
A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.
MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.
The examples are a good place to start.
A file containing FASTA sequences or the word 'stdin' to indicate that the sequences should be read from standard input. Note that MEME does not attempt to detect the alphabet from the sequences so you should specify it with the -dna or -protein options. MEME also supports a modification to the FASTA format for weighting the sequences. Large datasets will need to be sub-sampled to allow MEME to run in reasonable times.
MEME outputs its results primarily as a HTML file named
meme.html. MEME also outputs a machine-readable XML
file and a plain-text versions of its output, named
These files are placed in a directory named
You can select for the directory to have a different name using the
--o or --oc
|-text||Output in text format only to standard output.||The program behaves as if
|-dna||MEME will expect that the sequences use the DNA alphabet. This
means that the sequence may contain the letters
||MEME assumes that the sequences are protein.|
|-protein||MEME will expect that the sequences use the protein alphabet.
This means that the sequence may contain the letters
||MEME assumes that the sequences are protein.|
|Contributing Site Distribution|
|-mod||oops|zoops|anr||This option is used to describe the distribution of motif sites.
|MEME assumes the Zero or One Occurrence Per Sequence model.|
|Number of Motifs|
|-nmotifs||n||MEME will stop searching for motifs after finding n motifs.||MEME will find 1 motif.|
|-evt||ev||MEME will stop searching for motifs if the last motif found has an E-value > ev.||MEME will rely on other limits to decide when to stop searching for motifs.|
|-time||t||MEME will stop searching for motifs if it has found at least 1 motif and it estimates that finding any more will cause the total running time to exceed t CPU seconds.||MEME will rely on other limits to decide when to stop searching for motifs.|
|Number of Motif Occurrences|
|-nsites||n||When the motif site distribution model allows, MEME will only attempt to find motifs with n sites. Specifying this option is equalivent to setting the -minsites and -maxsites options to the same value. When the distribution model is OOPS this is ignored and the number of sites is set to the number of sequences.||See the -minsites and -maxsites options for information on the default behaviour.|
|-minsites||n||When the motif site distribution model allows, MEME will attempt to find motifs with at least n sites. When the distribution model is OOPS this is ignored and the number of sites is set to the number of sequences.||The minimum number of sites is set to 2 when it is not otherwise defined by the use of the OOPS model or the -nsites option.|
|-maxsites||n||When the motif site distribution model allows, MEME will attempt to find motifs with at most n sites. When the distribution model is OOPS this is ignored and the number of sites is set to the number of sequences.||When the site distribution model is ZOOPS this is set to the
number of sequences, however when it is ANR this is set to
|-wnsites||weight||This controls the strength of the bias towards motifs with exactly the expected number of sites as defined by the -nsites, -minsites and -maxsites options. It is a number in the range [0..1). The closer to one it is, the stronger the bias towards motifs with exactly the expected number of sites.||The weighting is set to 0.8 .|
|-w||w||Search for motifs with a width of w.||Search for motifs with widths between the range set by -minw and -maxw.|
|-minw||min w||Search for motifs with a width ≥ min w.||Searches for motifs with a minimum width of 8.|
|-maxw||max w||Search for motifs with a width ≤ max w.||Searches for motifs with a maximum width of 50.|
|-nomatrim||Do not adjust motif width using multiple alignments.||The motif is trimmed to avoid insertions and deletions.|
|-wg||wg||The gap opening cost for creating the alignments used to trim the motif.||The opening cost for a gap is 11.|
|-ws||ws||The gap extension cost for creating the alignments used to trim the motif.||The extension cost for a gap is 1.|
|-noendgaps||Do not count end gaps in the alignments used to trim the motif.||End gaps are penalised like any other gap.|
|-bfile||bfile||The name of Markov background
The background model is used by MEME:
Note that MEME uses only the 0-order portion (single letter frequencies) of the background model for purposes 3 and 4, but uses the full-order model for purposes 1 and 2, above.
|The 0-order background frequencies are determined from the sequences.|
|-psp||PSP file||The name of a MEME position specific priors file. This can be used to bias the search for motifs in MEME. The PSP file supplies a position-specific prior distribution on the location of motif sites in sequence(s) in the input dataset.||All motif sites are considered equally likely.|
|DNA Palindromes & Strands|
|-revcomp||Consider both the given strand and the reverse complement strand when searching for DNA motifs.||Search for DNA motifs on the given strand only.|
|-pal||This causes MEME to only look for palindromes in DNA datasets. MEME averages the letter frequencies in corresponding columns of the motif (PSPM) together. For instance, if the width of the motif is 10, columns 1 and 10, 2 and 9, 3 and 8, etc., are averaged together. The averaging combines the frequency of A in one column with T in the other, and the frequency of C in one column with G in the other.||MEME won't specifically look for palindromes.|
|-maxiter||max iter||The number of iterations of EM to run from any starting point. EM is run for max iter iterations or until convergence (see -distance, below) from each starting point.||MEME will use a maximum of 50 EM iterations from any starting point.|
|-distance||distance||MEME stops iterating EM when the change in the motif frequency matrix is less than distance. Change is defined as the euclidean distance between two successive frequency matrices.||The distance used for measuring convergence is 0.001.|
|-prior||dirichlet|dmix|mega|megap|addone||The type of prior to use.
When the prior is not selected:
|-b||b||The strength of the prior on model parameters: b = 0 means use intrinsic strength of prior for prior = dmix.||
|-plib||plib||The name of the file containing the Dirichlet mixtures prior library.||
The default value of plib depends on
the alphabet of the sequences.
|Selecting Starts for EM|
|-spfuzz||fuzz||The fuzziness of sequence to theta mapping. The meaning of this parameter depends on the choice of mapping function as set by the -spmap option.||
The default value of fuzz depends on
the mapping function (see -spmap).
|-spmap||uni|pam||The mapping function to use for estimating theta.
The default mapping function depends on the alphabet of the sequences.
|-cons||consensus||Override the sampling of starting points and just use a starting point derived from consensus. This is useful when an actual occurrence of a motif is known and can be used as the starting point for finding the motif.||Refer to the -spmap option for the default behaviour.|
|Branching Search on EM Starts|
|-heapsize||hs||Size of heaps for widths where substring search occurs. See the branching search section for more details.|
|-x_branch||Experimental Perform x-branching. This does a BEAM search and tests the actual words in the input as well as words at hamming distance 1 at each successsive level of branching.||Normal branching search is performed. See the branching search section for more details.|
|-w_branch||Experimental Perform width branching. This is not optimised.||Normal branching search is performed. See the branching search section for more details.|
|-bfactor||bf||The number of iterations of branching search. See the branching search section for more details.||MEME uses 3 iterations of branching search.|
|-maxsize||max size||Maximum allowed dataset size (in characters).||Maximum allowed dataset size 100000 characters. Note that the default maximum size exists to warn users that their dataset is possibly too large to process in a reasonable time so please consider carefully before increasing this value.|
|-nostatus||Print no status messages to terminal.||Print minimal status messages to terminal.|
|-p||np||Use faster, parallel version of MEME with np processors. The parameter npmay be an number or it may be a quoted string starting with a number and followed by arguments to the particular MPI run command for your installation (e.g.,||Use a single processor.|
|-sf||sf||Print sf as name of sequence file.||Print actual file name.|
|-V||Print extensive status messages to terminal.||Print minimal status messages to terminal.|
|-h||Display a usage message and exit.||Run as normal.|
MEME uses an objective function on motifs to select the "best" motif. The objective function is based on the statistical significance of the log likelihood ratio (LLR) of the occurrences of the motif. The E-value of the motif is an estimate of the number of motifs (with the same width and number of occurrences) that would have equal or higher log likelihood ratio if the training set sequences had been generated randomly according to the (0-order portion of the) background model.
MEME searches for the motif with the smallest E-value. It searches over different motif widths, numbers of occurrences, and positions in the training set for the motif occurrences. The user may limit the range of motif widths and number of occurrences that MEME tries. In addition, MEME trims the motif (using a dynamic programming multiple alignment) to eliminate any positions where there is a gap in any of the occurrences.
The log likelihood ratio of a motif is
llr = log (Pr(sites | motif) / Pr(sites | back)) and is a
measure of how different the sites are from the background model.
Pr(sites | motif) is the probability of the occurrences
given the a model consisting of the position-specific probability matrix
(PSPM) of the motif.
Pr(sites | back) is the probability
of the occurrences given the background model. The background model is
an n-order Markov model. By default,
it is a 0-order model consisting of the frequencies of the letters in
the training set. A different 0-order Markov model or higher order
Markov models can be specified to MEME using the
The E-value reported by MEME is actually an approximation of the E-value of the log likelihood ratio (an approximation is used because it is far more efficient to compute). The approximation is based on the fact that the log likelihood ratio of a motif is the sum of the log likelihood ratios of each column of the motif. Instead of computing the statistical significance of this sum (its p-value), MEME computes the p-value of each column and then computes the significance of their product. Although not identical to the significance of the log likelihood ratio, this easier to compute objective function works very similarly in practice.
The motif significance is reported as the E-value of the motif. The statistical significance of a motif is computed based on:
MEME searches for motifs by performing Expectation Maximization (EM) on a motif model of a fixed width and using an initial estimate of the number of sites. It then sorts the possible sites according to their probability according to EM. MEME then calculates the E-values of the first n sites in the sorted list for different values of n. This procedure (first EM, followed by computing E-values for different numbers of sites) is repeated with different widths and different initial estimates of the number of sites. MEME outputs the motif with the lowest E-value.
Once a candidate motif has been found the multiple alignment method performs a separate pairwise alignment of the site with the highest probability and each other possible site (the alignment includes width/2 positions on either side of the sites).
The pairwise alignments are then combined and the method determines the widest section of the motif with no insertions or deletions. If this alignment is shorter than min w, it tries to find an alignment allowing up to one insertion/deletion per motif column. This continues (allowing up to 2, 3 ... insertions/deletions per motif column) until an alignment of width at least min w is found.
The switches -nomatrim, -wg, -ws and -noendgaps control trimming of motifs using the multiple alignment method. Specifying the -nomatrim option causes MEME to skip trimming altogether. The -wg and -ws options set the costs of including gaps in the alignments and the -noendgaps allows gaps at the beginning and end to be treated specially.
After trimming, the number of occurrences is then adjusted to maximize the motif E-value, and then the motif width is further shortened to optimize the E-value.
The search for good EM starting points can be improved by using branching search.
Branching search begins with a fixed-sized heap of best EM starts identified during the search of subsequences from the dataset. These starts are also called "seeds". The fixed-sized heap of seeds is used as the "branch_heap" during the first iteration of branching search.
For each iteration of branching search, all seeds in the current branch_heap are considered. All seeds in the ball within hamming distance 1 of a given seed are evaluated and added to a new heap. The ball of new seeds is generated by mutating each character of the initial seed to each alternative character in the alphabet.
After the ball for every branch_heap seed has been evaluated, the seeds in the resulting new heap are added to the heap of best EM starts. The new heap is then used as the branch_heap for the next iteration of branching search.
MEME's run time is cubic with respect to the number of input sequences. Datasets with more than 1000 input sequences in OOPS mode or 1000+ sites in ZOOPS and ANR mode are intractable in terms of run time. For example, on a 3 GHz CPU, initialization of the p-value table takes about 10 min for 1000 sites/sequences in OOPS, about a day for 5000 sites, and about 1 week for 10,000 sites. Then the quadratic run time with respect to the total number of characters kicks in next.
MEME's run time is quadratic with respect to the number of characters. So when the input dataset size doubles, the run time quadruples. If a 30,000 character dataset takes 15 min on a single cpu, a 30,000,000 character dataset would take 1 million times longer (N squared), or roughly 250,000 CPU hours, or 28.5 CPU years.
The parallel version of MEME scales up to about 128 processors. Please see http://www.sdsc.edu/~tbailey/papers/cabios96.pdf for a discussion of the parallel program. You must bear in mind however, that doubling the total number of sequences input to MEME means that you will need 8 times more processors for the job to finish in the same amount of time.
The following examples use data files provided in this release of the MEME Suite (in the tests directory).
A simple DNA example:
MEME looks for a single motif in the file
contains DNA sequences in FASTA format. The OOPS model is used so MEME
assumes that every sequence contains exactly one occurrence of the
motif. The palindrome switch is given so the motif model (PSPM) is
converted into a palindrome by combining corresponding frequency
columns. MEME automatically chooses the best width for the motif in
this example since no width was specified.
Searching for motifs on both DNA strands:
This is like the previous example except that the -revcomp switch tells MEME to consider both DNA strands, and the -pal switch is absent so the palindrome conversion is omitted. When DNA uses both DNA strands, motif occurrences on the two strands may not overlap. That is, any position in the sequence given in the training set may be contained in an occurrence of a motif on the positive strand or the negative strand, but not both.
A fast DNA example:
This example differs from example 1) in that MEME is told to only
consider motifs of width 20. This causes MEME to execute about 10
times faster. The
-w switch can also be used with protein
datasets if the width of the motifs is known in advance.
Using a higher-order background model:
In this example we use
-mod anr and
yeast.nc.6.freq. This specifies that
Using a higher order background model can often result in more sensitive detection of motifs. This is because the background model more accurately models non-motif sequence, allowing MEME to discriminate against it and find the true motifs.
A simple protein example:
-dna switch is absent, so MEME assumes the file
lipocalin.s contains protein sequences. MEME searches for
two motifs each of width less than or equal to 20. (Specifying
-maxw 20 makes MEME run faster since it does not have to
consider motifs longer than 20.) Each motif is assumed to occur in
each of the sequences because the OOPS model is specified.
Another simple protein example:
MEME searches for a motif of width up to 40 with up to 50 occurrences in the entire training set. The ANR sequence model is specified, which allows each motif to have any number of occurrences in each sequence. This dataset contains motifs with multiple repeats of motifs in each sequence. This example is fairly time consuming due to the fact that the time required to initialize the motif probability tables is proportional to max width × max sites.
A much faster protein example:
This time MEME is constrained to search for three motifs of width exactly ten. The effect is to break up the long motif found in the previous example. The -w switch forces motifs to be exactly ten letters wide. This example is much faster because, since only one width is considered, the time to build the motif probability tables is only proportional to max sites.
Splitting the sites into three:
This forces each motif to have 24 occurrences, exactly, and be up to 12 letters wide.
A larger protein example with E-value cutoff:
In this example, MEME looks for up to 20 motifs, but stops when a motif is found with E-value greater than 0.01. Motifs with large E-values are likely to be statistical artifacts rather than biologically significant.