Usage:

meme <sequence file> [options]

Description

MEME is a tool for discovering motifs in a group of related DNA or protein sequences.

A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.

MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.

The examples are a good place to start.

Input

Sequence File

A file containing FASTA sequences or the word 'stdin' to indicate that the sequences should be read from standard input. Note that MEME does not attempt to detect the alphabet from the sequences so you should specify it with the -dna or -protein options. MEME also supports a modification to the FASTA format for weighting the sequences. Large datasets will need to be sub-sampled to allow MEME to run in reasonable times.

Output

MEME outputs its results primarily as a HTML file named meme.html. MEME also outputs a machine-readable XML file and a plain-text versions of its output, named meme.xml and meme.txt, respectively.

These files are placed in a directory named meme_out. You can select for the directory to have a different name using the --o or --oc options.

Options

may be an number or it may be a quoted string starting with a number and followed by arguments to the particular MPI run command for your installation (e.g., mpirun).
Option Parameter Description Default Behaviour
Output
-text Output in text format only to standard output. The program behaves as if --oc meme_out had been specified.
Alphabet
-dna MEME will expect that the sequences use the DNA alphabet. This means that the sequence may contain the letters "ACGT" and the ambiguous letters "BDHKMNRSUVWY*-". All ambiguous characters will be treated as unknown. MEME assumes that the sequences are protein.
-protein MEME will expect that the sequences use the protein alphabet. This means that the sequence may contain the letters "ACDEFGHIKLMNPQRSTVWY" and the ambiguous letters "BUXZ*-". All ambiguous letters will be treated as unknown. MEME assumes that the sequences are protein.
Contributing Site Distribution
-modoops|​zoops|​anr This option is used to describe the distribution of motif sites.
ValueNameDescription
oopsOne Occurrence Per Sequence MEME assumes that each sequence in the dataset contains exactly one occurrence of each motif. This option is the fastest and most sensitive but the motifs returned by MEME may be "blurry" if any of the sequences is missing them.
zoopsZero or One Occurrence Per Sequence MEME assumes that each sequence may contain at most one occurrence of each motif. This option is useful when you suspect that some motifs may be missing from some of the sequences. In that case, the motifs found will be more accurate than using the first option. This option takes more computer time than the first option (about twice as much) and is slightly less sensitive to weak motifs present in all of the sequences.
anrAny Number of Repetitions MEME assumes each sequence may contain any number of non-overlapping occurrences of each motif. This option is useful when you suspect that motifs repeat multiple times within a single sequence. In that case, the motifs found will be much more accurate than using one of the other options. This option can also be used to discover repeats within a single sequence. This option takes the much more computer time than the first option (about ten times as much) and is somewhat less sensitive to weak motifs which do not repeat within a single sequence than the other two options.
MEME assumes the Zero or One Occurrence Per Sequence model.
Number of Motifs
-nmotifsn MEME will stop searching for motifs after finding n motifs. MEME will find 1 motif.
-evtev MEME will stop searching for motifs if the last motif found has an E-value > ev. MEME will rely on other limits to decide when to stop searching for motifs.
-timet MEME will stop searching for motifs if it has found at least 1 motif and it estimates that finding any more will cause the total running time to exceed t CPU seconds. MEME will rely on other limits to decide when to stop searching for motifs.
Number of Motif Occurrences
-nsitesn When the motif site distribution model allows, MEME will only attempt to find motifs with n sites. Specifying this option is equalivent to setting the -minsites and -maxsites options to the same value. When the distribution model is OOPS this is ignored and the number of sites is set to the number of sequences. See the -minsites and -maxsites options for information on the default behaviour.
-minsitesn When the motif site distribution model allows, MEME will attempt to find motifs with at least n sites. When the distribution model is OOPS this is ignored and the number of sites is set to the number of sequences. The minimum number of sites is set to 2 when it is not otherwise defined by the use of the OOPS model or the -nsites option.
-maxsitesn When the motif site distribution model allows, MEME will attempt to find motifs with at most n sites. When the distribution model is OOPS this is ignored and the number of sites is set to the number of sequences. When the site distribution model is ZOOPS this is set to the number of sequences, however when it is ANR this is set to min(5 × sequence count, 600).
-wnsitesweight This controls the strength of the bias towards motifs with exactly the expected number of sites as defined by the -nsites, -minsites and -maxsites options. It is a number in the range [0..1). The closer to one it is, the stronger the bias towards motifs with exactly the expected number of sites. The weighting is set to 0.8 .
Motif Width
-ww Search for motifs with a width of w. Search for motifs with widths between the range set by -minw and -maxw.
-minwmin w Search for motifs with a width ≥ min w. Searches for motifs with a minimum width of 8.
-maxwmax w Search for motifs with a width ≤ max w. Searches for motifs with a maximum width of 50.
-nomatrim Do not adjust motif width using multiple alignments. The motif is trimmed to avoid insertions and deletions.
-wgwg The gap opening cost for creating the alignments used to trim the motif. The opening cost for a gap is 11.
-wsws The gap extension cost for creating the alignments used to trim the motif. The extension cost for a gap is 1.
-noendgaps Do not count end gaps in the alignments used to trim the motif. End gaps are penalised like any other gap.
Background Model
-bfilebfile The name of Markov background model file.

The background model is used by MEME:

  1. during EM as the "null mode",
  2. for calculating the log liklihood ratio of a motif,
  3. for calculating the significance (E-value) of a motif, and,
  4. for creating the position-specific scoring matrix (log-odds matrix).

Note that MEME uses only the 0-order portion (single letter frequencies) of the background model for purposes 3 and 4, but uses the full-order model for purposes 1 and 2, above.

The 0-order background frequencies are determined from the sequences.
Position-Specific Priors
-pspPSP file The name of a MEME position specific priors file. This can be used to bias the search for motifs in MEME. The PSP file supplies a position-specific prior distribution on the location of motif sites in sequence(s) in the input dataset. All motif sites are considered equally likely.
DNA Palindromes & Strands
-revcomp Consider both the given strand and the reverse complement strand when searching for DNA motifs. Search for DNA motifs on the given strand only.
-pal This causes MEME to only look for palindromes in DNA datasets. MEME averages the letter frequencies in corresponding columns of the motif (PSPM) together. For instance, if the width of the motif is 10, columns 1 and 10, 2 and 9, 3 and 8, etc., are averaged together. The averaging combines the frequency of A in one column with T in the other, and the frequency of C in one column with G in the other. MEME won't specifically look for palindromes.
EM Algorithm
-maxitermax iter The number of iterations of EM to run from any starting point. EM is run for max iter iterations or until convergence (see -distance, below) from each starting point. MEME will use a maximum of 50 EM iterations from any starting point.
-distancedistance MEME stops iterating EM when the change in the motif frequency matrix is less than distance. Change is defined as the euclidean distance between two successive frequency matrices. The distance used for measuring convergence is 0.001.
-priordirichlet|​dmix|​mega|​megap|​addone The type of prior to use.
ValueNameDescription
dirichletSimple Dirichlet This is the default for DNA sequences. It is a simple dirichlet prior based on the background model as set by the -bfile option.
dmixDirichlets Mix This is the default for protein sequences when using the OOPS distribution model. It is a mixture of Dirichlets prior. The source of the Dirichlets is specified by the -plib option.
megaMega-weight Dirichlets Mix This is an extremely low variance version of dmix. The variance is scaled inversely with the size of the dataset.
megapMega-weight Dirichlets Mix Plus This is the default for protein sequences when using the ZOOPS or ANR models. This behaves like mega until the last iteration of EM when it reverts to dmix behaviour.
addoneAdd One Add +1 to each observed count.
When the prior is not selected:
ModelDNAProtein
One Occurrence Per Sequence dirichlet dmix
Zero or One Occurrence Per Sequence dirichlet megap
Any Number of Repetitions dirichlet megap
-bb The strength of the prior on model parameters: b = 0 means use intrinsic strength of prior for prior = dmix.
Priorb
dirichlet0.01
dmix0
-plibplib The name of the file containing the Dirichlet mixtures prior library. The default value of plib depends on the alphabet of the sequences.
Alphabetplib
DNAINSTALL_DIR/etc/dna.plib
ProteinINSTALL_DIR/etc/prior30.plib
Selecting Starts for EM
-spfuzzfuzz The fuzziness of sequence to theta mapping. The meaning of this parameter depends on the choice of mapping function as set by the -spmap option. The default value of fuzz depends on the mapping function (see -spmap).
Mapping Functionfuzz
Uniform0.5
Point Accepted Mutation120
-spmapuni|pam The mapping function to use for estimating theta.
ValueNameDescription
uniUniform Add a uniform prior of fuzz when converting a substring to an estimate of theta.
pamPoint Accepted Mutation Use PAM matrices, with the number of mutation events set by fuzz, to estimate theta.
The default mapping function depends on the alphabet of the sequences.
AlphabetMapping Function
DNAUniform
ProteinPoint Accepted Mutation
-consconsensus Override the sampling of starting points and just use a starting point derived from consensus. This is useful when an actual occurrence of a motif is known and can be used as the starting point for finding the motif. Refer to the -spmap option for the default behaviour.
Branching Search on EM Starts
-heapsizehs Size of heaps for widths where substring search occurs. See the branching search section for more details.
-x_branch Experimental Perform x-branching. This does a BEAM search and tests the actual words in the input as well as words at hamming distance 1 at each successsive level of branching. Normal branching search is performed. See the branching search section for more details.
-w_branch Experimental Perform width branching. This is not optimised. Normal branching search is performed. See the branching search section for more details.
-bfactorbf The number of iterations of branching search. See the branching search section for more details. MEME uses 3 iterations of branching search.
Misc
-maxsizemax size Maximum allowed dataset size (in characters). Maximum allowed dataset size 100000 characters. Note that the default maximum size exists to warn users that their dataset is possibly too large to process in a reasonable time so please consider carefully before increasing this value.
-nostatus Print no status messages to terminal. Print minimal status messages to terminal.
-pnp Use faster, parallel version of MEME with np processors. The parameter np Use a single processor.
-sfsf Print sf as name of sequence file. Print actual file name.
-V Print extensive status messages to terminal. Print minimal status messages to terminal.
-h Display a usage message and exit. Run as normal.

Motif Objective Function

MEME uses an objective function on motifs to select the "best" motif. The objective function is based on the statistical significance of the log likelihood ratio (LLR) of the occurrences of the motif. The E-value of the motif is an estimate of the number of motifs (with the same width and number of occurrences) that would have equal or higher log likelihood ratio if the training set sequences had been generated randomly according to the (0-order portion of the) background model.

MEME searches for the motif with the smallest E-value. It searches over different motif widths, numbers of occurrences, and positions in the training set for the motif occurrences. The user may limit the range of motif widths and number of occurrences that MEME tries. In addition, MEME trims the motif (using a dynamic programming multiple alignment) to eliminate any positions where there is a gap in any of the occurrences.

The log likelihood ratio of a motif is llr = log (Pr(sites | motif) / Pr(sites | back)) and is a measure of how different the sites are from the background model. Pr(sites | motif) is the probability of the occurrences given the a model consisting of the position-specific probability matrix (PSPM) of the motif. Pr(sites | back) is the probability of the occurrences given the background model. The background model is an n-order Markov model. By default, it is a 0-order model consisting of the frequencies of the letters in the training set. A different 0-order Markov model or higher order Markov models can be specified to MEME using the -bfile option.

The E-value reported by MEME is actually an approximation of the E-value of the log likelihood ratio (an approximation is used because it is far more efficient to compute). The approximation is based on the fact that the log likelihood ratio of a motif is the sum of the log likelihood ratios of each column of the motif. Instead of computing the statistical significance of this sum (its p-value), MEME computes the p-value of each column and then computes the significance of their product. Although not identical to the significance of the log likelihood ratio, this easier to compute objective function works very similarly in practice.

The motif significance is reported as the E-value of the motif. The statistical significance of a motif is computed based on:

  1. the log likelihood ratio,
  2. the width of the motif,
  3. the number of occurrences,
  4. the 0-order portion of the background model,
  5. the size of the training set, and
  6. the type of model (oops, zoops, or anr, which determines the number of possible different motifs of the given width and number of occurrences).

MEME searches for motifs by performing Expectation Maximization (EM) on a motif model of a fixed width and using an initial estimate of the number of sites. It then sorts the possible sites according to their probability according to EM. MEME then calculates the E-values of the first n sites in the sorted list for different values of n. This procedure (first EM, followed by computing E-values for different numbers of sites) is repeated with different widths and different initial estimates of the number of sites. MEME outputs the motif with the lowest E-value.

Multiple Alignment Trimming

Once a candidate motif has been found the multiple alignment method performs a separate pairwise alignment of the site with the highest probability and each other possible site (the alignment includes width/2 positions on either side of the sites).

The pairwise alignments are then combined and the method determines the widest section of the motif with no insertions or deletions. If this alignment is shorter than min w, it tries to find an alignment allowing up to one insertion/deletion per motif column. This continues (allowing up to 2, 3 ... insertions/deletions per motif column) until an alignment of width at least min w is found.

The switches -nomatrim, -wg, -ws and -noendgaps control trimming of motifs using the multiple alignment method. Specifying the -nomatrim option causes MEME to skip trimming altogether. The -wg and -ws options set the costs of including gaps in the alignments and the -noendgaps allows gaps at the beginning and end to be treated specially.

After trimming, the number of occurrences is then adjusted to maximize the motif E-value, and then the motif width is further shortened to optimize the E-value.

The search for good EM starting points can be improved by using branching search.

Branching search begins with a fixed-sized heap of best EM starts identified during the search of subsequences from the dataset. These starts are also called "seeds". The fixed-sized heap of seeds is used as the "branch_heap" during the first iteration of branching search.

For each iteration of branching search, all seeds in the current branch_heap are considered. All seeds in the ball within hamming distance 1 of a given seed are evaluated and added to a new heap. The ball of new seeds is generated by mutating each character of the initial seed to each alternative character in the alphabet.

After the ball for every branch_heap seed has been evaluated, the seeds in the resulting new heap are added to the heap of best EM starts. The new heap is then used as the branch_heap for the next iteration of branching search.

Running time on large inputs

MEME's run time is cubic with respect to the number of input sequences. Datasets with more than 1000 input sequences in OOPS mode or 1000+ sites in ZOOPS and ANR mode are intractable in terms of run time. For example, on a 3 GHz CPU, initialization of the p-value table takes about 10 min for 1000 sites/sequences in OOPS, about a day for 5000 sites, and about 1 week for 10,000 sites. Then the quadratic run time with respect to the total number of characters kicks in next.

MEME's run time is quadratic with respect to the number of characters. So when the input dataset size doubles, the run time quadruples. If a 30,000 character dataset takes 15 min on a single cpu, a 30,000,000 character dataset would take 1 million times longer (N squared), or roughly 250,000 CPU hours, or 28.5 CPU years.

The parallel version of MEME scales up to about 128 processors. Please see http://www.sdsc.edu/~tbailey/papers/cabios96.pdf for a discussion of the parallel program. You must bear in mind however, that doubling the total number of sequences input to MEME means that you will need 8 times more processors for the job to finish in the same amount of time.

Examples

The following examples use data files provided in this release of the MEME Suite (in the tests directory).

  1. A simple DNA example:

    meme crp0.s -dna -mod oops -pal

    MEME looks for a single motif in the file crp0.s which contains DNA sequences in FASTA format. The OOPS model is used so MEME assumes that every sequence contains exactly one occurrence of the motif. The palindrome switch is given so the motif model (PSPM) is converted into a palindrome by combining corresponding frequency columns. MEME automatically chooses the best width for the motif in this example since no width was specified.

  2. Searching for motifs on both DNA strands:

    meme crp0.s -dna -mod oops -revcomp

    This is like the previous example except that the -revcomp switch tells MEME to consider both DNA strands, and the -pal switch is absent so the palindrome conversion is omitted. When DNA uses both DNA strands, motif occurrences on the two strands may not overlap. That is, any position in the sequence given in the training set may be contained in an occurrence of a motif on the positive strand or the negative strand, but not both.

  3. A fast DNA example:

    meme crp0.s -dna -mod oops -revcomp -w 20

    This example differs from example 1) in that MEME is told to only consider motifs of width 20. This causes MEME to execute about 10 times faster. The -w switch can also be used with protein datasets if the width of the motifs is known in advance.

  4. Using a higher-order background model:

    meme INO_up800.s -dna -mod anr -revcomp -bfile yeast.nc.6.freq

    In this example we use -mod anr and -bfile yeast.nc.6.freq. This specifies that

    1. the motif may have any number of occurrences in each sequence, and,
    2. the Markov model specified in yeast.nc.6.freq is used as the background model. This file contains a fifth-order Markov model for the non-coding regions in the yeast genome.

    Using a higher order background model can often result in more sensitive detection of motifs. This is because the background model more accurately models non-motif sequence, allowing MEME to discriminate against it and find the true motifs.

  5. A simple protein example:

    meme lipocalin.s -mod oops -maxw 20 -nmotifs 2

    The -dna switch is absent, so MEME assumes the file lipocalin.s contains protein sequences. MEME searches for two motifs each of width less than or equal to 20. (Specifying -maxw 20 makes MEME run faster since it does not have to consider motifs longer than 20.) Each motif is assumed to occur in each of the sequences because the OOPS model is specified.

  6. Another simple protein example:

    meme farntrans5.s -mod anr -maxw 40 -maxsites 50

    MEME searches for a motif of width up to 40 with up to 50 occurrences in the entire training set. The ANR sequence model is specified, which allows each motif to have any number of occurrences in each sequence. This dataset contains motifs with multiple repeats of motifs in each sequence. This example is fairly time consuming due to the fact that the time required to initialize the motif probability tables is proportional to max width × max sites.

  7. A much faster protein example:

    meme farntrans5.s -mod anr -w 10 -maxsites 30 -nmotifs 3

    This time MEME is constrained to search for three motifs of width exactly ten. The effect is to break up the long motif found in the previous example. The -w switch forces motifs to be exactly ten letters wide. This example is much faster because, since only one width is considered, the time to build the motif probability tables is only proportional to max sites.

  8. Splitting the sites into three:

    meme farntrans5.s -mod anr -maxw 12 -nsites 24 -nmotifs 3

    This forces each motif to have 24 occurrences, exactly, and be up to 12 letters wide.

  9. A larger protein example with E-value cutoff:

    meme adh.s -mod zoops -nmotifs 20 -evt 0.01

    In this example, MEME looks for up to 20 motifs, but stops when a motif is found with E-value greater than 0.01. Motifs with large E-values are likely to be statistical artifacts rather than biologically significant.

Citing

If you use MEME in your research, please cite the following paper:
Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. [postscript] [pdf]