Biomedical Computation

What Are Motifs?

A motif in molecular biology is a relatively short sequence of nucleotides or amino acids that changes little during evolution and, at least presumably, has a definite biological function. A motif is sometimes meant not a specific sequence, but a spectrum of sequences described in some way, each of which is capable of performing a certain biological function of a given motif.

Motifs are ubiquitous in living organisms and perform many vital functions, such as regulation of transcription and translation (in the case of nucleotide motifs), post-translational modification and cellular localization of proteins, and partially determine their functional properties (leucine zipper). They are widely used in bioinformatics for predicting the functions of genes and proteins, constructing regulation maps, and are important for many problems in genetic engineering and molecular biology in general.

In connection with the practical importance of motifs, both bioinformatic methods for their search (MEME, Gibbs Sampler) and methods of in vivo search for motifs (ChIP-seq, ChIP-exo) have been developed. The latter quite often give approximate coordinates of motifs and their results are then refined using bioinformatics methods. For the convenience of storing motifs in databases, their different representations are used, the most common of which are consensus and positional weight matrix.

The motif should be distinguished from conservative areas in closely related organisms that do not have significant biological functions, where the mutational process has not yet sufficiently changed them.

Motifs in nucleic acids

In the case of DNA, most often motifs are short sequences that are binding sites for proteins, such as nucleases and transcription factors, or are involved in important regulatory processes already at the RNA level, such as ribosome entry, mRNA processing, and transcription termination.

A brief history of the study

The study of motifs in DNA became possible thanks to the appearance in 1973 of a DNA sequencing procedure (determination of the nucleotide sequence of a DNA fragment). The first to be defined were the sequences of the lac-operator and the lambda-operator. However, before the advent of more efficient sequencing methods, the number of motif sequences remained rather small. By the end of the 1970s, many examples of mutant sequences (sites) appeared that bind transcription factors and sequences with altered specificity. With the increase in the number of sequences, methods of theoretical prediction of motifs began to develop. In 1982, the positional weight matrix (PWM) of the translation initiation site motif was first constructed. Using the constructed PVM, other translation initiation sites were predicted. This approach turned out to be quite powerful and is still used in various forms to search for known motifs in genomes, and specific methods differ only in the type of weighting function. However, the approach based on the construction of PVM on the basis of already existing sequences did not allow finding fundamentally new motifs, which is a more difficult task. The first algorithm to solve this problem was proposed by Gallas and colleagues in 1985. This algorithm was based on searching for common words in a set of sequences and gave a large percentage of false-negative results, but it became the basis for a whole family of algorithms. Later, more accurate probabilistic methods were developed: the MEME algorithm based on the expectation maximization procedure and the Gibbs Sampler algorithm, also based on the expectation-maximization procedure. Both methods have proven to be very sensitive and are currently being used to predict motifs in sequence sets.

After the development of powerful tools for predicting the binding motifs of transcription factors and establishing a correspondence between a sufficient number of transcription factors and motifs, it became possible to predict the functions of the operon lying close to the motif by the specificity of the transcription factor that binds to it and vice versa, to predict the transcription factor by genes in the operon. lying next to a certain motif.

Motif structure

Often, the motifs that bind transcription factors are in the form of forward repeats of a certain sequence, backward repeats, or palindromic sequences. This can be explained by the work of transcription factors in the form of protein dimers, in which each of the monomers binds the same sequence. There are also motifs of greater repetition. This structure of motifs provides a sharper reaction to changes in external conditions. For example, if the binding depends on the concentration of one substance in the cell, then we obtain the dependence of the reaction force of the cell, described by the Michaelis-Menten equation. With an increase in the number of binding protein units (we will assume that the effect of binding a protein to a motif is manifested only in the case of binding to all repeats), the dependence becomes more and more like a sigmoid, in the limit tending to the Heaviside function, which describes one of the main principles of the response of living systems to many impact – the all-or-nothing law, for example, the formation of action potential.

Protein motifs

GOMO (What is GOMO? This is the program which searches in a set of ranked genes for enriched GO terms associated with high-ranking genes) distinguishes 2 types:

  • Motif in the amino acid sequence;
  • Structural motif – the relative position of several closely spaced elements of the secondary structure in space. On the sequence, these elements can be far apart from each other.

Motifs in primary structure (protein sequence)

The motifs in the primary structure are similar to the motifs in nucleic acids. Typical examples of these are:

  1. signal peptides — short amino acid sequences within a protein, about 3–60 amino acids in length, which determine which cell compartment will be sent to after synthesis. An example is a nuclear localization signal;
  2. sites of post-translational protein modification, which are conservative peptides of the order of 5-12 amino acids. An example is the sites of acetylation in a protein.

Structural motifs

In proteins, structural motifs describe the connections between the elements of the secondary structure. Such motifs often have sections of variable length, which in some cases may be completely absent.

  1. Leucine lightning – characteristic of dimeric proteins that bind DNA. Leucine zipper provides contact between two protein monomers through hydrophobic interactions. It is characterized by the presence of a leucine residue in every seventh position;
  2. Zinc fingers – characteristic of DNA-binding transcription factors;
  3. Helix-turn-helix is ​​a DNA-binding motif, just such a DNA-binding fragment in the Lac repressor;
  4. Homeodomain – a motif that binds DNA and RNA. In eukaryotes, proteins with homeodomains induce cell differentiation, triggering cascades of genes necessary for the formation of tissues and organs. It is similar to the “spiral-turn-spiral” motif, therefore it is often not singled out separately;
  5. Rossman’s fold is a motif that binds nucleotides (for example, NAD). It occurs, in particular, in dehydrogenases, including glyceraldehyde-3-phosphate dehydrogenase, which is involved in glycolysis;
  6. EF-hand – the motif that binds Ca2 + ions, is also similar to the “helix-turn-helix” motif;
  7. Nest – three consecutive amino acid residues form an anion binding site;
  8. Niche – three consecutive amino acid residues form the cation binding site;
  9. Beta-hairpin – two β-strands connected by a short turn of the protein chain.

In addition to the beta-hairpin, many other motifs are distinguished, the function of which is to form the structural framework of the protein.

Folding is close to the term protein structural motif. This is a characteristic arrangement of the elements of the secondary structure. Due to their similarity, the terms are often used one instead of the other and the line between them is blurred.