MHMM Format

MHMM format describes a hidden markov model of motif sites in a sequence.

Hidden Markov Model (HMM) is a statistical model that simulates the operation of a process similar to a Markov process with unknown parameters, and the task is to unravel the unknown parameters based on the observables. The obtained parameters can be used in further analysis, for example, for pattern recognition. HMM can be thought of as the simplest Bayesian web of trust.

The first notes on hidden Markov models were published by Baum in the 1960s, and they were first used in speech recognition in the 70s. Since the mid-1980s, HMMs have been used in the analysis of biological sequences, in particular DNA.

The main application of HMM is found in the field of speech recognition, writing, movement and bioinformatics. In addition, HMMs are used in cryptanalysis, machine translation.

Example

Imagine two friends talking over the phone every night about what they did this afternoon. Your friend can only do three things: walk in the park, go shopping, or clean the room. His choice is based only on the weather, which was at the time of the decision. You don’t know anything about the weather in the region where your friend lives, but you can, based on his decisions, try to guess what the weather was like.

The weather is represented as a Markov chain, it has two states: sunny or rainy, but you cannot see it yourself, so it is hidden from you. Each day your friend makes one of three possible decisions: walking, shopping, or cleaning. You can find out about your friend’s decision, so this is an observable value. In general, we get SMM.

HMM structure

In a conventional Markov model, the state is visible to the observer, so the transition probabilities are the only parameter. In the latent Markov model, we can only keep track of the variables that are affected by a given state. Each state has a probability distribution among all possible outputs. Therefore, the sequence of characters generated by the HMM provides information about the sequence of states.

The diagram below shows the general structure of the CMM. Ovals represent variables with a random value. The random variable {\ displaystyle x (t)} x (t) is the value of the hidden variable at time {\ displaystyle t} t. The random variable {\ displaystyle y (t)} y (t) is the value of the observed variable at time {\ displaystyle t} t}. The arrows in the diagram represent conditional dependencies.

From the diagram, it becomes clear that the value of the hidden variable {\ displaystyle x (t)} x (t) (at time {\ displaystyle t} t) depends only on the value of the hidden variable {\ displaystyle x (t-1)} x ( t-1) (at time {\ displaystyle t-1} t-1). This is called the Markov property. Although at the same time the value of the observable variable {\ displaystyle y (t)} y (t) depends only on the value of the hidden variable {\ displaystyle x (t)} x (t) (both at the moment of time {\ displaystyle t} t).

Basic algorithms

There are three main tasks associated with the SMM:

Algorithm of forward-backward motion: given the parameters of the model and the sequence, it is required to calculate the probability of the appearance of this sequence (allows you to solve the problem);
Viterbi algorithm: the parameters of the model are given, it is required to determine the most suitable sequence of hidden nodes that most accurately describes this model (it helps in solving this problem);
Baum-Welsh algorithm: an output sequence (or several) with discrete values is given, it is required to train the HMM at this output.