There are many DNA components near a given gene in a cell which influence its transcription and resulting activation for protein production (expression). These DNA regions can bind activators, co-activators, and inhibitors. Often these act to reinforce or inhibit each other. In general they can act combinatorially, determining the gene's expression depending on circumstances. Primary among such binding proteins are transcription factors (TFs), the main activators of gene transcription.
The DNA binding components are largely determined by binding site motifs, characteristic signatures of length 5-15 DNA base pairs (bps) which bind to transcription factors (TFs). The determination of a given TF's binding motifs often follows from identification of common DNA patterns in promoter regions (regions known to contain most TF binding sites) adjacent to genes which are known experimentally to bind the TF, or shown to express their RNA or protein in the presence of the TF in experiments.
The leading transcription factor binding site (TFBS) motif finding algorithm can be divided into two groups. One is Local alignment based method, e.g., Gibbs sampling method (AlignACE, Gibbs Sampler, BioProspector), MEME, CONSENSUS. They predict the binding motif by finding the best local alignment among N co-regulated promoters under given statistical measure of the alignment profile (e.g. PWM). In theory, it has been proved that exhaustively evaluating all possible alignment is NP-hard problem [PAVESI]. Therefore, some approximation techniques are used to search such a profile.
Another kind of methods is K-mer enumeration based, e.g. MDscan, Weeder. They enumerate all possible binding K-mers and evaluate each of them by counting their frequencies of occurrence in each promoter region with either no mismatch or at most M mis-matches. Again, exhaustive enumeration of K-mers has 4^K possibilities with an exponential growth of the length K. It is necessary to apply a searching method which is capable of either searching signal in a high dimensional space or restricting search in a largely reduced space.
SVMotif is a mechine learning based motif finder. It can be classified into K-mer enumeration based methods. the evaluation of each possible K-mer is done by Recursive SVM feature selection [R-SVM].