Stochastic Context-Free L-Grammar DNA Modeling Machine Learning Data Mining Bioinformatics
Institute of Information Theories and Applications FOI ITHEA
Formal grammars can used for describing complex repeatable structures such as DNA sequences. In
this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar.
L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant
development, and model the morphology of a variety of organisms. We believe that parallel grammars also can
be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory
DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for
successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species,
but there are many exceptions which makes the promoter recognition a complex problem. We replace the
problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for
the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and
vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a
Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L-
grammar rules are analyzed and compared with natural promoter sequences.