This is my PhD thesis from Penn State (advised by Manfred Denker).
This dissertation defines, develops, and applies the theory of substitution Markov chains, thereby making rigorous the intuitive concept of a "random substitution"' (repeatedly replacing subletters/subwords with different, randomly chosen letters/words). A substitution Markov chain (abbreviated SMC) is special class of countable-state Markov chain. Several contributions are contained in this dissertation: first, an SMC is defined and then frequencies of subwords are shown to converge almost surely. I then investigate the boundary theory for this class of Markov chains (which is fundamentally different than previously considered classes of countable state Markov chains) and calculate two examples of the Martin boundary. I then present a technique for modifying an SMC that transforms it into a more classical, irreducible Markov chain, and verify that frequencies of subwords still converge in this case.
I then demonstrate how a particular class of SMC's provide a natural framework to accurately model molecular evolution. Not only can a diverse set of mutational events be modeled with this class of substitution Markov chain, but the theorems regarding convergence of subword frequencies can be utilized to develop an alignment-free method of parameter estimation for this model. It is difficult to overstate the advantage of an alignment-free parameter estimation technique.
The last two chapters of this dissertation represent the initiation of a novel line research concerned with the adaptation of tools from symbolic dynamics and thermodynamic formalism to the study of DNA sequences. Having given a firm foundation to the non-trivial connections between molecular evolution, countable state Markov chains, and symbolic dynamics, I take the perspective of treating a DNA sequence as a concatenation of dynamical systems and then apply the analytic techniques from symbolic dynamics to classify the corresponding biological phenomena. Considered here are topological entropy and pressure. This perspective allows for contributions to be made in diverse genomic analysis problems such as intron/exon classification, a-priori coding sequence density estimation, quantification of inter- and intra-species synonymous codon bias, and unsupervised classification of short reads of DNA as coding or non-coding.