I am a mathematical biologist interested in developing mathematically sound approaches to the analysis of high-throughput DNA sequencing data. To do this, I utilize and develop techniques from the fields of probability, compressed sensing, and optimization. I am particularly interested in developing methods to analyze genomic and metagenomic data.
After introducing the notion of a random substitution Markov chain, we relate it to other notions of a "random substitution" and give a complete description of the Martin boundary for a few interesting examples.
We prove that nonnegative least squares (typically prone to over-fitting) can be slightly modified to return sparse results.
Extending the results of Quikr to whole genome shotgun metagenomic samples, we develop a method to automatically select a parameter that balances sparsity (how succinct the result is) with accuracy.
We review a variety of entropy/randomness-based techniques that are useful in a variety of data mining applications.
We present the idea of using the "earth mover's distance" (aka the first Wasserstein metric) to measure the distance between samples of DNA. This reduces to finding the most efficient way to transform one kind of graph (known as de Bruijn graphs) into another.