I am a mathematical biologist interested in developing mathematically sound approaches to the analysis of high-throughput DNA sequencing data. To do this, I utilize and develop techniques from the fields of probability, compressed sensing, and optimization. I am particularly interested in developing methods to analyze genomic and metagenomic data.
We introduce an extremely fast, light-weight, "big data" algorithm to quickly answer the question of "which bacteria are present?" in a given sample of DNA. The method is based on the theory of compressed sensing and aims to find the simplest explanation for the data in terms of known information.
We demonstrate that a concept of "weighted information content" (known as topological pressure, from the ergodic theory literature) can be used to facilitate the analysis of genomic data (in particular, find areas of a genome that have many genes in them). This is a conceptual extension to topological entropy approach presented earlier.
In this paper, we improve both the accuracy and speed of the Quikr approach to classifying a given set of metagenomic DNA sequences (16S rRNA). This is accomplished by increasing the number of "feature vectors" we use for each training genome, and by modifying the Lawson-Hanson algorithm for non-negative least squares.
This is a course that I created back in 2014 (that continues to run, typically in the Fall and Spring) to introduce students to Mathematica, Matlab, and LaTeX. In the future, I will be incorporating modules on Python and/or Julia.
After introducing the notion of a random substitution Markov chain, we relate it to other notions of a "random substitution" and give a complete description of the Martin boundary for a few interesting examples.