I am a mathematical biologist interested in developing mathematically sound approaches to the analysis of high-throughput DNA sequencing data. To do this, I utilize and develop techniques from the fields of probability, compressed sensing, and optimization. I am particularly interested in developing methods to analyze genomic and metagenomic data.
I define a new notion of "randomness" (called topological pressure) suitable for use on sequences of symbols (words) of finite length. I show that this can be used to distinguish between biologically interesting sequences in the human genome.
We introduce an extremely fast, light-weight, "big data" algorithm to quickly answer the question of "which bacteria are present?" in a given sample of DNA. The method is based on the theory of compressed sensing and aims to find the simplest explanation for the data in terms of known information.
We demonstrate that a concept of "weighted information content" (known as topological pressure, from the ergodic theory literature) can be used to facilitate the analysis of genomic data (in particular, find areas of a genome that have many genes in them). This is a conceptual extension to topological entropy approach presented earlier.
In this paper, we improve both the accuracy and speed of the Quikr approach to classifying a given set of metagenomic DNA sequences (16S rRNA). This is accomplished by increasing the number of "feature vectors" we use for each training genome, and by modifying the Lawson-Hanson algorithm for non-negative least squares.
This is a course that I created back in 2014 (that continues to run, typically in the Fall and Spring) to introduce students to Mathematica, Matlab, and LaTeX. In the future, I will be incorporating modules on Python and/or Julia.