Coding sequence density estimation via topological pressure

We demonstrate that a concept of "weighted information content" (known as topological pressure, from the ergodic theory literature) can be used to facilitate the analysis of genomic data (in particular, find areas of a genome that have many genes in them). This is a conceptual extension to topological entropy approach presented earlier.

SEK: sparsity exploiting k-mer-based estimation of bacterial community composition

In this paper, we improve both the accuracy and speed of the Quikr approach to classifying a given set of metagenomic DNA sequences (16S rRNA). This is accomplished by increasing the number of "feature vectors" we use for each training genome, and by modifying the Lawson-Hanson algorithm for non-negative least squares.

Sparse recovery by means of nonnegative least squares

We prove that nonnegative least squares (typically prone to over-fitting) can be slightly modified to return sparse results.

WGSQuikr: fast whole-genome shotgun metagenomic classification

Extending the results of Quikr to whole genome shotgun metagenomic samples, we develop a method to automatically select a parameter that balances sparsity (how succinct the result is) with accuracy.

On entropy-based data mining

We review a variety of entropy/randomness-based techniques that are useful in a variety of data mining applications.