Here you will find a reverse chronological list of all my publications. Abstracts, summaries, and other information can be found by clicking on the title of the publications.

  1. Fernando Meyer, Andreas Bremges, Peter Belmann, Stefan Janssen, Alice C. McHardy†... , Genome Biology , Monday, March 4, 2019

    We introduce a framework to compare tools utilized to determine what microbes are present in a sample, and at what relative abundance. This will help computational biologists design better tools to analyze communities of microorganisms (which affect nearly everything in existence!).

  2. The Biomedical Data Translator Consortium , Clinical and Translational Science , Friday, November 9, 2018

    The vision and high-level overview of the NIH National Center for Advancing Translational Science (NCATS) project entitled "Translator" (through which we have been funded). The goal of the project is essentially to build a biomedical "Siri": an automated platform for answering biomedical research questions that leverages repositories of publicly available information.

  3. The Biomedical Data Translator Consortium , Clinical and Translational Science , Friday, November 9, 2018

    A description of the NIH NCATS culture that emerged during the Translator project.

  4. Nathan LaPierre, Serghei Mangul, Mohammed Alswer, Igor Mandric, Nicholas C. Wu,... , BMC Bioinformatics , Thursday, November 1, 2018

    We show that read mapping, along with a probabilistic assignment of multi-mapped reads, outperforms other computational approaches to identify the presence and relative amount of viral and fungal organisms in a metagenomic sample of microorganismal DNA.

  5. Loes M. Olde Loohuis, Serghei Mangul, Anil P. S. Ori, Guillaume Jospin, David... , Nature, Translational Psychiatry , Friday, May 11, 2018

    Along with collaborators at UCLA, we were able to detect a small, but significant amount of microbes in blood This is surprising since it's typically assumed that the immune system typically removes any microbial presence from human blood. I used a reference-free microbial community algorithm, called EMDeBruijn, to help corroborate the patterns we saw which included an increase in microbial diversity in schizophrenia patients. EMDeBruijn is a metric based on the Wasserstein metric (aka the Earth Mover's Distance) and a de Bruijn graph induced by the k-mers in a metagenomic DNA sample.

  6. Hooman Zabeti and David Koslicki , Submitted to the Journal of Applied Probability , Tuesday, July 4, 2017

    This work improves upon the so called "min hash" technique (a "probabilistic data analysis" method) to develop a very fast and efficient way to estimate the similarity of two sets of objects (in terms of how much they overlap). The approach we present is orders of magnitude faster (and uses orders of magnitude less space) when two data sets under consideration are of very different size. The kinds of sets we consider are sets of sub-strings (called k-mers) of DNA sequences from communities of microorganisms.

  7. Mitra Ansariola, Molly Megraw, David Koslicki , Under review , Sunday, July 2, 2017

    A gene regulatory network is basically a representation of how genes interact with each other. In this work, we develop the only (to date) method to assess the accuracy of so called "motif discovery algorithms" that seek to find important sub-networks of a given gene regulatory network. We develop a provably correct mathematical approach (based on a variety of metrics that say how close two matrices are to each other) and use this to assess the performance of a variety of motif discovery algorithms.

  8. Jason McClelland and David Koslicki , Journal of Mathematical Biology , Sunday, January 1, 2017

    Rapidly answers “why are these data sets different” by leveraging hierarchical/relatedness information. In short, we develop an algorithm to quickly compute the Unifrac distance by leveraging the earth mover's distance, prove its correctness, and derive time and space complexity characterizations.

  9. Alexander Sczyrba, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen,... , Accepted to Nature Methods , Sunday, January 1, 2017

    In a very reproducible fashion, we assess a wide variety of computational techniques in metagenomics, including assembly (putting together pieces of genomes, called contigs, from short reads), binning (figuring out where the contigs came from), and taxonomic profiling (determining which organisms are present in a sample and at what relative amount).

  10. David Koslicki and Manfred Denker , Rocky Mountain Journal of Mathematics , Tuesday, November 1, 2016

    After introducing the notion of a random substitution Markov chain, we relate it to other notions of a "random substitution" and give a complete description of the Martin boundary for a few interesting examples.

  11. David Koslicki and Mark Novak , Accepted to The Journal of Mathematical Biology (DOI: 10.1007/s00285-017-1163-0) , Saturday, October 1, 2016

    In a network of interacting quantities (such as a food web), we examine how qualitative and quantitative predictions change when a quantity (such as the abundance of an organism or a set of organisms) is increased. This is quantified in terms of which model parameters cause the largest change in predictions.

  12. David Koslicki and Serghei Mangul , ACM-BCB , Tuesday, May 24, 2016

    We present the idea of using the "earth mover's distance" (aka the first Wasserstein metric) to measure the distance between samples of DNA. This reduces to finding the most efficient way to transform one kind of graph (known as de Bruijn graphs) into another.

  13. David Koslicki and Daniel Falush , mSystems , Wednesday, February 17, 2016

    We present a computational technique that answers the question "Which organisms are present in a given sample of of DNA from a microbial community, and at what relative amount" while simultaneously predicting the relatedness of novel (never-before seen organisms) in relation to known organisms. This relies on a mathematical technique referred to as sparsity-promoting optimization and relies on a technique similar to the Jaccard index.

  14. David Koslicki, Saikat Chatterjee, Damon Shahrivar, Alan W. Walker, Suzanna C.... , PLoS ONE , Friday, October 23, 2015

    We develop a pre-processing step that significantly improves k-mer based metagenomic profiling techniques.

  15. Saikat Chatterjee, David Koslicki, Siyuan Dong, Nicolas Innocenti, Lu Cheng, Yueheng... , Oxford Journal of Bioinformatics , Wednesday, May 7, 2014

    In this paper, we improve both the accuracy and speed of the Quikr approach to classifying a given set of metagenomic DNA sequences (16S rRNA). This is accomplished by increasing the number of "feature vectors" we use for each training genome, and by modifying the Lawson-Hanson algorithm for non-negative least squares.

  16. David Koslicki, Simon Foucart, and Gail Rosen , PLoS ONE , Thursday, March 13, 2014

    Extending the results of Quikr to whole genome shotgun metagenomic samples, we develop a method to automatically select a parameter that balances sparsity (how succinct the result is) with accuracy.

  17. Simon Foucart and David Koslicki , IEEE Signal Processing Letters , Thursday, February 27, 2014

    We prove that nonnegative least squares (typically prone to over-fitting) can be slightly modified to return sparse results.

  18. David Koslicki and Daniel J. Thompson , Journal of Mathematical Biology , Wednesday, January 22, 2014

    We demonstrate that a concept of "weighted information content" (known as topological pressure, from the ergodic theory literature) can be used to facilitate the analysis of genomic data (in particular, find areas of a genome that have many genes in them). This is a conceptual extension to topological entropy approach presented earlier.

  19. Andreas Holzinger, Matthias Hörtenhuber, Christopher Mayer, Martin Bachler,... , Chapter, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics , Wednesday, January 1, 2014

    We review a variety of entropy/randomness-based techniques that are useful in a variety of data mining applications.

  20. David Koslicki, Simon Foucart, and Gail Rosen , Oxford Journal of Bioinformatics , Thursday, June 20, 2013

    We introduce an extremely fast, light-weight, "big data" algorithm to quickly answer the question of "which bacteria are present?" in a given sample of DNA. The method is based on the theory of compressed sensing and aims to find the simplest explanation for the data in terms of known information.

  21. David Koslicki , Penn State University , Tuesday, May 1, 2012

    This is my PhD thesis from Penn State (advised by Manfred Denker).

  22. David Koslicki , Oxford Journal of Bioinformatics , Monday, February 21, 2011

    I define a new notion of "randomness" (called topological pressure) suitable for use on sequences of symbols (words) of finite length. I show that this can be used to distinguish between biologically interesting sequences in the human genome.