Fast approximation of frequent (k)-mers and applications to metagenomics. ..In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent (k)-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent (k)-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of (k)-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate (k)-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.

References in zbMATH (referenced in 1 article , 1 standard article )

Showing result 1 of 1.
Sorted by year (citations)

  1. Pellegrina, Leonardo; Pizzi, Cinzia; Vandin, Fabio: Fast approximation of frequent (k)-mers and applications to metagenomics (2019)