Counting regular expressions in degenerated sequences through lazy Markov chain embedding. Nowadays, Next- Generation Sequencing (NGS) produces huge number of reads which are combined using multiple alignment techniques to produce sequences. During this process, many sequencing errors are corrected, but the resulting sequences nevertheless contain a marginal level of uncertainty in the form of ∼0·1% or less of degenerated positions (like the letter “N” corresponding to any nucleotide). A previous work Nuel (Pattern Recognition in Bioinformatics. Springer, New York, 2009) showed that these degenerated letters might lead to erroneous counts when performing pattern matching on these sequences. An algorithm based on Deterministic Finite Automata (DFA) and Markov Chain Embedding (MCE) was suggested to deal with this problem. In this paper, we introduce a new version of this algorithm which uses Nondeterministic Finite Automata (NFA) rather than DFA to perform what we call “lazy MCE”. This new approach proves itself much faster than the previous one and we illustrate its usefulness on two NGS datasets and a selection of regular expressions. A software implementing this algorithm is available: countmotif, http://www.math-info.univ-paris5.fr/ delosvin/index.php?choix=4.

Keywords for this software

Anything in here will be replaced on browsers that support the canvas element