Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282–283, Bioinformatics, 18, 77–82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability:

References in zbMATH (referenced in 15 articles )

Showing results 1 to 15 of 15.
Sorted by year (citations)

  1. Dehzangi, Abdollah; López, Yosvany; Lal, Sunil Pranit; Taherzadeh, Ghazaleh; Michaelson, Jacob; Sattar, Abdul; Tsunoda, Tatsuhiko; Sharma, Alok: PSSM-Suc: accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction (2017)
  2. Keith, Jonathan M. (ed.): Bioinformatics. Volume I. Data, sequence analysis, and evolution (2017)
  3. Pai, Priyadarshini P.; Dash, Tirtharaj; Mondal, Sukanta: Sequence-based discrimination of protein-RNA interacting residues using a probabilistic approach (2017)
  4. Carugo, Oliviero (ed.); Eisenhaber, Frank (ed.): Data mining techniques for the life sciences (2016)
  5. Ali, Farman; Hayat, Maqsood: Classification of membrane protein types using voting feature interval in combination with Chou’s pseudo amino acid composition (2015)
  6. Kumar, Ravindra; Srivastava, Abhishikha; Kumari, Bandana; Kumar, Manish: Prediction of $\beta$-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine (2015)
  7. Zhao, Xiaowei; Ning, Qiao; Chai, Haiting; Ma, Zhiqiang: Accurate in silico identification of protein succinylation sites using an iterative semi-supervised learning technique (2015)
  8. Feng, Peng-Mian; Ding, Hui; Chen, Wei; Lin, Hao: Naïve Bayes classifier with feature selection to identify phage virion proteins (2013)
  9. Zhao, Xiaowei; Zhang, Jian; Ning, Qiao; Sun, Pingping; Ma, Zhiqiang; Yin, Minghao: Identification of protein pupylation sites using bi-profile Bayes feature extraction and ensemble learning (2013) ioport
  10. Lu, Jin-Long; Hu, Xue-Hai; Hu, Dong-Gang: A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences (2012)
  11. Shi, Shao-Ping; Qiu, Jian-Ding; Sun, Xing-Yu; Suo, Sheng-Bao; Huang, Shu-Yun; Liang, Ru-Ping: A method to distinguish between lysine acetylation and lysine methylation from protein sequences (2012)
  12. Zhang, Yongqing; Zhang, Danling; Mi, Gang; Ma Daichuan; Li, Gongbing; Guo, Yanzhi; Li, Menglong; Zhu, Min: Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions (2012)
  13. Lin, Hao; Ding, Hui: Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition (2011)
  14. Huang, Ying; Niu, Beifang; Gao, Ying; Fu, Limin; Li, Weizhong: CD-HIT suite: a web server for clustering and comparing biological sequences (2010) ioport
  15. Lu, Lingyi; Qian, Ziliang; Cai, Yu-Dong; Li, Yixue: ECS: an automatic enzyme classifier based on functional domain composition (2007)