• RCV1

  • Referenced in 109 articles [sw07279]
  • Benchmark Collection for Text Categorization Research. Reuters Corpus Volume I (RCV1) is an archive...
  • Penn Treebank

  • Referenced in 100 articles [sw08023]
  • speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation. We are located...
  • word2vec

  • Referenced in 98 articles [sw14978]
  • research. The word2vec tool takes a text corpus as input and produces the word vectors...
  • MML

  • Referenced in 47 articles [sw06970]
  • Mizar Mathematical Library (MML) is a large corpus of formalised mathematical knowledge. It has been...
  • GloVe

  • Referenced in 46 articles [sw26211]
  • word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear...
  • DARPA TIMIT

  • Referenced in 25 articles [sw36451]
  • DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus ... speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word...
  • MPTP

  • Referenced in 23 articles [sw02489]
  • MPTP system, which makes the largest existing corpus of formalized mathematics available to theorem provers...
  • TreeTagger

  • Referenced in 23 articles [sw07976]
  • lexicon and a manually tagged training corpus are available...
  • GENIA corpus

  • Referenced in 11 articles [sw35527]
  • GENIA corpus - a semantically annotated corpus for bio-textmining. Motivation: Natural language processing (NLP) methods ... literature. The lack of an extensively annotated corpus of this literature, however, causes a major ... bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials ... techniques work for bio-textmining. Results: GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts...
  • IAM

  • Referenced in 12 articles [sw08585]
  • based on the Lancaster-Oslo/Bergen (LOB) corpus. This corpus is a collection of texts ... automatically derived from the underlying corpus. The database also includes a few image-processing procedures...
  • gensim

  • Referenced in 12 articles [sw04081]
  • word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised ... input is necessary – you only need a corpus of plain text documents. Once these statistical...
  • BliStr

  • Referenced in 15 articles [sw16818]
  • problems created from the Flyspeck corpus...
  • Praat

  • Referenced in 10 articles [sw06142]
  • downloaded from praat.org. A speech corpus typically consists of a set of sound files, each ... multiple sound and annotation files across the corpus. Corpuswide acoustic analyses, leading to tables ready...
  • Medlda

  • Referenced in 11 articles [sw11723]
  • more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy...
  • Canterbury

  • Referenced in 11 articles [sw17938]
  • Canterbury Corpus is a benchmark to enable researchers to evaluate lossless compression methods. This site...
  • MBT

  • Referenced in 6 articles [sw08004]
  • learning approaches are useful when a tagged corpus is available as an example ... tagger. Based on such a corpus, the tagger-generator automatically builds a tagger which ... approach include (i) the relatively small tagged corpus size sufficient for training, (ii) incremental learning...
  • Senseval

  • Referenced in 10 articles [sw29525]
  • Senseval corpus: There are now many computer programs for automatically determining the sense...
  • DeepMath

  • Referenced in 8 articles [sw27551]
  • premise selection task on the Mizar corpus while avoiding the hand-engineered features of existing...
  • SemCor

  • Referenced in 4 articles [sw29524]
  • SemCor corpus is an English corpus with semantically annotated texts. The semantic analysis was done ... WordNet 3.0 (SemCor version 3.0). The SemCorpus corpus consists of 352 texts from Brown corpus...
  • GENETAG

  • Referenced in 5 articles [sw35526]
  • GENETAG: a tagged corpus for gene/protein named entity recognition. Results: To ensure heterogeneity ... corpus, MEDLINE sentences were first scored for term similarity to documents with known gene names...