GENIA corpus

GENIA corpus - a semantically annotated corpus for bio-textmining. Motivation: Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. Results: GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts has been released with more than 400 000 words and almost 100 000 annotations for biological terms. Availability: GENIA corpus is freely available at

References in zbMATH (referenced in 11 articles )

Showing results 1 to 11 of 11.
Sorted by year (citations)

  1. Astrakhantsev, N. A.; Fedorenko, D. G.; Turdakov, D. Yu.: Methods for automatic term recognition in domain-specific text collections: A survey (2015) ioport
  2. Xu, Kaiquan; Liao, Stephen Shaoyi; Lau, Raymond Y. K.; Leon Zhao, J.: Effective active learning strategies for the use of large-margin classifiers in semantic annotation: an optimal parameter discovery perspective (2014)
  3. Goulart, Rodrigo Rafael Villarreal: A systematic review of named entity recognition in biomedical texts (2011) ioport
  4. Segura-Bedmar, Isabel; Crespo, Mario; De Pablo-Sánchez, César; Martínez, Paloma: Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents (2010) ioport
  5. Zhang, Shao-Wu; Li, Yao-Jun; Xia, Li; Pan, Quan: Pplook: an automated data mining tool for protein-protein interaction (2010) ioport
  6. Dimililer, Nazife; Varoğlu, Ekrem; Altınçay, Hakan: Classifier subset selection for biomedical named entity recognition (2009) ioport
  7. Kabiljo, Renata; Clegg, Andrew B.; Shepherd, Adrian J.: A realistic assessment of methods for extracting gene/protein interactions from free text (2009) ioport
  8. Mcintosh, Tara; Curran, James R.: Challenges for automatically extracting molecular interactions from full-text articles (2009) ioport
  9. Wang, Yue; Kim, Jin-Dong; Sætre, Rune; Pyysalo, Sampo; Tsujii, Jun-Ichi: Investigating heterogeneous protein annotations toward cross-corpora utilization (2009) ioport
  10. Kim, Jin-Dong; Ohta, Tomoko; Tsujii, Jun’ichi: Corpus annotation for mining biomedical events from literature (2008) ioport
  11. Yang, Zhihao; Lin, Hongfei; Li, Yanpeng: Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature (2008)