RCV1: A New Benchmark Collection for Text Categorization Research. Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection’s properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.

This software is also peer reviewed by journal TOMS.

References in zbMATH (referenced in 69 articles )

Showing results 1 to 20 of 69.
Sorted by year (citations)

1 2 3 4 next

  1. Fountoulakis, Kimon; Gondzio, Jacek: A second-order method for strongly convex $\ell _1$-regularization problems (2016)
  2. Huang, Yakui; Liu, Hongwei; Zhou, Sha: An efficient monotone projected Barzilai-Borwein method for nonnegative matrix factorization (2015)
  3. Kuang, Da; Yun, Sangwoon; Park, Haesun: SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering (2015)
  4. Le, Tam; Cuturi, Marco: Adaptive Euclidean maps for histograms: generalized Aitchison embeddings (2015)
  5. Lin, Qihang; Lu, Zhaosong; Xiao, Lin: An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization (2015)
  6. Liu, Ming; Wu, Chong; Chen, Lei: A vector reconstruction based clustering algorithm particularly for large-scale text collection (2015)
  7. Mareček, Jakub; Richtárik, Peter; Takáč, Martin: Distributed block coordinate descent for minimizing partially separable functions (2015)
  8. Chen, Jianhui; Ye, Jieping: Sparse trace norm regularization (2014)
  9. Tian, Yingjie; Ping, Yuan: Large-scale linear nonparallel support vector machine solver (2014)
  10. Crammer, Koby; Gentile, Claudio: Multiclass classification with bandit feedback using adaptive regularization (2013)
  11. Crammer, Koby; Kulesza, Alex; Dredze, Mark: Adaptive regularization of weight vectors (2013)
  12. Gullo, Francesco; Domeniconi, Carlotta; Tagarelli, Andrea: Projective clustering ensembles (2013)
  13. Hensinger, Elena; Flaounas, Ilias; Cristianini, Nello: Modelling and predicting news popularity (2013)
  14. Lansdall-Welfare, Thomas; Flaounas, Ilias; Cristianini, Nello: Automatic annotation of a dynamic corpus by label propagation (2013)
  15. Lee, Sangkyun; Wright, Stephen J.: Stochastic subgradient estimation training for support vector machines (2013)
  16. Ren, Fuji; Sohrab, Mohammad Golam: Class-indexing-based term weighting for automatic text classification (2013)
  17. Guyon, Isabelle; Dror, Gideon; Lemaire, Vincent; Silver, Daniel L.; Taylor, Graham; Aha, David W.: Analysis of the IJCNN 2011 UTL challenge (2012)
  18. Hariharan, Bharath; Vishwanathan, S.V.N.; Varma, Manik: Efficient max-margin multi-label classification with applications to zero-shot learning (2012)
  19. Park, Sang-Hyeun; Fürnkranz, Johannes: Efficient prediction algorithms for binary decomposition techniques (2012)
  20. Policicchio, Veronica L.; Pietramala, Adriana; Rullo, Pasquale: GAMoN: discovering $M$-of-$N^\\neg, \lor$ hypotheses for text classification by a lattice-based genetic algorithm$ (2012)

1 2 3 4 next