RCV1

RCV1: A New Benchmark Collection for Text Categorization Research. Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection’s properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.

This software is also peer reviewed by journal TOMS.


References in zbMATH (referenced in 104 articles )

Showing results 1 to 20 of 104.
Sorted by year (citations)

1 2 3 4 5 6 next

  1. Yousefian, Farzad; Nedić, Angelia; Shanbhag, Uday V.: On stochastic and deterministic quasi-Newton methods for nonstrongly convex optimization: asymptotic convergence and rate analysis (2020)
  2. Duchi, John; Namkoong, Hongseok: Variance-based regularization with convex objectives (2019)
  3. Fercoq, Olivier; Bianchi, Pascal: A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions (2019)
  4. Karakus, Can; Sun, Yifan; Diggavi, Suhas; Yin, Wotao: Redundancy techniques for straggler mitigation in distributed optimization and learning (2019)
  5. Krishnamurthy, Akshay; Agarwal, Alekh; Huang, Tzu-Kuo; Iii, Hal Daumé; Langford, John: Active learning for cost-sensitive classification (2019)
  6. Milzarek, Andre; Xiao, Xiantao; Cen, Shicong; Wen, Zaiwen; Ulbrich, Michael: A stochastic semismooth Newton method for nonsmooth nonconvex optimization (2019)
  7. Song, Yangqiu; Upadhyay, Shyam; Peng, Haoruo; Mayhew, Stephen; Roth, Dan: Toward any-language zero-shot topic classification of textual documents (2019)
  8. Bashar, Md Abul; Li, Yuefeng: Interpretation of text patterns (2018)
  9. Bottou, Léon; Curtis, Frank E.; Nocedal, Jorge: Optimization methods for large-scale machine learning (2018)
  10. Burkhardt, Sophie; Kramer, Stefan: Online multi-label dependency topic models for text classification (2018)
  11. Elenberg, Ethan R.; Khanna, Rajiv; Dimakis, Alexandros G.; Negahban, Sahand: Restricted strong convexity implies weak submodularity (2018)
  12. Francisco Charte, Antonio J. Rivera, David Charte, María J. del Jesus, Francisco Herrera: Tips, guidelines and tools for managing multi-label datasets: the mldr.datasets R package and the Cometa data repository (2018) arXiv
  13. Gudivada, Venkat N.; Arbabifard, Kamyar: Open-source libraries, application frameworks, and workflow systems for NLP (2018)
  14. Leblond, Rémi; Pedregosa, Fabian; Lacoste-Julien, Simon: Improved asynchronous parallel optimization analysis for stochastic incremental methods (2018)
  15. Lin, Qihang; Nadarajah, Selvaprabu; Soheili, Negar: A level-set method for convex optimization with a feasible solution path (2018)
  16. Sakai, Tomoya; Niu, Gang; Sugiyama, Masashi: Semi-supervised AUC optimization based on positive-unlabeled learning (2018)
  17. Wang, Chenguang; Song, Yangqiu; Li, Haoran; Zhang, Ming; Han, Jiawei: Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks (2018)
  18. Yuan, Xiao-Tong; Li, Ping; Zhang, Tong: Gradient hard thresholding pursuit (2018)
  19. Yue, Hangrui; Yang, Qingzhi; Wang, Xiangfeng; Yuan, Xiaoming: Implementing the alternating direction method of multipliers for big datasets: a case study of least absolute shrinkage and selection operator (2018)
  20. Andrea Esuli, Tiziano Fagni, Alejandro Moreo Fernandez: JaTeCS an open-source JAva TExt Categorization System (2017) arXiv

1 2 3 4 5 6 next