arXMLiv: Translating the ​arXiv to XML+MathML: The last few years have seen the emergence of various XML-based, content-oriented markup languages for mathematics and natural sciences on the web, e.g. ​OpenMath, ​Content MathML, or our own ​OMDoc and ​PhysML. These representation languages focus on mathematics and make the structure of the mathematical knowledge in a document explicit enough that machines can operate on it. The promise of these content-oriented approaches is that various tasks involved in doing mathematics (e.g. search, navigation, cross-referencing, quality control, user-adaptive presentation, proving, simulation) can be machine-supported, and thus the working mathematician is relieved to do what humans can still do infinitely better than machines. In the arXMLiv project we try to translate the vast collection of scientific knowledge captured in the arXiv repository into content-based form, so that we can use it as a basis for added-value services. We are using Bruce Miller’s ​LaTeXML system for transforming LaTeX documents to XHTML/HTML5 with Presentation MathML. LaTeXML is a reimplementation of the TeX parser with a programmable XML emitter. The main advantage of the system is that we can control macro expansion by supplying customized ”LaTeXML bindings” for the macros. These are instructions to the emitter to directly construct output XML instead of expanding the macro to TeX primitives. The main technical task of the arXMLiv project is to supply LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collections. For this we have developed a ​distributed build system that continuously runs LaTeXML over the arXiv collection and collects statistics about e.g. ​the most sorely missing LaTeXML bindings. We have processes more than half of the arXiv collection (one run is a processor-year-size undertaking) and already have a success rate of over 60% (i.e. over 60% of the documents ran through without LaTeXML noticing an error).

Keywords for this software

Anything in here will be replaced on browsers that support the canvas element