DIADEM: thousands of websites to a single database. The web is overflowing with implicitly structured data, spread over hundreds of thousands of sites, hidden deep behind search forms, or siloed in marketplaces, only accessible as HTML. Automatic extraction of structured data at the scale of thousands of websites has long proven elusive, despite its central role in the ”web of data”. Through an extensive evaluation spanning over 10000 web sites from multiple application domains, we show that automatic, yet accurate full-site extraction is no longer a distant dream. diadem is the first automatic full-site extraction system that is able to extract structured data from different domains at very high accuracy. It combines automated exploration of websites, identification of relevant data, and induction of exhaustive wrappers. Automating these components is the first challenge. diadem overcomes this challenge by combining phenomenological and ontological knowledge. Integrating these components is the second challenge. diadem overcomes this challenge through a self-adaptive network of relational transducers that produces effective wrappers for a wide variety of websites. Our extensive and publicly available evaluation shows that, for more than 90% of sites from three domains, diadem obtains an effective wrapper that extracts all relevant data with 97% average precision. diadem also tolerates noisy entity recognisers, and its components individually outperform comparable approaches.
Keywords for this software
References in zbMATH (referenced in 3 articles )
Showing results 1 to 3 of 3.
- Interlandi, Matteo; Tanca, Letizia: A Datalog-based computational model for coordination-free, data-parallel systems (2018)
- Gottlob, Georg; Koch, Christoph; Pieris, Andreas: Logic, languages, and rules for web data extraction and reasoning over data (2017)
- Furche, Tim; Gottlob, Georg; Schallhart, Christian: DIADEM: Domains to databases (2012) ioport