How to Read All of Wikipedia

Session track:
Session time: 
2:00pm

Wikipedia pages are information rich but not easily accessible to data mining because their content is only made uniform by convention, not strict input validation.We present an approach to data-mining large volumes of semi-structured text, such as Wikipedia dump files, using open-source tools. We employ compiler-writing and data-visualization tools in such a way that we always "know what we don't know". This converts a mining task to an incremental process we call "Exploratory Parsing" with many applications in our increasingly open-data rich world.