How to Read All of Wikipedia

Session track:
Session time: 
2:00pm
Speaker: 

Wikipedia pages are information rich but not easily accessible to data mining because their content is only made uniform by convention, not strict input validation.We present an approach to data-mining large volumes of semi-structured text, such as Wikipedia dump files, using open-source tools. We employ compiler-writing and data-visualization tools in such a way that we always "know what we don't know". This converts a mining task to an incremental process we call "Exploratory Parsing" with many applications in our increasingly open-data rich world.