How to Read All of Wikipedia
Wikipedia pages are information rich but not easily accessible to data mining because their content is only made uniform by convention, not strict input validation.We present an approach to data-mining large volumes of semi-structured text, such as Wikipedia dump files, using open-source tools. We employ compiler-writing and data-visualization tools in such a way that we always "know what we don't know". This converts a mining task to an incremental process we call "Exploratory Parsing" with many applications in our increasingly open-data rich world.
Conference Mailing List
If you would like to receive GOSCON announcements, please subscribe to our GOSCON Newsletter.
latest tweets
RT @digiphile: RT @goscon: Excellent article on #opengov by design at @CFPB http://bit.ly/nrDsGO #gov20
RT @digiphile: RT @goscon: Excellent article on #opengov by design at @CFPB http://bit.ly/nrDsGO #gov20
nice one @noahkunin! RT @digiphile: RT @goscon: Excellent article on #opengov by design at @CFPB http://bit.ly/nrDsGO #gov20
