A project from the Python programming language user’s group has downloaded the full content of the Wikipedia and compressed it to fit a CD or DVD, yet allowing users to read it seamlessly with on-the-fly decompression. The project managers aim to distribute the full Wikipedia to remote schools which might lack broadband connectivity.
“CDPedia” is its name and it’s the work of PyAr, the Python User’s Group in Argentina, and it contains the full Spanish language Wikipedia in versions fitted to either a 680MB CD-R or a 4.5 GB DVD-R. The files –iso images to burn to a CD-R disc and a bigger version that fits on a DVD-R- are distributed over Bittorrent and make use of the Python language available on Linux and OSX systems, and includes Python on the disc for trouble-free running for Windows users.
The project warns that content in the current version -0.6- released last week is frozen at whatever the Wikipedia contained by Mid-2008. TechEye sat down with Alejandro J. Cura of the local PyAr in Buenos Aires to talk about this project and their plans to get this CD Wikipedia distributed by the government to every public school.
FC: Hi Alejandro. Why the Wikipedia on CD?
FC: What’s the difference between the 680mb and the 4-gig one. Is it just lower resolution graphics or less number of images or are there fewer articles?
AC: The CD version contains fewer articles and some images are in lower resolution. Anyway, those are just the ISOs we are torrenting, since our code tries to select the best articles for the size you choose, so there’s also a dual layer DVD version that we are sending to schools, and we are working on a version that a big company wants to distribute loaded in itspen drives.
FC: Any idea of the total number of articles in each edition?
AC: The DVD9 version has all 448038 articles of the Spanish Wikipedia that were dumped to html in June 2008, and 98% of the images. The CD version carries a big percentage of these articles and the images of the most relevant articles.
FC: So, the geek in me can’t help asking: what’s really the difference with doing a
“wget -m –np -k -c http://es.wikipedia.org” (note for non-geeks: that’s using the GNU Wget free software tool to download a recursive copy of the wikipedia to disk, converting all http:// links to virtual ones)
AC: We tried that. If you do that there’s no way the result can be fitted, even into a DVD. Also, it’s likely that Wikipedia will ban you from spidering all the content like that!
FC: Is it just HTML pointing to local links, or is there some Python magic going on behind the scenes?
AC: It’s HTML being served from a Python web server running on localhost. The server fetches the articles from a compressed block format that’s stored on the disc.
FC: So what’s the size of the Spanish language Wikipedia?
AC: Just the html content from Wikipedia is several gigabytes -without taking into account the images-.
FC: And how do you solve that?
AC: One of the tools we created compresses all the chosen articles, balancing space used and read speed from an optical media.
FC: Since ASCII text compresses well, is the text compressed -I mean compressed as in gzip or 7zip compression, not just optimization and removal of tags- on CDPedia?
AC: Yes, absolutely. We currently use bzip2 for the compression. We also had to make a trade-of between compressed size and de-compression speed, so we designed a simple block structure that worked out pretty well.
FC: Whats the most obvious content to throw out?
AC: The discussion pages linked from every article, the user pages, and the like. We would really like to include all that, but we have to draw the line somewhere.
FC: And how does it choose what articles to throw out?
AC: One of the tools selects the most interesting articles following various criteria, for instance, how many pages link to it -a sort of Pagerank– that we call “Peishranc“.
FC: How long does it take for you to generate a new CDPedia (not counting download time). I suppose you have a Python script to do the optimization?.
AC: It takes a few hours to generate the DVD. Since it’s something we rarely do (only when new versions of the wikipedia dump are available) we haven’t gotten around to improving its speed, but there’s a lot of room for optimizations, and that’s something that is easily done with Python.
FC: You mention that some content has to be thrown out, but besides what you already mentioned, which are fixed rules, is there a manual -Human or Alien- selection process, or are there just a set of rules (RegEx?) to apply to remove undesired content?
AC: There’s a file with a list of articles that are always included (think wikipedia’s about pages and legal disclaimers). Also we will need to add a list of articles that are not included depending on the audience (think “double dildo”). Both lists will be manually edited by the people doing the particular cdpedia build.
FC: I heard you’ve recently shown this to Argentina’s education authorities.
AC: Yes, version 0.6 of CDPedia -more precisely the DVD version- was given to Educ.ar, the country’s government-operated education web site for teachers and students. We at PyAr -Argentina’s Python user’s group- plan to distribute this free encyclopedia to all public schools in Argentina
FC: How did you get in touch with the local Wikipedia chapter, and how you got them interested?
FC: Do you anticipate the need for any sort of training to use CDPedia, or do you believe all teachers are familiar already with inserting a CD and using the CDPedia?
AC: If you can read and click on a link you are ready to use CDPedia.
FC: You mention in the docs that Python is included in the CD for use by Windows users. Does it auto-start? Or does the user have to manually launch it?
AC: Yes, on Windows it auto-starts and the Python run-time environment is included. The same CD or DVD can be used from a Linux or Mac OS-X system.
FC: Is this set of Python tools used to make the CDPedia available? Do you plan to make these eventually available to the general public?
FC: When do you plan to release a newer version with more up-to-date content?
AC: We are working on it right now, and we plan to release it by the end of March.
FC: Here’s your chance to say why Python is so cool and why people should join their nearest Python user’s group
AC: We wouldn’t be able to build something like cdpedia in our very little free time as easily as with Python, so I recommend you all learn it and use it. Also, I’ve found that some of the smartest people in my area are hanging at the Python Users group, and they also happen to be fun people to hang with, so it was a no-brainer joining that group!
FC: What’s the download URL?
AC: The project is hosted at http://code.google.com/p/cdpedia/, you will find the download links there.
FC: Thanks Alejandro for your time and sharing your thoughts with Tech Eye. As soon as our torrent is done downloading the four gigabytes, we’ll take CDPedia for a spin.
AC: thank you!