The status of this project is: DORMANT. I’ve written a new and very interesting matching engine to rewrite it around, but haven’t had the time to put the pieces together.


To get the code:

hg clone


To download in package form:


Roux is a tool for scraping data out of arbitrary HTML in a civilized manner. It uses BeautifulSoup, and is designed around the idea of regular expressions that are composed, interpreted, and executed as “tag soup” rather than character strings.

It contains a framework for turning any such “recipe” into a soup-scraping cron job that outputs matches as an Atom feed. Feeds are intelligently updated to minimize unnecessary bandwidth usage either on the fetching side or from rewriting portions of the feed that haven’t actually changed.

This package is most well-known (which is to say, at all) for producing the feed of Ted Goranson’s IMDB comments hosted here at Red Bean. For some reason, IMDB does not provide feeds of their own. Otherwise I mostly use it for reading comics.