I've been using this for some serious scraping recently... dozens of threads scraping for 5, 6 days at a time.
Very good application, but watch out for memory leaks. Leaks like a sieve when you miss a ->clear() call in your destructors if you're not using the circular reference detection feature in the 5.3 garbage collector.
Since folks here seem to be in the know, if I were after "parsing as close to the browser as possible", ideally in Java/Scala or Ruby or Python (sorry PHP!) -- any recommendations?
I've done scraping (e.g. w/ BeautifulSoup) but haven't looked to see how true the parses are to what IE/FF/WebKit would produce.
Run headless Firefox+xvfb+Selenium. Selenium has java, ruby, and python clients. This is working really well in production.
You might also have luck with AppleScript/RubyCocoa/MacRuby+Safari. I've tried a couple other options like HTMLUnit (which google uses with GWT), and Mozilla Java Html Parser, which Dapper.net uses. I couldn't get them running, but YMMV.
for Ruby there's Hpricot and Nokogiri. Now I must admit not understanding what "parsing as close to the browser as possible" would mean. These parsers would not be for displaying the HTML, they're not rendering engines like those in browsers, but will help you navigate the DOM of a (X)(HT)ML document programatically. Surely I'm missing your meaning.
What I'm aiming at would mean "given this lump of (malformed) HTML, what DOM would a browser give me?" Maybe Hpricot, BeautifulSoup, et al are already there, but I don't know. :)
Another way to get to the DOM, especially if you are using tidy to clean-up your HTML is tidy itself
http://php.net/manual/en/book.tidy.php It is not very hard to get elements using tidy, but you need to write your own functions. However, is almost bullet-proof as it can correct for malformed HTML which some of the other libaries don't.
I'm using Tidy now, and it's been working pretty well. What I do is use Tidy to convert mal-formed HTML to XHTML and then use the Simple XML methods for my parsing.
Anyone know if this new library is better than that approach?
I used QueryPath for a recent project and really liked it. Its selectors are almost identical to jQuery's so the learning curve was shallow. Two thing QueryPath does that this library doesn't seem to is work w/ XML and allow for chainable method calls. See this IBM Developer Work's article http://www.ibm.com/developerworks/opensource/library/os-php-...
Very good application, but watch out for memory leaks. Leaks like a sieve when you miss a ->clear() call in your destructors if you're not using the circular reference detection feature in the 5.3 garbage collector.