PHP Simple HTML DOM Parser

encoderer · on Dec 11, 2009

I've been using this for some serious scraping recently... dozens of threads scraping for 5, 6 days at a time.

Very good application, but watch out for memory leaks. Leaks like a sieve when you miss a ->clear() call in your destructors if you're not using the circular reference detection feature in the 5.3 garbage collector.

crux_ · on Dec 11, 2009

Since folks here seem to be in the know, if I were after "parsing as close to the browser as possible", ideally in Java/Scala or Ruby or Python (sorry PHP!) -- any recommendations?

I've done scraping (e.g. w/ BeautifulSoup) but haven't looked to see how true the parses are to what IE/FF/WebKit would produce.

(On my list of things to look into: html5lib -- http://code.google.com/p/html5lib/ ... is it any good?)

fizx · on Dec 12, 2009

Run headless Firefox+xvfb+Selenium. Selenium has java, ruby, and python clients. This is working really well in production.

You might also have luck with AppleScript/RubyCocoa/MacRuby+Safari. I've tried a couple other options like HTMLUnit (which google uses with GWT), and Mozilla Java Html Parser, which Dapper.net uses. I couldn't get them running, but YMMV.

tremendo · on Dec 12, 2009

for Ruby there's Hpricot and Nokogiri. Now I must admit not understanding what "parsing as close to the browser as possible" would mean. These parsers would not be for displaying the HTML, they're not rendering engines like those in browsers, but will help you navigate the DOM of a (X)(HT)ML document programatically. Surely I'm missing your meaning.

crux_ · on Dec 13, 2009

What I'm aiming at would mean "given this lump of (malformed) HTML, what DOM would a browser give me?" Maybe Hpricot, BeautifulSoup, et al are already there, but I don't know. :)

yannis · on Dec 11, 2009

Another way to get to the DOM, especially if you are using tidy to clean-up your HTML is tidy itself http://php.net/manual/en/book.tidy.php It is not very hard to get elements using tidy, but you need to write your own functions. However, is almost bullet-proof as it can correct for malformed HTML which some of the other libaries don't.

dshah · on Dec 13, 2009

I'm using Tidy now, and it's been working pretty well. What I do is use Tidy to convert mal-formed HTML to XHTML and then use the Simple XML methods for my parsing.

Anyone know if this new library is better than that approach?

gorm · on Dec 11, 2009

http://code.google.com/p/phpquery/ This also seems interesting.

danw · on Dec 11, 2009

I've used this before. It's got a lovely api but uses far too much memory.

larrykubin · on Dec 11, 2009

For Python folks, check out pyquery (http://www.pyquery.org/). It's really handy.

mildweed · on Dec 11, 2009

This is a good one to be sure. Also, check out QueryPath (http://querypath.org/).

kylemathews · on Dec 12, 2009

I used QueryPath for a recent project and really liked it. Its selectors are almost identical to jQuery's so the learning curve was shallow. Two thing QueryPath does that this library doesn't seem to is work w/ XML and allow for chainable method calls. See this IBM Developer Work's article http://www.ibm.com/developerworks/opensource/library/os-php-...

grumpycanuck · on Dec 11, 2009

As someone who works with XML all day long, I am wondering why people don't use SimpleXML in PHP 5+ for this sort of thing too.

tetsuo13 · on Dec 11, 2009

SimpleXML will choke on invalid HTML/XML while this library claims to support it.

vinhboy · on Dec 12, 2009

you guys read my mind.. i've been needing one of these.