Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PHP Simple HTML DOM Parser (sourceforge.net)
47 points by barredo on Dec 11, 2009 | hide | past | favorite | 15 comments


I've been using this for some serious scraping recently... dozens of threads scraping for 5, 6 days at a time.

Very good application, but watch out for memory leaks. Leaks like a sieve when you miss a ->clear() call in your destructors if you're not using the circular reference detection feature in the 5.3 garbage collector.


Since folks here seem to be in the know, if I were after "parsing as close to the browser as possible", ideally in Java/Scala or Ruby or Python (sorry PHP!) -- any recommendations?

I've done scraping (e.g. w/ BeautifulSoup) but haven't looked to see how true the parses are to what IE/FF/WebKit would produce.

(On my list of things to look into: html5lib -- http://code.google.com/p/html5lib/ ... is it any good?)


Run headless Firefox+xvfb+Selenium. Selenium has java, ruby, and python clients. This is working really well in production.

You might also have luck with AppleScript/RubyCocoa/MacRuby+Safari. I've tried a couple other options like HTMLUnit (which google uses with GWT), and Mozilla Java Html Parser, which Dapper.net uses. I couldn't get them running, but YMMV.


for Ruby there's Hpricot and Nokogiri. Now I must admit not understanding what "parsing as close to the browser as possible" would mean. These parsers would not be for displaying the HTML, they're not rendering engines like those in browsers, but will help you navigate the DOM of a (X)(HT)ML document programatically. Surely I'm missing your meaning.


What I'm aiming at would mean "given this lump of (malformed) HTML, what DOM would a browser give me?" Maybe Hpricot, BeautifulSoup, et al are already there, but I don't know. :)


Another way to get to the DOM, especially if you are using tidy to clean-up your HTML is tidy itself http://php.net/manual/en/book.tidy.php It is not very hard to get elements using tidy, but you need to write your own functions. However, is almost bullet-proof as it can correct for malformed HTML which some of the other libaries don't.


I'm using Tidy now, and it's been working pretty well. What I do is use Tidy to convert mal-formed HTML to XHTML and then use the Simple XML methods for my parsing.

Anyone know if this new library is better than that approach?


http://code.google.com/p/phpquery/ This also seems interesting.


I've used this before. It's got a lovely api but uses far too much memory.


For Python folks, check out pyquery (http://www.pyquery.org/). It's really handy.


This is a good one to be sure. Also, check out QueryPath (http://querypath.org/).


I used QueryPath for a recent project and really liked it. Its selectors are almost identical to jQuery's so the learning curve was shallow. Two thing QueryPath does that this library doesn't seem to is work w/ XML and allow for chainable method calls. See this IBM Developer Work's article http://www.ibm.com/developerworks/opensource/library/os-php-...


As someone who works with XML all day long, I am wondering why people don't use SimpleXML in PHP 5+ for this sort of thing too.


SimpleXML will choke on invalid HTML/XML while this library claims to support it.


you guys read my mind.. i've been needing one of these.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: