Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is the first time I have seen someone use GROBID. It seems like an incredibly cool solution


I spent forever looking at various PDF parsing solutions like Unstructured, and eventually stumbled across GROBID, which was an absolute perfect fit since it's entirely made for scientific papers and has header/section level segmentation capabilities (splitting the paper into Abstract, Introduction, References, etc.) It's lightweight and fast too!


I've spent the last couple weeks diving into various PDF parsing solutions for scientific documents. GROBID is pretty cool, but it made some mistakes when trying to parse (I think arxiv) papers which removed some of the text.

Even though it gave a lot of great structured options, missing even a single sentence was unforgivable to me. I went with Nougat instead, for arxiv papers.

(Also check out Marker (mentioned on hn in the last month) for pretty high fidelity paper conversion to markdown. Does reasonable job with equations too.)


Google's Document AI does a good job, but I'll need to test the equation handling again to be sure.


Did you try Apache Tika?


I wonder if they knew that they could get html versions of the paper by just changing the link from ...arxiv.. to ar5iv..


I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: