This is the first time I have seen someone use GROBID. It seems like an incredib...

evanhu_ · on Dec 21, 2023

I spent forever looking at various PDF parsing solutions like Unstructured, and eventually stumbled across GROBID, which was an absolute perfect fit since it's entirely made for scientific papers and has header/section level segmentation capabilities (splitting the paper into Abstract, Introduction, References, etc.) It's lightweight and fast too!

pugio · on Dec 21, 2023

I've spent the last couple weeks diving into various PDF parsing solutions for scientific documents. GROBID is pretty cool, but it made some mistakes when trying to parse (I think arxiv) papers which removed some of the text.

Even though it gave a lot of great structured options, missing even a single sentence was unforgivable to me. I went with Nougat instead, for arxiv papers.

(Also check out Marker (mentioned on hn in the last month) for pretty high fidelity paper conversion to markdown. Does reasonable job with equations too.)

kordlessagain · on Dec 21, 2023

Google's Document AI does a good job, but I'll need to test the equation handling again to be sure.

skeptrune · on Dec 21, 2023

Did you try Apache Tika?

arbitrandomuser · on Dec 21, 2023

I wonder if they knew that they could get html versions of the paper by just changing the link from ...arxiv.. to ar5iv..

evanhu_ · on Dec 21, 2023

I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.