Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I created a PDF table extractor tool last year with the same idea that it should be local only. Try it here: https://pdftableutil.possiblenull.com/app/ Also as a Google Docs addon (still local only) https://workspace.google.com/marketplace/app/pdf_table_impor...

I had a bad case of scope creep, so the tool can also extract tables from scanned/image PDFs using OpenCV.js and tesseract OCR wasm build!



Wow That looks awesome, what did you use to display the PDF in the Browser? feels all really responsive!


I used Mozilla's PDF.js https://mozilla.github.io/pdf.js/ It is what firefox uses on desktop to show PDFs!


Thanks, really great work!


This is interesting. How accurate would you say it is?


I haven't seen anything better. It started as a PoC and I decided not to include table detection on the page and require the user to draw box around the table.

I use Tabula under the hood for the cell/row detection and it is really good given the correct mode is selected for the type of table. The modes are stream (find cells by spacing) or lattice (find cells by ruling lines).

The OCR/OpenCV seemed to be fine as well as long as the text isn't too blurry. Here is a GIF of the OCR/OpenCV running on an example Image PDF: https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: