I created a PDF table extractor tool last year with the same idea that it should...

kickbeak · on March 3, 2021

Wow That looks awesome, what did you use to display the PDF in the Browser? feels all really responsive!

mgm__ · on March 3, 2021

I used Mozilla's PDF.js https://mozilla.github.io/pdf.js/ It is what firefox uses on desktop to show PDFs!

kickbeak · on March 3, 2021

Thanks, really great work!

redman25 · on March 3, 2021

This is interesting. How accurate would you say it is?

mgm__ · on March 3, 2021

I haven't seen anything better. It started as a PoC and I decided not to include table detection on the page and require the user to draw box around the table.

I use Tabula under the hood for the cell/row detection and it is really good given the correct mode is selected for the type of table. The modes are stream (find cells by spacing) or lattice (find cells by ruling lines).

The OCR/OpenCV seemed to be fine as well as long as the text isn't too blurry. Here is a GIF of the OCR/OpenCV running on an example Image PDF: https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...