Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are great CLI tools already in this thread, but for some of my side-gig work I'm searching large piles of PDFs, docs formats, and ePubs with a GUI word processor open and need to reference the source by page/graf number. For those I use DocFetcher[1], a quirky and intermittently updated Java app that indexes file contents and provides rudimentary relevance searching along with regex. I index my docs, put the database it generates into a read-only shared directory, and point systems across OSs at that db so I can search quickly regardless of which box (or where) I'm working from, or can toss the app, db, and docs onto a thumbdrive for portability.

There's a commercial version that prioritized bugfixes, making the free and open version less attractive than it used to be. But it's still one of the better tools for the job when you want more than a grep-equivalent.

[1] http://docfetcher.sourceforge.net/en/index.html



Interesting.

At my previous job I created something similar for a recruitment sister company. They had a ton of CV's in all kinds of formats (Word, Excel, PDF, rtf, plain text etc). I used Lucene.NET to do the indexing.

Both companies no longer exist and I've needed to find some text in docs of my own. If I have a bit of time I could recreate the app pretty easily.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: