Whoosh: a fast pure-Python search engine

delano · on Feb 12, 2009

... a typical single-term search takes 0.003 seconds in Xappy, while Whoosh takes 0.01 seconds. The Whoosh index was less than 4 MB, while the Xappy/Xapian index was 30 MB on disk.

I should hope it's fast with an index of 4MB.

That aside, it's more like a lookup engine rather than search engine. Search refers to calculating multi-dimensional relevance in real time. Here's a quick, two-dimensional example:

A search for "fender" could be configured to have related keywords: "music", "gisbon", and "automobile", "ford". The strength of the relation should also be configurable as it's used to create a score for the weight of each keyword. Then expand each keyword: fender, fenders, music, musical, etc... These expansions are given slightly lower scores than the keywords. Then we consider what fields in the data set to look for these values. Each field has a score. We determine which records to include by making a two-dimensional calculation between the keywords and fields. Now we sort the result set based on the same or other additional criteria and return.

There are many possible dimensions include misspellings, synonyms, wildcards, user history, brands. Basically anything. And the data-set get much larger. That's why search is so difficult.

I'm not criticizing Whoosh specifically because other open source "search engine" libraries have the same weaknesses. But it is important to understand the difference between search and lookup.

anuraggoel · on Feb 12, 2009

While Whoosh isn't there yet, I don't think all open source libraries have the same weaknesses.

The two dimensions you described are synonym expansion and stemming. Lucene (and Solr, a search server that uses Lucene) can handle these two dimensions by ordering the analysis of indexed text and query terms appropriately - perform synonym expansion first, then apply stemming. The default scoring algorithm works well in a lot of cases, and Lucene/Solr's architecture allows plugging in a custom scoring implementation to boost individual keywords. Misspellings and wildcard handling is also available out of the box, and you can override most of the functionality easily. Reddit, Netflix, CNET, Digg etc. have all deployed Solr/Lucene on their public sites (http://wiki.apache.org/solr/PublicServers).

ntoshev · on Feb 12, 2009

I'm looking at the source code right now, and it seems Whoosh does use Porter stemmer to include word forms for English. Not sure yet if it weighs exact matches heavier.

EDIT: actually it uses query expansion for word forms for English. Took me a while to understand what you mean, probably because English is not my first language and I don't see the relation between "fender" and "automobile" or "music".

The sort of expansions you mean would be trivially accounted for if the search engine used latent semantic indexing.

delano · on Feb 12, 2009

Fender is a brand of musical equipment and also another word for the bumper of a car.

A straight relation between "fender" and "music" is a trivial expansion but it can become complex very quickly. You can expand from "fender" as a keyword and also as a brand. If you are in the business of selling musical equipment, you may have sold certain result placements to a competing brand like yamaha. In this case, a search for "fender" should produce results that match "music", "guitar", fender products, etc... but yamaha, as a paying advertiser, still expects yamaha products to have a prominent ranking.

I would agree that it's counter-intuitive. If I search for fender, I want fender products. However, that is not the only reality and it's the complexity associated to these edge cases that defines what modern search engines can do.

timf · on Feb 12, 2009

"I should hope it's fast with an index of 4MB."

But even in that case it's saying it is slower at search (0.003 < 0.01) than Xappy (that also had a 7.5x bigger index). Only claiming "as fast or faster" at indexing (and has no data to support that claim).

I agree with you about search being hard, but it does mention BM25F so it doesn't look like "just" a lookup engine...

delano · on Feb 12, 2009

The scoring in Whoosh is used only for sorting the result set. It's not used in determining the records within that set which is the primary reason why I refer to it as a lookup engine.

timf · on Feb 12, 2009

OK, fair enough.

But it looks like there's some awareness of these issues in that project, I thought you were trying to pigeonhole this as a glorified hashtable or something. Quoting from the Whoosh docs:

"Like its ancestor Lucene, Whoosh is not really a search engine, it's a toolkit for creating a search engine."

Lucene has support for proximity queries, fuzzy "minimumSimilarity", etc., and it looks like Whoosh wants to be a tool like this.

Would "search engine library" be a more appropriate term for you? Something that is used to construct a useful search engine.

delano · on Feb 12, 2009

Search engine library is better, ya.

However, creating the search engine is the hard problem. It helps to have tools to make some of the outlying problems a bit easier, but the challenge is being able to calculate a configurable set of multi-dimensional scores.

There are 3 places the scoring takes place: indexing-time, run-time, and search-time. Every solution (commercial and open-source) has an indexing process. Most have some run-time processing but only a few have multi-dimensional scoring at search-time and that is the coup de grâce.

When you make the relations configurable and increase the data set to GB or TB you have several incredibly difficult problems. There are commercial products that do it, big and small, but to my knowledge there are no open source projects that can.

In other words, the state of commercial products is so far ahead of open source ones that it's misleading to refer to both as "search engines".

ntoshev · on Feb 12, 2009

A term like "lookup engine" is certainly not commonly used in the field of information retrieval:

http://www.google.bg/search?q="information+retrieval"+"looku...

delano · on Feb 12, 2009

I use the term lookup engine because it alludes to something more complex than, as timf put it, a glorified hashtable.

alnayyir · on Feb 13, 2009

This is why I don't believe in writing libraries in the abstracted language they're intended for. I'd rather see them coded in C, or better, optimized C, or better, pure asm written by someone who knows what they're doing.

You don't need python to be good at searching, you need to use it to leverage fast libraries.

This is notably a critical failing of other abstracted programming languages. (I won't name names, I don't want to be piked here Vlad Tepes style.)

MaysonL · on Feb 13, 2009

Languages are not implementations.

Algorithms are more productive of speed than language choice.

It is highly likely that in five years, Python code will run competitively as fast as C, assuming that either language is still in use.

alnayyir · on Feb 14, 2009

Python code will run competitively as fast as C <--- that's highly unlikely that will happen in 50 years, let alone five years.

Have you been keeping track of programming history at all? Lisp isn't as fast as fortran or C, and it's been around for 50 years.

You're delusional, I'm sorry.

MaysonL · on Feb 25, 2009

Well... we'll have to agree to disagree - and you really should check out the state of the art in Lisp compilers these days.

PS - I've been programming since before Lisp was a teenager.

Daniel_Newby · on Feb 14, 2009

"I'd rather see them coded in C, or better, optimized C, or better, pure asm written by someone who knows what they're doing."

In most markets, the greatest source of economic value is high-level optimization of the whole system. If the Gods should bless you with overwhelming market demand, then you hire some bitheads to do micro-optimization (or just throw hardware at the problem).

This especially includes search. Plenty of systems have search engines that are lightning fast and come with all the bells and whistles. And are nearly useless because they (1) prioritize by criteria that are wrong for the users, and (2) are written in impenetrable optimized low-level languages that the system maintainer lacks the skills to customize.

thomaspaine · on Feb 12, 2009

Looks very cool...it says it's as fast or faster than PyLucene and Xappy, but it'd be nice to see some concrete benchmarks.

nihilocrat · on Feb 12, 2009

There's definitely some clever coding afoot when a pure-python library can outpace a C wrapper library.

timf · on Feb 12, 2009

It does not look like it is outpacing them at all.

( http://news.ycombinator.com/item?id=478654 )

adamtj · on Feb 12, 2009

Well written python basically is a C wrapper.

teej · on Feb 12, 2009

So is Ruby - that doesn't make it fast.