10 tips for better search queries in Apache SolrOn September 16, 2017 by Ilene
Apache Solr is an open source search engine at heart, but it is much more than that. It is a NoSQL database with transactional support. It is a document database that offers SQL support and executes it in a distributed manner.
Previously, I’ve shown you how to create and load a collection into Solr; you can load that collection now if you hadn’t done it previously. (Full disclosure: I work for Lucidworks, which employs many of the key contributors to the Solr project.)
In this post, I’ll show you more 10 more things you can do with that collection:
1. Filter queries
Consider this query:
On its face, this query looks similar to if I just did
q=Provider_State:NC. However, filter queries return only IDs, and they don’t affect the score. Filter queries are also cached. This is a good way to find the most relevant
q=blue suede in
department:footwear as opposed to
Try this query:
The following is returned at the top:
Faceting gives you your category counts (among other things). If you’re implementing a retail site, this is how you provide categories and category counts for departments or other ways that you divide your inventory.
3. Range faceting
Add this to a query string:
This range faceting can help divide up a numeric field into categories of ranges. If you’re helping someone find a laptop in the $2,000-$3,000 range, this is for you. You can do a similar query without hard-coding the ranges by doing this instead:
In your schema, make sure the
docValues attribute is selected for fields that you are faceting on. This optimizes the field for these sorts of searches and saves on memory at query time, as shown in this schema.xml excerpt:
<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
You can do operations on your data and return a value. Try this:
The example uses some of Solr’s built-in functions to categorize providers as expensive or inexpensive based on the average total payments. I put
price_category:if(min(0,sub(Average_Total_Payments,5000)),"inexpensive","expensive") in the
fl, or field list, along with two other fields.
6. Query parsers
defType lets you pick one of Solr’s query parsers. The default Standard Query Parser is really good for specific machine-generated queries. But Solr also has the Dismax and eDismax parsers, which are a better for normal people: You can click one of them at the bottom of the admin query screen or add
defType=dismax to your query string. The Dismax parser generally produces better results for user-entered queries by finding the “disjunction maximum,” or the field with the most matches, and adding it to the score.
If you search
Provider_State:AL^5 OR Provider_State:NC^10, results in North Carolina will be scored higher than results in Alabama. You can do this in your query (
q=""). This is an important way to manipulate the results returned.
8. Date ranges
Although the example data doesn’t support any date-range searches, if it did it would be formatted like
timestamp_dt:[2016-12-31T17:51:44.000Z TO 2017-02-20T18:06:44.000Z]. Solr supports date type fields and date type searches and filtering.
9. TF-IDF and BM25
The original scoring mechanism that Solr used (to determine which documents were relevant to your search term) is called TF-IDF, for “term frequency versus the inverse document frequency.” It returns how frequently a term occurs in your field or document versus how frequently that term occurs overall in your collection. The problem with this algorithm is that having “Game of Thrones” occur 100 times in a 10-page document versus ten times in a 10-page document doesn’t make the document 10 times more relevant. It makes it more relevant but not 10 times more relevant.
BM25 smoothes this process, effectively letting documents reach a saturation point, after which the impact of additional occurrences are mitigated. Recent versions of Solr all use BM25 by default.
In the Admin Query console, you can check debugQuery to add
debugQuery=on to the Solr query string. If you inspect the results, you’ll find this output:
Among other things you see it is using the LuceneQParser (the name of the standard query parser) and, above that, how each result was scored. You see the BM25 algorithm itself and how boosts affected the scoring. If you’re trying to debug your search, this is a very valuable tool!
These ten aspects of Solr certainly help me when using Solr for search and tuning my results.