Understanding Elasticsearch III

Bài đăng này đã không được cập nhật trong 7 năm

In Part II we covered the basic of Search API and Query DSL and how to combine those together to make a complex search. In this part we will take a look at relevance and how to tweak it to get the best result.

What is Relevance?

The relevance is the algorithm that we use to calculate how similar the contents of a full text field are to a full text query string. The relevance score of each document is represented by a positive floating point number called the _score the higher the _score, the more relevant the document. A query clause generates a _score for each document. How that score is calculated depends on the type of query clause. A fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term. A terms query would incorporate the percentage of terms that were found.

The standard similarity algorithm used in Elasticsearch is known as Term Frequency/Inverse Document Frequency, or TF/IDF, which takes the following factors into account.

Term frequency: How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
Inverse document frequency: How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more uncommon terms.
Field length norm: How long is the field? The longer it is, the less likely it is that words in the field will be relevant. A term appearing in a short title field carries more weight than the same term appearing in a long content field.

Finally when multiple query clauses are combined using a compound query like theboolquery, the _score from each of these query clauses is combined to calculate the overall _score for the document.

Note that these three factors, term frequency, inverse document frequency, and field length norm are calculated and stored at index time.

Controlling Relevance

Field Boosting

We can make a certain field get more weight than the other by applying the boost parameterat search time like.

GET /_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "jumping at the start",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "content": "jumping at the start"
          }
        }
      ]
    }
  }
}

This wii give title field twice as important as the content. This type of boosting is called Query-time boosting and is the main tool that you can use to tune relevance. Any type of query accepts a boost parameter. Setting a boost of 2 doesn’t simply double the final _score the actual boost value that is applied goes through normalization and some internal optimization. However, it does imply that a clause with a boost of 2 is twice as important as a clause with a boost of 1.

Boosting Query Clause

The bool query isn’t restricted to combining simple one-word match queries it can combine any other query, including other bool queries. It is commonly used to fine-tune the relevance _score for each document by combining the scores from several distinct queries.

For example we want to search for documents about "full text search" but we want to give more weight to documents which also mention "Elasticsearch". We can write a bool query like follow.

GET /_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "content": {
            "query": "full text search",
            "operator": "and"
          }
        }
      },
      {
        "should": [
          { "match": { "content": "Elasticsearch" } }
        ]
      }
    }
  }
}

Other way to boost query is using positive and negative query. It allows us to still include results which appear to be about the fruit or the pastries, but to downgrade them to rank them lower. For example we want to search the word "apple", but we want to result that associated with the apply company to score higher than the result that associated with fruit we could it like this.

GET /_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "text": "apple"
        }
      },
      "negative": {
        "match": {
          "text": "pie tart fruit crumble tree"
        }
      },
      "negative_boost": 0.5
    }
  }
}

The result associated with fruit would likely include the words like pie, tart, crumble, and tree. We could try to narrow it down to just the company by putting these words in negative query which will be downgraded by multiplying the original _score of the document with the negative_boost. For this to work, the negative_boost must be less than 1.0.

Function Score

function_score query is ultimate tool for taking complete control over the scoring process and allows you to apply a function to each document which matches the main query in order to alter or completely replace the original query _score. It supports a number of predefined functions out of the box.

weight: Apply a simple boost to each document without the boost being normalized: a weight of 2 results in 2 * _score.
field_value_factor: Use the value of a field in the document to alter the _score, such as factoring in a popularity count or number of votes.
random_score: Apply consistently random sorting to your results to ensure that documents would otherwise have the same score are randomly shown to different users, while main‐ taining the same sort order for each user.
linear, exp, gauss: Incorporate sliding-scale values like publish_date, geo_location, or price into the _score to prefer recently published document, documents near a lat-lon point, or documents near a specified price-point.
script_score: Use a custom script to take complete control of the scoring logic. If your needs extend beyond those of the functions listed above, write a custom script to imple‐ ment the logic that you need.

For example: imagine that we have a website hosting blogposts where users can upvote the blogposts that they like. We would like more popular posts to appear higher in the results list, but still have the full text score as the main relevance driver. We can do this easily by storing the number of votes with each blogspost. At search time, we can use the function_score query with the field_value_factor function to combine the number of votes with the full text relevance score:

GET /blogposts/post/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "popularity",
          "fields": ["title", "content"]
        }
      },
      "field_value_factor": {
        "field": "votes"
      }
    }
  }
}

The function_score query wraps the main query and the function we would like to apply. The main query is executed first then the field_value_factor function is applied to every document matching the main query. Every document must have a number in the votes field for the function_score to work. The new score for each document will be calculate as: new_score = old_score * number_of_votes. field_value_factor also has modifier that we can use to smooth out the votes value by setting this field to one of the available: none(default), log, log1p, log2p, ln, ln1p, ln2p, square, sqrt and reciprocal.

Conclusion

With this basic knowledge of how to control relevance in place you can tweak your search query to get the most relevance result to your heart content. Tweaking relevance is the hardest thing to learn and get right when working with elasticsearch in my opinion and it can get really frustrated at some point, but whatever you do remember to have fun in the process.

Elasticsearch fulltext search