The Tower of Babel by Pieter Brueghel the Elder, public domain.

The Wikimedia Foundation’s Search Platform team recently worked with Daniel Worley and Doug Turnbull from Open Source Connections on a Learning to Rank plugin for Elasticsearch, the software the powers search on Wikimedia sites, designed to apply machine learning to search relevance ranking. We recently chatted with Erik Bernhardson, a software engineer at the Wikimedia Foundation, on how the plugin will help improve search results across language wikis.

Q: Erik, I know you initially used your 10% time to think through this idea. Can you share a little on how that led to thinking about a plug-in?

Erik: Initially I started think about this due to a talk I saw at the Lucene/Solr Revolution 2015 conference. I had also been pondering it a bit because Justin Ormont, an advisor to the search team, had suggested that machine learning could be a reasonable way to solve a subset of our search optimization problems.

One of the problems with optimizing search rankings is that we have to manually choose weights that decide how important various features are. Adding new features means manually retuning the weights. As my colleague Trey Jones noted in a recent blog post on search relevance, those features might include:

“How common each individual word is overall. (As the most common word in the English language, theis probably less important to a query than somewhat rarer friggatriskaidekaphobia.)

How frequently each word appears in a given article. (Five matches is probably better than four matches, right?)

Whether there are any matching words in the title or a redirect. (Well, if you search for just the, an article on English’s definite article does look pretty good.)

How close the words are to each other in the article or title. (Why is there a band called “The The”?)

Whether words match exactly or as related forms. (The query term resume also matches both resuming and résumé.)

How many words are in the query.

How many words are in the article. (Okay, maybe five matches in a twenty thousand word article might be worse than four matches in a five hundred word article.)

How often an article gets read. (Popular articles are probably better.)

How many other articles link to an article. (Did I mention that popular articles are probably better?”

Applying machine learning to this problem puts the complicated question of how to weigh these bits of evidence in the hands of a machine.The machine is much better equipped to handle optimizing the tradeoffs of, for example, increasing the importance given to how well the search query matches the categories of an article, as it is able to take into account what effect that has on a large sampling of queries. Machine learning is additionally able to apply non-linear transformations. We do this to some extent ourselves with the traditional search ranking, but it’s much easier for a computer.

In my 10% time I built out an offline machine learning pipeline. This is something where i can feed clickthrough data in one end, and at the other end it will output example result result pages for search queries. This was useful for proving machine learning was a workable idea, but rather useless for answering a user’s search query in a fraction of a second. That’s why we had the idea to build a plugin for Elasticsearch that was able to store machine learned models and apply them in response to user search queries.

Tell me a little bit about the Elasticsearch Learning To Rank plugin—what does it do, and what problem does it solve?

Erik: The Elasticsearch Learning To Rank plugin primarily allows us to apply a machine learned ranking algorithm for ranking search results. This is not an end-to-end solution, because collecting data for the machine learning to evaluate, deciding what features to provide to the algorithm, and training the actual models is all handled separately.

But this plugin provides a couple of important pieces of the puzzle. It acts as a data store for features—that is, definitions of how to calculate the individual data points that the machine learning utilizes—and models. It provides support for collecting feature vectors needed for training the model. And it performs the final, critical, step of actually using that model to rank queries in response to a user’s search request.

This is certainly not the only way to rank queries, but we have some unique constraints that make it particularly enticing. We run many wikis in many different languages. Ideally the search algorithm would be tuned for each site individually. The appropriate importance of various features may vary between Arabic Wikipedia, English Wikipedia, and German Wikipedia, for example. There are even larger differences between projects, such as Wikipedia and Wiktionary, even in the same language. But we have limited engineering resources devoted to search—and it is next to impossible for us to hand tune the search algorithm for each site.

In the past, we have tuned the search algorithm for English Wikipedia, and every other project and language gets that same tuning regardless of how well it works for their particular site. With machine learning, however, we can learn a model for each wiki that has enough search traffic for us to look at logs of searches and clicks to use statistical modeling to determine what results are preferred.

You mentioned parts of the machine learning that are not handled by the plugin, how does that work?

Erik: We have built up an additional project, highly specialized to our use case, called MjoLniR. MjoLniR does three main things: It applies a User Browsing Model to transform logs of searches and clicks into a relevance score for query/article pairs, It communicates with the Elasticsearch LTR plugin to collect feature vectors for query/article pairs, and it performs training and hyperparameter optimization of ML models in our analytics cluster using XGBoost.

While all of those parts are important, and we couldn’t move forward without them, the most important part of any machine learning process is the training data — a query, an article, and a relevance score denoting how well the query matches the article—used to train the model. We are currently building our training data based on user clickthroughs on search result pages. Clickthroughs from search to an article page are logged and stored for 90 days, per our privacy policy. We then apply a bit of clustering to group together queries that are similar but not exactly the same and apply a user browsing model to transform those clickthroughs into a relevance score.

For example, love, Love and LOVE go together, and loving, the lovings, The lovings, the loving and loving] get grouped together (etc). These groups of clickthroughs are fed into an algorithm called a DBN, or more generally a User Browsing Model. This model takes into account expected user behavior on search pages, such as expecting that if a user clicks the third search result, then they probably also looked at the first and second and rejected them.

The DBN also considers that if the user clicked the third result, then comes back to the search page to click on the fifth result, then the third result is attractive and more suitable than the first or second results, but that the fifth result is probably better for some reason. The DBN aggregates together user behavior for the groups of queries to estimate the relevancy of article and query pairs. We currently require at least ten distinct search sessions in each group of queries, but will be testing variations to determine how many sessions are actually required for reliable relevance data. For more information, I suggest reading Olivier Chapelle and Ya Zhang’s paper “A dynamic bayesian network click model for web search ranking.”

I’m curious about how the search algorithm might differ based on language—can you go into more detail about the ways in which they might be different?

Erik: It’s hard to explain exactly, but the various scores we compose together to create the final score per article/query pair do not have the same distribution across different wikis. Because the distribution varies, the optimal way to combine those values into a final score also varies. Perhaps one way to think of this would be the popularity score we use as a ranking signal—which is generally about the second strongest signal, right after how well the query matches the article title. This popularity score is the percentage of page views over the course of a week that went to that particular article.

Wikis with more articles are going to have lower average values for this, so the algorithm needs to apply slightly different weights. Because this gets a bit complicated we never actually started using the popularity score on wikis other than English Wikipedia prior to applying machine learning. There are of course statistical methods to handle this without going all the way to machine learning; a bit of math could be applied to re-center the data and make it look relatively similar between the wikis, but it would still fundamentally differ between languages such that no exact weight would be correct for all wikis.

Can anyone use this plug-in and if so, how? How can they get more deeply involved in the project?

Erik: Anyone using Elasticsearch, a popular open source search engine, can use the plugin. As mentioned above though the plugin is only the final step. The best way to get involved is to visit the project page on GitHub and try it out. There is a demo included in the repository that demonstrates a full pipeline of building a training set, training a model, loading the model into Elasticsearch, and then ranking search results with the model.

 

How do you know the machine learning results are as good as the hand-tuned results? How much better are they?

Erik: There are a variety of evaluation metrics that can be applied both offline, with judgement lists, and online, with A/B tests. For offline testing we use the same judgement lists that are used to train the model with cross validation to estimate the performance of the model.

Currently we use judgement lists generated by applying statistical models of user behavior to clickthroughs on queries. We are looking to augment this with human judgements via the relevance surveys currently being run on article pages. The Offline evaluation of the machine learning models with our first generation feature set show an improvement of between 20% and 40% of the possible improvement (this varies per-wiki) on the NDCG@10 evaluation metric.

Our online evaluation of the machine learning models has only been completed on English Wikipedia thus far, but we are currently running A/B tests on 18 more language Wikipedias which represent almost all wikis with > 1% of full text search volume.

On English Wikipedia the A/B test showed a clickthrough rate of 106% of the baseline, and session abandonment (sessions that search but don’t click on any results) at 98% of the baseline. While these numbers are not particularly huge, they only represent our initial foray into machine learned ranking. The goal of the very first model is to match or slightly exceed the quality of the existing search results, as there are quite a few pieces to build and tie together to get to this point. Even matching the existing performance means the process of collecting click data, generating judgement lists, collecting feature vectors, training models, and running those models in production is all working correctly. With all those pieces working it is now significantly easier for us to evaluate new ranking signals and improve the scoring function going forward.

Does the machine learning model get out of date?

Erik: Yes. Over time the distribution of feature values changes due to changes in the content of the wiki and the model needs to be retrained with the new feature values. The judgement lists used to train the model also change over time, as we aggregate together the last 90 days of search results and clickthroughs to build the judgement lists. On the largest wikis the percentage of content that is changed is low, relative to the total amount of content, so the feature values don’t change too much. We haven’t been running the machine learned ranking for long enough to say how much the judgement lists change over time, but I expect there to be some variation for two reasons: One is that user behavior can change over time, and second is that the results we return to users change over time, so the data the algorithms have to work with changes.

Do different people get different results?

Erik: While personalization of results can be a feature of machine learned ranking, we don’t apply any form of personalization at the user level or at higher levels, such as by geographic region. With respect to single user personalization, we don’t associate the data used for training, such as web requests or searches, to uniquely identifiable information that would allow for aggregating months of data together for individual users. This is a conscious decision to intrude as little as possible on user privacy while building out a ranking system that learns from user behavior.

Melody Kramer, Senior Audience Development Manager, Communications
Wikimedia Foundation

Thank you to Mairead Whitford Jones for asking really good questions for this post.