Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

What are readers looking for? Wikipedia search data now available

(Update 9/20 17:40 PDT)  It appeared that a small percentage of queries contained information unintentionally inserted by users. For example, some users may have pasted unintended information from their clipboards into the search box, causing the information to be displayed in the datasets. This prompted us to withdraw the files.

We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.

Diederik van Liere, Product Manager Analytics

I am very happy to announce the availability of anonymous search log files for Wikipedia and its sister projects, as of today. Collecting data about search queries is important for at least three reasons:

  1. it provides valuable feedback to our editor community, who can use it to detect topics of interest that are currently insufficiently covered.
  2. we can improve our search index by benchmarking improvements against real queries.
  3. we give outside researchers the opportunity to discover gems in the data.

Peter Youngmeister (Ops team) and Andrew Otto (Analytics team) have worked diligently over the past few weeks to start collecting search queries. Every day from today, we will publish the search queries for the previous day at: http://dumps.wikimedia.org/other/search/ (we expect to have a 3 month rolling window of search data available).

Each line in the log files is tab separated and it contains the following fields:

  1. Server hostname
  2. Timestamp (UTC)
  3. Wikimedia project
  4. URL encoded search query
  5. Total number of results
  6. Lucene score of best match
  7. Interwiki result
  8. Namespace (coded as integer)
  9. Namespace (human-readable)
  10. Title of best matching article

The log files contain queries for all Wikimedia projects and all languages and are unsampled and anonymous. You can download a sample file. We collect data from both from the search box on a wiki page after the visitor submits the query, and from queries submitted from Special:Search pages. The search log data does not contain queries from the autocomplete search functionality, this generates too much data.

Anonymous means that there is nothing in the data that allows you to map a query to an individual user: there are no IP addresses, no editor names, and not even anonymous tokens in the dataset. We also discard queries that contain email addresses, credit card numbers and social security numbers.

It’s our hope that people will use this data to build innovative applications that highlight topics that Wikipedia is currently not covering, improve our Lucene parser or uncover other hidden gems within the data. We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy, and the new data could help to give it a little bit of much-needed TLC. If you’ve got search chops then have a look at our Lucene external contractor position.

We are making this data available under a CC0 license: this means that you can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. But we do appreciate it if you cite us when you use this data source for your research, experimentation or product development.

Finally, please consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Diederik van Liere, Product Manager Analytics

(Update 9/19 20:20 PDT) We’ve temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries.

Comments are closed.