Wikimedia blog

News from inside the Wikimedia Foundation.org

Posts by Tomasz Finc

Update on Offline Wikimedia projects

Greetings,

With the annual fundraiser wrapping up, two sections of Wikimedia engineering are going to start moving more quickly: Mobile and Offline. The offline ecosystem has a lot of moving parts and it’s easy to get lost. The Wikimedia Foundation is currently focusing on three main areas of intervention: selection tools, file formats and offline apps.

Right now, “Offline” refers to supporting read access to Wikimedia content without an internet connection; increasing reach was identified during the Wikimedia strategic planning process as one of the movement priorities, and the first recommendation of the Offline task force was to “Simplify reuse of content from WMF projects”.

The first step in making Wikimedia content available offline is to select it. The Wikipedia Version 1.0 Editorial Team has been steadily releasing new versions of their beta Wikipedia collections, but technical limitations have hampered how quickly those can be finished. We’re going to evaluate the team’s tool set to see how to support them.

For example, we’re looking at extending the Wikipedia Release Version Tools to add features like sub-selection and comments (see an example of how the tool works for the Physics project).

Once the content has been selected, it needs to be packaged into a standard file format. The openZim format is an actively developed format for offline Wikipedia content, and we want to facilitate its integration into our general architecture.

Our first step is going to be the enhancement of the Collections extension to support openZim. This will be done by our partners from PediaPress, who have already started to work on it. They will need help from other community members to help test the new openZim files created by the extension.

After selection and packaging, the last remaining piece is the application that allows readers to access the content. Over the last many years, there have been lots of Wikipedia offline apps: BzReader, MzReader, WikiTaxi, WikiFilter, Kiwix, Okawix, etc. Some have come and gone, while others continue to thrive and are actively releasing new updates.

One thing we’ve learned looking at this ecosystem is that there is a strong need for a featured, easy-to-use and well supported offline app.

During the strategic planning process, one app emerged as a good candidate for the WMF to actively support: Kiwix. Kiwix has been around since 2007 and, through the great work of its lead developer Kelson, has steadily improved its feature set, platform support and overall stability.

In order to support this work and to help make the application even easier to use, we’ll be conducting a usability study on Kiwix, focused on search and browse, during the first quarter of 2011. Later this year, we’ll be focusing on an easier update cycle using openZim as the underlying storage format.

We hope 2011 will be full of exciting news about offline Wikimedia content. If you’d like to get involved, please participate in the strategic product discussion about Offline, or contact me if you’d like to help with development.

Tomasz Finc
Engineering Program Manager – Offline, Mobile, & Fundraising

Open Web Analytics 1.4

Open Web Analytics 1.4.0rc3 is out!  You probably don’t care, do you?  You should!  At least we do!

Anyway, let’s start in the beginning:

As we strategized about future development of Wikimedia properties, it became abundantly clear that the measurement tools that we have are insufficient to make the decisions we need to make.  This was a key recommendation from the Strategy task force. We evaluated several possible analytics frameworks as a supplement or even replacement for our homegrown system(s).  After evaluating a couple of open source solutions (while keeping an open mind about the possible need to go with a proprietary solution), we decided to try out Open Web Analytics (OWA) for this year’s fundraiser, with the goal of evaluating it for broader use.

OWA is a PHP-based analytics tool which provides very sophisticated capabilities for real-time data analysis, providing many tools offered by proprietary counterparts. For us, OWA seems to hit the right balance of flexibility and scalability, with the added benefit that there was already an integration plugin for MediaWiki.  Over the past few months, we’ve been working with Peter Adams, the designer of OWA, to adapt OWA for our needs and to make sure that it would work at the scale that we operate at.

Many of the features in the 1.4 release were made initially for our use, but are general-purpose features that many OWA users should be able to benefit from.  We wanted to track how successful we were at getting people from banners, to letter, to donation, so Peter added a couple of features called “conversion goal tracking” and “goal funnels” which will help us figure out where people might be dropping off, but can also be used for general conversion analysis on any OWA-enabled site.  We also needed to keep track of all of this on a per-banner basis, as well as knowing whether the user clicked on the banner or on the “Donate” link in the sidebar, so the “campaign tracking” feature was added.

Finally, we needed to deploy many instances of OWA, so clustered deployment was added in this release.  Peter worked with Nimish Gautam here at WMF to make OWA more scalable, with Nimish becoming a committer on OWA. Peter focused on the architecture, while Nimish focused on making sure that all of the work integrated seamlessly into Wikimedia’s environment.

We’ve just deployed OWA for purposes of observing traffic patterns for the fundraiser, and we’ll be reporting on how well it works for us.  We’re not using all of the features; for example, we’ve disabled features such as mouse movement recording/playback.  We’re being very careful to respect everyone’s privacy and stay true to the WMF donor privacy policy and the Wikimedia privacy policy

We believe the work we’ve done is generally applicable to anyone who wants MediaWiki analytics, and we’re eager to see how it works for others.  We are also at a point where we would love help with testing this.

Wikimedia XML data sets released on Amazon Public Data Sets

For our community members that do analysis on Wikimedia project data, I’m happy to announce the release of our XML snapshots within Amazon Public Data Sets.

http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html

To those curious about why this happened … earlier this year I had gotten approached multiple times from researchers and community members wanting to parse our data but frustrated at the costs and time of doing it on their own infrastructure.

Many of them were already familiar with academic computational clusters and were wondering if there were any similar solutions available for them to do large scale processing. The tool sever was one option but sometimes it didn’t provide the level of flexibility that they needed and/or their projects were too computationally intensive to run alongside other tasks.

Instead I mentioned that I had already been thinking of pushing our data sets in the Amazon cloud as they had a wealth of the infrastructure in place and had an infrastructure that our users who were familiar with.

Fast forward to now, we have our first release ready to be worked on and I’m sure that we will hear back about new and exiting discoveries that our communities make.

This isn’t the first Wikipedia data set to exist in Public Data Sets but it will be the first that Wikimedia is committing to supporting on a regular release cycle. Amazon will be picking it up every month and retaining copies for at least three months.

I’m excited to see the stats of how many people use it.

–tomasz

We’re adding an off site archive for Commons and the XML snapshots

Thanks are due to eBart consulting and User:Milosh for proving a backup server and storage array at their colocation facility in Europe. This server will store archives of our publicly available data of Wikimedia Commons and the XML snapshots.

Everyone knows that this has been long and coming as having an off site location for our data is extremely important for disaster recovery. With this archive in place we’ll have another external archive space for Commons image data to complement the one living at MIT.

Given the 10T’s donated were likely to also store yearly archives of the XML snapshots.

This won’t stop us from continuing to be rigorous about our internal backups for the same data along with keeping all of our users private data within our own data centers. It will simply be another physical space for us to archive our publicly available content.

While this off line mirror will only be used internally we have some other leads about other sponsors who might be able to offer a publicly available mirror. Over the next weeks we’ll be streamlining the off line archiving process and seeding the initial commons upload which currently comes in at just under 4T’s ! Once we make some sense of how best to manage the archiving process we’ll see who else is able to host our data.

Wikimedia & FourKitchens support CiviCRM development

Here at Wikimedia we’ve been avidly using CiviCRM for over two years now. Over that period we’ve seen it grow and mature as a platform for fundraising, contact tracking, mailings and have been wanting to make the platform evolve even more. Together with Civi community, we’ve worked to organize the early release of the CiviReport architecture for the 2.2 branch. Thanks go to the core Civi team for doing the backport and FourKitchens for contributing a wealth of new reports for us. You can read a full write up of the release at the CiviCRM blog.

For those of our readers who are interested in CiviCRM and are in the Bay Area, we’ve also started to organize regular user meetups. The first one had a great turn out and we’d love for both developers and users of CiviCRM to attend the next one on August 4th at 6pm.

Tomasz Finc, Software Developer

Wikimedia & FourKitchens support CiviCRM development

Here at Wikimedia we’ve been avidly using CiviCRM for over two years now. Over that period we’ve seen it grow and mature as a platform for fundraising, contact tracking & mailings and have been wanting to make the platform evolve even more. Together with Civi community, we’ve worked to organize the early release of the CiviReport architecture for the 2.2 branch. Thanks go to the core Civi team for doing the backport and FourKitchens for contributing a wealth of new reports for us. You can read a full write up of the release at the CiviCRM blog.

For those of our readers who are interested in CiviCRM and are in the Bay Area, we’ve also started to organize regular user meetups. The first one had a great turn out and we’d love for both developers and users of CiviCRM to attend the next one on August 4th at 6pm.

Wikimedia Donates Servers to Local and Remote Causes

Wikimedia donates servers to SFCCP and northxsouth.

Wikimedia donates servers to SFCCP and northxsouth.

The Wikimedia projects have been running on the same commodity hardware for many years now and every now and then we decide to decommission some of our older machines. This not only allows us to free up space for new servers but also lets us use more energy efficient hardware.

While searching around for a new home for our old but still very useful servers we came across two linked organizations: northxsouth & San Francisco Community Collocation Project (SFCCP). Both of these groups help out their local and regional communities by using open source software to better spread information within various media spaces.

The SFCCP is active within the San Francisco community while northxsouth works with various Latin American countries.

Our donation of servers was happily received and I’m excited to report that they will soon be humming along and serving the public for a long time to come.

Tomasz Finc, Software Developer