Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts Tagged ‘parsoid’

Preparing for VisualEditor on all Wikipedias

This post is available in 5 languages: English BanglaDeutschespañolfrançais

Visual_Editor-logoAfter several years of development and testing, VisualEditor, the new visual interface to edit Wikipedia pages, will soon be available in “beta” form for all users. This lets Wikipedia editors create and modify articles visually, using a new system where the articles they edit will look the same as they show for reading, and their changes show up as they enter them — like writing a document in a word processor.

VisualEditor removes the need to learn complex wiki markup, and so simplifies editing for both new and experienced editors. We hope that this will open up editing to more people, and along with other efforts will encourage more editors to start and continue to contribute.

We plan to enable it for all logged-in users of the English Wikipedia in early July, later that month extending it to logged-out users, and then the other Wikipedias. Ahead of rolling out VisualEditor in July, we will be carrying out a test of VisualEditor for some randomly-selected new accounts on the English Wikipedia beginning on 17 June. During this testing period, we will be monitoring the impact on users, listening to feedback, and solving problems.

The “alpha” prototype was previously available only to users with a registered account who opted in to test out VisualEditor. First made available on the English Wikipedia in December 2012, it was extended to 16 more language editions in April, and will be made available on all remaining Wikipedias later this week. A lot of valuable feedback has been provided by the early testers of this alpha, and we would like to thank them for their help.

Visual HTML editors are now common on the Web, but building one for Wikipedia (and its sister sites) has been a challenge in itself, due to our specialized requirements and the need to integrate with our existing software, MediaWiki. Behind the scenes, VisualEditor heavily relies on Parsoid, a new complex software component for MediaWiki that translates between wiki markup and annotated HTML+RDFa.

We need your help!

What you can do to help: over the past few months, we have asked you to try out the alpha version of the VisualEditor, and many of you did. Since then, it has changed significantly, and so we’re asking that you try it again. It’s very important that we fix as many critical issues as possible prior to the deploying for everyone in a few weeks’ time — of course, we’d love to fix them all, but that may not be possible. So please, enable the VisualEditor (it’s in your preferences, under the editing tab — check the box labeled “Enable VisualEditor”) and submit any bugs that you find. Your early testing means that we can ensure a better VisualEditor and a smoother deployment for everyone.

Philippe Beaudette, Director, Community Advocacy
James Forrester, Product Manager, VisualEditor and Parsoid

Parsoid: How Wikipedia catches up with the web

Wikitext, as a Wikipedia editor has to type it in (above), and the resulting rendered HTML that a reader sees in her browser (below)

When the first wiki saw the light of the world in 1995, it simplified HTML syntax in a revolutionary way, and its inventor Ward Cunningham chose its name after the Hawaiian word for “fast.” When Wikipedia launched in 2001, its rapid success was thanks to the easy collaboration using a wiki. Back then, the simplicity of wiki markup made it possible to start writing Wikipedia with Netscape 4.7 when WYSIWYG editing was technically impossible. A relatively simple PHP script converted the Wikitext to HTML. Since then, Wikitext has always provided both the edit interface and the storage format of MediaWiki, the software underlying Wikipedia.

About 12 years later, Wikipedia contains 25 million encyclopedia articles written in Wikitext, but the world around it has changed a bit. Wikitext makes it very difficult to implement visual editing, which is now supported in browsers for HTML documents, and expected by web users from many other sites they are familiar with. It has also become a speed issue: With a lot of new features, the conversion from Wikitext to HTML can be very slow. For large Wikipedia pages, it can take up to 40 seconds to render a new version after the edit has been saved.

The Wikimedia Foundation’s Parsoid project is working on these issues by complementing existing Wikitext with an equivalent HTML5 version of the content. In the short term, this HTML representation lets us use HTML technology for visual editing. In the longer term, using HTML as the storage format can eliminate conversion overhead when rendering pages, and can also enable more efficient updates after an edit that only affect part of the page. This might all sound pretty straightforward. So why has this not been done before?

Lossless conversion between Wikitext and HTML is really difficult

For the Wikitext and HTML5 representations to be considered equivalent, it should be possible to convert between Wikitext and HTML5 representations without introducing any semantic differences. It turns out that the ad-hoc structure of Wikitext makes such a lossless conversion to HTML and back extremely difficult.

In Wikitext, italic text is enclosed in double apostrophes (”…”), and bold text in triple apostrophes (”’…”’), but here these notations clash. The interpretation of a sequence of three or more apostrophes depends on other apostrophe-sequences seen on that line.
Center: Wikitext source. Below: As interpreted and rendered by MediaWiki. Above: Alternative interpretation.

  • Context-sensitive parsing: The only complete specification of Wikitext’s syntax and semantics is the MediaWiki PHP-based runtime implementation itself, which is still heavily based on regular expression driven text transformation. The multi-pass structure of this transformation combined with complex heuristics for constructs like italic and bold formatting make it impossible to use standard parser techniques based on context-free grammars to parse Wikitext.
  • Text-based templating: MediaWiki’s PHP runtime supports an elaborate text-based preprocessor and template system. This works very similar to a macro processor in C or C++, and creates very similar issues. As an example, there is no guarantee that the expansion of a template will parse to a self-contained DOM structure. In fact, there are many templates that only produce a table start tag (<table>), a table row (<tr>...</tr>) or a table end tag (</table>). They can even only produce the first half of an HTML tag or Wikitext element (e.g. ...</tabl), which is practically impossible to represent in HTML. Despite all this, content generated by an expanded template (or multiple templates) needs to be clearly identified in the HTML DOM.
  • (more…)