Parsoid: How Wikipedia catches up with the web

Wikitext, as a Wikipedia editor has to type it in (above), and the resulting rendered HTML that a reader sees in her browser (below)

When the first wiki saw the light of the world in 1995, it simplified HTML syntax in a revolutionary way, and its inventor Ward Cunningham chose its name after the Hawaiian word for “fast.” When Wikipedia launched in 2001, its rapid success was thanks to the easy collaboration using a wiki. Back then, the simplicity of wiki markup made it possible to start writing Wikipedia with Netscape 4.7 when WYSIWYG editing was technically impossible. A relatively simple PHP script converted the Wikitext to HTML. Since then, Wikitext has always provided both the edit interface and the storage format of MediaWiki, the software underlying Wikipedia.

About 12 years later, Wikipedia contains 25 million encyclopedia articles written in Wikitext, but the world around it has changed a bit. Wikitext makes it very difficult to implement visual editing, which is now supported in browsers for HTML documents, and expected by web users from many other sites they are familiar with. It has also become a speed issue: With a lot of new features, the conversion from Wikitext to HTML can be very slow. For large Wikipedia pages, it can take up to 40 seconds to render a new version after the edit has been saved.

The Wikimedia Foundation’s Parsoid project is working on these issues by complementing existing Wikitext with an equivalent HTML5 version of the content. In the short term, this HTML representation lets us use HTML technology for visual editing. In the longer term, using HTML as the storage format can eliminate conversion overhead when rendering pages, and can also enable more efficient updates after an edit that only affect part of the page. This might all sound pretty straightforward. So why has this not been done before?

Lossless conversion between Wikitext and HTML is really difficult

For the Wikitext and HTML5 representations to be considered equivalent, it should be possible to convert between Wikitext and HTML5 representations without introducing any semantic differences. It turns out that the ad-hoc structure of Wikitext makes such a lossless conversion to HTML and back extremely difficult.

In Wikitext, italic text is enclosed in double apostrophes (”…”), and bold text in triple apostrophes (”’…”’), but here these notations clash. The interpretation of a sequence of three or more apostrophes depends on other apostrophe-sequences seen on that line.
Center: Wikitext source. Below: As interpreted and rendered by MediaWiki. Above: Alternative interpretation.

  • Context-sensitive parsing: The only complete specification of Wikitext’s syntax and semantics is the MediaWiki PHP-based runtime implementation itself, which is still heavily based on regular expression driven text transformation. The multi-pass structure of this transformation combined with complex heuristics for constructs like italic and bold formatting make it impossible to use standard parser techniques based on context-free grammars to parse Wikitext.
  • Text-based templating: MediaWiki’s PHP runtime supports an elaborate text-based preprocessor and template system. This works very similar to a macro processor in C or C++, and creates very similar issues. As an example, there is no guarantee that the expansion of a template will parse to a self-contained DOM structure. In fact, there are many templates that only produce a table start tag (<table>), a table row (<tr>...</tr>) or a table end tag (</table>). They can even only produce the first half of an HTML tag or Wikitext element (e.g. ...</tabl), which is practically impossible to represent in HTML. Despite all this, content generated by an expanded template (or multiple templates) needs to be clearly identified in the HTML DOM.
  • No invalid Wikitext: Every possible Wikitext input has to be rendered as valid HTML – it is not possible to reject a user’s edit with a “syntax error” message. Many attempts to create an alternative parser for MediaWiki have tried to simplify the problem by declaring some inputs invalid, or modifying the syntax, but at Wikimedia we need to support the existing corpus created by our users over more than a decade. Wiki constructs and HTML tags can be freely mixed in a tag soup, which still needs to be converted to a DOM tree that ideally resembles the user’s intention. The behavior for rare edge cases is often more accident than design. Reproducing the behavior for all edge cases is not feasible nor always desirable. We use automated round-trip testing on 100,000 Wikipedia articles, unit test cases and statistics on Wikipedia dumps to help us identify the common cases we need to support.
  • Character-based diffs: MediaWiki uses a character-based diff interface to show the changes between the Wikitext of two versions of a wiki page. Any character difference introduced by a round-trip from Wikitext to HTML and back would show up as a dirty diff, which would annoy editors and make it hard to find the actual changes. This means that the conversion needs to preserve not just the semantics of the content, but also the syntax of unmodified content character-by-character. Put differently, since Wikitext-to-HTML is a many-to-one mapping where different snippets of Wikitext all result in the same HTML rendering (Example: The excess space in “* list” versus “*list” is ignored), a reverse conversion would effectively normalize Wikitext syntax. However, character-based diffs forces the Wikitext-to-HTML mapping to be treated as a one-to-one mapping. We use a combination of complementary techniques to achieve clean diffs:
    • we detect changes to the HTML5 DOM structure and use a corresponding substring of the source Wikitext when serializing an unmodified DOM part (selective serialization), see below.
    • we record variations from some normalized syntax in hidden round-trip data (example: excess spaces, variants of table-cell Wikitext).
    • we collect and record information about ill-formed HTML that is auto-corrected while building the DOM tree (example: auto-closed inline tags in block context).

How we tackle these challenges with Parsoid

 

Artist’s impression of the Parsoid HTML5 + RDFa wiki runtime

 

Parsoid is implemented as a node.js-based web service. There are two distinct, and somewhat independent pieces to Parsoid: the parser and runtime that converts Wikitext to HTML, and the serializer that converts HTML to Wikitext.

Converting Wikitext to HTML

The conversion from Wikitext to HTML DOM starts with a PEG-based tokenizer, which emits tokens to an asynchronous token stream transformation pipeline. The stages of the pipeline effectively do two things:

  • Asynchronous expansion of template and extension tags: We are using MediaWiki’s web API for these expansions, which distributes the execution of a single request across a cluster of machines. The asynchronous nature of Parsoid’s token stream transformation pipeline enables it to perform multiple expansions in parallel and stitch them back together in original document order with minimal buffering.
  • A table created with multiple templates; in Wikitext (below) and rendered HTML (above)

    Parsing of Wikitext constructs on the expanded token stream: Quotes, lists, pre-blocks and paragraphs are handled via transformations on the expanded token stream. Each transformation is performed by a handler implementing a state machine. This lets us parse context-sensitive Wikitext constructs like quotes. By operating on the fully expanded token stream, we can also mimic the PHP runtime’s support for structures partly created by templates, or even multiple templates. An example for this are tables created with a sequence of table start / row / table end templates as in this football article.

Fully processed tokens are passed to a HTML5 tree builder. The resulting DOM is further post-processed before it is stored or delivered to a client (this could simply be the reader’s browser, but also the VisualEditor, or a bot processing the HTML further). The post-processing identifies template blocks, marks auto-corrected HTML tags, and maps DOM subtrees to the original source Wikitext range that generated the subtrees. These techniques enable the HTML-to-Wikitext reverse transformation to be performed while minimizing dirty diffs.

Converting HTML to Wikitext

The conversion from HTML DOM to Wikitext is performed in a serializer, which needs to make make sure that the generated Wikitext parses back to the original DOM. For this, it needs a deep understanding of the various syntactical constructs and their constraints.

A full serialization of an HTML DOM to Wikitext often results in some normalization. For example, we don’t track if single quotes or double quotes are used in attributes (e.g. style='...' vs. style="..."). The serializer always uses double quotes for attributes, which will lead to a dirty diff if single quotes were used in the original Wikitext.

To avoid this, we have implemented a serialization mode which is more selective about what parts of the DOM it serializes. This selective serializer relies on access to both the original Wikitext and the original DOM that was generated from it. It compares the original and new DOM it receives and selectively serializes only the modified parts of the DOM. For unmodified parts of the DOM, it simply emits the original Wikitext that generated those subtrees. This avoids any dirty diffs in unmodified parts of a page.

An additional problem that both serializers need to contend with is the presence of Wikitext-like constructs in text content. The serializers need to escape Wikitext-like text content (example: [[Foo]]) to ensure that it remains text content when the Wikitext is converted back to HTML. This Wikitext escaping is not trivial for a context-sensitive language. The current solution uses smart heuristics and the Wikitext tokenizer, and works quite well. It could however be further improved to eliminate spurious and unnecessary Wikitext escaping, in particular for context-sensitive syntax not fully handled in the tokenizer.

Examples

Let us now have a look at some examples in more detail.

Consider the Wikitext:

[[Foo|bar]]

The HTML generated by Parsoid for this is:

<a rel="mw:WikiLink" href="./Foo">bar</a>

The <a>-tag itself should be obvious given that the Wikitext is a wiki-link. However, in addition to wiki links, external links, images, ISBN links and others also generate an <a>-tag. In order to properly convert the <a>-tag back to the correct Wikitext that generated it, Parsoid needs to be able to distinguish between them. Towards this end, Parsoid also marks the <a>-tag with the mw:WikiLink property (or mw:ExtLink, mw:Image, etc.). This kind of RDFa markup also provides clients (like the VisualEditor) additional semantic information about HTML DOM subtrees.

Let us now change the Wikitext slightly where the link content is generated by a template:

[[Foo|{{echo|bar}}]]

The HTML generated by Parsoid for this is:

<a rel="mw:WikiLink" href="./Foo">
  <span about="#mwt1" data-parsoid="{...}" typeof="mw:Object/Template">bar</span>
</a>

First of all, note that in the browser this Wikitext will render identically to Example 1 — so semantically, there is no difference between these two Wikitext snippets. However, Parsoid adds additional markup to the link content: The <span>-tag wrapping the content has an about attribute and an RDFa type. Once again, this is to let clients know that the content came from a template, and to let Parsoid serialize this back to the original Wikitext. Parsoid also maintains private information for roundtripping in the data-parsoid HTML attribute (in this example, the original template transclusion source). The about attribute on the <span> lets us mark template output expanding to several DOM subtrees as a group.

The future

Our roadmap describes our plans for the next months and beyond. Apart from new features and refinement in support of the VisualEditor project, we plan to assimilate several Parsoid features into the core of MediaWiki. HTML storage in parallel with Wikitext is the first major step in this direction. This will enable several optimizations and might eventually lead to HTML becoming the primary storage format in MediaWiki. We are also working on a DOM-based templating solution with better support for visual editing, separation between logic and presentation and the ability to cache fragments for better performance.

Join us!

If you like the technical challenges in Parsoid and want to get involved, then please join us in the #mediawiki-parsoid IRC channel on Freenode. You could even get paid to work on Parsoid: We are looking for a full-time software engineer and 1-2 contractors. Join the small Parsoid team and make the sum of all knowledge easier and more efficient to edit, render, and reuse!

 

Gabriel Wicke, Senior Software Engineer, Parsoid

Subramanya Sastry, Senior Software Engineer, Parsoid

Categories: Jobs, VisualEditor
Tags:
Categories:

Tags:
4 Show

4 Comments on Parsoid: How Wikipedia catches up with the web

Olivier 1 year

Thank you for taking the time to write this! I don’t understand it all but it’s nevertheless enlightning. I now have a much better understanding of why the oncoming of the visual editor is taking so much time. All the best meeting this challenge…

Subfader 1 year

Your own fault using the same characters for bold and italic syntax. Epic fail imo.

TiddlyWiki: http://tiddlywiki.com/#%5B%5BBasic%20Formatting%5D%5D

”bold”

//italics//

[[internal link|Article]]

[[google|http://www.google.com]]

–Strikethrough–

@@Highlight@@

{{{code}}}

{{{
pre
}}}

Bedhed 1 year

Thanks for taking the time to write this useful post. It’s a good start to dig into the new parsing/editing system. I am very keen to see it work (and pitch in by porting Extension:WidgetsFramework to parsoid).

Michael Jahn 1 year

I barely grasp half of this post, but that’s totally sufficient to give me a rough idea of the challenges behind visual editing. Thanks for a great read!

Leave a Reply

Your email address will not be published. Required fields are marked *