> Wikipedia Server Upgrade

penbat

Anyone know if it will happen any time soon ? It would be nice to get back full statistical data on internal links etc and email alerts of article changes for example. is this a big secret ? I dont see any info about this anywhere. It strikes me that this issue is a big reason why interest in Wikipedia is drying up (no email alerts and many tools not available). Wikipedia has been pushing hard to raise funds so surely a server upgrade should happen at some point.

Kelly Martin

QUOTE(penbat @ Tue 29th September 2009, 5:17pm)

Anyone know if it will happen any time soon ? It would be nice to get back full statistical data on internal links etc and email alerts of article changes for example. is this a big secret ? I dont see any info about this anywhere. It strikes me that this issue is a big reason why interest in Wikipedia is drying up (no email alerts and many tools not available). Wikipedia has been pushing hard to raise funds so surely a server upgrade should happen at some point.

Wikimedia upgrades servers on a regular basis. Of course, they have a couple hundred of them, so you might not really notice. In any case, it sounds like what you're asking about is a software upgrade. The features you're looking for are unlikely to be implemented because nobody who matters is clamoring for them, and only features wanted by Important People ever get added (unless someone else writes them and adding them is easy).

Most of the statistical work on the contents of the database is done using the toolserver replicas or via offline dumps. They won't enable email notifications for the English Wikipedia (the only project for which they are not enable) because turning them on would cream the daylights out of the poor little box that handles their email.

MZMcBride

QUOTE(penbat @ Tue 29th September 2009, 6:17pm)

Anyone know if it will happen any time soon ? It would be nice to get back full statistical data on internal links etc and email alerts of article changes for example. is this a big secret ? I dont see any info about this anywhere. It strikes me that this issue is a big reason why interest in Wikipedia is drying up (no email alerts and many tools not available). Wikipedia has been pushing hard to raise funds so surely a server upgrade should happen at some point.

I'm not sure how familiar you are with MediaWiki, but things like e-mail notifications are already implemented (smaller wikis like Meta have them in Special:Preferences). As you note, however, it isn't available on the English Wikipedia (or any of the larger wikis). It's a combination of hardware and software issues. The sheer volume of contributions to any large wiki make it a monumental task to keep up with e-mails. This requires smart software and dedicated hardware, neither of which are in large supply currently.

As Kelly notes, they do regularly buy new servers and they do operate hundreds of them. In fact, they're now doing a donation drive for old servers as they replace them with newer hardware. But at this point, from my perspective at least, it's still a matter of keeping the site up and running more than anything else. Fun features like e-mail alerts or link counts simply aren't budgeted for currently while basic site operation is still such a focus.

I should note that the Wikimedia Foundation recently gave a $40,000 grant to the Wikimedia Toolserver (operated by Wikimedia Deutschland) for more servers. This will hopefully reduce load on the Toolserver and make it more reliable / stable, which means that querying things like link counts may be possible again at some point. Though, even with extra servers, the pagelinks table on the English Wikipedia is over 476,280,378 rows and the revision table is over 320,748,417 rows (rough estimates from MySQL using EXPLAIN). It's simply a lot of fucking data.

Milton Roe

QUOTE(MZMcBride @ Tue 29th September 2009, 11:22pm)

I should note that the Wikimedia Foundation recently gave a $40,000 grant to the Wikimedia Toolserver (operated by Wikimedia Deutschland) for more servers. This will hopefully reduce load on the Toolserver and make it more reliable / stable, which means that querying things like link counts may be possible again at some point. Though, even with extra servers, the pagelinks table on the English Wikipedia is over 476,280,378 rows and the revision table is over 320,748,417 rows (rough estimates from MySQL using EXPLAIN). It's simply a lot of fucking data.

It certainly would be if the WMF servers actually store every old WP page version, instead of just the differences between versions. Tell me they don't store a whole article worth of text to keep track of one revision of one spelling error? Or each vandalism and its reversions?

For shear reasons of bandwidth, there is some point at which data-compression must win out over savings in computation to restore data from difference-sets. Which one limits WP now?

MZMcBride

QUOTE(Milton Roe @ Wed 30th September 2009, 2:33am)

QUOTE(MZMcBride @ Tue 29th September 2009, 11:22pm)

I should note that the Wikimedia Foundation recently gave a $40,000 grant to the Wikimedia Toolserver (operated by Wikimedia Deutschland) for more servers. This will hopefully reduce load on the Toolserver and make it more reliable / stable, which means that querying things like link counts may be possible again at some point. Though, even with extra servers, the pagelinks table on the English Wikipedia is over 476,280,378 rows and the revision table is over 320,748,417 rows (rough estimates from MySQL using EXPLAIN). It's simply a lot of fucking data.

It certainly would be if the WMF servers actually store every old WP page version, instead of just the differences between versions. Tell me they don't store a whole article worth of text to keep track of one revision of one spelling error? Or each vandalism and its reversions?

For shear reasons of bandwidth, there is some point at which data-compression must win out over savings in computation to restore data from difference-sets. Which one limits WP now?

The full text of each revision is stored. Old revisions are heavily compressed. There's a low overhead with compression and it's more robust (and less likely to corrupt) with all revisions, versus pointers. I can't speak too much on the technical specifics, but I did dig up some information from the wikitech-l archives.

If you can read complicated code (I certainly can't):

HistoryBlob.php

Note the PDF referenced in the 2005 thread is linked to a dead location. The PDF is available here: http://noc.wikimedia.org/~tstarling/Wikipedia_21C3.pdf

dogbiscuit

QUOTE(MZMcBride @ Wed 30th September 2009, 10:23am)

The full text of each revision is stored. Old revisions are heavily compressed. There's a low overhead with compression and it's more robust (and less likely to corrupt) with all revisions, versus pointers. I can't speak too much on the technical specifics, but I did dig up some information from the wikitech-l archives.

If you can read complicated code (I certainly can't):

HistoryBlob.php

Note the PDF referenced in the 2005 thread is linked to a dead location. The PDF is available here: http://noc.wikimedia.org/~tstarling/Wikipedia_21C3.pdf

The trouble with diff based revisions is that they are at risk from both storage and processing error. Also, when you want to do things like removing versions, it is going to be a process of reconstruction then rediffing.

The hierarchy of compression seems like a good pragmatic move as the compression algorithms are very robust.

The reality is that at some point, the long history becomes irrelevant and is rarely touched, so the processing time to extract the very old becomes irrelevant. Perhaps there should come a point where the history is purged (aside from some summary contributions list for attribution purposes, or a yearly snapshot) like the six year rule for tax. Either the article is stable so there is nothing of interest, or the article is an edit war so who cares what it was like. I suspect though, that the lore of Wikipedia will deem that to lose the history of the article is unthinkable and they would rather shackle the effort of "writing an encyclopedia" with the game tool of keeping the history. If people really are interested in the state of an article a couple of years ago, then it suggests something other than constructive effort on moving forwards.

penbat

The features i am referring to were all available up to 2006 and i found them useful before then.

Kelly Martin

QUOTE(dogbiscuit @ Wed 30th September 2009, 5:48am)

I suspect though, that the lore of Wikipedia will deem that to lose the history of the article is unthinkable and they would rather shackle the effort of "writing an encyclopedia" with the game tool of keeping the history. If people really are interested in the state of an article a couple of years ago, then it suggests something other than constructive effort on moving forwards.

I advocated for a policy of purging "vandalism" revisions (and their reverts) from the database after six months, but this was a complete nonstarter (for one, because it would decimate the edit counts of most of the MMORPGers). Some people even insisted that it violated the GFDL. Most Wikipedians have internalized the rule that "the GFDL requires retaining the whole history" with no real awareness of what the GFDL actually required. (A moot point now that Wikipedia is under Creative Commons.)

QUOTE(penbat @ Wed 30th September 2009, 5:54am)

The features i am referring to were all available up to 2006 and i found them useful before then.

Most of the statistical queries against the live database were disable in 2005 or 2006 for performance reasons. The English Wikipedia database is too large to handle those queries with the database technology they're currently using, and they're not willing to make the migration to Oracle.

CharlotteWebb

QUOTE(Milton Roe @ Wed 30th September 2009, 6:33am)

It certainly would be if the WMF servers actually store every old WP page version, instead of just the differences between versions. Tell me they don't store a whole article worth of text to keep track of one revision of one spelling error? Or each vandalism and its reversions?

For shear reasons of bandwidth, there is some point at which data-compression must win out over savings in computation to restore data from difference-sets. Which one limits WP now?

Theoretically they could do a little bit of both, by having most edits stored as a diff but also storing full-text of the current version plus periodic "key frames" to allow fairly efficient reconstruction of how a page might have looked at any given time.

For one it would reduce the attribution fiascos caused by the "oversight" tool in cases where the offending material was added to a page, but not quickly removed (i.e. it remained on a page over the course of several constructive and unrelated edits following it).

QUOTE(Brion Vibber @ 2005-09-12 08:56:53 GMT (4 years @ 2 weeks, 4 days, 9 hours and 38 minutes ago))

Tagging of revisions is likely to happen soonish...

Lololololol

QUOTE(Kelly Martin @ Wed 30th September 2009, 1:03pm)

I advocated for a policy of purging "vandalism" revisions (and their reverts) from the database after six months...

That would only work well if everything categorized as vandalism was actually vandalism, and if all "vandalism" was devoid of serious literary/artistic/political/scientific value. Everybody's got their own little definition and judgment doesn't scale well for uncorrectable decisions on WP.

Milton Roe

QUOTE(CharlotteWebb @ Wed 30th September 2009, 11:04am)

QUOTE(Milton Roe @ Wed 30th September 2009, 6:33am)

It certainly would be if the WMF servers actually store every old WP page version, instead of just the differences between versions. Tell me they don't store a whole article worth of text to keep track of one revision of one spelling error? Or each vandalism and its reversions?

For shear reasons of bandwidth, there is some point at which data-compression must win out over savings in computation to restore data from difference-sets. Which one limits WP now?

Theoretically they could do a little bit of both, by having most edits stored as a diff but also storing full-text of the current version plus periodic "key frames" to allow fairly efficient reconstruction of how a page might have looked at any given time.on WP.

Exactly what I was thinking. And if they did, flagged revisions might actually result in big efficiency savings, as the key pages you'd be saving in toto would of course be the ones with high reps, not the IP changes that shortly are verted by Cluebot. Those could stay stored as mere diffs, without great danger that we'd thereby somehow accidently fail to save the great edits of some great master on the subject, who will never write on it, again.

Wikipedia's social policies make the latter scenario more likely than anything they could do with their SOFTWARE.

It's not like they really care.

That said, I don't think WP does ANY of this. I think they store a complete new text-image of every frigging article, every time a change is made to it by anybody (or a new image of the section, if it's a section-edit). They save compression-wise on the transcluded pages, if they aren't changed (including illustrations). But even saving and later pulling all that text must be killing their servers.

QUOTE(CharlotteWebb @ Wed 30th September 2009, 11:04am)

QUOTE(Kelly Martin @ Wed 30th September 2009, 1:03pm)

I advocated for a policy of purging "vandalism" revisions (and their reverts) from the database after six months...

That would only work well if everything categorized as vandalism was actually vandalism, and if all "vandalism" was devoid of serious literary/artistic/political/scientific value. Everybody's got their own little definition and judgment doesn't scale well for uncorrectable decisions on WP.

Sure, but it's hard to argue for keeping a full image for a change that didn't last longer than a few minutes, six months ago. There is where all you need is the diff, particularly if it was totally reverted, and never came back.

dogbiscuit

QUOTE(Milton Roe @ Thu 1st October 2009, 1:05am)

Sure, but it's hard to argue for keeping a full image for a change that didn't last longer than a few minutes, six months ago. There is where all you need is the diff, particularly if it was totally reverted, and never came back.

You don't need anything.

However, the logic they are following is that when they store history in a compressed blob, typically these very similar versions get compressed down to not a lot.

Milton Roe

QUOTE(dogbiscuit @ Wed 30th September 2009, 5:36pm)

QUOTE(Milton Roe @ Thu 1st October 2009, 1:05am)

Sure, but it's hard to argue for keeping a full image for a change that didn't last longer than a few minutes, six months ago. There is where all you need is the diff, particularly if it was totally reverted, and never came back.

You don't need anything.

However, the logic they are following is that when they store history in a compressed blob, typically these very similar versions get compressed down to not a lot.

Not unless the differences are subtracted (abstracted) BEFORE they are compressed. And that's the whole question.

anthony

QUOTE(Kelly Martin @ Tue 29th September 2009, 10:23pm)

They won't enable email notifications for the English Wikipedia (the only project for which they are not enable) because turning them on would cream the daylights out of the poor little box that handles their email.

They can't use more than one box for email?

QUOTE(dogbiscuit @ Wed 30th September 2009, 10:48am)

The trouble with diff based revisions is that they are at risk from both storage and processing error.

Not any more than the current system. Did you look at HistoryBlob? Have you seen many reports of storage and processing errors in svn?

QUOTE(dogbiscuit @ Wed 30th September 2009, 10:48am)

Also, when you want to do things like removing versions, it is going to be a process of reconstruction then rediffing.

Removing individual versions is relatively rare, and if you use skip-deltas (like Subversion), the process is no more complicated than reading a history version. Finding the record would be O(log n) and once you found the record it'd be O(1). In all likelihood it'd be *faster* than the current system.

QUOTE(Milton Roe @ Thu 1st October 2009, 1:05am)

QUOTE(dogbiscuit @ Wed 30th September 2009, 5:36pm)

QUOTE(Milton Roe @ Thu 1st October 2009, 1:05am)

Sure, but it's hard to argue for keeping a full image for a change that didn't last longer than a few minutes, six months ago. There is where all you need is the diff, particularly if it was totally reverted, and never came back.

You don't need anything.

However, the logic they are following is that when they store history in a compressed blob, typically these very similar versions get compressed down to not a lot.

Not unless the differences are subtracted (abstracted) BEFORE they are compressed. And that's the whole question.

No, the compression works well even if you don't do the diff. But it works a lot better if you do the diff.

Concatenating all historical versions of the anarchism article gets you a 902 meg file. Compressing the concatenation of all versions of anarchism gets you from 902 megs to 51 megs. Using reverse deltas gets you from 902 megs to 42 megs. Compressing the reverse delta file gets you from 902 megs to 42 megs to 3.3 megs. That alone doesn't give you O(log n) random access, though. To get random access you need to use skip-deltas, which is how Subversion works (I believe they use forward deltas rather than reverse deltas though). This is a problem which was solved years ago, but the Wikipedians insist on their own kludgy home grown solution. This is what happens when you have a great coder and terrible manager as CTO.

http://blog.p2pedia.org/2008/10/anarchism.html

anthony

QUOTE(CharlotteWebb @ Wed 30th September 2009, 6:04pm)

QUOTE(Milton Roe @ Wed 30th September 2009, 6:33am)

For shear reasons of bandwidth, there is some point at which data-compression must win out over savings in computation to restore data from difference-sets. Which one limits WP now?

Theoretically they could do a little bit of both, by having most edits stored as a diff but also storing full-text of the current version plus periodic "key frames" to allow fairly efficient reconstruction of how a page might have looked at any given time.

Again, problem that is already solved. Skip-deltas. Basically, you do deltas in a binary tree structure, so the most branches you have to traverse is log2(n).

CharlotteWebb

QUOTE(Milton Roe @ Thu 1st October 2009, 12:05am)

Exactly what I was thinking. And if they did, flagged revisions might actually result in big efficiency savings, as the key pages you'd be saving in toto would of course be the ones with high reps...

Well no, I didn't mean for this to be based on the estimated quality of the edit, but on the estimated efficiency trade-off. So the exact location of key frames would depend on several factors, be indiscernible to site-users, and not necessarily be stable.

Looking at the version from wednesday might involve loading it instantly from a cached revision snapshot, or reconstructing it from monday's version by "mentally" adding tuesday's edits to it (or subtracting this morning's edits from the current version).

The goal being for site-users never to notice the difference, one would want to base this how much extra execution time is acceptable to save a KB in the database.

I'm thinking the tradeoff would be most favorable for revisions which are viewed least often and have the most redundant text.

Prime example: I add a pithy comment to the bottom of some big stupid noticeboard, then make a second edit to correct my spelling error, etc. The revision texts will weigh about half a megabyte each and be at least 99% identical (unless I'm on a really long rant) and hardly worth long-term storage as full-text unless there is sufficient demand, that is, if my edit is viewed [FSVO] disproportionately more often than others.