The Poor Man’s Flagged-Revision Fix
(Actually, a number of them)
If you’re scraping WP for non-vandalized articles to put into a chip for a book-reader for the dusty children of Africa, or for children anywhere where children can’t afford internet cafes and have no home computer/net access, how do you do it? You can’t just avoid all but semi-protected articles— WP’s open editor policy has seen to the fact that there aren’t enough of them.
What you CAN do is go back in the history of the article to the last version saved by a nameuser. Even further, you can pick a nameuser who is not redlinked, showing that they at least have edited their talk page. This will miss a few doppelganger accounts like JzGÂ (T-C-L-K-R-D)
and others who enjoy having red usernames, but this is not enough to affect statistics. Who really cares about versions JzG has approved anyway?
You could use this technique and still luck into a new nameuser vandal who hasn’t been blocked yet, but just going by non-red nameuser versions does pretty well. You’ll miss out on a few IP edits, but in any longer article, the odds that the last IP user edits have added anything of lasting value, are small. Even Cluebot knows this, and you can use Cluebot’s technique of which version to revert TO, to figure out which version to look AT.
Of course, it’s possible to do even better, by looking at the same two simple figures-of merit that WP uses to “register†name users. A name user account requires 5 edits and 4 days (or something). But suppose we set the bar higher, and have it apply to only nameuser accounts which have 300 edits or more, and have been active more than a month? Now, we’ve screened out just about all the simple vandals. The only problem is that in order to easily generate this list of “time/edit†trusted users, we have to use some server-time-using tool, to get at these statistics, to check every nameuser we find on a last non-IP version of a Wiki. At this point, we want some kind of look up table of trusted (old and active) nameusers.
At this point, we observe that very high edit counts serve as proxy for minimal account age, since it’s more or less impossible to run up thousands of edits in less than the minimal time we’d like to make sure a vandal nameuser account has been “noticed†and blocked. So one thing we can do immediately to generate a list of “trusted†nameusers, is use the list of editors by edit-count, and taking all of them. These are at:
http://en.wikipedia.org/wiki/Wikipedia:Lis...number_of_edits
This ends at 4000 editors, who all have at least 8933 edits. We can take them all. If the list includes some inactive editors, and editors who have since been blocked for political problems or fighting or socking or whatever, it doesn’t matter. Blocked editors won’t be ones we’re querying for our latest article version, and even if they are (recently blocked) do we really care that the last version we’re reading is by somebody with 10,000 edits but recently blocked? Is their banning likely to be due to anything having to do with clearly erroneous content? Whatever they did, by definition, is likely to be POV-pushing type subtle, and we can probably stand to look at that.
Another interesting list is the 5000 editors who’ve made the most edits in the last 30 days, which currently means anybody who has made more than 117 edits in the last month:
http://en.wikipedia.org/wiki/Wikipedia:Lis...of_recent_edits
These also are unlikely to include any nameuser vandal who has made 117 edits (whether in a day, week or month) but hasn’t yet been caught, so all of these are probably useful. Of course, there will be some overlap with the list of editors with the most edits, but probably a large divergence also. Since we’re interested in a pool of editors much larger than is in either of these lists, we want to sum of them, not people who are on both lists (though such an intersection list would presumably generate very active and also super-contributive nameusers).
Note that the present proposals for flagged, patrolled and sighted versions/revisions (none of them quite the same thing, see WP:FLAG), all have the basic problem that nobody can agree on the criteria for an editor to be a reviewer/patroller or trusted editor or article-promoter. See for instance the debate at: http://en.wikipedia.org/wiki/Wikipedia:Reviewers
Worse still, the “flag†proposals I’ve seen make article promotion into a manual time-consuming process, instead of the automatic thing it should be, whenever a nameuser who carries the “trusted†flag, edits and saves a version. This should automatically make it “sighted†to whatever specifications we trust THAT editor with. Right? Duh.
Many of these problems could be bypassed, if WP simply kept weekly track (no more often than weekly, is necessary) for two things for each editor: 1) total time in months since registering, and 2) total edits in blocks of 100, to the nearest 100. For editors who edit more than once a week, this could be done any time in the week that server time is available, and put somewhere in the file associated with the username (perhaps the same one that contains the password). As such, it would not be subject to manipulation. For editors who have not edited for more than a week, the update could be done on the spot, at their next edit.
It would remain for MediaWiki’s software to simply append these two numbers, in parentheses, after the username, whenever a user edits and saves any page. Thus, if you see in the history of a page that user:JoeBlow(24;200) has saved a page, you know that JoeBlow has been a user for 24 months and has 20,000 edits. And again, for our purposes, it DOES NOT MATTER if the figures aren’t absolutely up to date or accurate.
Once this is done, we can do something remarkable. Not only can the entire project automatically set a “sighted†floor for which editors can flag an article by by the mere act of saving it, but this “floor†can be easily changed at any time, to get the best outcome.
Moreover, if the project as a whole cannot agree on the “sighting†limits for editors (which seems likely), that doesn’t matter, either! Once these merit-numbers are associated with all nameusers, the reader/user can set the numbers to any value in their user-preferences. That is, they can, they even prefer to read the last sighted version, rather than the last raw version. (Of course, even if you’re reading sighted versions, you can always bring up the last raw version in the history, if you want to see it; and it will automatically come up if you edit).
Thus, if a reader wants to see just the last raw version, as now, he or she can set their preferences that way. If they want to see only versions of articles which have been saved by editors with (say) at least 6 months of wiki-experience and at least 1000 edits, they can do that, also. And change these personal threshholds at any time.
Note: yes, I know that far more complicated Taj-Mahal proposals have been name by MediaWiki lab people. One is that each nameuser carry around a figure of merit which encodes their sum total of bytes changed on WP, multiplied by the time each byte-change has lasted (or lasted till removed). This will indeed be the gold standard of content contribution. And it has been envisioned that this number can be used to change the shade of orange that each nameuser’s changes to an article appear in! Thus, if you like, you can see that the darker a word, the more content the user who added it, has contributed.
We should all live so long. Before the millennial vision of this arrives, perhaps we should try some of the simpler solutions outlined above. Most or all of them have the virtue of being easy to implement in software, and also that we don’t have to have community agreement on any of the “trust†thresholds for readers. Rather, each reader chooses for themselves. A horrid idea, no?
Perennially Proposing Milton