> Wikipedia and privacy

> Wikimedia Discussion > General Discussion

anthony

QUOTE

On Thu, Apr 22, 2010 at 6:31 PM, Platonides <Platonides@gmail.com> wrote:

S. Nunes wrote:
> Hi all,
>
> I presume that Wikipedia keeps data about HTTP accesses to all articles.
> Can anybody inform me if this data is available for research purposes?

No. With the amount of traffic it has, space needs would be immense, and
Wikimedia is not interested in logging all accesses.

http://lists.wikimedia.org/pipermail/wiki-...ril/000987.html

Did most people here on Wikipedia Review know about this "sampled feed" (I've heard 1/100th all the way up to 1/10th for the sample rate)? Isn't this a huge breach of privacy? Why isn't anyone talking about this?

I'm especially surprised I've never heard Daniel Brandt bring it up. A log of what pages you're reading on Wikipedia is about as sensitive as a log of what searches you're doing on Google, and not only is the Wikimedia Foundation collecting it, but they're giving it out to third parties as well.

Yes, it's sampled, but the privacy policy doesn't say how often the samples are made, I've heard as often as 1/10, and even the lesser figure of 1/100th still presents an unacceptable risk to regular users.

It's not even clear whether or not the procedure of giving out the data for "research purposes" is in compliance with the privacy policy. Yes, the policy mentions sampled data, but it claims "the raw log data is not made public". Now I'm not interested in getting into an argument with some Wikipedia apologist over whether or not the researchers fall under the rubric of "the public", or whether or not the modifications made to the data render it no long "raw" (*). I admit it's ambiguous. The very fact that the privacy policy is so ambiguous is part of the problem.

(*) I would be interested in learning exactly what data *is* being released, in what form, and to whom.

Jon Awbrey

One Question. Where did you get the idea that the WikiMedia Foundation gives a rat's ass about anyone else's privacy?

Jon

Ottava

Wouldn't you be more worried that they have two points of information - an email and an IP?

Kelly Martin

QUOTE(anthony @ Fri 23rd April 2010, 9:12pm)

It's not even clear whether or not the procedure of giving out the data for "research purposes" is in compliance with the privacy policy. Yes, the policy mentions sampled data, but it claims "the raw log data is not made public". Now I'm not interested in getting into an argument with some Wikipedia apologist over whether or not the researchers fall under the rubric of "the public", or whether or not the modifications made to the data render it no long "raw" (*). I admit it's ambiguous. The very fact that the privacy policy is so ambiguous is part of the problem.

The purpose of the privacy policy is to protect the Wikimedia Foundation, not to protect individual readers or contributors. Once you understand this, the policy's "ambiguity" makes perfect sense.

There is, generally speaking, no "master log" for Wikipedia. What you can get is an HTTP log from any one Squid proxy server, or from any one Apache host, but such logs will necessarily not reflect all accesses due to load balancing and caching. They can produce a unified log (by collecting and combining all the relevant individual logs) if circumstances require it, but doing so is tedious and time-consuming and so they don't do it often.

anthony

QUOTE(Jon Awbrey @ Sat 24th April 2010, 2:20am)

One Question. Where did you get the idea that the WikiMedia Foundation gives a rat's ass about anyone else's privacy?

Mu. I didn't. But I figured maybe someone on here might.

QUOTE(Kelly Martin @ Sat 24th April 2010, 3:06am)

The purpose of the privacy policy is to protect the Wikimedia Foundation, not to protect individual readers or contributors. Once you understand this, the policy's "ambiguity" makes perfect sense.

Well, I do understand this (at least in terms of what the purpose is, as opposed to what the purpose ought to be), and it does make sense.

I still think it's wrong.

QUOTE(Kelly Martin @ Sat 24th April 2010, 3:06am)

There is, generally speaking, no "master log" for Wikipedia. What you can get is an HTTP log from any one Squid proxy server, or from any one Apache host, but such logs will necessarily not reflect all accesses due to load balancing and caching. They can produce a unified log (by collecting and combining all the relevant individual logs) if circumstances require it, but doing so is tedious and time-consuming and so they don't do it often.

How recent is your knowledge of how this works? Doesn't the Squid log the cache hits (a quick scan of the docs shows that this is certainly an option)?

In any case, what you're saying contradicts what I've been told, which is that only a sampled log is made.

Kelly Martin

QUOTE(anthony @ Fri 23rd April 2010, 10:19pm)

The Squid logs will include cache hits. However, any one squid log will only have the events logged by that squid. There are dozens of Squid servers, and unless they explicitly do a union log that Squid's log will only contain a "sample" of all events for that time interval (the sample being that fraction of the events that were load-balanced to that Squid). There are also dozens of Apache servers, and so the log of any one Apache server will contain only those events that (a) made it past the Squids and (b) were load-balanced to that Apache server.

A "unified log" has already been processed and is thus not raw; if they generate a unified log for any purpose they can just as easily generate a sample from that log and/or scrub it to conceal or remove IP addresses. My recollection is that the tools they have for creating unified logs can do both things.

For what it's worth, the systems people at Wikimedia have been, in my experience, both ethical and honorable. Their bosses may be scum, but I've never caught any of them doing anything beyond ordinary BOFHing.

anthony

QUOTE(Kelly Martin @ Sat 24th April 2010, 3:34am)

A "unified log" has already been processed and is thus not raw; if they generate a unified log for any purpose they can just as easily generate a sample from that log and/or scrub it to conceal or remove IP addresses. My recollection is that the tools they have for creating unified logs can do both things.

I'm sure they can. But it seems to me that every time I search Wikipedia for [insert embarrassing or otherwise privacy sensitive search here], there's a chance (maybe 1/10, maybe 1/100) that a bunch of grad students are going to get an unscrubbed log entry of that search.

If that's well known, and no one on here other than me has a problem with it, fine, nothing to see here, move along.

EricBarbour

QUOTE(Kelly Martin @ Fri 23rd April 2010, 8:34pm)

For what it's worth, the systems people at Wikimedia have been, in my experience, both ethical and honorable. Their bosses may be scum, but I've never caught any of them doing anything beyond ordinary BOFHing.

The irony of this statement makes my head asplode.

I've rarely seen any talk about WP's operational query-response reliability. For a thing paid for by donations and run on assorted cheapie rent-a-servers using assorted open-source software packages (especially Squid), their uptime is remarkable. They should be writing technical papers and news releases bragging about their uptime statistics. And renting themselves out to other organizations, to set up similar serverfarms.

Nope....instead of bragging about uptime, the WMF brags about all its unpaid volunteers and the mountains of embarrassing dreck they routinely (inevitably?) produce. The important people, who keep the servers up.....they toil in obscurity.

This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.