About the scrape (sunbeam.city/@puffinus_puffinu).

There is very little you can do about something like that happening to your toots. This is what 'public' posting means: your messages will get read by someone you don't agree with, in contexts that are different than originally intended, using means you dislike.

It is currently trivially easy to scrape the entire fediverse like the researchers did. So it will happen again.

If you are worried about your messages: first of all, don't put on-line what you don't want a random stranger to read. Your audience is always bigger than you imagine.

Second make more use of 'unlisted' and 'followers only' posting modes. The scrape made use of public timelines (http://<instance>/api/v1/timelines/public?local=true) and using these two visibility modes guards against that. This is easily circumvented by a follower bot though. Which is one of the reasons why some folks choose to manually approve their followers.

This stuff is pretty difficult to figure out and I do not at all wish to suggest it is your own fault if your data got scraped. It isn't. (Question is, did we as do enough to inform new users about the meaning and utility of these settings?)

So how about the researchers?


Are they allowed to do this? According to the research paper they can do so because posts are public. Additionally the privacy policies of the instances don't explicitly disallow it (tip for instance admins). They also respected robots.txt (tip for instance admins) and they note that many intances' privacy policy is copy pasta of MS (tip for..). They also say that because they do not store and release personally identifiable data it complies with the EU GDPR.

However, they seem to assume the research subjects know their data is open for scraping by being public. This is not the same as *informed consent* which is required under GDPR.

Considering the outcry, it is clear informed consent was not given.

Even more problematic though is the data set they released. It consists of 6M public posts + metadata. While they hashed the author of each public post (What hashing algo though.. did someone already check that?) they left the link to the original post which contains the author. Here is an example of such link: post.lurk.org/@rra/10347508927

So this is clearly bad practice and a huge ethics and GDPR violation. In fact the data set has already been taken down. dataverse.harvard.edu/dataset.

@er1n and others prepared a letter of complaint on these grounds, see social.mecanis.me/@er1n/103472

The letter:


cauliflower morgan queenchandrelle 


cauliflower morgan queenchandrelle 

Β· Web Β· 1 Β· 4 Β· 5

cauliflower morgan queenchandrelle 

Sign in to participate in the conversation

Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.