About the scrape (sunbeam.city/@puffinus_puffinu).

There is very little you can do about something like that happening to your toots. This is what 'public' posting means: your messages will get read by someone you don't agree with, in contexts that are different than originally intended, using means you dislike.

It is currently trivially easy to scrape the entire fediverse like the researchers did. So it will happen again.

If you are worried about your messages: first of all, don't put on-line what you don't want a random stranger to read. Your audience is always bigger than you imagine.

Second make more use of 'unlisted' and 'followers only' posting modes. The scrape made use of public timelines (http://<instance>/api/v1/timelines/public?local=true) and using these two visibility modes guards against that. This is easily circumvented by a follower bot though. Which is one of the reasons why some folks choose to manually approve their followers.

This stuff is pretty difficult to figure out and I do not at all wish to suggest it is your own fault if your data got scraped. It isn't. (Question is, did we as do enough to inform new users about the meaning and utility of these settings?)

So how about the researchers?



Are they allowed to do this? According to the research paper they can do so because posts are public. Additionally the privacy policies of the instances don't explicitly disallow it (tip for instance admins). They also respected robots.txt (tip for instance admins) and they note that many intances' privacy policy is copy pasta of MS (tip for..). They also say that because they do not store and release personally identifiable data it complies with the EU GDPR.

However, they seem to assume the research subjects know their data is open for scraping by being public. This is not the same as *informed consent* which is required under GDPR.

Considering the outcry, it is clear informed consent was not given.

Even more problematic though is the data set they released. It consists of 6M public posts + metadata. While they hashed the author of each public post (What hashing algo though.. did someone already check that?) they left the link to the original post which contains the author. Here is an example of such link: post.lurk.org/@rra/10347508927

So this is clearly bad practice and a huge ethics and GDPR violation. In fact the data set has already been taken down. dataverse.harvard.edu/dataset.

@er1n and others prepared a letter of complaint on these grounds, see social.mecanis.me/@er1n/103472

The letter:


cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

re: cauliflower morgan queenchandrelle 

re: cauliflower morgan queenchandrelle 

re: cauliflower morgan queenchandrelle 

@rra this is factually incorrect. They ASSUMED that all instances TOS were similar to those of mastodon.social and didn't check. An instance was scraped that SPECIFICALLY disallowed this kind of thing in TOS.

@c24h29clo4 True. Very sloppy research. Maybe they also never read the "about/more" pages of the instances.

@rra @er1n a small correction, informed consent is *not* a hard requirement of the GDPR. It is one of the many justifications you can give for processing data, and it's up to the regulator or judge to decide if it's a valid one.

Public statements also get less protection wrt. sensitive data. Under normal circumstances, a company can never store if you're a union member. But if you toot "I'm a union member", then a lot more is permitted.

Sign in to participate in the conversation

Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.