Follow

About the scrape (sunbeam.city/@puffinus_puffinu).

There is very little you can do about something like that happening to your toots. This is what 'public' posting means: your messages will get read by someone you don't agree with, in contexts that are different than originally intended, using means you dislike.

It is currently trivially easy to scrape the entire fediverse like the researchers did. So it will happen again.

If you are worried about your messages: first of all, don't put on-line what you don't want a random stranger to read. Your audience is always bigger than you imagine.

Second make more use of 'unlisted' and 'followers only' posting modes. The scrape made use of public timelines (http://<instance>/api/v1/timelines/public?local=true) and using these two visibility modes guards against that. This is easily circumvented by a follower bot though. Which is one of the reasons why some folks choose to manually approve their followers.

This stuff is pretty difficult to figure out and I do not at all wish to suggest it is your own fault if your data got scraped. It isn't. (Question is, did we as do enough to inform new users about the meaning and utility of these settings?)

So how about the researchers?

1/?

Are they allowed to do this? According to the research paper they can do so because posts are public. Additionally the privacy policies of the instances don't explicitly disallow it (tip for instance admins). They also respected robots.txt (tip for instance admins) and they note that many intances' privacy policy is copy pasta of MS (tip for..). They also say that because they do not store and release personally identifiable data it complies with the EU GDPR.

However, they seem to assume the research subjects know their data is open for scraping by being public. This is not the same as *informed consent* which is required under GDPR.

Considering the outcry, it is clear informed consent was not given.

Even more problematic though is the data set they released. It consists of 6M public posts + metadata. While they hashed the author of each public post (What hashing algo though.. did someone already check that?) they left the link to the original post which contains the author. Here is an example of such link: post.lurk.org/@rra/10347508927

So this is clearly bad practice and a huge ethics and GDPR violation. In fact the data set has already been taken down. dataverse.harvard.edu/dataset.

@er1n and others prepared a letter of complaint on these grounds, see social.mecanis.me/@er1n/103472

The letter:
docs.google.com/document/d/15j

2/3

Show thread

cauliflower morgan queenchandrelle 

Show thread

cauliflower morgan queenchandrelle 

Show thread

cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

cauliflower morgan queenchandrelle 

re: cauliflower morgan queenchandrelle 

re: cauliflower morgan queenchandrelle 

re: cauliflower morgan queenchandrelle 

@rra this is factually incorrect. They ASSUMED that all instances TOS were similar to those of mastodon.social and didn't check. An instance was scraped that SPECIFICALLY disallowed this kind of thing in TOS.

@c24h29clo4 True. Very sloppy research. Maybe they also never read the "about/more" pages of the instances.

@rra There is a server setting called 'Allow unauthenticated access to public timeline'. This should work to stop scraping using the API (I enabled this last night). Brute force scraping would always be possible.

I agree with you that posting something public on the internet is always a hazard. A lot of people are still not aware of that, especially teenagers.

On our server I always had a Robots.txt file and a bad bot blocker. I have to investigate why it was able to bypass this.

I also added long time ago a copyright paragraph saying that the user itself holds the copyright. This is afaik standard European copyright (or at least Dutch). So you can indeed not consider everything public on the Internet in the public domain.

@jeroenpraat @rra for the record, that setting was added in v3.0 as far as I can tell. And v3.0 was released in october 2019, where the scraping of this incident happened in 2018, way before that option was implemented.

And I still don't have it on my instance, because I haven't updated to v3.

@jeroenpraat @rra i was looking up about copy write recently and it seems that in european and american law, the author of something always has copy write, even if they dont want it, but they can license it for other people to use

@rra here i thought it was illegal to think about this without screaming like a banshee

@rra

Quoting so I can boost this crucial excerpt.

"Second make more use of 'unlisted' and 'followers only' posting modes. The scrape made use of public timelines (http://<instance>/api/v1/timelines/public?local=true) and using these two visibility modes guards against that. This is easily circumvented by a follower bot though. Which is one of the reasons why some folks choose to manually approve their followers. "

Sign in to participate in the conversation
post.lurk.org

Welcome to post.lurk.org, an instance for discussions around cultural freedom, experimental, new media art, net and computational culture, and things like that.