About the scrape (sunbeam.city/@puffinus_puffinu).

There is very little you can do about something like that happening to your toots. This is what 'public' posting means: your messages will get read by someone you don't agree with, in contexts that are different than originally intended, using means you dislike.

It is currently trivially easy to scrape the entire fediverse like the researchers did. So it will happen again.

If you are worried about your messages: first of all, don't put on-line what you don't want a random stranger to read. Your audience is always bigger than you imagine.

Second make more use of 'unlisted' and 'followers only' posting modes. The scrape made use of public timelines (http://<instance>/api/v1/timelines/public?local=true) and using these two visibility modes guards against that. This is easily circumvented by a follower bot though. Which is one of the reasons why some folks choose to manually approve their followers.

This stuff is pretty difficult to figure out and I do not at all wish to suggest it is your own fault if your data got scraped. It isn't. (Question is, did we as do enough to inform new users about the meaning and utility of these settings?)

So how about the researchers?


Are they allowed to do this? According to the research paper they can do so because posts are public. Additionally the privacy policies of the instances don't explicitly disallow it (tip for instance admins). They also respected robots.txt (tip for instance admins) and they note that many intances' privacy policy is copy pasta of MS (tip for..). They also say that because they do not store and release personally identifiable data it complies with the EU GDPR.

However, they seem to assume the research subjects know their data is open for scraping by being public. This is not the same as *informed consent* which is required under GDPR.

Considering the outcry, it is clear informed consent was not given.

Even more problematic though is the data set they released. It consists of 6M public posts + metadata. While they hashed the author of each public post (What hashing algo though.. did someone already check that?) they left the link to the original post which contains the author. Here is an example of such link: post.lurk.org/@rra/10347508927

So this is clearly bad practice and a huge ethics and GDPR violation. In fact the data set has already been taken down. dataverse.harvard.edu/dataset.

@er1n and others prepared a letter of complaint on these grounds, see social.mecanis.me/@er1n/103472

The letter:


Show thread

cauliflower morgan queenchandrelle 

It is clear the researchers did not know the net culture they researched, which turned out to be a huge disadvantage to them.

Their research is based around constructing a data set of 'sensitive' and 'inappropriate' posts. Ostensibly to be able to automagically do content moderation (hmmmmmm...). To identify what is 'inappropriate' they filter out posts that contain a 'CW' or that in the toot's json are marked Sensitive:True (see for example: post.lurk.org/@rra/10285864059).

The big error they make is to take 'Content Warning' and 'Sensitive' labels for granted as the only use. If your on the fedi for just one day however, you know the CW has a much richer culture around it than just flagging 'inappropriate' or 'sensitive'.

Instead of 'CW', the button may has well been labled, 'Spoiler Alert', 'Click-And-Reveal Joke', ' Summary', 'I'm sorry for posting such a wall of text button' etc etc. You can see the effect of this in the word clouds they add to their article. A CW is not at all a measure for the 'appropriateness' of a post.

It is such a shortcoming in their conclusion that could have easily been avoided by actually engaging the communities you study. But that means talking to people which is: 1 hard, 2 not "AI"


cauliflower morgan queenchandrelle 

Computers can not and probably never will be able to account for jokes, irony, cultural contexts etc. so trying to automatically classify the use of language in this way is flawed and potentially extremely dangerous. That is why we don't want 'AI' to make these kinds of value judgements. Someone will at some point make a moderation bot based partly on the cauliflower corpus.

It also shows that irony, shitposting, noise, is a very good way to thwart this kind of reasoning.

Anyway the authors of the objection letter have much more thorough critiques of the research methods and conclusions and I recommend reading it:

Show thread

cauliflower morgan queenchandrelle 

@rra good summary of appropriateness: "on octodon.social, sensitive contents fall into two categories: (i) offensive and explicit sexual words, and (ii) spoilers of TV series or movies."

cauliflower morgan queenchandrelle 

@rra thanks for the detailed posts.
I see they did not scrape scholar.social (admin there probably took measures). Use of CWs there is "like the subject line in an email" so they would have had a lot of false positives in whatever definition of appropriateness they are using.

cauliflower morgan queenchandrelle 

@air_pump they didn't explicitly scrape scholar.social but the data set does contain 10k+ scholar.social posts as per the objection letter (maybe in the form of boosts? mentions? threads?).

This in clear violation of scholar.social ToS.

cauliflower morgan queenchandrelle 

@rra yep sorry just noticing that now reading through the letter. what. a. fail.

re: cauliflower morgan queenchandrelle 

@air_pump @rra It looks like they unpublished the dataset, it says

"Deaccession Reason
Legal issue or Data Usage Agreement
Many entries in the datasets do not fulfill the law about personal data release since they allow identification of personal information"


re: cauliflower morgan queenchandrelle 

@fadelkon @rra yep, not sure when this happened though? it say the data has been online for nearly a year?

re: cauliflower morgan queenchandrelle 

@air_pump @rra It looks so. It was published in Jan 2019 and the latest possible it was unpublished was today 🤨
Sign in to participate in the conversation

Welcome to post.lurk.org, an instance for discussions around cultural freedom, experimental, new media art, net and computational culture, and things like that.