Salvaging the Internet Hate Machine: Using the discourse of extremist online subcultures to identify emergent extreme speech

Creators
Publication date 20-02-2020
Description
This dataset accompanies a paper submitted to the WebSci 20 conference. In this paper, we present a lexicon of 'extreme speech' that may be used to detect hate speech and extreme speech on online platforms. We outline a cross-disciplinary research protocol through which this lexicon is initially extracted from a corpus of 3,335,265 posts from 4chan's /pol/ sub-forum using a hybrid method comprising word2vec modeling and subsequent snowballing of nearest neighbours of a small initial expert seed list of extreme language. The choice of corpus is significant, as 4chan is a space of rapid language innovation and obscure extreme vernacular, complicating generalised approaches. Our lexicon detects significantly more extreme posts within a corpus from a more mainstream platform (Reddit) than another popular lexicon, Hatebase, with similar accuracy. Our lexicon and the method of its creation thus provide a contribution to the study of the toxicity of online subcultures similar to 4chan, as well as more mainstream platforms. As we demonstrate, the lexicon allows for more effective detecting of extreme speech in these spaces. This method and the lexicon have further been made available through an open-source web tool for the study of online social platforms, 4CAT. The computational methods and lexicon on offer here can thus be used by a wide academic audience, fostering interdisciplinary approaches to the study of online hate and extreme speech.
Publisher Zenodo
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
  • Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR) - Amsterdam School for Cultural Analysis (ASCA)
Document type Dataset
DOI https://doi.org/10.5281/zenodo.3676482
Other links https://zenodo.org/record/3676483
Permalink to this page
Back