A bias towards neutrality? How LLM guardrail sensitivity affects classification

Open Access
Authors
Publication date 2025
Journal Communication and Change
Article number 13
Volume | Issue number 1
Number of pages 18
Organisations
  • Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR) - Amsterdam School for Cultural Analysis (ASCA)
Abstract
The advent of generative AI platforms and large language models (LLMs) such as ChatGPT has prompted scholarly work in two seemingly disconnected directions: automated classification as well as bias detection. Here these two strands of work are brought together to take up one of the larger challenges facing social scientific research with AI platforms: the effects of LLM safety guardrails on the quality of LLM data labelling. The piece briefly reviews the literature that takes up classification and bias, particularly their conjunction, which has been termed the safety/helpfulness trade-off. We then turn to findings made from research that explores the effects of guardrails on labelling. In all we find that the greater the bias mitigation the more neutralising sentiment exhibited by LLMs in their classification and labelling. By way of conclusion, we discuss the implications of this bias towards neutrality as an analytical flattening that accompanies the automation of knowledge making.
Document type Article
Language English
Published at https://doi.org/10.1007/s44382-025-00013-0
Downloads
s44382-025-00013-0 (Final published version)
Permalink to this page
Back