Assessing Reliability in AI-Powered Learning Systems with A/A Tests

Open Access
Authors
Publication date 2025
Book title L@S '25
Book subtitle Proceedings of the Twelfth ACM Conference on Learning @ Scale : July 21-23, 2025, Palermo, Italy
ISBN (electronic)
  • 9798400712913
Event 12th ACM Conference on Learning @ Scale, L@S 2025
Pages (from-to) 13-23
Number of pages 11
Publisher New York, NY: Association for Computing Machinery
Organisations
  • Faculty of Social and Behavioural Sciences (FMG) - Psychology Research Institute (PsyRes)
Abstract

The rapid evolution of Artificial Intelligence (AI) has expanded access to large-scale online adaptive learning systems. Such AI-powered systems strive to deliver personalized learning experiences, often by means of advanced algorithms that continuously model learner behavior. Ensuring the reliability of these systems is fundamental, otherwise their ability to optimize individual learning paths and inform decision-making is undermined. In order to trust such systems, learners with identical profiles should follow highly similar learning trajectories. But how can we evaluate the reliability of these dynamic learning environments, especially in systems that are continuously developed and updated? This paper demonstrates the effectiveness of A/A testing - - large-scale double-blind experiments with identical conditions - - in systematically evaluating the reliability of AI-powered learning environments. We illustrate this by assessing the reliability of student model parameters in a large-scale online arithmetic learning platform that is driven by a well-studied and powerful explainable AI algorithm. We duplicated the item bank of a newly developed game and randomly assigned 50% of the players to one of two identical versions, which were launched simultaneously in the live environment. We then analyzed the reliability of item difficulty convergence, the stability of student ability estimates in the new game, and their relationship to ability estimates from other arithmetic games, as well as patterns in student errors. Our results indicate that the student model parameters are stable across the two variants, highlighting A/A testing as a valuable tool for assessing the reliability of large-scale AI-powered learning systems. We discuss its advantages and suggest future directions for adapting the approach, while considering its relevance in dynamic learning environments.

Document type Conference contribution
Note With supplementary video
Language English
Published at https://doi.org/10.1145/3698205.3729553
Other links https://www.scopus.com/pages/publications/105013071212
Downloads
3698205.3729553 (Final published version)
Supplementary materials
Permalink to this page
Back