Assessing Reliability in AI-Powered Learning Systems with A/A Tests

Lilian Ye; Alexander O. Savi; Abe D. Hofman

doi:https://doi.org/10.1145/3698205.3729553

Assessing Reliability in AI-Powered Learning Systems with A/A Tests

Authors	Lilian Ye Alexander O. Savi Abe D. Hofman
Publication date	2025
Book title	L@S '25
Book subtitle	Proceedings of the Twelfth ACM Conference on Learning @ Scale : July 21-23, 2025, Palermo, Italy
ISBN (electronic)	9798400712913
Event	12th ACM Conference on Learning @ Scale, L@S 2025
Pages (from-to)	13-23
Number of pages	11
Publisher	New York, NY: Association for Computing Machinery
Organisations	Faculty of Social and Behavioural Sciences (FMG) - Psychology Research Institute (PsyRes)
Abstract	The rapid evolution of Artificial Intelligence (AI) has expanded access to large-scale online adaptive learning systems. Such AI-powered systems strive to deliver personalized learning experiences, often by means of advanced algorithms that continuously model learner behavior. Ensuring the reliability of these systems is fundamental, otherwise their ability to optimize individual learning paths and inform decision-making is undermined. In order to trust such systems, learners with identical profiles should follow highly similar learning trajectories. But how can we evaluate the reliability of these dynamic learning environments, especially in systems that are continuously developed and updated? This paper demonstrates the effectiveness of A/A testing - - large-scale double-blind experiments with identical conditions - - in systematically evaluating the reliability of AI-powered learning environments. We illustrate this by assessing the reliability of student model parameters in a large-scale online arithmetic learning platform that is driven by a well-studied and powerful explainable AI algorithm. We duplicated the item bank of a newly developed game and randomly assigned 50% of the players to one of two identical versions, which were launched simultaneously in the live environment. We then analyzed the reliability of item difficulty convergence, the stability of student ability estimates in the new game, and their relationship to ability estimates from other arithmetic games, as well as patterns in student errors. Our results indicate that the student model parameters are stable across the two variants, highlighting A/A testing as a valuable tool for assessing the reliability of large-scale AI-powered learning systems. We discuss its advantages and suggest future directions for adapting the approach, while considering its relevance in dynamic learning environments.
Document type	Conference contribution
Note	With supplementary video
Language	English
Published at	https://doi.org/10.1145/3698205.3729553
Other links	https://www.scopus.com/pages/publications/105013071212
Downloads	3698205.3729553 (Final published version)
Supplementary materials	lsfp082-video
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Assessing Reliability in AI-Powered Learning Systems with A/A Tests