Aligning language models with human understanding and behavior

Aligning language models with human understanding and behavior

Authors	Y. Lyu
Supervisors	M. de Rijke
Cosupervisors	Z. Ren
Award date	03-07-2025
ISBN	9789465221694
Number of pages	114
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Language models (LMs) have achieved impressive progress in natural language processing, yet they remain misaligned with human understanding and behavior, limiting their effectiveness in real-world applications. This thesis addresses these challenges by investigating LM alignment from two perspectives: aligning model understanding with humans, and aligning model behavior with humans. Specifically, we explore four key themes: (i) aligning understanding via debiased representation learning, (ii) aligning behavior via strong-to-weak learning, (iii) aligning behavior via weak-to-strong learning, and (iv) aligning behavior via test-time behavior reflection. We begin by addressing representational alignment during fine-tuning, proposing a framework that reduces biased latent features and captures their dynamic influence, thereby improving out-of-distribution generalization. Then, in the strong-to-weak learning setting, we develop behavior alignment methods to improve completeness, factuality, and logicality in knowledge-intensive tasks, leveraging both fine-grained and coarse-grained knowledge signals. Next, we study the weak-to-strong alignment scenario, where stronger LMs must learn from weaker human supervision. To this end, we introduce an iterative preference optimization strategy that facilitates mutual learning between weak teachers and strong students. Finally, we focus on aligning behavior at inference time by mitigating cognitive biases in LM decision-making. We propose a method that follows three sequential steps—bias determination, bias analysis, and cognitive debiasing—to iteratively reduce potential cognitive biases in prompts.
Document type	PhD thesis
Language	English
Downloads	Thesis
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Aligning language models with human understanding and behavior