Aligning language models with human understanding and behavior
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors |
|
| Award date | 03-07-2025 |
| ISBN |
|
| Number of pages | 114 |
| Organisations |
|
| Abstract |
Language models (LMs) have achieved impressive progress in natural language processing, yet they remain misaligned with human understanding and behavior, limiting their effectiveness in real-world applications. This thesis addresses these challenges by investigating LM alignment from two perspectives: aligning model understanding with humans, and aligning model behavior with humans. Specifically, we explore four key themes: (i) aligning understanding via debiased representation learning, (ii) aligning behavior via strong-to-weak learning, (iii) aligning behavior via weak-to-strong learning, and (iv) aligning behavior via test-time behavior reflection.
We begin by addressing representational alignment during fine-tuning, proposing a framework that reduces biased latent features and captures their dynamic influence, thereby improving out-of-distribution generalization. Then, in the strong-to-weak learning setting, we develop behavior alignment methods to improve completeness, factuality, and logicality in knowledge-intensive tasks, leveraging both fine-grained and coarse-grained knowledge signals. Next, we study the weak-to-strong alignment scenario, where stronger LMs must learn from weaker human supervision. To this end, we introduce an iterative preference optimization strategy that facilitates mutual learning between weak teachers and strong students. Finally, we focus on aligning behavior at inference time by mitigating cognitive biases in LM decision-making. We propose a method that follows three sequential steps—bias determination, bias analysis, and cognitive debiasing—to iteratively reduce potential cognitive biases in prompts. |
| Document type | PhD thesis |
| Language | English |
| Downloads | |
| Permalink to this page | |
