Improving public access to government documents
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 11-12-2025 |
| ISBN |
|
| Number of pages | 146 |
| Organisations |
|
| Abstract |
This thesis addresses the challenge of the large-scale publication of Dutch FOIA documents in such a way that they adhere to the common FAIR data principles for digital artifacts. Current publication practices mean that much of the FOIA documents published online do not adhere to these principles, limiting the usefulness of these documents. Since addressing these problems at the source (i.e., before publication) is currently not a viable solution, this thesis focuses on developing and evaluating methods for the post-hoc repair of such documents. Part 1 of the thesis is concerned with the development of automatic methods for the page stream segmentation of complex dossiers into individual documents, and the detection of redacted text to benefit downstream page segmentation and OCR output. Part 1 is concluded with the creation of the Woogle dataset, a large collection of Dutch FOIA documents, collected through web-scraping, processed using the techniques described in Chapters 2 and 3, and integrated into a search engine to show the feasibility and benefits of such a large-scale dataset of FOIA documents. Part 2 of the thesis is concerned with the evaluation of the extreme clustering tasks covered in Part 1, and in adapting existing metrics for this purpose. Chapters 5 and 6 cover adaptations to the BCubed and PQ metrics respectively, and chapter 7 provides a detailed comparison of three different groups of evaluation metrics for the task of text segmentation, discussing the advantages and disadvantages of each of these groups through theoretical and empirical investigations.
|
| Document type | PhD thesis |
| Language | English |
| Downloads |
Thesis (complete)
(Embargo up to 2026-12-11)
Chapter 7: Text segmentation metrics: A survey
(Embargo up to 2026-12-11)
|
| Permalink to this page | |
