Improving public access to government documents

R.J. van Heusden

Improving public access to government documents

Authors	R.J. van Heusden
Supervisors	J. Kamps
Cosupervisors	M.J. Marx
Award date	11-12-2025
ISBN	9789082169546
Number of pages	146
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	This thesis addresses the challenge of the large-scale publication of Dutch FOIA documents in such a way that they adhere to the common FAIR data principles for digital artifacts. Current publication practices mean that much of the FOIA documents published online do not adhere to these principles, limiting the usefulness of these documents. Since addressing these problems at the source (i.e., before publication) is currently not a viable solution, this thesis focuses on developing and evaluating methods for the post-hoc repair of such documents. Part 1 of the thesis is concerned with the development of automatic methods for the page stream segmentation of complex dossiers into individual documents, and the detection of redacted text to benefit downstream page segmentation and OCR output. Part 1 is concluded with the creation of the Woogle dataset, a large collection of Dutch FOIA documents, collected through web-scraping, processed using the techniques described in Chapters 2 and 3, and integrated into a search engine to show the feasibility and benefits of such a large-scale dataset of FOIA documents. Part 2 of the thesis is concerned with the evaluation of the extreme clustering tasks covered in Part 1, and in adapting existing metrics for this purpose. Chapters 5 and 6 cover adaptations to the BCubed and PQ metrics respectively, and chapter 7 provides a detailed comparison of three different groups of evaluation metrics for the task of text segmentation, discussing the advantages and disadvantages of each of these groups through theoretical and empirical investigations.
Document type	PhD thesis
Language	English
Downloads	Thesis (complete) (Embargo up to 2026-12-11) Front matter Chapter 1: Introduction Chapter 2: OpenPSS: An open page stream segmentation benchmark Chapter 3: Redacted text detection using neural image segmentation methods Chapter 4: A collection of FAIR Dutch Freedom of Information Act documents Chapter 5: Elements like me: BCubed revisited Chapter 6: A sharper definition of alignment for panoptic quality Chapter 7: Text segmentation metrics: A survey (Embargo up to 2026-12-11) Chapter 8: Conclusions Bibliography; Summary; Samenvatting
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Improving public access to government documents