Detection of Redacted Text in Legal Documents
| Authors |
|
|---|---|
| Publication date | 2023 |
| Host editors |
|
| Book title | Linking Theory and Practice of Digital Libraries |
| Book subtitle | 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, September 26–29, 2023 : proceedings |
| ISBN |
|
| ISBN (electronic) |
|
| Series | Lecture Notes in Computer Science |
| Event | 27th International Conference on Theory and Practice of Digital Libraries |
| Pages (from-to) | 310-316 |
| Number of pages | 6 |
| Publisher | Cham: Springer |
| Organisations |
|
| Abstract |
We present a technique for automatically detecting redacted text in legal documents, using a combination of Optical Character Recognition (OCR) and morphological operations from the Computer Vision domain, allowing us to detect a wide variety of different types of redaction blocks with little to no training data. As this is a segmentation task, we evaluate our technique using the Panoptic Quality methodology, with the algorithm obtaining F1 scores of 0.79, 0.86 and 0.76 on black, colored and outlined redaction blocks respectively, and an F1 score of 0.62 for gray blocks. The total running time of the algorithm is two seconds on average measured on a thousand pages from a government supplier, with roughly of this time being used by Tesseract and the conversion from PDF to PNG, and by the detection algorithm. Detecting text redaction at scale thus is feasible, allowing a more or less objective measurement of this practice.The redacted text detection code and the manually labelled dataset created for evaluation is released via Github.
|
| Document type | Conference contribution |
| Language | English |
| Related dataset | Automatic Text Redaction Dataset |
| Published at | https://doi.org/10.1007/978-3-031-43849-3_28 |
| Other links | https://github.com/irlabamsterdam/TPDLTextRedaction https://lakdetector.wooverheid.nl |
| Downloads |
978-3-031-43849-3_28
(Final published version)
|
| Permalink to this page | |
