Privacy-Preserving Record Linkage with Spark
| Authors |
|
|---|---|
| Publication date | 2019 |
| Book title | Proceedings 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing |
| Book subtitle | CCGrid 2019, Cyprus |
| ISBN |
|
| ISBN (electronic) |
|
| Event | 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019 |
| Pages (from-to) | 440-448 |
| Number of pages | 9 |
| Publisher | IEEE Computer Society |
| Organisations |
|
| Abstract |
Privacy considerations obligate careful and secure processing of personal data. This is especially true when personal data is linked against databases from other organizations. During such endeavors, privacy-preserving record linkage (PPRL) can be utilized to prevent needless exposure of sensitive information to other organizations. With the increase of personal data that is being gathered and analyzed, scalable PPRL capable of handling massive databases is much desired. In this work, we evaluate Apache Spark as an option to scale PPRL. Not only is it valuable to have a scalable PPRL implementation, but one based on the Spark would also be commonly deployable and could take advantage of further development of the ecosystem. Our results show that a PPRL solution based on Spark outperforms alternatives when it comes to handling multiple millions of
records; can scale to dozens of nodes, and is on-par with regular record linkage implementations in terms of achieved results. |
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1109/CCGRID.2019.00058 |
| Other links | https://www.scopus.com/pages/publications/85069434168 |
| Permalink to this page | |
