The ramifications of data handling for computational models
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 04-12-2024 |
| ISBN |
|
| Series | SIKS Dissertation series , 2024-37 |
| Number of pages | 196 |
| Organisations |
|
| Abstract |
Many computational models rely on real-world data, with the successful application of these models being dependent on access to accurate and representative datasets. With increasingly sophisticated models and data, the steps required in moving from data collection to model output are becoming more complex. The effects of data handling steps such as cleaning and integration on the modelling and simulation process have generally not been addressed in the literature. This thesis investigates these issues and introduces frameworks for how best to reason about such problems.
The first part of the thesis is focused on network diffusion models. These models are used to simulate spreading processes (such as disease or information) over networks. The outputs of such models are highly sensitive to the topology of the network on which they are run. From both theoretical and practical perspectives, we show the high model sensitivities to data handling that can be observed and suggest how results can be reported for transparent, holistic conclusions. In the second part, we expand to other data handling problems and model types. We first illustrate how data preprocessing decisions can change the structure of word co-occurrence networks. Such networks are frequently used in the social sciences, where decisions behind network construction are often not justified. Second, we show how mismatched training and test data cleaning pipelines can affect the performance and selection of regression models. Such mismatches can have surprising consequences, which have strong implications for practice. |
| Document type | PhD thesis |
| Language | English |
| Downloads | |
| Permalink to this page | |
