Directions Towards Efficient and Automated Data Wrangling with Large Language Models
| Authors | |
|---|---|
| Publication date | 2024 |
| Book title | 2024 IEEE 40th International Conference on Data Engineering Workshops |
| Book subtitle | ICDEW 2024 : 13-17 May 2024, Utrecht, Netherlands : proceedings |
| ISBN |
|
| ISBN (electronic) |
|
| Event | 40th IEEE International Conference on Data Engineering Workshops, ICDEW 2024 |
| Pages (from-to) | 301-304 |
| Publisher | Los Alamitos, California: IEEE Computer Society |
| Organisations |
|
| Abstract |
Data integration and cleaning have long been a key focus of the data management community. Recent research indicates the potential of large language models (LLMs) for such tasks. However, scaling and automating data wrangling with LLMs for real-world use cases poses additional challenges. Manual prompt engineering for example, is expensive and hard to operationalise, while full fine-tuning of LLMs incurs high compute and storage costs. Following up on previous work, we evaluate parameter-efficient fine-tuning (PEFT) methods for efficiently automating data wrangling with LLMs. We conduct a study of four popular PEFT methods on differently sized LLMs for ten benchmark tasks, where we find that PEFT methods achieve performance on-par with full fine-tuning, and that we can leverage small LLMs with negligible performance loss. However, even though such PEFT methods are parameter-efficient, they still incur high compute costs at training time and require labeled training data. We explore a zero-shot setting to further reduce deployment costs, and propose our vision for ZEROMATCH, a novel approach to zero-shot entity matching. It is based on maintaining a large number of pretrained LLM variants from different domains and intelligently selecting an appropriate variant at inference time. |
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1109/ICDEW61823.2024.00044 |
| Other links | https://www.proceedings.com/75058.html |
| Downloads |
Directions_Towards_Efficient_and_Automated_Data_Wrangling_with_Large_Language_Models
(Final published version)
|
| Permalink to this page | |
