Master thesis: Large language models for the automatic extraction of data from scientific literature

  • Stellenart:

    Master thesis

  • Fakultät/Abteilung:

    Department of Computer Science

  • Institut:

    Institute of Theoretical Informatics

  • Eintrittstermin:

    Any time

  • Kontaktperson:

    pascal.friederich@kit.edu, tobias.schloeder@kit.edu

We are looking for an informatics student with experience in python as well as machine learning and ideally with first experiences with high-performance computing (HPC) for an exciting and ambitious Master’s thesis project.

The goal of this project is to use state-of-the-art large language models (LLMs) to extract tabular data from scientific literature. Multiple models will be tested, fine-tuned and included in various extraction strategies in order to maximize extraction accuracy. A manually annotated dataset which was developed in cooperation with materials scientists at KIT will be used as a proof-of-principle test case. The downstream goal of this project is to extend existing databases used for the prediction of properties and synthesis conditions of so far unknown materials.

In your work, you deploy and fine-tune pretrained LLMs (e.g. for text generation or question answering) on HPC infrastructure (BWUniCluster and HoreKa at KIT), systematically develop extraction strategies, and apply them to our database of synthesis paragraphs. Your main task will be the development of a workflow for automatic extraction of experimental parameters such as temperature or solvents used in scientific experiments from literature. This will require testing and optimizing different prompts/questions and possibly also fine-tuning models on scientific literature or more specifically on the descriptions of experimental setups. The development of strategies to generate and incorporate synthetic data can be beneficial for the project.