Our manuscript, “VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models,” has been accepted to the KDD 2026 Conference (AI for Sciences track). The 2026 conference will take place August 9–13, 2026 in Jeju, Korea.
KDD is a Data Science and AI conference, traditionally hosting a Research Track and an Applied Data Science Track, and more recently expanding to include the Datasets & Benchmarks Track. This year marks the introduction of the AI for Sciences track, which highlights the role of AI and data-driven methods in supporting interdisciplinary research and accelerating scientific discovery.

In this paper, we approach the scientific information extraction (SIE) from the literature from a new perspective. Rather than seek answers to multiple choice or t/f questions, we designed a novel task of retrieving mutations in a given virus that modify its interaction with the host. This open-ended task goes beyond the conventional choice-based tasks to reflect a more realistic and complex setting for SIE. To address the complexity of this problem, we developed a new, multi-step retrieval augmented generation (RAG) framework called Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models (VILLA). We also curated a novel dataset of 629 mutations in influenza A virus proteins obtained from 293 scientific publications to serve as ground truth for our mutation extraction task. We demonstrated VILLA’s superior performance using a comprehensive quantitative and qualitative evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools involving both open and closed large language models (LLMs) for SIE.
We are grateful to the team and collaborators who contributed to this work. We look forward to feedback on the full paper from the community. Congratulations to all!
