High-coverage information extraction from web and narrative texts

Singhania, Sneha

Please use this identifier to cite or link to this item: doi:10.22028/D291-47216

Title:	High-coverage information extraction from web and narrative texts
Author(s):	Singhania, Sneha
Language:	English
Year of Publication:	2025
DDC notations:	500 Science 600 Technology
Publikation type:	Dissertation
Abstract:	The web predominantly stores information in unstructured forms, supporting applications such as search, large language model (LLM) training, and decision-making tools. Information extraction (IE) aims to transform key textual content into structured representations, such as subject-predicate-object triples. Most IE systems have largely prioritized precision, often at the expense of recall. As LLMs and knowledge-intensive applications become more prominent, there is a growing need for frameworks that achieve high recall---capturing all relevant facts across diverse sources, formats, and contexts. This dissertation advances methods for high-recall IE across different settings: web-scale documents, parametric LLM knowledge, and long-form narrative texts. To address web-scale extraction, we introduce a new task: predicting the information coverage of a document for relation extraction. We propose Herb, a lightweight classifier that identifies high-recall documents using a combination of explicit features and latent document-level signals. Rather than performing exhaustive extraction, Herb prioritizes documents by estimated coverage, maximizing information yield under strict budget constraints. We also investigate IE from temporally evolving web-scale documents, focusing on news articles. We present Neon, a framework that extracts OpenIE-style propositions to construct an entity-centric, timestamped representation. Integrating these propositions into a retrieval-augmented generation (RAG) framework improves question answering over dynamic information. To harness parametric knowledge in LLMs, we study high-recall knowledge extraction by probing LLMs and solve the multi-valued slot-filling task, where a subject–relation pair can have multiple correct objects. While prior methods have focused on extracting a single object, we formulate the problem as a rank-then-select task. We develop predicate-specific prompting techniques that improve the extraction of valid objects for multi-valued relations. For long-form narratives, where evidence is sparse and dispersed, we address the challenge of extracting long lists of objects. Our proposed L3X framework combines recall-oriented generation---via RAG, iterative prompting, and pseudo-relevance feedback---with a precision-oriented scrutinization stage. This architecture yields substantially higher recall while maintaining good precision. By rethinking the interplay between retrieval and extraction, this dissertation advances the state-of-the-art in high-recall IE. The core contributions include novel extraction methods, large-scale task-specific benchmarks, empirical results that push the boundary of extraction capabilities, and demonstration systems that reveal persistent challenges and opportunities for the next generation of high-recall IE. Das Web speichert Informationen vorwiegend in unstrukturierter Form und unterstützt Anwendungen wie die Suche, das Training großer Sprachmodelle (Large Language Models: LLMs) und Entscheidungsfindung. Informationsextraktion (IE) zielt darauf ab, zentrale textuelle Inhalte in strukturierte Repräsentationen, wie etwa Subjekt-Prädikat-Objekt-Tripel, zu transformieren. Bislang priorisierten die meisten IE-Systeme die Metrik Precision, oft auf Kosten des Recalls. Da LLMs und wissensintensive Anwendungen zunehmend an Bedeutung gewinnen, wächst der Bedarf an Frameworks, die einen hohen Recall erzielen—also das Erfassen aller relevanten Fakten über diverse Quellen, Formate und Kontexte hinweg. Diese Dissertation entwickelt Methoden für High-Recall- IE in verschiedenen Szenarienweiter: große Mengen anWebseiten, parametrischesWissen in LLMs und narrative Langtexte. Umdie Extraktion ausWeb-Dokumenten zu adressieren, führen wir eine neue Aufgabe ein: die Vorhersage der Informationsabdeckung eines Dokuments für die Relationsextraktion. Wir stellen Herb vor, einen leichtgewichtigen Klassifikator, der High-Recall-Dokumente mittels einer Kombination aus expliziten Merkmalen und latenten Signalen auf Dokumentenebene identifiziert. Anstatt eine erschöpfende Extraktion durchzuführen, priorisiert Herb Dokumente anhand ihrer geschätzten Abdeckung und maximiert so die Informationsausbeute unter strikten Budgetbeschränkungen. Zudem untersuchen wir IE in sich temporal verändernden Webinhalten, mit Fokus auf Nachrichtenartikeln. Wir präsentieren Neon, ein Framework, das Propositionen extrahiert, um eine entitätszentrierte, temporal angereicherte Repräsentation zu konstruieren. Die Integration dieser Propositionen in ein Framework für Retrieval-Augmented Generation (RAG) verbessert die Beantwortung von Fragen über dynamische Informationen. Um parametrisches Wissen in LLMs zu erschließen, untersuchen wir High-Recall-IE mittels LLM-Probing. Wir adressieren dabei die Aufgabe des Multi-Valued Slot-Filling, bei dem ein Subjekt-Relation-Paar mehrere korrekte Objekte aufweisen kann. Während bisherige Methoden typischerweise nur ein einzelnes Objekt extrahieren, formulieren wir diese Aufgabe als Rank-then- Select-Problem. Wir entwickeln prädikatsspezifische Prompting-Techniken, die die Extraktion valider Objekte für mehrwertige Relationen verbessern. Bei narrativen Langtexten, in denen Evidenz spärlich und verstreut ist, adressieren wir die Herausforderung, lange Listen von Objekten zu extrahieren. Unser L3X-Framework kombiniert Recallorientierte Generierung—mittels RAG, iterativem Prompting und Pseudo-Relevance Feedback—mit einer Precision-orientierten Überprüfungsphase. Diese Architektur liefert einen wesentlich höheren Recall bei gleichzeitiger Wahrung guter Precision. In all diesen Bereichen des Zusammenspiels von Suche und Extraktion bringt diese Dissertation den State-of-the-Art in High-Recall-IE voran. Die Kernbeiträge umfassen neuartige Extraktionsmethoden, groß angelegte aufgabenspezifische Benchmarks, empirische Ergebnisse, die die Grenzen der Extraktionsfähigkeiten erweitern, sowie Demonstrationssysteme, die bestehende Herausforderungen und Chancen für die nächste Generation in High-Recall-IE aufzeigen.
Link to this record:	urn:nbn:de:bsz:291--ds-472165 hdl:20.500.11880/41944 http://dx.doi.org/10.22028/D291-47216
Advisor:	Razniewski, Simon Weikum, Gerhard
Date of oral examination:	24-Feb-2026
Date of registration:	1-Jun-2026
Faculty:	MI - Fakultät für Mathematik und Informatik
Department:	MI - Informatik MI - Mathematik
Professorship:	MI - Prof. Dr. Gerhard Weikum
Collections:	SciDok - Der Wissenschaftsserver der Universität des Saarlandes

Files for this record:

File	Description	Size	Format
PhD_Thesis.pdf		3 MB	Adobe PDF	View/Open

Export: BibTex

This item is licensed under a Creative Commons License