Automated PROMISE V2 Scoring from PSMA PET/CT Reports Using Large Language Models: A Comparative Evaluation of Prompt Design and Model Performance

Speicher, Tilman; Demirkol, Isa Ethem; Blickle, Arne; Bastian, Moritz B.; Maus, Stephan; Schaefer-Schuler, Andrea; Bartholomä, Mark; Burgard, Caroline; Ezziddin, Samer; Rosar, Florian

Bitte benutzen Sie diese Referenz, um auf diese Ressource zu verweisen: doi:10.22028/D291-48144

Titel:	Automated PROMISE V2 Scoring from PSMA PET/CT Reports Using Large Language Models: A Comparative Evaluation of Prompt Design and Model Performance
VerfasserIn:	Speicher, Tilman Demirkol, Isa Ethem Blickle, Arne Bastian, Moritz B. Maus, Stephan Schaefer-Schuler, Andrea Bartholomä, Mark Burgard, Caroline Ezziddin, Samer Rosar, Florian
Sprache:	Englisch
Titel:	Current Oncology
Bandnummer:	33
Heft:	6
Verlag/Plattform:	MDPI
Erscheinungsjahr:	2026
Freie Schlagwörter:	PROMISE LLM large language model prostate cancer PSMA PET/CT
DDC-Sachgruppe:	610 Medizin, Gesundheit
Dokumenttyp:	Journalartikel / Zeitschriftenartikel
Abstract:	Large language models (LLMs) are increasingly explored for clinical use. However, the extent to which such models can reliably support physicians in reporting, staging, and the assessment of classification remains an active area of research. This study aimed to evaluate and compare multiple LLMs for automated PROMISE V2 classification for prostate cancer. A total of 126 unambiguous German-language PSMA PET/CT text reports were retrospectively analyzed, with reference standards established by expert consensus based on image interpretation and the original report text. Five LLMs (GPT-5.4, DeepSeek-V3.2, Claude Sonnet 4.6, Gemini 3 Flash and Grok 4) were assessed using two English-language prompting strategies of varying complexity. Agreement with the reference standard served as the primary endpoint. Performance varied in the short-prompt setting (36.5–79.4%) but improved consistently with the long prompt (74.6–86.5%), with Gemini 3 Flash achieving the highest agreement. Across PROMISE V2 subcategories, agreement rates were high (miT: 81.0–92.1%, miN: 92.9–96.0%, miM: 92.9–95.2%), despite inter-model differences. In conclusion, contemporary LLMs demonstrate promising performance in deriving PROMISE V2 scores from unambiguous original report texts, particularly when guided by detailed prompts.
DOI der Erstveröffentlichung:	10.3390/curroncol33060349
URL der Erstveröffentlichung:	https://doi.org/10.3390/curroncol33060349
Link zu diesem Datensatz:	urn:nbn:de:bsz:291--ds-481446 hdl:20.500.11880/42106 http://dx.doi.org/10.22028/D291-48144
ISSN:	1718-7729
Datum des Eintrags:	29-Jun-2026
Bezeichnung des in Beziehung stehenden Objekts:	Supplementary Materials
In Beziehung stehendes Objekt:	https://www.mdpi.com/article/10.3390/curroncol33060349/s1
Fakultät:	M - Medizinische Fakultät
Fachrichtung:	M - Radiologie
Professur:	M - Prof. Dr. Samer Ezziddin
Sammlung:	SciDok - Der Wissenschaftsserver der Universität des Saarlandes

Dateien zu diesem Datensatz:

Datei	Beschreibung	Größe	Format
curroncol-33-00349-v3.pdf		882,18 kB	Adobe PDF	Öffnen/Anzeigen

Export: BibTex Statistik anzeigen

Diese Ressource wurde unter folgender Copyright-Bestimmung veröffentlicht: Lizenz von Creative Commons