The stability of IRT parameters under several test equating conditions

Weber, Dominik; Becker, Nicolas; Spinath, Frank M.; Koch, Marco

Please use this identifier to cite or link to this item: doi:10.22028/D291-47880

Title:	The stability of IRT parameters under several test equating conditions
Author(s):	Weber, Dominik Becker, Nicolas Spinath, Frank M. Koch, Marco
Language:	English
Title:	Frontiers in Psychology
Volume:	16
Publisher/Platform:	Frontiers
Year of Publication:	2026
Free key words:	test equating item linking test validity anchor item item response theory simulation study
DDC notations:	150 Psychology
Publikation type:	Journal Article
Abstract:	Introduction: It is crucial for researchers and test developers to compare results from different test sets (e. g., re-testing, parallel test forms). To ensure comparability, test sets are often linked using anchor items as a common denominator alongside distinct items. To date, most studies on test equating have been limited in scope, typically comparing only absolute numbers of anchor items or focusing on a single IRT model or equating method. Furthermore, previous research has primarily evaluated the absolute deviation of estimated parameters from true parameters. However, in diagnostic contexts, the correlation between these values is often more relevant for ensuring validity and test fairness. Therefore, the aim of this simulation study was to examine the impact of a broad range of key factors on test equating. Methods: We evaluated correlations and recovery indices between predefined true values and values estimated through test equating for three IRT parameters (discrimination, difficulty, and ability). To this end, we varied the equating method (MS, MM, MGM, IRF, TRF), the IRT model (2PL vs. 3PL), guessing probability (0.000–0.250), anchor item proportion (5–25%), test set size (20–80 items), and the discrimination parameters of the anchor items. In addition, we used samples of 25–100 individuals to assess equating quality under challenging conditions as well as samples of 500 and 1,000 individuals to reflect adequate modeling conditions. Results: Low guessing probabilities and high anchor item discrimination parameters strongly improved test equating quality for all three IRT parameters. Recovery of discrimination and ability parameters increased logarithmically with larger test set sizes and higher anchor item proportions, with each of these two factors partially compensating for reductions in the other. While sample sizes below 100 individuals produced inadequate parameter recovery, samples of 100 or 500 individuals were justifiable under certain conditions. However, samples of only 100 individuals carried a slight risk of non-convergence. The choice of the equating method had rather minor effects and the impact of the IRT model was ambivalent. Discussion: These findings highlight the importance of using distractor-free response formats without any guessing probability, anchor items with high discrimination parameters, and large samples to ensure valid test equating. For individual research and test application purposes, we provide a comprehensive data set covering multiple factor levels and a step-by-step simulation guide.
DOI of the first publication:	10.3389/fpsyg.2025.1652341
URL of the first publication:	https://doi.org/10.3389/fpsyg.2025.1652341
Link to this record:	urn:nbn:de:bsz:291--ds-478803 hdl:20.500.11880/41870 http://dx.doi.org/10.22028/D291-47880
ISSN:	1664-1078
Date of registration:	21-May-2026
Faculty:	HW - Fakultät für Empirische Humanwissenschaften und Wirtschaftswissenschaft
Department:	HW - Psychologie
Professorship:	HW - Prof. Dr. Frank Spinath
Collections:	SciDok - Der Wissenschaftsserver der Universität des Saarlandes

Files for this record:

File	Description	Size	Format
fpsyg-16-1652341.pdf		1,78 MB	Adobe PDF	View/Open

Export: BibTex

This item is licensed under a Creative Commons License