Please use this identifier to cite or link to this item:
doi:10.22028/D291-47880 | Title: | The stability of IRT parameters under several test equating conditions |
| Author(s): | Weber, Dominik Becker, Nicolas Spinath, Frank M. Koch, Marco |
| Language: | English |
| Title: | Frontiers in Psychology |
| Volume: | 16 |
| Publisher/Platform: | Frontiers |
| Year of Publication: | 2026 |
| Free key words: | test equating item linking test validity anchor item item response theory simulation study |
| DDC notations: | 150 Psychology |
| Publikation type: | Journal Article |
| Abstract: | Introduction: It is crucial for researchers and test developers to compare results from different test sets (e. g., re-testing, parallel test forms). To ensure comparability, test sets are often linked using anchor items as a common denominator alongside distinct items. To date, most studies on test equating have been limited in scope, typically comparing only absolute numbers of anchor items or focusing on a single IRT model or equating method. Furthermore, previous research has primarily evaluated the absolute deviation of estimated parameters from true parameters. However, in diagnostic contexts, the correlation between these values is often more relevant for ensuring validity and test fairness. Therefore, the aim of this simulation study was to examine the impact of a broad range of key factors on test equating. Methods: We evaluated correlations and recovery indices between predefined true values and values estimated through test equating for three IRT parameters (discrimination, difficulty, and ability). To this end, we varied the equating method (MS, MM, MGM, IRF, TRF), the IRT model (2PL vs. 3PL), guessing probability (0.000–0.250), anchor item proportion (5–25%), test set size (20–80 items), and the discrimination parameters of the anchor items. In addition, we used samples of 25–100 individuals to assess equating quality under challenging conditions as well as samples of 500 and 1,000 individuals to reflect adequate modeling conditions. Results: Low guessing probabilities and high anchor item discrimination parameters strongly improved test equating quality for all three IRT parameters. Recovery of discrimination and ability parameters increased logarithmically with larger test set sizes and higher anchor item proportions, with each of these two factors partially compensating for reductions in the other. While sample sizes below 100 individuals produced inadequate parameter recovery, samples of 100 or 500 individuals were justifiable under certain conditions. However, samples of only 100 individuals carried a slight risk of non-convergence. The choice of the equating method had rather minor effects and the impact of the IRT model was ambivalent. Discussion: These findings highlight the importance of using distractor-free response formats without any guessing probability, anchor items with high discrimination parameters, and large samples to ensure valid test equating. For individual research and test application purposes, we provide a comprehensive data set covering multiple factor levels and a step-by-step simulation guide. |
| DOI of the first publication: | 10.3389/fpsyg.2025.1652341 |
| URL of the first publication: | https://doi.org/10.3389/fpsyg.2025.1652341 |
| Link to this record: | urn:nbn:de:bsz:291--ds-478803 hdl:20.500.11880/41870 http://dx.doi.org/10.22028/D291-47880 |
| ISSN: | 1664-1078 |
| Date of registration: | 21-May-2026 |
| Faculty: | HW - Fakultät für Empirische Humanwissenschaften und Wirtschaftswissenschaft |
| Department: | HW - Psychologie |
| Professorship: | HW - Prof. Dr. Frank Spinath |
| Collections: | SciDok - Der Wissenschaftsserver der Universität des Saarlandes |
Files for this record:
| File | Description | Size | Format | |
|---|---|---|---|---|
| fpsyg-16-1652341.pdf | 1,78 MB | Adobe PDF | View/Open |
This item is licensed under a Creative Commons License

