Comparing weighted RMSD, weighted MD, infit, and outfit item fit statistics under uniform differential item functioning
Artikel in Fachzeitschrift › Forschung › begutachtet
Publikationsdaten
| Von | Alexander Robitzsch |
| Originalsprache | Englisch |
| Erschienen in | Mathematics, 13(23), Article 3752 |
| Herausgeber (Verlag) | MDPI |
| ISSN | 2227-7390 |
| DOI/Link | https://doi.org/10.3390/math13233752 |
| Publikationsstatus | Veröffentlicht – 11.2025 |
In educational large-scale assessment studies, uniform differential item functioning (DIF) across countries often challenges the application of a common item response model, such as the two-parameter logistic (2PL) model, to all participating countries. DIF occurs when certain items provide systematic advantages or disadvantages to specific groups, potentially biasing ability estimates and secondary analyses. Identifying misfitting items caused by DIF is therefore essential, and several item fit statistics have been proposed in the literature for this purpose. This article investigates the performance of four commonly used item fit statistics under uniform DIF: the weighted root mean square deviation (RMSD), the weighted mean deviation (MD), the infit, and the outfit statistics. Analytical approximations were derived to relate the uniform DIF effect size to these item fit statistics, and the theoretical findings were confirmed through a comprehensive simulation study. The results indicate that distribution-weighted RMSD and MD statistics are less sensitive to DIF in very easy or very difficult items, whereas difficulty-weighted RMSD and MD exhibit consistent detection performance across all item difficulty levels. However, the sampling variance of the difficulty-weighted statistics is notably higher for items with extreme difficulty. Infit and outfit statistics were largely ineffective in detecting DIF in items of moderate difficulty, with sensitivity limited to very easy or very difficult items. To illustrate the practical application of these statistics, they were computed for the PISA 2006 reading study, and the distribution of the statistics across participating countries was descriptively examined. The findings guide selecting appropriate item fit statistics in large-scale assessments and highlight the strengths and limitations of different approaches under uniform DIF conditions.