| Differential item functioning (DIF) can occur across age, gender, ethnic, and/or linguistic groups of examinee populations. Therefore, whenever there is more than one group of examinees involved in a test, a possibility of DIF exists. It is important to detect items with DIF with accurate and powerful statistical methods. While finding a proper DIF method is essential, until now most of the available methods have been dominated by applications to large scale testing contexts. Since the early 1990s, Ramsay has developed a nonparametric item response methodology and computer software, TestGraf (Ramsay, 2000). The nonparametric item response theory (IRT) method requires fewer examinees and items than other item response theory methods and was also designed to detect DIF. However, nonparametric IRT's Type I error rate for DIF detection had not been investigated.;The present study investigated the Type I error rate of the nonparametric IRT DIF detection method, when applied to moderate-to-small-scale testing context wherein there were 500 or fewer examinees in a group. In addition, the Mantel-Haenszel (MH) DIF detection method was included.;A three-parameter logistic item response model was used to generate data for the two population groups. Each population corresponded to a test of 40 items. Item statistics for the first 34 non-DIF items were randomly chosen from the mathematics test of the 1999 TIMSS (Third International Mathematics and Science Study) for grade eight, whereas item statistics for the last six studied items were adopted from the DIF items used in the study of Muniz, Hambleton, and Xing (2001). These six items were the focus of this study.;The MH test maintained its Type I error rate at the nominal level. The investigation of the nonparametric IRT methodology resulted in: (a) inflated error rates for both a formal and informal test of DIF, and (b) a discovery of an error in the widely available nonparametric IRT software, TestGraf. As a result, new cut-off indices for the nonparametric IRT DIF test were determined for use in the moderate-to-small-scale testing context. |