Guidelines to screen and select common items for vertical scaling have been adopted from equating. Differences between vertical scaling and equating suggest that these guidelines may not apply to vertical scaling in the same way that they apply to equating. For example, in equating the examinee groups are assumed to be randomly equivalent, but in vertical scaling the examinee groups are assumed to possess different levels of proficiency. Equating studies that examined the characteristics of the common-item set stress the importance of careful item selection, particularly when groups differ in ability level. Since in vertical scaling cross-level ability differences are expected, the common items' psychometric characteristics become even more important in order to obtain a correct interpretation of students' academic growth. This dissertation applied two screening criteria and two selection approaches to investigate how changes in the composition of the linking sets impacted the nature of students' growth when creating vertical scales for two elementary mathematics tests. The purpose was to observe how well these equating guidelines were applied in the context of vertical scaling. Two separate datasets were analyzed to observe the impact of manipulating the common items' content area and targeted curricular grade level. The same Rasch scaling method was applied for all variations of the linking set. Both the robust z procedure and a variant of the 0.3-logit difference procedure were used to screen unstable common items from the linking sets. (In vertical scaling, a directional item-difficulty difference must be computed for the 0.3-logit difference procedure.) Different combinations of stable common items were selected to make up the linking sets. The mean/mean method was used to compute the equating constant and linearly transform the students' test scores onto the base scale. A total of 36 vertical scales were created. The results indicated that, although the robust z procedure was a more conservative approach to flagging unstable items, the robust z and the 0.3-logit difference procedure produced similar interpretations of students' growth. The results also suggested that the choice of grade-level-targeted common items affected the estimates of students' grade-to-grade growth, whereas the results regarding the choice of content-area-specific common items were inconsistent. The findings from the Geometry and Measurement dataset indicated that the choice of content-area-specific common items had an impact on the interpretation of students' growth, while the findings from the Algebra and Data Analysis/Probability dataset indicated that the choice of content-area-specific common items did not appear to significantly affect students' growth. A discussion of the limitations of the study and possible future research is presented.



David O. McKay School of Education; Instructional Psychology and Technology



vertical scaling, common-item design, equating, linking, content and construct representation, item stability, robust z, 0.3-logit difference, Item Response Theory, Rasch scaling