Abstract

A rating design is a pre-specified plan for collecting ratings. The best design for a rater-mediated assessment both psychometrically and from the perspective of fairness is a fully-crossed design in which all objects are rated by all raters. An incomplete rating design is one in which all objects are not rated by all raters, instead each object is rated by an assigned subset of raters usually to reduce the time and/or cost of the assessment. Human raters have varying propensities to rate severely or leniently. One method of compensating for rater severity is the many-facets Rasch model (MFRM). However, unless the incomplete rating design used to gather the ratings is appropriately linked, the results of the MFRM analysis may not be on the same scale and therefore may not be fairly compared. Given non-trivial numbers of raters and/or objects to rate, there are numerous possible incomplete designs with various levels of linkage. The literature provides little guidance on the extent to which differently linked rating designs might affect the results of a MFRM analysis. Eighty different subsets of data were extracted from a pre-existing fully-crossed rating data set originally gathered from 24 essays rated by eight raters. These subsets represented 20 different incomplete rating designs and four specific assignments of raters to essays. The subsets of rating data were analyzed in Facets software to investigate the effects of incomplete rating designs on the MFRM results. The design attributes related to linkage that were varied in the incomplete designs include (a) rater coverage: the number of raters-per-essay, (b) repetition-size: the number of essays rated in one repetition of the sub-design pattern, (c) design structure: the linking network structure of the incomplete design, and (d) rater order: the specific assignments of raters to essays. A number of plots and graphs were used to visualize the incomplete designs and the rating results. Several measures including the observed and fair averages for raters and essays from the 80 MFRM analyses were compared against the fair averages for the fully-crossed design. Results varied widely depending on different combinations of design attributes and rater orders. Rater coverage had the overall largest effect, with rater order producing larger ranges of values for sparser designs. Many of the observed averages for raters and essays more closely approximated the results from the fully-crossed design than did the adjusted fair-averages, particularly for the more sparsely linked designs. The stability of relative standing measures was unexpectedly low.

Degree

PhD

College and Department

David O. McKay School of Education; Instructional Psychology and Technology

Rights

http://lib.byu.edu/about/copyright/