Intercoder reliability is the most often used quantitative indicator of measurement quality in content studies. Researchers in psychology, sociology, education, medicine, marketing and other disciplines also use reliability to evaluate the quality of diagnosis, tests and other assessments. Many indices of reliability have been recommended for general use. This article analyzes 22, which are organized into 18 chance-adjusted and four non-adjusted indices. The chance-adjusted indices are further organized into three groups, including nine category-based indices, eight distribution-based indices, and one that is double based, on category and distribution. The main purpose of this work is to examine the assumptions behind each index. Most of the assumptions are unexamined in the literature, and yet these assumptions have implications for assessments of reliability that need to be understood, and that result in paradoxes and abnormalities. This article discusses 13 paradoxes and nine abnormalities to illustrate the 24 assumptions. To facilitate understanding, the analysis focuses on categorical scales with two coders, and further focuses on binary scales where appropriate. The discussion is situated mostly in analysis of communication content. The assumptions and patterns that we will discover will also apply to studies, evaluations and diagnoses in other disciplines with more coders, raters, diagnosticians, or judges using binary or multi-category scales. We will argue that a new index is needed. Before the new index can be established, we need guidelines for using the existing indices. This article will recommend such guidelines.
|Journal||Annals of the International Communication Association|
|Publication status||Published - 18 May 2016|