Abstract
Six major indices of intercoder reliability were evaluated against judgments of human coders in a reconstructed experiment. The estimates of the often-used indices, Cohen’s κ, Scott’s π, and Krippendorff’s α, more than quadrupled observed chance agreement, and underestimated true reliability by more than half. While these indices were designed to improve on percent agreement, their estimation errors tripled that of the percent agreement. Further, the three indices’ estimated chance agreements are negatively correlated with the observed chance agreements, indicating the indices tend to predict the opposite of what they were meant to predict.
A less known index, Gwet’s AC1, performed better than other five in most of the tests, and far better than κ, π, and α in all tests. \AC1’s chance estimation, however, produced a below 8% accuracy. When predicting true reliability (at), AC1 underperformed percent agreement by 14%. The other indices performed worse in both tests.
To identify the cause of the inaccuracy, we discovered that category and target distribution have minimal influence on observed chance agreement while S, Ir and AC1 rely mostly on category, and κ, π, and α rely heavily on target distribution. We also discovered that task difficulty has a positive and fairly strong influence on observed chance agreement, but no detectable influence on S, Ir and AC1’s chance agreement estimators, and an even negative influence on κ, π, and α’s estimators. The indices have evidently relied on the wrong factors and failed to rely on one right factor.
A less known index, Gwet’s AC1, performed better than other five in most of the tests, and far better than κ, π, and α in all tests. \AC1’s chance estimation, however, produced a below 8% accuracy. When predicting true reliability (at), AC1 underperformed percent agreement by 14%. The other indices performed worse in both tests.
To identify the cause of the inaccuracy, we discovered that category and target distribution have minimal influence on observed chance agreement while S, Ir and AC1 rely mostly on category, and κ, π, and α rely heavily on target distribution. We also discovered that task difficulty has a positive and fairly strong influence on observed chance agreement, but no detectable influence on S, Ir and AC1’s chance agreement estimators, and an even negative influence on κ, π, and α’s estimators. The indices have evidently relied on the wrong factors and failed to rely on one right factor.
Original language | English |
---|---|
Publication status | Published - 21 Jun 2013 |
Event | 63rd Annual International Communication Association Conference, ICA 2013: Challenging Communication Research - London, United Kingdom Duration: 17 Jun 2013 → 21 Jun 2013 https://convention2.allacademic.com/one/ica/ica13/ (Link to online conference programme) |
Conference
Conference | 63rd Annual International Communication Association Conference, ICA 2013 |
---|---|
Country/Territory | United Kingdom |
City | London |
Period | 17/06/13 → 21/06/13 |
Internet address |
|