Block I Illinois Library Illinois Open Publishing Network

12 Judgment Rule 4 for Content Analysis

Judgment Rule: Intercoder reliability should be 80 or above for each variable (some exceptions apply).

Key Takeaways

Judgment rule answers the question: Is the intercoder reliability high enough?

Intercoder reliability[1] is a measure of how often two or more coders code the same text (or portion of a text) the same way, it is a measure of consistency. For readers, intercoder reliability is the way to check that the coding scheme (see Chapter 11) is not idiosyncratic or limited to a single individual’s opinion, but that coders share a common definition. For example, as discussed earlier, one of the most common coding categories is gender, which for many research projects is either male or female. Look at the picture below. How would you code it?

If you gave this picture to two researchers to code for gender and they both coded this picture as female, then their intercoder reliability score would be 100 percent, or perfect agreement. With 100 percent agreement, then you—the reader—know that whatever is being looked at is fairly clearly defined and understood, at least within the culture that the coders are drawn from (a point to which we will return below). But if two coders do not decide the same way on the same thing, then the reader knows that the coding instructions are not exact enough to be able to tell what the coders are coding.

Studio publicity picture of Katherine Hepburn as a young male.
Figure 12.1: Ambiguous cues for gender (adult). Publicity photograph of Katherine Hepburn in Sylvia Scarlett for RKO studios. Reproduced courtesy of Moviestore Collection Ltd and Alamy Limited.

A high intercoder reliability is one of the essential components you need to use to judge the acceptability of the research. As Neuendorf, one of the foremost authorities on content analysis said, “given that a goal in content analysis is to identify and record relatively objective (or at least intersubjective) characteristics of messages, reliability is paramount. Without the establishment of reliability, content analysis measures are useless.”[2] In fact, “interjudge reliability is […] the standard measure of research quality. High levels of disagreement among judges suggest weaknesses in [the] research methods.”[3]

Intercoder reliability is either measured as percent agreement (from 0 percent, or “no agreement,” to 100 percent, or “total agreement”), or by one of several intercoder reliability indexes (which run from zero, “no agreement,” to one, “complete agreement”). Each of the intercoder reliability indexes uses a slightly different method to calculate agreement, but in general, most indexes consider .80 agreement as an acceptable level of intercoder reliability.

Intercoder reliability indexes: The most commonly used measures in communications are listed below:

Table 12.1. Chart of Commonly Used Intercoder Reliability Indexes
Index Acceptable level NOTE:
Percent agreement 80% to 100% Overestimates true agreement
Holsti’s method .80 to 1.00

Overestimates true agreement

Scott’s pi (p) .80 to 1.00 Cohen’s kappa is slightly more informative
Cohen’s kappa (k) .75 to 1.00 Considered slightly stricter than other measures
Krippendorff’s alpha (a) .80 to 1.00 Well-regarded
Cronback’s alpha   — Considered inappropriate

Researchers should report what kind of intercoder reliability index they used—Scott’s pi, Krippendorff’s alpha, or any of the other indexes above—and what level of agreement the coders reached. As a reader, you should look for which index the researcher used and the level of agreement reported. If the researchers report an index with which you aren’t familiar, then it is your job as a reader to look up that index and find the level of acceptability for that index.

In truth, there is no completely “set in stone” minimum level of reliability.  Most people consider .8 or 80 percent acceptable, and .75 is considered good. But for some variables, such as gender, that are considered easy to code, you should probably raise the standard even higher (.93 or better), unless the researcher has given a satisfactory explanation of why a particular population (e.g., babies, the elderly) is particularly difficult to code.

For hard to code variables, a .75 to .80 intercoder reliability would be acceptable, but lower levels need to be explained. Values as low as .60 to .67 are sometimes reported, but most readers will not accept agreement that low. Remember that if coders randomly guessed gender (without looking at the pictures), they would expect to get the right answer about 50 percent of the time. (A purely random throw of the coin would have a .5 chance of being a head.)

Agreement by consensus. Some researchers use a technique in which they look at all of the disagreements in coding and then “mutually agree” on how to recode that particular disagreement.[4] If the researcher recalculates the reliability index using consensus-derived codes, this is essentially getting two bites at the apple, or research cheating. The intercoder reliability must be calculated using the original disagreements between coders.

Reporting intercoder reliability. Readers should also look at whether the article writers reported an overall intercoder reliability or reported the reliability on each variable. Ideally, the researcher should report the reliability on each variable. Using the overall intercoder reliability could disguise that some codes are far more unreliable than others. An average of 80 percent on five variables could mean that each variable had 80 percent agreement, or it could mean that four variables had 100 percent agreement and one had 0 percent agreement.

Further Reading

Krippendorff, Klaus. Content Analysis: An Introduction to its Methodology. Thousand Oaks, CA: Sage Publications, 2004.

Neuendorf, Kimberly A. The Content Analysis Guidebook. Thousand Oaks, CA: Sage Publications, 2002.


  1. Also called interrater reliability.
  2. Kimberly A. Neuendorf, The Content Analysis Guidebook (Thousand Oaks, CA: Sage Publications, 2002), 141.
  3. Richard H. Kolbe and Melissa S. Burnett, “Content Analysis Research: An Examination of Applications with Directives for Improving Research Reliability and Objectivity,” Journal of Consumer Research 18, no. 2 (September 1991): 248, https://doi.org/10.1086/209256.
  4. As Krippendorf points out, the “consensus” or “majority vote” process is deeply flawed: “Observers are known to negotiate and yield to each other in tit-for-tat exchanges, with prestigious group members dominating the outcome. […] Observing and coding come to reflect the social structure of the group.” Klaus Krippendorff, Content Analysis: An Introduction to its Methodology (Thousand Oaks, CA: Sage Publications, 2004), 217.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Reading Social Science Methods Copyright © 2023 by Ann Reisner is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book