18 Judgment Rule 4 for Experimental Analysis
Judgment Rule: Accept the findings only if the dependent variable can reasonably be expected to detect changes in the experimental participant.
Key Takeaways
Obviously, the treatment phase of the experiment—if the treatment actually is effective—produces a change in the treated subjects. The next phase of the experiment involves measuring what (and how much) change was produced. Researchers call this the testing phase, and that thing that measures change is called the dependent variable.
Researchers generally have a variety of measures that they could use to as the dependent variable (see Example 18.1). For example, researchers could:
Ask subjects to rank their preferences for food on a list that has some high-fat/high-density items and some low-fat/low-density items,
Ask subjects to list what food they intend to eat at lunch, or
Give subjects a buffet and see what they eat (revealed preference).
Example 18.1
Examples of Dependent Variables
- Researchers want to know if listening to music can lower blood pressure. In this experiment, blood pressure measurements are the dependent variable.
- Researchers want to know if Facebook activity reduces prosocial behavior. In this experiment, willingness to give money to strangers in a dictator game was the measure of prosocial behavior.
- Researchers want to know if reading, watching, or listening to news increases depression. In this case, the Beck Depression Inventory was the dependent variable.
All these measures can detect changes in people’s food preferences and so meet the basic criteria for an adequate dependent variable. Some are actually stronger measures than others—for example, watching what kinds of foods that people eat is a stronger measure of food preference than asking people what food they intend to eat at lunch (revealed preference is a stronger measure than stated preference), but all measures will give the researcher some indication of whether preteens exposed to high-calorie, high-density food commercials changed their food preferences.
The basic judgment rule for this phase of an experiment is: Can the dependent variable reasonably be expected to detect the changes that the treatment produced? In forming a judgment about the quality of the dependent variable, the reader needs to consider two basic questions: First, is the dependent variable reliable? Second, is the dependent variable valid?
Reliability: Reliability is defined as the degree to which the instrument used to quantify the dependent variable gives the same reading each (and every) time. If a researcher wanted to measure the relative importance of race in a local newspaper, one simple way to measure “importance” is to physically measure the number of column inches of newsprint. A researcher who used a steel ruler, which is a relatively inflexible instrument, will probably get the same number of inches of newsprint—or something very close—each time he or she measured a column of print. If researchers used something more flexible—a rubber band marked in inches, or silly putty—they might reasonably expect to get different answers each time they measured the column. The silly putty scale would be considered unreliable.
Researchers have several checks for reliability. Some of the major checks are:
Inter-rater reliability: Two researchers (coders) get the same answer using the same measure.
Checking for intercoder reliability:
As a reader, you need to look for whether the researcher measured how consistently coders agreed with other coders’ ratings. Perfect inter-coder reliability means that every coder rated every coded variable exactly the same way (a reliability score of 1). No correlation between coders would be a reliability score = 0: the coders disagreed on every measure. (Notice that you will judge the reliability of coding for experiments exactly the same way as judging intercoder reliability for content analysis.) Table 18.1 lists the generally accepted guidelines for judging intercoder reliability:
Guidelines for Acceptability of Intercoder Reliability | |
.9 and greater | Excellent reliability |
.8 to .9 | Good reliability |
.7 to .8 | Acceptable reliability |
.6 to .7 | Questionable reliability |
.5 to .6 | Poor reliability |
Under .5 | Unacceptable reliability |
Test-retest reliability: An assessment measure produces the same answer time and time again. (A thermometer that gives you a 105-degree fever three times in fifteen minutes has high reliability—and also suggests that you should go to the emergency room.) If you took your temperature three times in quick succession and got respective readings of 86 degrees, 106 degrees, and 93 degrees, then your thermometer isn’t reliable.
Test-retest reliability: Test-retest reliability is determined by repeatedly measuring the same respondents (sometimes with the same question a few minutes apart, say at different points in the survey, or by testing subjects twice). Like inter-coder reliability, readers need to look at whether the researcher reported test-retest reliability. Test-retest reliability is measured with a correlation coefficient which can range from 0 (perfect unreliability) to 1 (perfect reliability). Test-retest reliability of over .7 is generally considered acceptable.
Internal consistency reliability: Researchers ask multiple questions on the same construct and check whether the subjects answer the questions consistently. The Beck Depression Inventory (see Example 18.2) is a pen-and-paper test used to detect depression. The basic construct is “depression.” The inventory has twenty-one questions, each one of which asks a about a different aspect of level of unhappiness. The idea behind internal consistency is that if the questions all measure the same latent construct, then a subject’s answers should all generally agree with each other.
Example 18.2
Beck Depression Inventory
For all questions, please answer with regard to the last two weeks.
- How is your mood?
___ I do not feel sad
___ I feel blue or sad
___ I am blue or sad all the time and I can’t snap out of it
___ I am so sad or unhappy that I can’t stand it
- How pessimistic are you?
___ I am not particularly pessimistic or discouraged about the future
___ I feel discouraged about the future
___ I feel I have nothing to look forward to and I won’t ever get over my troubles
___ I feel that the future is hopeless and that things cannot improve
- Do you feel like a failure?
___ I do not feel like a failure
___ I feel I have failed more than the average person
___ As I look back on my life all I see is a lot of failures
___ I feel I am a complete failure as a person
- Are you satisfied?
___ I do not feel particularly dissatisfied
___ I feel bored most of the time and don’t enjoy things I used to
___ I don’t get satisfaction out of anything anymore
___ I am dissatisfied with everything
- Do you feel guilty?
___ I don’t feel particularly guilty
___ I feel bad or unworthy a lot of the time
___ I feel quite guilty and bad or unworthy practically all the time
___ I feel as though I am very bad or worthless
Internal consistency: Looking at internal consistency is important when the researcher’s dependent variable is a scale developed from a list of questions about the same construct, as in a list of questions that together measure “body image,” or “acceptance of rape myths,” or “depression.” To determine whether the researcher has checked for internal consistency, you check whether the items on a scale vary together. For example, you would logically expect that a person who says, “I am so unhappy that I cannot stand it” (see Figure 18.2) is more likely to say, “I feel that the future is hopeless and that things cannot improve” than a person who says either, “I do not feel sad” or, “I am not particularly pessimistic or discouraged about the future.” And, in fact, research studies over decades and from many countries have found that people are consistent in how they respond to this test.
Researchers have two primary ways to report internal consistency. First, the researchers can measure the consistency of a set of items themselves. Internal consistency is most commonly measured by Cronback’s alpha, with under .5 as unacceptable reliability and above .7 as acceptable.
Second, the researcher can use an established scale—such as the Beck Depression Inventory—where other researchers have measured and established reliability. To check, look for whether the researchers in the article you read referred to a study that established a scale or a test’s reliability.
Judgment Rules for Reliability
The reader’s check for whether the dependent variable is reliable depends on the exact method used to check reliability. For both intercoder reliability and test-retest reliability and internal consistency tests, look for whether the researchers’ measure of reliability is over .7. As usual, if the researcher does not report this measure, you should assume that the test is not reliable.
Validity refers to the degree to which the instrument measures what it is supposed to measure. Remember, reliability is defined as being able to measure the same thing in the same way. It is possible to construct a highly reliable measure that is also totally wrong. Let’s say, for example, that you were constructing a test for male babies by clothing color, and you developed a measure that coded babies wearing pink as males and wearing blue as females. It is highly likely that you could get a reliable measure. Coders are likely to be able to reliably and consistently tell what babies are wearing pink and what babies are wearing blue. You would also most likely have a terribly invalid measure, particularly in this culture, where male babies are traditionally dressed in blue and girl babies in pink.
Once again, “valid” means that researchers are studying what they think they are studying. There are two basic ways readers should judge validity: face validity and construct validity. But before discussing these, let’s turn to one measure of validity that is not acceptable, faith validity.
Faith validity: Faith validity is simply blind faith that a measure works. Without empirical evidence, without testing, the researcher claims a test is valid because the researcher believes the test is valid. Faith validity is particularly problematic because the researcher’s faith in the measure can also draw the reader into accepting the researcher’s biases. Just because the researcher labels a scale as “Honesty” does not—in and of itself—mean that the scale can measure honesty. Take, for example, Example 18.3. Do these items really measure how honest a person is? No, not really. All of the questions are really about what situations qualify as rape and what situations don’t qualify as rape. The questions do not test attitudes about honesty, nor can they distinguish when the person taking the test is lying.
Example 18.3
Honesty Subscale for Rape Myth Acceptance Scale | 1 | 2 | 3 | 4 | 5 |
Subscale 1: She asked for it | |||||
1. If a girl is raped while she is drunk, she is at least somewhat responsible for letting things get out of hand. | |||||
2. When girls go to parties wearing slutty clothes, they are asking for trouble. | |||||
Subscale 2: He did not mean to | |||||
1. Rape happens when a guy’s sex drive goes out of control. | |||||
2. It shouldn’t be considered rape if a guy is drunk and didn’t realize what he was doing. | |||||
Subscale 3: It wasn’t rape | |||||
1. If a girl doesn’t physically resist sex—even if protesting verbally—it cannot be considered rape. | |||||
2. If a girl doesn’t say “no,” she can’t claim rape. | |||||
Subscale 4: She lied | |||||
1. A lot of times, girls who say they were raped agreed to have sex and then regretted it. | |||||
2. A lot of times, girls who claim they were raped have emotional problems. | |||||
Scoring: Scores range from 1 (strongly agree) to 5 (strongly disagree). Scores may be totaled for a cumulative score. A higher score indicates a greater rejection of rape myths. |
On the face of it, the questions on the honesty scale are not a valid test for honesty just because the scale is labeled “honesty.” (The scale is actually a test for rape myth acceptance.)
Readers need, instead, to rely on their personal assessment of a dependent measure’s validity (face validity), or to look for how the researchers tested for construct or empirical validity (construct validity).
Face validity: Face validity is a straightforward judgment of whether the questions are reasonably able to measure what they are supposed to measure. For example, going back to the Beck Depression Inventory, it seems reasonable that someone who says, “I am so unhappy that I cannot stand it,” and, “I feel that the future is hopeless and that things cannot improve,” is—at least for the moment—depressed, while people who say that they “do not feel sad at all” and that they “are not particularly pessimistic or hopeless about the future” are not likely to be depressed. So, a “best guess” assessment is that these questions are likely to show which subjects are depressed and which subjects are not. The most important word in the previous sentence is “likely.” Face validity means that the test looks like it will work, and will probably work, but there is no real testing to determine if the test actually works.
Construct validity: Researchers have a variety of methods they use to test for construct validity. One, researchers use a panel of “experts” to judge whether the test questions tap into the different aspects of the main construct. Two, the researcher (or reader) considers whether the test includes questions about all aspects of a construct that the theory suggests are important. Anxiety, for example, alters behavior. (Anxious people tend to startle more easily, have more trouble sleeping through the night, and are less able to sit still.) Anxiety also changes people’s judgment. Anxious people are more likely to predict disaster, to catastrophize, and to worry. And, of course, anxiety is an emotional state. A researcher who is testing whether horror movies increase anxiety should use a scale that tests for all of the different aspects of anxiety: judgment, behavior, and emotion.
Empirical validity/concurrent validity: Empirical validity tests how closely scores on a test correspond to some other measure that has already been established. For example, how well does the Beck Depression Inventory test for depression? Would the Beck Depression Inventory and a group of highly skilled psychologists identify the same patients as depressed? The test of empirical validity would be the degree to which both the Beck Depression Inventory and the group of psychologists agreed with each other. A valid test would show that the patients the psychologists identified as depressed people gathered on the high end of the Beck Depression Inventory scale, and the not-depressed people gathered on the other end of the scale. (They do.)
Predictive validity: Another method of looking at validity is predictive validity—determining whether the test can predict future behavior.
Judgment Rules for Validity
In thinking about whether to accept the researcher’s assurances about validity, readers have two checks. First, do you personally think that the checks that the researcher used to detect treatment effects are reasonable? If so, why? For example, in one study of altruism, the researcher tracked whether students leaving the experimental room held the door open for a research assistant (who was, for the experiment, pretending to be on crutches). Do you agree that holding a door open for a person using crutches is a helpful act? What about the opposite? Is a person who didn’t hold the door open not helpful?
If you have two groups, one that saw an action film and another that saw a chick flick, and the group that saw the chick flick was far more likely to open doors than the group that saw the action film, would the experiment show an increase in helpfulness in the chick flick film people? Or did it show a decrease in helpfulness in the action film group? Either way, is it safe to say that helpfulness was affected? If you—the reader—think that opening doors is a valid measure of helpfulness, then—yes—you would have to say that the experiment showed that the genre of films seen impacted the subject’s willingness to offer aid and comfort to the poor research assistant on crutches.
Second, you should look for information about how much the measure used has been tested for validity. Some measures, like the Beck Depression Inventory, have been used to test for depression for decades, and most researchers assume that readers will know and accept the validity of this test. But readers who are learning the field or readers who encounter an unfamiliar test will need to do some extra work to check out the measure’s validity. To illustrate, a search of the terms “Beck Depression Inventory,” “reliability,” and “validity” will turn up scores of articles that measured the reliability and validity of this test. For the less well-known tests, authors will commonly cite previous studies that have validated the measures that they use. As a reader, you should look for these citations and, if you have any concerns about the measure or if the researcher did not report the questions used to develop the measure, go to the original article, and look at how the measure was developed.