Evaluation design and threat to internal validity

(1)

Using Experimental Designs

for Program Evaluation

Keith G. Diem, Ph.D., Program Leader in Educational Design

Bulletin

E284 w w w . r c e . r u t g e r s . e d u w w w . r c e . r u t g e r s . e d u w w w . r c e . r u t g e r s . e d u w w w . r c e . r u t g e r s . e d u w w w . r c e . r u t g e r s . e d u

E

xperimental research is a methodical way of comparing two or more groups to deter-mine differences in the effect of different “treatments” received by each group. A treatment (or experimental variable) can be an educational program or activity, life experience, new drug, herbicide, or procedure that is being tested for its “effect” on the dependent variable. The goal is to establish cause-and-effect between an intended treatment and the result. In program evaluation, this is a powerful means to prove that an educa-tional program in which people participated (the treatment) was indeed the basis (cause) for their improved knowledge, skills, attitudes, etc. (the effect).

When using an experimental design, the evalu-ator purposely manipulates a treatment (inde-pendent variable) to see if it causes a change in the dependent variable (effect). For example, giving a new reading program to one group of students and using the old way of teaching reading to a different group of students to see if the new way yields higher reading scores. Ex-traneous variables are also controlled by the researcher so they can be ruled out as other possible “causes.” Experimental research is the only type of study where true “cause-and-ef-fect” can be claimed.

When using an experimental design, the investi-gator starts with a research hypothesis, such as these examples:

• Children who participate in after-school

pro-grams are less likely to cause vandalism.

• Participants who complete the pesticide

train-ing program will have fewer violations of pollution laws.

• Attendance at nutrition education courses

will result in a reduction of blood cholesterol levels of people who attended.

• Adults attending a parent education class are

less likely to commit child abuse.

This publication provides an overview of some basic experimental research designs that are use-ful in program evaluation.

Basic Definitions

Variable

• Characteristics by which people or things can

be described.

• Must have more than one level - in other

words, to be able to change over time for the same person/object, or from person to per-son, or object to object.

• Some variables, called attributes, cannot be

manipulated by the researcher (e.g., socio-economic status, IQ score, race, gender, etc.).

(2)

• Some variables can be manipulated but are

not in a particular study. This occurs when subjects self-select the level of the indepen-dent variable, or the level is naturally occur-ring (as with ex post facto research).

Manipulation

• Random assignment of subjects to levels of

the independent variable (treatment groups).

Independent variable

• The treatment, factor, or presumed cause that

will produce a change in the dependent vari-able. This is what the experimenter tries to manipulate; it is the “experimental variable.”

• Denoted as “X” on the horizontal axis of a

graph.

Dependent variable

• The presumed effect or consequence

result-ing from changes in the independent vari-able.

• This is the observation made and is denoted

by “Y” on the vertical axis of a graph. The score of “Y” depends on the score of “X.”

Controlling threats to validity

Threats to the internal and external validity of a study must be controlled to yield findings that are credible. This is accomplished by using a strong design that minimizes the potential threats to achieving valid results.

Internal Validity (Are the data true?)

Selection

If participants were not randomly assigned to treatment groups (as is commonly the situation in educational settings with humans as subjects in “intact groups” like classrooms), then the results

might be affected by the possibility that a certain type of audience chose their own treatment group by “self-selection.” For example, participants interested in environmental education may select a course on the topic. Therefore, they’re already a motivated, possibly more knowledgeable group than a comparison group that may do better because of their interest than the effect of the course itself.

History

Did events that happened during the treatment result in effects not caused by the treatment itself? For example, did participants gain knowledge about a subject from outside sources, not the course being taught?

Maturation

Did program participants improve their perfor-mance, knowledge, etc. as a result of simply gaining life experience? This threat is more likely if the treatment happens over a long period of time or subjects are tested long after the program ended.

Testing

Sometimes participants learn how to “take the test.” Therefore, increases in scores may result less from what they learned during a course, and more from their improved ability to take the test. This is not an issue when there isn’t both a pretest and a posttest in a single group, or when random assignment of participants to more than one group equalizes any testing effects between the groups. Another way to reduce this threat is to use a random selection of test questions for the pretest and posttest instead of using identical tests.

Instrumentation

A change in calibration, observers or scorers used may produce a change in measurements. This can be prevented by using a proven, printed test or randomly assigning observers, to groups so

(3)

they don’t know which group is which (a “blind” or “double-blind” experiment).

Differential Attrition

It is not unusual to lose program participants (also called “experimental mortality”), especially dur-ing long-lastdur-ing programs or follow-up surveys that take place a long time after the program ended. If such attrition is similar between treatment groups, this is not much of an issue. However, if one group loses many more respondents than another, data are lost which may skew the results.

Statistical Regression

When a treatment group is selected from an “extreme” population (such as poor performers or gifted & talented students), the resulting per-formance is likely to end up closer to the mean. In other words, if you are a very low scorer, you only have one way to go (up, toward the average), and vice versa.

External Validity (For whom are the data true?)

In general, findings from a study can be general-ized to the population from which subjects for the study are randomly selected. However, a few “reactions” unique to a study (not explained in this publication) can limit the ability to generalize back to the population:

• Testing-treatment interaction • Selection-treatment interaction • Reactivity to the experimental setting

Setting up an Experimental Design

for Program Evaluation

Experimental designs are depicted with the fol-lowing symbols:

• X = Treatment (program, activity,

interven-tion, instruction)

• O = Observation (test, measurement, survey) • R = Random Assignment (of subjects to

treat-ment groups). A dashed line between groups indicates that the groups are not equal (par-ticipants are not randomly assigned to the “treatment groups”).

There are three categories of experimental designs:

• True experiments • Quasi-experiments • Pre-experiments

True Experiment

A true experiment requires the random assign-ment of subjects (such as people, animals, or plants) to a treatment group. Random assignment is the only way that groups can be considered statistically equivalent. Here are two common examples of true experiments:

Pretest-Posttest Control Group Design R O X O

R O O

Subjects are randomly assigned to treatment groups, then given a pretest. One group gets a treatment and the other gets none (or a variation of the treatment). Then, each group is given a posttest. Your hypothesis would be that the “gain” score from pre-test to post-test would be greater for the treatment group than the control group. Posttest-Only Control Group Design R X O

R O

This design eliminates the need for a pretest. Because subjects are randomly assigned to treat-ment groups, they are considered statistically equivalent. Therefore, differences between the groups can be attributed to differences in the treatments. However, although a test (the pretest) is eliminated, a baseline measure of each group is sacrificed.

(4)

Pre-Experimental Design

One-Group Posttest-Only Design

A pre-experiment is a commonly used evaluation method, typically represented by giving partici-pants end-of-program evaluation surveys. But it is a weak design that potentially suffers from every threat to validity because it has little control over environmental factors that could affect the outcome of a study. Thus, it is sometimes referred to as a “one-shot case study.” The design is essentially “survey research,” which is useful for determining opinions or current status. But such a design can provide some evidence of program impact with the proper measurement instrument (with major limitations in the conclusions that can be drawn) and is appropriate to use when more elaborate designs are not possible. Additional observations/tests/surveys (depicted by O’s) can be added after the initial posttest. This doesn’t eliminate the weaknesses of the design but it does help determine the sustainability of any benefits of a program.

X O

One-Group Pretest-Posttest Design

This design is still weak but, by adding a pretest, it allows the evaluator to determine the status of the program participants before the program be-gan. Therefore, the degree of improvement can be measured.

O X O

For example, a camp director might like to claim that youth attending summer camp gained skills because they attended the weeklong outdoor education program. But, it is possible that the youth already had the skills when they came to camp. As a matter of fact, it is likely their current interest and skills in outdoor education might have been the reason they chose to attend sum-mer camp. Using a pretest provides a baseline with which posttest results can be compared.

Post-then-Pre X O_pre O_post

A variation of the Pretest-Posttest One-Group

Design involves giving the pretest and the

posttest after the program is completed. When is this appropriate and what are the benefits?

• When you are asking participants to rate (i.e.

give their opinion) of their perceived knowl-edge or skill, instead of testing them.

• This is a problem when participants don’t

have a realistic understanding of their knowl-edge or skills until they learn through an educational program how much there really is to know.

• Teenagers are notorious for overestimating

their knowledge or skill levels because they lack a frame of reference that experience and maturity provides.

Static-Group Comparison

Adding a comparison group is another way of improving a one-shot case study. But, the dashed line between the two groups indicates that the groups are not equal (participants were not ran-domly assigned to the “treatment groups”) and because no pretests were given, there is no way to know how similar the groups were to start (pos-sible “selection” threat) or how many participants dropped out (differential attrition).

X O

—————— O

Ex Post Facto

Another pre-experimental design is an ex post

facto (“after the fact”) study. This relational

research method is also used when a true experi-mental design is not possible, such as when people have self-selected levels of an indepen-dent variable or when a treatment is naturally occurring and the researcher could not “control” the degree of its use (represented by the question

(5)

mark next the X that depicts the treatment in the model).

X? O

—————— O

With an ex post facto study, the researcher starts by specifying a dependent variable and then “works backward” to try to identify possible reasons for its occurrence as well as alternative (rival) explanations. The design is stronger than a typical one-group design because such con-founding (intervening, contaminating, or extra-neous) variables are “controlled” using statistics, such as a regression analysis.

• This type of study is very common and useful

when using human subjects in real-world situations and the investigator comes in “after the fact.”

• It might be observed that people from one

county eat healthier diets than people from nearby counties. Although an educator might like to claim that attendance at nutrition education programs offered in that county is the reason for the improved health status of local residents, other reasons for the differ-ences would need to be explored. Other potential explanations might be differences in education levels, income, ethnicity, access to supermarkets that sell healthy foods, etc. A note of caution about relational studies

• In a relational study, “cause-and-effect”

can-not be claimed; only that there is a

relation-ship between the variables.

• Variables that are completely unrelated

could, in fact, vary together due to nothing more than coincidence.

• The researcher needs to establish a plausible

reason (research hypothesis) why there might be a relationship between two variables be-fore conducting a study.

• For example, it might be found that pet

own-ers have higher IQ’s than people who do not have pets. Even if there is a relationship between pet ownership and IQ scores, it is not likely that buying a dog caused someone’s IQ to improve (unless, perhaps, it was a very intelligent dog that taught its owner some new tricks).

Quasi-Experiment – A Realistic

Compromise

In a quasi-experiment, groups of subjects are constructed using a method other than random assignment. That’s because, when using human subjects, it is often impossible to do random assignment. They are often part of intact groups such as school classrooms, community organiza-tions, neighborhoods, 4-H clubs, or nursing homes. Although groups might be reasonably similar in a practical sense, using data from intact groups limits the conclusions that can be drawn regarding program effects. Still, quasi-experi-ments are useful in providing valuable evidence of program impacts. A quasi-experiment in-volves giving a “treatment” (such as an educa-tional program) to one group, but not the other. In these examples of quasi-experiments, a pretest is used to provide a baseline for a valid comparison of the effect of the treatment on each of the groups. Having a comparison group and using a pretest and a posttest avoids most validity threats. Nonequivalent Control Group Design (Pretest-Posttest with Comparison Group)

O X O ——————— O O

In reality, it is difficult to get humans to agree to be part of a “program” when there is no benefit to participating. It is more likely that one group will get a “special” treatment (such as your new, improved educational program) and the compari-son group will get a limited or substitute treatment (such as the existing program, depicted by a “small x” or X₂ in the model). But, to achieve

(6)

maximum “variance” (a difference in results) that can be measured, it is important to make the “special” treatment as different as possible from the limited version.

O X O ———————— O x O

Designing Tests & Questionnaires

for Experimental

Research/Evalua-tion

Tests and questionnaires are typically used to collect data (i.e. measure the “O” depicted in experimental designs). Tests usually measure knowledge or skills whereas questionnaires typically ask about attitudes or self-reports of knowledge, skills, or behaviors. Many commer-cially-developed and other proven tests are available to reliably measure many dependent variables (such as IQ, reading ability, self-es-teem, etc.) When developing your own mea-surement instrument, it is just as important to design one that is valid and reliable. Problems with the testing instrument, such as confusing questions or arrangement or unclear instructions that result in respondents giving incorrect an-swers, result in measurement error. This can be avoided by field-testing the instrument with subjects similar to the program participants. For guidance on constructing questionnaires and developing procedures to administer them so they achieve valid and reliable results, consult the RCE fact sheet, “A Step-by-Step Guide to

Developing Effective Questionnaires and Sur-vey Procedures for Program Evaluation and Research (FS995).”

Summary and Recommendations

• Program evaluation and, therefore, the

selec-tion of the appropriate research method and experimental design as appropriate, needs to be determined during program planning. De-ciding which to use after the program is in progress or completed limits the options.

Refer to the RCE fact sheet, “Choosing

Ap-propriate Research Methods to Evaluate Educational Programs (FS943)” to see

which research method might best serve your program evaluation needs.

• Use the strongest design possible under the

circumstances. This increases the validity of the findings and expands the use of the results needed to justify the benefits of the educational program being evaluated.

• Even when you must “settle” for a less

pow-erful design than desired, at least be aware of the design’s shortcomings and consider rival explanations of the findings.

References

Campbell, D. & Stanley, J.C. (1963).

Experimen-tal and Quasi ExperimenExperimen-tal Designs for Re-search. Chicago: Rand McNally Co.

Diem, K. (January 2003). Program development in a political world – it’s all about impact!,

Journal of Extension [On-line], Volume

41(1). Available at: www.joe.org/joe/ 2003february/a6.shtml.

Diem, K. (December 2002). Using research methods to evaluate your Extension pro-gram. Journal of Extension [On-line], Vol-ume 40(6). Available at: www.joe.org/joe/ 2002december/a1.shtml.

Diem, K. (1999). Choosing Appropriate Re-search Methods to Evaluate Educational Pro-grams . Rutgers Cooperative Extension Fact Sheet #FS943. New Brunswick, NJ.

Diem, K. (1997). Measuring Impact of Educa-tional Programs. Rutgers Cooperative Exten-sion fact sheet #869. New Brunswick, NJ: Rutgers University.

(7)

Rockwell, S.K. & Kohn, H. (Summer 1989). Post-Then-Pre Evaluation. Journal of Exten-sion [On-line], Volume 27(2). Available at: www.joe.org/joe/1989summer/a5.html.

Spector, P.E. (1981). Research Methods. Sage University Paper series on Quantitative Ap-plications in the Social Sciences. Beverly Hills: Sage Publications.

Rutgers Cooperative Extension

Program Evaluation Resources web page

This page provides information and links to other resources that will help you design and evaluate educational programs.

(8)

© 2003 by Rutgers Cooperative Extension, New Jersey Agricultural Experiment Station, Rutgers, The State University of New Jersey. This material may be copied for educational purposes only by not-for-profit accredited educational institutions.

Desktop publishing by Rutgers Cooperative Extension/Resource Center Services Revised: October 2003

RUTGERS COOPERATIVE EXTENSION N.J. AGRICULTURAL EXPERIMENT STATION RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY