Impact evaluation and
measurement issues
Training for University of Haiti-Tulane Health Monitoring and Evaluation Course May 2013
Learning Objectives
To become familiar with specific
research designs used to answer impact
and effectiveness questions
To be aware of measurement issues
To explore the full range of evaluation
Definition
Impact Evaluation: A type of outcome evaluation
that focuses on the broad, longer-term impact or
results – often health results -of an intervention in a population. For example, an impact evaluation could show that a decrease in vertical transmission of HIV was the direct result of a program designed to
improve testing and referral services for pregnant women, provide high quality delivery practices, pre- and post-natal treatment and appropriate counseling and support for feeding.
Theory of Impact Analysis
Measurement of impact requires an
evaluation framework
Measurement of effect and attribution require
a theory of impact analysis
Requires a theory about how the program or
intervention works (treatment or program theory)
Impact Analysis draws on the work of
Campbell and Stanley Experimental and
Quasi-experimental designs for research. Chicago: Rand McNally (1966)
Evaluation framework
Logical model of treatment with
elements that can be observed and
measured (sound working knowledge)
A design to measure the model of
treatment ( or effect)
A way to measure efficacy/effectiveness
and coverage of the intervention (and
to judge success) also called adequacy
Evaluation Framework
Combines 3 dimensions discussed in the
course:
The relationship of the intervention to the
problem
The “character” of the intervention and
therefore the strategy for the design
Judgement, to identify criteria for success
Impact analysis
The key to impact analysis is
measuring:
What did happen, and attributing it to
program, as compared to
What would have happened if the program
Impact analysis
Impact analysis is about “cause and effect”
X produces Y1, Y2…YX
Measured as a regression coefficient, a
difference between two means, or two proportions, with tests of statistical
Measuring impact
Problem:
Measurement result:
R (received treatment) – C (control) = E
(Effect)
But E doesn’t tell you if the effect is big enough
to be a “success”
So, compare to planned impact or adequacy
Coverage:
Consider adequacy (proportion of problem
Internal and External Validity
Internal validity
Related closely to design
Conclusions regarding what happened to
subjects at that time and in that context are actual conclusions
External validity
Are the results generalizeable: other
Internal Validity
Goal of Design Strategy for impact
questions is Internal Validity
Internal Validity
Internal validity refers to the extent to
which the design enables you to
determine that the program, rather
than other factors, caused the changes
you have observed.
This is important when answering
Threats to Internal Validity
There are several threats:
Selection
Instrumentation
History
Contamination
Does not mean they actually exist
You need to consider the plausibility
Threats to internal validity
Selection
Something other than T accounts for the
outcome: the 2 populations are different from the beginning
Two kinds of selection bias:
P = differences at pretest
Q = all other selection biases:
Example = early adopters, people easier to influence
Threats to Internal Validity
Instrumentation
Is the factor you are exploring
conceptualized correctly?
Does your questionnaire operationalize the
Threats to internal validity
History
Something else beside the T accounts for
outcome
External events Maturation
Regression Attrition
Threats to internal validity
Contamination
Intervention not delivered properly
Tips to reduce threats to internal
validity
Determine the comparability of program participants and
the “typical” population
Limit time period between pre and posttest and identify
other changes in community
Carry out the pre and post test with same methodology Ensure that participants are not “extreme”
Ensure maximum control over validity and reliability of
measurement
Identify any natural changes in the population over time Identify the effect of participants dropping out
Threats to external validity
Even if program worked in a given
population, how do you know if it would
work in another?
Threats to external validity
Random selection, sociodemographic
diversity, large study
How often do we have a chance to do
that?
Time: will the intervention be effective
over time:
Threats to external validity
Operationalization of:
The intervention (radio vs. face-to-face;
kind of condom)
The selection and measurement of
variables
Importance of Design
Designs attempt to eliminate or reduce
other possible explanations
Design is crucial in evaluations that
want to show that the program caused
the desired result or had an impact.
General Types of Designs for
Answering Impact Questions
Experimental
Quasi-Experimental
Types of Design
Experimental Design
Key elements
Central control of selection of
participants
Experimental Design 1
R: O1EX O2E R: O1C O2C
R indicates Random assignment
O is the Observation or measurement
E is the experimental group, C is the control group O1 = pretest; O2 = posttest
X is the Program (or treatment) Tic-tac-toe
Experimental Design 2
R:
X O
2T
R:
O
What’s the difference between
1 and 2?
What don’t you get without the pretest?
What do you gain?
Experimental Design 1
R: X1E T Y2E R: X1C Y2C
R indicates Random assignment
E is the experimental group, C is the control group X = pretest; Y = posttest on outcomes of interest T is the treatment
Experimental Design 2
R:
T Y
2T
R:
Y
Experimental Design
Y2T - Y2C (Y = mean) tested for significance to see if
could have come from same population or different and have to make assumptions about the
randomization (happy or unhappy)
If X exists, can include as a control variable –
Sensitivity
How small the difference between Y
2T
and Y
2Ccan be to demonstrate impact
of the program
Sensitivity
Types of Design
Quasi-Experiment
“Quasi” means no random assignment Key elements:
Comparison (with and without the
Program)
Quasi-Experimental Design
A/C: X
1ET Y
2EProgram Group
A/C: X
1CY
2CControl Group
Groups
Matched pairs
Non-equivalent comparison groups
A = autonomous; C = controlled
Types of Design
Quasi-Experimental Designs (cont’d)
Use when you cannot control the
process for deciding who gets the
treatment
Weak because there may be selection
bias and other biases
But this is often more practical in public
Quasi-Experimental Design:
Interrupted Time Series
Key elements: many measures before
and after the “treatment”
A/C: Y
1Y
2Y
3T Y
4Y
5Y
6(Some suggest that you should have at
least 10 measures before and after the
treatment (T))
Quasi-Experimental Design:
Comparative Time Series
Key elements: many measures before
and after the “treatment”
A/C: Y
1EY
2EY
3ET Y
4EY
5EY
6A/C: Y
1CY
2CY
3CY
4CY
5CY
6CTypes of Design
Ex post facto and Non-Experimental
Designs
Key elements:
No random assignment
Maybe no before-program measures
Ex post facto and
Non-experimental Design
Ex post facto: no true sampling plan Before and After Design
Y1 T Y2 One Shot
Ex post facto and
Non-experimental Design
Very common!
No evaluator control of selection or exposure
to treatment
Threat of “spuriousness”
Threat of self-selection and “volunteerisim”
Can “control” by selecting a criterion population,
with some of the same characteristics as the volunteers.
Non Experimental Designs:
Pre and posttest
Provides a measure of change, with preliminary
evidence, when supported by strong process
evaluation data, but no strong conclusive results.
Uses:
To conduct a pilot test
To demonstrate impact of short term intervention
Period between 0
1 and O2 should be as short as possible
Maximum control over validity and reliability of measurement and data collection methods
This design is susceptible to almost all the threats to internal validity
Learning from
non-experiments
Analyses:
Regression analysis Econometric techniques Propensity scoringTypes of Evaluation Design
Randomized or True Experiment Is randomized assignment used? YES NO Quasi Experiment Non ExperimentIs there a control group or multiple measures?
NO YES
Considerations in Choosing an
Evaluation design
What is the strength of evidence require to address
the purpose of the evaluation?
Is there any ethical or legal consideration?
What is the amount of resources available?
Has the intervention been introduced already?
What is the time frame required?
Discussion:
Applying Design to Case study
Given what you know about programs to
interrupt vertical transmission,
what type of design could it use to:
1) Determine the impact of counseling on mothers?
2) Determine the impact of the program on vertical transmission?
3) Determine whether Ministry staff are
satisfied with the performance of clinicians who participated in the training?
Measurement Strategy
What do you want to know?
How will you know it?
Developing Measurement
Strategy
Conceptual definition
Of Key terms/concepts:
Training, counseling, attitudes
Boundaries:
In 9 district hospitals from 2005-2006
Operational definition
How will each variable be measured?
Indicators/Monitoring
Monitoring program performance at
repeated intervals to track progress
requires the use of carefully identified
and defined indicators so that
meaningful comparisons can be made.
The definition and measurement issues
we discuss here are common to both
monitoring and evaluation.
Definitions
An indicator is a word or phrase which
“indicates” the level or extent of some phenomenon of interest
(Example: % HIV+ mothers receiving neverapine)
A measure is the operational definition of how
data are collected to assign a value to an indicator
(Example: % of pregnant antenatal attendees who accept HIV test, test positive, counseled who
Defining Your Terms
It means translating vague words into
specific meanings.
Defining your terms means obtaining
agreement from the stakeholders about
the question, the definitions and the
Defining Your Terms
Sometimes it is difficult to assign a
number or to actually measure what
you want to measure.
For example, you may not really be able
to measure the quality of a program.
Instead, you may have to be content
with measuring whether people think it
is a quality program.
Example: Training clinical staff
Clinician attitudes:
Measured by using a survey that asks clinicians
about their attitudes
Quality of care:
Measured by having observers rate specific
components of performance
Effectiveness of the training system:
Measured by the number of participants
Measured by meeting set targets for % HIV
Case: Measures
1. Did clinician attitudes change after the
training?
a. Indicator: attitudes
b. Measure: responses to a series of
attitude questions about the kind of women who are HIV+ (0-4 scale)
2. Did patients counseled intend to test?
a. Indicator: % agreeing to test b. Measure: # of eligible mothers tested/total #of eligible mothers
Some commonly used
measures
Frequencies, percents, proportions
Means, Medians, Modes
Cost, in currency
Percent change over time or between groups
Rates, Ratios
Treatment effect: difference in βT
Yi = α + βXi + ui same as Yi = α + βTTi + ui where T = treatment (0,1)
Key issues about measures
Are they relevant?
Are they valid?
Are they reliable?
Are they precise?
The problem of behavior
What is a behavior?
Are behaviors consistent?
Data Source Issue
What are the best sources of data?
Validity Reliability
Do the data already exist?
Are they reliable?
Discussion: Where Can we
Find Data?
Monitoring (M): Number of training seminars
held
(M) Number of clinicians who completed
training
(E) Attitudes of clinicians
(E) Quality of care
(E) Quality of teaching material
(M) Participation of clinicians
Data Source Lessons
Which ones might be easier to obtain?
Which ones might be very difficult to
obtain?
How accurate and reliable are each of
the data sources?
How valid are existing data?
Case discussion
Goal: Capability of health educators is
upgraded
How do they define capability?
Evaluation Grid
One tool that some find useful is the
evaluation grid
This tool helps you see how you intend to
answer each question
For each question, you will need to identify
the information needed, sources of that
information, and how you will collect the data
Evaluation Grid
Evaluation
Criteria Eval. Questions 2a Questions Basis for Judgement Data needed Data sources Data collection methods Relevance Effectiveness Efficiency Impact Sustainability Others
Exercise: Evaluation Grid
For PMTCT:
Identify two evaluation questions
What data/measures would best answer your
questions?
What are likely sources of information?
Complete the
Data Needed column Data Source column
Bibliography
Mohr, L.B., Impact Analysis for
Program Evaluation. Thousand Oaks:
Sage, 1995
Habicht, J.P., CG Victora and JP Vaughan.
Evaluation designs for adequacy, plausibility and probability of public health programme performance and impact. Int J of