D. Lee Bawden
Freya L. Sonenstein
The Urban Institute
Quasi-experimental designs are sometimes necessary when social programs are evaluated. Under certain conditions they are preferable to experimental designs with random assignment to treatment and control groups. This article examines the situations in which experimental evaluation designs may be inappropriate. When faced with these limiting conditions, analysts can choose a number of approaches which will strengthen the rigor of their quasi-experimental designs.
It is often asserted that the classical evaluation design, using random assignment of individuals or families to treatment or control groups, is superior to quasi-experimental evaluation designs. This is certainly true for some demonstration programs or interventions, but in the human service arena, programs suited to a classic experiment may be the exception rather than the rule.
We will briefly review some of the conditions under which we think quasi-experimental designs using comparison groups are superior. We will then describe how these designs can be more rigorously executed. We conclude with some comments on the need both to prioritize opportunities for child welfare evaluations and to permit flexibility in the law regarding their evaluation designs.
Random Assignment: Not A Panacea
There are program demonstrations where the random assignment of clients is simply not appropriate. One example is a systemic intervention to improve the coordination of services to children, such as the Casey child welfare reform initiative, which simultaneously attempts to reform several linked services within a locality or state (Center for the Study of Social Policy, 1987). While the impacts being measured may be the effects on children, it is the system that is being tested, and it is unlikely that the system could operate in one way for treatment children and another way for control children. Some of the pilot programs proposed in the various bills pending before Congress to reform child welfare programs involve coordination of services. Quasi-experimental designs may be the only option for these types of interventions.
Random assignment is also unsuitable when it is barred by legislation or departmental policy. In developing design options for the evaluation of transition benefits in the Family Support Act, we found that random assignment using controls who did not receive these benefits was not acceptable because transition benefits were defined by law as entitlements. Moreover, we learned that the Health Care Finance Administration considered experimental designs that involved withholding health benefits from populations to be unethical, even when the control group would receive the existing array of services. Finally, in the Fall of 1989, there was rancorous debate in a number of state legislatures, including those of Texas and Florida, about experimenting with people and effectively denying services to them. In Texas, the infamous syphilis experiments were cited. In this charged environment, the Department of Health and Human Services had little enthusiasm for the classic experimental design (Sonenstein, Ku, Adams, & Orloff, 1990).
Having cited these obvious examples of situations in which random assignment with no-treatment controls cannot be used, let us consider other situations in which the classic experimental design may not accomplish the objectives of a program evaluation. We are going to assume that an experiment is well-managed and that contamination of the control group will not be an issue. However, we note that long follow-up periods--two to three years or more--increase the risk of contamination as well as the risk of subtle, and not so subtle, changes in the the intervention. When these occur, the analyst must fall back on techniques developed for quasi-experimental designs.
It is now commonly accepted that the best environment in which to test a demonstration is within the agency that would administer the program if it were adopted on an ongoing basis. We are going to argue that the basic policy question to be answered by an evaluation is whether a program achieves desired outcomes under conditions approximating real life. Unfortunately, while the use of random assignment, a critical element of an experimental design, may allow a determination of whether the desired outcomes occur, it may create "laboratory conditions" that make evaluation results non-generalizable. At the end of the experiment, we may know that the intervention worked for its participants, but not whether the same results can be expected when the program is implemented more widely. Some examples of how experimental evaluations may not lead to generalizable results follow.
One potential threat to generalizablity is program cooperation. Since child welfare programs are primarily operated or supervised at the state or local level of government, the federal government must get states and localities to agree to random assignment for an experimental program evaluation. In the past, this has been a major obstacle because program operators do not like the idea of random assignment. They care about the people they serve and, since program resources are always limited, they believe that the neediest of their clientele should get these resources. If the control group has to remain "pure" over a long period, program operators can be particularly resistant to denying services to clients. In the national evaluation of Job Training Partnership Act (JTPA) programs, Hotz (1990) reports that the goal was to select 20 nationally-representative sites that would conduct random assignment of individuals. Sixteen sites agreed to participate in the study, while 228 refused. It seems obvious that the results of the evaluation in these 16 sites cannot be generalized nationwide.
Another possible threat to generalizability is that, by definition, the experimental program may operate on a smaller scale than if it were offered to the entire clientele. This can be a misleading test if the scale of operations is an important factor. A critical question about family preservation programs, should they prove effective in small scale experiments, is whether positive results could be attained if the services were universally available. If a program utilizes the resources of other programs not under its control, small scale programs are likely to produce different results than those that put heavy demands on other service systems. For example, the new Job Opportunities and Basic Skills (JOBS) program under the Family Support Act utilizes the existing JTPA and Adult Basic Education programs for much of the remedial education and training. If the capacity of these existing programs is limited, the results of a small-scale demonstration may be quite different than for a full-scale program. We can envision a comparable situation if child welfare demonstrations rely heavily on referrals to services beyond their control, such as scarce drug abuse treatment and prevention programs or child care or parenting education programs administered and funded by other agencies.
When the scale of an experimental program is limited, another issue which arises is that the cost of an on-going program may be difficult to estimate. This is because those among the eligible population who apply to a new program may be quite different than those who apply to a program that is widely known. Programs get reputations. If a program is widely known to be "good," that is, it helps those who participate in it, then a much higher proportion of the eligible population will want to participate than if the program has a bad reputation.
Finally, a small-scale demonstration will not help to predict the potential effects of a larger program on the behavior of others in the community, even those that do not participate in the program. For example, if a large-scale investment in drug treatment were made in a community where using drugs was the norm, and the program was successful, drug use may no longer be the norm in that community. Consequently, drug use among those who did not participate in the program may be significantly reduced. Such macro effects would unlikely be captured in a demonstration in which a sizeable fraction of the drug users were denied intensive treatment due to random assignment.
In summary, then, random assignment of individuals or families to treatment and control groups has limited applicability when the following conditions are in force:
´ random assignment to a no treatment group has been prohibited;
´ sites or programs cannot be forced or enticed to participate readily in the experiment;
´ large scale implementation of the program is expected to have different effects than a small scale demonstration;
´ the evaluation includes generating an estimate of total program costs; or
´ macro-effects of the program on the community are expected.
In addition to these conditions, it goes without saying that random assignment should not be attempted with weak management controls over the assignment process--an apparently serious problem in family preservation programs. (See Schuerman, Rzepnicki, Littell, & Budde, this volume.) If any of these situations apply, alternative designs should be considered. The nature of the intervention and its likely outcomes, as well as the objectives of the evaluation, should dictate the choice of evaluation design.
Quasi-experimental evaluation designs use comparison groups rather than randomly-assigned control groups as the baseline or counter factual against which to measure net program impacts. The three most common sources for such comparison groups are: (1) eligible non-participants in the same community, (2) individuals similar to the participants from an existing data base that contains the outcome measures of interest, and (3) individuals in a matched comparison site who would have been eligible for the program if it were in that site. Evaluations using these kinds of comparison groups can effectively test for the effects of program participation on outcomes under certain conditions. New analytical approaches can be used to control statistically for the effects of potential differences between treatment and comparison groups. These approaches require longitudinal data on the outcomes of interest and their potential determinants for a period prior to program implementation. If these data are available, models of pre-program differences between the two groups can be estimated and tested, and the results can be used to interpret post-program differences between the treatment and comparison group populations (Heckman, Hotz, & Dabos, 1987). We will illustrate this more fully with each type of comparison group.
Only rarely is it possible to use eligible non-participants in the same community as a comparison group. To do so, it must be possible not only to control for selection bias, but also to test whether the controls are adequate. The test requires data on the outcome measures of interest prior to the program, which are not usually available. One study where the use of eligible non-participants as a comparison group appears to have been successfully carried out--that is, the tests showed that selection bias was adequately controlled for--was the evaluation of the Massachusetts ET Choices program (Nightingale, Wissoker, Burbridge, Bawden, & Jeffries, 1991). This approach may only be successful where selection bias arises from self-selection rather than selection by program operators or some third party. But, since child welfare services are rarely self-selected, this type of comparison group design does not appear useful.
Some evaluations of the Comprehensive Employment and Training Act (CETA) program in the late 1970s used like individuals from the Current Population Survey (CPS) as a comparison group. In effect, a synthetic control group was developed from a secondary data source. The validity of this approach was later examined by drawing such a comparison group for evaluations that also had a randomly assigned control group. These studies shed considerable doubt on this approach (see LaLonde & Maynard, 1987). Unless the previously-described test for selection bias could be conducted with pre-program longitudinal data, and unless one also knew about any participation in similar programs by the comparison group drawn from the secondary data source, the use of synthetic comparison groups offers little hope. The only secondary data set which might provide a potential synthetic control group for a child welfare evaluation is the National Longitudinal Survey of Youth's Child Supplement. However, it currently does not collect information about participation in child welfare services.
The third source of comparison groups--those in a similar community who would be eligible if the program were offered there--appears to offer the most promise when it is either infeasible or unwise to conduct random assignment in the test site. We will, therefore, consider the strengths and weaknesses of this approach for evaluations of child welfare programs.
Comparison sites should be as similar as possible to treatment sites in characteristics thought to influence the outcome of the program being tested. This not only suggests that the eligible population should be similar, but that such factors as the socio-economic, cultural, and institutional environments should also be similar.
Typically, the demonstration site is selected first and then a search is made for its comparison site. This need not be the case, however. If it is a multi-site demonstration, sites can be matched first and then randomly selected as demonstration or comparison sites. This procedure was used by the state of Washington in an evaluation of its Family Independence Program (FIP). Welfare offices were stratified by rural and urban and, because the economies are so different, by eastern and western Washington. Within these four strata, welfare offices were matched according to seven area characteristics, including the size and composition of the caseload, the rate of out-of-wedlock births, local area employment, earnings of single parent welfare cases, average earnings of workers in retail and service industries, and placements of Aid to Families with Dependent Children (AFDC) recipients by local Job Service Centers. The offices were then purposely assigned to category A or category B so that the aggregate characteristics across all offices in each group were also similar. A random selection was then made to determine which category--A or B--would be the demonstration sites and which would be the comparison sites.
Such a design can be further strengthened by gathering data on the outcomes relevant to the demonstration prior to the implementation of the demonstration--in both the experimental and comparison sites. For example, in the evaluation of Washington's Family Independence Program (FIP), the effects of the AFDC program will be estimated in treatment and control sites before FIP was implemented to see if there were pre-program differences between the two sets of sites. Then tests will be conducted to determine whether any differences can be explained by controlling for caseload and site characteristics. If any differences can be reduced to zero by these explanatory variables, one would have considerable confidence that the same model could be used to accurately estimate the net impacts of FIP. On the other hand, if all of the pre-FIP differences cannot be explained away, the portion that cannot be explained can be used as an adjustment factor in the evaluation of the net impacts of FIP.
When the unexplained differences are large relative to the expected impacts of the demonstration, the validity of the estimated net impacts are questionable. If you do not know why there were pre-demonstration differences, then you do not know the effects of unobserved explanatory variables in the two sets of sites during the life of the demonstration. However, when the unexplained differences are relatively small, then one can be fairly confident that the comparison sites serve as a valid counterfactual for estimating the net impacts of the demonstration.
A similar comparison site approach might be used to evaluate child welfare demonstrations not suited to random assignment. A few cautions should be noted, however. Program cooperation may still be an issue in this design, even without random assignment. The likelihood of sites choosing not to participate should be fully assessed prior to selecting this approach, since generalizability could be threatened. Contamination of the comparison sites is also a real possibility. In the recent Wisconsin child support demonstration, interventions that were to be tested in treatment sites were adopted early by comparison sites (Klawitter & Garfinkel, 1990). Furthermore, although only aggregate statistics across time periods are required, the availability of uniform historical program data on outcomes of interest from child welfare programs in the sites may also be a problem.
In summary, the quasi-experimental design of using comparison sites is sometimes necessary because random assignment of individuals is not possible; and it is sometimes superior to random assignment of individuals for the reasons pointed out earlier. Comparison sites can serve as valid controls for treatment sites in evaluating the net impacts of demonstrations. However, if the quasi-experimental design of comparison sites is used for evaluating the net impacts of a demonstration, the evaluation will have more validity if:
´ a pre-demonstration estimate of site differences in the outcomes of interest is estimated; and
´ in both the pre-demonstration estimates of site differences and the estimates of the net impacts of the demonstration, all relevant data on the characteristics of the affected population, the socio-economic characteristics of the sites, and the characteristics of the programs in each site are fully utilized.
Legislating Evaluation Designs: A Caution
Pending child welfare reform bills include several demonstration initiatives, including coordination of services, family reunification, prevention, permanency planning, and substance abuse projects. Given the number of initiatives being proposed simultaneously, it will be important to think about priorities for evaluation. Policy makers should consider and articulate what kinds of information they need most. Presumably these priorities should guide the choice of evaluation initiatives. Three years after the Family Support Act, it is clear that not all the evaluations envisioned by the framers of this legislation will be pursued with equal vigor.
Our remarks have been intended to show that both experimental and comparison group designs have strengths and weaknesses. In developing the child welfare legislation, we recommend against a stipulation that evaluations should use experimental and control groups assigned randomly. The Family Support Act contained this kind of language. Ultimately, we think the best evaluation strategies have to be carefully tailored to the nature of the intervention, the scope of the program envisioned, the types of questions that the evaluation should answer, and the capacity to measure the relevant outcomes and predictors. While legislation can provide explicit guidance about interventions to test, their scope, and even the kinds of questions which should drive evaluation efforts, it is unlikely that the best research design can be chosen a priori.
Center for the Study of Social Policy. (1987). The framework for child welfare reform. Washington, DC: Author.
Heckman, J.J., Hotz, V.J., & Dabos, M. (1987). Do we need experimental data to evaluate the impact of manpower on earnings? Evaluation Review, 11, 395-427.
Hotz, V.J. (1990). Recent experience in designing evaluations of social programs: The case of the national JTPA study. Madison: University of Wisconsin, Institute for Research on Poverty.
Klawitter, M.M., & Garfinkel, I. (1990). Child support, routine income withholding and post divorce income for divorced mothers and their children. Paper presented at the meetings of the Association of Public Policy and Management, San Francisco.
LaLonde, R., & Maynard, R. (1987). How precise are evaluations of employment and training programs: Evidence from a field experiment. Evaluation Review, 11, 428-451.
Nightingale, D.S., Wissoker, D.A., Burbridge, L.C., Bawden, D.L., & Jeffries, N. (1991). Evaluation of the Massachusetts Employment and Training (ET) Program. Washington, DC: The Urban Institute Press.
Schuerman, J.R., Rzepnicki, T.L., Littel, J.H., & Budde, S. (1992). Implementation issues. Children and Youth Services Review, 14, 191-203.
Sonenstein, F.L., Ku, L., Adams, K., & Orloff, T. (1990). Potential research strategies for evaluating the effects of transitional Medicaid and child care benefits. Final report to Assistant Secretary of Planning and Evaluation. Lexington, MA: SysteMetrics/McGraw-Hill.