Skip to contents

Data and Design

data(lsoSynth)

The data for this example were randomly simulated using the synthpop package in R based on data originally collected by Lindo, Sanders, and Oreopoulos (2010; hereafter LSO). (The original real data can be found here; code to simulate the fake data can be found here.)

At a major public university, students with a cumulative grade point average (GPA) below a pre-specified cutoff are placed on academic probation (AP). Students on AP are given extra support, and threatened with suspension if their GPAs do not improve. See LSO for details. Along with LSO, we will consider only students at the end of their first year at the university. The university consisted of three campuses, and they AP cutoff varied by campus. To simplify matters, we centered each student’s first year GPA on their campus’s cutoff, creating a new variable RiGPAiccampus[i]R_i\equiv GPA_i -c_{campus[i]}, where GPAiGPA_i is student ii’s first-year GPA, and ccampus[i]c_{campus[i]} is the AP cutoff for student ii’s campus. A student ii is placed on AP if Ri<0R_i<0 and avoids AP if Ri0R_i\ge 0.

We will attempt to estimate the average effect of AP placement on nextGPA, students’ subsequent GPA (either over the summer or in the following academic semester). This figure plots each the average nextGPA for students with distinct R values (i.e. centered first-year GPA). The cutoff for AP, 0, is denoted with a dotted line. The the sizes of the points are proportional to the natural log of the numbers of students with each unique value of R.


figDat <- aggregate(lsoSynth[,c('nextGPA','lhsgrade_pct')],by=list(R=lsoSynth$R),
                    FUN=mean,na.rm=TRUE)
figDat$n <- as.vector(table(lsoSynth$R))

with(figDat,
     plot(R,nextGPA,cex=log(n+1)/3,main='Average Subsequent GPA\n as a function of 1st-Year GPA,\n Centered at AP Cutoff'))
abline(v=0,lty=2)

Background

Regression Discontinuity Designs

What characterizes discontinuity (RD) designs, including the LSO study, is that treatment is assigned as a function of a numeric “running variable” RR, along with a prespecified cutoff value cc, so that treatment is assigned to subjects ii for whom Ri>cR_i>c, or for whom Ri<cR_i<c. If ZiZ_i denote’s ii’s treatment assignment, we write Zi=𝟏{Ri<c}Z_i=\mathbf{1}\{R_i<c\} or 𝟏{Ri<c}\mathbf{1}\{R_i<c\}, where 𝟏{x}\mathbf{1}\{x\} is the indicator function–equal to 1 when xx is true and 0 otherwise. In the LSO study, ii indexes students, centered first-year GPA $R_i $ is the running variable, and the treatment under study is AP placement. Then $ Z_i={R_i<0} $. (In “fuzzy” RD designs, RR’s value relative to cc doesn’t completely determine treatment, but merely affects the probability of treatment, so subjects with, say, Ri>cR_i>c are more likely to be treated than those with Ri<cR_i<c; in fuzzy RD designs, Zi=𝟏{Ri>c}Z_i=\mathbf{1}\{R_i>c\} is typically modeled as an instrument for treatment receipt.)

RD designs hold a privileged place in causal inference, because unlike most other observational study designs, the mechanism for treatment assignment is known. That is, RR is the only confounder. On the other hand, since ZZ is completely determined by RR, there are no subjects with the same values for RR but different values for ZZ, so common observational study techniques, such as matching on RR, are impossible. Instead, it is necessary to adjust for RR with modeling–typically regression.

Analyzing RD Designs with ANCOVA

Traditionally, the typical way to analyze data from RD designs is with ANCOVA, by fitting a regression model such as Yi=β0+β1Ri+β2Zi+ϵiY_i=\beta_0+\beta_1R_i+\beta_2Z_i+\epsilon_i where YiY_i is the outcome of interest measured for subject ii (in our example nextGPA), and ϵi\epsilon_i is a random regression error. Then, the regression coefficient for ZZ, β2\beta_2, is taken as an estimate of the treatment effect. Common methodological updates to the ANCOVA approach include an interaction term between ZiZ_i and RiR_i, and the substitution semi-parametric regression, such as local linear or polynomial models, for linear ordinary least squares.

Under suitable conditions, the ANCOVA model is said to estimate a “Local Average Treatment Effect” (LATE). If Y1Y_1 and Y0Y_0 are the potential outcomes for YY, then the LATE is defined as limrc+E(Y1|R=r)limrcE(Y0|R=r)\displaystyle\lim_{r\rightarrow c^+} E(Y_1|R=r)-\displaystyle\lim_{r\rightarrow c^-} E(Y_0|R=r) or equivalently limΔ0+E(Y1Y0|R(cΔ,c+Δ))\displaystyle\lim_{\Delta\rightarrow 0^+} E(Y_1-Y_0 |R\in (c-\Delta,c+\Delta)) where EE denotes expectation–that is, the LATE is the limit of average treatment effects for subjects with RR in ever-shrinking regions around cc.

Among other considerations, the LATE target suggests that data analysts restrict their attention to subjects with RR falling within a bandwidth b>0b>0 of cc, e.g. only fitting the model to subjects ii with within a “window of analysis” 𝒲={i:Ri(cb,c+b)}\mathcal{W}=\{i:R_i\in (c-b,c+b)\}. A number of methods have been proposed to select bb, including cross validation, non-parametric modeling of the second derivative of f(r)=E(Y|R=r)f(r)=E(Y|R=r), and specification tests.

The propertee Approach to RD Designs

The propertee approach to RD designs breaks the data analysis into three steps:

  1. Conduct specification tests and choose a bandwidth b>0b>0, along with, possibly, other data exclusions, resulting in an analysis sample WW.
  2. Fit a covariance model Yi0=g(Ri,xi;β)Y_{i0}=g(R_i,x_i;\beta), modeling Yi0Y_{i0} as a function of running variable RiR_i and (optionally) other covariates 𝐱i\mathbf{x}_i. Fitting a covariance model to only the control subjects, as in other designs, would entail extrapolation of a model fit to subjects with Ri(cb,c)R_i\in (c-b,c) to subjects with Ri(c,c+b)R_i\in (c,c+b), or vice-versa, which is undesirable. Instead, we fit the covariance model to the full analysis sample 𝒲\mathcal{W}, including both treated and untreated subjects. However, since we are interested in modeling Y0Y_0, and not Y1Y_1 or YY, so rather than fitting g()g(\cdot), we fit an extended model g̃(Ri,xi,Zi;β,γ)=g(Ri,xi;β)+γZi\tilde{g}(R_i,x_i,Z_i;\beta,\gamma)=g(R_i,x_i;\beta)+\gamma Z_i including a term for treatment assignment. The estimate for γ\gamma is, in essence, a provisional estimate of the treatment effect.
  3. Let Ŷi0=g(Ri,xi;β̂)\widehat{Y}_{i0}=g(R_i,x_i;\hat{\beta}), using only the model for Y0Y_0—not including the term for ZZ—along with β̂\hat{\beta} estimated in step 2. Then estimate the average treatment effect for subjects in WW using the difference in means estimator: dβ̂=iWZi(YiŶi0)iWZiiW(1Zi)(YiŶi0)iW(1Zi) d_{\hat{\beta}}=\frac{\sum_{i\in W} Z_i(Y_i-\widehat{Y}_{i0})}{\sum_{i \in W} Z_i}-\frac{\sum_{i\in W} (1-Z_i)(Y_i-\widehat{Y}_{i0})}{\sum_{i \in W} (1-Z_i)}

The estimator dβ̂d_{\hat{\beta}} will be consistent if the model g(Ri,xi;β0)g(R_i,x_i;\beta_0), with β0\beta_0 as the probability limit for β̂\hat{\beta}, successfully removes all confounding due to RR, i.e. Y0g(Ri,xi;β0)ZY_0-g(R_i,x_i;\beta_0) \perp \!\!\! \perp Z for i𝒲i\in \mathcal{W}.

Analyzing an RD design in propertee

Determining the window of analysis

There is an extensive literature on determining the appropriate window of analysis for an RD design. See, for instance, Imbens and Kalyanaraman (2012) and Sales and Hansen (2020). This stage of the analysis is beyond the scope of this vignette, and does not require the propertee package. For the purpose of this example, we will focus our analysis on 𝒲={i:Ri[0.5,0.5]}\mathcal{W}=\{i:R_i\in [-0.5,0.5]\}.

Initializing the RD Design Object

In general, propertee design objects require users to specify a unit of assignment variable, that corresponds to the level of treatment assignment–in other words, which units were assigned between conditions. This is necessary both for combining results across different models or datasets, and for estimating correct design-based standard errors. In lsoSynth, there are no clusters, and individual students are assigned to academic probation on an individual basis. The dataset does not include an ID variable, but any variable that takes a unique value for each row will do. We will use row-names.

lsoSynth$id <- rownames(lsoSynth)

Defining an RD design requires, at minimum, identifying the running variable(s) RR, as well as how RR determines treatment assignment. In our example,

lsoSynth$Z <- lsoSynth$R<0

lsoSynthW <- subset(lsoSynth,abs(R)<=0.5)

#lsoSynthW$id <- 1:nrow(lsoSynthW)
des <- rd_design(Z ~ forcing(R) + unitid(id), data=lsoSynth, subset=abs(lsoSynth$R) <= 0.5)

Modeling YCY_C as a function of RR

We will consider two potential models g̃()\tilde{g}(\cdot) of YCY_C as a function of RR. First, the standard OLS model:

### this doesn't work:
g1 <- lm(nextGPA ~ R + Z, data = lsoSynth, weights = ett(des)
##this works, but it's annoying to enter in the subset expression a 2nd time:
g1 <- lm(nextGPA ~ R + Z, data = lsoSynth, subset = abs(R) <= 0.5)

The second is a bounded-influence polynomial model of the type recommended in Sales & Hansen (2020):

g2 <-
if(requireNamespace("robustbase", quietly = TRUE)){
  robustbase::lmrob(nextGPA~poly(R,5)+Z,data=lsoSynthW)
} else g1

[Put something here evaluating them?]

Estimating Effects


yhat1 <- predict(g1,data.frame(R=forcings(des)[[1]],Z=FALSE))
yhat2 <- predict(g2,data.frame(R=forcings(des)[[1]],Z=FALSE))

plot(yhat1,yhat2)

### method 1:

mean(lsoSynthW$nextGPA[lsoSynthW$Z]-yhat1[lsoSynthW$Z])-
  mean(lsoSynthW$nextGPA[!lsoSynthW$Z]-yhat1[!lsoSynthW$Z])
#> [1] 0.2553202

#### method 2:
coef(lm(nextGPA~Z, offset=yhat1,data=lsoSynthW))['ZTRUE']
#>     ZTRUE 
#> 0.2553202
### method 1:

mean(lsoSynthW$nextGPA[lsoSynthW$Z]-yhat2[lsoSynthW$Z])-
  mean(lsoSynthW$nextGPA[!lsoSynthW$Z]-yhat2[!lsoSynthW$Z])
#> [1] 0.2295554

#### method 2:
coef(lm(nextGPA~Z, offset=yhat2,data=lsoSynthW))['ZTRUE']
#>     ZTRUE 
#> 0.2295554