Title: | Honest Inference in Regression Discontinuity Designs |
---|---|
Description: | Honest and nearly-optimal confidence intervals in fuzzy and sharp regression discontinuity designs and for inference at a point based on local linear regression. The implementation is based on Armstrong and Kolesár (2018) <doi:10.3982/ECTA14434>, and Kolesár and Rothe (2018) <doi:10.1257/aer.20160945>. Supports covariates, clustering, and weighting. |
Authors: | Michal Kolesár [aut, cre, cph]
|
Maintainer: | Michal Kolesár <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1.9000 |
Built: | 2025-02-16 05:47:40 UTC |
Source: | https://github.com/kolesarm/rdhonest |
Oreopoulos (2006) UK general household survey dataset
cghs
cghs
A data frame with 73,954 rows and 2 variables:
Annual earnings in 1998 (UK pounds)
Year individual turned 14
American Economic Review data archive, doi:10.1257/000282806776157641
Philip Oreopoulos. Estimating average and local average treatment effects when compulsory education schooling laws really matter. American Economic Review, 96(1):152–175, 2006. doi:10.1257/000282806776157641
Computes the critical value such that
the confidence interval
has coverage
, where the estimator
is normally
distributed with variance equal to
and maximum bias at most
.
CVb(B, alpha = 0.05)
CVb(B, alpha = 0.05)
B |
Maximum bias, vector of non-negative numbers. |
alpha |
Determines CI level, |
Vector of critical values, one for each value of maximum bias
supplied by B
.
## 90% critical value: CVb(B = 1, alpha = 0.1) ## Usual 95% critical value CVb(0) ## Returns vector with 3 critical values CVb(B = c(0, 0.5, 1), alpha = 0.05)
## 90% critical value: CVb(B = 1, alpha = 0.1) ## Usual 95% critical value CVb(0) ## Returns vector with 3 critical values CVb(B = c(0, 0.5, 1), alpha = 0.05)
Subset of Ludwig-Miller (2007) data. Counties with missing poverty rate, or
with both outcomes missing (hs
and mortality
) were removed. In
the original dataset, Yellowstone County, MT (oldcode = 27056
) was
entered twice, here the duplicate is removed. Yellowstone National Park, MT
(oldcode = 27057
) is also removed due to it being an outlier for both
outcomes. Counties with oldcode
equal to (3014, 32032, 47010, 47040,
47074, 47074, 47078, 47079, 47096) matched more than one FIPS entry, so the
county labels may not be correct. Mortality data is missing for Alaska.
headst
headst
A data frame with 3,127 rows and 18 variables:
State FIPS code
County FIPS code
ID in Ludwig-Miller dataset
Poverty rate in 1960 relative to 300th poorest county (which had poverty rate 59.1984)
Average Mortality rate per 100,000 for children aged 5-9 over 1973–83 due to causes addressed as part of Head Start's health services
Average Mortality rate per 100,000 for children aged 5-9 over 1973–83 due to injury
High school completion rate in 1990 census, ages 18-24
County population (1960 census)
Percent attending school, ages 14-17 (1960 census)
Percent attending school, ages 5-34 (1960 census)
High school completion rate in 1960 census, ages 25+
Population aged 14-17 (1960 census)
Population aged 5-34 (1960 census)
Population aged 25+ (1960 census)
Percent urban (1960 census)
Percent black (1960 census)
State postal code
County name
Douglas Miller's former website, http://web.archive.org/web/20190619165949/http://faculty.econ.ucdavis.edu:80/faculty/dlmiller/statafiles/
Jens Ludwig and Douglas L. Miller. Does head start improve children's life chances? Evidence from a regression discontinuity design. Quarterly Journal of Economics, 122(1):159–208, February 2007. doi:10.1162/qjec.122.1.159
Lee (2008) US House elections dataset
lee08
lee08
A data frame with 6,558 rows and 2 variables:
Vote share in next election
Democratic margin of victory
Mostly Harmless Econometrics data archive, https://economics.mit.edu/people/faculty/josh-angrist/mhe-data-archive
David S. Lee. Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics, 142(2):675–697, 2008. doi:10.1016/j.jeconom.2007.05.004
Battistin, Brugiavini, Rettore, and Weber (2009) retirement consumption puzzle dataset
rcp
rcp
A data frame with 30,006 rows and 6 variables:
Survey year
Years to/from eligibility (males)
Retirement status (males)
Total household food expenditure
Total household consumption
Total household expenditure on non-durable goods
Educational attainment (males), one of: "none", "elementary school", "lower secondary", "vocational studies", "upper secondary", "college or higher")
Family size
American Economic Review data archive, doi:10.1257/aer.99.5.2209
Erich Battistin, Agar Brugiavini, Enrico Rettore, and Guglielmo Weber. The retirement consumption puzzle: Evidence from a regression discontinuity approach. American Economic Review, 99(5):2209–2226, 2009. doi:10.1257/aer.99.5.2209
Calculate estimators and bias-aware CIs for the sharp or fuzzy RD parameter, or for value of the conditional mean at a point.
RDHonest( formula, data, subset, weights, cutoff = 0, M, kern = "triangular", na.action, opt.criterion = "MSE", h, se.method = "nn", alpha = 0.05, beta = 0.8, J = 3, sclass = "H", T0 = 0, point.inference = FALSE, sigmaY2, sigmaD2, sigmaYD, clusterid )
RDHonest( formula, data, subset, weights, cutoff = 0, M, kern = "triangular", na.action, opt.criterion = "MSE", h, se.method = "nn", alpha = 0.05, beta = 0.8, J = 3, sclass = "H", T0 = 0, point.inference = FALSE, sigmaY2, sigmaD2, sigmaYD, clusterid )
formula |
an object of class |
data |
optional data frame, list or environment (or object coercible by
|
subset |
optional vector specifying a subset of observations to be used in the fitting process. |
weights |
Optional vector of weights to weight the observations (useful for aggregated data). The weights are interpreted as the number of observations that each aggregated data point averages over. Disregarded if optimal kernel is used. |
cutoff |
specifies the RD cutoff in the running variable. For inference
at a point, specifies the point |
M |
Bound on second derivative of the conditional mean function, a
numeric vector of length one. For fuzzy RD, |
kern |
specifies the kernel function used in the local regression. It
can either be a string equal to |
na.action |
function which indicates what should happen when the data
contain |
opt.criterion |
Optimality criterion that the bandwidth is designed to optimize. The options are:
The methods use conditional variance given by |
h |
bandwidth, a scalar parameter. If not supplied, optimal bandwidth is
computed according to criterion given by |
se.method |
method for estimating standard error of the estimate, one of:
|
alpha |
determines confidence level, |
beta |
Determines quantile of excess length to optimize, if bandwidth
optimizes given quantile of excess length of one-sided confidence
intervals ( |
J |
Number of nearest neighbors, if |
sclass |
Smoothness class, either |
T0 |
Initial estimate of the treatment effect for calculating the optimal bandwidth. Only relevant for fuzzy RD. |
point.inference |
Do inference at a point determined by |
sigmaY2 |
Supply variance of outcome. Ignored when kernel is optimal. |
sigmaD2 |
Supply variance of treatment (fuzzy RD only). |
sigmaYD |
Supply covariance of treatment and outcome (fuzzy RD only). |
clusterid |
Vector specifying cluster membership. If supplied,
|
The bandwidth is calculated to be optimal for a given performance criterion,
as specified by opt.criterion
. Alternatively, for local polynomial
estimators, the bandwidth can be specified by h
. For
kern="optimal"
, calculate optimal estimators under second-order Taylor
smoothness class (sharp RD only).
Returns an object of class "RDResults"
. The function
print
can be used to obtain and print a summary of the results. An
object of class "RDResults"
is a list containing four components.
First, a data frame "coefficients"
containing the following
columns:
term
type of parameter being estimated
estimate
point estimate
std.error
standard error of estimate
maximum.bias
maximum bias of estimate
conf.low
, conf.high
lower (upper) end-point of a
two-sided CI based on estimate
conf.low.onesided
, conf.high.onesided
lower (upper)
end-point of a one-sided CIs based on estimate
bandwidth
bandwidth used. If kern="optimal"
, the
smoothing parameters bandwidth.m
and bandwidth.p
on
either side of the cutoff are reported instead
eff.obs
number of effective observations
leverage
maximal leverage of estimate
cv
critical value used to compute two-sided CIs
alpha
coverage level, as specified by option alpha
method
sclass
is used
M
curvature bound used for worst-case bias
calculations. For fuzzy RD, equals
(abs(estimate)*M.fs+M.rf)/first.stage
M.rf
, M.fs
curvature bound for the outcome (i.e. reduced-form) and first-stage regressions. Fuzzy RD only.
first.stage
estimate of the first-stage coefficient. Fuzzy RD only.
kernel
kernel used
p.value
p-value for testing the null of no effect
Second, a list called "data"
containing the data used for
estimation. This is useful mostly for internal calculations. Third, an
object of class "lm"
containing the local linear regression
estimates. Finally, a call
object containing the matched call
called "call"
.
If kern="optimal"
, the "lm"
object is empty, and the
numeric vectors "delta"
and "omega"
are returned in
addition. These correspond to the parameters in the modulus problem used
to compute the optimal estimation weights.
subset
is evaluated in the same way as variables in formula
,
that is first in data
and then in the environment of formula
.
Timothy B. Armstrong and Michal Kolesár. Optimal inference in a class of regression models. Econometrica, 86(2):655–683, March 2018. doi:10.3982/ECTA14434
Timothy B. Armstrong and Michal Kolesár. Simple and honest confidence intervals in nonparametric regression. Quantitative Economics, 11(1):1–39, January 2020. doi:10.3982/QE1199
Michal Kolesár and Christoph Rothe. Inference in regression discontinuity designs with a discrete running variable. American Economic Review, 108(8):2277—-2304, August 2018. doi:10.1257/aer.20160945
RDHonest(voteshare ~ margin, data = lee08, kern = "uniform", M = 0.1, h = 10) RDHonest(cn | retired ~ elig_year, data=rcp, cutoff=0, M=c(4, 0.4), kern="triangular", opt.criterion="MSE", T0=0, h=3) RDHonest(voteshare ~ margin, data = lee08, subset = margin>0, kern = "uniform", M = 0.1, h = 10, point.inference=TRUE)
RDHonest(voteshare ~ margin, data = lee08, kern = "uniform", M = 0.1, h = 10) RDHonest(cn | retired ~ elig_year, data=rcp, cutoff=0, M=c(4, 0.4), kern="triangular", opt.criterion="MSE", T0=0, h=3) RDHonest(voteshare ~ margin, data = lee08, subset = margin>0, kern = "uniform", M = 0.1, h = 10, point.inference=TRUE)
Computes honest CIs for local polynomial regression with uniform kernel in sharp RD under the assumption that the conditional mean lies in the bounded misspecification error (BME) class of functions, as considered in Kolesár and Rothe (2018). This class formalizes the notion that the fit of the chosen model is no worse at the cutoff than elsewhere in the estimation window.
RDHonestBME( formula, data, subset, cutoff = 0, na.action, h = Inf, alpha = 0.05, order = 0, regformula )
RDHonestBME( formula, data, subset, cutoff = 0, na.action, h = Inf, alpha = 0.05, order = 0, regformula )
formula |
object of class |
data |
optional data frame, list or environment (or object coercible by
|
subset |
optional vector specifying a subset of observations to be used in the fitting process. |
cutoff |
specifies the RD cutoff in the running variable. |
na.action |
function which indicates what should happen when the data
contain |
h |
bandwidth, a scalar parameter. |
alpha |
determines confidence level, |
order |
Order of local regression |
regformula |
Explicitly specify regression formula to use instead of
running a local polynomial regression, with |
An object of class "RDResults"
. This is a list with at least
the following elements:
"coefficients"
Data frame containing estimation results, including point estimate, one- and two-sided confidence intervals, a bound on worst-case bias, bandwidth used, and the number of effective observations.
"call"
The matched call.
"lm"
An "lm"
object containing the fitted
regression.
"na.action"
(If relevant) information on the special
handling of NA
s.
subset
is evaluated in the same way as variables in formula
,
that is first in data
and then in the environment of formula
.
Michal Kolesár and Christoph Rothe. Inference in regression discontinuity designs with a discrete running variable. American Economic Review, 108(8):2277—-2304, August 2018. doi:10.1257/aer.20160945
RDHonestBME(log(earnings)~yearat14, data=cghs, h=3, order=1, cutoff=1947) ## Equivalent to RDHonestBME(log(earnings)~yearat14, data=cghs, h=3, cutoff=1947, order=1, regformula="y~x*I(x>=0)")
RDHonestBME(log(earnings)~yearat14, data=cghs, h=3, order=1, cutoff=1947) ## Equivalent to RDHonestBME(log(earnings)~yearat14, data=cghs, h=3, cutoff=1947, order=1, regformula="y~x*I(x>=0)")
Scatterplot of raw observations in which each point corresponds to an binned average.
RDScatter( formula, data, subset, cutoff = 0, na.action, avg = 10, xlab = NULL, ylab = NULL, vert = TRUE, propdotsize = FALSE )
RDScatter( formula, data, subset, cutoff = 0, na.action, avg = 10, xlab = NULL, ylab = NULL, vert = TRUE, propdotsize = FALSE )
formula |
object of class |
data |
optional data frame, list or environment (or object coercible by
|
subset |
optional vector specifying a subset of observations to be used in the fitting process. |
cutoff |
specifies the RD cutoff for the running variable. |
na.action |
function which indicates what should happen when the data
contain |
avg |
Number of observations to average over. If set to |
xlab , ylab
|
x- and y-axis labels |
vert |
Draw a vertical line at cutoff? |
propdotsize |
If |
An object of class "ggplot"
, a scatterplot the binned raw
observations.
subset
is evaluated in the same way as variables in formula
,
that is first in data
and then in the environment of formula
.
RDScatter(log(earnings)~yearat14, data=cghs, cutoff=1947, avg=Inf, propdotsize=TRUE)
RDScatter(log(earnings)~yearat14, data=cghs, cutoff=1947, avg=Inf, propdotsize=TRUE)
Estimate a lower bound on the smoothness constant M and provide a lower confidence interval for it, using method described in supplement to Kolesár and Rothe (2018).
RDSmoothnessBound( object, s, separate = FALSE, multiple = TRUE, alpha = 0.05, sclass = "H" )
RDSmoothnessBound( object, s, separate = FALSE, multiple = TRUE, alpha = 0.05, sclass = "H" )
object |
An object of class |
s |
Number of support points that curvature estimates should average over. |
separate |
If |
multiple |
If |
alpha |
determines confidence level |
sclass |
Smoothness class, either |
Returns a data frame wit the following columns:
estimate
Point estimate for lower bounds for M.
conf.low
Lower endpoint for a one-sided confidence interval for M
The data frame has a single row if separate==FALSE
; otherwise it has
two rows, corresponding to smoothness bound estimates and confidence
intervals below and above the cutoff, respectively.
Michal Kolesár and Christoph Rothe. Inference in regression discontinuity designs with a discrete running variable. American Economic Review, 108(8):2277—-2304, August 2018. doi:10.1257/aer.20160945
## Subset data to increase speed r <- RDHonest(log(earnings)~yearat14, data=cghs, subset=abs(yearat14-1947)<10, cutoff=1947, M=0.04, h=3) RDSmoothnessBound(r, s=2)
## Subset data to increase speed r <- RDHonest(log(earnings)~yearat14, data=cghs, subset=abs(yearat14-1947)<10, cutoff=1947, M=0.04, h=3) RDSmoothnessBound(r, s=2)
Compute efficiency of minimax one-sided CIs at constant functions, or efficiency of two-sided fixed-length CIs at constant functions under second-order Taylor smoothness class.
RDTEfficiencyBound(object, opt.criterion = "FLCI", beta = 0.5)
RDTEfficiencyBound(object, opt.criterion = "FLCI", beta = 0.5)
object |
An object of class |
opt.criterion |
Either |
beta |
Determines quantile of excess length for evaluating minimax
efficiency of one-sided CIs. Ignored if |
Efficiency bound, a numeric vector of length one.
Timothy B. Armstrong and Michal Kolesár. Optimal inference in a class of regression models. Econometrica, 86(2):655–683, March 2018. doi:10.3982/ECTA14434
r <- RDHonest(voteshare ~ margin, data=lee08, subset=abs(margin)<10, M=0.1, h=2) RDTEfficiencyBound(r, opt.criterion="OCI")
r <- RDHonest(voteshare ~ margin, data=lee08, subset=abs(margin)<10, M=0.1, h=2) RDTEfficiencyBound(r, opt.criterion="OCI")
Subset of Lalive (2008) data for individuals in the regions affected by the REBP program
rebp
rebp
A data frame with 29,371 rows and 4 variables:
Age in years, at monthly accuracy
Indicator for whether REBP is in place
Indicator for female
unemployment duration in weeks
Rafael Lalive's website, https://sites.google.com/site/rafaellalive/
Rafael Lalive. How do extended benefits affect unemployment duration? A regression discontinuity approach. Journal of Econometrics, 142(2):785–806, February 2008. doi:10.1016/j.jeconom.2007.05.013