| Title: | Linear Regression with Missing Data |
|---|---|
| Description: | Provides methods for linear regression in the presence of missing data, including missingness in covariates and responses. The package implements two estimators: 'oss_estimator', a low-dimensional semi-supervised method, and 'dantzig_missing', a high-dimensional approach. The tuning parameter can be selected automatically via 'cv_dantzig_missing'. See the associated methodology paper for details. |
| Authors: | Benedict Risebrow [aut, cre], Thomas Berrett [aut] |
| Maintainer: | Benedict Risebrow <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.1 |
| Built: | 2026-05-22 07:47:20 UTC |
| Source: | https://github.com/benrisebrow/lrmiss |
Performs K-fold cross-validation for the Dantzig selector in linear regression models with missing covariates. The method optionally incorporates unlabelled covariate data to improve estimation of second-moment matrices.
cv_dantzig_missing( X, y, X_unlabeled = NULL, lambdas = NULL, nlambda = 30, lambda_min_ratio = 1e-3, K = 5, standardise = TRUE, gurobi = FALSE, seed = 123, fold_ids = NULL, verbose = TRUE, plot_path = TRUE )cv_dantzig_missing( X, y, X_unlabeled = NULL, lambdas = NULL, nlambda = 30, lambda_min_ratio = 1e-3, K = 5, standardise = TRUE, gurobi = FALSE, seed = 123, fold_ids = NULL, verbose = TRUE, plot_path = TRUE )
X |
Labelled covariates. |
y |
Response variables for the labelled data. |
X_unlabeled |
Optional unlabeled covariates. |
lambdas |
Optional sequence of regularisation parameters. |
nlambda |
Number of lambdas if |
lambda_min_ratio |
Smallest lambda as a fraction of the largest. |
K |
Number of cross-validation folds. |
standardise |
Logical; if TRUE covariates are standardised. |
gurobi |
Logical; if TRUE uses Gurobi to solve the linear programs. |
seed |
Random seed for fold assignment. |
fold_ids |
Optional fold assignments for labelled or combined data. |
verbose |
Logical; print progress messages. |
plot_path |
Logical; if TRUE computes and plots the solution path. |
For each candidate value of the regularisation parameter, the Dantzig selector is fitted using moment estimates computed from the training folds. Prediction performance is assessed on held-out folds via the maximum absolute moment mismatch. The tuning parameter is selected using both the minimum mean cross-validation score and the one-standard-error (1-SE) rule.
A named list with the following components:
Numeric vector of tuning parameters used.
Numeric matrix of cross-validation scores (folds × lambdas).
Mean CV score for each lambda.
Standard error of CV scores for each lambda.
Lambda minimising mean CV score.
Lambda chosen by the 1-SE rule.
Optional coefficient path matrix (present if plot_path=TRUE).
Optional design column names (matching beta_path rows).
Optional saved coefficient vector from full-data path.
Optional saved intercept corresponding to beta_est.
set.seed(1) n <- 50; p <- 5 X <- matrix(rnorm(n * p), n, p) y <- X[, 1] + 0.5 * X[, 2] + rnorm(n) X_unlabeled <- matrix(rnorm(100 * p), 100, p) cv_fit <- cv_dantzig_missing( X = X, y = y, X_unlabeled = X_unlabeled, K = 5, nlambda = 20 ) cv_fit$lambda_1seset.seed(1) n <- 50; p <- 5 X <- matrix(rnorm(n * p), n, p) y <- X[, 1] + 0.5 * X[, 2] + rnorm(n) X_unlabeled <- matrix(rnorm(100 * p), 100, p) cv_fit <- cv_dantzig_missing( X = X, y = y, X_unlabeled = X_unlabeled, K = 5, nlambda = 20 ) cv_fit$lambda_1se
High-dimensional linear regression estimator based on the Dantzig selector
that accommodates missing covariates and optionally leverages unlabelled
covariate data. This function is a user-facing wrapper that dispatches to
either a standardised or unstandardised implementation depending on the
value of standardise.
dantzig_missing( X_labeled, y, X_unlabeled = NULL, lambda, gurobi = FALSE, standardise = TRUE )dantzig_missing( X_labeled, y, X_unlabeled = NULL, lambda, gurobi = FALSE, standardise = TRUE )
X_labeled |
Numeric matrix or data.frame of labelled covariates, with rows corresponding to observations and columns to covariates. Missing values are allowed. |
y |
Numeric response vector of length |
X_unlabeled |
Optional numeric matrix or data.frame of unlabelled covariates. If supplied, these observations are used only for estimating second moments of the covariates and do not contribute to the response. |
lambda |
Positive numeric scalar giving the Dantzig regularisation parameter. |
gurobi |
Logical; if TRUE, the linear programs are solved using the gurobi optimizer (a valid Gurobi installation and license are required). If FALSE, the open-source solver from Rglpk is used instead. |
standardise |
Logical; if TRUE, covariates are standardised prior to estimation and the resulting coefficients are mapped back to the original scale with an intercept term returned. |
Categorical covariates are internally dummy-encoded, with missing values
preserved. When standardise = TRUE, covariates are centred and scaled
using empirical means and standard deviations computed from the combined
labelled and unlabelled samples.
A list with at least the following component:
Numeric vector of estimated regression coefficients, with names corresponding to the encoded design matrix columns.
If standardise = TRUE, the list also contains:
Numeric scalar giving the estimated intercept term.
set.seed(1) n <- 50; p <- 5 X_full <- matrix(rnorm(n * p), n, p) beta_true <- c(1, 0.5, rep(0, p - 2)) y <- X_full[, 1] * beta_true[1] + X_full[, 2] * beta_true[2] + rnorm(n) # introduce missingness into covariates X_miss <- X_full X_miss[sample(length(X_miss), size = 0.1 * length(X_miss))] <- NA # fit Dantzig estimator (example lambda; tune in practice) fit <- dantzig_missing( X_labeled = X_miss, y = y, lambda = 0.1, standardise = TRUE ) fit$beta_hatset.seed(1) n <- 50; p <- 5 X_full <- matrix(rnorm(n * p), n, p) beta_true <- c(1, 0.5, rep(0, p - 2)) y <- X_full[, 1] * beta_true[1] + X_full[, 2] * beta_true[2] + rnorm(n) # introduce missingness into covariates X_miss <- X_full X_miss[sample(length(X_miss), size = 0.1 * length(X_miss))] <- NA # fit Dantzig estimator (example lambda; tune in practice) fit <- dantzig_missing( X_labeled = X_miss, y = y, lambda = 0.1, standardise = TRUE ) fit$beta_hat
Estimates the covariance matrix of a design matrix in the presence of missing values. Each covariance entry is computed using all observations for which the corresponding pair of covariates is jointly observed.
estimate_cov_raw(X)estimate_cov_raw(X)
X |
Numeric matrix (or object coercible to a matrix) containing covariates.
Rows correspond to observations and columns to variables. Missing values
( |
Let denote the -th covariate for observation .
For each pair of variables , the covariance estimate is
where is the number of observations for which both entries are
observed. If no such observations exist, the corresponding covariance entry
is set to NA.
This estimator is symmetric by construction and reduces to the usual sample second-moment matrix when the data contain no missing values.
A numeric p x p matrix containing the estimated covariance matrix,
where p = ncol(X). Entries corresponding to variable pairs that are
never jointly observed are NA.
set.seed(1) X <- matrix(rnorm(25), 25, 5) X[sample(length(X), 10)] <- NA Sigma_hat <- estimate_cov_raw(X) Sigma_hatset.seed(1) X <- matrix(rnorm(25), 25, 5) X[sample(length(X), 10)] <- NA Sigma_hat <- estimate_cov_raw(X) Sigma_hat
Fits a linear regression model in the presence of missing covariates and/or missing responses using the OSS (Ordinary Semi-Supervised) estimator. The method exploits partially observed covariates and optionally unlabelled observations to improve estimation efficiency.
oss_estimator(formula, data, all_weights_one = FALSE, crossfitting = FALSE)oss_estimator(formula, data, all_weights_one = FALSE, crossfitting = FALSE)
formula |
A model formula specifying the linear regression, e.g. |
data |
A data.frame containing the variables in the model. Rows with missing responses are treated as unlabelled observations. |
all_weights_one |
Logical; if TRUE, all missingness-pattern weights are set to one, yielding an unweighted OSS estimator. |
crossfitting |
Logical; if TRUE, a two-fold cross-fitted version of the OSS estimator is used. |
An invisible list with components:
Numeric vector of estimated regression coefficients.
Estimated noise variance, or NA if not computed.
Named vector of weights associated with each missingness pattern.
Data.frame mapping labelled observations to missingness patterns.
Complete-case coefficient estimates if used, otherwise NULL.
dat <- data.frame( y = c(1.0, NA, 2.3, 0.5), x1 = rnorm(4), x2 = rnorm(4) ) ## Without cross-fitting res <- oss_estimator(y ~ x1 + x2, dat) ## With cross-fitting res_cf <- oss_estimator(y ~ x1 + x2, dat, crossfitting = TRUE)dat <- data.frame( y = c(1.0, NA, 2.3, 0.5), x1 = rnorm(4), x2 = rnorm(4) ) ## Without cross-fitting res <- oss_estimator(y ~ x1 + x2, dat) ## With cross-fitting res_cf <- oss_estimator(y ~ x1 + x2, dat, crossfitting = TRUE)