A tradeoff between false discovery and true positive proportions for sparse high-dimensional logistic regression

Zhou, Jing ORCID: https://orcid.org/0000-0002-8894-9100 and Claeskens, Gerda (2024) A tradeoff between false discovery and true positive proportions for sparse high-dimensional logistic regression. Electronic Journal of Statistics, 18 (1). pp. 395-428. ISSN 1935-7524

[thumbnail of 23-EJS2204]
Preview
PDF (23-EJS2204) - Published Version
Available under License Creative Commons Attribution.

Download (516kB) | Preview

Abstract

The logistic regression model is a simple and classic approach to binary classification, where in sparse high-dimensional settings, one believes that only a small proportion of the predictive variables are relevant to the response variable with nonnull regression coefficients. We focus on regularized logistic regression models and the analysis is valid for a large group of regularizers, including folded-concave regularizers such as MCP and SCAD. For finite samples, the discrepancy between the estimated and true non-null coefficients is evaluated by the false discovery and true positive rates. We show that the false discovery rate can be described using a nonlinear tradeoff function of power asymptotically using a system of equations with six parameters. The analysis is conducted in an “average-over-components” fashion for the unknown parameter and follows the conventional assumptions of the literature in the relevant field. More specifically, we assume a linear growth rate n/p → δ > 0 covering not only the typical high dimensional settings where p ≥ n but also for n>p. Further, we propose two applications of this tradeoff function that improve the reproducibility of variable selection: (1) a sample size calculation procedure to achieve a certain power under a prespecified level of false discovery rate using the tradeoff; (2) calibration of the false discovery rate for variable selection taking power into consideration. A similar asymptotic analysis for the model-X knockoff, which provides a selection with a controlled false discovery rate, is investigated to show how to compare two selection methods by comparing the tradeoff curves. We illustrate the tradeoff analysis and its corresponding applications using simulated and real data.

Item Type: Article
Additional Information: Funding information: This work was supported by a Postdoc Fellowship of the Research Foundation Flanders and KU Leuven internal fund C16/20/002. The resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation-Flanders (FWO) and the Flemish Government.
Uncontrolled Keywords: fdr control,high-dimensional data,false discovery rate,knockoff,logistic regression,sparsity,statistics and probability,statistics, probability and uncertainty ,/dk/atira/pure/subjectarea/asjc/2600/2613
Faculty \ School: Faculty of Science > School of Mathematics
UEA Research Groups: Faculty of Science > Research Groups > Statistics
Related URLs:
Depositing User: LivePure Connector
Date Deposited: 04 Mar 2024 18:35
Last Modified: 25 Mar 2024 09:30
URI: https://ueaeprints.uea.ac.uk/id/eprint/94523
DOI: 10.1214/23-EJS2204

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item