In paired design studies, it is common to have multiple measurements taken for the same set of subjects under different conditions. In observational studies, it is many times of interest to conduct pair matching on multiple covariates between a treatment group and a control group, and to test the treatment effect represented by multiple response variables on well pair-matched data. However, there is a lack of an effective test on multivariate paired data. The multivariate paired Hotelling's $T^2$ test can sometimes be used, but its power decreases fast as the dimension increases. Existing methods for assessing the balance of multiple covariates in matched observational studies usually ignore the paired structure and thus they do not perform well under some settings. In this work, we propose a new non-parametric test for paired data, which exhibits a substantial power improvement over existing methods under a wide range of situations. We also derive the asymptotic distribution of the new test and the approximate $p$-value is reasonably accurate under finite samples through simulation studies even when the dimension is larger than the sample size, making the new test an easy-off-the-shelf tool for real applications. The proposed test is illustrated through an analysis of a real data set on the Alzheimer's disease research.

### 相关内容

Consider the task of matrix estimation in which a dataset $X \in \mathbb{R}^{n\times m}$ is observed with sparsity $p$, and we would like to estimate $\mathbb{E}[X]$, where $\mathbb{E}[X_{ui}] = f(\alpha_u, \beta_i)$ for some Holder smooth function $f$. We consider the setting where the row covariates $\alpha$ are unobserved yet the column covariates $\beta$ are observed. We provide an algorithm and accompanying analysis which shows that our algorithm improves upon naively estimating each row separately when the number of rows is not too small. Furthermore when the matrix is moderately proportioned, our algorithm achieves the minimax optimal nonparametric rate of an oracle algorithm that knows the row covariates. In simulated experiments we show our algorithm outperforms other baselines in low data regimes.

We consider the Bayesian analysis of models in which the unknown distribution of the outcomes is specified up to a set of conditional moment restrictions. The nonparametric exponentially tilted empirical likelihood function is constructed to satisfy a sequence of unconditional moments based on an increasing (in sample size) vector of approximating functions (such as tensor splines based on the splines of each conditioning variable). For any given sample size, results are robust to the number of expanded moments. We derive Bernstein-von Mises theorems for the behavior of the posterior distribution under both correct and incorrect specification of the conditional moments, subject to growth rate conditions (slower under misspecification) on the number of approximating functions. A large-sample theory for comparing different conditional moment models is also developed. The central result is that the marginal likelihood criterion selects the model that is less misspecified. We also introduce sparsity-based model search for high-dimensional conditioning variables, and provide efficient MCMC computations for high-dimensional parameters. Along with clarifying examples, the framework is illustrated with real-data applications to risk-factor determination in finance, and causal inference under conditional ignorability.

The first step towards investigating the effectiveness of a treatment is to split the population into the control and the treatment groups, then compare the average responses of the two groups to the treatment. In order to ensure that the difference in the two groups is only caused by the treatment, it is crucial for the control and the treatment groups to have similar statistics. The validity and reliability of trials are determined by the similarity of two groups' statistics. Covariate balancing methods increase the similarity between the distributions of the two groups' covariates. However, often in practice, there are not enough samples to accurately estimate the groups' covariate distributions. In this paper, we empirically show that covariate balancing with the standardized means difference covariate balancing measure is susceptible to adversarial treatment assignments in limited population sizes. Adversarial treatment assignments are those admitted by the covariate balance measure, but result in large ATE estimation errors. To support this argument, we provide an optimization-based algorithm, namely Adversarial Treatment ASsignment in TREatment Effect Trials (ATASTREET), to find the adversarial treatment assignments for the IHDP-1000 dataset.

Constraint based causal structure learning for point processes require empirical tests of local independence. Existing tests require strong model assumptions, e.g. that the true data generating model is a Hawkes process with no latent confounders. Even when restricting attention to Hawkes processes, latent confounders are a major technical difficulty because a marginalized process will generally not be a Hawkes process itself. We introduce an expansion similar to Volterra expansions as a tool to represent marginalized intensities. Our main theoretical result is that such expansions can approximate the true marginalized intensity arbitrarily well. Based on this we propose a test of local independence and investigate its properties in real and simulated data.

Missing data is a common issue in many biomedical studies. Under a paired design, some subjects may have missing values in either one or both of the conditions due to loss of follow-up, insufficient biological samples, etc. Such partially paired data complicate statistical comparison of the distribution of the variable of interest between the two conditions. In this paper, we propose a general class of test statistics based on the difference in weighted sample means without imposing any distributional or model assumption. An optimal weight is derived for this class of tests. Simulation studies show that our proposed test with the optimal weight performs well and outperforms existing methods in practical situations. Two cancer biomarker studies are provided for illustration.

In this paper we address the computational feasibility of the class of decision theoretic models referred to as adversarial risk analyses (ARA). These are models where a decision must be made with consideration for how an intelligent adversary may behave and where the decision-making process of the adversary is unknown, and is elicited by analyzing the adversary's decision problem using priors on his utility function and beliefs. The motivation of this research was to develop a computational algorithm that can be applied across a broad range of ARA models; to the best of our knowledge, no such algorithm currently exists. Using a two-person sequential model, we incrementally increase the size of the model and develop a simulation-based approximation of the true optimum where an exact solution is computationally impractical. In particular, we begin with a relatively large decision space by considering a theoretically continuous space that must be discretized. Then, we incrementally increase the number of strategic objectives which causes the decision space to grow exponentially. The problem is exacerbated by the presence of an intelligent adversary who also must solve an exponentially large decision problem according to some unknown decision-making process. Nevertheless, using a stylized example that can be solved analytically we show that our algorithm not only solves large ARA models quickly but also accurately selects to the true optimal solution. Furthermore, the algorithm is sufficiently general that it can be applied to any ARA model with a large, yet finite, decision space.

Semi-Supervised Learning (SSL) approaches have been an influential framework for the usage of unlabeled data when there is not a sufficient amount of labeled data available over the course of training. SSL methods based on Convolutional Neural Networks (CNNs) have recently provided successful results on standard benchmark tasks such as image classification. In this work, we consider the general setting of SSL problem where the labeled and unlabeled data come from the same underlying probability distribution. We propose a new approach that adopts an Optimal Transport (OT) technique serving as a metric of similarity between discrete empirical probability measures to provide pseudo-labels for the unlabeled data, which can then be used in conjunction with the initial labeled data to train the CNN model in an SSL manner. We have evaluated and compared our proposed method with state-of-the-art SSL algorithms on standard datasets to demonstrate the superiority and effectiveness of our SSL algorithm.

Recommender systems are central to modern online platforms, but a popular concern is that they may be pulling society in dangerous directions (e.g., towards filter bubbles). However, a challenge with measuring the effects of recommender systems is how to compare user outcomes under these systems to outcomes under a credible counterfactual world without such systems. We take a model-based approach to this challenge, introducing a dichotomy of process models that we can compare: (1) a "recommender" model describing a generic item-matching process under a personalized recommender system and (2) an "organic" model describing a baseline counterfactual where users search for items without the mediation of any system. Our key finding is that the recommender and organic models result in dramatically different outcomes at both the individual and societal level, as supported by theorems and simulation experiments with real data. The two process models also induce different trade-offs during inference, where standard performance-improving techniques such as regularization/shrinkage have divergent effects. Shrinkage improves the mean squared error of matches in both settings, as expected, but at the cost of less diverse (less radical) items chosen in the recommender model but more diverse (more radical) items chosen in the organic model. These findings provide a formal language for how recommender systems may be fundamentally altering how we search for and interact with content, in a world increasingly mediated by such systems.

Performing causal inference in observational studies requires we assume confounding variables are correctly adjusted for. G-computation methods are often used in these scenarios, with several recent proposals using Bayesian versions of g-computation. In settings with few confounders, standard models can be employed, however as the number of confounders increase these models become less feasible as there are fewer observations available for each unique combination of confounding variables. In this paper we propose a new model for estimating treatment effects in observational studies that incorporates both parametric and nonparametric outcome models. By conceptually splitting the data, we can combine these models while maintaining a conjugate framework, allowing us to avoid the use of MCMC methods. Approximations using the central limit theorem and random sampling allows our method to be scaled to high dimensional confounders while maintaining computational efficiency. We illustrate the model using carefully constructed simulation studies, as well as compare the computational costs to other benchmark models.

This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.

Christina Lee Yu
0+阅读 · 10月26日
Siddhartha Chib,Minchul Shin,Anna Simoni
0+阅读 · 10月26日
0+阅读 · 10月25日
Nikolaj Thams,Niels Richard Hansen
0+阅读 · 10月25日
Yuntong Li,Brent J. Shelton,William St Clair,Heidi L. Weiss,John L. Villano,Arnold J. Stromberg,Chi Wang,Li Chen
0+阅读 · 10月25日
0+阅读 · 10月25日
0+阅读 · 10月21日
Serina Chang,Johan Ugander
0+阅读 · 10月21日
Yiyan Huang,Cheuk Hang Leung,Xing Yan,Qi Wu,Nanbo Peng,Dongdong Wang,Zhixiang Huang
11+阅读 · 2020年12月17日

37+阅读 · 2020年12月14日

65+阅读 · 2020年5月15日

144+阅读 · 2020年4月19日

54+阅读 · 2019年10月9日

58+阅读 · 2019年7月17日
CreateAMind
12+阅读 · 2019年5月22日
LibRec智能推荐
5+阅读 · 2019年3月7日
CreateAMind
29+阅读 · 2019年1月3日

10+阅读 · 2018年12月24日
CreateAMind
3+阅读 · 2018年4月15日

5+阅读 · 2017年11月16日

22+阅读 · 2017年9月8日

3+阅读 · 2017年8月6日
CreateAMind
5+阅读 · 2017年8月4日
Top