Title: Two-sample comparison through additive tree models for density ratios.
Abstract: The ratio of two densities characterizes their differences. We consider the two-sample comparison problem through estimating the density ratio given i.i.d. observations from each distribution. We propose additive tree models for the density ratio along with efficient algorithms for training these models using a new loss function called the balancing loss. With this loss, additive tree models for the density ratio can be trained using several algorithms originally designed for supervised learning. First, they can be trained from an optimization perspective through boosting algorithms including forward-stagewise fitting and gradient boosting. Moreover, due to the balancing loss function’s resemblance to an exponential family kernel, it can serve as a pseudo-likelihood for which conjugate priors exist, thereby enabling effective generalized Bayesian inference on the density ratio using backfitting samplers designed for Bayesian additive regression trees (BART). This generalized Bayesian strategy enables uncertainty quantification on the inferred density ratio, which is critical for applications involving high-dimensional and complex distributions, as in such problems uncertainty given limited data can often be substantial. We provide insights on the balancing loss through its close connection to the exponential loss in binary classification and to the variational form of f-divergence, in particular that of the squared Hellinger distance. Our numerical experiments demonstrate the accuracy and computational efficiency of the proposed approach while providing unique capabilities in uncertainty quantification. We demonstrate the application of our method in a case study involving assessing the quality of generative models for microbiome compositional data.