Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after Tsteps of stochastic gradient descent, the error of the final iterate is O(log(T)/T) with high probability. We also construct a function from this class for which the error of the final iterate of deterministic gradient descent is (log(T)/T). This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). A n intermediate step of our analysis proves that the suffix averaging method achieves error O(1/T) with high probability, which is optimal (for any first-order optimization method). This improves results of Rakhlin (2012) and Hazan and Kale (2014), both of which achieved error O(1/T), but only in expectation, and achieved a high probability error bound of O(loglog(T)/T), which is suboptimal. We prove analogous results for functions that are Lipschitz and convex, but not necessarily strongly convex or differentiable. After T steps of stochastic gradient descent, the error of the final iterate is O(log(T)/sqrt(T)) with high probability, and there exists a function for which the error of the final iterate of deterministic gradient descent is (log(T)/sqrt(T)).

Compressed sensing theory explains why Lasso programs recover structured high-dimensional signals with minimax order-optimal error. Yet, the optimal choice of the program’s governing parameter is often unknown in practice. It is still unclear how variation of the governing parameter impacts recovery error in compressed sensing, which is otherwise provably stable and robust. We establish a novel notion of instability in Lasso programs when the measurement matrix is identity. This is the proximal denoising setup. We prove asymptotic cusp-like behaviour of the risk as a function of the parameter choice, and illustrate the theory with numerical simulations. For example, a 0.1% underestimate of a Lasso parameter can increase the error significantly; and a 50% underestimate can cause the error to increase by a factor of 1e9. We hope that revealing parameter instability regimes of Lasso programs helps to inform a practitioner’s choice.

In this paper we generalize the 1-bit matrix completion problem to higher order tensors. We prove that when r = O(1), a bounded rank-r, order-d, N x N x…x N, tensor T can be estimated efficiently by only m = O(ND) binary measurements by regularizing its max-qnorm and M-norm as surrogates for its rank. We prove that similar to the matrix case, i.e., when the sample complexity of recovering a low-rank tensor from 1-bit measurements of a subset of its entries is the same as recovering it from unquantized measurements. Moreover, we show the advantage of using 1-bit tensor completion over matricization both theoretically and numerically. Specifically, we show how the 1-bit measurement model can be used for context-aware recommender systems.

We prove that (kd^2/eps^2) samples are necessary and sufficient for learning a mixture of k Gaussians in R^d, up to error eps in total variation distance. This improves both the known upper bound and lower bound for this problem. For mixtures of axis-aligned Gaussians, we show that O(kd/eps^2) samples suffice, matching a known lower bound. Moreover, these results hold in an agnostic learning setting as well.
The upper bound is based on a novel technique for distribution learning based on a notion of sample compression. Any class of distributions that allows such a sample compression scheme can also be learned with few samples. Moreover, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions. The core of our main result is showing that the class of Gaussians in R^d has an efficient sample compression.

We analyze low rank tensor completion (TC) using noisy measurements of a subset of the tensor. Assuming a rank-r, order-d, N x N x…x N tensor where r=O(1), the best sampling complexity that was achieved is O(N^(d/2)), which is obtained by solving a tensor nuclear-norm minimization problem. However, this bound is significantly larger than the number of free variables in a low rank tensor which is O(d N). In this paper, we show that by using an atomic-norm whose atoms are rank-1 sign tensors, one can obtain a sample complexity of O(dN). Moreover, we generalize the matrix max-norm definition to tensors, which results in a max-quasi-norm (max-qnorm) whose unit ball has small Rademacher complexity. We prove that solving a constrained least squares estimation using either the convex atomic-norm or the nonconvex max-qnorm results in optimal sample complexity for the problem of low-rank tensor completion. Furthermore, we show that these bounds are nearly minimax rate-optimal. We also provide promising numerical results for max-qnorm constrained tensor completion, showing improved recovery results compared to matricization and alternating least squares.

We study matrix completion with non-uniform, deterministic sampling patterns. We introduce a computable parameter, which is a function of the sampling pattern, and show that if this parameter is small, then we may recover missing entries of the matrix, with appropriate weights. We theoretically analyze a simple and well-known recovery method, which simply projects the (zero-padded) subsampled matrix onto the set of low-rank matrices. We show that under non-uniform deterministic sampling, this method yields a biased solution, and we propose an algorithm to de-bias it. Numerical simulations demonstrate that de-biasing significantly improves the estimate. However, when the observations are noisy, the error of this method can be sub-optimal when the sampling is highly non-uniform. To remedy this, we suggest an alternative which is based on projection onto the max-norm ball whose robustness to noise tolerates arbitrarily non-uniform sampling. Finally, we analyze convex optimization in this framework.

The restricted isometry property (RIP) for design matrices gives guarantees for optimal recovery in sparse linear models. It is of high interest in compressed sensing and statistical learning. This property is particularly important for computationally efficient recovery methods. As a consequence, even though it is in general NP-hard to check that RIP holds, there have been substantial efforts to find tractable proxies for it. These would allow the construction of RIP matrices and the polynomial-time verification of RIP given an arbitrary matrix. We consider the framework of average-case certifiers, that never wrongly declare that a matrix is RIP, while being often correct for random instances. While there are such functions which are tractable in a suboptimal parameter regime, we show that this is a computationally hard task in any better regime. Our results are based on a new, weaker assumption on the problem of detecting dense subgraphs.

This letter is focused on quantized Compressed Sensing, assuming that Lasso is used for signal estimation. Leveraging recent work, we provide a framework to optimize the quantization function and show that the recovered signal converges to the actual signal at a quadratic rate as a function of the quantization level. We show that when the number of observations is high, this method of quantization gives a significantly better recovery rate than standard Lloyd-Max quantization. We support our theoretical analysis with numerical simulations.

One-bit compressive sensing has extended the scope of sparse recovery by showing that sparse signals can be accurately reconstructed even when their linear measurements are subject to the extreme quantization scenario of binary samples—only the sign of each linear measurement is maintained. Existing results in one-bit compressive sensing rely on the assumption that the signals of interest are sparse in some fixed orthonormal basis. However, in most practical applications, signals are sparse with respect to an overcomplete dictionary, rather than a basis. There has already been a surge of activity to obtain recovery guarantees under such a generalized sparsity model in the classical compressive sensing setting. Here, we extend the one-bit framework to this important model, providing a unified theory of one-bit compressive sensing under dictionary sparsity. Specifically, we analyze several different algorithms—based on convex programming and on hard thresholding—and show that, under natural assumptions on the sensing matrix (satisfied by Gaussian matrices), these algorithms can efficiently recover analysis-dictionary-sparse signals in the one-bit model.

Let A be an isotropic, sub-gaussian m?n matrix. We prove that the process Z_x:=||Ax||_2 –sqrt{m}||x||_2 has sub-gaussian increments. Using this, we show that for any bounded subset T of R^n, the deviation of ||Ax||_2 around its mean is uniformly bounded by the Gaussian complexity of T. We also prove a local version of this theorem, which allows for unbounded sets. These theorems have various applications, some of which are reviewed in this paper. In particular, we give a new result regarding model selection in the constrained linear model.

We give a practical random mapping that takes any set of documents represented as vectors in Euclidean space and then maps them to a sparse subset of the Hamming cube while retaining ordering of inter-vector inner products. Once represented in the sparse space, it is natural to index documents using commercial text-based search engines which are specialized to take advantage of this sparse and discrete structure for large-scale document retrieval. We give a theoretical analysis of the mapping scheme, characterizing exact asymptotic behavior and also giving non-asymptotic bounds which we verify through numerical simulations. We balance the theoretical treatment with several practical considerations; these allow substantial speed up of the method. We further illustrate the use of this method on search over two real data sets: a corpus of images represented by their color histograms, and a corpus of daily stock market index values.

We study the problem of signal estimation from non-linear observations when the signal belongs to a low-dimensional set buried in a high-dimensional space. A rough heuristic often used in practice postulates that non-linear observations may be treated as noisy linear observations, and thus the signal may be estimated using the generalized Lasso. This is appealing because of the abundance of efficient, specialized solvers for this program. Just as noise may be diminished by projecting onto the lower dimensional space, the error from modeling non-linear observations with linear observations will be greatly reduced when using the signal structure in the reconstruction. We allow general signal structure, only assuming that the signal belongs to some set K in R^n. Our theory tolerates general non-linearity, which may be discontinuous, not one-to-one and even unknown. We assume a random Gaussian model for the measurement matrix, but allow the rows to have an unknown covariance matrix. As special cases of our results, we recover near-optimal theory for noisy linear observations, and also give the first theoretical accuracy guarantee for 1-bit compressed sensing with unknown covariance matrix of the measurement vectors.

Many applications have benefited remarkably from low-dimensional models in the recent decade. The fact that many signals, though high dimensional, are intrinsically low dimensional has given the possibility to recover them stably from a relatively small number of their measurements. For example, in compressed sensing with the standard (synthesis) sparsity prior and in matrix completion, the number of measurements needed is proportional (up to a logarithmic factor) to the signal’s manifold dimension.
Recently, a new natural low-dimensional signal model has been proposed: the cosparse analysis prior. In the noiseless case, it is possible to recover signals from this model, using a combinatorial search, from a number of measurements proportional to the signal’s manifold dimension. However, if we ask for stability to noise or an efficient (polynomial complexity) solver, all the existing results demand a number of measurements which is far removed from the manifold dimension, sometimes far greater. Thus, it is natural to ask whether this gap is a deficiency of the theory and the solvers, or if there exists a real barrier in recovering the cosparse signals by relying only on their manifold dimension. In this work we show that the latter is true both in theory and in simulations. This gives a practical counter-example to the growing literature on compressed acquisition of signals based on manifold dimension.

Binary measurements arise naturally in a variety of statistical and engineering applications. They may be inherent to the problem—e.g., in determining the relationship between genetics and the presence or absence of a disease—or they may be a result of extreme quantization. In one-bit compressed sensing it has recently been shown that the number of one-bit measurements required for signal estimation mirrors that of unquantized compressed sensing. Indeed, s-sparse signals in R^n can be estimated (up to normalization) from (s log(n/s)) one-bit measurements. Nevertheless, controlling the precise accuracy of the error estimate remains an open challenge. In this paper, we focus on optimizing the decay of the error as a function of the oversampling factor lambda:=m/(s log(n/s)), where m is the number of measurements. It is known that the error in reconstructing sparse signals from standard one-bit measurements is bounded below by O(1/lambda). Without adjusting the measurement procedure, reducing this polynomial error decay rate is impossible. However, we show that an adaptive choice of the thresholds used for quantization may lower the error rate to exp(-Omega(lambda)). This improves upon guarantees for other methods of adaptive thresholding as proposed in Sigma-Delta quantization. We develop a general recursive strategy to achieve this exponential decay and two specific polynomial-time algorithms which fall into this framework, one based on convex programming and one on hard thresholding. This work is inspired by the one-bit compressed sensing model, in which the engineer controls the measurement procedure. Nevertheless, the principle is extendable to signal reconstruction problems in a variety of binary statistical models as well as statistical estimation problems like logistic regression.

Consider measuring an n-dimensional vector x through the inner product with several measurement vectors, a_1, a_2, …, a_m. It is common in both signal processing and statistics to assume the linear response model y_i = <a_i, x> + e_i, where e_i is a noise term. However, in practice the precise relationship between the signal x and the observations y_i may not follow the linear model, and in some cases it may not even be known. To address this challenge, in this paper we propose a general model where it is only assumed that each observation y_i may depend on a_i only through <a_i, x>. We do not assume that the dependence is known. This is a form of the semiparametric-single index model, and it includes the linear model as well as many forms of the generalized linear model as special cases. We further assume that the signal x has some structure, and we formulate this as a general assumption that x belongs to some known (but arbitrary) feasible set K. We carefully detail the benefit of using the signal structure to improve estimation. The theory is based on the mean width of K, a geometric parameter which can be used to understand its effective dimension in estimation problems. We determine a simple, efficient two-step procedure for estimating the signal based on this model — a linear estimation followed by metric projection onto K. We give general conditions under which the estimator is minimax optimal up to a constant. This leads to the intriguing conclusion that in the high noise regime, an unknown non-linearity in the observations does not significantly reduce one’s ability to determine the signal, even when the non-linearity may be non-invertible. Our results may be specialized to understand the effect of non-linearities in compressed sensing.

1-bit matrix completion
with Mark Davenport, Ewout van den Berg, and Mary Wootters. Information and Inference, 2014. (Nominated for best paper award.) Code available on Mark’s webpage.

In this paper we develop a theory of matrix completion for the extreme case of noisy 1-bit observations. Instead of observing a subset of the real-valued entries of a matrix M, we obtain a small number of binary (1-bit) measurements generated according to a probability distribution determined by the real-valued entries of M. The central question we ask is whether or not it is possible to obtain an accurate estimate of M from this data. In general this would seem impossible, but we show that the maximum likelihood estimate under a suitable constraint returns an accurate estimate of M when ||M||_infinity <= alpha, and rank(M) <= r. If the log-likelihood is a concave function (e.g., the logistic or probit observation models), then we can obtain this maximum likelihood estimate by optimizing a convex program. In addition, we also show that if instead of recovering M we simply wish to obtain an estimate of the distribution generating the 1-bit measurements, then we can eliminate the requirement that ||M||_infinity <= \alpha. For both cases, we provide lower bounds showing that these estimates are near-optimal.

In one-bit compressed sensing, previous results state that sparse signals may be robustly recovered when the measurements are taken using Gaussian random vectors. In contrast to standard compressed sensing, these results are not extendable to natural non-Gaussian distributions without further assumptions, as can be demonstrated by simple counter-examples. We show that approximately sparse signals, which also satisfy a mild infinity-norm constraint, can be accurately reconstructed from single-bit measurements sampled according to a sub-gaussian distribution, and the reconstruction comes as the solution to a convex program.

This paper develops theoretical results regarding noisy 1-bit compressed sensing and sparse binomial regression. We show that a single convex program gives an accurate estimate of the signal, or coefficient vector, for both of these models. We demonstrate that an s-sparse signal in R^n can be accurately estimated from m = O(slog(n/s)) single-bit measurements using a simple convex program. This remains true even if almost half of the measurements are randomly flipped. Worst-case (adversarial) noise can also be accounted for, and uniform results that hold for all sparse inputs are derived as well. In the terminology of sparse logistic regression, we show that O(slog(n/s)) Bernoulli trials are sufficient to estimate a coefficient vector in R^n which is approximately s-sparse. Moreover, the same convex program works for virtually all generalized linear models, in which the link function may be unknown. To our knowledge, these are the first results that tie together the theory of sparse logistic regression to 1-bit compressed sensing. Our results apply to general signal structures aside from sparsity; one only needs to know the size of the set K where signals reside. The size is given by the mean width of K, a computable quantity whose square serves as a robust extension of the dimension.

Given a subset K of the unit Euclidean sphere, we estimate the minimal number m = m(K) of hyperplanes that generate a uniform tessellation of K, in the sense that the fraction of the hyperplanes separating any pair x, y in K is nearly proportional to the Euclidean distance between x and y. Random hyperplanes prove to be almost ideal for this problem; they achieve the almost optimal bound m = O(w(K)^2) where w(K) is the Gaussian mean width of K. Using the map that sends x in K to the sign vector with respect to the hyperplanes, we conclude that every bounded subset K of R^n embeds into the Hamming cube {-1, 1}^m with a small distortion in the Gromov-Haussdorf metric. Since for many sets K one has m = m(K) << n, this yields a new discrete mechanism of dimension reduction for sets in Euclidean spaces.

We give the first computationally tractable and almost optimal solution to the problem of one-bit compressed sensing, showing how to accurately recover an s-sparse vector x in R^n from the signs of O(s log^2 (n/s)) random linear measurements of x. The recovery is achieved by a simple linear program. This result extends to approximately sparse vectors x. Our result is universal in the sense that with high probability, one measurement scheme will successfully recover all sparse vectors simultaneously. The argument is based on solving an equivalent geometric problem on random hyperplane tessellations.

Low-rank matrix recovery addresses the problem of recovering an unknown low-rank matrix from few linear measurements. Nuclear-norm minimization is a tractible approach with a recent surge of strong theoretical backing. Analagous to the theory of compressed sensing, these results have required random measurements. For example, m >= Cnr Gaussian measurements are sufficient to recover any rank-r n x n matrix with high probability. In this paper we address the theoretical question of how many measurements are needed via any method whatsoever — tractible or not. We show that for a family of random measurement ensembles, m >= 4nr – 4r^2 measurements are sufficient to guarantee that no rank-2r matrix lies in the null space of the measurement operator with probability one. This is a necessary and sufficient condition to ensure uniform recovery of all rank-r matrices by rank minimization. Furthermore, this value of $m$ precisely matches the dimension of the manifold of all rank-2r matrices. We also prove that for a fixed rank-r matrix, m >= 2nr – r^2 + 1 random measurements are enough to guarantee recovery using rank minimization. These results give a benchmark to which we may compare the efficacy of nuclear-norm minimization.

This paper introduces a simple and very general theory of compressive sensing. In this theory, the sensing mechanism simply selects sensing vectors independently at random from a probability distribution F; it includes all models – e.g. Gaussian, frequency measurements – discussed in the literature, but also provides a framework for new measurement strategies as well. We prove that if the probability distribution F obeys a simple incoherence property and an isotropy property, one can faithfully recover approximately sparse signals from a minimal number of noisy measurements. The novelty is that our recovery results do not require the restricted isometry property (RIP) – they make use of a much weaker notion – or a random model for the signal. As an example, the paper shows that a signal with s nonzero entries can be faithfully recovered from about s log n Fourier coefficients that are contaminated with noise.

Testing for the significance of a subset of regression coefficients in a linear model, a staple of statistical analysis, goes back at least to the work of Fisher who introduced the analysis ofvariance (ANOVA). We study this problem under the assumption that the coefficient vector is sparse, a common situation in modern high-dimensional settings. Suppose we have p covariates and that under the alternative, the response only depends upon on the order of p^(1-alpha) of those, 0 < alpha < 1. Under moderate sparsity levels, i.e. 0 < alpha < 1/2, we show that ANOVA is essentially optimal under some conditions on the design. This is no longer the case under strong sparsity constraints, i.e. alpha > 1/2. In such settings, a multiple comparison procedure is often preferred and we establish its optimality when alpha > 3/4. However, these two very popular methods are suboptimal, and sometimes powerless, under moderately strong sparsity where 1/2 < alpha < 3/4. We suggest a method based on the Higher Criticism that is powerful in the whole range alpha > 1/2. This optimality property is true for a variety of designs, including the classical (balanced) multi-way designs and more modern “p > n” designs arising in genetics and signal processing. In addition to the standard fixed effects model, we establish similar results for a random effects model where the nonzero coefficients of the regression vector are normallydistributed.

This paper presents several novel theoretical results regarding the recovery of a low-rank matrix from just a few measurements consisting of linear combinations of the matrix entries. We show that properly constrained nuclear-norm minimization stably recovers a low-rank matrix from a constant number of noisy measurements per degree of freedom; this seems to be the first result of this nature. Further, the recovery error from noisy data is within a constant of three targets: 1) the minimax risk, 2) an oracle error that would be available if the column space of the matrix were known, and 3) a more adaptive oracle error which would be available with the knowledge of the column space corresponding to the part of the matrix that stands above the noise. Lastly, the error bounds regarding low-rank matrices are extended to provide an error bound when the matrix has full rank with decaying singular values. The analysis in this paper is based on the restricted isometry property (RIP) introduced in [6] for vectors, and in [22] for matrices.

On the heels of compressed sensing, a remarkable new field has very recently emerged. This field addresses a broad range of problems of significant practical interest, namely, the recovery of a data matrix from what appears to be incomplete, and perhaps even corrupted, information. In its simplest form, the problem is to recover a matrix from a small sample of its entries, and comes up in many areas of science and engineering including collaborative filtering, machine learning, control, remote sensing, and computer vision to name a few.

This paper surveys the novel literature on matrix completion, which shows that under some suitable conditions, one can recover an unknown low-rank matrix from a nearly minimal set of entries by solving a simple convex optimization problem, namely, nuclear-norm minimization subject to data constraints. Further, this paper introduces novel results showing that matrix completion is provably accurate even when the few observed entries are corrupted with a small amount of noise. A typical result is that one can recover an unknown nn matrix of low rank r from just about nr log^2 n noisy samples with an error which is proportional to the noise level. We present numerical results which complement our quantitative analysis and show that, in practice, nuclear norm minimization accurately fills in the many missing entries of large low-rank matrices from just a few noisy samples. Some analogies between matrix completion and compressed sensing are discussed throughout.

We consider the fundamental problem of estimating the mean of a vector y=X beta+z, where X is an n by p design matrix in which one can have far more variables than observations, and z is a stochastic error term, the so-called “p>n” setup. When beta is sparse, or, more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean vector, we ask whether or not it is possible to accurately estimate X beta using a computationally tractable algorithm.

We show that, in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error that one would achieve with an oracle supplying perfect information about which variables should and should not be included in the model. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an vast majority of cases where X beta is a sparse or nearly sparse superposition of variables, but not in all cases.

Our results are nonasymptotic and widely applicable, since they simply require that pairs of predictor variables are not too collinear.

The importance of sparse signal structures has been recognized in a plethora of applications ranging from medical imaging to group disease testing to radar technology. It has been shown in practice that various signals of interest may be (approximately) sparsely modeled, and that sparse modeling is often beneficial, or even indispensable to signal recovery. Alongside an increase in applications, a rich theory of sparse and compressible signal recovery has recently been developed under the names compressed sensing (CS) and sparse approximation (SA). This revolutionary research has demonstrated that many signals can be recovered from severely undersampled measurements by taking advantage of their inherent low-dimensional structure. More recently, an offshoot of CS and SA has been a focus of research on other low-dimensional signal structures such as matrices of low rank. Low-rank matrix recovery (LRMR) is demonstrating a rapidly growing array of important applications such as quantum state tomography, triangulation from incomplete distance measurements, recommender systems (e.g., the Netflix problem), and system identification and control. In this dissertation, we examine CS, SA, and LRMR from a theoretical perspective. We consider a variety of different measurement and signal models, both random and deterministic, and mainly ask two questions. How many measurements are necessary? How large is the recovery error? We give theoretical lower bounds for both of these questions, including oracle and minimax lower bounds for the error. However, the main emphasis of the thesis is to demonstrate the efficacy of convex optimization—in particular l1 and nuclear-norm minimization based programs—in CS, SA, and LRMR. We derive upper bounds for the number of measurements required and the error derived by convex optimization, which in many cases match the lower bounds up to constant or logarithmic factors. The majority of these results do not require the restricted isometry property (RIP), a ubiquitous condition in the literature.

Tight Analyses for Non-Smooth Stochastic Gradient Descent with Nicholas Harvey, Chris Liaw, and Sikander Randhawa.

Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after Tsteps of stochastic gradient descent, the error of the final iterate is O(log(T)/T) with high probability. We also construct a function from this class for which the error of the final iterate of deterministic gradient descent is (log(T)/T). This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). A n intermediate step of our analysis proves that the suffix averaging method achieves error O(1/T) with high probability, which is optimal (for any first-order optimization method). This improves results of Rakhlin (2012) and Hazan and Kale (2014), both of which achieved error O(1/T), but only in expectation, and achieved a high probability error bound of O(loglog(T)/T), which is suboptimal. We prove analogous results for functions that are Lipschitz and convex, but not necessarily strongly convex or differentiable. After T steps of stochastic gradient descent, the error of the final iterate is O(log(T)/sqrt(T)) with high probability, and there exists a function for which the error of the final iterate of deterministic gradient descent is (log(T)/sqrt(T)).

Parameter instability regimes for sparse proximal denoising problems with Aaron Berk and Ozgur Yilmaz.

Compressed sensing theory explains why Lasso programs recover structured high-dimensional signals with minimax order-optimal error. Yet, the optimal choice of the program’s governing parameter is often unknown in practice. It is still unclear how variation of the governing parameter impacts recovery error in compressed sensing, which is otherwise provably stable and robust. We establish a novel notion of instability in Lasso programs when the measurement matrix is identity. This is the proximal denoising setup. We prove asymptotic cusp-like behaviour of the risk as a function of the parameter choice, and illustrate the theory with numerical simulations. For example, a 0.1% underestimate of a Lasso parameter can increase the error significantly; and a 50% underestimate can cause the error to increase by a factor of 1e9. We hope that revealing parameter instability regimes of Lasso programs helps to inform a practitioner’s choice.

Learning tensors from partial binary measurements with Navid Ghadermarzy and Ozgur Yilmaz.

IEEE, Transactions on Signal Processing,2018.In this paper we generalize the 1-bit matrix completion problem to higher order tensors. We prove that when r = O(1), a bounded rank-r, order-d, N x N x…x N, tensor T can be estimated efficiently by only m = O(ND) binary measurements by regularizing its max-qnorm and M-norm as surrogates for its rank. We prove that similar to the matrix case, i.e., when the sample complexity of recovering a low-rank tensor from 1-bit measurements of a subset of its entries is the same as recovering it from unquantized measurements. Moreover, we show the advantage of using 1-bit tensor completion over matricization both theoretically and numerically. Specifically, we show how the 1-bit measurement model can be used for context-aware recommender systems.

Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes with Hassan Ashtiani, Shai Ben-David, Nick Harvey, Christopher Liaw, and Abbas Mehrabian.

Advances in Neural Information Processing Systems (NIPS),2018.Winner of NIPS 2018 Best Paper Award.We prove that (kd^2/eps^2) samples are necessary and sufficient for learning a mixture of k Gaussians in R^d, up to error eps in total variation distance. This improves both the known upper bound and lower bound for this problem. For mixtures of axis-aligned Gaussians, we show that O(kd/eps^2) samples suffice, matching a known lower bound. Moreover, these results hold in an agnostic learning setting as well.

The upper bound is based on a novel technique for distribution learning based on a notion of sample compression. Any class of distributions that allows such a sample compression scheme can also be learned with few samples. Moreover, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions. The core of our main result is showing that the class of Gaussians in R^d has an efficient sample compression.

Near-optimal sample complexity for convex tensor completion with Navid Ghadermarzy and Ozgur Yilmaz.

We analyze low rank tensor completion (TC) using noisy measurements of a subset of the tensor. Assuming a rank-r, order-d, N x N x…x N tensor where r=O(1), the best sampling complexity that was achieved is O(N^(d/2)), which is obtained by solving a tensor nuclear-norm minimization problem. However, this bound is significantly larger than the number of free variables in a low rank tensor which is O(d N). In this paper, we show that by using an atomic-norm whose atoms are rank-1 sign tensors, one can obtain a sample complexity of O(dN). Moreover, we generalize the matrix max-norm definition to tensors, which results in a max-quasi-norm (max-qnorm) whose unit ball has small Rademacher complexity. We prove that solving a constrained least squares estimation using either the convex atomic-norm or the nonconvex max-qnorm results in optimal sample complexity for the problem of low-rank tensor completion. Furthermore, we show that these bounds are nearly minimax rate-optimal. We also provide promising numerical results for max-qnorm constrained tensor completion, showing improved recovery results compared to matricization and alternating least squares.

De-biasing low-rank projection for matrix completion with Simon Foucart, Deanna Needell and Mary Wootters.

Wavelets and Sparsity XVII, 2017.We study matrix completion with non-uniform, deterministic sampling patterns. We introduce a computable parameter, which is a function of the sampling pattern, and show that if this parameter is small, then we may recover missing entries of the matrix, with appropriate weights. We theoretically analyze a simple and well-known recovery method, which simply projects the (zero-padded) subsampled matrix onto the set of low-rank matrices. We show that under non-uniform deterministic sampling, this method yields a biased solution, and we propose an algorithm to de-bias it. Numerical simulations demonstrate that de-biasing significantly improves the estimate. However, when the observations are noisy, the error of this method can be sub-optimal when the sampling is highly non-uniform. To remedy this, we suggest an alternative which is based on projection onto the max-norm ball whose robustness to noise tolerates arbitrarily non-uniform sampling. Finally, we analyze convex optimization in this framework.

Average-case hardness of RIP certification

with Tengyao Wang and Quentin Berthet.

Advances in Neural Information Processing Systems (NIPS), 2016.The restricted isometry property (RIP) for design matrices gives guarantees for optimal recovery in sparse linear models. It is of high interest in compressed sensing and statistical learning. This property is particularly important for computationally efficient recovery methods. As a consequence, even though it is in general NP-hard to check that RIP holds, there have been substantial efforts to find tractable proxies for it. These would allow the construction of RIP matrices and the polynomial-time verification of RIP given an arbitrary matrix. We consider the framework of average-case certifiers, that never wrongly declare that a matrix is RIP, while being often correct for random instances. While there are such functions which are tractable in a suboptimal parameter regime, we show that this is a computationally hard task in any better regime. Our results are based on a new, weaker assumption on the problem of detecting dense subgraphs.

Optimizing quantization for Lasso recovery?

with Xiaoyi Gu, Shenyinying Tu, Hao-Jun Michael Shi, Mindy Case, and Deanna Needell.

IEEE Signal Processing Letters.This letter is focused on quantized Compressed Sensing, assuming that Lasso is used for signal estimation. Leveraging recent work, we provide a framework to optimize the quantization function and show that the recovered signal converges to the actual signal at a quadratic rate as a function of the quantization level. We show that when the number of observations is high, this method of quantization gives a significantly better recovery rate than standard Lloyd-Max quantization. We support our theoretical analysis with numerical simulations.

One-Bit Compressive Sensing of Dictionary-Sparse Signals

with Rich Baraniuk, Simon Foucart, Deanna Needell, and Mary Wootters.

Information and Inference.One-bit compressive sensing has extended the scope of sparse recovery by showing that sparse signals can be accurately reconstructed even when their linear measurements are subject to the extreme quantization scenario of binary samples—only the sign of each linear measurement is maintained. Existing results in one-bit compressive sensing rely on the assumption that the signals of interest are sparse in some fixed orthonormal basis. However, in most practical applications, signals are sparse with respect to an overcomplete dictionary, rather than a basis. There has already been a surge of activity to obtain recovery guarantees under such a generalized sparsity model in the classical compressive sensing setting. Here, we extend the one-bit framework to this important model, providing a unified theory of one-bit compressive sensing under dictionary sparsity. Specifically, we analyze several different algorithms—based on convex programming and on hard thresholding—and show that, under natural assumptions on the sensing matrix (satisfied by Gaussian matrices), these algorithms can efficiently recover analysis-dictionary-sparse signals in the one-bit model.

A simple tool for bounding the deviation of random matrices on geometric sets

with Christopher Liaw, Abbas Mehrabian, and Roman Vershynin.

Geometric Aspects of Functional Analysis, Lecture Notes in Mathematics, Springer, Berlin.Let A be an isotropic, sub-gaussian m?n matrix. We prove that the process Z_x:=||Ax||_2 –sqrt{m}||x||_2 has sub-gaussian increments. Using this, we show that for any bounded subset T of R^n, the deviation of ||Ax||_2 around its mean is uniformly bounded by the Gaussian complexity of T. We also prove a local version of this theorem, which allows for unbounded sets. These theorems have various applications, some of which are reviewed in this paper. In particular, we give a new result regarding model selection in the constrained linear model.

Random mappings designed for commercial search engines

with Roger Donaldson, Arijit Gupta, and Thomas Reimer.

We give a practical random mapping that takes any set of documents represented as vectors in Euclidean space and then maps them to a sparse subset of the Hamming cube while retaining ordering of inter-vector inner products. Once represented in the sparse space, it is natural to index documents using commercial text-based search engines which are specialized to take advantage of this sparse and discrete structure for large-scale document retrieval. We give a theoretical analysis of the mapping scheme, characterizing exact asymptotic behavior and also giving non-asymptotic bounds which we verify through numerical simulations. We balance the theoretical treatment with several practical considerations; these allow substantial speed up of the method. We further illustrate the use of this method on search over two real data sets: a corpus of images represented by their color histograms, and a corpus of daily stock market index values.

The generalized Lasso with non-linear observations

with Roman Vershynin.

IEEE Transactions on Information Theory.We study the problem of signal estimation from non-linear observations when the signal belongs to a low-dimensional set buried in a high-dimensional space. A rough heuristic often used in practice postulates that non-linear observations may be treated as noisy linear observations, and thus the signal may be estimated using the generalized Lasso. This is appealing because of the abundance of efficient, specialized solvers for this program. Just as noise may be diminished by projecting onto the lower dimensional space, the error from modeling non-linear observations with linear observations will be greatly reduced when using the signal structure in the reconstruction. We allow general signal structure, only assuming that the signal belongs to some set K in R^n. Our theory tolerates general non-linearity, which may be discontinuous, not one-to-one and even unknown. We assume a random Gaussian model for the measurement matrix, but allow the rows to have an unknown covariance matrix. As special cases of our results, we recover near-optimal theory for noisy linear observations, and also give the first theoretical accuracy guarantee for 1-bit compressed sensing with unknown covariance matrix of the measurement vectors.

On the effective measure of dimension in the analysis cosparse model

with Raja Giryes and Roman Vershynin.

IEEE Transactions on Information Theory, 2015.Many applications have benefited remarkably from low-dimensional models in the recent decade. The fact that many signals, though high dimensional, are intrinsically low dimensional has given the possibility to recover them stably from a relatively small number of their measurements. For example, in compressed sensing with the standard (synthesis) sparsity prior and in matrix completion, the number of measurements needed is proportional (up to a logarithmic factor) to the signal’s manifold dimension.

Recently, a new natural low-dimensional signal model has been proposed: the cosparse analysis prior. In the noiseless case, it is possible to recover signals from this model, using a combinatorial search, from a number of measurements proportional to the signal’s manifold dimension. However, if we ask for stability to noise or an efficient (polynomial complexity) solver, all the existing results demand a number of measurements which is far removed from the manifold dimension, sometimes far greater. Thus, it is natural to ask whether this gap is a deficiency of the theory and the solvers, or if there exists a real barrier in recovering the cosparse signals by relying only on their manifold dimension. In this work we show that the latter is true both in theory and in simulations. This gives a practical counter-example to the growing literature on compressed acquisition of signals based on manifold dimension.

Exponential decay of reconstruction error from binary measurements of sparse signals?

with Richard Baraniuk, Simon Foucart, Deanna Needell, and Mary Wootters

Binary measurements arise naturally in a variety of statistical and engineering applications. They may be inherent to the problem—e.g., in determining the relationship between genetics and the presence or absence of a disease—or they may be a result of extreme quantization. In one-bit compressed sensing it has recently been shown that the number of one-bit measurements required for signal estimation mirrors that of unquantized compressed sensing. Indeed, s-sparse signals in R^n can be estimated (up to normalization) from (s log(n/s)) one-bit measurements. Nevertheless, controlling the precise accuracy of the error estimate remains an open challenge. In this paper, we focus on optimizing the decay of the error as a function of the oversampling factor lambda:=m/(s log(n/s)), where m is the number of measurements. It is known that the error in reconstructing sparse signals from standard one-bit measurements is bounded below by O(1/lambda). Without adjusting the measurement procedure, reducing this polynomial error decay rate is impossible. However, we show that an adaptive choice of the thresholds used for quantization may lower the error rate to exp(-Omega(lambda)). This improves upon guarantees for other methods of adaptive thresholding as proposed in Sigma-Delta quantization. We develop a general recursive strategy to achieve this exponential decay and two specific polynomial-time algorithms which fall into this framework, one based on convex programming and one on hard thresholding. This work is inspired by the one-bit compressed sensing model, in which the engineer controls the measurement procedure. Nevertheless, the principle is extendable to signal reconstruction problems in a variety of binary statistical models as well as statistical estimation problems like logistic regression.

High-dimensional estimation with geometric constraints

with Roman Vershynin and Elena Yudovina.

Information and Inference.Consider measuring an n-dimensional vector x through the inner product with several measurement vectors, a_1, a_2, …, a_m. It is common in both signal processing and statistics to assume the linear response model y_i = <a_i, x> + e_i, where e_i is a noise term. However, in practice the precise relationship between the signal x and the observations y_i may not follow the linear model, and in some cases it may not even be known. To address this challenge, in this paper we propose a general model where it is only assumed that each observation y_i may depend on a_i only through <a_i, x>. We do not assume that the dependence is known. This is a form of the semiparametric-single index model, and it includes the linear model as well as many forms of the generalized linear model as special cases. We further assume that the signal x has some structure, and we formulate this as a general assumption that x belongs to some known (but arbitrary) feasible set K. We carefully detail the benefit of using the signal structure to improve estimation. The theory is based on the mean width of K, a geometric parameter which can be used to understand its effective dimension in estimation problems. We determine a simple, efficient two-step procedure for estimating the signal based on this model — a linear estimation followed by metric projection onto K. We give general conditions under which the estimator is minimax optimal up to a constant. This leads to the intriguing conclusion that in the high noise regime, an unknown non-linearity in the observations does not significantly reduce one’s ability to determine the signal, even when the non-linearity may be non-invertible. Our results may be specialized to understand the effect of non-linearities in compressed sensing.

1-bit matrix completion

with Mark Davenport, Ewout van den Berg, and Mary Wootters.

Information and Inference,2014. (Nominated for best paper award.) Code available on Mark’s webpage.In this paper we develop a theory of matrix completion for the extreme case of noisy 1-bit observations. Instead of observing a subset of the real-valued entries of a matrix M, we obtain a small number of binary (1-bit) measurements generated according to a probability distribution determined by the real-valued entries of M. The central question we ask is whether or not it is possible to obtain an accurate estimate of M from this data. In general this would seem impossible, but we show that the maximum likelihood estimate under a suitable constraint returns an accurate estimate of M when ||M||_infinity <= alpha, and rank(M) <= r. If the log-likelihood is a concave function (e.g., the logistic or probit observation models), then we can obtain this maximum likelihood estimate by optimizing a convex program. In addition, we also show that if instead of recovering M we simply wish to obtain an estimate of the distribution generating the 1-bit measurements, then we can eliminate the requirement that ||M||_infinity <= \alpha. For both cases, we provide lower bounds showing that these estimates are near-optimal.

One-bit compressed sensing with non-Gaussian measurements

with Albert Ai, Alex Lapanowski, and Roman Vershynin.

Linear Algebra and its Applications, 2014.In one-bit compressed sensing, previous results state that sparse signals may be robustly recovered when the measurements are taken using Gaussian random vectors. In contrast to standard compressed sensing, these results are not extendable to natural non-Gaussian distributions without further assumptions, as can be demonstrated by simple counter-examples. We show that approximately sparse signals, which also satisfy a mild infinity-norm constraint, can be accurately reconstructed from single-bit measurements sampled according to a sub-gaussian distribution, and the reconstruction comes as the solution to a convex program.

Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach

with Roman Vershynin. ?

IEEE Transactions on Information Theory, 2013.This paper develops theoretical results regarding noisy 1-bit compressed sensing and sparse binomial regression. We show that a single convex program gives an accurate estimate of the signal, or coefficient vector, for both of these models. We demonstrate that an s-sparse signal in R^n can be accurately estimated from m = O(slog(n/s)) single-bit measurements using a simple convex program. This remains true even if almost half of the measurements are randomly flipped. Worst-case (adversarial) noise can also be accounted for, and uniform results that hold for all sparse inputs are derived as well. In the terminology of sparse logistic regression, we show that O(slog(n/s)) Bernoulli trials are sufficient to estimate a coefficient vector in R^n which is approximately s-sparse. Moreover, the same convex program works for virtually all generalized linear models, in which the link function may be unknown. To our knowledge, these are the first results that tie together the theory of sparse logistic regression to 1-bit compressed sensing. Our results apply to general signal structures aside from sparsity; one only needs to know the size of the set K where signals reside. The size is given by the mean width of K, a computable quantity whose square serves as a robust extension of the dimension.

Dimension reduction by random hyperplane tessellations

with Roman Vershynin.

Discrete and Computational Geometry, 2014.Given a subset K of the unit Euclidean sphere, we estimate the minimal number m = m(K) of hyperplanes that generate a uniform tessellation of K, in the sense that the fraction of the hyperplanes separating any pair x, y in K is nearly proportional to the Euclidean distance between x and y. Random hyperplanes prove to be almost ideal for this problem; they achieve the almost optimal bound m = O(w(K)^2) where w(K) is the Gaussian mean width of K. Using the map that sends x in K to the sign vector with respect to the hyperplanes, we conclude that every bounded subset K of R^n embeds into the Hamming cube {-1, 1}^m with a small distortion in the Gromov-Haussdorf metric. Since for many sets K one has m = m(K) << n, this yields a new discrete mechanism of dimension reduction for sets in Euclidean spaces.

One-bit compressed sensing by linear programming

with Roman Vershynin.

Communications on Pure and Applied Mathematics, 2013.We give the first computationally tractable and almost optimal solution to the problem of one-bit compressed sensing, showing how to accurately recover an s-sparse vector x in R^n from the signs of O(s log^2 (n/s)) random linear measurements of x. The recovery is achieved by a simple linear program. This result extends to approximately sparse vectors x. Our result is universal in the sense that with high probability, one measurement scheme will successfully recover all sparse vectors simultaneously. The argument is based on solving an equivalent geometric problem on random hyperplane tessellations.

Uniqueness Conditions for low-rank matrix recovery

with Deanna Needell and Yonina Eldar.

Applied and Computational Harmonic Analysis, 2011.Low-rank matrix recovery addresses the problem of recovering an unknown low-rank matrix from few linear measurements. Nuclear-norm minimization is a tractible approach with a recent surge of strong theoretical backing. Analagous to the theory of compressed sensing, these results have required random measurements. For example, m >= Cnr Gaussian measurements are sufficient to recover any rank-r n x n matrix with high probability. In this paper we address the theoretical question of how many measurements are needed via any method whatsoever — tractible or not. We show that for a family of random measurement ensembles, m >= 4nr – 4r^2 measurements are sufficient to guarantee that no rank-2r matrix lies in the null space of the measurement operator with probability one. This is a necessary and sufficient condition to ensure uniform recovery of all rank-r matrices by rank minimization. Furthermore, this value of $m$ precisely matches the dimension of the manifold of all rank-2r matrices. We also prove that for a fixed rank-r matrix, m >= 2nr – r^2 + 1 random measurements are enough to guarantee recovery using rank minimization. These results give a benchmark to which we may compare the efficacy of nuclear-norm minimization.

A probabilistic and RIPless theory of compressed sensing

with Emmanuel J. Candes.

IEEE Transactions on Information Theory, 2011.This paper introduces a simple and very general theory of compressive sensing. In this theory, the sensing mechanism simply selects sensing vectors independently at random from a probability distribution F; it includes all models – e.g. Gaussian, frequency measurements – discussed in the literature, but also provides a framework for new measurement strategies as well. We prove that if the probability distribution F obeys a simple incoherence property and an isotropy property, one can faithfully recover approximately sparse signals from a minimal number of noisy measurements. The novelty is that our recovery results do not require the restricted isometry property (RIP) – they make use of a much weaker notion – or a random model for the signal. As an example, the paper shows that a signal with s nonzero entries can be faithfully recovered from about s log n Fourier coefficients that are contaminated with noise.

Global testing under sparse alternatives: ANOVA, multiple comparisons and the Higher Criticism

with Emmanuel J. Candes and Ery Arias-Castro.

Annals of Statistics, 2011.Testing for the significance of a subset of regression coefficients in a linear model, a staple of statistical analysis, goes back at least to the work of Fisher who introduced the analysis ofvariance (ANOVA). We study this problem under the assumption that the coefficient vector is sparse, a common situation in modern high-dimensional settings. Suppose we have p covariates and that under the alternative, the response only depends upon on the order of p^(1-alpha) of those, 0 < alpha < 1. Under moderate sparsity levels, i.e. 0 < alpha < 1/2, we show that ANOVA is essentially optimal under some conditions on the design. This is no longer the case under strong sparsity constraints, i.e. alpha > 1/2. In such settings, a multiple comparison procedure is often preferred and we establish its optimality when alpha > 3/4. However, these two very popular methods are suboptimal, and sometimes powerless, under moderately strong sparsity where 1/2 < alpha < 3/4. We suggest a method based on the Higher Criticism that is powerful in the whole range alpha > 1/2. This optimality property is true for a variety of designs, including the classical (balanced) multi-way designs and more modern “p > n” designs arising in genetics and signal processing. In addition to the standard fixed effects model, we establish similar results for a random effects model where the nonzero coefficients of the regression vector are normallydistributed.

Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements

with Emmanuel J. Candes.

IEEE Transactions on Information Theory, 2010.This paper presents several novel theoretical results regarding the recovery of a low-rank matrix from just a few measurements consisting of linear combinations of the matrix entries. We show that properly constrained nuclear-norm minimization stably recovers a low-rank matrix from a constant number of noisy measurements per degree of freedom; this seems to be the first result of this nature. Further, the recovery error from noisy data is within a constant of three targets: 1) the minimax risk, 2) an oracle error that would be available if the column space of the matrix were known, and 3) a more adaptive oracle error which would be available with the knowledge of the column space corresponding to the part of the matrix that stands above the noise. Lastly, the error bounds regarding low-rank matrices are extended to provide an error bound when the matrix has full rank with decaying singular values. The analysis in this paper is based on the restricted isometry property (RIP) introduced in [6] for vectors, and in [22] for matrices.

Matrix completion with noise

with Emmanuel J. Candes.

Proceedings of the IEEE,2010.On the heels of compressed sensing, a remarkable new field has very recently emerged. This field addresses a broad range of problems of significant practical interest, namely, the recovery of a data matrix from what appears to be incomplete, and perhaps even corrupted, information. In its simplest form, the problem is to recover a matrix from a small sample of its entries, and comes up in many areas of science and engineering including collaborative filtering, machine learning, control, remote sensing, and computer vision to name a few.

This paper surveys the novel literature on matrix completion, which shows that under some suitable conditions, one can recover an unknown low-rank matrix from a nearly minimal set of entries by solving a simple convex optimization problem, namely, nuclear-norm minimization subject to data constraints. Further, this paper introduces novel results showing that matrix completion is provably accurate even when the few observed entries are corrupted with a small amount of noise. A typical result is that one can recover an unknown nn matrix of low rank r from just about nr log^2 n noisy samples with an error which is proportional to the noise level. We present numerical results which complement our quantitative analysis and show that, in practice, nuclear norm minimization accurately fills in the many missing entries of large low-rank matrices from just a few noisy samples. Some analogies between matrix completion and compressed sensing are discussed throughout.

Near-ideal model selection by l1 minimization

with Emmanuel J. Candes.

Annals of Statistics, 2009.We consider the fundamental problem of estimating the mean of a vector y=X beta+z, where X is an n by p design matrix in which one can have far more variables than observations, and z is a stochastic error term, the so-called “p>n” setup. When beta is sparse, or, more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean vector, we ask whether or not it is possible to accurately estimate X beta using a computationally tractable algorithm.

We show that, in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error that one would achieve with an oracle supplying perfect information about which variables should and should not be included in the model. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an vast majority of cases where X beta is a sparse or nearly sparse superposition of variables, but not in all cases.

Our results are nonasymptotic and widely applicable, since they simply require that pairs of predictor variables are not too collinear.

International Journal of Radiation Biology, 2005.Thesis:Compressed sensing, sparse approximation, and low-rank matrix estimationOlder conference Publications:Accepted in

International Conference on Sampling Theory and Applications, 2015.Accepted in

10th International Conference on Sampling Theory and Applications,2013.with Mark Davenport, Ewout van den Berg, and Mary Wootters.

Signal Processing with Adaptive Sparse Representations, 2013.IEEE International Symposium on Information Theory, 2013.47th annual Allerton conference on communication, control and computing, 2009.