Rank Estimation in Linear Mixture Models

or

Detecting the Number of Signals

or

Estimating the Number of Components

Written by Shira Kritchman & Boaz Nadler, 2008

The Problem

Proposed Solution

Matlab Code + Demo

We are given n independent identically distributed high dimensional observations, i=1, …, n, from the following linear mixture model

 (1)

where ui are random variables, vi are fixed unknown vectors in Rp,  is the (unknown) noise level and  is a multivariate  N(0, I) Gaussian random variable with identity covariance matrix. The basic assumption is that the covariance matrix of the random variables ui is full rank, and that the vectors are linearly independent, so that indeed there are K identifiable components.

In the signal array processing literature, the model (1) is typically written in matrix form as

where the  matrix contains the p vectors  and is typically called the steering matrix, whereas the vector contains the K random signals.

The problem is to estimate the number of factors/components/sources/signals K from the given the n observations , i=1…,n. This problem occurs in a variety of scientific fields, including signal processing, statistics, and chemometrics (see [3,4,5,6,7,8] and many references therein). Determining the number of sources is typically the first step prior to other computationally intensive parametric procedures.

Here we consider estimation of the number of components in the non-parametric setting where no prior information is assumed on the vectors, beyond the fact that they are linearly independent. Under this setting, to estimate K we use only the eigenvalues of the sample covariance matrix

.

Non-Parametric Detection Limit:

As , and assuming the random variables ui have finite fourth moments, the sample covariance matrix converges with probability one to the population covariance matrix. Hence,

where are the signal eigenvalues, and .Thus, as , any signal, however small, can eventually be detected. However, for finite n, the noise eigenvalues are not all equal to  but may rather have a significant spread. As shown by various authors, the largest eigenvalue of a pure noise matrix is of the order of .  This puts a threshold on the signal strengths identifiable (with high probability), using the sample eigenvalues .

Specifically, we have that [9,10,11]

Theorem: In the joint limit , with , a signal is identifiable by the largest sample eigenvalue  if and only if

.

Hence, any algorithm to detect the number of signals should hopefully have a detection performance close to this threshold.

Proposed Solution:

We propose an estimator for the number of sources based on a sequence of hypothesis tests. For k=1,…,min(p,n)-1 we test

To reject H0, hence accepting the alternative hypothesis H1, the k-th sample eigenvalue must be significantly larger than the quantity expected under H0, where the k-th population eigenvalue is equal to. According to [2], the largest eigenvalue of the covariance matrix of n pure noise observations of length p is asymptotically distributed according to a Tracy-Widom distribution

where  for real-valued / complex valued observations, respectively, and explicit expressions for  can be found in [1,12].

Hence, our estimator is of the form

where  is an estimator of the unknown noise level,  is a significance level chosen by the user, and  is the resulting threshold obtained by inversion of the Tracy-Widom distribution.

This test is closely related to Roy's largest root test, for testing the sphericity of a given covariance matrix. It can be shown that this algorithm asymptotically attains the non-parametric limit of detection.

One key point in using this approach is that for good detection performance at finite sample sizes, and specifically when n<<p, there is a need for an accurate estimate of the unknown noise level. This is discussed in detail in [1].

Matlab Code + Demo

The main function that estimates the rank and the noise level is

function [k_hat sigma2_hat ] = KN_rankEst(ell, n, beta, alpha, max_k)

where

ell         = sample eigenvalues, sorted in decreasing order !

n          = number of samples,

beta      = 1/2 for real/complex valued noise,

alpha    = confidence level in percentage points, 0.1 is a good value.

max_k = maximum number of signals

Additional required routines are:

function [mu_np sigma_np] = KN_mu_sigma(n,p,beta);    % computes

function s_Wishart = KN_s_Wishart(alpha,beta)               % computes approximate threshold

function sig_hat_kk = KN_noiseEst(ell,n,kk)                     % estimate of noise level, assuming kk signals

All these functions can be downloaded here: KN_rank_estimation.zip

Demo: function demo_n produces a performance curve as a function of number of samples n, similar to the one in the paper.

References

[1] S. Kritchman and B. Nadler, Determining the number of components in a factor model from limited noisy data, Chem. Int. Lab. Sys. 2008.

[2] I.M. Johnstone, On the distribution of the largest eigenvalues in principal components analysis, Annals of Statistics, vol. 29, pp. 295-327, 2001.

[3] M. Wax and T. Kailath, Detection of signals by information theoretic criteria, IEEE Trans. Sig. Proc., vol.33, pp. 387-392, 1985.

[4] L.A. Goodman and S.J. Haberman, The analysis of nonadditivity in two-way analysis of variance, J. Am. Stat. Assoc., vol. 85, pp. 139-145, 1990.

[5] J.R. Schott, A note on the critical values used in stepwise tests for multiplicative components of interaction, Comm. Stat. Th. Methods, vol. 15, pp. 1561-1570, 1986.

[6] E.R. Malinowski, Determination of the number of factors and the experimental error in a data matrix, Analytical Chemistry, vol. 49, pp. 612-617, 1977.

[7] K. Faber and B.R. Kowalski, Modification of Malinowski's F-test for abstract factor analysis, Journal of Chemometrics, vol. 11, pp. 53-72, 1997.

[8] M. Meloun, J. Capek, P. Miksik, and R.G. Brereton, Critical comparison of methods predicting the number of components in spectroscopic data. Analytica Chimica Acta, vol. 423, pp. 51-68, 2000.

[9] J. Baik and J.W. Silverstein, Eigenvalues of large sample covariance matrices of spiked population models. J. of Mult. Anal., 97(6):1382-1408, 2006.

[10] D. Paul, Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17:1617{1642, 2007.

[11] B. Nadler, Finite sample approximation results for principal component analysis: A matrix perturbation approach. to appear, Annals of Statistics, 2008.

[12] N. El-Karoui, A rate of convergence result for the largest eigenvalue of complex white Wishart matrices, Annals of Probability, 34 (6):2077–2117, 2006.