Unsupervised Ensemble Learning

Written by Ariel Jaffe and Boaz Nadler, 2015

  1. Introduction

  2. Problem Setup

  3. Spectral Methods for Binary Ensembles

  4. Code

  5. References

1. Introduction

Consider for simplicity a binary classification problem, for which we are given $m$ pre-constructed classifiers, $f_1,\ldots,f_m$. The main task in ensemble learning is to construct a more accurate meta learner from this group of $m$ possibly weak learners.
There are two main ensemble learning scenarios:

2. Problem Setup

Consider an ensemble of $m$ binary classifiers, all operating on the same instance space $\mathcal X$ (typically $\mathbb{R}^d$), with an output space $\mathcal Y = \{1,-1\}$. Since the classifiers are binary, they are fully defined by their sensitivity and specificity, denoted $\psi$ and $\eta$ respectively:
$$ \psi_i = \Pr[f_i(X)=1|y=1] \quad \eta_i = \Pr[f_i(X)=-1|y=-1] $$ Given only the $m\times n$ matrix $Z$ of classifiers predictions, where $Z_{ij}=f_i(x_j)$ can one:

  1. Estimate the accuracies of the different classifiers?

  2. Combine the classifiers to an improved meta classifier?

Dawid and Skene [5] were among the first to study this setup, and propose an expectation maximization (EM) algorithm, to estimate the unknown specificities and sensitivities of all $m$ classifiers and the labels $y_j$ of the $n$ instances. In recent years, with the popularity of crowdsourcing, several works studied similar scenarios, and also suggested EM based approaches, see for example [6] ,[7] ,[8] and [9].

The key assumption made by Dawid and Skene, which we also adopt in our work is that of conditionally independent classifiers. Namely, that conditional on the class label, different classifiers make independent predictions. That is, for all $f_i,f_j$ and for all labels $a_i,a_j \in \{-1,1\}$, $$ \Pr[f_i(x)=a_i,f_j(x)=a_j\,|y] = \Pr[f_i(x)=a_i|y] \cdot \Pr[f_j(x)=a_j\,|y] $$ In fact, assuming all classifiers are conditionally independent (and not only in pairs, as in the above equation), our problem is equivalent to learning a mixture of discrete (binary) product distributions, see [10]. In recent years several works have used the low rank structure of the second and third moments of the data in order to estimate the parameters of the mixture, see [11] and [12].

3. Spectral Methods for Binary Ensembles

In our work, we build upon the spectral approach developed in [2] , which is based on a study of the $m\times m$ covariance matrix of the $m$ classifiers.

  • An important lemma, presented in [2] is the rank-1 structure of the off-diagonal elements of this matrix: $$ r_{ij} = (1-b^2)(\psi_i+\eta_i-1) (\psi_j+\eta_j-1),$$ where $b=\Pr(Y=1)-\Pr(Y=-1)$ denotes the class imbalance.
    Furthermore, as shown in [1], the population mean of the $i$-th classifier is given by $$ \mu_i = (\psi_i-\eta_i) + b (\psi_i+\eta_i-1).$$ When the class imbalance is apriori known, using the above two quantities, it is easy to estimate the unknonw parameters $\psi_i,\eta_i$ of the $m$ classifiers (see Section 3 in [1]).

  • When $b$ is unkown, we propose the following two methods to estimate it, and then estimate $\psi_i,\eta_i$ using $\hat b$.

    1. The first is based on the joint covariance tensor, $T_{ijk} = \mathbb E[(f_i-\mu_i)(f_j-\mu_j)(f_k-\mu_k)]$, whose off diagonal elements correspond to a rank-1 tensor: $$ T_{ijk}= -b(1-b^2)(\psi_i+\eta_i-1) (\psi_j+\eta_j-1)(\psi_k+\eta_k-1)$$ Note that the vectors constructing $r_{ij}$ and $T_{ijk}$ are both proportional to $(\psi_i+\eta_i-1)$. In [1] , we use this observation to construct an estimate for $b$. We refer to this method as the 'tensor method'.

    2. The second approach to estimate $b$ is based on the log-likelihood of our data: $$ \log\bigg( \prod_{j=1}^n \Pr(Z_{1j}, \ldots, Z_{mj}\bigg) = \sum_{j=1}^n \log\Pr(Z_{1j}, \ldots, Z_{mj})$$ Note that the likelihood is a function of $2m+1$ variables: $\{\psi_i\}_{i=1}^m$,$\{\eta_i\}_{i=1}^m$ and $b$. However, every guess for $b$ yields corresponding guesses for the remaining variables. We can therefore scan over $b \in [-1+\delta, 1-\delta]$, where $\delta>0$ corresponds to an a-priori bound on the class imbalance, and find the value which maximizes the log likelihood function. We denote this as the 'restricted likelihood' method.

    4. Code

    A matlab implementation of our unsupervised methods can be found here.
    The file contains the following functions:

    Additional routines (some required by the procedures above):

    function [y,Z] = generate_prediction_matrix(m,n,b,psi,eta) - generates a random vector of true lablels y, and a random prediction matrix $Z$.

    function R = estimate_rank_1_matrix(Q) - receives a $m \times m$ matrix and estimates its diagonal values, by assuming that the matrix is (or close to) rank 1.

    function T = compute_classifier_3D_tensor(Z) - receives the prediction matrix $Z$ and returns the sample covariance tensor.

    function alpha = estimate_alpha(V,T) - estimates the value of $\alpha(b)$, used by the tensor method to estimate $b$.

    We also provide a demo code which generates synthetic random data and reproduces the following figure from [1] . This figure shows the mean square error $\mathbb{E}[(\hat b-b)^2]$ in estimating the class imbalance, vs. the number of instances over a log-log scale, of the two methods described above. As seen from the figure, on simulated data the restricted likelihood approach is more accurate than the tensor-based one.

    5. References

    1. A. Jaffe, B. Nadler, Y. Kluger, Estimating the accuracies of multiple classifiers without labeled data , AISTATS-2015.
    2. F. Parisi, F. Strino, B. Nadler, Y. Kluger, Ranking and Combining Multiple predictors without labeled data Proceedings of the National Academy of Sciences, vol. 111(4), 1253-1258, 2014.

    3. L. I. Kuncheva, Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.

    4. T. G. Dietterich, Ensemble Methods in Machine Learning. Multiple classifier systems. Springer Berlin Heidelberg, 2000. 1-15.

    5. A. P. Dawid and A. M. Skene, Maximum likelihood estimation of observer error-rates using the EM algorith. Journal of the Royal Statistical Society, 1979.

    6. V.C. Raykar, Y. Shipeng, L.H. Zhao, G.H. Valdez, C. Florin, L. Bogoni, and Moy L, Learning from crowds. Journal of Machine Learning Research, 2010.

    7. J. Whitehill, P. Ruvolo, T. Wu, J Bergsma, and J.R. Movellan, Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, 2009.

    8. P. Welinder, S. Branson, S. Belongie, and P. Perona, The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, 2010.

    9. P. Donmez, G. Lebanon, and K. Balasubramanian, Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels. The Journal of Machine Learning Research 11 (2010).

    10. Y. Freund and Y. Mansour, Estimating a mixture of two product distributions. In Proceedings of the 12th Annual Conference on Computational Learning Theory,1999.

    11. A. Anandkumar, D. Hsu, and S.M. Kakade, A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory, 2012.

    12. Y. Zhang, X. Chen, D. Zhou, and M.I. Jordan, Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Advances in Neural Information Processing Systems 27, 2014.