CMSMITACS Joint Conference 2007May 31  June 3, 2007
Delta Hotel, Winnipeg, Manitoba

Classification methods based on statistical learning theory are mostly concerned with binary (twoclass) classifiers. It is sometimes argued that multiclass (more than two classes) problems can be reduced to binary ones through sequential dichotomization. The drawbacks of such approaches are obvious. We seek here simultaneous solutions (classifier determinations) to a multiple classification problem with real loss function on classification error cases, reflecting the possibly variable gravity of misdiagnoses and/or a decision not to classify an object. We demonstrate a general reduction principle and show in particular that asymmetry in the loss function is necessary and sufficient for the multiclass problem not to be reducible to a biclass one. We then propose two generalization bounds specifically designed for the multiclass setting. These bounds are numerical, and are tight by construction.
We argue that learning to be intelligent involves the learning of highly varying functions, in a mathematical sense. We present results suggesting strongly that most currently popular statistical learning approaches to learning flexible functions have fundamental limitations that render them inappropriate for learning highly varying functions. The first issue concerns the representation of such functions with what we call shallow model architectures. We discuss limitations of shallow architectures, such as socalled kernel machines, boosting algorithms, decision trees, and onehiddenlayer artificial neural networks. Mathematical results in circuits complexity theory helps us understand the issue. The second issue is more focused and concerns kernel machines with a local (e.g. Gaussian) kernel.
We show that they have a limitation similar to those already proved for older nonparametric methods, and connected to the socalled curse of dimensionality. Though it has long been believed that efficient learning in deep architectures is difficult, recently proposed computational principles for learning in deep architectures may offer a breakthrough. An idea that emerges from the experiments performed with these algorithms is that in order to optimize a highly nonconvex functions, humans and machines could be exploiting the pedagogical approach: learn simple concepts first, and when they are mastered use them to express and learn more abstract concepts. Selecting training examples and the order in which they are presented (just like teachers do with children) could be a way to guide this difficult optimization problem by solving a series of gradually more complex problems embedded in each other.
Density estimators that can adapt for asymmetric heavy tails are required in many applications such as finance and insurance. We put forward a nonparametric density estimator that brings together the strengths of nonparametric density estimation and of Extreme Value Theory. A hybrid Pareto distribution that can be used in a mixture model is proposed to extend the generalized Pareto (GP) to the whole real axis. Experiments on simulated data show the following. On one hand, the mixture of hybrid Paretos converges faster in terms of loglikelihood and provides good estimates of the tail of the distributions when compared with other density estimators including the GP distribution. On the other hand, the mixture of hybrid Paretos offers an alternate way to estimate the tail index which is comparable to the one estimated with the standard GP methodology. The mixture of hybrids is also evaluated on the Danish fire insurance data set.
Nonnegative matrix factorization (NMF) was introduced as a tool for data mining by Lee and Seung in 1999. NMF attempts to approximate a matrix with nonnegative entries by a product of two lowrank matrices, also with nonnegative entries. We propose an algorithm called R1D (rankone downdate) for computing a NMF that is motivated by singular value decomposition. This computes the dominant singular values and vectors of adaptively determined submatrices of a matrix.
Preliminary computational tests indicate that this method is able to successfully identify features in realistic datasets.
Joint work with Stephen Vavasis and Michael Biggs.
Deep Belief Networks (DBN) are generative neural network models with many layers of hidden causal variables, recently introduced by Hinton et al., along with a greedy layerwise unsupervised learning algorithm. The building block of a DBN is a probabilistic model called a Restricted Boltzmann Machine (RBM), used to represent one layer of the model. Restricted Boltzmann Machines are interesting because inference is easy in them, and because they have been successfully used as building blocks for training deeper models.
We show that RBMs are universal approximators of discrete distributions. A first theorem shows that adding hidden units yields improved modeling power, while a second theorem shows that an RBM can model any discrete distribution.
We then study the question of whether DBNs with more layers are strictly more powerful in terms of representational power. This suggests another criterion for DBNs, obtained by considering that the top layer can perfectly fit its input.
Scientists, working with large amounts of highdimensional data are constantly facing the problem of dimensionality reduction: how to discover lowdimensional structure from highdimensional observations. The compact representation can be used for exploratory data analysis, preprocessing, data visualization, and information retrieval.
One way to discover lowdimensional structure is to convert highdimensional data to lowdimensional codes by training a multilayer neural network with a small central layer to reconstruct highdimensional input vectors. Gradient descent can be used for finetuning the weights in such "autoencoder" networks, but this only works well if the initial weights are close to a good solution. In this talk we will describe an effective way of initializing the weights which allows deep autoencoder networks to learn lowdimensional codes that work much better than widely used Principal Components Analysis.
When trained on large document corpus, autoencoders are capable of extracting lowdimensional "semantic" codes that allow for much more accurate and faster retrieval than Latent Semantic Analysis, a wellknown document retrieval method based on Singular Value Decomposition.
In a variety of domains it is desirable to learn a representation of an environment defined by a stream of sensorimotor experience. In many cases such a representation is necessary as the observational data is too plentiful to be stored in a computationally feasible way. In other words, the primary feature of a learned representation is that it must be compact, summarizing information in a way that alleviates storage and retrieval demands.
This admits a new way of phrasing the problem: as a variation of dimensionality reduction. There are a variety of wellstudied algorithms for the dimensionality reduction problem. We argue that any of these can be useful for learning compact representations as long as additional constraints to the problem are respected, namely that the resulting representation is useful in the context of the actions which generated the observations.
Here, we formalize the problem of learning a subjective representation, clearly articulating solution features that are necessary for a learned representation to be "useful"; the actions must correspond to simple and consistent transformations in the learned representation. Further, we briefly present a possible solution to the newly defined problem and demonstrate its effectiveness for reasoning, planning and localization.