TLDR: Attention solves the underfitting issue in amortized probabilistic clustering.
Amortized approaches to clustering have recently received renewed attention thanks to novel objective functions that exploit the expressiveness of deep learning models. In this work we revisit a recent proposal for fast amortized probabilistic clustering, the Clusterwise Clustering Process (CCP), which yields samples from the posterior distribution of cluster labels for sets of arbitrary size using only O(K) forward network evaluations, where K is an arbitrary number of clusters. While adequate in simple datasets, we show that the model can severely underfit complex datasets, and hypothesize that this limitation can be traced back to the implicit assumption that the probability of a point joining a cluster is equally sensitive to all the points available to join the same cluster. We propose an improved model, the Attentive Clustering Process (ACP), that selectively pays more attention to relevant points while preserving the invariance properties of the generative model. We illustrate the advantages of the new model in applications to spike-sorting in multi-electrode arrays and community discovery in networks. The latter case combines the ACP model with graph convolutional networks, and to our knowledge is the first deep learning model that handles an arbitrary number of communities.
@article{pakman2020acp, title = {Attentive Clustering Process}, author = {Ari Pakman and Yueqi Wang and Yoonho Lee and Pallab Basu and Juho Lee and Yee Whye Teh and Liam Paninski}, year = {2020}, archiveprefix = {arXiv}, primaryclass = {stat.ML}, eprint = {2010.15727} }
TLDR: We estimate uncertainty in a Neural Process using bootstrapping.
Unlike in the traditional statistical modeling for which a user typically hand-specify a prior, Neural Processes (NPs) implicitly define a broad class of stochastic processes with neural networks. Given a data stream, NP learns a stochastic process that best describes the data. While this "data-driven" way of learning stochastic processes has proven to handle various types of data, NPs still rely on an assumption that uncertainty in stochastic processes is modeled by a single latent variable, which potentially limits the flexibility. To this end, we propose the Boostrapping Neural Process (BNP), a novel extension of the NP family using the bootstrap. The bootstrap is a classical data-driven technique for estimating uncertainty, which allows BNP to learn the stochasticity in NPs without assuming a particular form. We demonstrate the efficacy of BNP on various types of data and its robustness in the presence of model-data mismatch.
@inproceedings{lee2020bnp, title = {Bootstrapping Neural Processes}, author = {Juho Lee* and Yoonho Lee* and Jungtaek Kim and Eunho Yang and Sung Ju Hwang and Yee Whye Teh}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2020} }
TLDR: A meta-learning framework for predicting generalization.
While various complexity measures for deep neural networks exist, specifying an appropriate measure capable of predicting and explaining generalization in deep networks has proven challenging. We propose Neural Complexity (NC), a meta-learning framework for predicting generalization. Our model learns a scalar complexity measure through interactions with many heterogeneous tasks in a data-driven way. The trained NC model can be added to the standard training loss to regularize any task learner in a standard supervised learning scenario. We contrast NC’s approach against existing manually-designed complexity measures and other meta-learning models, and we validate NC’s performance on multiple regression and classification tasks.
@inproceedings{lee2020nc, title = {Neural Complexity Measures}, author = {Yoonho Lee and Juho Lee and Sung Ju Hwang and Eunho Yang and Seungjin Choi}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2020} }
TLDR: We extract class information from final FC layer weights.
This paper studies the probability distributions of penultimate activations of classification networks. Specifically, we show that the discriminative criterion applied to the final layer forms a Generative-Discriminative pair with a specific generative model of penultimate activations. This model is parameterized by the weights of the final fully-connected layer and can synthesize classification networks’ activations without feeding input data. We empirically demonstrate that our generative model enables stable knowledge distillation in the presence of domain shift, and can also transfer knowledge from a classifier to variational autoencoders and generative adversarial networks for class-conditional image generation.
TLDR: We regularize few-shot classification using compact discrete codes.
Learning compact discrete representations of data is a key task on its own or for facilitating subsequent processing of data. In this paper we present a model that produces Discrete InfoMax COdes (DIMCO); we learn a probabilistic encoder that yields k-way d-dimensional codes associated with input data. Our model’s learning objective is to maximize the mutual information between codes and labels with a regularization, which enforces entries of a codeword to be as independent as possible. We show that the infomax principle also justifies previous loss functions (e.g., cross-entropy) as its special cases. Our analysis also shows that using shorter codes, as DIMCO does, reduces overfitting in the conext of few-shot classification. Through experiments in various domains, we observe this implicit meta-regularization effect of DIMCO. Furthermore, we show that the codes learned by DIMCO are efficient in terms of both memory and retrieval time compared to previous methods.
@article{lee2019dimco, title = {Discrete Infomax Codes for Supervised Representation Learning}, author = {Yoonho Lee and Wonjae Kim and Wonpyo Park and Seungjin Choi}, year = {2019}, archiveprefix = {arXiv}, primaryclass = {stat.ML}, eprint = {1905.11656} }
TLDR: Our method learns to cluster by identifying one cluster at a time.
We propose Deep Amortized Clustering (DAC), a framework in which a neural network learns to cluster datasets efficiently using a few forward passes through a deep neural network. DAC implicitly learns what makes a cluster, how to group data points into clusters, and how to count the number of clusters in datasets. DAC is meta-learned in a data-driven way, using only clustered datasets and their partitions. This framework differs from traditional clustering algorithms, which usually require user-specified prior knowledge about the shape or structure of clusters. We empirically show on both synthetic and image data that DAC can efficiently and accurately cluster novel datasets.
@article{lee2019dac, title = {Deep Amortized Clustering}, author = {Juho Lee and Yoonho Lee and Yee Whye Teh}, booktitle = {Sets and Parts Workshop @ NeurIPS}, year = {2019} }
TLDR: Smooth attention is interpretable.
Without relevant human priors, neural networks may learn uninterpretable features. We propose Dynamics of Attention for Focus Transition (DAFT) as a human prior for machine reasoning. DAFT is a novel method that regularizes attention-based reasoning by modelling it as a continuous dynamical system using neural ordinary differential equations. As a proof of concept, we augment a state-of-the-art visual reasoning model with DAFT. Our experiments reveal that applying DAFT yields similar performance to the original model while using fewer reasoning steps, showing that it implicitly learns to skip unnecessary steps. We also propose a new metric, Total Length of Transition (TLT), which represents the effective reasoning step size by quantifying how much a given model’s focus drifts while reasoning about a question. We show that adding DAFT results in lower TLT, demonstrating that our method indeed obeys the human prior towards shorter reasoning paths in addition to producing more interpretable attention maps.
@inproceedings{kim2019daft, title = {Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning}, author = {Kim, Wonjae and Yoonho Lee}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2019} }
TLDR: Use Transformer blocks in your set networks.
Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.
@inproceedings{lee2019st, title = {Set Transformer: A Framework for Attention-based Permutation-invariant Neural Networks}, author = {Juho Lee and Yoonho Lee and Jungtaek Kim and Adam Kosiorek and Seungjin Choi and Yee Whye Teh}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2019} }
TLDR: We architecturally separate task-general and task-specific learning in MAML.
Gradient-based meta-learning methods leverage gradient descent to learn the commonalities among various tasks. While previous such methods have been successful in meta-learning tasks, they resort to simple gradient descent during meta-testing. Our primary contribution is the MT-net, which enables the meta-learner to learn on each layer’s activation space a subspace that the task-specific learner performs gradient descent on. Additionally, a task-specific learner of an MT-net performs gradient descent with respect to a meta-learned distance metric, which warps the activation space to be more sensitive to task identity. We demonstrate that the dimension of this learned subspace reflects the complexity of the task-specific learner’s adaptation task, and also that our model is less sensitive to the choice of initial learning rates than previous gradient-based meta-learning methods. Our method achieves state-of-the-art or comparable performance on few-shot classification and regression tasks.
@inproceedings{lee2018mtnet, title = {Gradient-based Meta-learning with Learned Layerwise Metric and Subspace}, author = {Yoonho Lee and Seungjin Choi}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2018} }