Publications | Yoonho Lee

2025

ICML-W

Learning to Discover Abstractions for LLM Reasoning
Yoonho Lee*, Anikait Singh*, Yuxiao Qu*, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, Aviral Kumar
ICML 2025 workshops: AI for Math, PRAL, ES-FoMo [abstract] [paper]
TLDR: LLMs discover and leverage textual "abstractions" to solve complex reasoning tasks.

Effective reasoning often requires going beyond pattern matching or memorization of solutions to identify and implement “algorithmic procedures” that can be used to deduce answers to hard problems. These algorithmic procedures consist of reusable primitives, intermediate results, or procedures that themselves can be applied across many problems. While current methods of RL post-training on long chains of thought ultimately desire to uncover this kind of algorithmic behavior, their sensitivity to benchmarks and the brittle and locally optimal nature of strategies learned by these systems suggest that this is far from a fulfilled promise. To instantiate this, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward successful reasoning strategies. We train models to be capable of proposing several useful abstractions given a problem, followed by RL training that incentivizes building a solution while using the information provided by these abstractions. This results in a two-agent cooperative RL training paradigm, RL through Abstraction Discovery (RLAD), that jointly trains an abstraction generator and an abstraction-conditioned solution generator. This bi-level setup effectively enables structured exploration, decouples learning signals pertaining to abstraction proposal and solution generation, and improves generalization to harder problems, analogous to what we would expect from hierarchical RL. Empirically, RLAD improves performance on challenging math benchmarks.
ICML-W

Test-Time Alignment via Hypothesis Reweighting
Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn
ICML 2025 Workshop PUT [abstract] [paper]
TLDR: Adapting models to test-time preferences by reweighting ensemble members.

Large pretrained models often struggle with underspecified tasks – situations where the training data does not fully define the desired behavior. For example, chatbots must handle diverse and often conflicting user preferences, requiring adaptability to various user needs. We propose a novel framework to address the general challenge of aligning models to test-time user intent, which is rarely fully specified during training. Our approach involves training an efficient ensemble, i.e., a single neural network with multiple prediction heads, each representing a different function consistent with the training data. Our main contribution is HyRe, a simple adaptation technique that dynamically reweights ensemble members at test time using a small set of labeled examples from the target distribution, which can be labeled in advance or actively queried from a larger unlabeled pool. By leveraging recent advances in scalable ensemble training, our method scales to large pretrained models, with computational costs comparable to fine-tuning a single model. We empirically validate HyRe in several underspecified scenarios, including personalization tasks and settings with distribution shifts. Additionally, with just five preference pairs from each target distribution, the same ensemble adapted via HyRe outperforms the prior state-of-the-art 2B-parameter reward model accuracy across 18 evaluation distributions.
ICLR

Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
Yuejiang Liu*, Jubayer Ibn Hamid*, Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn
ICLR 2025 [abstract] [paper] [code]
TLDR: Improving robot action chunking by combining temporal consistency with reactivity.

Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. Yet, its reported effects on the learned policy are inconsistent: some studies find it crucial for achieving strong results, while others observe decreased performance. In this paper, we first dissect how action chunking impacts the divergence between a learner and a demonstrator. We find that action chunking allows the learner to better capture the temporal dependencies in demonstrations but at the cost of reduced reactivity in stochastic environments. To address this tradeoff, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples that align with previous decisions; (ii) forward contrast, which seeks samples of high likelihood for future plans. By coupling decisions within and across action chunks, BID promotes consistency over time while maintaining reactivity to unexpected changes. Experimental results show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks. Code and videos are available at https://github.com/YuejiangLIU/bid_diffusion.

2024

UIST

Clarify: Improving Model Robustness With Natural Language Corrections
Yoonho Lee, Michelle Lam, Helena Vasconcelos, Michael S. Bernstein, Chelsea Finn
UIST 2024
NeurIPS 2023 Workshops: XAIA, ICBINB [abstract] [paper] [code]
TLDR: A natural language interface for high-level feedback on model misconceptions; works at ImageNet scale.

The standard way to teach models is by feeding them lots of data. However, this approach often teaches models incorrect ideas because they pick up on misleading signals in the data. To prevent such misconceptions, we must necessarily provide additional information beyond the training data. Prior methods incorporate forms of additional instance-level supervision, such as labels for misleading features or additional labels for debiased data. However, such strategies require a large amount of labeler effort. We hypothesize that people are good at providing textual feedback at the level of concepts, a capability which existing frameworks for teaching do not leverage. We propose Clarify, a novel interface and method for interactively correcting model misconceptions. Through Clarify, users need only provide a short text description to describe a model’s consistent failure patterns. Then, in an entirely automated way, we use such descriptions to improve the training process. Our user studies show that non-expert users can successfully describe model misconceptions via Clarify, leading to increased worst-case performance in two datasets. We additionally conduct a case study on a large-scale image dataset, ImageNet, where we use Clarify to find and rectify 31 novel hard subpopulations.
TMLR

Conservative Prediction via Data-Driven Confidence Minimization
Caroline Choi*, Fahim Tajwar*, Yoonho Lee*, Huaxiu Yao, Ananya Kumar, Chelsea Finn
TMLR 2024
ICLR 2023 workshops: TrustML, ME-FoMo [abstract] [paper] [code]
TLDR: Conservative prediction by reducing confidence on relevant hard examples.

Errors of machine learning models can be prohibitively costly, especially in safety-critical settings such as healthcare. However, machine learning may be applicable to such scenarios if the learned model can abstain and defer to a human on difficult examples instead of making errors. In safety-critical settings, we prefer conservative models that defer to humans at the cost of some overall accuracy. Unfortunately, selective classification and out-of-distribution detection are notably difficult as it is hard to anticipate all possible examples. To mitigate this challenge, we focus on the transductive setting, where unlabeled examples from the test distribution are available during training. We propose transductive confidence minimization (TCM), which minimizes prediction confidence on unlabeled test examples while simultaneously optimizing the training objective. We theoretically show that TCM learns a lower bound on the true confidence, and that this property can be leveraged to provably detect examples that are sufficiently different from training examples, regardless of what distribution they came from. In our experiments, TCM consistently shows high performance, achieving the highest OOD detection performance compared to 6 other methods on 9 out of 10 ID->OOD pairs and consistently outperforming methods for selective classification in settings where we test on data from a previously unseen distribution.
NeurIPS-W

Robust Fine-Tuning by Learning the Objective with Bi-Level Optimization
Caroline Choi*, Yoonho Lee*, Annie S. Chen, Allan Zhou, Aditi Raghunathan, Chelsea Finn
NeurIPS 2023 Workshop on Distribution Shifts [abstract] [paper] [code]
TLDR: Robust fine-tuning by optimizing hyperparameters on OOD data; state-of-the-art on WILDS.

Large pretrained models encode rich representations that should, in principle, facilitate adaptation to downstream tasks by fine-tuning. However, fine-tuning a model on one data distribution often degrades performance on related distributions. Recent works propose hand-crafted fine-tuning algorithms to mitigate this issue, but these may not fully capture the relevant information for the task at hand. Since the optimal fine-tuning procedure depends on the relationship between the pretrained model and the downstream task, a data-driven approach which automatically searches the space of fine-tuning procedures is desirable. We propose LOFT, which optimizes the fine-tuning procedure such that resulting models are more robust to distribution shifts. Concretely, LOFT leverages a small validation set from a distribution that is different from both the training set and the test set. LOFT performs bi-level optimization to search for a loss function for which fine-tuning results in high performance on the validation set. Our experiments on nine naturally occurring distribution shifts show that LOFT outperforms existing robust fine-tuning methods in generalization to unseen distributions. Notably, LOFT achieves a new state-of-the-art on the WILDS iWildCam and FMoW benchmarks, outperforming the previous best methods by 6.0% and 1.5%, respectively.
ICLR

Project and Probe: Sample-Efficient Domain Adaptation by Interpolating Orthogonal Features
Annie S. Chen*, Yoonho Lee*, Amrith Setlur, Sergey Levine, Chelsea Finn
ICLR 2024 (Spotlight Presentation, top 5%)
ICLR 2023 workshops: TrustML (Oral), ME-FoMo [abstract] [paper]
TLDR: A lightweight and sample-efficient approach that learns diverse linear features and adapts to a target distribution by interpolating among them with a small target dataset.

Conventional approaches to robustness try to learn a model based on causal features. However, identifying maximally robust or causal features may be difficult in some scenarios, and in others, non-causal "shortcut" features may actually be more predictive. We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features with a small target dataset. Our approach, Project and Probe (Pro2), first learns a linear projection that maps a pre-trained embedding onto orthogonal directions while being predictive of labels in the source dataset. The goal of this step is to learn a variety of predictive features, so that at least some of them remain useful after distribution shift. Pro2 then learns a linear classifier on top of these projected features using a small target dataset. We theoretically show that Pro2 learns a projection matrix that is optimal for classification in an information-theoretic sense, resulting in better generalization due to a favorable bias-variance tradeoff. Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro2 improves performance by 5-15% when given limited target data compared to prior methods such as standard linear probing.
ICLR

Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning
Johnathan Wenjia Xie, Yoonho Lee, Annie S. Chen, Chelsea Finn
ICLR 2024 [abstract] [paper]
TLDR: Self-supervised learning by predicting masked features, applicable to novel data modalities.

Self-supervised learning excels in learning representations from large amounts of unlabeled data, demonstrating success across multiple data modalities. Yet, extending self-supervised learning to new modalities is non-trivial because the specifics of existing methods are tailored to each domain, such as domain-specific augmentations which reflect the invariances in the target task. While masked modeling is promising as a domain-agnostic framework for self-supervised learning because it does not rely on input augmentations, its mask sampling procedure remains domain-specific. We present Self-guided Masked Autoencoders (SMA), a fully domain-agnostic masked modeling method. SMA trains an attention based model using a masked modeling objective, by learning masks to sample without any domain-specific assumptions. We evaluate SMA on three self-supervised learning benchmarks in protein biology, chemical property prediction, and particle physics. We find SMA is capable of learning representations without domain-specific knowledge and achieves state-of-the-art performance on these three benchmarks.
EMNLP

Calibrating Language Models With Adaptive Temperature Scaling
Johnathan Wenjia Xie*, Annie S. Chen*, Yoonho Lee, Eric Mitchell, Chelsea Finn
EMNLP 2024 [abstract] [paper]
TLDR: Calibrating post-trained models via tokenwise temperature scaling.

The effectiveness of large language models (LLMs) is not only measured by their ability to generate accurate outputs but also by their calibration—how well their confidence scores reflect the probability of their outputs being correct. While unsupervised pre-training has been shown to yield LLMs with well-calibrated conditional probabilities, recent studies have shown that after fine-tuning with reinforcement learning from human feedback (RLHF), the calibration of these models degrades significantly. In this work, we introduce Adaptive Temperature Scaling (ATS), a post-hoc calibration method that predicts a temperature scaling parameter for each token prediction. The predicted temperature values adapt based on token-level features and are fit over a standard supervised fine-tuning (SFT) dataset. The adaptive nature of ATS addresses the varying degrees of calibration shift that can occur after RLHF fine-tuning. ATS improves calibration by over 10-50% across three downstream natural language evaluation benchmarks compared to prior calibration methods and does not impede performance improvements from RLHF.

2023

ICLR-W

Confidence-Based Model Selection: When to Take Shortcuts for Subpopulation Shifts
Annie S. Chen, Yoonho Lee, Amrith Setlur, Sergey Levine, Chelsea Finn
NeurIPS 2023 Workshop on Distribution Shifts [abstract] [paper]
TLDR: Per-sample model selection based for selectively choosing between robust and non-robust models.

Effective machine learning models learn both robust features that directly determine the outcome of interest (e.g., an object with wheels is more likely to be a car), and shortcut features (e.g., an object on a road is more likely to be a car). The latter can be a source of error under distributional shift, when the correlations change at test-time. The prevailing sentiment in the robustness literature is to avoid such correlative shortcut features and learn robust predictors. However, while robust predictors perform better on worst-case distributional shifts, they often sacrifice accuracy on majority subpopulations. In this paper, we argue that shortcut features should not be entirely discarded. Instead, if we can identify the subpopulation to which an input belongs, we can adaptively choose among models with different strengths to achieve high performance on both majority and minority subpopulations. We propose COnfidence-baSed MOdel Selection (CosMoS), where we observe that model confidence can effectively guide model selection. Notably, CosMoS does not require any target labels or group annotations, either of which may be difficult to obtain or unavailable. We evaluate CosMoS on four datasets with spurious correlations, each with multiple test sets with varying levels of data distribution shift. We find that CosMoS achieves 2-5% lower average regret across all subpopulations, compared to using only robust predictors or other model aggregation methods.
ICML

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, Chelsea Finn
ICML 2023 (Oral Presentation, top 3%) [abstract] [paper] [code]
TLDR: We develop a method that can detect if a passage is generated by a particular language model. Our method is based on the hypothesis that a passage is likely model-generated if it is near a local maximum in the model’s predictive probability space.

The fluency and factual knowledge of large language models (LLMs) heightens the need for corresponding systems to detect whether a piece of text is machine-written. For example, students may use LLMs to complete written assignments, leaving instructors unable to accurately assess student learning. In this paper, we first demonstrate that text sampled from an LLM tends to occupy negative curvature regions of the model’s log probability function. Leveraging this observation, we then define a new curvature-based criterion for judging if a passage is generated from a given LLM. This approach, which we call DetectGPT, does not require training a separate classifier, collecting a dataset of real or generated passages, or explicitly watermarking generated text. It uses only log probabilities computed by the model of interest and random perturbations of the passage from another generic pre-trained language model (e.g, T5). We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection, notably improving detection of fake news articles generated by 20B parameter GPT-NeoX from 0.81 AUROC for the strongest zero-shot baseline to 0.95 AUROC for DetectGPT.
ICLR

Surgical Fine-Tuning Improves Adaptation to Distribution Shifts
Yoonho Lee*, Annie S. Chen*, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, Chelsea Finn
ICLR 2023
NeurIPS 2022 Workshops: DistShift, ICBINB [abstract] [paper] [code]
TLDR: The best layer to fine-tune reflects the nature of the distribution shift.

A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.
ICLR

Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement
Yoonho Lee, Huaxiu Yao, Chelsea Finn
ICLR 2023
ICML workshops: PODS, SCIS [abstract] [paper] [website] [code]
TLDR: Given underspecified data, (1) find a diverse set of solutions and (2) choose the best one.

Many datasets are underspecified: there exist multiple equally viable solutions to a given task. Underspecification can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus produce widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.

2022

NeurIPS-W

Relaxing the Kolmogorov Structure Function for Realistic Computational Constraints
Yoonho Lee, Chelsea Finn, Stefano Ermon
NeurIPS 2022 Workshop on Information-Theoretic Principles in Cognitive Systems [abstract] [paper]
TLDR: An efficient relaxation of the Kolmogorov Structure Function that can leverage neural networks.

The degree to which a task is learnable given different computational constraints shows the amount of usable information at different scales. An instantiation of this idea is the Kolmogorov Structure Function (KSF), which shows how the fit of an optimal k-bit description of a given string improves for increasing values of k. While conceptually appealing, computing the KSF is infeasible in practice due to the exponentially large search space of all descriptions of a given length, in addition to the unbounded time complexity. This paper proposes the Constrained Structure Function (CSF), a generalization of the KSF that can be computed efficiently by taking into account realistic computational constraints. In addition to being feasible to compute, the CSF of a dataset can be expressed as the sum of datapoint-wise functions which reflect the degree to which each datapoint is typical in the context of the dataset. Empirically, we demonstrate that the CSF can be used for detecting individual datapoints with characteristics such as being easy, mislabeled, or belonging to a hidden subgroup.
NeurIPS

Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time
Huaxiu Yao*, Caroline Choi*, Bochuan Cao, Yoonho Lee, Pang Wei Koh, Chelsea Finn
NeurIPS 2022 Datasets & Benchmarks Track
ICML 2022 Shift Happens Workshop [abstract] [paper] [code]
TLDR: A benchmark of distribution shifts over time.

Distribution shifts occur when the test distribution differs from the training distribution, and can considerably degrade performance of machine learning models deployed in the real world. While recent works have studied robustness to distribution shifts, distribution shifts arising from the passage of time have the additional structure of timestamp metadata. Real-world examples of such shifts are underexplored, and it is unclear whether existing models can leverage trends in past distribution shifts to reliably extrapolate into the future. To address this gap, we curate Wild-Time, a benchmark of 7 datasets that reflect temporal distribution shifts arising in a variety of real-world applications. On these datasets, we systematically benchmark 9 approaches with various inductive biases. Our experiments demonstrate that existing methods are limited in tackling temporal distribution shift: across all settings, we observe an average performance drop of 21% from in-distribution to out-of-distribution data.
NeurIPS

On Divergence Measures for Bayesian Pseudocoresets
Balhae Kim, Jungwon Choi, Seanie Lee, Yoonho Lee, Jung-Woo Ha, Juho Lee
NeurIPS 2022 [abstract] [paper] [code]
TLDR: An exploration of the choice of divergence for learning a Bayesian pseudocoreset.

A Bayesian pseudocoreset is a small synthetic dataset for which the posterior over parameters approximates that of the original dataset. While promising, the scalability of Bayesian pseudocoresets is not yet validated in large-scale problems such as image classification with deep neural networks. On the other hand, dataset distillation methods similarly construct a small dataset such that the optimization with the synthetic dataset converges to a solution similar to optimization with full data. Although dataset distillation has been empirically verified in large-scale settings, the framework is restricted to point estimates, and their adaptation to Bayesian inference has not been explored. This paper casts two representative dataset distillation algorithms as approximations to methods for constructing pseudocoresets by minimizing specific divergence measures: reverse KL divergence and Wasserstein distance. Furthermore, we provide a unifying view of such divergence measures in Bayesian pseudocoreset construction. Finally, we propose a novel Bayesian pseudocoreset algorithm based on minimizing forward KL divergence. Our empirical results demonstrate that the pseudocoresets constructed from these methods reflect the true posterior even in large-scale Bayesian inference problems.
Entropy

Discrete Infomax Codes for Supervised Representation Learning
Yoonho Lee, Wonjae Kim, Wonpyo Park, Seungjin Choi
Entropy Special Issue "Theory and Applications of Information Processing Algorithms" [abstract] [paper]
TLDR: Regularizing few-shot classification using compact discrete codes.

Learning compact discrete representations of data is a key task on its own or for facilitating subsequent processing of data. In this paper we present a model that produces Discrete InfoMax COdes (DIMCO); we learn a probabilistic encoder that yields k-way d-dimensional codes associated with input data. Our model’s learning objective is to maximize the mutual information between codes and labels with a regularization, which enforces entries of a codeword to be as independent as possible. We show that the infomax principle also justiﬁes previous loss functions (e.g., cross-entropy) as its special cases. Our analysis also shows that using shorter codes, as DIMCO does, reduces overﬁtting in the conext of few-shot classiﬁcation. Through experiments in various domains, we observe this implicit meta-regularization effect of DIMCO. Furthermore, we show that the codes learned by DIMCO are efﬁcient in terms of both memory and retrieval time compared to previous methods.

2021

NeurIPS

Diversity Matters When Learning From Ensembles
Giung Nam*, Jongmin Yoon*, Yoonho Lee, Juho Lee
NeurIPS 2021 [abstract] [paper] [code]
TLDR: To distill from deep ensembles, use inputs that ensemble members disagree on.

Deep ensembles excel in large-scale image classification tasks both in terms of prediction accuracy and calibration. Despite being simple to train, the computation and memory cost of deep ensembles limits their practicability. While some recent works propose to distill an ensemble model into a single model to reduce such costs, there is still a performance gap between the ensemble and distilled models. We propose a simple approach for reducing this gap, i.e., making the distilled performance close to the full ensemble. Our key assumption is that a distilled model should absorb as much function diversity inside the ensemble as possible. We first empirically show that the typical distillation procedure does not effectively transfer such diversity, especially for complex models that achieve near-zero training error. To fix this, we propose an augmentation-based distillation strategy that reveals diversity by seeking inputs for which ensemble member outputs disagree. We empirically show that a model distilled with such augmented samples indeed exhibits enhanced diversity, leading to improved performance.
ICML-W

Amortized Probabilistic Detection of Communities in Graphs
Yueqi Wang*, Yoonho Lee*, Pallab Basu, Juho Lee, Yee Whye Teh, Liam Paninski, Ari Pakman
ICML 2024 SPIGM workshop [abstract] [paper] [code]
TLDR: An attention-based method for probabilistically detecting communities within graphs.

Learning community structures in graphs has broad applications across scientific domains. While graph neural networks (GNNs) have been successful in encoding graph structures, existing GNN-based methods for community detection are limited by requiring knowledge of the number of communities in advance, in addition to lacking a proper probabilistic formulation to handle uncertainty. We propose a simple framework for amortized community detection, which addresses both of these issues by combining the expressive power of GNNs with recent methods for amortized clustering. Our models consist of a graph representation backbone that extracts structural information and an amortized clustering network that naturally handles variable numbers of clusters. Both components combine into well-defined models of the posterior distribution of graph communities and are jointly optimized given labeled graphs. At inference time, the models yield parallel samples from the posterior of community labels, quantifying uncertainty in a principled way. We evaluate several models from our framework on synthetic and real datasets and demonstrate superior performance to previous methods. As a separate contribution, we extend recent amortized probabilistic clustering architectures by adding attention modules, which yield further improvements on community detection tasks.
UAI

On the Distribution of Penultimate Activations of Classification Networks
Minkyo Seo*, Yoonho Lee*, Suha Kwak
UAI 2021 [abstract] [paper]
TLDR: Final FC layer weights contain information about class relations.

This paper studies the probability distributions of penultimate activations of classification networks. Specifically, we show that, when a classification network is trained with the cross-entropy loss, its final classification layer forms a Generative-Discriminative pair with a generative classifier based on a specific distribution of penultimate activations. More importantly, the distribution is parameterized by the weights of the final fully-connected layer, and can be considered as a generative model that synthesizes the penultimate activations without feeding input data. We empirically demonstrate that this generative model enables stable knowledge distillation in the presence of domain shift, and can also transfer knowledge from a classifier to variational autoencoders and generative adversarial networks for class-conditional image generation.

2020

NeurIPS

Bootstrapping Neural Processes
Juho Lee*, Yoonho Lee*, Jungtaek Kim, Eunho Yang, Sung Ju Hwang, Yee Whye Teh
NeurIPS 2020 [abstract] [paper] [video] [code]
TLDR: Improved uncertainty estimates in Neural Processes using bootstrapping.

Unlike in the traditional statistical modeling for which a user typically hand-specify a prior, Neural Processes (NPs) implicitly define a broad class of stochastic processes with neural networks. Given a data stream, NP learns a stochastic process that best describes the data. While this "data-driven" way of learning stochastic processes has proven to handle various types of data, NPs still rely on an assumption that uncertainty in stochastic processes is modeled by a single latent variable, which potentially limits the flexibility. To this end, we propose the Boostrapping Neural Process (BNP), a novel extension of the NP family using the bootstrap. The bootstrap is a classical data-driven technique for estimating uncertainty, which allows BNP to learn the stochasticity in NPs without assuming a particular form. We demonstrate the efficacy of BNP on various types of data and its robustness in the presence of model-data mismatch.
NeurIPS

Neural Complexity Measures
Yoonho Lee, Juho Lee, Sung Ju Hwang, Eunho Yang, Seungjin Choi
NeurIPS 2020 [abstract] [paper] [video] [code]
TLDR: A meta-learning framework for predicting generalization.

While various complexity measures for deep neural networks exist, specifying an appropriate measure capable of predicting and explaining generalization in deep networks has proven challenging. We propose Neural Complexity (NC), a meta-learning framework for predicting generalization. Our model learns a scalar complexity measure through interactions with many heterogeneous tasks in a data-driven way. The trained NC model can be added to the standard training loss to regularize any task learner in a standard supervised learning scenario. We contrast NC’s approach against existing manually-designed complexity measures and other meta-learning models, and we validate NC’s performance on multiple regression and classification tasks.

2019

NeurIPS-W

Deep Amortized Clustering
Juho Lee, Yoonho Lee, Yee Whye Teh
NeurIPS 2019 Sets and Parts Workshop (oral) [abstract] [paper]
TLDR: Learning to cluster by identifying one cluster at a time.

We propose Deep Amortized Clustering (DAC), a framework in which a neural network learns to cluster datasets efficiently using a few forward passes through a deep neural network. DAC implicitly learns what makes a cluster, how to group data points into clusters, and how to count the number of clusters in datasets. DAC is meta-learned in a data-driven way, using only clustered datasets and their partitions. This framework differs from traditional clustering algorithms, which usually require user-specified prior knowledge about the shape or structure of clusters. We empirically show on both synthetic and image data that DAC can efficiently and accurately cluster novel datasets.
NeurIPS

Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning
Wonjae Kim, Yoonho Lee
NeurIPS 2019 [abstract] [paper] [code]
TLDR: Smooth and interpretable attention using Neural ODEs.

Without relevant human priors, neural networks may learn uninterpretable features. We propose Dynamics of Attention for Focus Transition (DAFT) as a human prior for machine reasoning. DAFT is a novel method that regularizes attention-based reasoning by modelling it as a continuous dynamical system using neural ordinary differential equations. As a proof of concept, we augment a state-of-the-art visual reasoning model with DAFT. Our experiments reveal that applying DAFT yields similar performance to the original model while using fewer reasoning steps, showing that it implicitly learns to skip unnecessary steps. We also propose a new metric, Total Length of Transition (TLT), which represents the effective reasoning step size by quantifying how much a given model’s focus drifts while reasoning about a question. We show that adding DAFT results in lower TLT, demonstrating that our method indeed obeys the human prior towards shorter reasoning paths in addition to producing more interpretable attention maps.
ICML

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, Yee Whye Teh
ICML 2019 [abstract] [paper] [code]
TLDR: Self-attention for sets using inducing points. O(N) feedforward complexity.

Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.

2018

ICML

Gradient-based Meta-learning with Learned Layerwise Metric and Subspace
Yoonho Lee, Seungjin Choi
ICML 2018 [abstract] [paper] [video] [code]
TLDR: Improving MAML by fixing some weights during task adaptation.

Gradient-based meta-learning methods leverage gradient descent to learn the commonalities among various tasks. While previous such methods have been successful in meta-learning tasks, they resort to simple gradient descent during meta-testing. Our primary contribution is the MT-net, which enables the meta-learner to learn on each layer’s activation space a subspace that the task-specific learner performs gradient descent on. Additionally, a task-specific learner of an MT-net performs gradient descent with respect to a meta-learned distance metric, which warps the activation space to be more sensitive to task identity. We demonstrate that the dimension of this learned subspace reflects the complexity of the task-specific learner’s adaptation task, and also that our model is less sensitive to the choice of initial learning rates than previous gradient-based meta-learning methods. Our method achieves state-of-the-art or comparable performance on few-shot classification and regression tasks.