Deep Learning on a Data Diet

Supervised learning is a type of machine learning that involves training a model on a labeled dataset. In this approach, the algorithm is given input data and the corresponding correct output values, or “labels”. On the other hand, unsupervised learning is a paradigm that aims at learning to generate meaningful and comprehensible representations solely from inputs. Unsupervised learning remains one of the most challenging tasks in modern machine learning and deep learning despite the recent success, in particular, of self-supervised learning, which is currently widely used in many applications, including image and speech recognition, natural language processing, and recommendation systems.

Due to multiple moving pieces, unsupervised learning is complicated and lacks reproducibility, scalability, and explainability. Three main branches have been developed by recent literature: 1) spectral embeddings, 2) self-supervised learning, and 3) reconstruction-based methods. Each of these schemes, however, has its pitfalls.

Spectral embedding estimates geodesic distances between training samples to produce embeddings, but this heavily relies on challenging distance estimation, limiting its use. 

Alternative methods like self-supervised learning use similar losses but generate positive pairs to avoid geodesic distance estimation. Yet, self-supervised learning is limited by unintelligibility, numerous hyperparameters inconsistent among architectures and datasets, and a lack of theoretical guarantees. Finally, reconstruction-based learning has limitations regarding stability and the need for careful tuning of loss functions to handle noisy data. 

To overcome such challenges, recent research from Stanford and Meta AI developed an overly simple unsupervised learning strategy that aims at challenging the limitations of current methods.

The approach is named DIET (Datum IndEx as Target) and implements the simple idea of predicting the index of each item in a dataset as the training label. In this manner, the model structure closely resembles the supervised learning scheme, i.e., a backbone encoder plus a linear classifier. Consequently, any progress made within the supervised learning realm can be ported as-is to DIET. To summarize, the three main benefits of DIET are: i) minimal code refactoring, ii) architecture independence, and iii) no additional hyperparameters. In particular, DIET does not require positive pairs or specific teacher-student architectures, and it provides a training loss that is informative of test time performances without adding to the hyperparameters in classification loss.

Experimental results shown in the article demonstrate that DIET can rival current state-of-the-art methods on the CIFAR100 and TinyImageNet benchmarks, demonstrating a non-trivial potential. Interesting insights include the empirical evidence of not being influenced by the batch size and reaching good performance on limited datasets while both being weaknesses of current self-supervised learning.

However, DIET still has some limitations to be addressed. More precisely, DIET is highly sensitive to the strength of data augmentation, similar to self-supervised learning, and the convergence is slower than self-supervised learning, but label smoothing helps.

Finally, the paper does not address the scalability issue to large datasets and shows that DIET can not match the state-of-the-art methods without further consideration and design.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 15k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Lorenzo Brigato is a Postdoctoral Researcher at the ARTORG center, a research institution affiliated with the University of Bern, and is currently involved in the application of AI to health and nutrition. He holds a Ph.D. degree in Computer Science from the Sapienza University of Rome, Italy. His Ph.D. thesis focused on image classification problems with sample- and label-deficient data distributions.