A New AI Research Introduces CACTI: A Framework For Multi-Task Multi-Scene Robot Manipulation


Recent advances in learning-based control have brought us closer to the objective of building an embodied agent with generalizable human-like abilities. Natural language processing (NLP) and computer vision (CV) have come a long way, thanks in large part to the availability of structured datasets on a massive scale. Web-scale datasets with high-quality photos and text have demonstrated significant improvements using the same fundamental methods. Nevertheless, gathering data on a comparable scale for robot learning is impossible due to logistical difficulties. Collecting demonstrations via teleoperation is laborious and time-consuming compared to the plethora of online textual and visual data. In the case of robot manipulation, covering a wide range of objects and scenarios needs enormous physical resources, so it’s more than just a difficulty to get a diverse set of data.

In a recent study by Columbia University, Meta AI and Carnegie Mellon University introduced a predefined framework CACTI for robot manipulation that can do several tasks in different environments. It uses text2image generative models (such as stable-diffusion) to provide visually realistic variations to data, and it scales well to various jobs. The research centers on segmenting the comprehensive plan into more manageable chunks by cost. To alleviate the burden of collecting a large amount of data, CACTI introduces a novel data augmentation scheme that enriches data diversity with rich semantic and visual variations.

CACTI refers to the four steps of the framework: The process goes as follows: collect expert demonstrations augment the data to enhance visual diversity compress the picture into frozen pretrained representations integrally train limitation learning agents with the compressed data. Recent SOTA in text-to-image creation can *zero-shot* produce incredibly realistic-looking objects and scenes, as found on real robot data.

In the Collect phase, demonstrations are collected with little effort from a human expert or task-specific learned expert. In the Augment phase, generative models from outside the original domain increase visual diversity by adding new scenes and layouts to the dataset. In the final TraIn stage, a single policy head is trained on frozen embeddings to imitate expert behavior across multiple tasks, using the cost-effectiveness of zero-shot visual representation models trained on out-of-domain data.

The researchers established virtual and physical settings for the robots to operate in. They used an actual Franka arm and a tabletop with ten different manipulation jobs. By modeling, they create a random kitchen setting with 18 chores, 100+ scene layouts, and variations in visual attributes. Frozen visual embeddings allow for inexpensive training. They, therefore, train a single policy to accomplish ten manipulation tasks, and the augmented data makes a noticeable impact in making the policy data efficient and robust to novel distractors and layouts.

The vision-based policy matches the performance of state-based oracles in simulation across 18 jobs Plus 100s of varied layouts and visual variances. Generalization also improves on held-out layouts, which is promising as the number of training layouts increases.

The findings strongly suggest that in cases where in-domain data gathering presents fundamental issues, generalization in robot learning can be improved by leveraging huge (generative and representational) models trained on heterogeneous internet-scale out-of-domain data sets. The team believes there can be a great starting point to investigate deeper links between huge quantities of domain models and robot learning, as well as the development of architectures capable of managing multi-modal data and scaling to multi-stage policies.


Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 15k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.