CMU Researchers Propose TIDEE: An Embodied Agent That Can Tidy Up Never-Before-Seen Rooms Without Any Explicit Instruction


Effective robot operation requires more than just blind obedience to predetermined commands. Robots should respond when there is an obvious deviation from the norm and should be able to deduce important context from incomplete instruction. Partial or self-generated instruction requires the kind of reasoning that requires a solid understanding of how things in the environment (objects, physics, other agents, etc.) should act. This type of thinking and action is a crucial component of embodied commonsense reasoning, which is essential for robots to work and interact naturally in the real world.

The field of embodied commonsense thinking has lagged behind embodied agents that can follow specific step-by-step instructions because the latter must learn to observe and act without explicit instruction. Embodied common sense, thinking may be studied via tasks like tidying up, in which the agent must recognize items in the wrong places and take corrective action to return them to more appropriate settings. The agent must intelligently navigate and manipulate while searching in likely locations for objects to be displaced, recognizing when things are out of their natural locations in the current scene and determining where to reposition the objects so they are in proper locations. Commonsense reasoning of object placements and the desirable skills of intelligent beings come together in this challenge.

TIDEE is a proposed embodied agent developed by the research team that can clean up spaces it has never seen before without guidance. TIDEE is the first type because it can scan a scene for items that aren’t where they should be, figure out where in the scene to put them, and then move them there with precision.

TIDEE investigates a home’s surroundings, finds misplaced things, infers probable object contexts for them, localizes such contexts in the present scene, and moves the objects back to their proper locations. The commonsense priors are encoded in a visual search network that guides the agent’s exploration for efficiently localizing the receptacle-of-interest in the current scene to reposition the object; ii) visual-semantic detectors that detect out-of-place objects; and iii) an associative neural graph memory of things and spatial relations that proposes plausible semantic receptacles and surfaces for object repositions. Using the AI2THOR simulation environment, researchers put TIDEE through its paces by having it clean up chaotic surroundings. TIDEE completes the job straight from pixel and raw depth input without having seen the same room previously, using only priors learned from a different collection of training homes. According to human assessments of the resulting room layout changes, TIDEE performs better than ablative variants of the model that exclude one or more of the commonsense priors.

TIDEE can tidy up spaces it has never seen before without any guidance or prior exposure to the places or objects in question. TIDEE does this by looking around the area, identifying items, and labeling them as normal or abnormal. TIDEE employs graph inference on its scene graph and external graph memory to infer potential receptacle categories when an object is out of place. It then uses the scene’s spatial semantic map to steer an image-based search network to possible locations of receptacle categories.

How does it works?

TIDEE cleans rooms in three distinct steps. TIDEE begins by scanning the area and running an anomaly detector at each time step until a suspicious object is found. TIDEE then moves to where the item is and grabs it. The second step involves TIDEE inferring a probable receptacle for the item based on the scene graph and the joint external graph memory. If TIDEE has yet to recognize the container, it will use a visual search network to guide its exploration of the area and suggest where the container may be discovered. TIDEE retains the estimated 3D centroids of previously identified objects in memory and uses this information for navigation and object tracking.

Each item’s visual attributes are collected using a commercially available object detector. At the same time, the relational language features are produced by feeding pretrained language model predictions for the 3D relationships between the objects (such as “next to,” “supported by,” “above,” and so on).

TIDEE contains a neural graph module programmed to anticipate possible item placement ideas once an object has been picked up. An item to be put, a memory graph holding plausible contextual connections learned from training scenarios, and a scene graph encoding the object-relation configuration in the present scene all interact to make the module function.

TIDEE employs an optical search network that predicts the likelihood of an object’s presence at each spatial point in an obstacle map given the semantic obstacle map and a search category. The agent then looks into those areas it thinks are most likely to contain the target.

TIDEE has two shortcomings, both of which are obvious directions for future research: it does not consider open and closed states of items, nor does it include their 3D posture as part of the messy and restructuring process.

It’s possible that the chaos that results from carelessly strewing stuff across a room isn’t representative of real-life chaos.

TIDEE completes the job straight from pixel and raw depth input without having seen the same room previously, using only priors learned from a different collection of training homes. According to human assessments of the resulting room layout changes, TIDEE performs better than ablative variants of the model that exclude one or more of the commonsense priors. A simplified model version greatly outperforms a top-performing solution on a comparable room rearrangement benchmark, allowing the agent to observe the objective state before rearrangement.


Check out the Paper, Project, Github, and CMU Blog. Don’t forget to join our 19k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

???? Check Out 100’s AI Tools in AI Tools Club


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.