Break-A-Scene: The AI-Powered Object Extraction and Remixing


Humans naturally possess the ability to break down complicated scenes into component elements and imagine them in various scenarios. One might easily picture the same creature in multiple attitudes and locales or imagine the same bowl in a new environment, given a snapshot of a ceramic artwork showing a creature reclining on a bowl. Today’s generative models, however, need help with tasks of this nature. Recent research suggests personalizing large-scale text-to-image models by optimizing freshly added specialized text embeddings or fine-tuning the model weights, given many pictures of a single idea, to enable synthesizing instances of this concept in unique situations.

Researchers from the Hebrew University of Jerusalem, Google Research, Reichman University, and Tel Aviv University present a novel scenario for textual scene decomposition in this study. Their goal is to identify a distinct text token for each idea from a single image of a scene that may include a number of concepts of different types. As a result, creative graphics may be produced using verbal cues to highlight particular ideas or a variety of subjects. It may not be evident since the concepts they seek to learn or extract from the customisation activity are only occasionally visible. Previous works have addressed this ambiguity by concentrating on one subject at a time and utilizing a number of photos to portray the concept in different contexts. The alternative, though, methods are required to resolve the problem when transitioning to a single-picture situation. n

They specifically suggest adding a series of masks to the input image to add further information about the concepts they want to extract. These masks may be free-form ones that the user supplies or ones produced by an automated segmentation approach (such as). Adapting the two primary techniques, TI and DB, to this environment indicate a reconstruction-editability tradeoff. Whereas TI fails to rebuild the ideas in a new context properly, DB needs more context control due to overfitting. In this study, the authors suggest a unique customization pipeline that successfully strikes a compromise between maintaining learned concept identity and preventing overfitting. 

Figure 1 provides an overview of our methodology, which has four main parts: (1) We use a union-sampling approach, in which a new subset of the tokens is sampled every time, to train the model to handle various combinations of created ideas. Additionally, (2) in order to prevent overfitting, we employ a two-phase training regime, starting with the optimisation of just the recently inserted tokens with a high learning rate and continuing with the model weights in the second phase with a reduced learning rate. The desired ideas are reconstructed by use of a (3) disguised diffusion loss. Fourth, we employ a unique cross-attention loss to promote disentanglement between the learned ideas.

Their pipeline contains two steps, which are shown in Figure 1. To rebuild the input image, they first identify a group of special text characters (called handles), freeze the model weights, and then optimize the handles. They continue to refine the handles while switching over to fine-tuning the model weights in the second phase. Their method strongly emphasizes disentangling concept extraction or ensuring that each handle is connected to just one target concept. They also understand that the customization procedure cannot be performed independently for each idea to develop graphics showcasing combinations of notions. In response to this discovery, we offer union sampling, a training approach that meets this need and improves the creation of idea combinations. 

They do this by utilizing the masked diffusion loss, a modified variation of the standard diffusion loss. The model is not penalized if a handle is linked to more than one concept because of this loss, which guarantees that each custom handle may deliver its intended idea. Their main finding is that they may punish such entanglement by additionally imposing a loss on the cross-attention maps, which are known to correlate with the scene layout. Due to the additional loss, each handle will concentrate solely on the areas covered by its target concept. They offer several automatic measurements for the task to compare their methodology to the benchmarks. 

They have made the following contributions, in order: (1) they introduce the novel task of textual scene decomposition; (2) they propose a novel method for this situation that strikes a balance between concept fidelity and scene editability by learning a set of disentangled concept handles; and (3) they suggest several automatic evaluation metrics and use them, along with a user study, to demonstrate the effectiveness of their approach. They also conduct user research, which shows that human assessors also like their methodology. In their last part, they suggest several applications for their technique.