This AI Research Shows How ILF can Significantly Improve the Quality of a Code Generation Model with Human-Written Natural Language Feedback


Program synthesis, or the automatic creation of computer programs from an input specification, is a crucial problem for software engineering. Not only may efficient program synthesis help software engineers’ productivity, but it can also make it easier to write code. Pre-trained large language models (LLMs) have recently exhibited significant progress in program synthesis, yet despite extensive pre-training, they still need to generate proper code consistently.

For instance, unfiltered code that has been scraped from the Internet and used as part of code pre-training datasets frequently has many security flaws. Researchers postulate that contemporary LLM pre-training set-ups are substantially to blame for these inadequacies. It has been demonstrated that incorporating written feedback into LLMs considerably increases the pass rates of code generation models when the input is given at test time.

Researchers suggest Imitation learning from Language Feedback to train LLMs with language feedback. This algorithm extends the work of Scheurer, who investigated the effects of learning from language feedback on text summarization models. By retraining the base model on improved summaries produced from the model’s initial recaps and human-written feedback, Scheurer enhances a summarizing model. Researchers’ work advances Scheurer in several ways, including:

  • By formalizing the algorithm and making it universally applicable in a form
  • By demonstrating how the reward function may be modified to generate code
  • By presenting an ILF (Imitation learning from Language Feedback) code developing proof-of-concept.

ILF (Imitation learning from Language Feedback) trains a different model called “Refine” to use language feedback to fix the incorrectly created programs to increase the accuracy of programs produced by a baseline code generation model called ??. Researchers next improve by tweaking it on the ?Refine generated refinements that pass unit tests, resulting in a final improved model ??*. Researchers refer to the repaired programs as refinements). This process can be regarded as minimizing the predicted KL divergence from a target ground truth distribution, and it may be repeated iteratively to keep improving the model.

Research and Findings

Researchers use the Mostly Basic Python Problems (MBPP) dataset to train and assess the models. The 974 Python programming assignments in MBPP are created for beginning programmers.

Although the dataset has a designated prompt/training/validation/test split in MBPP, researchers re-divided it into the following splits:

• MBPPRefine: These jobs have IDs in 111-310. However, CODEGEN-MONO 6.1B failed to produce any accurate completions for them. To train ?Refine, use this split.

• MBPPTrain: These tasks have IDs in the range of 311 to 974, but CODEGEN-MONO 6.1B failed to produce any accurate completions for them. This split is initially used to assess the accuracy of the refinements produced by ?Refine. Then, is trained to produce using the appropriate refinements in this split.

• MBPPTest: Researchers employ these tasks, which have IDs between 11 and 110, to assess the final performance of ??*. In contrast to the other two splits, they use all tasks in this split instead of just those for which CODEGENMONO 6.1B did not initially produce accurate programs. This makes it easier for us to compare the performance of ?? and ??* and at their baseline levels.

Researchers independently adjust two different instances of CODEGEN-MONO 6.1B to produce ?Refine and the final model ??* to put the algorithm into practice. Pairs of erroneous programs, human-written feedback, and targets of human-written refinements are used to train the ?Refine algorithm.

Even though the ILF algorithm only necessitates the gathering of human-written feedback for the tasks in MBPPTrain (assuming access to some ?Refine that are already tuned or can generate refinements via few-shot prompting), researchers gather both human-written feedback and refinement for all splits of the data to conduct further analyses of the approach. This enables us to compare fine-tuning on refinements generated by ?Refine with fine-tuning on refinements authored by humans, for example. ILF needs additional feedback annotations when scaled to various model and task combinations. However, it is feasible that employing ILF on one dataset will enhance the model’s performance on a different dataset for the same job. Future studies will include scaling ILF across various workloads and models.

A small sample of MBPP gold programs was used for training. However, this did not significantly improve accuracy compared to zero-shot inference. Researchers computed the perplexity of the MBPP gold programs, the ?Refine generated refinements, and the human-written refinements using the pretrained CODEGEN-MONO 6.1B model to test the hypothesis that the gold programs from the MBPP dataset may be slightly out-of-distribution for CODEGEN-MONO 6.1B. The MBPP dataset contains more high-perplexity programs (i.e., programs with perplexity 102) than the ?Refine generated refinements or the human-written refinements, even though the distributions of all three data sources appear identical. Since the latter two datasets are closer to CODEGEN-MONO 6.1B’s original distribution while remaining functionally sound, it is probably simpler for CODEGEN-MONO 6.1B to learn from them.

Moreover, ILF is especially helpful when there is a need for more access to huge quantities of gold codes. In this context, ILF is a technique for producing training data that explicitly fix the original model’s defects while also producing training data that is more similar to the model’s actual outputs in data representation space. So, even though both training datasets contain the same number of functionally perfect programs, fine-tuning the model on ?Refine produced refinements does not necessitate changing the weights as much as fine-tuning the model on the MBPP gold programs would.

To summarize

Learning from feedback in human-written natural language is more efficient in terms of training samples and more effective in terms of code creation tasks. An exciting recent discovery is the ability of pre-trained large language models (LLMs) to employ natural language feedback at inference time. Researchers expand on this finding by formalizing an algorithm, which they refer to as Imitation learning from Language Feedback, for learning from natural language feedback at training time instead (ILF). ILF is user-friendly and sample-efficient since it only needs a limited quantity of human-written feedback during training and none at test time. Researchers also offer a proof-of-concept on a task requiring the synthesis of a neural program, demonstrating that ILF can be considered a way to minimize the KL divergence from the ground truth distribution. Researchers use ILF to outperform fine-tuning on the Mostly Basic Python Problems (MBPP) benchmark and fine-tuning on repaired programs created by humans by increasing a CODEGEN-MONO 6.1B model’s pass@1 rate by 38% relative (and 10% absolute) on the MBPP benchmark. Researchers’ findings indicate that training purely on demos is inefficient for enhancing an LLM’s performance on code generation tasks and that learning via human-written natural language feedback is more efficient and sample-effective.


Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 17k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.