Building Semantic Alignment Across Languages


Have you ever tried asking a question in a language other than English in ChatGPT? You might get a weird, unrelated answer to your inquiry because these models are often biased toward the English language. Won?t it be easier if LLMs work in any language? 

National Key Laboratory for Novel Software Technology researchers propose a pre-trained LLM in non-English languages. The usual performance of LLMs is poor in non-English languages due to both pre-training corpus and instruction-tuning data being in English. One can improve it by continued pre-training with large-scale monolingual data. 

Researchers perform instruction-tuning on LLMs with translation tasks to improve the correspondence between two languages and use the cross-lingual general tasks to improve the ability of the instructions. They use LLaMA-7B as their pre-trained LLM and consider six languages similar to the English alphabet. LLaMA stands for Large Language Model Meta AI. 

An x-LLaMA is obtained with language-specific data for each language, which is then further compared with LLMs. This language modeling requires predicting the next token based on the prefix sequence. It needs the LLM to be trained on the large-scale corpus and translation data. Translation data is one of the most useful resources for learning semantic alignment, and LLM?s translation performance can be enhanced by using human expert-annotated translation data for instruction tuning. 

Researchers use publicly available sentence-level translation datasets to construct translation task instruction data. This makes their method scalable, reproducible, and extendable to more languages. They find that arranging non-English text on the target side of translation data can boost LLM?s performance on non-English tasks than having it on the source side.

Researchers used bilingual translation performance as a parameter to know the semantic alignment. They found that the scale of the translation task instruction data also greatly impacts the alignment. They derived an expression relating to translation performance and data scale, which has logarithmic dependence in the exponential form. They find that a less similar language requires more translation data to build semantic alignment than languages identical to English.

To compare x-LLaMA, researchers designed Alpaca-7B ( a LLaMA ), which was tuned with English instructions; Parrot – 7B, which was tuned with human-annotated translation data; and Bayling-7B, which was tuned with human interactive translations. They find that x-LLaMA outperforms Alpaca-7B by 42.50% in six non-English languages. The accuracy of non-English tasks in x-LLaMA was the same as the English tasks in Alpaca-7B. 

Finally, this proves that cross-lingual instruction tuning is an effective way. Their approach and findings illuminate the potential for developing more potent LLMs for non-English languages.