(KOR) Retrieval-Augmented Machine Translation with Unstructured Knowledge 논문 리뷰

1. Introduction
2. RAGtrans
3. Multi-Task Training
4. Experiment Setup
- 4.1 Setup
- 4.2 Prompt \& Response
5. Results
6. Limitations

1. Introduction

RAG assisted machine translation에 관심이 생겨서 관련 논문을 스크랩 및 정리하기로 했다. Retrieval-Augmented Machine Translation with Unstructured Knowledge(이하 RAGMT) 논문은 WeChat AI 산하의 패턴 인식 센터에서 투고하였다. 기존 RAG를 사용한 MT 기술은 번역 원문 텍스트와 유사한 텍스트를 context로 삼기위해 retrieval을 하거나 AKA : Translation Memory , 특정 도메인 지식이나 문화적 배경 지식이 필요한 경우 Knowledge Graph에서 이를 retrieval하는 경우가 대부분이다 AKA : Retriving Knowledge Triplets. 그러나, 대다수의 world knowledge는 unstructured documents의 형태로 저장되어있고, 위키백과와 같이 여러 언어로 번역되어 있지 않음으로 1:1 페어링이 어렵다.

본 논문은 unstructured documents(이하, UD)를 컨텍스트로 사용한 RAGMT 기술에 대해 다룬다. 이에 대한 벤치마크는 현존하지 않음으로, 본 논문의 연구진은 LLM의 RAGMT 능력을 측정할 수 있는 RAGtrans 데이터셋을 구축했다.

2. RAGtrans

2.1 벤치마크를 위한 데이터셋 구축

위키백과 기반 데이터셋
- Knowledge-intensive sentences : 위키백과 영문 문서의 앞 문단들을 종합해 79K knowledge-intensive semantics를 수집했다. 각 문장은 개별적인 source sentence이다.
- Useful relevant documents: 각 source sentence의 상위 문서 (parent-document)도 함계 포함되어 있다.
- Transferability to multilingual RAG : 위키백과는 하나의 문서가 여러 언어로 번역이 되어있음으로(AKA: parallel contents), 해당 문서들도 retrieval 대상이 될 수 있다.
데이터 관련 (79K 샘플 문장 기반)
- Training set : 74.5반
  - GPT-4o
- Validation set : 2.5K
  - GPT-4o`
- Test set : 2K
  - 중국어 원어민 annotator

2.2 Data Selection

2.2.1 Source Sentence Selection

3가지 기준으로 source sentence가 위키백과에서 선별되었다.

knowledge-intensive semantics (원문이 너무 번역하기 쉬우면 의미가 없으니까)
source sentence의 상위 문서를 쉽게 엑세스할 수 있어야 한다
source sentence의 번역된 문서도 쉽게 엑세스할 수 있어야 한다.

2.2.2 Translation Annotation

Test set에서는 인간 번역을 사용했으나, training set과 validation set은 GPT-4o를 사용했다
RAGMT를 사용해서 source sentence를 중국어로 번역한다

각 sample은 triplet으로 표현된다 : $<s,d^l,t>$

$s$ : English source sentence
$t$ : Chinese Translation
$d^l$ : indicates the given documents for $s$, and $l ∈{en, zh, de, fr, cs}$

2.3 Benchmark Settings

Golden Evaluation : human evaluation(통번역 전문가)를 통해 생성한 데이터셋을 활용하여 모델의 성능을 평가하는 방법.
Robustness Evaluation : 모델이 다양한 환경이나 상황, 혹은 입력의 노이즈 및 변형에 대해 얼마나 견고하게 작동하는지를 평가함.
Full Wiki Evaluation : 평가 데이터셋으로 Wikipedia 전체 또는 일부를 활용하는 방법. 모델이 광범위한 지식 기반에 대해 얼마나 잘 작동하는지 평가할 수 있음.

3. Multi-Task Training

LLM의 RAGMT 성능을 높이기 위해 새로운 multi-task traning method를 사용했다.
하나의 source sentence를 번역하는 과정에서 다양한 언어로 구성된 document를 retrieve할 수 있는데, 이를 효과적으로 정제하고 사용하는 방법을 고안했다. -> CSC

3.1 Cross-lingual information completion

여러 언어로 작성된 문서인 : $d_{\text{mix}}$와 해당 문서의 영문 요약본 $\hat{y}$이 주어졌을 때,LLM이 $\hat{y}$을 확장하여 완전한 요약본($y$)으로 만들어내도록 한다. 이것을 수식으로 표현한다면 다음과 같다.

\[\Theta(y \mid \hat{d}_{\text{mix}}, \hat{y})\]

여기서 $\Theta$는 LLM을 나타내고, 요약본은 위키백과의 첫 문단을 나타낸다.

3.2 Self-knowledge-enhanced translation

최근 RAG 연구에서 보여준 바와 같이, RAG 모델은 자체 지식의 도움을 받을 때 더 나은 성능을 보인다. 앞선 연구에서 영감을 받아, 자기 지식 강화 번역 기능(self-knowledge-enhanced translation)을 설계하였다.

구체적으로, 주어진 source sentence $s$에 대해, LLM은 먼저 특정 언어 $l \in {en, zh, de, fr, cs}$로 된 관련 문서 $\tilde{d}_l$를 생성하고, 이후 해당 문서를 활용하여 $s$를 목표 문장 $t$으로 번역한다. 이것을 수식으로 표현한다면 다음과 같다.

\[\Theta(t \mid \tilde{d}_l, s)\]

3.3 Cross-lingual relevance discrimination

RAG로 불러온 문서들이 여러 언어로 되어있을 수 있음으로, 서로 다른 언어로 작성된 두 텍스트 간의 연관성을 평가하는 능력을 설계했다. 서로 다른 언어 $l_1$과 $l_2$로 작성된 문서 쌍 $\langle d_{l_1}, d_{l_2} \rangle$이 주어지면, LLM이 두 문서 간의 연관성 메트릭 $r(d_{l_1}, d_{l_2})$을 생성하도록 한다. 이것을 수식으로 표현한다면 다음과 같다.

\[\Theta(r \mid d_{l_1}, d_{l_2})\]

4. Experiment Setup

4.1 Setup

Metrics
- BLEU
  - 평가 기준 : 원문과 번역문의 n-grams overlap
- COMET
  - 평가 기준 : 원문과 번역문의 semantic similarity
Backbones (LLMs used)
- Qwen2.5 - 7B + 14B
- llama3-8B
- Mistral-7B
- A100 * 8 사용, le-5 learning rate, 32 batch size
Retriever
- BM25
  - traditional lexical search method that matches keywords efficiently with an inverted index
- BGE-m3
  - multilingualsentence embedding model that supports dense retrievals across different languages

4.2 Prompt & Response

PROMPT

The document is provided as follows:
<document>
【Course à la direction du Parti québécois de 2005】Le, Bernard Landry, chef du Parti québécois, annonça son intention de démissionner de son poste de chef après avoir reçu un vote de confiance de
76,2 % de la part des délégués présents au Conseil national du parti se déroulant alors. Le Parti québécois désigna Louise Harel comme chef par intérim. La forte candidature de Gilles Duceppe, chef du
Bloc québécois, le parti souverainiste au niveau fédéral, était attendue. Toutefois, malgré une importante pression, il y renonça en date du pour plutôt disputer les prochaines élections fédérales à la tête
du Bloc québécois.
</document>
The sentence that needs to be translated is provided as follows:
<input sentence>
party in Quebec, Canada.
</input sentence>
The Parti Québécois leadership election of 2005 was held from November 13 to November 15, 2005 to elect the new leader of the Parti Québécois, the main sovereigntist and social democratic political
Please response in the following JSON format:
{
"is
_
helpful": {
"reason": str,
"score": int
},
"translation": str
}
Please carefully read the explanation of these keys:
-
"is
_
helpful": Dict type. Whether the given document is helpful for translating the sentence.
-
"reason": Str type. The reason of whether the given document is helpful.
-
"score": Int type. 1 - 5 scores. 1 indicates the given document is totally irrelevant to the sentence. 3 indicates that though the document is relevant to the sentence, but it not helpful during the
translation, you can still reach the translation result without seeing the document. 5 indicates the given document is very helpful during translation, you may make mistakes during translation without
seeing the document.
-
"translation": Str type. The translated result.

RESPONSE

{
"is
_
helpful": {
"reason": "The document provides context about the Parti Québécois leadership election in 2005, including key figures and events leading up to the election. However, it does not provide specific
details about the dates or the nature of the election itself.
"
,
"score": 3
},
"translation": "2005年魁北克党领导人选举于2005年11月13日至11月15日举行，旨在选举魁北克党这一加拿大魁北克省主要的主权主义和社会民主主义政党的新领导人。
"
}

5. Results

Zero-Shot Performance: RAGMT가 더 나은 성능을 냈다고 볼 수 없다.
Instruction-Tuning Performance : LLM을 RAGtrans에 SFT하면 좋은 성능이 나온다
CSC Performance : SFT + CSC -> 약간 향상된 성능..

6. Limitations

본 연구는 영어<->중국어 번역에만 초점을 맞추었다.
다국어 지식 베이스로 위키백과(영어, 중국어, 독일어, 프랑스어, 체코어 등)만을 사용했다
RAGtrans 데이터 수집 과정에서 CoT (Chain-of-Thought) 프롬프트를 GPT-4o 번역에 사용했지만, SFT(Supervised Fine-Tuning) 과정에서는 이를 적용하지 않았다.