Text Translation

1. Introduction
Text Translation, especially Machine Translation (MT), is a sub-field of computational linguistics that studies the use of machine learning methods to translate text from one language to another. This is one of the main research branches of Natural Language Processing which has broad applications both in scientific fields and in daily life. The MT field has received much attention with many approaches such as Rule-based MT, Example-based MT, Dictionary-based MT, Statistical MT (SMT), Neural MT (NMT) and so on. Developed in 2006, SMT has become the most popular and stable MT approach which is widely used. Since 2015, thanks to the rapid grow of neural techniques, Neural MT leads to better translations with the encoder-decoder architecture.
Even though SMT and NMT are two leading approaches nowadays, the nature of languages is ambiguous and special treatment of language phenomena is always required. Some challenging issues in MT which have received much attention are:
– Name Translation (or Transliteration) is a serious challenge in the case that the source and target languages use different writing systems. Most writing systems are phonetic, i.e., they transcribe the sounds of the languages, be its syllables (Chinese, Japanese kanji) or individual consonants and vowels (Latin, Arabic, Japanese katakana).
– Rich morphology leads to large vocabulary sizes and a sparse data problem in translation.
– Syntactic differences leads to more work on reordering words and phrases in sentences. For language pairs that both are subject-verb-object (SVO) languages, the major reordering work is the switching of adjectives and nouns (e.g. French-English, Vietnamese-English). For other language pairs that the source language has different fixed sentence structures to the target language, reordering is a big challenge (e.g. Arabic is VSO, Japanese is OSV).
– Lexical and structural ambiguity leads to bad translations. MT approaches choose words/phrases with the highest probability at each translation step. This word/phrase choice method works well in normal translation cases but the sentence translation also indicates word/phrase choice ambiguity or grammatical structure ambiguity. These ambiguity cases must be solved by integrating more linguistic knowledge, being special for language pairs.

Currently, we are focusing on investigating what linguistic factors are useful and how to integrate linguistic information in SMT and NMT approaches for language pairs in which one side is Vietnamese.

2. Research
2.1. Statistical MT:
SMT is an machine translation approach where translations are generated on the basis of statistical models. These statistical models started with word based models (IBM models), but later the performance were much improved by the advantages of phrase-based models. In 2016, we proposed a method to construct a Named Entity (NE) annotated English-Vietnamese bilingual corpus which is used to deal with the problem of translation person names between English and Vietnamese [1]. In the same year, we also introduced a mechanism to re-segment words based on NE information in Chinese-Vietnamese MT [2]. In 2017, we examined the linguistic relationships between Chinese and Vietnamese to improve the word alignment, the essential part in SMT [3].

2.2. Neural MT:
NMT is an machine translation approach where multiple neural network layers is used to predict the likelihood of a sequence of words. Currently, the state-of-the-art in NMT is a sequence-to-sequence encoder-decoder model with attention mechanism. Our work in integrating linguistic information in NMT is still ongoing and also our main focus in this time.

[1] Long H. B. Nguyen, Dien Dinh, and Phuoc Tran (2016). An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 16, 2, Article 9 (October 2016), 17 pages. DOI: https://doi.org/10.1145/2990191 (SCIE)
[2] Phuoc Tran, Dien Dinh, and Long H. B. Nguyen (2016). Word Re-Segmentation in Chinese-Vietnamese Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 16, 2, Article 12 (November 2016), 22 pages. DOI: https://doi.org/10.1145/2988237 (SCIE)
[3] Phuoc Tran, Dien Dinh, Tan Le, and Long H. B. Nguyen (2017). Linguistic-Relationships-Based Approach for Improving Word Alignment. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1, Article 5 (October 2017), 16 pages. DOI: https://doi.org/10.1145/3133323 (SCIE)