Applications of corpus-based Contrastive Linguistics for Vietnamese Vietnam National University at HCM City University of Science Computational Linguistics Center www.clc.hcmus.edu.vn Dien D., Triet NQM 17-19 May 2018 The 28 th Annual Meeting of the SEALS Kaohsiung, Taiwan
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based CL for VNese Conclusion 2
 
Overview Vietnamese: top-20 most popular languages in the world. is an isolating language (similar to Chinese, Thai, Lao, …). being interested/learnt by more and more foreign researchers/learners. However, most traditional approaches are rationalism, theoretical New approaches in the research of Contrastive Linguistics for Vietnamese: empiricism, practical: o The Corpus-based Contrastive Linguistics. Applications of corpus-based contrastive linguistics for Vietnamese: to find out the similarities, differences in morphology, syntax, semantics , pragmatics between VNese vs. popular foreign languages (English, Chinese, …). Researching/Teaching Vietnamese, teaching foreign languages for VNese; lexicography, translation studies, etc. 3
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based Contrast Ling Conclusion 4
 
Introduction to Vietnamese Corpora VCor (Vietnamese Corpus): collected from many sources: online news, books, in 2000 2010, consists of 17 M sentences, 346 M words, 443M morpho-syllables in 42 topics/18 domains. This corpus is automatically annotated with WS and POS. VTB (Vietnamese TreeBank): consists of approx. 300K sentences, are manually annotated: word segmentation, POS, NE EVC (English-Vietnamese parallel Corpus): in general, news, conversations, … consists of 2M pairs of sentences, 80M words. KVC (Korean-Vietnamese parallel Corpus): in general, news, conversations, …contains 500K pairs of sentences, 14.5 M words CVC (Chinese-Vietnamese parallel Corpus): in general, news, conversations, … consists of 200K pairs of sentences. Above-mentioned corpora are provided by CLC (Computational Linguistics Center, Vietnam National University at HCMC). 5
 
Introduction to Vietnamese Corpora 5 Parameters VCor VTB Number of sentences: 17,095,994 302,491 Number of morpho-syllables: 443,301,776 9,154,582 Number of words: 346,454,533 7,096,580 Agv length of sentence: (word) 20.27 23.46 Agv length of word: (morpho-syl) 1.28 1.29 Agv length of morpho-syl: (letter) 3.27 3.27 Number of unique morpho-syls: 6,835 6,714 Number of unique words: 34,588 32,645 Note : I am a teacher (4 words) Tôi là một giáo viên (5 morpho-syllables)
 
Introduction to Vietnamese Corpora 5 No. Topic Ratio Files Sentences Words Mor-Syl Len. 1 Entertainment 7.19% 67,535 1,374,386 26,350,787 31,868,527 19.17 2 Sports 3.53% 32,945 668,776 12,609,716 15,660,217 18.85 3 Computer 3.70% 27,037 616,068 12,638,479 16,392,697 20.51 4 Education 6.57% 47,740 1,060,987 22,535,214 29,142,722 21.24 5 Health 7.23% 56,154 1,211,813 25,796,610 32,040,892 21.29 6 Economics 8.51% 55,360 1,284,164 27,840,867 37,715,850 21.68 7 Tourism&Food 5.65% 62,030 964,265 19,430,539 25,046,919 20.15 8 Life 7.45% 75,093 1,406,104 26,503,411 33,032,093 18.85 9 Society 13.39% 97,144 2,174,765 45,975,042 59,375,454 21.14 10 Religion 5.33% 39,320 942,721 18,984,779 23,618,434 20.14 11 Culture 9.56% 86,842 1,770,401 33,964,734 42,378,422 19.18 12 Law 5.90% 43,219 977,697 19,309,864 26,170,834 19.75 13 Military 4.58% 30,660 746,093 15,859,404 20,312,096 21.26 14 International 3.70% 27,073 595,851 12,506,045 16,418,458 20.99 15 Transportation 0.44% 2,811 66,352 1,563,769 1,958,420 23.57 16 Sciences 5.47% 44,035 954,725 18,496,247 24,234,407 19.37 17 Criminal 0.23% 2,328 41,419 736,881 1,019,988 17.79 18 Politics 1.56% 7,859 239,407 5,352,145 6,915,346 22.36   Total 100% 805,185 17,095,994 346,454,533 443,301,776 20.27
 
Introduction to Vietnamese Corpora 5
 
Introduction to Vietnamese Corpora 7 <ANNOTATOR id="VTB0017"> <DOC docid="V010973" Language="Vietnamese" Domain="News"> <PARA id="1"> <SEG id="1">Nguyên_nhân/Nn/O là/Vc/O bão/Nn/O số/Nn/O 10/An/O đang/R/O chịu/Vv/O ảnh_hưởng/Nn/O bởi/Cp/O hệ_thống/Nn/O trục/Nn/O rãnh/Nn/O cao/Aa/O và/Cp/O sự/Nc/O lôi_kéo/Vv/O từ/Cm/O siêu__bão/Nn/TRM_B Melor/Nr/TRM_I ở/Cm/O ngoài/Cm/O khơi/Nn/O Philippines/Nr/LOC_B ./PU/O</SEG> </PARA> </DOC> </ANNOTATOR> A sample in VTB.
 
Introduction to Parallel Corpora 5 Life is clearly a miracle thing, completely differs from an inanimate world
 
Introduction to Parallel Corpora 5 These corpora have been licensed to many organizations (e.g I2R, Google, Samsung Elec., Systran, ELRA, etc.) Vietnam is the most important partner in the Korea’s ODA policy.
 
EVC INVENTORY 5 Figure 1. The statistical pairs of EV sentences in domains. Domains Ratio Qtt. 1. News 50.22% 1,027,445 2. Conversation 16.81% 343,822 3. General 24.01% 491,223 4. Technical 4.45% 91,072 5. Entertainment 4.51% 92,182 Total: 100% 2,045,744 Table 1. The statistical pairs of EV sentences in domain.
 
CVC INVENTORY 5 Figure 2. The statistical pairs of CV sentences in domains. Domains Ratio Qtt. 1. News 71.28% 143,023 2. Conversation 15.17% 30,432 3. General 5.05% 10,126 4. Technical 4.39% 8,802 5. Entertainment 4.12% 8,265 Total: 100% 200,648 Table 2. The statistical pairs of CV sentences in domain.
 
KVC INVENTORY 5 Domain Qtt. % 1. Blog 177,889 35.36 2. Conversation 80,126 15.93 3. Email 12,009 2.39 4. General 123,000 24.45 5. Technical 23,018 4.57 6. News 39,367 7.82 7. Entertainment 47,725 9.49 Total: 503,134 100 Table 3. The statistical pairs of KV sentences in domain.   Figure 3. The statistical pairs of KV sentences in domain.
 
KVC INVENTORY 5 Domain nWord agvLen 1. Blog 5,260,098 14.78 2. Conversation 1,600,756 9.99 3. Email 330,360 13.75 4. General 4,094,362 16.64 5. Technical 770,151 16.73 6. News 1,458,406 18.52 7. Entertainment 943,243 9.88 Total: 14,457,376 14.37 Table 4. The statistical words in domain.   Figure 4. The statistical average sentence - length in domain. The statistics from the total words and the average length of sentences ("word" is the orthography word). In Vietnamese, the majority (70%) is 2-syllable words, and the average length of a word is about 2.12 syllables (aka. morpho-syllable = orthography word).
 
KVC INVENTORY 5 Domain nKWord nVWord agvKLen agvVLen 1. Blog 1,696,930 3,563,168 9.54 20.03 2. Conversation 534,040 1,066,716 6.67 13.31 3. Email 105,193 225,167 8.76 18.75 4. General 1,298,081 2,796,281 10.55 22.73 5. Technical 244,231 525,920 10.61 22.85 6. News 450,302 1,008,104 11.44 25.61 7. Entertainment 308,126 635,117 6.46 13.31 Total: 4,636,903 9,820,473 9.22 19.52 Table 5. The s tatistical words and sentence-length in domain in each language.   The total of the orthography words in Vietnamese (morpho-syllables) will be more than the orthography words in Korean in the same sentence pairs (this ratio: 2.08).
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based Contrast Ling Conclusion 17
 
Processing Corpora for Contrative Linguistics Before exploiting our parallel corpora for Contrastive Linguistics, we need to process them as follows: Normalization: text-only, Unicode(utf-8), XML Alignments: to match the corresponding linguistic unit between the source language and the target language. Ex: text alignment, paragraph alignment, sentence alignment, word alignment. Linguistic annotations: to assign the linguistic tags to each linguistic unit for both source language and target language. Ex: word segmentation, lemmatization, POS (Parts-of-Speech: Nc, Vc, Nq,…), NER (Named Entity Recognizer), Semantic tagger (semantic class: HUM, ANI, ART, …). All above processing are made by automatic tools from CLC and others, then post-edited by manual. 7
 
5 Processing Corpora for Contrative Linguistics Paragraph alignment: Sentence alignment: * Helicopters can rise straight up into the air and can go straight down. + Máy bay trực thăng có thể lên thẳng trên không và đáp thẳng xuống đất.  * They can stand still in the air. + Chúng có thể đứng yên trên không.  * Helicopters do not have wings. + Máy bay trực thăng không có cánh.
 
Processing Corpora for Contrative Linguistics 7 Jet planes fly about nine miles high . Các máy bay phản lực bay cao khoảng chín dặm . POS NN NNS VBP RB CD NNS JJ E Jet planes fly about nine miles high V phản lực các | máy bay bay khoảng chín dặm cao POS N Q - N V R Q N A <DOC Domain='general'><SENT id='1'> <TXT_E> Jet planes fly about nine miles high </TXT_E> <TXT_V> C ác máy bay phản lực bay cao khoảng chín dặm </TXT_V> </SENT> </DOC>
 
Processing Corpora for Contrative Linguistics 7 Ex: in the KVC (Korean – Vietnamese Corpus): Việt_Nam/Nr/ORG_B là/Vc/O đối_tác/Nn/O quan_trọng/Aa/O nhất/R/O trong/Cm/O chính_sách/Nn/O ODA/FW/ABB_B của/Cm/O Hàn_Quốc/Nr/ORG_B ./PU/O (Vietnam is the most important partner in the Korea’s ODA policy) Using the underscore "_" to connect the morpho-syllables in a word => we have the words: “Việt Nam”, “là”, “đối tác”, “quan trọng”, “nhất”, “trong”, “chính sách”, “của”, “Hàn Quốc”. Using the CLC POS, NE-tagset: E.g. "Nr": proper Noun, "Vc": copula, "Nn": Noun, "Aa": Adjective, "R": adveRb, “Cm”: main- sub Connector, "FW": Foreign Word, "PU": Punctuation,...; “ORG”: organization, “ABB”: abbreviation, “O”: out of NE,.. Korean morphological analysis: (fml-pres), (inflm- pres), (fml-pst), (infml-pst), …=>" ”(go).
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based Contrast Ling Conclusion 22
 
Applications of Corpus-based Contrastive Ling. 7 No Word POS (en) freq 1 của Cm (of) 1,820 2 Cp (and) 1,822 3 các Nq (+PLR) 1,956 4 Ve (have) 1,959 5 Vc (tobe) 1,968 6 trong Cm (in) 1,986 7 một Nq (one) 2,012 8 đã R (+PST) 2,031 9 những Nq (+PLR) 2,043 10 không R (no,not) 2,050 Table 6. VN word frequency where m is the number of occurrences and N is the size of the corpus used for measuring. Ex: f=2 means that this word has occurred at the frequency 1/100 (1%), f=1 => 1/10. Rank Word Eng POS freq ..     .. 14 người man Nn 2,160 25 nhiều many Aa 2,210 27 năm year Nt 2,314 30 ngày day Nt 2,401 31 làm do Vv 2,423 32 phải must Vv 2,436 34 ông you Nn 2,464 36 theo follow Vv 2,530 43 việc thing Nn 2,611 53 có thể able Vv 2,660
 
Applications of Corpus-based Contrastive Ling. 7 Only top-10% word types will occupy 90% of the word tokens in Vietnamese text [3]. Build vocabulary lists (e.g. top-1000 words, top-2000 words, or top-3000 words), matching with each of the learner’s level. Accumulating freq of VN words Rank Word Eng POS freq 3,775 của wealth Nn 4,6789 368 then M 3,4268 20,793 shovel Vv 6,1384 39,212 các pay.extra Vv 6,7405 3,224 (particle) M 4,5731 103 exist R 2,9803 19,385 iron Vv 6,0415 5,290 being Cs 4,9209 143 as Cp 3,0857 1,749 (particle) M 4,1842 186 tốt good Aa 3,1813 25,154 tốt soldier Nn 6,4394
 
Applications of Corpus-based Contrastive Ling Apply the frequency dictionary (in readability) to compile textbooks, dictionaries (definitions, usages) appropriate for each learner’s level (A1,A2, ..CERF) . Ex: in the Oxford OALD8: use the top-3,000 word in all its definitions. E.g. : Avoid using a difficult word in the definition of a simple word. E.g in the Vietnamese dictionary: the definition of the word “đường” (sugar) is “một hợp chất kết tinh ...” (“hợp chất” = compound, “kết tinh” = crystallize”)
 
Applications of EV Contrastive Linguistics
 
Applications of EV Contrastive Linguistics
 
Applications of EV Contrastive Linguistics
 
Applications of EV Contrastive Linguistics
 
To contrast the word usage depending on the context. For examples: how to translate the word “xảy ra” into English. It maybe “occur (e.g: “an error occurs inside the computer”), or “happen” (e.g. “an accident happens”) or “take place” (e.g. “a meeting will be taken place”). Similarly, to wear (clothes=mặc; hat=đội; shoes=mang, glasses=đeo..); gà trống/ đực (cock) , dê đực/ trống (he- goat ) ; (in VN); big /heavy rain, strong wind, powerful computer, .. From EVC, the learners can “self-learn” the word-usages. In some cases, the word-usage in the dictionary is not up-to- date, e.g. “fondle” in most definitions in dictionaries bears the positive meaning (“to caress lovingly/touch gently”) whereas its usages in the practical corpora are “negative meaning” (“sexual harassment” !). Applications of EV Contrastive Linguistics
 
31 Applications of EV Contrastive Linguistics
 
Applications of EV Contrastive Linguistics
 
Applications of EV Contrastive Linguistics
 
Applications of EV Contrastive Linguistics
 
Applications of EV Contrastive Linguistics
 
Các__ máy _ bay phản_lực bay cao khoảng chín dặm . Jet planes fly about nine miles high. A 110 (côn trùng) ruồi, muỗi, gián, ong, kiến ... M 19 (cách thức di chuyển) đi, chạy, bay, bơi .. H 154 (dụng cụ đào, cắt) cái bay, xẻng, .. M 28 bay, lượn,... M 28 bay, lượn, vỗ cánh... G 280 (đại từ nhân xưng) anh, bạn, bay, mày .. L47 (bay hơi, bay màu, Semantic tagging Applications of EV Contrastive Linguistics
 
Applications of KV Contrastive Linguistics 7 Due to the language differences: Korean (agglutinating, SOV, marker, head-final,…) and Vietnamese (isolating, SVO, no-marker, head-first,…) => great differences in word boundary, lexicalization, word order,… The word alignment is automatically done by the software tool GIZA ++ [2] (using co-occurrence probability of pair of word/phrase in Korean and Vn) => no 100%-accuracy result. Vietnam is the most important partner in the Korea’ s ODA policy
 
Applications of KV Contrastive Linguistics 7 Using for learners to observe how to use a word in its context. Ex: the word " " (went) and its correspondences in Vnese.
 
Applications of KV Contrastive Linguistics 7 to observe word usage of the original word “ " (go) and its inflections/variations.
 
Applications of KV Contrastive Linguistics 7 to observe word usage of the “đi” (go) and its corresponding translations in Korean.
 
Applications of KV Contrastive Linguistics 7 to observe the word order in each language.
 
Applications of KV Contrastive Linguistics 7 * ODA . . , , . (22/33 c.w) + Việt Nam đối tác quan trọng nhất trong chính sách ODA của Hàn Quốc . Tổng thống Hàn Quốc đã đánh giá cao việc hợp tác lao động mang lại lợi ích to lớn cho cả Việt Nam Hàn Quốc . Hai vị lãnh đạo đã nhất trí về lợi ích chung của các quốc gia trong ngoài khu vực ASEAN như duy trì an ninh, hòa bình, ổn định biển Đông . (26/36 content words) Similar to Korean, Vietnamese has more than 65% of which is derived from Chinese (especially the words used in formal writing, science; and called Sino-Vietnamese). E xploit the Sino-Korean words and the Sino- Vietnamese, sharing the same Chinese origin: will help Koreans grasp/enrich vocabulary in Vietnamese easily and effectively.
 
Applications of KV Contrastive Linguistics 7 Sino- Korean Chinese (tradition) Chinese (simplify) Chinese (Pinyin) Sino- Vietnamese Vietnamese English hán guó Hàn quốc Hàn quốc Korea zhèng cè chính sách chính sách policy dì yī đệ nhất nhất first zhòng yào trọng yếu quan trọng important dà tǒng lǐng đại thống lĩnh tổng thống president láo dòng lao động lao động labor xié lì hiệp lực hợp tác cooperate lì yì lợi ích lợi ích gain píng jià bình giá đánh giá estimate zhǐ dǎo zhě chỉ đạo giả vị lãnh đạo leader dōng hǎi Đông hải Biển Đông East sea ān níng an ninh an ninh security píng huò bình hòa hòa bình peace ān dìng an định ổn định stability wéi chí duy trì duy trì maintain guó jiā quốc gia nước country yī zhì nhất trí nhất trí correspond gòng tōng cộng thông chung common dì yù địa vực khu vực region
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based Contrast Ling Conclusion and Development 44
 
Conclusion o Basing on these corpora, we can find out automatically the similarities and differences between Vietnamese vs. popular foreign languages (English, Korean, Chinese, …) in terms of: o morphology, syntax, semantics, pragmatics. Ex: Equivalent translations, word usages, lexicalizations, POS, word orders, head-initial vs. final-initial, pre-position vs. post -position, marker (semantic role, discourse), cognates,... Help learners "self-test" the word usages, the grammatical; rules of Vietnamese, EV/CV/KV translations, … Grasping language knowledge in which the traditional approach is difficult to convey all possible. Helps the linguists a useful tool to search for a certain word, phrase, pattern; to verify/prove a hypothesis.
 
Development Cooperations between your linguistics department/institute and our computational linguistics center for developing large annotated Chinese – Vietnamese parallel corpora. Chinese and Vietnamese are isolating languages: share many common features: cognate (65% sino-Vietnamese), classifiers (“cái” vs. “ ”, “quyển” vs. “ ”, “tấm” vs. “ , no- inflection, tense, word order, particles, function words (e.g. “rồi/đã” vs. “ chưa” vs. ”, “sao rồi?” vs. “ ?”, “anh đi bao giờ” (past)/ “bao giờ anh đi” (future) vs. ”/ etc.) Witthese bilingual parallel corpora, linguists and learners (both Vietnamese and Chinese) can easily, quickly, exactly contrast the similarities and differences Chinese vs VNese. If corpora are cooperated/invested to enhance their quantity, quality, their applications will be increased exponentially.
 
References [1]. https://www.clc.hcmus.edu.vn/resources/. [2]. Dien Dinh, "Building an Annotated English-Vietnamese parallel Corpus", MKS: A Journal of Southeast Asian Linguistics and Languages, Vol.35, pp.21-36, 2005. [3]. A.S.Hornby, “Từ điển Song ngữ Anh – Việt” (Oxford Advanced Learner’s Dictionary 8 th ed. with Vietnamese translation) (compiler: Dinh Dien), Youth publisher, HCMC, VN, 2014. [4]. T.Phuoc, D.Dien (2014), “A novel approach for handling unknown word problem in Chinese – Vietnamese machine translation”,  International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP)  , Vol.19, No.1, March 2014, pp. 1-10, ISSN: 1027-376X. [5] Dien Dinh, , Diep N. “Exploiting the Korean – Vietnamese Parallel Corpus in teaching Vietnamese for Koreans”, Interdisciplinary Study on Language Communication in Multicultural Society, the Int’l Conference of ISEAS/BUFS, May 2017, pp.11-23.
Thank you for your attention