Applications of corpus-based Contrastive Linguistics for Vietnamese Vietnam National University at HCM City University of Science Computational Linguistics Center www.clc.hcmus.edu.vn Dien D., Triet NQM 17-19 May 2018 The 28 th Annual Meeting of the SEALS Kaohsiung, Taiwan
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based CL for VNese Conclusion 2
 
Overview Vietnamese: top-20 most popular languages in the world. is an isolating language (similar to Chinese, Thai, Lao, …). being interested/learnt by more and more foreign researchers/learners. However, most traditional approaches are rationalism, theoretical New approaches in the research of Contrastive Linguistics for Vietnamese: empiricism, practical: o The Corpus-based Contrastive Linguistics. Applications of corpus-based contrastive linguistics for Vietnamese: to find out the similarities, differences in morphology, syntax, semantics , pragmatics between VNese vs. popular foreign languages (English, Chinese, …). Researching/Teaching Vietnamese, teaching foreign languages for VNese; lexicography, translation studies, etc. 3
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based Contrast Ling Conclusion 4
 
Introduction to Vietnamese Corpora VCor (Vietnamese Corpus): collected from many sources: online news, books, in 2000 2010, consists of 17 M sentences, 346 M words, 443M morpho-syllables in 42 topics/18 domains. This corpus is automatically annotated with WS and POS. VTB (Vietnamese TreeBank): consists of approx. 300K sentences, are manually annotated: word segmentation, POS, NE EVC (English-Vietnamese parallel Corpus): in general, news, conversations, … consists of 2M pairs of sentences, 80M words. KVC (Korean-Vietnamese parallel Corpus): in general, news, conversations, …contains 500K pairs of sentences, 14.5 M words CVC (Chinese-Vietnamese parallel Corpus): in general, news, conversations, … consists of 200K pairs of sentences. Above-mentioned corpora are provided by CLC (Computational Linguistics Center, Vietnam National University at HCMC). 5
 
Introduction to Vietnamese Corpora 5 Parameters VCor VTB Number of sentences: 17,095,994 302,491 Number of morpho-syllables: 443,301,776 9,154,582 Number of words: 346,454,533 7,096,580 Agv length of sentence: (word) 20.27 23.46 Agv length of word: (morpho-syl) 1.28 1.29 Agv length of morpho-syl: (letter) 3.27 3.27 Number of unique morpho-syls: 6,835 6,714 Number of unique words: 34,588 32,645 Note : I am a teacher (4 words) Tôi là một giáo viên (5 morpho-syllables)
 
Introduction to Vietnamese Corpora 5 No. Topic Ratio Files Sentences Words Mor-Syl Len. 1 Entertainment 7.19% 67,535 1,374,386 26,350,787 31,868,527 19.17 2 Sports 3.53% 32,945 668,776 12,609,716 15,660,217 18.85 3 Computer 3.70% 27,037 616,068 12,638,479 16,392,697 20.51 4 Education 6.57% 47,740 1,060,987 22,535,214 29,142,722 21.24 5 Health 7.23% 56,154 1,211,813 25,796,610 32,040,892 21.29 6 Economics 8.51% 55,360 1,284,164 27,840,867 37,715,850 21.68 7 Tourism&Food 5.65% 62,030 964,265 19,430,539 25,046,919 20.15 8 Life 7.45% 75,093 1,406,104 26,503,411 33,032,093 18.85 9 Society 13.39% 97,144 2,174,765 45,975,042 59,375,454 21.14 10 Religion 5.33% 39,320 942,721 18,984,779 23,618,434 20.14 11 Culture 9.56% 86,842 1,770,401 33,964,734 42,378,422 19.18 12 Law 5.90% 43,219 977,697 19,309,864 26,170,834 19.75 13 Military 4.58% 30,660 746,093 15,859,404 20,312,096 21.26 14 International 3.70% 27,073 595,851 12,506,045 16,418,458 20.99 15 Transportation 0.44% 2,811 66,352 1,563,769 1,958,420 23.57 16 Sciences 5.47% 44,035 954,725 18,496,247 24,234,407 19.37 17 Criminal 0.23% 2,328 41,419 736,881 1,019,988 17.79 18 Politics 1.56% 7,859 239,407 5,352,145 6,915,346 22.36   Total 100% 805,185 17,095,994 346,454,533 443,301,776 20.27
 
Introduction to Vietnamese Corpora 5
 
Introduction to Vietnamese Corpora 7 <ANNOTATOR id="VTB0017"> <DOC docid="V010973" Language="Vietnamese" Domain="News"> <PARA id="1"> <SEG id="1">Nguyên_nhân/Nn/O là/Vc/O bão/Nn/O số/Nn/O 10/An/O đang/R/O chịu/Vv/O ảnh_hưởng/Nn/O bởi/Cp/O hệ_thống/Nn/O trục/Nn/O rãnh/Nn/O cao/Aa/O và/Cp/O sự/Nc/O lôi_kéo/Vv/O từ/Cm/O siêu__bão/Nn/TRM_B Melor/Nr/TRM_I ở/Cm/O ngoài/Cm/O khơi/Nn/O Philippines/Nr/LOC_B ./PU/O</SEG> </PARA> </DOC> </ANNOTATOR> A sample in VTB.
 
Introduction to Parallel Corpora 5 Life is clearly a miracle thing, completely differs from an inanimate world
 
Introduction to Parallel Corpora 5 These corpora have been licensed to many organizations (e.g I2R, Google, Samsung Elec., Systran, ELRA, etc.) Vietnam is the most important partner in the Korea’s ODA policy.
 
EVC INVENTORY 5 Figure 1. The statistical pairs of EV sentences in domains. Domains Ratio Qtt. 1. News 50.22% 1,027,445 2. Conversation 16.81% 343,822 3. General 24.01% 491,223 4. Technical 4.45% 91,072 5. Entertainment 4.51% 92,182 Total: 100% 2,045,744 Table 1. The statistical pairs of EV sentences in domain.
 
CVC INVENTORY 5 Figure 2. The statistical pairs of CV sentences in domains. Domains Ratio Qtt. 1. News 71.28% 143,023 2. Conversation 15.17% 30,432 3. General 5.05% 10,126 4. Technical 4.39% 8,802 5. Entertainment 4.12% 8,265 Total: 100% 200,648 Table 2. The statistical pairs of CV sentences in domain.
 
KVC INVENTORY 5 Domain Qtt. % 1. Blog 177,889 35.36 2. Conversation 80,126 15.93 3. Email 12,009 2.39 4. General 123,000 24.45 5. Technical 23,018 4.57 6. News 39,367 7.82 7. Entertainment 47,725 9.49 Total: 503,134 100 Table 3. The statistical pairs of KV sentences in domain.   Figure 3. The statistical pairs of KV sentences in domain.
 
KVC INVENTORY 5 Domain nWord agvLen 1. Blog 5,260,098 14.78 2. Conversation 1,600,756 9.99 3. Email 330,360 13.75 4. General 4,094,362 16.64 5. Technical 770,151 16.73 6. News 1,458,406 18.52 7. Entertainment 943,243 9.88 Total: 14,457,376 14.37 Table 4. The statistical words in domain.   Figure 4. The statistical average sentence - length in domain. The statistics from the total words and the average length of sentences ("word" is the orthography word). In Vietnamese, the majority (70%) is 2-syllable words, and the average length of a word is about 2.12 syllables (aka. morpho-syllable = orthography word).
 
KVC INVENTORY 5 Domain nKWord nVWord agvKLen agvVLen 1. Blog 1,696,930 3,563,168 9.54 20.03 2. Conversation 534,040 1,066,716 6.67 13.31 3. Email 105,193 225,167 8.76 18.75 4. General 1,298,081 2,796,281 10.55 22.73 5. Technical 244,231 525,920 10.61 22.85 6. News 450,302 1,008,104 11.44 25.61 7. Entertainment 308,126 635,117 6.46 13.31 Total: 4,636,903 9,820,473 9.22 19.52 Table 5. The s tatistical words and sentence-length in domain in each language.   The total of the orthography words in Vietnamese (morpho-syllables) will be more than the orthography words in Korean in the same sentence pairs (this ratio: 2.08).
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based Contrast Ling Conclusion 17
 
Processing Corpora for Contrative Linguistics Before exploiting our parallel corpora for Contrastive Linguistics, we need to process them as follows: Normalization: text-only, Unicode(utf-8), XML Alignments: to match the corresponding linguistic unit between the source language and the target language. Ex: text alignment, paragraph alignment, sentence alignment, word alignment. Linguistic annotations: to assign the linguistic tags to each linguistic unit for both source language and target language. Ex: word segmentation, lemmatization, POS (Parts-of-Speech: Nc, Vc, Nq,…), NER (Named Entity Recognizer), Semantic tagger (semantic class: HUM, ANI, ART, …). All above processing are made by automatic tools from CLC and others, then post-edited by manual. 7