Applications of corpus-based Contrastive Linguistics for Vietnamese Vietnam National University at HCM City University of Science Computational Linguistics Center www.clc.hcmus.edu.vn Dien D., Triet NQM 17-19 May 2018 The 28 th Annual Meeting of the SEALS Kaohsiung, Taiwan
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based CL for VNese Conclusion 2
 
Overview Vietnamese: top-20 most popular languages in the world. is an isolating language (similar to Chinese, Thai, Lao, …). being interested/learnt by more and more foreign researchers/learners. However, most traditional approaches are rationalism, theoretical New approaches in the research of Contrastive Linguistics for Vietnamese: empiricism, practical: o The Corpus-based Contrastive Linguistics. Applications of corpus-based contrastive linguistics for Vietnamese: to find out the similarities, differences in morphology, syntax, semantics , pragmatics between VNese vs. popular foreign languages (English, Chinese, …). Researching/Teaching Vietnamese, teaching foreign languages for VNese; lexicography, translation studies, etc. 3
 
Content Overview Introduction to Vietnamese Corpora Processing VNese Corpora for Contrast Ling Applications of Corpus-based Contrast Ling Conclusion 4
 
Introduction to Vietnamese Corpora VCor (Vietnamese Corpus): collected from many sources: online news, books, in 2000 2010, consists of 17 M sentences, 346 M words, 443M morpho-syllables in 42 topics/18 domains. This corpus is automatically annotated with WS and POS. VTB (Vietnamese TreeBank): consists of approx. 300K sentences, are manually annotated: word segmentation, POS, NE EVC (English-Vietnamese parallel Corpus): in general, news, conversations, … consists of 2M pairs of sentences, 80M words. KVC (Korean-Vietnamese parallel Corpus): in general, news, conversations, …contains 500K pairs of sentences, 14.5 M words CVC (Chinese-Vietnamese parallel Corpus): in general, news, conversations, … consists of 200K pairs of sentences. Above-mentioned corpora are provided by CLC (Computational Linguistics Center, Vietnam National University at HCMC). 5
 
Introduction to Vietnamese Corpora 5 Parameters VCor VTB Number of sentences: 17,095,994 302,491 Number of morpho-syllables: 443,301,776 9,154,582 Number of words: 346,454,533 7,096,580 Agv length of sentence: (word) 20.27 23.46 Agv length of word: (morpho-syl) 1.28 1.29 Agv length of morpho-syl: (letter) 3.27 3.27 Number of unique morpho-syls: 6,835 6,714 Number of unique words: 34,588 32,645 Note : I am a teacher (4 words) Tôi là một giáo viên (5 morpho-syllables)
 
Introduction to Vietnamese Corpora 5