Applications of corpus-based
Contrastive Linguistics for Vietnamese
Vietnam National University at HCM City
University of Science
Computational Linguistics Center
www.clc.hcmus.edu.vn
Dien D.,
Triet NQM
17-19 May 2018
The 28
th
Annual Meeting of the SEALS
Kaohsiung, Taiwan
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based CL for VNese
Conclusion
2
Overview
Vietnamese: top-20 most popular languages in the world.
is an isolating language (similar to Chinese, Thai, Lao, …).
being interested/learnt by more and more foreign
researchers/learners. However,
most traditional approaches are rationalism, theoretical
New approaches in the research of Contrastive Linguistics for
Vietnamese: empiricism, practical:
o
The
Corpus-based
Contrastive Linguistics.
Applications of corpus-based contrastive linguistics for
Vietnamese: to find out the similarities, differences in
morphology,
syntax, semantics
, pragmatics between VNese
vs. popular foreign languages (English, Chinese, …).
Researching/Teaching Vietnamese, teaching foreign
languages for VNese; lexicography, translation studies, etc.
3
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based Contrast Ling
Conclusion
4
Introduction
to Vietnamese Corpora
VCor
(Vietnamese
Corpus):
collected
from
many
sources:
online
news,
books,
…
in
2000
–
2010,
consists
of
17
M
sentences,
346
M
words,
443M
morpho-syllables
in
42
topics/18
domains.
This
corpus is automatically annotated with WS and POS.
VTB
(Vietnamese
TreeBank):
consists
of
approx.
300K
sentences, are manually annotated:
word segmentation, POS, NE
EVC
(English-Vietnamese
parallel
Corpus):
in
general,
news,
conversations, … consists of 2M pairs of sentences, 80M words.
KVC
(Korean-Vietnamese
parallel
Corpus):
in
general,
news,
conversations, …contains 500K pairs of sentences, 14.5 M words
CVC
(Chinese-Vietnamese
parallel
Corpus):
in
general,
news,
conversations, … consists of 200K pairs of sentences.
Above-mentioned
corpora
are
provided
by
CLC
(Computational
Linguistics Center, Vietnam National University at HCMC).
5
Introduction
to Vietnamese Corpora
5
Parameters
VCor
VTB
Number of sentences:
17,095,994
302,491
Number of
morpho-syllables:
443,301,776
9,154,582
Number of words:
346,454,533
7,096,580
Agv length of sentence: (word)
20.27
23.46
Agv length of word: (morpho-syl)
1.28
1.29
Agv length of morpho-syl: (letter)
3.27
3.27
Number of unique
morpho-syls:
6,835
6,714
Number of unique
words:
34,588
32,645
Note
:
I
am
a
teacher
(4 words)
Tôi là
một
giáo viên
(5 morpho-syllables)
Introduction
to Vietnamese Corpora
5
No.
Topic
Ratio
Files
Sentences
Words
Mor-Syl
Len.
1
Entertainment
7.19%
67,535
1,374,386
26,350,787
31,868,527
19.17
2
Sports
3.53%
32,945
668,776
12,609,716
15,660,217
18.85
3
Computer
3.70%
27,037
616,068
12,638,479
16,392,697
20.51
4
Education
6.57%
47,740
1,060,987
22,535,214
29,142,722
21.24
5
Health
7.23%
56,154
1,211,813
25,796,610
32,040,892
21.29
6
Economics
8.51%
55,360
1,284,164
27,840,867
37,715,850
21.68
7
Tourism&Food
5.65%
62,030
964,265
19,430,539
25,046,919
20.15
8
Life
7.45%
75,093
1,406,104
26,503,411
33,032,093
18.85
9
Society
13.39%
97,144
2,174,765
45,975,042
59,375,454
21.14
10
Religion
5.33%
39,320
942,721
18,984,779
23,618,434
20.14
11
Culture
9.56%
86,842
1,770,401
33,964,734
42,378,422
19.18
12
Law
5.90%
43,219
977,697
19,309,864
26,170,834
19.75
13
Military
4.58%
30,660
746,093
15,859,404
20,312,096
21.26
14
International
3.70%
27,073
595,851
12,506,045
16,418,458
20.99
15
Transportation
0.44%
2,811
66,352
1,563,769
1,958,420
23.57
16
Sciences
5.47%
44,035
954,725
18,496,247
24,234,407
19.37
17
Criminal
0.23%
2,328
41,419
736,881
1,019,988
17.79
18
Politics
1.56%
7,859
239,407
5,352,145
6,915,346
22.36
Total
100%
805,185
17,095,994
346,454,533
443,301,776
20.27
Introduction
to Vietnamese Corpora
5
Introduction
to Vietnamese Corpora
7
<ANNOTATOR id="VTB0017">
<DOC docid="V010973" Language="Vietnamese"
Domain="News">
<PARA id="1">
<SEG id="1">Nguyên_nhân/Nn/O là/Vc/O bão/Nn/O số/Nn/O
10/An/O đang/R/O chịu/Vv/O ảnh_hưởng/Nn/O bởi/Cp/O
hệ_thống/Nn/O trục/Nn/O rãnh/Nn/O cao/Aa/O và/Cp/O sự/Nc/O
lôi_kéo/Vv/O từ/Cm/O siêu__bão/Nn/TRM_B Melor/Nr/TRM_I
ở/Cm/O ngoài/Cm/O khơi/Nn/O Philippines/Nr/LOC_B
./PU/O</SEG>
</PARA>
</DOC>
</ANNOTATOR>
A sample in VTB.
Introduction
to Parallel Corpora
5
Life is clearly a miracle thing, completely differs from an inanimate world
Introduction
to Parallel Corpora
5
These corpora have been licensed to many organizations (e.g
I2R, Google, Samsung Elec., Systran, ELRA, etc.)
Vietnam is the most important partner in the Korea’s ODA policy.
EVC
INVENTORY
5
Figure 1. The statistical pairs of EV
sentences in domains.
Domains
Ratio
Qtt.
1. News
50.22%
1,027,445
2. Conversation
16.81%
343,822
3. General
24.01%
491,223
4. Technical
4.45%
91,072
5. Entertainment
4.51%
92,182
Total:
100%
2,045,744
Table 1. The statistical pairs
of EV sentences in domain.
CVC
INVENTORY
5
Figure 2. The statistical pairs of CV
sentences in domains.
Domains
Ratio
Qtt.
1. News
71.28%
143,023
2. Conversation
15.17%
30,432
3. General
5.05%
10,126
4. Technical
4.39%
8,802
5. Entertainment
4.12%
8,265
Total:
100%
200,648
Table 2. The statistical pairs
of CV sentences in domain.
KVC
INVENTORY
5
Domain
Qtt.
%
1. Blog
177,889
35.36
2. Conversation
80,126
15.93
3. Email
12,009
2.39
4. General
123,000
24.45
5. Technical
23,018
4.57
6. News
39,367
7.82
7. Entertainment
47,725
9.49
Total:
503,134
100
Table 3. The statistical pairs of KV
sentences in domain.
Figure 3. The statistical pairs of KV
sentences in domain.
KVC
INVENTORY
5
Domain
nWord
agvLen
1. Blog
5,260,098
14.78
2. Conversation
1,600,756
9.99
3. Email
330,360
13.75
4. General
4,094,362
16.64
5. Technical
770,151
16.73
6. News
1,458,406
18.52
7. Entertainment
943,243
9.88
Total:
14,457,376
14.37
Table 4. The statistical words in domain.
Figure 4.
The statistical average sentence
-
length in domain.
The statistics from the total words and the average length of sentences
("word" is the orthography word). In Vietnamese, the majority (70%) is
2-syllable words, and the average length of a word is about 2.12
syllables (aka. morpho-syllable = orthography word).
KVC
INVENTORY
5
Domain
nKWord
nVWord
agvKLen
agvVLen
1. Blog
1,696,930
3,563,168
9.54
20.03
2. Conversation
534,040
1,066,716
6.67
13.31
3. Email
105,193
225,167
8.76
18.75
4. General
1,298,081
2,796,281
10.55
22.73
5. Technical
244,231
525,920
10.61
22.85
6. News
450,302
1,008,104
11.44
25.61
7. Entertainment
308,126
635,117
6.46
13.31
Total:
4,636,903
9,820,473
9.22
19.52
Table 5. The
s
tatistical
words and sentence-length in domain
in each language.
The total of the orthography words
in Vietnamese (morpho-syllables)
will be more than the orthography
words in Korean in the same
sentence pairs (this ratio: 2.08).
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based Contrast Ling
Conclusion
17
Processing
Corpora for Contrative Linguistics
Before exploiting our parallel corpora for Contrastive
Linguistics, we need to process them as follows:
Normalization: text-only, Unicode(utf-8), XML
Alignments: to match the corresponding linguistic unit between
the source language and the target language.
Ex: text alignment, paragraph alignment, sentence alignment,
word alignment.
Linguistic annotations: to assign the linguistic tags to each
linguistic unit for both source language and target language.
Ex: word segmentation, lemmatization, POS (Parts-of-Speech:
Nc, Vc, Nq,…), NER (Named Entity Recognizer), Semantic
tagger (semantic class: HUM, ANI, ART, …).
All above processing are made by automatic tools from CLC
and others, then post-edited by manual.
7