Applications of corpus-based
Contrastive Linguistics for Vietnamese
Vietnam National University at HCM City
University of Science
Computational Linguistics Center
www.clc.hcmus.edu.vn
Dien D.,
Triet NQM
17-19 May 2018
The 28
th
Annual Meeting of the SEALS
Kaohsiung, Taiwan
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based CL for VNese
Conclusion
2
Overview
Vietnamese: top-20 most popular languages in the world.
is an isolating language (similar to Chinese, Thai, Lao, …).
being interested/learnt by more and more foreign
researchers/learners. However,
most traditional approaches are rationalism, theoretical
New approaches in the research of Contrastive Linguistics for
Vietnamese: empiricism, practical:
o
The
Corpus-based
Contrastive Linguistics.
Applications of corpus-based contrastive linguistics for
Vietnamese: to find out the similarities, differences in
morphology,
syntax, semantics
, pragmatics between VNese
vs. popular foreign languages (English, Chinese, …).
Researching/Teaching Vietnamese, teaching foreign
languages for VNese; lexicography, translation studies, etc.
3
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based Contrast Ling
Conclusion
4
Introduction
to Vietnamese Corpora
VCor
(Vietnamese
Corpus):
collected
from
many
sources:
online
news,
books,
…
in
2000
–
2010,
consists
of
17
M
sentences,
346
M
words,
443M
morpho-syllables
in
42
topics/18
domains.
This
corpus is automatically annotated with WS and POS.
VTB
(Vietnamese
TreeBank):
consists
of
approx.
300K
sentences, are manually annotated:
word segmentation, POS, NE
EVC
(English-Vietnamese
parallel
Corpus):
in
general,
news,
conversations, … consists of 2M pairs of sentences, 80M words.
KVC
(Korean-Vietnamese
parallel
Corpus):
in
general,
news,
conversations, …contains 500K pairs of sentences, 14.5 M words
CVC
(Chinese-Vietnamese
parallel
Corpus):
in
general,
news,
conversations, … consists of 200K pairs of sentences.
Above-mentioned
corpora
are
provided
by
CLC
(Computational
Linguistics Center, Vietnam National University at HCMC).
5
Introduction
to Vietnamese Corpora
5
Parameters
VCor
VTB
Number of sentences:
17,095,994
302,491
Number of
morpho-syllables:
443,301,776
9,154,582
Number of words:
346,454,533
7,096,580
Agv length of sentence: (word)
20.27
23.46
Agv length of word: (morpho-syl)
1.28
1.29
Agv length of morpho-syl: (letter)
3.27
3.27
Number of unique
morpho-syls:
6,835
6,714
Number of unique
words:
34,588
32,645
Note
:
I
am
a
teacher
(4 words)
Tôi là
một
giáo viên
(5 morpho-syllables)
Introduction
to Vietnamese Corpora
5
No.
Topic
Ratio
Files
Sentences
Words
Mor-Syl
Len.
1
Entertainment
7.19%
67,535
1,374,386
26,350,787
31,868,527
19.17
2
Sports
3.53%
32,945
668,776
12,609,716
15,660,217
18.85
3
Computer
3.70%
27,037
616,068
12,638,479
16,392,697
20.51
4
Education
6.57%
47,740
1,060,987
22,535,214
29,142,722
21.24
5
Health
7.23%
56,154
1,211,813
25,796,610
32,040,892
21.29
6
Economics
8.51%
55,360
1,284,164
27,840,867
37,715,850
21.68
7
Tourism&Food
5.65%
62,030
964,265
19,430,539
25,046,919
20.15
8
Life
7.45%
75,093
1,406,104
26,503,411
33,032,093
18.85
9
Society
13.39%
97,144
2,174,765
45,975,042
59,375,454
21.14
10
Religion
5.33%
39,320
942,721
18,984,779
23,618,434
20.14
11
Culture
9.56%
86,842
1,770,401
33,964,734
42,378,422
19.18
12
Law
5.90%
43,219
977,697
19,309,864
26,170,834
19.75
13
Military
4.58%
30,660
746,093
15,859,404
20,312,096
21.26
14
International
3.70%
27,073
595,851
12,506,045
16,418,458
20.99
15
Transportation
0.44%
2,811
66,352
1,563,769
1,958,420
23.57
16
Sciences
5.47%
44,035
954,725
18,496,247
24,234,407
19.37
17
Criminal
0.23%
2,328
41,419
736,881
1,019,988
17.79
18
Politics
1.56%
7,859
239,407
5,352,145
6,915,346
22.36
Total
100%
805,185
17,095,994
346,454,533
443,301,776
20.27
Introduction
to Vietnamese Corpora
5
Introduction
to Vietnamese Corpora
7
<ANNOTATOR id="VTB0017">
<DOC docid="V010973" Language="Vietnamese"
Domain="News">
<PARA id="1">
<SEG id="1">Nguyên_nhân/Nn/O là/Vc/O bão/Nn/O số/Nn/O
10/An/O đang/R/O chịu/Vv/O ảnh_hưởng/Nn/O bởi/Cp/O
hệ_thống/Nn/O trục/Nn/O rãnh/Nn/O cao/Aa/O và/Cp/O sự/Nc/O
lôi_kéo/Vv/O từ/Cm/O siêu__bão/Nn/TRM_B Melor/Nr/TRM_I
ở/Cm/O ngoài/Cm/O khơi/Nn/O Philippines/Nr/LOC_B
./PU/O</SEG>
</PARA>
</DOC>
</ANNOTATOR>
A sample in VTB.
Introduction
to Parallel Corpora
5
Life is clearly a miracle thing, completely differs from an inanimate world
Introduction
to Parallel Corpora
5
These corpora have been licensed to many organizations (e.g
I2R, Google, Samsung Elec., Systran, ELRA, etc.)
Vietnam is the most important partner in the Korea’s ODA policy.
EVC
INVENTORY
5
Figure 1. The statistical pairs of EV
sentences in domains.
Domains
Ratio
Qtt.
1. News
50.22%
1,027,445
2. Conversation
16.81%
343,822
3. General
24.01%
491,223
4. Technical
4.45%
91,072
5. Entertainment
4.51%
92,182
Total:
100%
2,045,744
Table 1. The statistical pairs
of EV sentences in domain.
CVC
INVENTORY
5
Figure 2. The statistical pairs of CV
sentences in domains.
Domains
Ratio
Qtt.
1. News
71.28%
143,023
2. Conversation
15.17%
30,432
3. General
5.05%
10,126
4. Technical
4.39%
8,802
5. Entertainment
4.12%
8,265
Total:
100%
200,648
Table 2. The statistical pairs
of CV sentences in domain.
KVC
INVENTORY
5
Domain
Qtt.
%
1. Blog
177,889
35.36
2. Conversation
80,126
15.93
3. Email
12,009
2.39
4. General
123,000
24.45
5. Technical
23,018
4.57
6. News
39,367
7.82
7. Entertainment
47,725
9.49
Total:
503,134
100
Table 3. The statistical pairs of KV
sentences in domain.
Figure 3. The statistical pairs of KV
sentences in domain.
KVC
INVENTORY
5
Domain
nWord
agvLen
1. Blog
5,260,098
14.78
2. Conversation
1,600,756
9.99
3. Email
330,360
13.75
4. General
4,094,362
16.64
5. Technical
770,151
16.73
6. News
1,458,406
18.52
7. Entertainment
943,243
9.88
Total:
14,457,376
14.37
Table 4. The statistical words in domain.
Figure 4.
The statistical average sentence
-
length in domain.
The statistics from the total words and the average length of sentences
("word" is the orthography word). In Vietnamese, the majority (70%) is
2-syllable words, and the average length of a word is about 2.12
syllables (aka. morpho-syllable = orthography word).
KVC
INVENTORY
5
Domain
nKWord
nVWord
agvKLen
agvVLen
1. Blog
1,696,930
3,563,168
9.54
20.03
2. Conversation
534,040
1,066,716
6.67
13.31
3. Email
105,193
225,167
8.76
18.75
4. General
1,298,081
2,796,281
10.55
22.73
5. Technical
244,231
525,920
10.61
22.85
6. News
450,302
1,008,104
11.44
25.61
7. Entertainment
308,126
635,117
6.46
13.31
Total:
4,636,903
9,820,473
9.22
19.52
Table 5. The
s
tatistical
words and sentence-length in domain
in each language.
The total of the orthography words
in Vietnamese (morpho-syllables)
will be more than the orthography
words in Korean in the same
sentence pairs (this ratio: 2.08).
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based Contrast Ling
Conclusion
17
Processing
Corpora for Contrative Linguistics
Before exploiting our parallel corpora for Contrastive
Linguistics, we need to process them as follows:
Normalization: text-only, Unicode(utf-8), XML
Alignments: to match the corresponding linguistic unit between
the source language and the target language.
Ex: text alignment, paragraph alignment, sentence alignment,
word alignment.
Linguistic annotations: to assign the linguistic tags to each
linguistic unit for both source language and target language.
Ex: word segmentation, lemmatization, POS (Parts-of-Speech:
Nc, Vc, Nq,…), NER (Named Entity Recognizer), Semantic
tagger (semantic class: HUM, ANI, ART, …).
All above processing are made by automatic tools from CLC
and others, then post-edited by manual.
7
5
Processing
Corpora for Contrative Linguistics
Paragraph alignment:
Sentence alignment:
* Helicopters can rise straight up into the air and can go straight down.
+ Máy bay trực thăng có thể lên thẳng trên không và đáp thẳng xuống đất.
* They can stand still in the air.
+ Chúng có thể đứng yên trên không.
* Helicopters do not have wings.
+ Máy bay trực thăng không có cánh.
Processing
Corpora for Contrative Linguistics
7
Jet
planes
fly
about
nine
miles
high
.
Các máy bay
phản
lực
bay
cao
khoảng
chín
dặm
.
POS
NN
NNS
VBP
RB
CD
NNS
JJ
E
Jet
planes
fly
about
nine
miles
high
V
phản
lực
các
|
máy
bay
bay
khoảng
chín
dặm
cao
POS
N
Q - N
V
R
Q
N
A
<DOC Domain='general'><SENT id='1'>
<TXT_E>
Jet planes fly about nine miles high
</TXT_E>
<TXT_V>
C
ác máy bay phản lực bay cao khoảng chín dặm
</TXT_V>
</SENT>
</DOC>
Processing
Corpora for Contrative Linguistics
7
•
Ex: in the KVC (Korean – Vietnamese Corpus):
•
Việt_Nam/Nr/ORG_B
là/Vc/O
đối_tác/Nn/O
quan_trọng/Aa/O
nhất/R/O
trong/Cm/O
chính_sách/Nn/O
ODA/FW/ABB_B
của/Cm/O
Hàn_Quốc/Nr/ORG_B ./PU/O
(Vietnam is the most important partner in the Korea’s ODA policy)
Using
the
underscore
"_"
to
connect
the
morpho-syllables
in
a
word
=>
we
have
the
words:
“Việt
Nam”,
“là”,
“đối
tác”,
“quan
trọng”, “nhất”, “trong”, “chính sách”, “của”, “Hàn Quốc”.
Using
the
CLC
POS,
NE-tagset:
E.g.
"Nr":
proper
Noun,
"Vc":
copula,
"Nn":
Noun,
"Aa":
Adjective,
"R":
adveRb,
“Cm”:
main-
sub
Connector,
"FW":
Foreign
Word,
"PU":
Punctuation,...;
“ORG”: organization, “ABB”: abbreviation, “O”: out of NE,..
Korean
morphological
analysis:
갑
니
다
(fml-pres),
가
요
(inflm-
pres),
갔
습
니
다
(fml-pst),
갔
어
요
(infml-pst), …=>"
가
다
”(go).
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based Contrast Ling
Conclusion
22
Applications
of Corpus-based Contrastive Ling.
7
No
Word
POS (en)
freq
1
của
Cm (of)
1,820
2
và
Cp (and)
1,822
3
các
Nq (+PLR)
1,956
4
có
Ve
(have)
1,959
5
là
Vc (tobe)
1,968
6
trong
Cm (in)
1,986
7
một
Nq (one)
2,012
8
đã
R (+PST)
2,031
9
những
Nq (+PLR)
2,043
10
không
R (no,not)
2,050
Table 6. VN word frequency
where m is the number of occurrences and N is the size of the
corpus used for measuring. Ex: f=2 means that this word has
occurred at the frequency 1/100 (1%), f=1 => 1/10.
Rank
Word
Eng
POS
freq
..
…
..
14
người
man
Nn
2,160
25
nhiều
many
Aa
2,210
27
năm
year
Nt
2,314
30
ngày
day
Nt
2,401
31
làm
do
Vv
2,423
32
phải
must
Vv
2,436
34
ông
you
Nn
2,464
36
theo
follow
Vv
2,530
43
việc
thing
Nn
2,611
53
có thể
able
Vv
2,660
Applications
of Corpus-based Contrastive Ling.
7
Only top-10% word types will occupy 90% of the word
tokens in Vietnamese text [3].
Build vocabulary lists (e.g. top-1000 words, top-2000 words,
or top-3000 words), matching with each of the learner’s level.
Accumulating freq of VN words
Rank
Word
Eng
POS
freq
3,775
của
wealth
Nn
4,6789
368
và
then
M
3,4268
20,793
và
shovel
Vv
6,1384
39,212
các
pay.extra
Vv
6,7405
3,224
có
(particle)
M
4,5731
103
có
exist
R
2,9803
19,385
là
iron
Vv
6,0415
5,290
là
being
Cs
4,9209
143
là
as
Cp
3,0857
1,749
là
(particle)
M
4,1842
186
tốt
good
Aa
3,1813
25,154
tốt
soldier
Nn
6,4394
Applications
of Corpus-based Contrastive Ling
•
Apply the frequency dictionary (in readability) to compile
textbooks, dictionaries (definitions, usages) appropriate
for each learner’s level (A1,A2, ..CERF)
.
•
Ex: in the Oxford OALD8: use the top-3,000 word in all
its definitions. E.g. :
Avoid using a difficult word in the definition of a simple
word. E.g in the Vietnamese dictionary: the definition of
the word “đường” (sugar) is
“một
hợp chất kết tinh
...”
(“hợp chất” = compound, “kết tinh” = crystallize”)
Applications
of EV Contrastive Linguistics
Applications
of EV Contrastive Linguistics
Applications
of EV Contrastive Linguistics
Applications
of EV Contrastive Linguistics
To contrast the word usage depending on the context. For
examples: how to translate the word “xảy ra” into English.
It maybe “occur
(e.g: “an error occurs inside the computer”),
or “happen” (e.g. “an accident happens”) or “take place” (e.g.
“a meeting will be taken place”).
Similarly,
to
wear (clothes=mặc; hat=đội; shoes=mang,
glasses=đeo..); gà trống/
đực
(cock)
, dê đực/
trống
(he-
goat
)
;
(in VN);
big
/heavy rain, strong wind, powerful computer, ..
From EVC, the learners can “self-learn” the word-usages.
In some cases, the word-usage in the dictionary is not up-to-
date, e.g. “fondle” in most definitions in dictionaries bears the
positive meaning (“to caress lovingly/touch gently”) whereas
its usages in the practical corpora are “negative meaning”
(“sexual harassment” !).
Applications
of EV Contrastive Linguistics
31
Applications
of EV Contrastive Linguistics
Applications
of EV Contrastive Linguistics
Applications
of EV Contrastive Linguistics
Applications
of EV Contrastive Linguistics
Applications
of EV Contrastive Linguistics
Các__
máy
_
bay
phản_lực
bay
cao khoảng chín dặm
.
Jet planes
fly
about nine miles high.
A 110
(côn trùng)
ruồi, muỗi, gián,
ong, kiến ...
M 19
(cách thức di chuyển)
đi, chạy, bay, bơi ..
H 154
(dụng cụ đào, cắt)
cái bay, xẻng, ..
M 28
bay,
lượn,...
M 28
bay, lượn,
vỗ cánh...
G 280
(đại từ nhân xưng)
anh, bạn,
bay, mày ..
L47
(bay hơi,
bay màu,
Semantic tagging
Applications
of EV Contrastive Linguistics
Applications
of KV Contrastive Linguistics
7
•
Due to the language differences: Korean (agglutinating,
SOV, marker, head-final,…) and Vietnamese (isolating,
SVO, no-marker, head-first,…) => great differences in word
boundary, lexicalization, word order,…
•
The word alignment is automatically done by the software
tool GIZA ++ [2] (using co-occurrence probability of
pair of
word/phrase in Korean and Vn) => no 100%-accuracy result.
Vietnam is the most important partner in the Korea’ s ODA policy
Applications
of KV Contrastive Linguistics
7
Using for learners to observe how to use a word in its context.
Ex: the word "
갔
다
" (went) and its correspondences in Vnese.
Applications
of KV Contrastive Linguistics
7
to observe word usage of the original word “
가
다
" (go)
and its
inflections/variations.
Applications
of KV Contrastive Linguistics
7
to observe word usage of the “đi”
(go)
and its corresponding
translations in Korean.
Applications
of KV Contrastive Linguistics
7
to observe the word order in each language.
Applications
of KV Contrastive Linguistics
7
*
베
트
남
은
한
국
의
ODA
정
책
중
제
일
중
요
한
파
트
너
이
다
.
한
국
의
대
통
령
은
노
동
협
력
이
베
트
남
과
한
국
모
두
에
게
큰
이
익
을
가
져
다
줄
것
이
라
고
높
이
평
가
하
였
다
.
두
지
도
자
는
동
해
의
안
녕
,
평
화
,
안
정
유
지
등
아
세
안
지
역
내
외
국
가
들
의
공
통
이
익
을
일
치
하
였
다
.
(22/33 c.w)
+
Việt
Nam
là
đối
tác
quan
trọng
nhất
trong
chính
sách
ODA
của
Hàn
Quốc
.
Tổng
thống
Hàn
Quốc
đã
đánh
giá
cao
việc
hợp
tác
lao
động
mang
lại
lợi
ích
to
lớn
cho
cả
Việt
Nam
và
Hàn
Quốc
.
Hai
vị
lãnh
đạo
đã
nhất
trí
về
lợi
ích
chung
của
các
quốc
gia
trong
ngoài
khu
vực
ASEAN
như
duy
trì
an ninh, hòa bình, ổn định
biển
Đông
.
(26/36 content words)
Similar to Korean, Vietnamese has more than 65% of which is
derived from Chinese (especially the words used in formal
writing, science; and called Sino-Vietnamese).
E
xploit the Sino-Korean words and the Sino- Vietnamese,
sharing the same Chinese origin: will help Koreans
grasp/enrich vocabulary in Vietnamese easily and effectively.
Applications
of KV Contrastive Linguistics
7
Sino-
Korean
Chinese
(tradition)
Chinese
(simplify)
Chinese
(Pinyin)
Sino-
Vietnamese
Vietnamese
English
한
국
韓
國
韩
国
hán guó
Hàn quốc
Hàn quốc
Korea
정
책
政
策
政
策
zhèng cè
chính sách
chính sách
policy
제
일
第
一
第
一
dì yī
đệ nhất
nhất
first
중
요
重
要
重
要
zhòng yào
trọng yếu
quan trọng
important
대
통
령
大
統
領
大
统
领
dà tǒng lǐng
đại thống lĩnh
tổng thống
president
노
동
勞
動
劳
动
láo dòng
lao động
lao động
labor
협
력
協
力
协
力
xié lì
hiệp lực
hợp tác
cooperate
이
익
利
益
利
益
lì yì
lợi ích
lợi ích
gain
평
가
評
價
评
价
píng jià
bình giá
đánh giá
estimate
지
도
자
指
導
者
指
导
者
zhǐ dǎo zhě
chỉ đạo giả
vị lãnh đạo
leader
동
해
東
海
东
海
dōng hǎi
Đông hải
Biển Đông
East sea
안
녕
安
寧
安
宁
ān níng
an ninh
an ninh
security
평
화
平
和
平
和
píng huò
bình hòa
hòa bình
peace
안
정
安
定
安
定
ān dìng
an định
ổn định
stability
유
지
維
持
维
持
wéi chí
duy trì
duy trì
maintain
국
가
國
家
国
家
guó jiā
quốc gia
nước
country
일
치
一
致
一
致
yī zhì
nhất trí
nhất trí
correspond
공
통
共
通
共
通
gòng tōng
cộng thông
chung
common
지
역
地
域
地
域
dì yù
địa vực
khu vực
region
Content
Overview
Introduction
to Vietnamese Corpora
Processing
VNese Corpora for Contrast Ling
Applications
of Corpus-based Contrast Ling
Conclusion and Development
44
Conclusion
o
Basing on these corpora, we can find out automatically the
similarities and differences between Vietnamese vs. popular
foreign languages (English, Korean, Chinese, …) in terms of:
o
morphology, syntax, semantics, pragmatics. Ex:
Equivalent translations, word usages, lexicalizations, POS,
word orders, head-initial vs. final-initial, pre-position vs. post
-position, marker (semantic role, discourse), cognates,...
Help learners "self-test" the word usages, the grammatical;
rules of Vietnamese, EV/CV/KV translations, …
Grasping language knowledge in which the traditional
approach is difficult to convey all possible.
Helps the linguists a useful tool to search for a certain word,
phrase, pattern; to verify/prove a hypothesis.
Development
Cooperations between your linguistics department/institute
and our computational linguistics center for developing large
annotated Chinese – Vietnamese parallel corpora.
Chinese and Vietnamese are isolating languages: share many
common features: cognate (65% sino-Vietnamese), classifiers
(“cái” vs. “
个
”, “quyển” vs. “
本
”, “tấm” vs. “
张
”
)
, no-
inflection, tense, word order, particles, function words (e.g.
“rồi/đã” vs. “
了
”
,
“
chưa” vs.
“
还
没
有
”, “sao rồi?” vs. “
这
么
了
?”, “anh đi bao giờ” (past)/ “bao giờ anh đi” (future) vs.
“
你
去
什
么
时
候
?
”/
“
什
么
时
候
你
去
?
”
,
etc.)
Witthese bilingual parallel corpora, linguists and learners
(both Vietnamese and Chinese) can easily, quickly, exactly
contrast the similarities and differences Chinese vs VNese.
If corpora are cooperated/invested to enhance their quantity,
quality, their applications will be increased exponentially.
References
[1]. https://www.clc.hcmus.edu.vn/resources/.
[2]. Dien Dinh, "Building an Annotated English-Vietnamese parallel
Corpus", MKS: A Journal of Southeast Asian Linguistics and Languages,
Vol.35, pp.21-36, 2005.
[3]. A.S.Hornby, “Từ điển Song ngữ Anh – Việt” (Oxford Advanced
Learner’s Dictionary 8
th
ed. with Vietnamese translation) (compiler: Dinh
Dien), Youth publisher, HCMC, VN, 2014.
[4].
T.Phuoc, D.Dien (2014), “A novel approach for handling unknown
word problem in Chinese – Vietnamese machine
translation”,
International Journal of Computational Linguistics and
Chinese Language Processing (IJCLCLP)
, Vol.19, No.1, March 2014,
pp. 1-10, ISSN: 1027-376X.
[5] Dien Dinh,
김
위
정
, Diep N. “Exploiting the Korean – Vietnamese
Parallel Corpus in teaching Vietnamese for Koreans”, Interdisciplinary
Study on Language Communication in Multicultural Society, the Int’l
Conference of ISEAS/BUFS, May 2017, pp.11-23.
Thank you for your attention