A Hybrid Approach to English-Korean Name Transliteration PDF

12-May-2016 The Korean Alphabet ... a When ㅇ appears in initial position it represents no sound and is not transcribed

KOREAN LANGUAGE

1. This book is divided into fourteen units. The first three deal with the Korean alphabet (vowels consonants

Korean–English Dictionary ّ5 ئZ ¦‡>

In case of a disagreement between the translation and the original English version of this License the original Korean Alphabet. Consonants and their names ...

English-Korean Named Entity Transliteration Using Statistical

Named entity translation plays an important role in machine translation cross-language informa- from English phonemes to Korean letters. • Mixed: union of ...

2019 Korean Government Scholarship program [KGSP] for

20-Mar-2019 or English translation authenticated by the issuing institution or notarized by a notary public. O All application documents must be ...

Language List by Country and Place

Adapted from Improving the use of translation and interpreting services: A guide to Victorian Korean South: Korean Kuwait: Arabic

Korean-to-Japanese Neural Machine Translation System using

04-Dec-2020 By contrast Korean uses the. Korean alphabet called Hangul to write sentences ... and English translations. there are domains in which there are ...

Zero-shot North Korean to English Neural Machine Translation by

10-Jul-2020 We use hgtk (Hangul toolkit)4 for the decomposition into phonemes. Character (phoneme BPE) model. We perform the character level tokenization ...

F. No. 25016/52/2019-LC Government of India Ministry of Home

South Korea. Korean translation of judicial and supporting documents is required. 30. Spain. No specific requirement. Request has to be made in English. 31.

requested to go through the application guidelines carefully before

09-Mar-2022 Must I submlt the certificates of language proficiency (English or Korean) when t apply for. GKS? A. The certificate of English proficiency ( ...

KOREAN LANGUAGE

Unit 1 ?? 1 Korean alphabet 1 Consonants 2

Learn to Read Korean: An Introduction to the Hangul Alphabet

12 May 2016 The Korean Alphabet. G Correct Sounds for ... G Following its invention Hangul was not widely used ... words (names or English borrowings).

Korean–English Dictionary ?5 ?Z ¦‡>

Korean–English Dictionary Opaque formats include PostScript PDF

NCU IISR English-Korean and English-Chinese Named Entity

26 Jul 2015 Named entity translation is a key problem in many. NLP research fields such as machine ... like English letters and words each Hangul block.

Hangeul

Hangeul the Korean alphabet. Hangeul consonants and vowels. The composition of Korean syllables. Korean syllables are made in 4 different manners.

A Hybrid Approach to English-Korean Name Transliteration

7 Aug 2009 Often named entities such as person names or place names from foreign origin do not appear in the dictionary

SOAS-AKS Working Papers in Korean Studies 1

This paper tries to argue that the Korean alphabet is a sole invention of King Sejong 3 English translation is adapted from Lee and Ramsey (2000: 31).

BASIC KOREAN: A GRAMMAR AND WORKBOOK

All Korean entries are presented in Hangul (the Korean alphabet) with. English translations to facilitate understanding. Accordingly it requires that learners

Zero-shot North Korean to English Neural Machine Translation by

10 Jul 2020 We use hgtk (Hangul toolkit)4 for the decomposition into phonemes. Character (phoneme BPE) model. We perform the character level tokenization ...

Cross-Language IR at University of Tsukuba: Automatic

and English letters via the Roman representation. To produce a new dictionary we use the Unicode system to romanize Korean words.

Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, pages 108-111,Suntec, Singapore, 7 August 2009.c?2009 ACL and AFNLPA Hybrid Approach to English-Korean Name Transliteration

Gumwon Hong

?, Min-Jeong Kim?, Do-Gil Lee+and Hae-Chang Rim? ?Department of Computer Science & Engineering, Korea University, Seoul 136-713, Korea {gwhong,mjkim,rim}@nlp.korea.ac.kr +Institute of Korean Culture, Korea University, Seoul 136-701, Korea motdg@korea.ac.kr

Abstract

This paper presents a hybrid approach to

English-Korean name transliteration. The

base system is built on MOSES with en- abled factored translation features. We expand the base system by combining with various transliteration methods in- cluding a Web-basedn-best re-ranking, a dictionary-based method, and a rule-based method. Our standard run and best non- standard run achieve 45.1 and 78.5, re- spectively, in top-1 accuracy. Experimen- tal results show that expanding training data size significantly contributes to the performance. Also we discover that the

Web-based re-ranking method can be suc-

cessfully applied to the English-Korean transliteration.

1 Introduction

Often, named entities such as person names or

place names from foreign origin do not appear in the dictionary, and such out of vocabulary words are a common source of errors in processing nat- ural languages. For example, in statistical ma- chine translation (SMT), if a new word occurs in the input source sentence, the decoder will at best drop the unknown word or directly copy the sourcewordto the targetsentence. Transliteration, a method of mapping phonemes or graphemes of source language into those of target language, can be used in this case in order to identify a possible translation of the word.

The approaches to automatic transliteration be-

tween English and Korean can be performed through the following ways: First, in learning how to write the names of foreign origin, we can re- fer to a transliteration standard which is estab- lished by the government or some official linguis- tic organizations. No matter where the standardcomes from, the basic principle of the standard is based on the correct pronunciation of foreign words. Second, since constructing such rules are very costly in terms of time and money, we can rely on a statistical method such as SMT. We be- lieve that the rule-based method can guarantee to increase accuracy for known cases, and the statis- tical method can be robust to handle various ex- ceptions.

In this paper, we present a variety of tech-

niques for English-Korean name transliteration.

First, we use a phrase-base SMT model with some

factored translation features for the transliteration task. Second, we expand the base system by ap- plying Web-basedn-best re-ranking of the results.

Third, we apply a pronouncing dictionary-based

method to the base system which utilizes the pro- nunciation symbols which is motivated by linguis- tic knowledge. Finally, we introduce a phonics- based method which is originally designed for teaching speakers of English to read and write that language.

2 Proposed Approach

In order to build our base system, we use MOSES

(Koehn et al., 2007), a well-known phrase-based system designed for SMT. MOSES offers a con- venient framework which can be directly applied to machine transliteration experiments. In this framework, the transliteration can be performed in a very similar process of SMT task except the following changes. First, the unit of translation is changed fromwordstocharacters. Second, a phrasein transliteration refers to any contiguous block of character sequence which can be directly matched from a source word to a target word. Also, we do not have to worry about any distortion parameters because decoding can be performed in a totally monotonic way.

The process of the general transliteration ap-

proach begins by matching the unit of a source108

Letter

AlignmentBilingual Corpus

Factored Phrase-

based Training

Trained Model

Eumjeol

Decomposition

MOSES

DecoderInput Word

Eumjeol

Re-composition

Target Word

Eumjeol

Decomposition

N-best

Re-ranking

Web

Dictionary

Phonics

Figure 1: System Architecture

word to the unit of a target word. The unit can be based on graphemes or phonemes, depending on language pairs or approaches. In English-Korean transliteration, both grapheme-to-grapheme and grapheme-to-phonemeapproachesarepossible. In our method, we select grapheme-to-grapheme ap- proach as a base system, and we apply grapheme- to-phoneme functions in pronouncing dictionary- based approach.

The transliteration between Korean and other

languages requires some special preprocessing techniques. First of all, Korean alphabet is or- ganized into syllabic blocks calledEumjeol. Ko- rean transliteration standard allows eachEumjeol to consist of either two or three of the 24 Korean letters, with (1) leading 14 consonants, (2) inter- mediate 10 vowels, and (3) optionally, trailing 7 consonants (out of the possible 14). Therefore, before performing training or decoding any input. Consequently, after the letter-unit transliteration is finished, all the letters should be re-composed to form a correct sequence ofEumjeols.

Figure 1 shows the overall architecture of our

system. The alignment between English letter and

Korean letter is performed using GIZA++ (Och

and Ney, 2003). We use MOSES decoder in or- der to search the best sequence of transliteration.

In this paper we focus on describing factored

phrase-based training andn-best re-ranking tech- niques including a Web-based method, a pro- nouncing dictionary-based method, and a phonics- based method.

Figure 2: Alignment example between 'Knight"

and 'sàÔ[naiteu]"

2.1 Factored Phrase-based Training

of different information for phrase-based SMT model. We report on experiments with three fac- tors: surface form, positional information, and the type of a letter. Surface form indicates a letter itself. For positional information, we add a BIO label to each input character in both the source words and the target words. The intuition is that certain character is differently pronounced de- pending on its position in a word. For example, 'k" in 'Knight" or 'h" in 'Sarah" are not pronounced. The type of a letter is used to classify whether a given letter is a vowel or a consonant. We assume that a consonant in source word would more likely be linked to a consonant in a target word. Figure 2 shows an example of alignment with factored fea- tures.

2.2 Web-based Re-ranking

We re-ranked the topnresults of the decoder by

referring to how many times both source word and target word co-occur on the Web. In news articles on the Web, a translation of a foreign name is of- ten provided near the foreign name to describe its pronunciation or description. To reflect this obser- vation, we use Google"s proximity search by re- stricting two terms should occur within four-word distance. The frequency is adjusted as relative fre- quency form by dividing each frequency by total frequency of alln-best results.

Also, we linearly interpolate then-best score

with the relative frequency of candidate output. To makefairinterpolation, weadjustbothscorestobe between 0 and 1. Also, in this method, we decide to remove all the candidates whose frequencies are zero.

2.3 Pronouncing Dictionary-based Method

According to "Oeraeeo pyogibeop1" (Korean or-

thography and writing method of borrowed for- 1 http://www.korean.go.kr/08 new/data/rule03.jsp 109

MethodsAcc.1Mean F1Mean FdecMRR MAPrefMAP10MAPsys

0.451 0.720 0.852 0.576 0.451 0.181 0.181

0.740 0.868 0.930 0.806 0.740 0.243 0.243

0.784 0.889 0.944 0.840 0.784 0.252 0.484

0.781 0.885 0.941 0.839 0.781 0.252 0.460

0.785 0.887 0.943 0.840 0.785 0.252 0.441

Table 1: Experimental Results (EnKo)

eign words), the primary principle of English-to- Korean transliteration is to spell according to the mapping table between the international phonetic alphabets and the Korean alphabets. Therefore, we can say that a pronouncing dictionary-based method is very suitable for this principle.

We use the following two resources for build-

ing a pronouncing dictionary: one is an English-

Korean dictionary that contains 130,000 words.

The other is the CMU pronouncing dictionary

2 created by Carnegie Mellon University that con- tains over 125,000 words and their transcriptions.

Phonetic symbols for English words in the

dictionaries are transformed to their pronuncia- tion information by using an internal code table.

The internal code table represents mappings from

each phonetic symbol to a single character within

ASCII code table. Our pronouncing dictionary in-

cludes a list of words and their pronunciation in- formation.

For a given English word, if the word exists

in the pronouncing dictionary, then its pronunci- ations are translated to Korean graphemes by a mapping table and transformation rules, which are defined by "Oeraeeo pyogibeop".

2.4 Phonics-based Method

Phonics is a pronunciation-based linguistic teach- ing method, especially for children (Strickland,

1998). Originally, it was designed to connect the

sounds of spoken English with group of English letters. In this research, we modify the phonics in order to connect English sounds to Korean let- ter because in Korean there is nearly a one-to-one correspondence between sounds and the letter pat- terns that represent them. For example, alpha- bet 'b" can be pronounced to '"(bieup) in Ko- rean. Consequently, we construct about 150 rules which map English alphabet into one or more sev- eral Korean graphemes, by referring to the phon- ics. Though phonics cannot reveal all of the pro- 2 http://www.speech.cs.cmu.edu/cgi-bin/cmudictnunciation of English words, the conversion from

English alphabet into Korean letter is performed

simply and efficiently. We apply the phonics in serial order from left to right of each input word. If multiple rules are applicable, the most specific rules are fist applied.

3 Experiments

3.1 Experimental Setup

We participate in both standard and non-standard

tracks for English-Korean name transliteration in

NEWS 2009 Machine Transliteration Shared Task

(Li et al., 2009). Experimenting on the develop- ment data, we determine the best performing pa- rameters for MOSES as follows. •Maximum Phrase Length: 3 •Language Model N-gram Order: 3 •Language Model Smoothing: Kneser-Ney •Phrase Alignment Heuristic: grow-diag-final •Reordering: Monotone •Maximum Distortion Length: 0

With above parameter setup, the results are pro-

duced from the following five different systems. •Baseline System (BS): For the standard task, we use only given official training data

3to con-

struct translation model and language model for our base system. •Expanded Resource (ER): For all four non- standard tasks, we use the examples of writing for- eign names as additional training data. The ex- amples are provided from the National Institute of the Korean Language

4. The data originally con-

sists of around 27,000 person names and around

7,000 place names including non-Ascii characters

for English side words as well as duplicate entries. We preprocess the data in order to use 13,194 dis- 3 Refer to Website http://www.cjk.org for more informa- tion

4The resource is open to public. See

http://www.korean.go.kr/eng for more information.110 tinct pairs of English names and Korean transliter- ation. •Web-based Re-ranking (WR): We re-rank the result ofERby applying the method described in section 2.2. •Pronouncing Dictionary-based Method (PD):

The re-ranking ofWRby combining with the

method described in section 2.3. •Phonics-based Method (PB): The re-ranking ofWRby combining with the method described in section 2.4.

The last two methods re-rank theWRmethod

and Phonics-based method. We restrict that the pronouncing dictionary-based method and

Phonics-based method can produce only one out-

put, and use the outputs of the two methods to re- rank (again) the result of Web-based re-ranking.

Whenre-rankingtheresults, weheuristicallycom-

bined the outputs ofPDorPBwith then-best re- sult ofWR. If the outputs of the two methods exist in the result ofWR, we add some positive scores to the original scores ofWR. Otherwise, we inserted the result into fixed position of the rank. The fixed position of rank is empirically decided using de- velopment set. We inserted the output ofPDand

PBat second rank and at sixth rank, respectively.

3.2 Experimental Results

Table 1 shows our experimental results of the five systems on the test data. We found that the use of additional training data (ER) and web-based re- ranking (WR) have a strong effect on translitera- tion performance. However, the integration of the

PDorPBwithWBproves not to significantly con-

tribute the performance. To find more elaborate integration of those results will be one of our fu- ture work.

TheMAPsysvalue of the three re-ranking

methodsWR,PD, andPBare relatively higher than other methods because we filter out some candidates inn-best by their Web frequencies. In addition to the standard evaluation measures, we include the Mean F decto measure the Levenshtein distance between reference and the output of the decoder (decomposed result).

4 Conclusions

In this paper, we proposed a hybrid approach to

English-Korean name transliteration. The system

is built on MOSES with factored translation fea-tures. When evaluating the proposed methods, we found that the use of additional training data can significantly outperforms the baseline system. Also, the experimental result of using threen-best re-ranking techniques shows that the Web-based re-ranking is proved to be a useful method. How- ever, our two integration methods with dictionary- based or rule-based method does not show the sig- nificant gain over the Web-based re-ranking.

For future work, we plan to devise more elab-

orate way to integrate statistical method and dic- tionary or rule-based method to further improve the transliteration performance. Also, we will ap- ply the proposed techniques to possible applica- tions such as SMT or Cross Lingual Information

Retrieval.

References

Philipp Koehn and Hieu Hoang. 2007. Factored trans- lation models. InProceedings of the 2007 Joint

Conference on Empirical Methods in Natural Lan-

guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL), pages 868-876,

Prague, CzechRepublic, June.AssociationforCom-

putational Linguistics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris

Callison-Burch, Marcello Federico, Nicola Bertoldi,

Brooke Cowan, Wade Shen, Christine Moran,

Richard Zens, Chris Dyer, Ond

rej Bojar, Alexandra

Constantin, and Evan Herbst. 2007. Moses: Open

Source Toolkit for Statistical Machine Translation.

InACL 2007, Demo and Poster Sessions, June.

Haizhou Li, A Kumaran, Min Zhang, and Vladimir

Pervouchine. 2009. Whitepaper of news 2009

machine transliteration shared task. InProceed- ings of ACL-IJCNLP 2009 Named Entities Work- shop (NEWS 2009), Singapore.

Franz Josef Och and Hermann Ney. 2003. A sys-

tematic comparison of various statistical alignment models.Computational Linguistics, 29.

D.S. Strickland. 1998.Teaching phonics today: A

primer for educators. International Reading Asso- ciation.111quotesdbs_dbs18.pdfusesText_24

[PDF] korean children's books pdf

[PDF] korean from zero pdf

[PDF] korean grammar pdf

[PDF] korean movies 2015 lee min ho english teacher

[PDF] korean vocabulary pdf

[PDF] kotler principles of marketing pdf

[PDF] kounouz bac economie

[PDF] kounouz bac gestion

[PDF] kounouz bac physique

[PDF] kounta kinte films complets en francais 2014

[PDF] kpmg algerie 2016

[PDF] kpmg algerie 2016 pdf

[PDF] kpmg algerie 2017

[PDF] kpmg algerie pdf

[PDF] kpsc exam 2015 questions and answers

[PDF] A Hybrid Approach to English-Korean Name Transliteration

Gumwon Hong

Abstract

This paper presents a hybrid approach to

English-Korean name transliteration. The

Web-based re-ranking method can be suc-

1 Introduction

Often, named entities such as person names or

The approaches to automatic transliteration be-

In this paper, we present a variety of tech-

First, we use a phrase-base SMT model with some

Third, we apply a pronouncing dictionary-based

2 Proposed Approach

In order to build our base system, we use MOSES

The process of the general transliteration ap-

Letter

AlignmentBilingual Corpus

Factored Phrase-

Trained Model

Eumjeol

Decomposition

DecoderInput Word

Eumjeol

Re-composition

Target Word

Eumjeol

Decomposition

N-best

Re-ranking

Dictionary

Phonics

Figure 1: System Architecture

The transliteration between Korean and other

Figure 1 shows the overall architecture of our

Korean letter is performed using GIZA++ (Och

In this paper we focus on describing factored

Figure 2: Alignment example between 'Knight"

2.1 Factored Phrase-based Training

2.2 Web-based Re-ranking

We re-ranked the topnresults of the decoder by

Also, we linearly interpolate then-best score

2.3 Pronouncing Dictionary-based Method

According to "Oeraeeo pyogibeop1" (Korean or-

MethodsAcc.1Mean F1Mean FdecMRR MAPrefMAP10MAPsys

0.451 0.720 0.852 0.576 0.451 0.181 0.181

0.740 0.868 0.930 0.806 0.740 0.243 0.243

0.784 0.889 0.944 0.840 0.784 0.252 0.484

0.781 0.885 0.941 0.839 0.781 0.252 0.460

0.785 0.887 0.943 0.840 0.785 0.252 0.441

Table 1: Experimental Results (EnKo)

We use the following two resources for build-

Korean dictionary that contains 130,000 words.

The other is the CMU pronouncing dictionary

Phonetic symbols for English words in the

The internal code table represents mappings from

ASCII code table. Our pronouncing dictionary in-

For a given English word, if the word exists

2.4 Phonics-based Method

1998). Originally, it was designed to connect the

English alphabet into Korean letter is performed

3 Experiments

3.1 Experimental Setup

We participate in both standard and non-standard

NEWS 2009 Machine Transliteration Shared Task

With above parameter setup, the results are pro-

3to con-

4. The data originally con-

7,000 place names including non-Ascii characters

4The resource is open to public. See

The re-ranking ofWRby combining with the

The last two methods re-rank theWRmethod

Phonics-based method can produce only one out-

Whenre-rankingtheresults, weheuristicallycom-

PBat second rank and at sixth rank, respectively.

3.2 Experimental Results

PDorPBwithWBproves not to significantly con-

TheMAPsysvalue of the three re-ranking

4 Conclusions

In this paper, we proposed a hybrid approach to

English-Korean name transliteration. The system