[PDF] Learning Chinese Word Representations From Glyphs Of Characters





Previous PDF Next PDF



Tuttle Learning Chinese Characters

Nov 5 2020 ... Chinese characters. hsK level a alison Matthews laurence Matthews ... characters it is squashed into the distorted form 衤(see 439a below) ...



Radical: a learning system of Chinese Mandarin characters.

Even though the Chinese language is more complicated than almost all the other languages in terms of its pronunciation and writing systems the number of.



Untitled

This sequel to Chinese Characters I is based on the. PRC text Elementary Chinese and is designed to increase your reading vocabulary to 821 characters through 



Chinese-Characters.pdf

Characters in Mandarin Chinese. Alan Hoenig Ph.D. Use an Innovative Memory Method ... CHINESE CHARACTERS: REMEMBER 2178 CHARACTER MEANINGS. I fear that errors ...



Chinese Character Writing Exercise Sheets

Practical Chinese Reader Book 1. Go to Ø · Practical Chinese Reader Book 2. (Book 1 Lessons 1-30). 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.



Chinese Character Components Radicals and Phonetics

because of the significantly different writing systems of Chinese characters and alphabet-based languages used in western countries. character set for ...





Glyce: Glyph-vectors for Chinese Character Representations

It is a four-squared format (similar to Chinese character 田) for beginner to learn writing Chinese characters. pdf 2018. Ashish Vaswani



Chinese romanization table

ALA-LC romanization of ideographic characters used for the Chinese language follows the principles of the Pinyin (“spell sound”) system.



The Origin and Evolvement of Chinese Characters

Like other ancient languages of Egypt and India ancient Sumerian symbols have been lost in the process of history



MANDARIN CHINESE VOCABULARY AND CHINESE CHARACTERS

(blank) When the Level 1 box is blank the word is needed only at Level 2. Chinese. Pinyin. English. Level 1 Level 2. Activities. ??.



Radical: a learning system of Chinese Mandarin characters.

Even though the Chinese language is more complicated than almost all the other languages in terms of its pronunciation and writing systems the number of.



Incorporating Chinese Characters of Words for Lexical Sememe

In Chinese words are composed of one or mul- tiple characters



ZiNet: Linking Chinese Characters Spanning Three Thousand Years

May 22 2565 BE Modern Chinese characters evolved from. 3



Untitled

This sequel to Chinese Characters I is based on the. PRC text Elementary Chinese and is designed to increase your reading vocabulary to 821 characters 



Component-Enhanced Chinese Character Embeddings - Yanran Li

Sep 21 2558 BE Distinguished from English



Tuttle Learning Chinese Characters

it uses modern standard Chinese (putonghua or “Mandarin”); If you want to learn Chinese characters then this book is the right.



Learning Chinese Word Representations From Glyphs Of Characters

(As printed in PDF file). 60 pixels. Figure 1: A Chinese character is represented as a fixed-size gray-scale image which is referred to as.



A CHINESE CHARACTERS CODING SCHEME FOR COMPUTER

A coding scheme for inputting Chinese characters by means of a conventional keyboard has been developed. The code for each Chinese character is composed.



Chinese Character Writing Exercise Sheets

Chinese Character Writing Exercise Sheets Chinese Language Program University of Vermont ... Practical Chinese Reader I Lesson 1 & 2.

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 264-273Copenhagen, Denmark, September 7-11, 2017.c

2017 Association for Computational LinguisticsLearning Chinese Word Representations From Glyphs Of Characters

Tzu-Ray SuandHung-Yi Lee

Dept. of Electrical Engineering, National Taiwan University

No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan

{b01901007,hungyilee}@ntu.edu.tw

Abstract

In this paper, we propose new methods to

learn Chinese word representations. Chi- nese characters are composed of graphical components, which carry rich semantics.

It is common for a Chinese learner to com-

prehend the meaning of a word from these graphical components. As a result, we propose models that enhance word repre- sentations by character glyphs. The char- acter glyph features are directly learned from the bitmaps of characters by con- volutional auto-encoder(convAE), and the glyph features improve Chinese word rep- resentations which are already enhanced by character embeddings. Another contri- bution in this paper is that we created sev- eral evaluation datasets in traditional Chi- nese and made them public.

1 Introduction

No matter which target language it is, high quality word representations (also known as word "em- beddings") are keys to many natural language processing tasks, for example, sentence classi- fication (Kim, 2014), question answering (Zhou et al., 2015), machine translation (Sutskever et al.,

2014), etc. Besides, word-level representations

et al., 2014) and sentence-level (Kiros et al., 2015) representations.

In this paper, we focus on learning Chinese

word representations. A Chinese word is com- posed of characters which contain rich seman- tics. The meaning of a Chinese word is often re- lated to the meaning of its compositional charac- ters. Therefore, Chinese word embedding can be enhanced by its compositional character embed- dings (Chen et al., 2015; Xu et al., 2016). Further-more, a Chinese character is composed of several graphical components. Characters with the same component share similar semantic or pronuncia- tion. When a Chinese user encounters a previ- ously unseen character, it is instinctive to guess the meaning (and pronunciation) from its graph- ical components, so understanding the graphical components and associating them with semantics help people learning Chinese. Radicals

1are the

graphical components used to index Chinese char- acters in a dictionary. By identifying the radical of a character, one obtains a rough meaning of that character, so it is used in learning Chinese word embedding(Yin etal.,2016)and characterembed- ding (Sun et al., 2014; Li et al., 2015). However, other components in addition to radicals may con- tain potentially useful information in word repre- sentation learning.

Our research begins with a question:Can ma-

chines learn Chinese word representations from glyphs of characters?By exploiting the glyphs of characters as images in word representation learn- ing, all the graphical components in a character are considered, not limited to radicals. In our proposed methods, we render character glyphs to fixed-size grayscale images which are referred to as "character bitmaps", as illustrated in Fig.1. A similar idea was also used in (Liu et al., 2017) to help classifying wikipedia article titles into 12 cat- egories. We use a convAE to extract character fea- tures from the bitmap to represent the glyphs. It is also possible to represent the glyph of a char- acter by the graphical components in it. We do not choose this way because there is no unique way to decompose a character, and directly learn- ing representation from bitmaps is more straight- forward. Then we use the models parallel to Skip- gram (Mikolov et al., 2013a) or GloVe (Penning-1 https://en.wikipedia.org/wiki/

Radical_(Chinese_characters)264

ton et al., 2014) to learn word representations from the character glyph features. Although we only consider traditional Chinese characters in this pa- per, andtheexamplesgivenbelowarebasedonthe traditional characters, the same ideas and methods can be applied on the simplified characters.

Rendered bitmaps

60 pixels

Characters Glyphs

(As printed in PDF file)

60 pixelsFigure 1: A Chinese character is represented as a

fixed-size gray-scale image which is referred to as "character bitmap" in this paper.

2 Background Knowledge and Related

Works

To give a clear illustration of our own work, we

briefly introduce the representative methods of word representation learning in Section 2.1. In Section 2.2, we will introduce some of the linguis- tic properties of Chinese, and then introduce the methods that utilize these properties to improve word representations.

2.1 Word Representation Learning

Mainstream research of word representation is

built upon the distributional hypothesis, that is, words with similar contexts share similar mean- ings. Usually a large-scale corpus is used, and word representations are produced from the co- occurrence information of a word and its con- text. Existing methods of producing word rep- resentations could be separated into two fami- lies (Levy et al., 2015): count-based family (Tur- ney and Pantel, 2010; Bullinaria and Levy, 2007), andprediction-basedfamily. Wordrepresentations can be obtained by training a neural-network- based models (Bengio et al., 2003; Collobert et al.,

2011). The representative methods are briefly in-

troduced below.

2.1.1 CBOW and Skipgram

Both continuous bag-of-words (CBOW) model

and Skipgram model train with words and con- texts in a sliding local context window (Mikolovet al., 2013a). Both of them assign each word w iwith an embedding?wi. CBOW predicts the word given its context embeddings, while Skip- gram predicts contexts given the word embed- ding. Predicting the occurrence of word/context in CBOW and Skipgram models could be viewed as learning a multi-class classification neural net- work (the number of classes is the size of vocab- ulary). In (Mikolov et al., 2013b), the authors in- troduced several techniques to improve the perfor- mance. Negative sampling is introduced to speed up learning, and subsampling frequent words is introduced to randomly discard training examples with frequent words (such as "the", "a", "of"), and has an effect similar to the removal of stop words.

2.1.2 GloVe

Instead of using local context windows, (Penning-

ton et al., 2014) proposed GloVe model. Train- ing GloVe word representations begins with cre- ating a co-occurrence matrixXfrom a corpus, where each matrix entryXijrepresents the counts that wordwjappears in the context of wordwi.

In (Pennington et al., 2014), the authors used a

harmonic weighting function for co-occurrence count, that is, word-context pairs with distanced contributes 1d to the global co-occurrence count.

Let?wibe the word representation of wordwi,

and ?˜wjbe the word representation of wordwjas context, GloVe model minimizes the loss: i,j?non-zeroentries of Xf(Xij)(?wTi?˜wj+bi+˜bj-log(Xij)), wherebiis the bias for wordwi, and˜bjis the bias for contextwj. A weighting functionf(Xij)is introduced because the authors consider rare co- occurrence word-context pairs carry less informa- tion than frequent ones, and their contributions to the total loss should be decreased. The weighting functionf(Xij)is defined as below. It depends on the co-occurrence count, and the authors set pa- rametersxmax= 100,α= 0.75. f(Xij) =? (Xijx max)αifXij< xmax

1otherwise

In the GloVe model, each word has 2 represen-

tations?wand?˜w. The authors suggest using?w+?˜w as the word representation, and reported improve- ments over using?wonly.265

2.2 Improving Chinese Word Representation

Learning

2.2.1 The Chinese Language

xylophone wooden zither battleship war ship rooster male chickenFigure 2: Examples of compositional Chinese words. Still, the reader should keep in mind that

NOT all Chinese words are compositional (related

to the meanings of its compositional characters).

A Chinese word is composed of a sequence of

characters. The meanings of some Chinese words are related to the composition of the meanings of their characters. For example, "戰艦" (battleship), is composed of two characters, "戰" (war) and "艦" (ship). More examples are given in Fig. 2.

To improve Chinese word representations with

sub-word information, character-enhanced word embedding (CWE) (Chen et al., 2015) in Sec- tion 2.2.2 is proposed.(A) Radicals: (butterfly) (bee) (snake) (mosquito) (crab) (cotton) (maple) (plum) (stick) (fruit) (sea) (river) (tear) (soup) (spring)(B) Semantics: anthropods, reptilesplants, wooden materialswater, liquid (C) Characters: (C-1)(C-2)(C-3)(C-4)(C-5)Figure 3: Some examples of radicals and the char- acters containing them. In rows (C-1) to (C-4), the radicals are at the left hand side of the character, while in row (C-5), the radicals are at the bottom, and may have different of shapes.

A Chinese character is composed of several

graphical components. Characters with the same component share similar semantic or phonetic properties. In a Chinese dictionary characters with similar coarse semantics are grouped into cate- gories for the ease of searching. The common graphical component which relates to the common semantic is chosen to index the category, knownas a radical. Examples are given in Fig. 3. There are three radicals in row (A), and their semantic meanings are in row (B). In each column, there are five characters containing each radical. It is easy to find that the characters having the same radical have meanings related to the radical in some as- pect. A radical can be put in different positions in a character. For example, in rows (C-1) to (C-4), the radicals are at the left hand side of a charac- ter, but in row (C-5), the radicals are at the bot- tom. The shape of a radical can be different in different positions. For example, the third radi- cal which represents "water" or "liquid" has dif- ferent forms when it is at the left hand side or the bottom of a character. Because radicals serve as a strong semantic indicator of a character, multi- granularity embedding (MGE) (Yin et al., 2016) in Section 2.2.3 incorporates radical embeddings in learning word representation.

(human)(human)(weapon)(speech) (attack, strike, cut down) (believe, promise, letter)Figure 4: Both characters in the figure have the

same radical "亻" (means humans) at the left hand side, but their meanings are the composition of the graphical components at the right hand side and their radical.

Usually the components other than radicals de-

termine the pronunciation of the characters, but in some cases they also influence the meaning of a character. Two examples are given in Fig. 4 2.

Both characters in Fig. 4 have the same radical

"亻" (means humans) at the left hand side, but the graphical components at the right hand side also have semantic meanings related to the characters. Considering the left character "伐" (means at- tack). Its right component "戈" means "weapon", and the meaning of the character "伐" is the com- position of the meaning of its two components (a humanwithaweapon). Noneofthepreviousword embedding approach considers all the components of Chinese characters in our best knowledge.2 The two example characters here have the same glyphs in the traditional and simplified Chinese characters.266

2.2.2 Character-enhanced Word Embedding

(CWE)

The main idea of CWE is that word embedding

is enhanced by its compositional character embed- dings. CWE predicts the word from both word and character embeddings of contexts, as illustrated in

Fig. 5 (a). For wordwi, the CWE word embedding

?w cweihas the following form: ?w cwei=?wi+1|C(i)|? c j?C(i)?c j where?wiis the word embedding,?cjis the embed- ding of the j-th character inwi, andC(i)is the set of compositional characters of wordwi. Mean value of CWE word embeddings of contexts are then used to predict the wordwi.

Sometimes one character has several different

meanings, this is known as the ambiguity problem. To deal with this, each character is assigned with a bag of embeddings. During training, one of the embeddings is picked to form the modified word embedding. The authors proposed three methods to decide which embedding is picked: position-quotesdbs_dbs8.pdfusesText_14
[PDF] chinese grammar pdf

[PDF] chinese language learning books pdf free download

[PDF] chinese lessons pdf

[PDF] chinese new year 2015 date

[PDF] chinese vocabulary pdf

[PDF] chinois lv2

[PDF] choc démographique définition

[PDF] choix bac 2017 algerie

[PDF] choix d'investissement et de financement exercices corrigés

[PDF] choix de bac algerie scientifique

[PDF] choix et liberté

[PDF] choix mathématique secondaire 4

[PDF] choix pour un bac philosophie

[PDF] choix test statistique

[PDF] cholesterol methode colorimetrique enzymatique