Aligning words in English-Hindi parallel corpora PDF

LEARN HINDI Through English Medium

Lesson 1 The Hindi Alphabet. 1. Lesson 2 Common Hind& Consonants. 2. Lesson 3 Speaking Hindi Characters. 3. Lesson 4 Writing Hind& Consonants.

Using English Acoustic Models for Hindi Automatic Speech

By comparing English phonemes with Hindi alphabets we notice that both languages have nasal consonants and monophthongs and diphthongs vowels. Hence we are

TAMIL Through English / Hindi

Each letter of the alphabet can be named by adding (suffixing) கர krma` karam to short letters கார karma` kaaram to long letters and prefixing இ } i to the

Hindi Alphabet Page: 1 akhlesh.com

Hindi Alphabet. Page: 1 akhlesh.com. Page 2. Hindi Alphabet. Page: 2 akhlesh.com. Page 3. Hindi Alphabet. Page: 3 akhlesh.com. Page 4. Hindi Alphabet.

प्रबोध पाठ्यक्रम - prabodh course - किट : 1 - kit

We have however

NIPUN BHARAT GUIDELINES - Book - Artwork - NEW (11.07.2021

11-Jul-2021 ... (Hindi English

Write the letter – क (Ka)

Hindi Alphabets. Mobile app. Made Learning Easy. Write the letter – क (Ka). Lotus. Page 2. ©www.TuitMob.com Kids Educational. Hindi Alphabets. Mobile app.

REVISED GUIDELINES FOR EVALUATION OF TYPING TEST/DEST

➢ The candidates are expected to type the words/figures and numerical/years in the manner as given in the Question Paper (both in English & Hindi). Mistake.

Aligning words in English-Hindi parallel corpora

For the English-Hindi alphabets it is possible to come up with a table consisting of correspondences between the letters of the two alphabets. This table

LANGUAGE IDENTIFICATION OF KANNADA HINDI AND ENGLISH

21-Sept-2007 The objective of this paper is to propose visual clues based procedure to identify Kannada Hindi and English text portions of the Indian ...

LEARN HINDI Through English Medium

Lesson 1 The Hindi Alphabet. 1. Lesson 2 Common Hind& Consonants. 2. Lesson 3 Speaking Hindi Characters. 3. Lesson 4 Writing Hind& Consonants.

Using English Acoustic Models for Hindi Automatic Speech

By comparing English phonemes with Hindi alphabets we notice that both languages have nasal consonants and monophthongs and diphthongs vowels.

Hindi Alphabet Page: 1 akhlesh.com

Hindi Alphabet. Page: 1 akhlesh.com. Page 2. Hindi Alphabet. Page: 2 akhlesh.com. Page 3. Hindi Alphabet. Page: 3 akhlesh.com. Page 4. Hindi Alphabet.

Hindi Varnamala Learn To Write 36 Hindi Alphabets [PDF] - m

hindi-varnamala-learn-to-write-36-hindi-alphabets. 1/1. Downloaded from m.central.edu on June 16 2022 by guest. Hindi Varnamala Learn To Write 36 Hindi

pimsleur - hindi

official languages only Hindi and English are official government languages of communication. There are no capital letters

Convert JPG to PDF online - convert-jpg-to-pdf.net

Hindi Learning & Handwriting Improvement Course ? Just in 14 Hours ?. A comparison between Hindi & English Alphabet and their vowels & consonant.

CONVERSION OF BRAILLE TO TEXT IN ENGLISH HINDI AND

This paper mainly focuses on conversion of a Braille document into its corresponding alphabets of three main languages namely English Tamil and Hindi using

Aligning words in English-Hindi parallel corpora

Section 3.2 describes the TS approach. 3.2 Transliteration Similarity. For the English-Hindi alphabets it is possible to come up with a table consisting of.

Hindi Indic Input 3 - User Guide

Hindi Indic Input 3 provides a very convenient way of entering text in Hindi Language using the English QWERTY keyboard in any editing application (Office

The 44 Sounds (Phonemes) of English

The 44 phonemes represented below are in line with the International Phonetic Alphabet. Consonants. Sound. Common spelling. Spelling alternatives. /b/ b ball bb.

Hindi Alphabets and Hindi Letters Writing [PDF] Download

Hindi Alphabets: Download Full Chart with English Letters All Hindi Letters with the pictures with meaning and Pronunciation in English for kids free printable

[PDF] Hindi Alphabet Page: 1 akhleshcom

Hindi Alphabet Page: 1 akhlesh com Page 2 Hindi Alphabet Page: 2 akhlesh com Page 3 Hindi Alphabet Page: 3 akhlesh com Page 4 Hindi Alphabet

Learn Hindi Alphabet (Letter) Animation Pronunciation & PDF

Here easy learn to Hindi Alphabets [Vowels(13) Consonants(37) Combinations(475)] with writing animation example words in English meaning Free PDF

[PDF] LEARN HINDI Through English Medium - PDFCOFFEECOM

Lesson 1 The Hindi Alphabet 1 Lesson 2 Common Hind Consonants 2 Lesson 3 Speaking Hindi Characters 3 Lesson 4 Writing Hind Consonants

Hindi Alphabet Chart With English Pronunciation PDF - InstaPDF

1 oct 2022 · Download PDF of Hindi Alphabet Chart With English Pronunciation from the link available below in the article Hindi Hindi Alphabet Chart

Hindi Alphabets - Free Download PDF - KUPDF

Download Hindi Alphabets

Hindi 52 Alphabets: Hindi Aksharmala PDF - Pinterest

Hindi 52 Alphabets: Hindi 52 Akshar Download Hindi 52 letters with pictures [PDF] Separate PDF given for Hindi Consonants (39) + Hindi Vowels (13) =52

Hindi 4 Kids Alphabet PDF - Pinterest

Jul 21 2020 - Hindi 4 kids Alphabet - Free download as PDF File ( pdf ) Text File ( txt) or read online for free Hindi for kids Abc book

Hindi 4 Kids Alphabet PDF - Scribd

Avis 44

Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 115-118,Ann Arbor, June 2005.c?Association for Computational Linguistics, 2005Aligning words in English-Hindi parallel corpora

Niraj Aswani Robert Gaizauskas

Department of Computer Science Department of Computer Science

University of Sheffield University of Sheffield

Regent Court 211, Portobello Street Regent Court 211, Portobello Street

Sheffield S1 4DP, UK Sheffield S1 4DP, UK

N.Aswani@dcs.shef.ac.uk R.Gaizauskas@dcs.shef.ac.uk

Abstract

In this paper, we describe a word

alignment algorithm for English-Hindi parallel data. The system was developed to participate in the shared task on word alignment for languages with scarce resources at the ACL 2005 workshop, on "Building and using parallel texts: data driven machine translation and beyond".

Our word alignment algorithm is based on

a hybrid method which performs local word grouping on Hindi sentences and uses other methods such as dictionary lookup, transliteration similarity, expected

English words and nearest aligned

neighbours. We trained our system on the training data provided to obtain a list of named entities and cognates and to collect rules for local word grouping in Hindi sentences. The system scored 77.03% precision and 60.68% recall on the shared task unseen test data.

1 Introduction

This paper describes a word alignment system developed as a part of shared task on word alignment for languages with scarce resources at the ACL 2005 workshop on "building and using parallel texts: data driven machine translation and beyond". Participants in the shared task were provided with common sets of training data, consisting of English-Inuktitut, Romanian-English, and English-Hindi parallel texts and the participating teams could choose to evaluate their

system on one, two, or all three language pairs. Our system is for aligning English-Hindi parallel data at the word level. The word-alignment algorithm described here is based on a hybrid - multi-feature approach, which groups Hindi words locally within a Hindi sentence and uses dictionary lookup (DL) as the main method of aligning words along with other methods such as Transliteration Similarity (TS), Expected English Words (EEW) and Nearest Aligned Neighbors (NAN). We used the training data supplied to derive rules for local word grouping in Hindi sentences and to find Named Entities (NE) and cognates using our TS approach. In the following sections we briefly describe our approach.

2 Training Data

The training data set was composed of

approximately 3441 English-Hindi parallel sentence pairs drawn from the EMILLE (Enabling

Minority Language Engineering) corpus (Baker et

al., 2004). The data was pre-tokenized. For the English data, a token was a sequence of characters that matches any of the "Dr.", "Mr.", "Hon.", "Mrs.", "Ms.", "etc.", "i.e.", "e.g.", "[a-zA-Z0-

9]+", words ending with apostrophe and all special

characters except the currency symbols £ and $. Similarly for the Hindi, a token consisted of a sequence of characters with spaces on both ends and all special characters except the currency symbols £ and $.

3 Word Alignment

Given a pair of parallel sentences, the task of word alignment can be described as finding one-to-one, one-to-many, and many-to-many correspondences 115

between the words of source and target sentences. It becomes more complicated when aligning phrases of one language with the corresponding words or phrases in the target language. For some words, it is also possible not to find any translation in the target language. Such words are aligned to null.

The algorithm presented in this paper, is a blend of various methods. We categorize words of a Hindi sentence into one of four different categories and use different techniques to deal with each of them. These categories include: 1) NEs and cognates 2) Hindi words for which it is possible to predict their corresponding English words 3) Hindi words that match certain pre-specified regular expression patterns specified in a rule file (explained in section 3.3.) and finally 4) words which do not fit in any of the above categories. In the following sections we explain different methods to deal with words from each of these categories.

3.1 Named Entities and Cognates

According to WWW1, the Named Entity Task is the process of annotating expressions in the text that are "unique identifiers" of entities (e.g. Organization, Person, Location etc.). For example: "Mr. Niraj Aswani", "United Kingdom", and "Microsoft" are examples of NEs. In most text processing systems, this task is achieved by using local pattern-matching techniques e.g. a word that is in upper initial orthography or a Title followed by the two adjacent words that are in upper initial or in all upper case. We use a Hindi gazetteer list that contains a large set of NEs. This gazetteer list is distributed as a part of Hindi Gazetteer processing resource in GATE (Maynard et al., 2003). The Gazetteer list contains various NEs including person names, locations, organizations etc. It also contains other entities such as time units - months, dates, and number expressions. Cognates can be defined as two words having a common etymology and thus are similar or identical. In most cases they are pronounced in a similar way or with a minor change. For example "Bungalow" in English is derived from the word

"" in Hindi, which means a house in the

Bengali style (WWW2). We use our TS method to locate such words. Section 3.2 describes the TS approach.

3.2 Transliteration Similarity

For the English-Hindi alphabets, it is possible to come up with a table consisting of correspondences between the letters of the two alphabets. This table is generated based on the various sounds that each letter can produce. For example a letter "c" can be mapped to two letters in Hindi, " and ". This mapping is not

restricted to one-to-one but also includes many-to- many correspondences. It is also possible to map a sequence of two or more characters to a single character or to a sequence two or more characters.

For example "tio" and "sh" in English correspond

to the character "" in Hindi.

Prior to executing our word alignment algorithm,

we use the TS approach to build a table of NEs and cognates. We consider one pair of parallel sentences at a time and for each word in a Hindi sentence, we generate different English words using our TS table. We found that before comparing words of two languages, it is more accurate to eliminate vowels from the words except those that appear at the start of words. We use a dynamic programming algorithm called "edit-distance" to measure the similarity between these words (WWW3). We calculate the similarity measure for each word in a Hindi sentence by comparing it with each and every word of an English sentence. We come up with an m x n matrix, where m and n refer to the number of words in Hindi and English respectively. This matrix contains a similarity measure for each word in a Hindi sentence corresponding to each word in a parallel English sentence. From our experiments of comparing more than 100 NE and cognate pairs, we found that the word pairs should be considered valid matches only if the similarity is greater than

75%. Therefore, we consider only those pairs

which have the highest similarity among the other pairs with similarity greater than 75%. The following example shows how TS is used to compare a pair of English-Hindi words. For example consider a pair "aswani  " and the TS table entries as shown below: 116 A, S, SS, V, W and N We remove vowels from both words: "aswn  ", and then convert the Hindi word into possible English words. This gives four different combinations: "asvn", "assvn", "aswn" and "asswn". These words are then compared with the actual English word "aswn". Since we are able to locate at least one word with similarity greater than

75%, we consider "aswani  " as a NE.

Once a list of NEs and cognates is ready, we switch to our next step: local word grouping, where all words in Hindi sentences, either those available in the gazetteer list or in the list derived using TS approach, are aligned using TS approach.

3.3 Local Word Grouping

Hindi is a partially free order language (i.e. the order of the words in a Hindi sentence is not fixed but the order of words in a group/phrase is fixed).

Unlike English where the verbs are used in

different inflected forms to indicate different tenses, Hindi uses one or two extra words after the verb to indicate the tense. Therefore, if the English verb is not in its base form, it needs to be aligned with one or more words in a parallel Hindi sentence. Sometimes a phrase is aligned with another phrase. For example "customer benefits" aligns with " ". In this example the first word "customer" aligns with the first word "" and the second word "benefits" aligns with the third word "". Considering "customer satisfaction" and "" as phrases to be aligned with each other, "" is the word that indicates the relation between the two words "" and "", which means the "benefits of customer" in English. These words in a phrase need to be grouped together in order to align them correctly. In the case of certain prepositions, pronouns and auxiliaries, it is possible to predict the respective Hindi postpositions, pronouns and other words. We derived a set of more than 250 rules to group such patterns by consulting the provided training data and other grammar resources such as Bal Anand (2001). The rule file

contains the following information for each rule: 1) Hindi Regular Expression for a word or phrase. This must match one or more words in the Hindi sentence.

2) Group name or a part-of-speech category.

3) Expected English word(s) that this Hindi word group may align to.

4) In case a group of one or more English words aligns with a group of one or more Hindi words, information about the key words in both groups. Key words must match each other in order to align English-Hindi groups.

5) A rule to convert Hindi word into its base form.

We list some of the derived rules below:

1) Group a sequence of [X + Postposition], where X can be any category in the above list except postposition or verb. For example: "For X" = "X ", where "For" = " ".

2) Root Verb + (, or ) + (PH). Present

continuous tense. We use "PH" as an abbreviation to refer to the present/past tense conjunction of the verb "" - , , , , etc.

3) Group two words that are identical to each other. For example: " ", which means

"different" in English. Such bi-grams are common in Hindi and are used to stress the importance of a word/activity in a sentence.

Once the words are grouped in a Hindi sentence,

we identify those word groups which do not fit in any of the TS and EEW categories. Such words are then aligned using the DL approach.

3.3 Dictionary lookup

Since the most dictionaries contain verbs in their base forms, we use a morphological analyzer to convert verbs in their base forms. The English-

Hindi dictionary is obtained from (WWW4). The

dictionary returns, on average, two to four Hindi words referring to a particular English word. The formula for finding the lemma of any Hindi verb is: infinitive = root verb + "". Since in most cases, our dictionary contains Hindi verbs in their infinitive forms, prior to comparing the word with the unaligned words, we remove the word "" from the end of it. Due to minor spelling mistakes it is also possible that the word returned from dictionary does not match with any of the words in 117

a Hindi sentence. In this case, we use edit-distance algorithm to obtain similarity between the two words. If the similarity is greater than 75%, we consider them similar. We use EEW approach for the words which remain unaligned after the DL approach.

3.4 Expected English words

Candidates for the EEW approach are the Hindi word groups (HWG) that are created by our Hindi local word grouping algorithm (explained in section 3.3). The HWGs such as postpositions, number expressions, month-units, day-units etc. are aligned using the EEW approach. For example, for the Hind word "" in a Hindi

sentence, which means "fifty two" in English, the algorithm tries to locate "fifty two" in its parallel English sentence and aligns them if found. For the remaining unaligned Hindi words we use the NAN approach.

3.5 Nearest Aligned Neighbors

In certain cases, words in English-Hindi phrases follow a similar order. The NAN approach works on this principle and aligns one or more words with one of the English words. Considering one

HWG at a time, we find the nearest Hindi word

that is already aligned with one or more English word(s). Aligning a phrase "customer benefits" with "" (example explained in section

3.3) is an example of NAN approach. Similarly

consider a phrase "tougher controls", where for its equivalent Hindi phrase " ", the dictionary returns a correct pair "controls  but fails to locate "tougher  For aligning the word "tougher", NAN searches for the nearest aligned word, which, in this case, is "controls". Since the word "controls" is already aligned with the word "the NAN method aligns the word "tougher" with the nearest unaligned word "".

4 Test Data results

We executed our algorithm on the test data

consisting of 90 English-Hindi sentence pairs. We obtained the following results for non-null alignment pairs.

Word Alignment Evaluation

Evaluation of SURE alignments

Precision = 0.7703

Recall = 0.6068

F-measure = 0.6788

Evaluation of PROBABLE alignments

Precision = 0.7703

Recall = 0.6068

F-measure = 0.6788

AER = 0.3212

References

Bal Anand, 2001, Hindi Grammar Books for

standard 5 to standard 10, Navneet Press, India.

Baker P., Bontcheva K., Cunningham H.,

Gaizauskas R., Hamza O., Hardie A., Jayaram

B.D., Leisher M., McEnery A.M., Maynard D.,

Tablan V., Ursu C., Xiao Z., 2004, Corpus

linguistics and South Asian languages: Corpus creation and tool development, Literary and

Linguistic Computing, 19(4), pp. 509-524.

Maynard D., Tablan V., Bontcheva K.,

Cunningham H., 2003, Rapid customisation of

an Information Extraction system for surprise languages, ACM Transactions on Asian

Language Information Processing, Special issue

on Rapid Development of Language

Capabilities: The Surprise Languages.

WWW1, Named Entity Task Definition,

sk20.book_2.html#HEADING1 [15/04/2005]

WWW2, Britannica Online Encyclopaedia,

8081 [15/04/2005]

WWW3, Dynamic Programming Algorithm (DPA)

for Edit-Distance,

DS/Dynamic/Edit/ [22/03/05]

WWW4, English-Hindi dictionary source,

[22/03/05]. 118quotesdbs_dbs7.pdfusesText_13

[PDF] hindi basic course pdf

[PDF] hindi for beginners pdf

[PDF] hindi glossary

[PDF] hindi grammar tenses pdf

[PDF] hindi language course pdf

[PDF] hindi learning course pdf

[PDF] hindi letters in tamil pdf

[PDF] hindi speaking course pdf free download

[PDF] hindi typing course pdf

[PDF] hindi typing course pdf download

[PDF] hindi varnamala worksheets pdf

[PDF] hindi words with pictures pdf

[PDF] hinds county ms zoning map

[PDF] hindu calendar 2019 pdf

[PDF] hindu code bill book in hindi pdf

[PDF] Aligning words in English-Hindi parallel corpora

Niraj Aswani Robert Gaizauskas

University of Sheffield University of Sheffield

Sheffield S1 4DP, UK Sheffield S1 4DP, UK

Abstract

In this paper, we describe a word

Our word alignment algorithm is based on

English words and nearest aligned

1 Introduction

2 Training Data

The training data set was composed of

Minority Language Engineering) corpus (Baker et

9]+", words ending with apostrophe and all special

3 Word Alignment

3.1 Named Entities and Cognates

3.2 Transliteration Similarity

For example "tio" and "sh" in English correspond

Prior to executing our word alignment algorithm,

75%. Therefore, we consider only those pairs

75%, we consider "aswani  " as a NE.

3.3 Local Word Grouping

Unlike English where the verbs are used in

2) Group name or a part-of-speech category.

3) Expected English word(s) that this Hindi word group may align to.

4) In case a group of one or more English words aligns with a group of one or more Hindi words, information about the key words in both groups. Key words must match each other in order to align English-Hindi groups.

5) A rule to convert Hindi word into its base form.

We list some of the derived rules below:

1) Group a sequence of [X + Postposition], where X can be any category in the above list except postposition or verb. For example: "For X" = "X ", where "For" = " ".

2) Root Verb + (, or ) + (PH). Present

3) Group two words that are identical to each other. For example: " ", which means

Once the words are grouped in a Hindi sentence,

3.3 Dictionary lookup

Hindi dictionary is obtained from (WWW4). The

3.4 Expected English words

3.5 Nearest Aligned Neighbors

HWG at a time, we find the nearest Hindi word

3.3) is an example of NAN approach. Similarly

4 Test Data results

We executed our algorithm on the test data

Word Alignment Evaluation

Evaluation of SURE alignments

Precision = 0.7703

Recall = 0.6068

F-measure = 0.6788

Evaluation of PROBABLE alignments

Precision = 0.7703

Recall = 0.6068

F-measure = 0.6788

AER = 0.3212

References

Bal Anand, 2001, Hindi Grammar Books for

Baker P., Bontcheva K., Cunningham H.,

Gaizauskas R., Hamza O., Hardie A., Jayaram

B.D., Leisher M., McEnery A.M., Maynard D.,

Tablan V., Ursu C., Xiao Z., 2004, Corpus

Linguistic Computing, 19(4), pp. 509-524.

Maynard D., Tablan V., Bontcheva K.,

Cunningham H., 2003, Rapid customisation of

Language Information Processing, Special issue

Capabilities: The Surprise Languages.

WWW1, Named Entity Task Definition,

WWW2, Britannica Online Encyclopaedia,

8081 [15/04/2005]

WWW3, Dynamic Programming Algorithm (DPA)

DS/Dynamic/Edit/ [22/03/05]

WWW4, English-Hindi dictionary source,