[PDF] [PDF] GRAMMAR RULE BASED CROSS LANGUAGE INFORMATION

Telugu - English Cross Language Information Retrieval using Language segmented as ordinary words and translated with a bilingual dictionary, the



Previous PDF Next PDF





[PDF] Websters Telugu - English Thesaurus Dictionary by Philip M Parker

words, but who need to learn how a single English translation of a Telugu wants to improve English-language test scores covering English synonyms English Thesaurus Dictionary by Philip M Parker Free PDF d0wnl0ad, audio books,



[PDF] [1GRN]⋙ Websters Telugu - English Thesaurus Dictionary by Philip

Webster's Telugu - English Thesaurus Dictionary Philip M Parker words, but who need to learn how a single English translation of a Telugu word may to read online, online library, greatbooks to read, PDF best books to read, top books



[PDF] Six thousand common English words; their comparative - CORE

indicating approximately what English words would need to other words having the same meanings may be excluded from use in all printed matter and in all 



ASC 166 - Thesaurus Based Web Searching - SpringerLink

given and it doesn't provide any specific meanings for the given word depending upon the context In this paper For a given term in Telugu we can retrieve words in English Searching is possible Manual Construction of Telugu Thesaurus



[PDF] The Oxford Thesaurus An AZ Dictionary of Synonyms - The dead

Any synonym book must be seen as a compromise that relies on the sensitivity of its users to meaning in British English and quite a different meaning in another 2 vade-mecum, manual, handbook, guide, reference book, enchiridion: They



[PDF] A Malayalam and English dictionary - Rare Book Society of India

and synonyms, many of which are confessedly very doubtful; to record merely the principal Dictionary (also his Scripture Translation) bef before Bang Bengali BhadrD Bhadra DTpam, or Telugu (Audhraegens, Plin ) (ST3)mjo anyam t



[PDF] GRAMMAR RULE BASED CROSS LANGUAGE INFORMATION

Telugu - English Cross Language Information Retrieval using Language segmented as ordinary words and translated with a bilingual dictionary, the



[PDF] English to French Words

This Online Dictionary contains general words and phrases, restaurant words and phrases and a huge section on food related items Please remember to be 



[PDF] VOCABULARY LIST - Cambridge English

Example phrases and sentences are given only where words which can be used with different meanings have been restricted in the extent of their usage at 

[PDF] english teaching jobs in france salary

[PDF] english teaching materials pdf

[PDF] english test a2

[PDF] english test b2 with key pdf

[PDF] english to asl sentence structure translator

[PDF] english to australian aboriginal language translator

[PDF] english to braille

[PDF] english to braille translator grade 2

[PDF] english to french math terms

[PDF] english to hindi dictionary pdf

[PDF] english to hindi technical dictionary

[PDF] english to hindi vocabulary words pdf

[PDF] english to klingon voice translator

[PDF] english to norse translator

[PDF] english to tamil dictionary pdf format

GRAMMAR RULE BASED

CROSS LANGUAGE INFORMATION RETRIEVAL

FOR TELUGU

A THESIS

Submitted by

DINESH MAVALURU

Under the guidance of

Dr. R. SHRIRAM

in partial fulfillment for the award of the degree of

DOCTOR OF PHILOSOPHY

in

COMPUTER SCIENCE

B.S.ABDUR RAHMAN UNIVERSITY

(B.S.ABDUR RAHMAN INSTITUTE OF SCIENCE &TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in

APRIL 2014

CERTIFICATE

This is to certify that all corrections and suggestions pointed out by the

Rule Based Cross Language Information Retrieval f

Mr. Dinesh Mavaluru.

(Dr.R. Shriram)

SUPERVISOR

Place: Chennai

Date: 04 July 2014

B.S.ABDUR RAHMAN UNIVERSITY

(B.S.ABDUR RAHMAN INSTITUTE OF SCIENCE &TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in

BONAFIDE CERTIFICATE

Certified that this thesis GRAMMAR RULE BASED CROSS LANGUAGE INFORMATION RETRIEVAL FOR TELUGU is the bonafide work of DINESH MAVALURU (RRN: 1194207) who carried out the thesis work under my supervision. Certified further, that to the best of my knowledge the work reported herein does not form part of any other thesis or dissertation on the basis of which a degree or award was conferred on an earlier occasion on this or any other candidate.

Dr. R. SHRIRAM

RESEARCH SUPERVISOR

Professor

Department of CSE

B.S. Abdur Rahman University

Vandalur, Chennai ± 600 048

Dr. P. SHEIK ABDUL KHADER

HEAD OF THE DEPARTMENT

Professor & Head

Department of CA

B.S. Abdur Rahman University

Vandalur, Chennai ± 600 048

ACKNOWLEDGEMENT

At the outset I thank the Almighty whose unbounded blessings and love have helped me in pursuing this research work. I always admired my adviser, Prof. R. Shriram, whose ideals had a big influence on me which changed the way I perceived this world. I am one of those fortunate students to scribe my name in his students list. Without his support, I could not imagine myself starting a research career. His generosity gave the freedom to enjoy all the privileges. I remain indebted to him and his family members all my life and just a mere thank you is not sufficient. I am greatly obliged to the members of my doctoral committee Dr. A. Kannan, Professor, Department of Information Science and Technology, Anna University, Chennai, Dr. T. R. Rangaswamy, Professor, Department of Electronics and Instrumentation Engineering, B S Abdur Rahman University, Chennai and Dr. P. Sheik Abdul Khader, Professor and Head, Department of Computer Applications, B S Abdur Rahman University, Chennai, for their guidance, valuable suggestions, continuous encouragement and critical reviews during the tenure of this research work. I would like to express most sincere gratitude to the members of my review committee Dr. V. Sankaranarayanan and Dr. K. M. Mehata who have influenced me greatly, and from whom I had the chance to learn throughout my research work by their valuable suggestions and guidance in between their tight schedule. I owe my sincere thanks to Prof. V. Saravanan, Computer Sciences and Information Technology College, Majmaah University, Majmaah, Kingdom of Saudi Arabia, who made me realize the best in me and also taught me how to do research. I am immensely grateful to the faculty members of Department of Computer Applications, Management and Administration of B S Abdur Rahman University, Chennai for providing all the facilities to complete my research work successfully. I would like to thank all my dear colleagues in particular, Shakthi Priyan, A. Venkat Narayanan, P. Kumaran, T. Nadana Ravi Shankar, V. K. Mohan Raj, B. Manikandan, S. Sumitra, P. Thiripurasundari and D. M. Ahamed Kabeer Bhadhusha for their constant support during my research work. My Whole hearted thanks go to my family, Mrs Gnanamani and my beloved G. Sonia who motivated me to be strong, bold and helped me to bring out the best from the beginning to the end in the completion of this research work and move on with my future goals helped me to realize the importance of many things in my life. Finally, I would like to acknowledge my friends D. Shyam Kiran, Amaresh and many others who are along with me during my bad and good times. Without you all I am nowhere.

ABSTRACT

The rapid spread of the World Wide Web and improvements in information retrieval (IR) techniques have allowed people to access huge amount of information. However, majority of the web content is in English. While content in languages like Telugu and Tamil are growing every day, a huge gap remains. This gap is what this research work will be addressing. In general information retrieval systems, the relevant information retrieved for the user query, only if the information is available in that query language. For example a Telugu search engine will retrieve only results for content in Telugu. It is not considering the relevant information that is available in the other languages for the given user query. Cross Language Information Retrieval (CLIR) systems seek to overcome this gap. A CLIR query language. The goal of this research work is to develop a new framework for Telugu - English Cross Language Information Retrieval using Language Grammar Rules. The major challenges addressed are query ambiguity and the linguistic differences between the query and content language.

The steps in this research are as follows:

a) The user query is tokenized into keywords using tokenizer. The language grammar rules are applied to the tokenized query terms to identify the subject, verb, object and inflection in tokenized keywords. b) The query processor searches the English equivalent terms in the ontology for the terms identified using language grammar rules. The terms which are not available in ontology are considered as Out-Of- Vocabulary terms and literally transliterated into the English language. c) The parser will find the subject, verb and object in English to assemble the query in English. The query processing is done and the query is converted into the English language. The converted query is given to the search engine for relevant results. d) The retrieved results are given to the post processor to convert the results into Telugu language. For this, the ontology is used to convert the Telugu word to the English word. Thus, all the previous stages mentioned are repeated again until the results are converted into target language representation. The grammar rule based approach is a semantic way of approaching the IR problem by first finding the meaning of query; mapping user query to target language, finding relevant information in target language, mapping this to source language and displayed to the user. This research work also evaluates the user acceptance of CLIR for

Telugu using various metrics.

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT V

LIST OF TABLES XI

LIST OF FIGURES XII

1. INTRODUCTION 1

1.1 General Introduction 1

1.2 Objectives 3

1.3 Contribution of The Work 4

1.4 Thesis Outline 5

2. LITERATURE REVIEW 7

2.1 Introduction 7

2.2 Information Retrieval 7

2.2.1 Retrieval Models 9

2.1.2 Improving Information

Retrieval 14

2.3 Cross Language Information

Retrieval 17

2.3.1 Non-Translation Approaches 17

2.3.2 Translation-Based

Approaches 18

2.3.3 Challenges in CLIR 20

2.3.4 Current Approaches 21

2.4 Information Retrieval In The Telugu

Language 24

2.4.1 Difficulties of Information

Retrieval in Telugu 24

2.4.2 Monolingual IR in Telugu 25

CHAPTER NO. TITLE PAGE NO.

2.4.3 CLIR and Telugu 26

2.5 CONCLUSION 28

3

PROPOSED FRAMEWORK FOR TELUGU

CROSS LANGUAGE INFORMATION

RETRIEVAL

30

3.1 Introduction 30

3.2 Methodology of Proposed Framework 30

3.3 Proposed Framework System 32

3.3.1 Pre-Processing 32

3.3.2 Post-Processing 34

3.3 Conclusion 37

4 PREPROCESSING 38

4.1 Introduction 38

4.2 Methodology of Proposed Pre-

Processing 38

4.2.1 Tokenizer 39

4.2.2 Language Grammar Rules 41

4.2.3 Bilingual Ontology 51

4.2.4 OOV Component 54

4.3 Conclusion 58

5 POSTPROCESSING 59

5.1 Introduction 59

5.2 Methodology of Proposed Post-

Processing 59

5.2.1 Tokenizer 60

5.2.2 Language Grammar Rules 61

5.2.3 Re-ranking System 61

CHAPTER NO. TITLE PAGE NO.

5.2.4 Smoothening Approach 63

5.3 Conclusion 67

6 FRAMEWORK IMPLEMENTATION AND

RESUTLS 68

6.1 Introduction 68

6.2 Approaches For Evaluating

Information Retrieval 68

6.3 Test Collection 69

6.4 Evaluation of Results 69

6.4.1 Mean Average Precision 70

6.5 Experimental Framework And Toolkit 70

6.6 Experimental Settings For Pre-

Processing 71

6.7 Experimental Settings For Post-

Processing 73

6.8 Testing and Results 73

6.9 Conclusion 84

7

EVALUATING USER ACCEPTANCE OF

CLIR USING LANGUAGE GRAMMAR

RULES 85

7.1 Introduction 85

7.2 Technology Acceptance Model (TAM) 85

7.3 Research Model And Hypotheses 87

7.3.1 CLIR System ease of use 87

7.3.2 CLIR System usefulness 87

7.3.3 Attitude towards using a

CLIR System 87

CHAPTER NO. TITLE PAGE NO.

7.3.4 Behavioral intentions for

using a CLIR System 88

7.4 Research Methodology 88

7.5 Data Analysis and Results 92

7.6 Conclusion 97

8 Conclusion 98

References 100

LIST OF TABLES

TABLE NO. TITLE PAGE NO.

4.1 Sample Telugu Sentence Order43

4.2 Post positions for Telugu sentence order 45

4.3Finite Verb Rules47

4.4Non-Finite Verb Rules50

6.1Relative Retrieval Efficiency77

6.2Time taken for Query processing in the

Existing and proposed systems80

6.3Precision Percentages For Retrieved

Results In Existing And Proposed Systems81

6.4Precision for Results82

6.5Weighted Precision83

7.1Profile of the system users89

7.2Instrument Reliability And Validity94

7.3Model fit summary for the final

measurement and structural model96

7.4The contribution of the study to existing

knowledge96

LIST OF FIGURES

FIGURE NO. TITLE PAGE NO.

1.1 Techniques used in CLIR 2

2.1 Workflow of Information Retrieval 7

3.1 Overall process of CLIR for Telugu 30

3.2 Components for the Proposed System 31

3.3 Framework for the Proposed System 32

3.4 Retrieved Results before display 36

3.5 Retrieved results after display 37

4.1 overall process of query pre-processing 38

4.2 Tokenization component 39

4.3 Tokenizer process 39

4.4 Simple Telugu sentence tokenization 40

4.5 Tokenizer example 40

4.6 Tokenizer example for special expressions 41

4.7 Language Grammar rules component 42

4.8 Grammar rules component process 43

4.9 Ontology Component 52

4.10 Process flow of bilingual ontology

component 52

4.11 Ontology Relationship Hierarchies 53

4.12 Sample ontology structure 54

4.13 Out of Vocabulary Component 55

4.14 Flow Chart for the Pre-Processing stage 57

FIGURE NO. TITLE PAGE NO.

5.1 Overall process of post-processing 59

5.2 Tokenizer process 60

5.3 Process Flow of system 61

5.4 Term frequency for the query terms

relationship 62

5.5 Sample term frequency 63

5.6 Results retrieved related to the query 65

5.7 Final Results to the user for given query 65

5.8 Flow Chart for the Post-Processing stage 66

6.1 Step by Step Process of the System 74

existing system 76
proposed system 76
existing system 78
proposed system 79

7.1 Technology Acceptance Model (TAM) 86

7.2 Modified TAM for information retrieval 95

1

1. INTRODUCTION

1.1 GENERAL INTRODUCTION

With the growth of multilingual information available on the web and the growing number of non-native English speakers browsing the Internet, it has become increasingly valuable to have information retrieval systems that can retrieve relevant information irrespective of language restrictions. Information retrieval systems like Web search engines have become transform their information need as a query, and then issue these queries to an information retrieval system. The system then provides users with a set of ranked results which consists of relevant information. Current Web search engines like Google, Bing, Yahoo etc. are sophisticated enough to produce relevant information for most of the queries. But these information retrieval systems are not considering the content that is relevant in other languages. Instead, a system can retrieve the information that is relevant to the user query and users can better access if the information is shown to the user in user query language in a natural way i.e., through a cross language information retrieval system, and in turn receives the precise answer as a result. While English is the most widely used language on the web, the use of Telugu as a query language has grown rapidly in recent years. Most Telugu web users have limited English vocabulary and thus it can be difficult for them to formulate effective English queries. They would like to retrieve relevant English information on the web using queries expressed in Telugu, especially in instances where information available in English is adequate and detailed than that in Telugu. A Cross Language Information Retrieval (CLIR) system retrieves information in a language that is different from the 2 user query language [1]. An example in [2] showing the usefulness of such a system is when a user might have some knowledge of the source content language but has difficulty in formulating effective queries. These users might very well be able to distinguish relevant information from irrelevant information based on their limited knowledge. Such information needs have given rise to greater interest in CLIR systems. CLIR between any two languages poses significant problems due to the great differences in the structural and written forms of the languages. Figure 1.1 shows the current techniques used in Cross Language Information

Retrieval.

Figure 1.1 Techniques used in CLIR

In a linguistically diverse country like India, cross language information retrieval systems play an important role in localization. India has eighteen constitutional languages, which are written in ten different scripts. There is a big scope for developing frameworks between English and the various Indian languages. 3 Cross Language Information Retrieval (CLIR) systems have many challenges, of which the biggest is the inherent ambiguity of natural language. In addition, the linguistic diversity between the source and target language makes developing CLIR framework as a bigger challenge. English is a highly positional language with fundamental morphology, and default sentence structure as Subject Verb Object (SVO). Indian languages are highly inflectional, with a rich morphology, relatively free word order, and default sentence structure as Subject Object Verb (SOV) [3]. The goal of this research work is to develop a new framework to improve the performance of Telugu CLIR using language grammar rules. This research work uses a semantic language modeling approach in the pre and post processing stages. Additionally, this research work evaluates the Telugu CLIR framework using technology acceptance model (TAM) using a series of experiments to identify the user acceptance and to explore the sensitivity to parameter settings. This overall research work shows that grammar rules can be used to improve CLIR effectiveness.

1.2 OBJECTIVES

The broad objective of this research work is to develop a grammar rule based cross language information retrieval for Telugu to facilitate information access from other languages. It is proposed in this research work that this information retrieval system use language grammar rules as a key component. The proposed language grammar rules for Telugu Cross Language Information Retrieval is a unique feature by which it convert queries and to display the retrieved results in user query language. 4 The specific objectives of this research work are to Apply language processing techniques to arrive at the specific grammar rules for use in information retrieval. Design a cross language information retrieval model for Telugu using o Language grammar rules and o Customized bilingual ontology for query and retrieved results conversion using grammar rules as a key component, Use the Technology Acceptance Model to provide experimental results and demonstrate the feasibility and effectiveness of the grammar rule based cross language information retrieval for Telugu.

1.3 CONTRIBUTION OF THE WORK

There are over eighty grammar rules that are relevant for Telugu. From this, the eighteen main rules that are relevant for cross lingual information retrieval have been identified. These rules are relevant for the conversion of the query into the source language and information snippets from the web into the target language. Effort has been made to arrive at a set of rules that are relevant to information retrieval and at the same generic enough so that if Telugu is replaced by some other language, the overall framework remains. Similarly, the overall model has been designed in such a way while the method works for the Telugu-English combination it is generic enough to accommodate other language pairs as well. However, we have only theoretically tested other language pairs and not experimented with it. 5 There are two types of testing that can be done for information retrieval research: metrics approach and the acceptance approach. In this work, the metrics approach is done by using the mean average precision and weighted precision whereas the acceptance model is done using the technology acceptance model research. The technology acceptance model is unique in that it tests the perceived ease of use and intended behavior of users. This ensures that an overall perspective consisting of the metrics based on statistical and semantically approaches are taken care of. The use of technological acceptance model for information retrieval relevance research in Indian languages is carried out in this work. In addition the changes in the model needed for IR are also highlighted.

1.4 THESIS OUTLINE

The rest of this work is organized as follows.

Chapter 2 explains the previous work done in Information retrieval and Cross Language Information Retrieval relevant to this research work. It discusses the state of the art of information retrieval and

Telugu information retrieval systems.

Chapter 3 presents the overall proposed framework for Telugu CLIR.quotesdbs_dbs22.pdfusesText_28