Synset Based Multilingual Dictionary: Insights, Applications and Challenges
DSpace at IIT Bombay
View Archive InfoField | Value | |
Title |
Synset Based Multilingual Dictionary: Insights, Applications and Challenges
|
|
Creator |
MOHANTY, RK
BHATTACHARYYA, P KALELE, S PANDEY, P SHARMA, A KOPRA, M |
|
Subject |
multilingual dictionary
dictionary standardization concept based dictionary light weight wsd and lexical choice multilingual dictionary database |
|
Description |
In this paper, we report our effort at the standardization, design and partial implementation of a multilingual dictionary in the context of three large scale projects, viz., (i) Cross Lingual Information Retrieval, (ii) English to Indian Language Machine Translation, and (iii) Indian Language to Indian Language Machine Translation. These projects are large scale, because each project involves 8-10 partners spread across the length and breadth of India with great amount of language diversity. The dictionary is based not on words but on WordNet SYNSETS, i. e., concepts. Identical dictionary architecture is used for all the three projects, where source to target language transfer is initiated by concept to concept mapping. The whole dictionary can be looked upon as an M X N matrix where M is the number of synsets (rows) and N is the number of languages (columns). This architecture maps the lexeme(s) of one language-standing for a concept-with the lexeme(s) of other languages standing for the same concept. In actual usage, a preliminary WSD identifies the correct row for a word and then a lexical choice procedure identifies the correct target word from the corresponding synset. Currently the multilingual dictionary is being developed for 11 languages: English, Hindi, Bengali, Marathi, Punjabi, Urdu, Tamil, Kannada, Telugu, Malayalam and Oriya. Our work with this framework makes us aware of many benefits of this multilingual concept based scheme over language pair-wise dictionaries. The pivot synsets, with which all other languages link, come from Hindi. Interesting insights emerge and challenges are faced in dealing with linguistic and cultural diversities. Economy of representation is achieved on many fronts and at many levels. We have been eminently assisted by our long standing experience in building the WordNets of two major languages of India, viz., Hindi and Marathi which rank 5th (similar to 500 million) and 14th (similar to 70 million) respectively in the world in terms of the number of people speaking these languages.
|
|
Publisher |
UNIV SZEGED, DEPT INFORMATICS
|
|
Date |
2011-09-01T08:17:40Z
2011-12-26T12:59:33Z 2011-12-27T05:51:54Z 2011-09-01T08:17:40Z 2011-12-26T12:59:33Z 2011-12-27T05:51:54Z 2007 |
|
Type |
Article
|
|
Identifier |
GWC 2008: FOURTH GLOBAL WORDNET CONFERENCE, PROCEEDINGS, (), 321-333
http://dspace.library.iitb.ac.in/xmlui/handle/10054/12718 http://hdl.handle.net/10054/12718 |
|
Language |
en
|
|