CORPORA DEVELOPMENT FOR LEARNING TAMIL-MALAY BILINGUAL WORD REPRESENTATIONS FROM SENTENCE-ALIGNED PARALLEL CORPUS
Dr. Kingston pal Thamburaj Dr. Kartheges Ponniah, Dr.Ilankumaran Sivanathan Dr. MuniiswaranKumar

Sultan Idris Education University, Malaysia


Abstract

In this study, we propose sentence-aligned parallel corpora in Tamil and Malay, two languages. These languages have few resources, are underexplored, and have linguistic features that make machine interpretation problematic. The corpora are generated from web sources with content that is translated into multiple languages. The corpora offered greatly expand existing materials which are either insufficiently large or limited to a certain domain (such as health). We also include a different test corpus produced from an unaffiliated internet source that can be utilized to critically validate accuracy in two languages. We also discuss how to build such corpora utilizing tools made possible by recent breakthroughs in machine interpretation and cross-lingual query using deep neural network-based approaches. Limited accuracy and low coverage are common characteristics of learned models (high out-of-vocabulary rates). To improve both, we apply a new data resource called parallel corpora in this study. We increase coverage by employing bilingual lexicon generation approaches to learn new interpretations from comparable corpora, starting with a tiny bitext and associated phrase-based SMT model. To increase accuracy, we enrich the model^s characteristic space with interpretation scores computed over parallel corpora.

Keywords: parallel corpus, bilingual, Tamil, Malay, Linguistic, machine translation, vocabulary rates, SMT model, learning

Topic: Computer Science Education

ICMScE 2022 Conference | Conference Management System