المستخلص: |
This paper describes several approaches to language modeling of Arabic dialects using Modern Standard Arabic (MSA) data. We build a baseline language model on words and experiment with various techniques of data transformation to account for differences between MSA and Colloquial Arabic. Specifically, we describe three methods of data transformation: morphological simplification (stemming), lexical transductions, and syntactic transformations. We compare the performance of each method with that of the baseline language model. While the best performing model remains the one built using only dialectal data, these techniques allow us to obtain an improvement over the baseline MSA model.
|