Mathematical modeling of the frequencies of words of different lengths in written Hindi language corpora and examination of the role of texts’ stylistic factor in model’s parameters
Issue: Vol 4 No. 1 (2017)
Journal: Journal of Research Design and Statistics in Linguistics and Communication Science
Subject Areas: Linguistics
DOI: 10.1558/jrds.33107
Abstract:
In quantitative research related to the areas of language and linguistics, first the linguistic features are specified and counted, and then statistical models are constructed in order to explicate these observed facts. In the present paper, an attempt has been made to represent the pattern of occurrence of words of different lengths in various corpora of Hindi language in the form of a mathematical model and an inspection has been made to check the dependency of the parameters of investigated model for a particular text in the type of text by selection of texts under categories media/essay and creative writing; or in other words we have attempted to test the applications of the parameters of the model in text classification process.
Author: Hemlata Pande, Hoshiyar S. Dhami
References :
Abbe, S. (2000). Word length distribution in Arabic letters. Journal of Quantitative Linguistics, 7 (2), 121–127. https://doi.org/10.1076/0929-6174(200008)07:02;1-Z;FT121
Alekseev, P. M. (1998). Graphemic and syllabic length of words in text and vocabulary. Journal of Quantitative Linguistics, 5 (1–2), 5–12. https://doi.org/10.1080/09296179808590107
Antić, G., Kelih, E., and Grzybek, P. (2006). Zero syllable words in determining word length. In P. Grzybek (Ed.) Contributions to the Science of Text and Language: Word Length Studies and Related Issues, 117–156. Springer, Netherlands. https://doi.org/10.1007/1-4020-4068-7_4
Antić, G., Stadlober, E., Grzybek, P., and Kelih, E. (2006). Word Length and Frequency Distributions in Different Text Genres. From Data and Information Analysis to Knowledge Engineering, 310–317. Springer, Berlin Heidelberg. https://doi.org/10.1007/3-540-31314-1_37
Aoyama, H and Constable, J. (1999). Word length frequency and distribution in English: Part I. Prose. Literary and Linguistic Computing 14 (3), 339–358. https://doi.org/10.1093/llc/14.3.339
Bharati, A., Rao K, P., Sangal R. and Bendre, S. M. (2002). Basic statistical analysis of corpus and cross comparison among corpora. In Proceedings of 2002 International Conference on Natural Language Processing, Mumbai, India. Available at: http://ltrc.iiit.ac.in/MachineTrans/publications/technicalReports/tr022/camera-187.pdf
Barbaro, S. (2000). Word length distribution in Italian letters by Pier Paolo Pasolini, Journal of Quantitative Linguistics 7 (2), 115–120. https://doi.org/10.1076/0929-6174(200008)07:02;1-Z;FT115
Best, K.-H. (1996). Word length in Old Icelandic songs and prose texts, Journal of Quantitative Linguistics 3 (2), 97–105. https://doi.org/10.1080/09296179608599619
Dittrich, H. (1996). Word length frequency in the letters of G. E. Lessing, Journal of Quantitative Linguistics 3 (3), 260–264. https://doi.org/10.1080/09296179608599633
Frischen, J. (1996). Word length analysis of Jane Austen’s letters, Journal of Quantitative Linguistics 3 (1), 80–84. https://doi.org/10.1080/09296179608590066
Gómez, P. C. (2013). Statistical Methods in Language and Linguistic Research. Sheffield: Equinox Publishing Ltd.
Gries, S. T. (2009). Statistics for Linguistics. Berlin: R. De Gruyter Mouton. https://doi.org/10.1515/9783110216042
Grzybek, P. (Ed.) (2006). Contributions to the Science of Text and Language: Word Length Studies and Related Issues, Rotterdam: Springer. https://doi.org/10.1007/1-4020-4068-7
Grzybek, P., Stadlober, E., Kelih, E., and Antić, G. (2005). Quantitative text typology: The impact of word length. In: C. Weihs and W. Gaul (Eds), Classification – The Ubiquitous Challenge, 53–64. Heidelberg, Springer. https://doi.org/10.1007/3-540-28084-7_5
Hatzigeorgiu, N., Mikros, G., and Carayannis, G. (2001). Word length, word frequencies and Zipf’s Law in the Greek language. Journal of Quantitative Linguistics 8 (3), 175–185. https://doi.org/10.1076/jqul.8.3.175.4096
Jayaram, B. D. and Vidya, M. N. (2006). Word length distribution in Indian languages, Glottometrics 12, 16–38.
Kelih, E., Antić, G., Grzybek, P., and Stadlober, E. (2005). Classification of author and/or genre? The impact of word length. In C. Weihs and W. Gaul (Eds) Classification, the Ubiquitous Challenge, 498–505. Springer Berlin-Heidelberg. https://doi.org/10.1007/3-540-28084-7_58
Kromer, V. (2001). Word length model based on one displaced Poisson uniform distribution. Glottometrics 1, 87–96.
Krott, A. (1996). Some remarks on the relation between word length and morpheme length. Journal of Quantitative Linguistics 3 (1), 29–37. https://doi.org/10.1080/09296179608590061
Krylov, J. K. (2002). Synergetic models and methods in quantitative linguistics. Journal of Quantitative Linguistics 9 (2), 125–185. https://doi.org/10.1076/jqul.9.2.125.8487
Leopold, E. (1998). Frequency spectra within word‐length classes. Journal of Quantitative Linguistics 5 (3), 224–231. https://doi.org/10.1080/09296179808590130
Lupsa, D. A. and Lupsa, R. (2005). The law of word length in a vocabulary. Studia Univ. Babes-Bolyal, Informatica, Vol. L, No. 2.
Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.
Meyer, P. (1999). Relating word length to morphemic structure: A morphologically motivated class of discrete probability distributions, Journal of Quantitative Linguistics 6 (1), 66–69. https://doi.org/10.1076/jqul.6.1.66.4143
PawlRottmann, O. A. (1999). Word and syllable lengths in East Slavonic, Journal of Quantitative Linguistics 6 (3), 235–238. https://doi.org/10.1076/jqul.6.3.235.6162
Pande, H. and Dhami, H. S. (2010). Mathematical modelling of occurrence of letters and word’s initials in texts of Hindi Language. SKASE Journal of Theoretical Linguistics 7 (2), 19–38.
Pande, H. and Dhami, H. S. (2012).: Model generation for word length frequencies in texts with the application of Zipf’s order approach, Journal of Quantitative Linguistics 19 (4), 249–261. https://doi.org/10.1080/09296174.2012.714531
Pande, H. and Dhami, H. S. (2013a).Mathematical modelling of the pattern of occurrence of words in different corpora of the Hindi language, Journal of Quantitative Linguistics 20 (1), 1–12. https://doi.org/10.1080/09296174.2012.754596
Pande, H. and Dhami, H. S. (2013b). Analysis for the significance of statistical word-length features in genre discrimination of Hindi texts. IOSR Journal of Mathematics 8 (1), 5–10. https://doi.org/10.9790/5728-0810510
Popescu, I.-I., Naumann, S., Kelih, E., Rovenchak, A., Overbeck, A., Sanada, H., Smith, R., Čech, R., Mohanty, P., Wilson, A., and Altmann, G. (2013). Word length: Aspects and languages. In G. Altmann and R. Köhler (Eds), Issues in Quantitative Linguistics Vol. 3, 224–281. Studies in Quantitative Linguistics, vol. 13, Lüdenscheid: RAM-Verlag.
Renkui, H. and Minghu, J. (2012). Discrimination of Chinese Quantitative Style Features Based on Text Clustering. 11th International Conference on Signal Processing (ICSP), 2012 IEEE, 21–25 October 2012, Beijing.
Röttger, W. (1996). Distribution of word length in Ciceronian letters. Journal of Quantitative Linguistics 3 (1), 68–72. https://doi.org/10.1080/09296179608590064
Rottmann, O. (2003). Word length in the Baltic languages – are they of the same type as the word lengths in the Slavic languages? Glottometrics 6, 52–60.
Rottmann, O. A. (1997). Word‐length counting in Old Church Slavonic. Journal of Quantitative Linguistics, 4 (1–3), 252–256. https://doi.org/10.1080/09296179708590101
Sigurd B., Eeg-Olofsson M., and Weijer, J. van de (2004). Word length, sentence length and frequency – Zipf revisited. Studia Linguistica 58 (1), 37–52. https://doi.org/10.1111/j.0039-3193.2004.00109.x
Těšitelová, M. (1992). Quantitative Linguistics. Amsterdam/Philadelphia: John Benjamins Publishing Company. https://doi.org/10.1075/llsee.37
Uhlírová, L. (1995). On the generality of statistical laws and individuality of texts. A case of syllables, word forms, their length and frequencies, Journal of Quantitative Linguistics 2 (3), 238–247. https://doi.org/10.1080/09296179508590052
Uhlírová, L. (1999). Word length modelling: Intertextuality as a relevant factor? Journal of Quantitative Linguistics 6 (3), 252–256. https://doi.org/10.1076/jqul.6.3.252.6165
Wilson, A. (2003). Word length distribution in modern Welsh prose texts. Glottometrics 6, 35–39.
Wilson, A. (2006). Word-length distribution in present-day lower Sorbian newspaper texts. In P. Grzybek (Ed.), Contributations to the Science of Text and Language: Word Length Studies and Related Issues, 319–327. Rotterdam: Springer.
Ziegler, A. (1996). Word length distribution in Brazilian‐Portuguese texts, Journal of Quantitative Linguistics 3 (1), 73–79. https://doi.org/10.1080/09296179608590065
Ziegler, A. (2000). Word length in Romance languages. A complemental contribution, Journal of Quantitative Linguistics 7 (1), 65–68. https://doi.org/10.1076/0929-6174(200004)07:01;1-3;FT065