Item Details

Statistical properties of English text produced by Korean and Chinese authors

Issue: Vol 1 No. 1 (2014)

Journal: Journal of Research Design and Statistics in Linguistics and Communication Science

Subject Areas: Linguistics

DOI: 10.1558/jrds/7207888858


This article presents findings from a comparison of the Zipf’s law and Heaps’ law properties of English texts produced by second language writers of English to similar texts produced by native writers. Zipf’s law is a famous statistical law capturing the distribution of words in texts, while Heaps’ law describes the rate at which new words are encountered as one reads a text. The analysis of the Zipf properties of texts by writers whose first language (L1) is Korean shows that the distribution of words and multiword sequences (n-grams) is different from that observed in native speakers’ texts. These differences imply that Korean writers do not fully exploit English pronominals. Analyses also indicate that the Heaps’ law properties of texts produced by the L1 and L2 writers show different rates of lexical innovation. A number of studies have estimated the Zipf’s and Heaps’ properties of native language texts in various contexts (e.g., Li, 1992; Cancho, 2005), while others have considered the consequences of Zipf’s law for second language acquisition (Laufer and Nation, 1995; Meara, 2005; Ellis, 2012). This study is the first attempt to estimate the parameters of these laws for texts produced by different populations of the second language writers.

Author: Robert Nelson

View Original Web Page

References :

Biber, D., & Conrad, S. (1999). Lexical bundles in conversation and academic prose. Language And Computers, 26, 181-190.
Cancho, R. F. i., Riordan, O., & Bollobás, B. (2005). The consequences of Zipf's law for syntax and symbolic reference. Proceedings of the Royal Society B: Biological Sciences, 272(1562), 561-565.
Cancho, R. F. i. (2005a). The variation of Zipf’s law in human language. The European Physical Journal B-Condensed Matter and Complex Systems, 44(2), 249-257.
Cancho, R. F. i. (2005b). Decoding least effort and scaling in signal frequency distributions. Physica A: Statistical Mechanics and its Applications, 345(1), 275-284.
Cattuto, C., Barrat, A., Baldassarri, A., Schehr, G., & Loreto, V. (2009). Collective dynamics of social annotation. Proceedings of the National Academy of Sciences, 106(26), 10511-10515.
Dewaele, J. M., & Pavlenko, A. (2003). Productivity and lexical diversity in native and non-native speech: A study of cross-cultural effects. Effects of the second language on the first, 3, 120.
Eliazar, I. (2011). The growth statistics of Zipfian ensembles: Beyond Heaps’ law. Physica A: Statistical Mechanics and its Applications 390.20, 3189-3203.
Ellis, N. C. (2003). Constructions, chunking, and connectionism: The emergence of second language structure. The handbook of second language acquisition, 63-103.
Ellis, N. C. (2008). The dynamics of second language emergence: Cycles of language use, language change, and language acquisition. The Modern Language Journal, 92(2), 232-249.
Ellis, N. C. (2012). Formulaic Language and Second Language Acquisition: Zipf and the Phrasal Teddy Bear. Annual Review of Applied Linguistics, 32(1), 17-44.
Gelbukh, A., & Sidorov, G. (2001). Zipf and Heaps Laws’ coefficients depend on language. In Computational Linguistics and Intelligent Text Processing (pp. 332-335). Springer Berlin Heidelberg.
Heaps, H. S., (1978). Information Retrieval: Computational and Theoretical Aspects. Academic Press.
Herdan, G., (1960). Type-token mathematics. The Hague: Mouton.
Larsen-Freeman, D. (1997). Chaos/complexity science and second language acquisition. Applied linguistics, 18(2), 141-165.
Li, W. (1992). Random texts exhibit Zipf's-law-like word frequency distribution. Information Theory, IEEE Transactions on, 38(6), 1842-1845.
Lu, L., Zhang, Z. K., & Zhou, T. (2012). Scaling Laws in Human Language. arXiv preprint arXiv:1202.2903.
Mandelbrot, B. (1953). An informational theory of the statistical structure of language. Communication theory, 84.
Meara, P., Lightbown, P. M., & Halter, R. H. (1997). Classrooms as lexical environments. Language Teaching Research, 1(1), 28-46.
Milton, J. (2009). Measuring second language vocabulary acquisition. Multiligual Matters.
Zanette, D., & Montemurro, M. (2005). Dynamics of text generation with realistic Zipf's distribution. Journal of quantitative Linguistics, 12(1), 29-40.