Leveraging a Large Learner Corpus for Automatic Suggestion of Collocations for Learners of Japanese as a Second Language
Issue: Vol 33 No. 3 (2016)
Journal: CALICO Journal
One of the challenges of learning Japanese as a Second Language (JSL) is finding the appropriate word for a particular usage. To address this challenge, we developed a collocational aid designed to suggest more appropriate collocations in Japanese. In particular, we address the problem of generating and ranking noun and verb candidates for correcting potential collocation errors in the learners’ text. Given a noun-verb construction as input, our system generates possible noun or verb correction candidates based on noun and verb corrections extracted from a large Japanese learner corpus. We use this corpus to investigate the learner's tendency to commit collocation errors, and to produce a smaller and more realistic set of candidates. After combining nouns or verbs with the generated candidates to form noun-verb pairs, the system uses the Weighted Dice coefficient as the association measure to filter out inappropriate noun-verb pairs and rank the proper collocations. We report the detailed evaluation and results on learner data. In addition, we show that our system statistically outperforms existing approaches to collocation error correction. Finally, we report a preliminary user study with JSL learners.
Author: Lis Pereira, Erlyn Manguilimotan, Yuji Matsumoto
Chang, Y. C., Chang, J. S., Chen, H. J., & Liou, H. C. (2008). An automatic collocation writing assistant for Taiwanese EFL learners: A case of corpus-based NLP technology. Computer Assisted Language Learning, 21(3), 283–299. Retrieved from:
Chen, M.-H., , C.-C., , S.-T., , J.S., & , H.C. (2014). An automatic reference aid for improving EFL learners’ formulaic expressions in productive language use. IEEE Transactions on Learning Technologies, 57–68. Retrieved from: 7
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting on Association for Computational Linguistics (pp. 76–83). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from:
Dahlmeier, D., & Ng, H. T. (2011). Correcting semantic collocation errors with L1-induced paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 107–117). Stroudsburg, PA: Association for Computational Linguistics.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Futagi, Y., Deane, P., Chodorow, M., & Tetreault, J. (2008). A computational approach to detecting collocation errors in the writing of non-native speakers of English. Computer Assisted Language Learning, 21(4), 353–367. Retrieved from
Hill, J. (2000). Revising priorities: From grammatical failure to collocational success. In Michael Lewis (Ed.), Teaching Collocation: Further Developments in the Lexical Approach (pp. 88–117). Hove: Language Teaching Publications.
Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Upper Saddle River, NJ: Prentice Hall PTR.
Kitamura, M., & Matsumoto, Y. (1997). Automatic extraction of translation patterns in parallel corpora. Information Processing Society of Japan Journal, 38(4), 727–735.
Kudo, T., & Matsumoto, Y. (2002). Japanese dependency analysis using cascaded chunking. In Proceedings of the 6th Conference on Natural Language Learning (pp. 1–7). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from
Lea, D., & Runcie, M. (Eds.) (2002). Oxford Collocations Dictionary for Students of English. Oxford: Oxford University Press.
Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated Grammatical Error Detection For Language Learners (Synthesis lectures on human language technologies 3(1), pp. 1–134). San Rafael, CA: Morgan & Claypool.
Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 25–32). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from:
Lewis, M. (2000). There is nothing as practical as a good theory. In Michael Lewis (Ed.), Teaching Collocation: Further Developments in the Lexical Approach (pp. 10–27). Hove: Language Teaching Publications.
Liou, H., Chang, J., Chen, H., Lin, C., Liaw, M., Gao, Z., ... You, G. (2006). Corpora processing and computational scaffolding for a Web-based English learning environment: The CANDLE project. CALICO Journal, 24(1), 77–95.
Liu, A. L.-E.,Wible, D., & Tsao, N.-L. (2009). Automated suggestions for miscollocations. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 47–50). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from:
Liu, L. E. (2002). A corpus-based lexical semantic investigation of verb-noun miscollocations in Taiwan learners’ English (Master’s thesis). Tamkang University, Taipei.
Maekawa, K., , M., , T., , T.,, H., , W., … , Y. (2014). Balanced corpus of contemporary written Japanese. Language Resources and Evaluation, 48(2), 345–371. Retrieved from:
Oyama, H., Komachi, M., & Matsumoto, Y. (2013). Towards automatic error type classification of Japanese language learners’ writings. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (pp.163–172). Taipei, Taiwan.
Park, T., Lank, E., Poupart, P., & Terry, M. (2008). “Is the sky pure today?” AwkChecker: An assistive tool for detecting and correcting collocation errors. In Proceedings of the 21th Annual Association for Computing Machinery Symposium on User Interface Software and Technology (pp. 121–130). Monterey, CA, USA.
Pereira, L. (2013). Collocation suggestion for Japanese second language learners (Master’s thesis). Nara Institute of Science and Technology, Ikoma, Japan.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.
Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1),1–38.
Voorhees, E. M.(1999). The TREC-8 question answering track evaluation. In E. M. Voochees & D. K. Harman (Eds.), Proceedings of the Text Retrieval Conference (TREC-8) (pp. 83–105). NIST Special Publication 500-246.
Yi, X., Gao, J., & Dolan, W. (2008). A web-based English proofing system for English as a Second Language users. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (pp. 619–624). Stroudsburg, PA: Association for Computational Linguistics.