Item Details

Using Grammatical Features for Automatic Register Identification in an Unrestricted Corpus of Documents from the Open Web

Issue: Vol 2 No. 1 (2015)

Journal: Journal of Research Design and Statistics in Linguistics and Communication Science

Subject Areas: Linguistics

DOI: 10.1558/jrds.v2i1.27637


Most previous attempts at automatic genre identification have been based on corpus samples that are relatively small and artificially restricted. In this study we set out to automatically predict register/genre categories in a large, representative sample of documents from the open web using a linguistic approach focused on lexico-grammatical characteristics that have functional associations. Our findings demonstrate the possibility of automatically predicting register/genre on the unrestricted open web, and we anticipate that future extensions will allow this task to be accomplished with considerably higher degrees of accuracy.

Author: Douglas Biber, Jesse Egbert

View Original Web Page

References :

Agarwal, S., Godbole, S., Punjani, D., and Roy, S. (2007) How much noise is too much: A study in automatic text classification. Proceedings of Seventh IEEE International Conference on Data Mining 3–12.

Argamon, S., Koppel, M., and Avneri, G. (1998) Routing documents according to style. In Proceedings of the First International Workshop on Innovative Internet Information Systems (IIIS-98). Pisa

Baroni, M. and Bernardini, S. (2004) BootCaT: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004 1313–1316. Lisbon: ELDA.

Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009) The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3): 209–226.

Biber, D. (1988) Variation across Speech and Writing. Cambridge: Cambridge University Press.

Biber, D. (1995) Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge: Cambridge University Press.

Biber, D. and Conrad, S. (2009) Register, Genre, and Style. Cambridge: Cambridge University Press.

Biber, D., Egbert, J. and Davies, M. (2015) Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora 10(1): 11–45.

Biber, D., Egbert, J. (to appear). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics.

Biber, D., Johansson, S. Leech, G., Conrad, S., and Finegan, E. (1999) The Longman Grammar of Spoken and Written English. London: Longman.

Boese, E. S. (2005) Stereotyping the Web: Genre Classification of Web Documents. Master’s thesis, Department of Computer Science. Colorado State University.

Cantos Gómez, P. (2013) Statistical Methods in Language and Linguistic Research. Sheffield: Equinox.

Crowston, S. (2010) Problems in the use-centered development of a taxonomy of web genres. In A. Mehler, S. Sharoff, and M. Santini (eds), Genres on the Web: Computational Models and Empirical Studies 69–86. New York: Springer.

Dalal, M. K. and Zaveri, M.A. (2011) Automatic text classification: A technical review. International Journal of Computer Applications 28: 975–987.

Egbert, J., and Biber, D. (2013) Developing a user-based method of web register classification. In S. Evert, E. Stemle, and P. Rayson (Eds), Proceedings of the 8th Web as Corpus Workshop (WAC-8) @Corpus Linguistics 2013, 16–23.

Egbert, J., Biber, D., and Davies, M. (2015) Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66(9): 1817–1831.

Fonda, W. and Purwarianti, A. (2014) Experiments on keyword list generation by term distribution clustering for text classification. Proceedings of the 2014 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 297–301.

Gunnarson, M. (2011) Classification along Genre Dimensions: Exploring a Multidisciplinary Problem. PhD Dissertation, University of Borås (Sweden).

Jebari, C., Wani, M. A. (2012) A multi-label and adaptive genre classification of web pages. Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA) 1: 578–581.

Kanaris, I. and Stamatatos, E. (2009a) Learning to recognize webpage genres. Information Processing and Management 45(5): 499–512.

Kanaris, I. and Stamatatos, E. (2009b) Webpage genre identification using variable-length character n-grams. Proceedings of the IEEE International Conference on Tools with Artificial Intelligence, 3–10.

Karlgren, J. (2000) Stylistic Experiments for Information Retrieval. PhD thesis, Department of Linguistics. Stockholm University.

Kessler, B., Nunberg, G., and Schütze, H. (1997) Automatic detection of text genre. Proceedings of the 35th annual meeting of the Association for Computational Linguistics and the 8th meeting of the European Chapter of the Association for Computational Linguistics, 32–38.

Kim S., Han K., Rim H., and Myaeng S. H. (2006) Some effective techniques for naïve Bayes text classification. IEEE Transactions on Knowledge and Data Engineering 18: 1457–1466.

Lex, E., Juffinger, A., and Granitzer, M. (2010) A comparison of stylometric and lexical features for web genre classification and emotion classification in blogs. Proceedings of the 2010 Workshop on Database and Expert Systems Applications (DEXA), 10–14.

Lim, C. S., Lee, K. J., and Kim, G. C. (2005) Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41 (5): 1263–1276.

Maeda, A., Hayashi, Y. (2009) Automatic genre classification of web documents using discriminant analysis for feature selection. Proceedings of the Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT '09), 405–410.

Mason, J. E., Shepherd, M., and Duffy, J. (2009a) Classifying web pages by genre: An n-gram Approach. Proceedings of the International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT '09), 458–465.

Mason J., Shepherd M. and Duffy J. (2009b) An n-Gram based approach to automatically identifying web page genre. Hawaii International Conference on System Sciences, 1–10.

Meena M. J., and Chandran K. R. (2009) Naïve Bayes text classification with positive features selected by statistical method. Proceedings of the IEEE International Conference on Advanced Computing, 28–33.

Meyer zu Eissen, S. and Stein, B. (2004) Genre classification of web pages: User study and feasibility analysis. In P. G. Biundo and T. Fruhwirth (Eds), Advances in Artificial Intelligence, 256–269. Berlin: Springer.

Rehm, G. (2002) Towards automatic web genre identification. Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS’02).

Rehm, G., Santini, M., Mehler, A., Braslavski, P., Gleim, R., Stubbe, A., Symonenko, S., Tavosanis, M. and Vidulin, V. (2008) Towards a reference corpus of web genres for the evaluation of genre identification systems. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, and D. Tapias (Eds), Proceedings of the 6th Language Resources and Evaluation Conference, 351–358.

Rosso, M. A., and Haas, S. W. (2010) Identification of web genres by user warrant. In A. Mehler, S. Sharoff, and M. Santini (Eds), Genres on the Web: Computational Models and Empirical Studies, 47–68. New York: Springer.

Santini, M. (2004a) Identifying Genres on the Web. Technical Report ITRI-03-06, ITRI, University of Brighton.

Santini, M. (2004b) State-of-the-Art on automatic genre identification. Technical Report ITRI-04-03, ITRI, University of Brighton.

Santini, M. (2005) Genres in formation? An exploratory study of web pages using cluster analysis. In Proceedings of the 8th Annual Colloquium for the UK Special Interest Group for Computational Linguistics.

Santini, M. (2007a) Automatic Identification of Genre in Web Pages. Ph.D. thesis, University of Brighton.

Santini, M. (2007b) Characterizing genres of web pages: Genre hybridism and individualization. In R. H. Sprague (Ed.), Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS-40), 1–10.

Santini, M. (2008) Zero, single, or multi? Genre of web pages through the users’ perspective. Information Processing and Management 44: 702–737.

Santini, M. and S. Sharoff. (2009) Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics 24(1): 125–141.

Santini, M., Sharoff, S., Rehm, G. & Mehler, A., (Eds) (2008 –). WebGenreWiki: The wiki dedicated to Automatic Web Genre Identification.

Sharoff, S. (2005) Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds), WaCky! Working papers on the Web as Corpus 63–98. Gedit, Bologna.

Sharoff, S. (2006) Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics 11(4): 435–462.

Stamatatos E., Fakotakis N. and Kokkinakis G. (2000) Text genre detection using common word frequencies. Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000). Saarbrücken.

Vidulin, V., Luštrek, M. and Gams, M. (2009) Multi-label approaches to web genre identification. Journal for Language Technology and Computational Linguistics 24(1): 97–114.

Wastholm, P., Kusma, A., and Megyesi, B. (2005) Using linguistic data for genre classification. In Advances in Artificial Language in Sweden. The Annual Swedish Artificial Intelligence and Learning Systems Event (SAIS-SSLS), 173–176.

Wolters, M. and Kirsten, M. (1999) Exploring the use of linguistic features in domain and genre classification. In Proceedings of the Ninth Conference on European chapter of the Association for Computational Linguistics, 142–149.

Zhang W., Yoshida T., and Tang X. (2007) Text classification using multi-word features. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 3519–