Item Details

Using Grammatical Features for Automatic Register Identification in an Unrestricted Corpus of Documents from the Open Web

Issue: Vol 2 No. 1 (2015)

Journal: Journal of Research Design and Statistics in Linguistics and Communication Science

Subject Areas: Linguistics

DOI: 10.1558/jrds.v2i1.27637


Most previous attempts at automatic genre identification have been based on corpus samples that are relatively small and artificially restricted. In this study we set out to automatically predict register/genre categories in a large, representative sample of documents from the open web using a linguistic approach focused on lexico-grammatical characteristics that have functional associations. Our findings demonstrate the possibility of automatically predicting register/genre on the unrestricted open web, and we anticipate that future extensions will allow this task to be accomplished with considerably higher degrees of accuracy.

Author: Douglas Biber, Jesse Egbert

