Stylistic variation within genre conventions in the Enron email corpus: developing a textsensitive methodology for authorship research
Issue: Vol 20 No. 1 (2013)
Journal: International Journal of Speech Language and the Law
Subject Areas: Linguistics
Abstract:
Over recent years there has been much theoretical discussion regarding idiolect and its usefulness in forensic authorship analysis. This article, drawing on email data from the former American energy company Enron, offers an empirical investigation into identifying individuals’ idiolects through analysing author distinctive variation within two conventions of the email genre – greetings and farewells. The first part of a two-stage analysis identifies a number of forms which distinguish between authors in a four-author corpus. Using likelihood ratios, the second stage of analysis finds that some of the greeting and farewell forms identified, and combinations of forms, remain distinctive and individuating of their author when tested against the 126-author ‘Enron Sent Email Author Reference Corpus’, and highlights the diagnostic power of less frequent variants. The results from this article offer both theoretical and methodological contributions as well as a baseline of population results for forensic authorship casework involving emails.
Author: David Wright
References :
Abbasi, A. and Chen, H. (2005) Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20(5): 67–75. http://dx.doi.org/10.1109/MIS.2005.81
Barlow, M. (2010) Individual usage: a corpus-based study of idiolects. Paper presented at the 34th International LAUD Symposium, Landau, Germany.
Biber, D. (1993) The multi-dimensional approach to linguistic analyses of genre variation: an overview of methodology and findings. Computers and the Humanities 26(5/6): 331–345.
Biber, D. and Conrad, S. (2009) Register, Genre and Style. Cambridge: Cambridge University Press.
Bou-Franch, P. (2011) Openings and closings in Spanish email conversations. Journal of Pragmatics 43(6): 1772–1785. http://dx.doi.org/10.1016/j.pragma.2010.11.002
Butters, R. (2012) Retiring President’s closing address: ethics, best practices, and standards. In S. Tomblin, N. MacLeod, R. Sousa-Silva and M. Coulthard (eds) Proceedings of the Tenth International Association of Forensic Linguists’ Biennial Conference, Aston University, Birmingham, 351–361. www.forensiclinguistics.net
Cohen, W.W. (2009) Enron Email Dataset. Retrieved October 2010 from http://www.cs.cmu.edu/~enron/.
Corney, M., Anderson, A., Mohay, G., and de Vel, O. (2001) Identifying the authors of suspect Email. Retrieved December 2001 from http://eprints.qut.edu.au/8021/1/CompSecurityPaper.pdf.
Cotterill, J. (2010) How to use corpus linguistics in forensic linguistics. In A. O’Keefe and M. McCarthy (eds) The Routledge Handbook of Corpus Linguistics 578–590. London: Routledge.
Coulthard. M. (2004) Author identification, idiolect, and linguistic uniqueness. Applied Linguistics 24(4): 431–447. http://dx.doi.org/10.1093/applin/25.4.431
Coulthard, M. (2010) Experts and opinions: In my opinion. In M. Coulthard and A. Johnson (eds) The Routledge Handbook of Forensic Linguistics 473–486. London: Routledge.
Coulthard, M., Grant, T. and Kredens, K. (2011) Forensic linguistics. In R. Wodak, B. Johnstone and P. Kerswill (eds) The SAGE Handbook of Sociolinguistics 531–544. London: Sage.
Crystal, D. (2001) Language and the Internet. Cambridge: Cambridge University Press.
Crystal, D. (2008) Txtng: The Gr8 Db8. Oxford: Oxford University Press.
De Beaugrande, R. (1998) Language and society: the real and the ideal in linguistics, sociolinguistics, and corpus linguistics. Retrieved April 2012 from http://www.beaugrande.com/jsocioling.htm. Also in Journal of Sociolinguistics 3(1): 128–139.
de Vel, O., Anderson, A., Corney, M. and Mohay, G. (2001) Mining e-mail content for author identification forensics. Association for Computing Machinery Sigmod Record 30(4): 55–64. http://dx.doi.org/10.1145/604264.604272
Gains, J. (1999) Electronic mail – a new style of communication or just a new medium? An investigation into the text features of e-mail. English for Specific Purposes 18(1): 81–101. http://dx.doi.org/10.1016/S0889-4906(97)00051-3
Grant, T. (2010) Txt 4n6: idiolect free authorship analysis? In M. Coulthard and A. Johnson (eds) The Routledge Handbook of Forensic Linguistics 508–522. London: Routledge.
Halliday, M.A.K. and Hasan, R. (1989) Language, Context and Text: Aspects of Language in a Social-Semiotic Perspective. Oxford: Oxford University Press.
Holmes, Janet (2001) An Introduction to Sociolinguistics, 2nd edn. Harlow: Longman.
Hymes, D. (1974) Foundations in Sociolinguistics: An Ethnographic Approach. London: Tavistock.
Jakobson, R. (1956) Fundamentals of Language. Hague: Mouton Press.
Johnson, A. (2012) Applying forensic linguistics in professional settings: implications for research. Paper presented at the 1st Inter-university PhD Seminar on Forensic Linguistics (University of Leeds and IULA/Universitat Pompeu Fabra), Universitat Pompeu Fabra, Barcelona, 30 March 2012.
Johnstone, B. (2009) Stance, style, and the linguistic individual. In A. Jaffe (ed.) Stance: Sociolinguistic Perspectives 29–52. Oxford: Oxford University Press.
Kredens, K. (2002) Towards a corpus-based methodology of forensic authorship attribution: a comparative study of two idiolects. In B. Lewandowska-Tomaszczyk (ed.) PALC’01: Practical Applications in Language Corpora 405–437. Frankfurt am Main: Peter Lang.
Kuhl, J. (2003) The Idiolect, Chaos, and Language Custom Far From Equilibrium: Conversations in Morocco. Unpublished PhD Thesis, University of Georgia, Athens, Georgia.
Lan, L. (2000) Email: a challenge to Standard English? English Today 16(4): 23–29.
Lyons, J. (1968) Introduction to Theoretical Linguistics. Cambridge: Cambridge University Press.
MacLeod, N. and Grant, Tim (2012) Whose tweet? Authorship analysis of micro-blogs and other short form messages. In S. Tomblin, N. MacLeod, R. Sousa-Silva and M. Coulthard (eds) Proceedings of the Tenth International Association of Forensic Linguists’ Biennial Conference, Aston University, Birmingham, 210–224. From: www.forensiclinguistics.net
McGee, S. (2002) Simplifying likelihood ratios. Journal of General Internal Medicine 17(8): 647–650. http://dx.doi.org/10.1046/j.1525-1497.2002.10750.x
Mollin, S. (2009) ‘I entirely understand’ is a Blairism: The methodology of identifying idiolectal collocations. International Journal of Corpus Linguistics 14(3): 367–392. http://dx.doi.org/10.1075/ijcl.14.3.04mol
Scott, M. (2008) WordSmith Tools version 5. Liverpool: Lexical Analysis Software.
Smith, D.J., Spencer, S. and Grant, T. (2009) Authorship analysis for counter terrorism Unpublished Research Report, QinetiQ/Aston University.
Solan, L. (2012) Ethics and method in forensic linguistics. In S. Tomblin, N. MacLeod, R. Sousa-Silva and M. Coulthard (eds.) Proceedings of the Tenth International Association of Forensic Linguists’ Biennial Conference, Aston University, Birmingham, 362–368. www.forensiclinguistics.net
Turell, M.T. (2010) The use of textual, grammatical and sociolinguistic evidence in forensic text comparison. The International Journal of Speech, Language and the Law 17(2): 211–250.
Waldvogel, J. (2007) Greetings and closings in workplace email. Journal of Computer-Mediated Communication 12(2): 456–477. http://dx.doi.org/10.1111/j.1083-6101.2007.00333.x
Wales, K. (2001) A Dictionary of Stylistics. Harlow: Longman.
Wardhaugh, R. (2006) An Introduction to Sociolinguistics, 5th edn. Oxford: Blackwell.
Woolls, D. (2012) Description of CFL extraction routines for CMU Enron Sent email database. Retrieved March 2012 from http://www.cflsoftware.com/CFL_CMU_Enron_Sent_email_Extraction.mht