The effect of speaker sampling in likelihood ratio based forensic voice comparison
Issue: Vol 26 No. 1 (2019)
Subject Areas: Linguistics
Within the field of forensic voice comparison (FVC), there is growing pressure for experts to demonstrate the validity and reliability of the conclusions they reach in casework. One benefit of a fully data-driven approach that utilises databases of speakers to compute numerical likelihood ratios (LRs) is that it is possible to estimate validity and reliability empirically. However, little is known about the stability of LR output as a function of the specific speakers sampled for use in the training, test and reference data sets. The present study addresses this issue using two large sets of formant data: Cantonese sentence final particle /a/ and British English filled pauses UM. Experiments were replicated 100 times varying the 1) training, test and reference speakers, 2) training speakers only, 3) test speakers only, and 4) reference speakers only. The results show that varying the speakers in all three sets has the greatest effect on system stability for both the Cantonese and English variables, with the Cllr varying from 0.60 to 0.97 for /a/ and 0.32 to 1.33 for UM. However, this variability is primarily due to the effects of uncertainty in the test set. Varying only the training speakers has the least effect on system stability for /a/ (Cllr range: 0.76 to 0.88), while varying reference speakers has the smallest effect for UM (Cllr range: 0.40 to 0.54). The results indicate that in LR-based FVC it is important to assess the stability of the system as a function of the samples of speakers used (Cllr range) rather than just reporting a single Cllr value based on one configuration of speakers in each set. The study contributes to the general debate on reporting uncertainty in LR computation.
Author: Bruce Xiao Wang, Vincent Hughes, Paul Foulkes
Aitken, C. G. and Lucy, D. (2004) Evaluation of trace evidence in the form of multivariate. data. Journal of the Royal Statistical Society: Series C (Applied Statistics) 53(1): 109-122. https://doi.org/10.1046/j.0035-9254.2003.05271.x
Andrus, Tony, et al. (2016) IARPA Babel Cantonese language pack IARPA-babel101bv0.4c LDC2016S02. Web download. Philadelphia: Linguistic Data Consortium.
Boersma, P. and Weenink, D. (2017) Praat: doing phonetics by computer [Computer program]. Version 6.0.36.
Brümmer, N. and du Preez, J (2006) Application-independent evaluation of speaker detection. Computer Speech and Language, 20(2-3): 230-275. https://doi.org/10.1016/j. csl.2005.08.001
Curran, J. M. (2016) Admitting to uncertainty in the LR. Science and Justice 56(5): 380-382. https://doi.org/10.1016/j.scijus.2016.05.005
Enzinger, E. and Morrison, G. S. (2012) The importance of using between-session test data in evaluating the performance of forensic-voice-comparison systems. Proceedings of the 14th Australasian International Conference on Speech Science and Technology: 137-140.
Enzinger, E. and Morrison, G.S. (2017) Empirical test of the performance of an acousticphonetic approach to forensic voice comparison under conditions similar to those of a real case. Forensic Science International 277: 30-40. https://doi.org/10.1016/j. forsciint.2017.05.007
Enzinger, E., Morrison, G. S. and Ochoa, F. (2016) A demonstration of the application of the new paradigm for the evaluation of forensic evidence under conditions reflecting those of a real forensic-voice-comparison case. Science & Justice 56(1): 42-57. https://doi.org/10.1016/j.scijus.2015.06.005
Grigoras, C., Smith, J., Morrison, G. and Enzinger, E. (2013) Forensic audio analysis - Review: 2010-2013. Proceedings of the 17th International Science Managers' Symposium: 612-637.
Home Office (1984) Police and Criminal Evidence Act. Her Majesty's Stationery Office. https://www.legislation.gov.uk/ukpga/1984/60/contents
Hughes, V. (2017) Sample size and the multivariate kernel density likelihood ratio: How many speakers are enough? Speech Communication 94: 15-29. https://doi.org/10.1016/j.specom.2017.08.005
Hughes, V. and Foulkes, P. (2015) The relevant population in forensic voice comparison: effects of varying delimitations of social class and age. Speech Communication 66: 218-230. https://doi.org/10.1016/j.specom.2014.10.006
Hughes, V. and Foulkes, P. (2017) What is the relevant population? Considerations for the computation of likelihood ratios in forensic voice comparison. Proceedings of Interspeech 2017: 3772-3776. https://doi.org/10.21437/interspeech.2017-1368
Hughes, V., Wood, S. and Foulkes, P. (2016) Strength of forensic voice comparison evidence from the acoustics of filled pauses. International Journal of Speech, Language and the Law 23(1): 99-132. https://doi.org/10.1558/ijsll.v23i1.29874
Ishihara, S. and Kinoshita, Y. (2008) How many do we need? Exploration of the population size effect on the performance of forensic speaker classification. Proceedings of Interspeech: 1941-1944.
Kinoshita, Y. and Ishihara, S. (2014) Background population: how does it affect LR-based forensic voice comparison? International Journal of Speech, Language and the Law 21(2): 191-224. https://doi.org/10.1558/ijsll.v21i2.191
Kwok, H. (1984) Sentence particles in Cantonese. Hong Kong: Centre of Asian Studies, University of Hong Kong.
Law, A. (2002) Cantonese sentence-final particles and the CP domain. UCL Working Papers in Linguistics 14: 375-398.
Lennes, M. (2003a) Save_intervals_to_wav_sound_files.praat. Retrieved on 21August 2018 from https://github.com/FieldDB/Praat-Scripts
Lennes, M. (2003b) Collect_formant_data_from_files.praat. Retrieved on 21August 2018 from https://github.com/FieldDB/Praat-Scripts
Leung, W. M. (2009) A study of the Cantonese hearsay particle wo from a tonal perspective. International Journal of Linguistics 1(1): 1-14. https://doi.org/10.5296/ijl. v1i1.204
Lindblom, B. (1963) Spectrographic study of vowel reduction. Journal of the Acoustical Society of America 35(11): 1773-1781. https://doi.org/10.1121/1.1918816
Liu, X. M. (2006) 刑事侦查程序理论与改革研究 [Criminal investigation theory and reform]. China Legal Publishing House.
Lo, J. (2018) FVClrr: Likelihood ratio calculation and testing in forensic voice comparison [unpublished R package]. https://github.com/justinjhlo/fvclrr
McDougall, K. (2004) Speaker-specific formant dynamics: an experiment on Australian English /aɪ/. International Journal of Speech, Language and the Law 11(1): 103-130. https://doi.org/10.1558/sll.2004.11.1.103
McDougall, K. (2006) Dynamic features of speech and the characterization of speakers: towards a new approach using formant frequencies. International Journal of Speech, Language and the Law 13(1): 89-126. https://doi.org/10.1558/ijsll.v13i1.89
Morrison, G. S. (2007) Matlab implementation of Aitken and Lucy's (2004) forensic likelihood-ratio software using multivariate-kernel-density estimation. Retrieved on 20 July 2018 from Geoff-morrison.net/#MVKD
Morrison, G. S. (2008) Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /aɪ/. International Journal of Speech, Language and the Law 15(2): 249-266. https://doi.org/10.1558/ijsll.v15i2.249
Morrison, G. S. (2009) Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. Journal of the Acoustical Society of America 125(4): 2387-2397. https://doi.org/10.1121/1.3081384
Morrison, G. S. (2011) A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM). Speech Communication 53(2): 242-256. https://doi.org/10.1016/j.specom.2010.09.005
Morrison, G. S. (2013) Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio. Australian Journal of Forensic Sciences 45(2): 173-197. https://doi.org/10.1080/00450618.2012.733025
Morrison, G. S. (2016) Special issue on measuring and reporting the precision of forensic likelihood ratios: introduction to the debate. Science & Justice 56(5): 371-373. https://doi.org/10.1016/j.scijus.2016.05.002
Morrison, G. S. and Enzinger, E. (2016) What should a forensic practitioner's likelihood ratio be? Science & Justice 56(5): 374-379.
Morrison, G. S., Ochoa, F. and Thiruvaran, T. (2012) Database selection for forensic voice comparison. Proceedings of Odyssey: 62-77.
Morrison, G. S. and Poh, N. (2018) Avoiding overstating the strength of forensic evidence: shrunk likelihood ratios/Bayes factors. Science & Justice 58(3): 200-218. https://doi.org/10.1016/j.scijus.2017.12.005
Nolan, F., McDougall, K., de Jong, G. and Hudson, T. (2009) The DyViS database: stylecontrolled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law 16(1): 31-57. https://doi.org/10.1558/ijsll.v16i1.31
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. Oxford: Oxford University Press.
Rose, P. (2004) Technical forensic speaker identification from a Bayesian linguist's perspective. Proceedings of Odyssey: 3-10.
Rose, P. and Morrison, G. (2009) A response to the UK Position Statement on forensic speaker comparison. International Journal of Speech, Language and the Law 16(1): 139. https://doi.org/10.1558/ijsll.v16i1.139
Rose, P. and Wang, B. X. (2016) Cantonese forensic voice comparison with higher-level features: likelihood ratio-based validation using F-pattern and tonal F0 trajectories over a disyllabic hexaphone. Proceedings of Odyssey 2016: 326-333. https://doi.org/10.21437/odyssey.2016-47
Shriberg, E. (2001) To 'errrr' is human: ecology and acoustics of speech disfluencies. Journal of the International Phonetic Association 31(1): 153-169. https://doi.org/10.1017/s0025100301001128
Sybesma, R. and Li, B. (2007) The dissection and structural mapping of Cantonese sentence final particles. Lingua 117(10): 1739-1783. https://doi.org/10.1016/j. lingua.2006.10.003
Tschäpe, N., Trouvain, J., Bauer, D. and Jessen, M. (2005) Idiosyncratic patterns of filled pauses. In Proceedings of the 14th Annual Conference of the International Association for Forensic Phonetics and Acoustics, Marrakesh, Morocco.
Wakefield, J. (2011) The English equivalents of Cantonese sentence-final particles. Doctoral dissertation, Hong Kong Polytechnic University.
Wang, B., Hughes, V. and Foulkes, P. (2019) Effect of score sampling on system stability in likelihood ratio based forensic voice comparison. In Proceedings of the 19th International Congress of Phonetic Sciences. Melbourne, Australia.
Zhang, C., Morrison, G. S. and Thiruvaran, T. (2011) Forensic voice comparison using Chinese /iau/. In Proceedings of the 17th International Congress of Phonetic Sciences 17: 21.