Back to listing

ANU - CoEDL Linguistics Seminar: Discrete high-level features in likelihood ratio-based forensic voice comparison, Michael Carne, 26 Feb

Australian National University, Outreach

Date: 16 February 2021

Seminar: Discrete high-level features in likelihood ratio-based forensic voice comparison

Speaker: Michael Carne, Australian National University

When: 26 Feb 2021, 3pm (AEDT)

Where: via zoom (please email CoEDL@anu.edu.au for zoom link invitation)

Abstract:

There is an on-going paradigm shift in the forensic comparison sciences (e.g. comparison of DNA samples, fingerprints, voice recordings etc). This shift is characterised by data driven and probabilistic approaches  to evidentiary evaluation (Saks & Koehler, 2005), and the emergence of the likelihood ratio (LR) framework as central to drawing inferences about the origin of unknown forensic material (Aitken & Taroni, 2004; Evett, 1998; Morrison, 2009; Rose, 2005). Voice comparison is one area of the forensic sciences where this shift is underway (Gold & French, 2019; Morrison, Enzinger, & Zhang, 2017).

High-level features are defined as those that rely on linguistic or long-range information (Shriberg, 2007). These are thought to capture characteristics that human listeners recognise as salient when recognising an individual by their voice – word choice, intonation, pronunciation and so forth (Campbell et al, 2003). High-level features also possess several desirable forensic qualities. They are relatively robust to acoustic variability introduced by transmission effects and background noise (Campbell, Campbell, Reynolds, Jones, & Leek, 2004; Rose, 2002), are arguably more easily interpreted by non-experts (e.g., judges, juries etc.) (Rose, 2006), and have the potential to add complementary information to traditional acoustic-based systems (Ferrer et al., 2006; Shriberg & Stolcke, 2008) – which is advantageous where data is limited. 

A majority of forensic voice comparison (FVC) studies operating within the new paradigm have investigated high-level features derived from acoustic-phonetic analyses of speech (vowel formant frequencies, voice ‘pitch’ a.k.a long-term fundamental frequency etc.) (e.g. Kinoshita, 2001; Kinoshita & Ishihara, 2010; Rose, 2017). Other approaches have examined the of use low-level spectral information derived from signal processing techniques, such as Mel-frequency cepstral coefficients (MFCC) (e.g. Zhang, Morrison, Enzinger, & Ochoa, 2013; Zhang, Morrison, & Thiruvaran, 2011). Both approaches rely on continuous acoustic representations of speech, and various methods exist for estimating forensic LRs using this type of data (e.g. Multivariate kernel density estimation (Aitken & Lucy, 2004); Gaussian Mixture Models (Morrison, Enzinger, Ramos, González-Rodríguez, & Lozano-Díez, 2020).

However, some patterns of language use that speakers exhibit are discrete. That is, they are properties of speech that are quantified by the frequency of their occurrence in a voice recording or are defined by their presence or absence. These include a speaker’s habitual lexical or syntactic choices, pronunciation patterns, speech disfluencies and so forth. There are no existing FVC studies within in the new paradigm investigating the forensic potential of this kind of linguistic information, which is described here as discrete high-level features.  Furthermore, few statistical procedures are described in the literature for estimating LRs for discrete forensic data (Aitken & Gold, 2013; Bolck & Stamouli, 2017), and those that do exist  have not been empirically validated for speech data.  This project aims to investigate how discrete high-level information can be captured using speaker-dependant language models and implemented in LR-based FVC. Specific questions the research will address are as follows. (1) What performance characteristics (e.g. accuracy) do discrete high-level features exhibit? (2) How does this compare with traditional acoustic features? (3) What performance gains are achievable from fusing the outputs of multiple high-level FVC systems?

References:

Aitken, C., & Gold, E. (2013). Evidence evaluation for discrete data. Forensic Science International, 230(1-3), 147-155.

Aitken, C. G., & Lucy, D. (2004). Evaluation of trace evidence in the form of multivariate data. Journal of the Royal Statistical Society: Series C (Applied Statistics)), 53, 109-122. doi:doi: 10.1046/j.0035-9254.2003.05271.x

Aitken, C. G. G., & Taroni, F. (2004). Statistics and the Evaluation of Evidence for Forensic Scientists. Chichester, U.K.: Wiley.

Bolck, A., & Stamouli, A. (2017). Likelihood ratios for categorical evidence; Comparison of LR models applied to gunshot residue data. Law, Probability and Risk, 17(2-3), 71-90. doi:https://dx.doi.org/10.1093/lpr/mgx005

Evett, I. W. (1998). Towards a uniform framework for reporting opinions in forensic science casework. Science & Justice, 38(3), 198-202. doi:10.1016/s1355-0306(98)72105-7

Ferrer, L., Shriberg, E., Kajarekar, S. S., Stolcke, A., Sonmez, K., Venkataraman, A., & Bratt, H. (2006). The contribution of cepstral and stylistic features to SRI’s 2005 NIST speaker recognition evaluation system. Paper presented at the The International Conference on Acoustics, Speech, & Signal Processing (ICASSP), Toulouse.

Gold, E., & French, P. (2019). International practices in forensic speaker comparisons: second survey. The International Journal of Speech, Language and the Law, 26(1), 1-20. doi:10.1558/ijsll.38028

Kinoshita, Y. (2001). Testing realistic forensic speaker identification in Japanese: a likelihood ratio based approach using formants. Australian National University, Canberra.

Kinoshita, Y., & Ishihara, S. (2010, 14-16 December ). F0 can tell us more: speaker classification using the long term distribution. Paper presented at the Proceedings of Thirteenth Australasian International Conference on Speech Science and Technology, Melbourne, Australia.

Morrison, G., Enzinger, E., & Zhang, C. (2017). Forensic speech science. In I. Freckelton & H. Selby (Eds.), Expert Evidence. Sydney, Australia: Thomson Reuters.

Morrison, G. S. (2009). Forensic voice comparison and the paradigm shift. Science & Justice, 49(2), 298-308. doi:10.1016/j.scijus.2009.09.002

Morrison, G. S., Enzinger, E., Ramos, D., González-Rodríguez, J., & Lozano-Díez, A. (2020). Statistical models in forensic voice comparison. In D. L. Banks, K. Kafadar, D. H. Kaye, & M. Tackett (Eds.), Handbook of Forensic Statistics. FL: CRC: Boca Raton,.

Rose, P. (2005). Forensic Speaker Recognition at the Beginning of the Twenty-first Century – an Overview and a Demonstration. The Australian Journal of Forensic Sciences, 37(2), 49-71. doi:10.1080/00450610509410616

Rose, P. (2017). Likelihood ratio-based forensic voice comparison with higher level features: research and reality. Computer Speech & Language, 45, 475-502. doi:10.1016/j.csl.2017.03.003

Saks, M. J., & Koehler, J. J. (2005). The coming paradigm shift in forensic identification science. Science, 309(5736), 892-895. doi:10.1126/science.1111565

Shriberg, E. (2007). Higher-Level Features in Speaker Recognition. In C. A. M. (Ed.), Speaker Classification I Fundamentals, Features, and Methods (Vol. 4343, pp. 241-259): Springer.

Shriberg, E., & Stolcke, A. (2008). The case for automatic higher-level features in forensic speaker recognition. Paper presented at the Proc. 9th Annual Conference of the International Speech Communication Association 2008 (INTERSPEECH 2008), Brisbane, Australia.

Zhang, C., Morrison, G., Enzinger, E., & Ochoa, F. (2013). Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison – Female voices. Speech Communication(55), 796-813. doi:10.1016/j.specom.2013.01.011

Zhang, C., Morrison, G. S., & Thiruvaran, T. (2011). Forensic voice comparison using Chinese /iau/. Paper presented at the 17th International Congress of Phonetic Sciences (ICPhS 2011), Hong Kong.


  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University