Back to listing

ANU-CoEDL ZOOM Seminar: Estimating forensic likelihood ratios for authorship text evidence with Poisson-based models, 4 Dec

Australian National University, Outreach

Seminar: Estimating forensic likelihood ratios for authorship text evidence with Poisson-based models

Speaker: Michael Carne and Shunichi Ishihara

When: 4 Dec 2020, 3pm (AEDT)

Where: via zoom (please email for zoom link invitation)


In forensic comparison of objects of known and questioned origin (e.g. voices, documents) a likelihood ratio (LR) quantifying the strength of evidence can be estimated using either a score- or feature-based method (Aitken 2018; Bolck et al. 2015; Morrison and Enzinger 2018). The score-based method – in which a distance measure is typically used as a score generating function – is more prevalent in forensic science for estimating LRs mainly due to its easy implementation and robustness. Distance measures (e.g. Burrows’s Delta, Cosine distance) are a standard tool in authorship attribution studies, and various measures have been devised and demonstrated good performance (Argamon 2008; Eder and Rybicki 2012; Hoover 2004). Thus, the implementation of a score-based method using a distance measure is naturally the first step for estimating LRs for textual evidence. However, many of authorship attribution features extracted from textual data are discrete (e.g. the counts of function words) and may violate the statistical assumptions underlying distance-based models. More importantly, such models only assess the similarity, not the typically, of the objects (i.e. documents) under comparison (Aitken 2018; Bolck et al. 2015; Morrison and Enzinger 2018).

This paper reports on findings from a set of forensic text comparison (FTC) experiments which sought to investigate the performance of score- vs feature-based LR estimation. Three feature-based LR models based on the Poisson distribution were tested: one-level Poisson, one-level zero-inflated Poisson and two-level Poisson-Gamma (Aitken and Gold 2013) LR models. The performance of each model was assessed against a baseline score-based model using the Cosine distance. Performance was assessed in terms of: system validity (accuracy), discrimination and calibration loss. Metrics quantifying these were derived from the log-likelihood ratio cost function (Cllr(Brümmer and du Preez 2006; van Leeuwen and Brümmer 2007). 

The experimental data for this study was based on documents collected from 2,157 authors from the Amazon Product Data Authorship Verification Corpus (Halvani et al. 2017).  Each document was equalised to 700, 1400 or 2100 words, and modelled as a vector consisting of the counts of the 400 most frequently occurring words.

The results demonstrate that: (1) feature-based LR models outperform the score-based model in terms of system validity (accuracy); (2) the feature-based LR model constructed using a Poisson-Gamma distribution yields on average the best validity, and least amount of discrimination loss relative to Poisson and Zero-inflated Poisson LR models; (3) the feature-based LR models however are poorly calibrated relative to the score-based model in some cases; and (4) performance can be further improved using a feature selection method based on discrimination lossSome distinctive performance characteristics of the feature-based methods are described as are its potential shortcomings when applied in real forensic casework.

Aitken, C. G. G. (2018) Bayesian hierarchical random effects models in forensic science. Frontier in Genetics 9: Article 126. https/

Aitken, C. G. G. and Gold, E. (2013) Evidence evaluation for discrete data. Forensic Science International 230(1-3): 147-155.

Argamon, S. (2008) Interpreting Burrows’s Delta: Geometric and probabilistic foundations. Literary and Linguistic Computing 23(2): 131-147.

Bolck, A., Ni, H. F. and Lopatka, M. (2015) Evaluating score- and feature-based likelihood ratio models for multivariate continuous data: Applied to forensic MDMA comparison. Law, Probability and Risk 14(3): 243-266.

Brümmer, N. and du Preez, J. (2006) Application-independent evaluation of speaker detection. Computer Speech and Language 20(2-3): 230-275.

Eder, M. and Rybicki, J. (2012) Do birds of a feather really flock together, or how to choose training samples for authorship attribution. Literary and Linguistic Computing 28(2): 229-236.

Halvani, O., Winter, C. and Graner, L. (2017). Authorship verification based on compression-models. arXiv preprint arXiv:1706.00516. Retrieved on 25 June 2020 from

Hoover, D. L. (2004) Testing Burrows’s Delta. Literary and Linguistic Computing 19(4): 453-475.

Morrison, G. S. and Enzinger, E. (2018) Score based procedures for the calculation of forensic likelihood ratios - Scores should take account of both similarity and typicality. Science & Justice 58(1): 47-58.

van Leeuwen, D. and Brümmer, N. (2007) An introduction to application-independent evaluation of speaker recognition systems. In C. Müller (ed.), Speaker Classification I: Fundamentals, Features, and Methods 330-353. Berlin; New York: Springer.

  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University