Seminar: A preliminary report on a likelihood ratio-based forensic text comparison with samples from 2000+ authors, Shunichi Ishihara, 13 Dec
Seminar: A preliminary report on a likelihood ratio-based forensic text comparison with samples from 2000+ authors: A bag of words approach and a rotated quadratic distance measure with adapted within-author co-variance matrix
Speaker: Shunichi Ishihara
When: 13 Dec 2019 3.30pm-5pm
Where: Engma Room (3.165), HC Coombs Building, ANU
It has been demonstrated that the likelihood ratio (LR) framework works with linguistic text evidence (Ishihara 2011, 2013, 2014a, 2014b, 2016, 2017a, 2017b). However, forensic scientists have to make decisions at every level of the process of estimating LRs; for example, what distance measures, statistical models, and authorial features are to be used. It is important for the forensic scientists to understand how these decisions will impact the outcome. This study utilities a score-based method of estimating LRs with “bag of words (BOW)” features. A score-based method is commonly used for estimating LRs across different fields of forensic science, and the features based on BOW statistics is a typical feature set, being used as a standard feature set in general authorship analysis studies. The type of the score-based method and BOW features have not been tested in LR-based forensic text comparison (FTC). In any classification/identification tasks, an accurate measure of the similarity/difference between the objects under comparison (e.g. known and unknown text samples) is crucial. This study will also test four different distance measures (Manhattan, Euclidean, Cosine, and Mahalanobis). In previous studies (Evert et al. 2017), although it is theoretically sound (Argamon 2008), it has been reported that the Mahalanobis distance substantially under-performs for authorship verification tasks in comparison to the other distance measures. In this study, the mahalanobis distance is measured for each comparison with the averaged within-author co-variance matrix obtained from the entire database. This way of adapting a covariance matrix is more appropriate for FTC tasks than the adaptation technique employed by previous studies (Evert, et al. 2017).
This study uses a part (texts written by 2160 authors) of the Amazon Customer Reviews Dataset. The texts of 2160 authors are reformatted so that each author has a pair of 2100, 1400 and 700 word messages. Thus, for each of the sample sizes (700, 1400 and 2100 words), 2,160 same author and 4,663,440 different author comparisons are possible, and their LRs were estimated using the entire database in a non-cross-validated manner. The outcomes of the experiments will be reported and discussed by referring to the log-likelihood-ratio cost (Cllr) values and Tippett plots.
Argamon, S. (2008) Interpreting Burrows's Delta: Geometric and probabilistic foundations. Literary and Linguistic Computing 23(2): 131-147. https://dx.doi.org/10.1093/llc/fqn003
Evert, S., Proisl, T., Jannidis, F., Reger, I., Pielström, S., Schöch, C. and Vitt, T. (2017) Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities 32(suppl_2): ii4-ii16. https://doi.org/10.1093/llc/fqx023
Ishihara, S. (2011) A forensic authorship classification in SMS messages: A likelihood ratio based approach using N-gram. In D. Molla and D. Martinez (eds.), Proceedings of the Australasian Language Technology Workshop 2011: 47-56.
Ishihara, S. (2013) A Comparative Study of Two Procedures for Calculating Likelihood Ratio in Forensic Text Comparison: Multivariate Kernel Density vs. Gaussian Mixture Model-Universal Background Model. Proceedings of Australasian Language Technology Association Workshop 2013: 71-79.
Ishihara, S. (2014a) A likelihood ratio-based evaluation of strength of authorship attribution evidence in SMS messages using N-grams. International Journal of Speech Language and the Law 21(1): 23-50. http://dx.doi.org/10.1558/ijsll.v21i1.23
Ishihara, S. (2014b) A likelihood ratio-based forensic text comparison in predatory chatlog messages. In L. Gawne and J. Vaughan (eds.), Proceedings of the 44th Conference of the Australian Linguistic Society: 39-57.
Ishihara, S. (2016) An effect of background population sample size on the performance of a Likelihood Ratio-based forensic text comparison system: A Monte Carlo simulation with Gaussian mixture model. Proceedings of the 15th Annual Workshop of Australasian Language Technology Association Workshop: 124-132.
Ishihara, S. (2017a) Strength of forensic text comparison evidence from stylometric features: A multivariate likelihood ratio-based analysis. The International Journal of Speech, Language and the Law 24(1): 67-98. https://doi.org/10.1558/ijsll.30305
Ishihara, S. (2017b) Strength of linguistic text evidence: A fused forensic text comparison system. Forensic Science International 278: 184-197. https://doi.org/10.1016/j.forsciint.2017.06.040