Back to listing

Seminar: Telling a new story with old data, Dan Villarreal, 21 January

Australian National University, Evolution

Date: 18 January 2019

Seminar: Telling a new story with old data: Random-forest classification of non-prevocalic (r) in Southland New Zealand English

Speaker: Dan Villarreal, University of Canterbury, New Zealand

When: 21 January, 3.30pm-5pm

Where: BPB E4.44, Baldessin Seminar Toom Level 4, Baldessin Building, ANU


Variationists have increasingly turned to computational methods to automate time-consuming research tasks such as data extraction (Fromont and Hay 2012), phonetic alignment (McAuliffe et al. 2017), and transcription (Reddy and Stanford 2015). But the time-consuming nature of coding categorical sociolinguistic variables represents an impediment to addressing research questions that require large volumes of data. In this research, we apply machine learning methods to automate coding of non-prevocalic (r) in Southland New Zealand English, growing an existing dataset fivefold. With this increased data set, we have been able to re-analyse this variable’s phonological and social conditioning and discovered new patterns in the story of rhoticity in Southland.

We used a random-forest classifier to predict whether tokens of non-prevocalic (r) were present (e.g., start [stɑɹt]) or absent (e.g., [stɑːt]) from their acoustic signatures. Random forests are an extension of classification and regression trees, which recursively partition data into successively smaller subsets at each tree node by finding the independent variable that minimises variation in the subbranches under the node. In random forests, trees select from among random subsets of independent variables at each split. The ensembled trees in a forest are averaged to a consensus on which variables are most important (Tagliamonte and Baayen 2012). Random forests can also perform classification, predicting new observations’ response values based on independent variables; this application has been used in fields like photogrammetry (Rodriguez-Galiano et al. 2012) and sleep medicine (Fraiwan et al. 2012) but not sociolinguistics.

We extracted 180 acoustic measures from 4000 previously hand-coded tokens of non-prevocalic (r) from the variably rhotic Southland variety of New Zealand English (Bartlett 2002) including the first four formants, pitch, and intensity. These measures were entered into a classifier predicting (r) presence vs. absence using the ranger package (Wright and Ziegler 2017) for R (R Core Team 2017). To assess the accuracy of this classifier, we tested its predictions against previously hand-coded data via k-fold cross-validation (k = 5). On average, 82.6% of this classifier’s predictions aligned with hand-codes, which compares well with experienced coders’ roughly 85% inter-rater reliability for (r) (Lawson et al. 2014).

We then applied this classifier to 16,000 previously uncoded (r) tokens. Mixed-effects logistic regression modelling of these 20,000 tokens (via lme4: Bates et al. 2015) indicates finer conditioning of Southland (r) variation than has previously been reported. Whereas Bartlett (2002) reported that women led a decrease in rhoticity in all preceding-vowel environments save for an increase in NURSE, we find that men led the increase in NURSE and women led the decrease elsewhere. The change toward non-rhoticity proceeded differently in different environments: gradually for NORTH but rapidly for START.

This application of machine learning provides us with the ability to automatically code vast quantities of sociolinguistic variables with accuracy comparable to that of a trained linguist, opening research questions that would otherwise be time-prohibitive to pursue. We argue that this represents a potentially powerful new method in the sociolinguistic toolkit.


Bartlett, Christopher. 2002. The Southland Variety of New Zealand English: Postvocalic Ir/ and the BATH vowel: University of Otago. Unpublished PhD thesis.

Bates, Douglas, Martin Mächler, Ben Bolker and Steve Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67(1). 48.

Fraiwan, Luay, Khaldon Lweesy, Natheer Khasawneh, Heinrich Wenz and Hartmut Dickhaus. 2012. Automated sleep stage identification system based on time–frequency analysis of a single EEG channel and random forest classifier. Computer Methods and Programs in Biomedicine 108(1). 10-19.

Fromont, Robert and Jennifer Hay. 2012. LaBB-CAT: An annotation store. Proceedings of Australasian Language Technology Association Workshop. 113-117.

Lawson, Eleanor, James Scobbie and Jane Stuart-Smith. 2014. A socio-articulatory study of Scottish rhoticity. In Robert Lawson (ed.) Sociolinguistics in Scotland, 53-78. London: Palgrave Macmillan.

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the 18th Conference of the International Speech Communication Association.

R Core Team. 2017. R: A language and environment for statistical computing [vers. 3.4.2.

Reddy, Sravana and James Stanford. 2015. A web application for automated dialect analysis. In Proceedings of NAACL-HLT 2015.

Rodriguez-Galiano, V. F., B. Ghimire, J. Rogan, M. Chica-Olmo and J. P. Rigol-Sanchez. 2012. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS Journal of Photogrammetry and Remote Sensing 67. 93-104.

Tagliamonte, Sali A. and R. Harald Baayen. 2012. Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(2). 135-178.

Wright, Marvin N. and Andreas Ziegler. 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1). 1-17.

  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University