Back to listing

Danielle Barth: "Quantitative Corpus Linguistics and Fieldwork Data", 30 October 2015

Australian National University, Shape

Date: 30 October 2015

When: 30 October, 11am-12.30pm

Where: Engma Room, Coombs Building, ANU

This talk discusses assumptions and methods in quantitative corpus linguistics (Gries, 2009) including exploratory data-mining techniques for pattern detection like random forests (Hothorn et al., 2006a; Hothorn et al., 2006b; Strobl et al., 2007; Strobl et al., 2008; Strobl et al., 2009). Although one-million words is traditionally used as the minimum number for a body of texts to be a “corpus” (Fang, 1993; Leech, 1991), there are plenty of research questions that can be investigated quantitatively using smaller “corpora” (Barth, 2015; Meyerhoff and Walker, 2012; 2013; Meyerhoff, 2015).

I will present two case studies using data from Matukar Panau (Oceanic, Papua New Guinea). The first case study will present a quantitative exploration of casual and clear lexical variants and situate the results in the sociolinguistic style and identity literature (Eckert, 2008; 2012; Pennebaker and Stone, 2003; Podesva, 2007; Zhang, 2008, inter alia). The second case study will present a quantitative exploration of directional construction type distribution in Matukar Panau and discuss language-internal motivations for choice of syntactic construction.

In this talk, I will advocate looking at variation in lesser studied languages and show that is possible before the 1 million word mark of a traditionally termed “corpus.”


Barth, D. (2015). To have and to be: function word reduction in child speech, child directed speech and inter-adult speech (Doctoral dissertation, University of Oregon).

Eckert, P. (2008). Variation and the indexical field. Journal of Sociolinguistics, 12, 453-476.

Eckert, P. (2012). Three waves of variation study: The emergence of meaning in the study of sociolinguistic variation. Annual Review of Anthropology, 41, 87-100.

Fang, A. C. (1993). Building a corpus of the English of computer science. English Language Corpora: Design, Analysis and Exploitation. Amsterdam and Atlanta, GA: Rodopi, 73-8.

Gries, S. Th. 2009. Quantitative corpus linguistics with R: a practical introduction. Routledge, Taylor and Francis Group.

Hothorn, T., Buehlmann, P., Dudoit, S., Molinaro, A. & Van Der Laan, M. (2006). Survival ensembles. Biostatistcs, 7(3), 355-373.

Hothorn, T., Hornik, K. & Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651-674.

Leech, G. (1991). The state of the art in corpus linguistics. In Aijmer, K. and Altenberg, B. (eds.), English Corpus Linguistics: Studies in honour of Jan Svartvik. Longman, London, pp. 8 – 29.

Meyerhoff, M. (2015). Turning variation on its head: Analysing subject prefixes in Nkep (Vanuatu) for language documentation. Asia-Pacific Language Variation1(1), 78-108.

Meyerhoff, M., & Walker, J. A. (2013). An existential problem: The sociolinguistic monitor and variation in existential constructions on Bequia (St. Vincent and the Grenadines). Language in Society, 42 (4), 407-428.

Meyerhoff, M., & Walker, J. A. (2012). Grammatical variation in Bequia (St Vincent and the Grenadines). Journal of Pidgin and Creole Languages27(2), 209-234.

Pennebaker, J. W. and Stone, L. D. (2003). Words of wisdom: Language use over the life span. Journal of Personality and Social Psychology, 85(2), 291-301.

Podesva, R. J. (2007). Phonation type as a stylistic variable: the use of falsetto in constructing a persona. Journal of Sociolinguistics, 11, 478-504.

  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University