Back to listing

Largest translated corpus enters CoEDL collection

Alan Rumsey, Learning

Date: 15 November 2021

The CoEDL Corpus Collection grew 25% in 2021 to include over 6.5 million words across 42 corpora. A particularly significant addition to the collection is the English-translated Ku Waru child language corpus, compiled by a team led by CI Alan Rumsey. Containing over 1.3 million transcribed words, it is the largest English-translated Indigenous language corpus in the CoEDL collection, and one of the largest for any Pacific language.

Ku Waru is a language of the Western Highlands Province of Papua New Guinea; it is actively spoken by about 10,000 people and still learned as a first language by children. Younger generations of Ku Waru speakers also speak Tok Pisin, a largely English-based creole and one of Papua New Guinea’s national languages.

Alan has been working with the Ku Waru community for many years, with some of his earliest audio recordings of children there dating from 2004. All of the CoEDL-archived corpus, however, was gathered for the Ku Waru Child Language Socialisation study (KWCLSS). This study ran from 2013 to 2016, supported by ARC Discovery Project funding. AI Francesca Merlan and several CoEDL PhD students — as well as field assistants John Onga and Andrew Noma, members of the Ku Waru community — worked with Alan on KWCLSS, which explored questions about how children learn language and whether and how they are socialised to particular behaviours and ways of life as they acquire language.

To answer these questions, the study took a longitudinal approach. It followed five children, ranging in age from 20 months to five years old, as they learned Ku Waru and Tok Pisin. John and Andrew filmed each child for one hour per month while they interacted with parents and other people. The field assistants then transcribed the recordings and translated the Ku Waru transcriptions into English. This work was recorded in hundreds of notebooks. Partner institution Appen scanned and created typed transcripts of the notebooks, before the material passed to the research team for analysis.

“The size and diversity of the data set provides a unique opportunity for in-depth investigation of linguistic and anthropological issues,” Alan explains on the project website. Of the various publications coming from this immense effort are studies of intersubjectivity, the theme of deception, gesture and the ways in which children learn complex syntactic constructions in Ku Waru, like clause chaining.

The full corpus contains over 2.5 million words in Ku Waru and Tok Pisin, spanning 364 sessions. CoEDL Corpus Manager Wolfgang Barth spent several months of 2021 preparing the corpus before adding it to the Centre’s collection. Now that the corpus is available online, words and phrases can be searched, and word frequency can be compared for each speaker, age group or gender.

The selection of transcripts archived in the CoEDL Corpus Collection is available here, while there are also files archived with PARADISEC and with the Language, Acquisition, Diversity Lab (ACQDIV) at the University of Zurich.



Header—Alan Rumsey, field assistant John Onga (with notebook), and Wapi Onga, ca. 2004 (image: KWCLSS).

  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University