Corpus Development

We are beginning to build large corpora of little-known languages of our region. Despite over 50 years of high-quality linguistic research, there is still not one single publicly available corpus of an Australian Indigenous language.

We will build large structured, annotated corpora of ten Indigenous Australian languages, three Papuan languages, as well as Bislama, the national creole language of Vanuatu. We are working with the Documentation and Archiving thread on principles for the development of additional corpora and for presentation of portions of each corpus in the EOPAS online system.

These corpora, which will act as vast storehouses of cultural knowledge, are of great value to indigenous communities because corpus development unlocks the material for local use, as well as forming the basic data-sets for present and future linguistic investigation.  As well as extensive collection of new materials, legacy holdings will be upgraded by annotation with as much demographic data as can be collected, through detailed investigations in the community, seeking to record identical information for all participants to make the corpora as comparable as possible. All corpora will be publicly available not just for relevant researchers, but also for speaker communities (subject to community-driven access restrictions), thus unlocking the material for local use.

  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University

Subscribe to our newsletter