Corpus Development

We are building an ever-increasing number of corpora from languages of our region, in particular indigenous languages of Australia, Papua and Austronesian languages of Indonesia, PNG, and island Melanesia.

You can see the current set of corpus materials here:

Corpora are collections of texts, either spoken or written, that are initially compiled within language documentation projects. Language documentation involves a very open research approach to a little-known language, comprising recording of different communicative events, their transcription and translation, as well as documentation of vocabulary, cultural and encyclopaedic knowledge, and speakers’ judgments about language structures.

Whether drawn from older written material, audio recordings or newer video recordings, corpora are amenable for linguistic research through relevant mark-up and metadata, so that linguists can retrieve relevant linguistic information and relate these to structural context and information about speakers, occasion of communicative events, audiences, etc.

The systematic use and compilation of corpora for research on lesser-studied languages is still a relatively recent development. Nonetheless, many of our corpora contain good amounts of text data, collected over decades of research in communities in and around Australia. We are also developing new techniques of specific data collection and data mark-up that will enhance more systematic corpus linguistic research across diverse languages. It is our belief that making more of this material accessible to a broader public of scholars will be a valuable contribution to the empirical scientific study of language.

Our researchers are at the forefront of developing different aspects of this emergent development in language research. Our corpora serve as the basis for grammars, dictionaries, and other studies (see Nafsan). In other projects, we make accessible older specimens of data from a range of languages of a particular area by digitizing older tape recordings, and presenting respective media files and other information on a dedicated web landing page (see Daly River languages). Yet other projects are concerned explicitly with the comparison of diverse languages with regard to specific linguistic structures, employing newly developed systems of data mark-up to bear on long-standing questions about the universality of certain patterns of language use (see Multi-CAST and SCOPIC).

  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University