Data Archives

All of the primary and derived material produced by the our Centre is prepared in ways that allow it to be  archived in an accessible form and re-used later.

Accordingly, we have established a repository for our data that is integrated with the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC).

We have begun adapting PARADISEC’s catalogue (NABU) to the needs of the Centre, by adding fields to the catalogue, developing a better upload system for primary records, and building streaming access to media that can be delivered on mobile devices.

We have successfully applied for disk space from the Research Data Storage Infrastructure (RDSI), the federal government-funded data storage system.

Each program will produce varying kinds of material — recorded narratives, conversations, child language records, dictionaries, scholarly articles, statistical analyses and so on — and we aim to archive as much of it as possible. For records of small languages we will create a corpus in which it will be possible to perform the usual corpus operations (search, concordance, collocational search, frequency counts, visualisations based on well-structured data, etc.).

We will also build playable online texts (as in the current EOPAS model) that will ultimately provide stories, with interlinear text and media, for as many languages as we can arrange it for, with a map interface to make it attractive to general users.

We have been identifying existing collections that will be made accessible via the Centre, for example:

  • the work done by the Aboriginal Child Language Acquisition project
  • a collection of recordings made in the Daly region of northern Australia since the 1980s
  • the late Darrell Tryon’s collection of several hundred cassettes spanning his research career.

These kinds of collections will need significant work as they are in varying states of repair, or are analog tapes that will need digitising. As with much current research practice in the humanities, little attention was paid to data structures, filenaming, cataloging and other aspects of data management.

The Centre is developing appropriate methods to ensure that the records produced are accessible and reusable. These records include primary recordings, photos, fieldnotes and their derivative annotations, texts, dictionaries and so on. We have committed to preparing accessible collections of material in at least 30 indigenous languages of Australia and its region.

In order to get an idea of how much is known about the world’s languages we have partnered with the Open Language Archives Community (OLAC) and arranged for them to take a quarterly timeslice of their aggregated list of information held by all of their participating archives. We will be able to identify particular languages we have been working with and to observe the increase in materials available for those languages.

There are various types of material recorded by Centre researchers including media and transcripts that are based in different annotation traditions (language documentation, child language acquisition, language variation and so on), and we will be developing metadata systems that allow each of these to be adequately described for later retrieval.

We plan to develop a metadata entry tool that will make it easier to write good descriptions in standard formats of files created by researchers. Currently we have ExSite9, a tool built by PARADISEC, and will explore if it can be a basis for future development.

Finally, we want to encourage the use of a simple guide to each collection that includes contextual information about how the collection came into being, what purpose it was created for, what other material there is (academic articles about it etc.), and metadata on the participants. See the example here:

  • Australian Government
  • The University of Queensland
  • Australian National University
  • The University of Melbourne
  • Western Sydney University