Towards an online museum of languages: Digitising records of the world’s 7,000 languages
By Associate Professor Nick Thieberger
First published in Museums Galleries Australia Magazine, Autumn-Winter 2018
What happens if you see a hole in the GLAM fabric that needs patching? We were in that position in 2002 as we looked at all the analog tape recordings made by our forebears — linguists, musicologists and ethnographers. No agency in Australia took responsibility for these analog tapes — hundreds of hours of recordings of stories, music, songs, ceremonies, from the Pacific, PNG, and Asia. These tapes were in basements, in filing cabinets, or within the deceased estate materials of former academic researchers who were mostly publicly funded and whose research outputs their universities had ostensibly undertaken to store in the long term.
We had great help from the NFSA and the NLA in determining what metadata and equipment standards were required, and we applied for Australian Research Council (ARC) funding for a year to help us build research infrastructure. With that, we started building an archive, the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC).
PARADISEC is a completely digital archive that is a collaboration between the University of Sydney, University of Melbourne, and the ANU. We are grateful to national storage and network programs like AARNet, GrangeNet, NCI, RDSI, ANDS, and Nectar that have supported our work over time. In 16 years the archive has grown to 46 terabytes, and contains material from 1,175 languages, now including more than 8,500 hours of audio recordings. While the initial focus was on the region around Australia, the collection now holds material from anywhere in the world. It is a significant collection that has been entered into the UNESCO Memory of the World register for preservation of intangible cultural heritage.
In the past, recordings were on analog tape, reels or cassettes. These tapes are becoming unplayable as the media deteriorates. A further problem for listening to these tapes is that as older delivery technologies have dropped out there are fewer and fewer playback machines available. We have been working with agencies in the Pacific, the Solomon Islands Museum, the Vanuatu Cultural Centre and the Divine Word University in Madang, to digitise their tapes. In each case, the tapes are an important symbol of local oral tradition as well as being a significant record in themselves, but they have often not been listened to for some time since there is no local playback equipment. Once we are able to digitise such tapes we send back a hard disk of the recordings, and also make them available via our online collection — subject to whatever access terms are specified.
While the initial motivation for our archival work was digitising analog tapes and building a system for adding new recordings, while cataloging and making them accessible, we soon began including digital records from current researchers. In fact, we are now advocating novel research practices in which recordings are archived as close to the time of their creation as possible, allowing them to have a persistent citation form that can be used in the research process.
Data citation and verification is becoming increasingly important to ensure research integrity. However, to be citable, research data needs to have persistent identification in an accessible repository. It also needs to have clear access conditions — so every collection incorporated in PARADISEC now has a deposit form to be filled out by the depositor or their executor. All users of the catalog sign in first to get access to any files, and so accept the access conditions we specify. We licence all metadata using Creative Commons and assert the moral rights of the performers and/or rightful knowledge custodians.
Given that PARADISEC holds records made up to 70 years ago in one or more of over 1,175 languages, we cannot – as managers today – know the sensitive content of some of the recordings. Therefore, we also apply a ‘takedown’ principle, in case a member of a source community finds anything in the collection that should be closed. Our catalog home page has the following wording:
The catalog entry for an item is usually written by the depositor, and some are more detailed than others. In the case of collections that we have digitised from deceased researchers, we do the best we can to describe the records, but often there is little information available. By placing these items in the collection we hope that other researchers will enrich the descriptions as they use the material. We believe the rights in the material presented in this collection have been cleared by the depositors. Please let us know if you think that is not the case for any particular item.
We have three digitisation units: in Melbourne, Sydney and Canberra. Our online system (a Ruby on Rails application called Nabu) allows upload of the files to a directory where their filenames and formats are checked to ensure they conform to our requirements. Audio files are moved to be processed into Broadcast Wave Format (BWF) and mp3 files; text or image files are moved straight into the collection; and video is sent for transcoding into mp4, mxf, and JPEG2000. We are sent reports on success or failure of the transfers. A nightly backup of all files to QCIF in Brisbane keeps an offsite copy, and we periodically do a disaster recovery from that backup to ensure it is all working. The database is also backed up weekly.
There are three APIs (Application Programming Interfaces) from the catalog. One is the collectionlevel feed primarily for Research Data Australia and the National Library’s TROVE. The second is an OAI-PMH feed that is aimed at the Open Language Archives Community, a service that aggregates records from some 60 language archives worldwide. The third is a general API using GraphQL. All of these tools and operations, in combination, aim to maximise the discoverability of items in the collection.
In order to explore the contents of the PARADISEC collection, we recently worked on a virtual reality installation we called Glossopticon, which was displayed at the Canberra Museum and Gallery in late 2016. We extracted 20-second snippets of audio from within selected items, and then — using the catalog’s metadata, which includes geographic reference — we were able to plot the location of each snippet and present a topographic map of the Pacific with shards of light projecting from the earth to the sky, each of these representing a language.
Flying through this universe of languages permits the user to hear these snippets, getting louder as you approach and fading away as you pass by. Some information about a particular language, number of speakers, and how much associated information is recorded, is also presented. The Glossopticon has attracted a great deal of attention and is now in further development as a framework that can continue to absorb more snippets from researchers, custodians and archivists.
The value and potential for ongoing enrichment of the PARADISEC archive, by making it as discoverable as possible, was made clear when we had a request some time ago from Diana Looser — then a PhD candidate in Theatre at Cornell University in the USA, who was writing a dissertation on Oceanic theatre and drama. Looser was seeking access to a play that was listed in our catalogue but existed nowhere else that she could find. In his collection TD1, the linguist Tom Dutton had included a tape of playwright Albert Toro’s Sugarcane Days, recorded from Australian Broadcasting Corporation (ABC) Radio Port Moresby in 1979. Looser subsequently transcribed this tape and prepared the only extant version of the original script, which she then redeposited in the PARADISEC collection. This re-use of research material in new ways can only be achieved if this material is stored in accessible locations, with licences for use in place and supported by a catalogue that provides sufficient information to allow an item to be located.
PARADISEC is building a museum of languages, and we would like to explore how to build on existing models like Mundolingua (Paris, France)  or the National Museum of Ethnology, Minpaku (Osaka, Japan). While our main tasks currently are the accession, description and curation of primary records, we know that these resources include wonderful performances that deserve to reach wider audiences.
Dr Nick Thieberger is a linguist who works with languages of Vanuatu and Australia and is an Australian Research Council Future Fellow at the University of Melbourne, Victoria. He is a Chief Investigator with the ARC Centre of Excellence for the Dynamics of Language.
Acknowledgement: The Glossopticon project is an ongoing collaboration between PARADISEC and Rachel Hendery and Andrew Burrell, whose work is gratefully acknowledged.
Text citation: Nick Thieberger, ‘Towards an online museum of languages: Digitising records of the world’s 7,000 languages’, Museums Galleries Australia Magazine, Vol. 26(2), Museums Galleries Australia, Canberra, Autumn-Winter 2018, pp. 52–55.