OSCAAR: The Online Speech/Corpora Archive and Analysis Resource

Welcome to the Online Speech/Corpora Archive and Analysis Resource

OSCAAR is a project of the Northwestern University Linguistics Department that was launched in the Fall 2009. The goal of OSCAAR is to provide a secure, web-accessible, and extensible repository for the many speech recordings and experimental materials of the department's researchers, and for linguistic and speech science researchers elsewhere. OSCAAR is a work-in-progress and is not yet fully operational or fully available for use, but we would be happy to hear from you if you are interested in accessing OSCAAR's data or using OSCAAR to house your own data.

A recent, short paper in the Proceedings of the Chicago Colloquium on Digital Humanities and Computer Science discusses our motivation for OSCAAR and some examples of its use. In many ways, our vision of OSCAAR is influenced by the Sociolinguistic Archive and Analysis Project (SLAAP) at North Carolina State University. If you are interested in the OSCAAR project, you may find information available on the SLAAP website relevant.

OSCAAR is home to a growing number of data collections, including the following:
Click here to see all of the publicly viewable collections in alphabetical order. You can click on a collection title for more information.

Wildcat Corpus icon
Wildcat Corpus
The Wildcat corpus of native- and foreign-accented English is a corpus of scripted and spontaneous speech recordings from 24 native speakers of American English and 52 non-native speakers of English. The core element of this corpus is a set of spontaneous speech recordings, for which a new method of eliciting dialogue-based, laboratory-quality speech recordings was developed (the Diapix task). In addition to scripted materials (e.g., ...) the corpus includes dialogues between two native speakers of English, between two non-native speakers of En... [ More ]

Collection Type: Production RecordingsURL: http://sites.google.com/site/nuwildcatcorpus/
Language(s): Native and Non-Native English



LUCID
The LUCID corpus contains spontaneous and scripted speech recordings in casual and clear speaking styles for 40 native talkers of British English. The spontaneous speech recordings were collected using the Diapix task methodology (Van Engen et al, 2010), which is a collaborative task completed by two people who have to find a set of differences between two versions of the same picture without seeing each other's version. The task results in speech that is spontaneous and natural-sounding but also constrained by topic and words. A new set of tas... [ More ]

Collection Type: not on fileURL: http://www.ucl.ac.uk/psychlangsci/research/speech/variability
Language(s): native British English, non-native English



NUFAESD
The Northwestern University Foreign-Accented English Speech Database (NUFAESD)

Overview:

  • 64 simple English sentences (BKB lists 7, 8, 9, 10)
  • 32 talkers from various native language backgrounds: Chinese (n=20), Korean (n=5), Bengali (n=1), Hindi (n=1), Japanese (n=1), Romanian (n=1), Slovakian (n=1), Spanish (n=1), and Thai (n=1)
A total of 2,048 recorded sentences

For each talker, the database has:

  • demographic information
  • a sentence production score (i.e. average s... [ More ]

Collection Type: Speech Recordings
Language(s): Foreign-accented English


OSCAAR is a project of the Northwestern University Linguistics Department. We are grateful to the following funding sources for their support: NIDCD grants R01DC005794 and R56DC005794, National Institute of Health, National Institute of Deafness and Other Communication Disorders, and The Hugh Knowles Center for Clinical and Basic Science in Hearing and Its Disorders at Northwestern University  Find bug, have suggestion, need help?
Contact: Tyler Kendall
Last Mod: 12/7/2011