Speech Recognition Corpus

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In Speech technology, speech corpora are used, among other things, to create acoustic models which can then be used with a speech recognition engine.

CORPUS TEXT

1. Names of 3540 major trains of Indian Railways
Duration: 200 hours

2. Names of 490 major stations of Indian Railways
Duration: 100 hours

3. Names of 50 major crops and agriculture related words in India
Duration: 100 hours

4. Sentences of 1180 travel related queries including name of place, mode of transport and date of travel
Duration: 400 hours

CORPUS SPEAKER DISTRIBUTION

Native Languages

Language
Distribution
Bangla
15%
Gujarati
10%
Haryanavi
5%
Hindi
35%
Marathi
5%
Punjabi
5%
Tamil
5%
Telugu
20%

Age

Age
Distribution
18-30
50%
30-40
25%
40-50
25%

Sex

Sex
Distribution
Male
63%
Female
37%

DIRECTORY AND FILE STRUCTURE

The speech and associated data is organized according to the following hierarchy
where,

CORPUS := SCASR
REGION := south | west | east | north
LANGUAGE := english | hindi
SEX := m | f
NATIVE := UP | MP | Bihar | Tamil Nadu | West Bengal | Maharashtra | Punjab | Haryana | Orissa | Bangalore | Andhra Pradesh | Gujrat
SPEAKER_ID :=
where,
INITIALS := speaker initials, 3 letters
DIGIT := number 0-9 to differentiate speakers with identical initials
UTTERANCE ID :=
where,
TEXT_TYPE := tr | st | da
SENTENCE_NUMBER := 1,2,3,…
FILE_TYPE := wav | txt

Examples:
/SCASR/south/mume0/tr1.wav
(SCASR corpus, region south, male speaker, speaker-id ume0, sentence text tr1, speech waveform file)

/SCASR/east/fpoo5/st1.txt
(SCASR corpus, region east, female speaker, speaker-id poo5, sentence text st1, transcription file)

FILE TYPES

The SCASR corpus includes 2 files associated with each utterance. In addition to a speech waveform file, there is a text file complete with word-level transcription and details of the aforementioned characteristics of the speaker.