Record Details

Record Details

KenSpeech: Swahili Speech Transcriptions

Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)

View Archive Info


Field	Value

Title	KenSpeech: Swahili Speech Transcriptions

Identifier	https://doi.org/10.7910/DVN/YHXJSU

Creator	Awino, Dorcas Muchemi, Lawrence Wanzare, Lilian D.A. Ombui, Edward Wanjawa, Barack McOnyango, Owen Indede, Florence

Publisher	Harvard Dataverse

Description	This speech dataset includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers. In total this dataset includes 27 hours 31 minutes 50 seconds of speech data from 26 speakers, that is, 19 females and 7 males. The recordings are of the following audio format: .wav, 16bits, 16kHz, mono and Little Endian. Of the total recordings 26 hours 32 minutes and 37 seconds represent the read speech data while 59 minutes 13 seconds represent the spontaneous speech recordings. Each audio file has a corresponding transcript, for example, an audio file named tweet_5701.wav in audios folder corresponds to the transcript file tweet_5701.txt in the transcripts folder. Additionally, this dataset includes a phonelist file kencorpus.phone containing all the Swahili phones as used by KenCorpus. This phonelist file is crucial as its contents have been used to create the KenCorpus Swahili lexicon-phone dictionary kencorpus.dic which contains all the words in the KenCorpus transcripts with their corresponding pronunciations as per the Swahili phones in the phonelist. The lexicon-phone dictionary contains over 30,000 words. Acknowledgement of data curators: Dorcas Awino, Dr. Benard Okal, Khalid Kitito, Owiny Japheth Otieno

Subject	Computer and Information Science Social Sciences speech Speech Synthesis Transcriptions

Contributor	WANZARE, LILIAN D. LACUNA Fund Maseno University