Kencorpus: Kenyan Languages Corpus
Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)
View Archive InfoField | Value | |
Title |
Kencorpus: Kenyan Languages Corpus
|
|
Identifier |
https://doi.org/10.7910/DVN/6N5V1K
|
|
Creator |
Wanjawa, Barack
Wanzare, Lilian D.A. Indede, Florence McOnyango, Owen Ombui, Edward Muchemi, Lawrence |
|
Publisher |
Harvard Dataverse
|
|
Description |
This project collected text and speech corpora for Languages in Kenya. In KenCorpus project, three languages were strategically selected i.e. Kiswahili, Luhya, and Dholuo. The Luhya Language has several dialects. In the project, 3 dialects were chosen as a start: Lumarachi, Logooli and Lubukusi. Primary data was collected from the respective language communities, which also included indiginous stories and other narratives from student compositions, native language media stations, and publishers. This went beyond the conventional religious texts to include other genres of texts that made the corpus more representative of everyday language use in the communities. Text data : A total of 4442 texts were collected: 546 texts for Dholuo, 483 texts for Luhya-Lumarachi, 135 texts for Luhya-Lubukusu and 359 texts for Luhya-Logooli. Spontaneous Speech data: A total of 1,152 files were collected which total to 176hr 29min and 46sec of spontaneous speech data: 104 files (19hr 10min 57sec) for Swahili, 512 files (99hr 3min 8sec) for Dholuo, 138 files (15hr 37min 46sec) for Luhya-Lumarachi, 354 files (30hr 11min) for Luhya-Lubukusu and 44 files (12hr 26min 55sec) for Luhya-Logooli. Acknowledgement of data collectors: Kiswahili - Rose Felynix, Khalid Kitito, Dr. Benard Okal Luo - Jotham Ondu Ajiki, Dr. Jackline Okello, Jonathan Muga, Mercy Lavinca Oduoll Luhyia (Logooli) - Salano Odari, Dr. Phillip Lumwamu Luhyia (Bukusu) - Mactilda Nekesa Makana, Mulwale Martin Luhyia (Marachi) - Yonah Weunda |
|
Subject |
Computer and Information Science
Social Sciences Datasets low resource languages African languages Dataset curation |
|
Language |
Swahili
|
|
Date |
2022-05-04
|
|
Contributor |
WANZARE, LILIAN D.
LACUNA Fund Maseno University |
|
Type |
Text
Audio |
|