Digitisation of Oral Data for NLP of Low-Resource Languages

Practical Methods and Processes for Scalable and Sustainable Ecosystem Development

Introduction 

Africa is home to over 2,000 languages1 making it one of the most linguistically diverse regions in the world. Yet, despite being spoken by millions, most African languages remain underrepresented in AI and Natural Language Processing (NLP) systems2. Only a handful of languages, such as English, Mandarin, and French, have significant digital representation. This leaves low-resource languages (LRLs) in a state of digital silence, creating barriers to knowledge, connection, and inclusion. 

The challenges for African Low Resource Languages (LRLs) are multi-layered: 

  • Data scarcity: Many languages lack large text or speech datasets, limiting the ability of AI systems to learn effectively. 
  • Oral tradition: Some languages are primarily spoken with little written material, making corpus creation difficult. 
  • Linguistic complexity: Features such as tonal shifts (e.g., in Igbo and Yorùbá) or critical diacritics (e.g., ṣ vs. s in Yorùbá) are often lost in preprocessing, which reduces model accuracy. 
  • Framework limitations: Most mainstream NLP tools assume structures and rules that do not apply to many African languages, which results in poor performance. 

These challenges are not just technical; they impact access and equity3. For instance, many African languages are unsupported by tools like Google Translate, Siri, or Alexa, making digital resources inaccessible to non-English or non-French speakers. 

Left to right: Chimaoge Esotu, Perry Hewitt, Danil Mikhailov, Aanu Oyeniran (first row), Olubayo Adekanmbi, Uyi Stewart, Chinazo Anebelundu, Oluwaseun Abdul.
Left to right: Chimaoge Esotu, Perry Hewitt, Danil Mikhailov, Aanu Oyeniran (first row), Olubayo Adekanmbi, Uyi Stewart, Chinazo Anebelundu, Oluwaseun Abdul at the Data Scientists Network in Lagos, Nigeria.

Explore this playbook 

The playbook opens by setting out a deliberately ambitious and human-centred vision for the digitisation of African low-resource languages.  

Chapter One positions this work not as a narrow technical exercise but as a collective undertaking that sits within a broader ecosystem of human collaboration, policy reform, and technological innovation. It reframes Africa’s immense linguistic diversity—so often treated as an obstacle—as a reservoir of cultural wealth, arguing that language itself is a form of digital equity and a cultural catalyst. The chapter insists that the disappearance of a language from digital spaces is never a trivial loss of vocabulary; it is the disappearance of a worldview. The playbook is presented as a direct response to this crisis of digital exclusion. 

Chapter Two shifts the focus from vision to the technical realities of digitising African languages for Natural Language Processing tasks especially as it relates to audio (oral) data collection and processing.  It argues for a community-driven, ethical workflow for data collection and cleaning. This workflow unfolds through five stages: Foundational Readiness and Ethical Grounding, Ontology Development and Prompt Design, Participant Recruitment and Distribution Logic, Data Collection and Technical Quality Assurance, and Data Processing and Validation. The chapter makes a pointed argument for large-scale, community-led data collection efforts and contends that only through collaboration among linguists, native speakers, researchers, and technologists can African languages move from digital silence to meaningful digital presence. 

Chapter Three confronts the question that haunts most initiatives in African NLP: how can digitisation scale sustainably, rather than appearing in isolated bursts of activity? Drawing on a combined methodology of literature review and a continent-wide survey of ninety practitioners, the chapter maps current capacities, gaps, and opportunities across the ecosystem. It reveals persistent fragmentation, with many projects concentrating on single languages and regions receiving uneven attention. Anchored in a Theory of Change perspective, the chapter argues that sustainable scaling requires more than additional data and better tools. It demands participatory models, ethical frameworks, and accessible infrastructure that communities can use autonomously—without reliance on external gatekeepers.

Partners


1  Ethnologue – Africa https://www.ethnologue.com/region/Africa/

2  Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. “The state and fate of linguistic diversity and inclusion in the NLP world.” arXiv preprint arXiv:2004.09095 (2020). 

3  Adebara, Ife. AI and Language Data Flaring in Africa: Addressing the Low-Resource Challenge. Centre for International Governance Innovation, https://www.cigionline.org/publications/ai-and-language-data-flaring-in-africa-addressing-the-low-resource-challenge/. Policy Brief No. 21