AsoteleLingua
Open Nigerian-language financial intelligence — bringing the same economic forecasting and analysis that institutional clients pay for to the 100+ million Nigerians who don't speak English fluently. All four target Nigerian-language sentiment classifiers are now trained on AfriSenti SemEval-2023. Hausa, Igbo, and Yoruba land in a 0.715–0.782 macro-F1 band; Pidgin lands at 0.460 — an honest gap that workstream-2 finance-domain fine-tuning is designed to close (LINGUA Africa application in flight, June 2026).
Why this exists
Roughly 47% of Nigerians do not speak English at all, and of those who do, only 20–30% read or speak it at a fluent level — Nigeria ranks 30th globally on the EF English Proficiency Index. Yet the overwhelming majority of formal financial communication in Nigeria — CBN policy releases, NGX market reports, financial news, monetary policy analysis — is published exclusively in English. Economic intelligence in Nigeria is structurally gated by language.
The institutional gap is just as wide. The African Development Bank's 2023 benchmark of African central-bank macroeconomic models documents that quarterly projection models operate at eight African central banks (Cameroon, Egypt, Kenya, Malawi, Mozambique, Niger, Senegal, Zimbabwe) — none of them publicly released. The World Bank's open MFMod-ModelFlow ships six country models (Bolivia, Croatia, Iraq, Nepal, Pakistan, Türkiye) and includes no African country. Open, transparent macroeconomic analytics for Africa do not exist at scale.
Asotele builds that public macro-analytical layer. AsoteleLingua makes its outputs accessible in the languages most Nigerians actually speak.
Languages in scope
- Hausa — ~70M speakers, Northern Nigeria dominant. Classifier shipped 2026-06-02, F1 0.779.
- Yoruba — ~45M speakers, South-West Nigeria dominant. Classifier shipped 2026-06-04, F1 0.715.
- Igbo — ~30M speakers, South-East Nigeria dominant. Classifier shipped 2026-06-04, F1 0.782; live-corpus availability is the substrate gap workstream 1 is designed to fill.
- Nigerian Pidgin / Naija — ~80M speakers, the most widely spoken lingua franca across all regions. Classifier shipped 2026-06-08, F1 0.460 — substantially below the other three (smallest AfriSenti split + hybrid English/Nigerian register); workstream-2 finance-domain fine-tune is expected to close part of this gap.
Each language ships as a sentiment classifier, terminology dictionary, and generative model — released openly under permissive licenses (Apache 2.0, ODC-By, CC-BY-SA).
Four-of-four classifiers — what's already shipped
Before applying for funding, we built all four language pilots end-to-end on commodity homelab hardware. The numbers below are measured, not promised:
- AfriBERTa-Large fine-tuned on AfriSenti SemEval-2023 Hausa (14,173 labelled tweets, 3-class). Test macro-F1 = 0.779. SemEval-2023 winning systems scored 0.81–0.83 — we land within 4 points of the top, on a single CPU fine-tune pass.
- AfriBERTa-Large fine-tuned on AfriSenti SemEval-2023 Yoruba (8,522 labelled tweets, 3-class). Test macro-F1 = 0.715. Same script, same hyperparameters, single CPU pass — `--language yor` slots into the same pipeline as Hausa.
- AfriBERTa-Large fine-tuned on AfriSenti SemEval-2023 Igbo (10,193 labelled tweets, 3-class — the largest of the four splits). Test macro-F1 = 0.782, the strongest of the three orthographic-Nigerian-language classifiers.
- AfriBERTa-Large fine-tuned on AfriSenti SemEval-2023 Naija Pidgin (5,121 labelled tweets, 3-class — the smallest split). Test macro-F1 = 0.460. Substantially below the other three, reflecting smaller training data, lower annotator agreement on `pcm` in the original SemEval-2023 task, and Pidgin's hybrid English/Nigerian register. The workstream-2 finance-domain fine-tune (BBC News Pidgin, Wazobia FM, Naija FM transcripts) is expected to close part of this gap once that source corpus is in place.
- Confidence lift on real-world BBC headlines, where measurable. Hausa: AfriBERTa 0.716 vs mDeBERTa baseline 0.463 = +0.253 lift, 75% label agreement. Yoruba: AfriBERTa 0.732 vs mDeBERTa baseline 0.516 = +0.216 lift, 67% label agreement. Igbo cannot yet be tested on a live BBC corpus — see the Igbo blog post for the corpus-availability finding. Pidgin live-corpus test pending workstream-2.
- Openly released: data scrapers, fine-tune script, applied-model script, terminology dictionary, READMEs, and three public blog posts — The Hausa pilot, Yoruba parity + the Igbo finding, and Pidgin closes the loop + the honest F1 gap.
Three of four classifiers (Hausa, Igbo, Yoruba) land in a 0.715–0.782 macro-F1 band that confirms AfriBERTa pretraining transfer is the right substrate. Pidgin's lower number is the load-bearing honest result that workstream-2 funding is designed to address.
Planned architecture under LINGUA Africa funding
Asotele pipeline
Already operational · upstream contextMultilingual corpus
Workstream 1Sentiment classifiers
Workstream 2 · one per languageMultilingual generation
Workstream 3Distribution
Workstream 4Community engagement is a workstream, not a footer
- Pre-launch listening sessions — ~60 SME owners, market traders, journalists, and community organisers across the four target languages, surfacing what economic information they actually need.
- Native-speaker linguistic review boards — paid contractors, not volunteers. 3–5 reviewers per language reviewing every model release for accuracy, fluency, register, and cultural appropriateness.
- Open feedback channels — public issue tracker on the project repository plus partner-run community feedback groups.
- Annual community gathering — Lagos convening with all partners and language community representatives to review what worked and what to build next.
Open release commitment
Everything ships under permissive open licenses:
- Pipeline + fine-tune code — Apache 2.0
- Aligned datasets and benchmarks — ODC-By, on HuggingFace
- Terminology dictionaries and technical report — CC-BY-SA
- Sentiment classifiers — Apache 2.0, on HuggingFace
- Generative models — Apache 2.0 (Qwen base) and Gemma license (Gemma base)
The replication guide is itself a deliverable — the architecture is explicitly designed to extend to Swahili, Amharic, isiZulu, and other African language families with minimal rework.
Status
- Hausa classifier shipped (2026-06-02): sentiment classifier (F1 0.779), terminology dictionary, BBC Hausa scraper. Code + model on the homelab.
- Yoruba classifier shipped (2026-06-04): F1 0.715, BBC Yoruba scraper, mDeBERTa baseline + AfriBERTa comparison (+0.216 lift). Same pipeline as Hausa, parametrized by `--language`.
- Igbo classifier shipped (2026-06-04): F1 0.782 — the strongest of the three orthographic Nigerian languages. Live BBC Igbo finance corpus is structurally near-absent, itself the workstream-1 argument.
- Pidgin classifier shipped (2026-06-08): F1 0.460 — substantially below the other three; honest gap reflecting smaller `pcm` AfriSenti split and Pidgin's hybrid register. Workstream-2 finance-domain fine-tune is expected to close part of the gap once the Pidgin source corpus is in place.
- Funding application in flight: LINGUA Africa Open Call (Microsoft Research + Google.org + Gates Foundation, administered with Masakhane). Deadline 2026-06-15.
- Partner outreach in progress: Masakhane, Daily Trust Foundation, Wazobia FM, Lagos Business School SIDFI.
- Advisor outreach in progress: early-career researchers in African NLP (Saarland, Mila, U. Ibadan) plus mid-career economists and AI-policy advisors.
Read the latest pilot post Talk to us about collaboration
Last updated: 2026-06-14