Pidgin closes the loop — and the honest F1 gap that comes with it
The headline is clean: AsoteleLingua now has end-to-end sentiment classifiers for all four target Nigerian languages — Hausa, Yoruba, Igbo, Pidgin — trained, evaluated, and saved on the homelab. The full multilingual surface the LINGUA Africa proposal scopes is no longer aspirational. It exists, on disk, in models/nlp/afriberta-naija-pidgin-sentiment/ next to the other three.
The Pidgin number itself is honest and worth being honest about.
The 4-of-4 measurement table
finetune_afriberta_sentiment.py --language pcm ran the AfriSenti SemEval-2023 Naija-Pidgin split (5,121 train / 1,281 dev / 4,154 test labelled tweets) on the homelab CPU, 3 epochs at batch 16, learning rate 2e-5 — same hyperparameters as the other three. The full table now reads:
| Language | AfriSenti train set | Test macro-F1 | Live BBC lift (where measurable) |
|---|---|---|---|
| Hausa | 14,173 | 0.779 | +0.253 over mDeBERTa cross-lingual baseline |
| Igbo | 10,192 | 0.782 | n/a — BBC Igbo finance feed is structurally empty |
| Yoruba | 8,522 | 0.715 | +0.216 over mDeBERTa cross-lingual baseline |
| Pidgin | 5,121 | 0.460 | — |
Pidgin is the largest gap in the table by a wide margin. The other three sit in a 0.715–0.782 band that says “AfriBERTa pretraining transferred well, fine-tuning closed the rest of the distance.” Pidgin sits in a different regime entirely.
Why Pidgin lags — what we think and what we don’t yet know
A few honest reads on what’s likely going on, in descending order of confidence:
1. Smallest train set. Pidgin’s AfriSenti split is 5,121 examples vs Hausa’s 14,173. Less data, less signal — and AfriSenti’s Pidgin annotations themselves had lower inter-annotator agreement than the other Nigerian languages in the original SemEval-2023 paper. The shared-task winning systems for pcm topped out lower than for hau for the same reason. Getting to 0.46 with this much data is not catastrophic; getting to 0.78 with this much data would have been the surprise.
2. Pidgin’s hybrid character. Naija Pidgin shares much of its vocabulary with English, then layers Yoruba, Igbo, and Hausa grammatical patterns on top. AfriBERTa-Large’s pretraining was on 11 African languages including Pidgin — but the proportion of Pidgin in that pretraining corpus, relative to its volume on the open web, is small. The model has seen Pidgin, but not at the depth it’s seen Hausa.
3. Sentiment in Pidgin is harder to label. Pidgin is the most expressive of the four — irony, sarcasm, code-switching mid-sentence between Pidgin and English are common features of how Nigerians actually use it online. The 3-class label set (positive / negative / neutral) compresses that into a coarse signal. A 0.46 macro-F1 against a 33% random baseline still says the model is learning something real; it’s just that the ceiling is lower than for languages with more conventional written register.
4. We have not yet tested it on a live finance corpus. Hausa and Yoruba got out-of-distribution tests on real BBC headlines. BBC News Pidgin is a real, well-edited Pidgin source with a finance angle — but the scraper for it is one of the items the LINGUA Africa workstream-2 funds, not yet built. Until that test runs, the 0.46 is on AfriSenti’s general-domain tweet test set only. It is plausible — though by no means guaranteed — that the same finance-domain fine-tune that closed the cross-lingual gap on Hausa and Yoruba does some of that work for Pidgin too.
That last point is the one to watch. The same workstream-2 architecture that produced +0.253 / +0.216 lifts on Hausa and Yoruba could close part of the Pidgin gap once the finance-domain layer is in place. Or it might not. The honest answer is we don’t know yet, and the LINGUA proposal language reflects that — Pidgin is now in the “trained, ready for workstream-2 lift” bucket rather than the “demonstrated end-to-end lift” bucket the other three occupy.
What this means for the LINGUA Africa evidence
The proposal’s workstream-2 claim has been that pretraining-transfer plus finance-domain fine-tune is the replicable architecture. That claim now sits on the strongest possible footing for three of the four target languages — Hausa, Igbo, Yoruba, all in the same 0.715-0.782 macro-F1 band on AfriSenti, with the two that have a live test corpus showing similar-direction lift over a cross-lingual baseline.
Pidgin extends the same template but lands the AfriSenti baseline lower. The honest framing in the proposal is now: “AfriBERTa Hausa 0.779 / Igbo 0.782 / Yoruba 0.715 / Pidgin 0.460 on AfriSenti SemEval-2023; the lower Pidgin number reflects a smaller AfriSenti split, lower annotator agreement on the original task, and Pidgin’s hybrid English/Nigerian register. The workstream-2 finance-domain fine-tune is expected to close part of this gap once the funded Pidgin source corpus (BBC News Pidgin, Wazobia FM, Naija FM transcripts) is in place — the same lift pattern observed on Hausa and Yoruba.”
That language is more honest than claiming Pidgin parity. It also remains a perfectly defensible position for a research-grant evaluation.
The other thing that landed: grounded-knowledge expansion
Quietly, alongside the multilingual closing, the chat surface Asotele runs against has been growing. The retrieval-augmented chat now grounds its answers against a substantially larger and more carefully curated knowledge base than the briefings-only version that ran a week ago.
Categories now indexed and citable include:
- Asotele’s own daily economic briefings and table-fact summaries (CBN policy rate, FX, inflation, NGX sectoral indices, food and fuel prices)
- A curated economic-and-business glossary covering universal terms (GDP, inflation, monetary policy, capital, recession), Nigerian-context terms (CBN, NDIC, MFB, NIBSS, BVN, parallel market), and crypto / digital-asset vocabulary
- Open-license economics textbook coverage — CORE Econ: The Economy, OpenStax Principles of Economics, OpenStax Principles of Finance, OpenStax Introductory Statistics, the World Bank Nigeria Development Update (four most recent editions), the AfDB African Economic Outlook 2024
- The full Nigerian Exchange (NGX) listed-companies directory and the rolling NGX corporate-disclosure feed — listed-company cards by ticker, current-trading snapshots, recent corporate actions, AGMs, earnings forecasts
That’s the chat being able to answer “what is the current MPR?” with a citation back to today’s briefing; “what is GDP?” with a citation back to the World Bank glossary; “how does the labor market work?” with a citation back to an OpenStax chapter; “what is DANGCEM?” with a citation back to the live NGX directory; “is there a pending Dangote IPO?” with a citation back to an NGX corporate-actions disclosure.
It is also — by deliberate design — the chat refusing to answer questions where the indexed sources cannot support an answer. That refusal is the load-bearing safety property the architecture has always been about. It still refuses, cleanly, when asked to forecast tomorrow’s exchange rate or recommend an individual stock. The expansion adds breadth without weakening the anti-hallucination floor.
What’s next
The Pidgin live-corpus pipeline — BBC News Pidgin, Wazobia FM, Naija FM transcripts — is workstream-2 / 3 in the LINGUA proposal and the natural next thing for the multilingual track. The proposal carries the four-language replication argument; whether Pidgin closes part of its gap under workstream-2 is now a question the funded work answers.
Outside the multilingual track, the larger arc is the same as it has been: ship internal infrastructure first, prove the operational layer before publishing the architecture publicly, and keep retrieval-grounded refusals as the core trust property. The 4-of-4 multilingual coverage and the grounded-knowledge expansion are concrete steps along that arc. Pidgin’s F1 number is the honest measurement that comes with them.