2026-06-04

Yoruba lands, Igbo runs tonight — and the Igbo finding the proposal didn't expect

Two days ago AsoteleLingua had a Hausa pilot. Tonight it has Hausa and Yoruba, with Igbo training in the background. The numbers replicate, the code path generalized cleanly with a single --language flag, and the LINGUA Africa proposal that funds the rest of the work now stands on two-of-four target languages instead of one.

The bigger story turns out to be the third language, where the surprise was finding out how little Igbo-language finance content exists online at all.

What landed today

sources/asotele_lingua/ now ships a parallel Yoruba pipeline alongside the Hausa pilot from earlier this week. Three new artifacts, three measurements, one structural refactor:

fetch_bbc_yoruba.py — RSS scraper for BBC News Yorùbá, filtered to finance/economy items by Yorùbá-language keyword matching. The keyword set is precision-tuned the way Hausa wasn’t initially — Yorùbá has many short, high-frequency words that overlap with non-finance content. Bare owó (money) also names the town of Owo in Ondo State, bare èrè (profit) substring-matches òṣèré (actor) and ère (statue), bare iṣẹ́ (work) matches generic work mentions in any news context. The list got rewritten to prefer multi-word distinctive phrases — owó orí (tax), iye owó (price), ètò ìnáwó (budget), gbèsè (debt), pẹ́tírólù (petrol). Recall is lower; precision is much higher. Today’s pull: 38 feed items, 3 kept as real finance — an IMF debt-to-Africa question, the ₦100,000 minimum-wage governors’ announcement, and Tinubu’s third-anniversary economic statement.

finetune_afriberta_sentiment.py --language yor — AfriBERTa-Large fine-tuned on the AfriSenti SemEval-2023 Yorùbá split (8,522 train / 2,090 dev / 4,515 test labelled tweets). Same script, same hyperparameters, single CPU pass on the homelab. Test macro-F1 = 0.715, weighted-F1 = 0.733. Lower than the Hausa pilot’s 0.779, which is what AfriBERTa’s per-language coverage and AfriSenti Yorùbá’s smaller training split would predict — but solidly in the “fine-tune transfer worked” range.

sentiment_afriberta.py --language yor plus sentiment_baseline.py --language yor — both scripts now take a language flag and share a per-language config map. On the 3 BBC Yorùbá finance items:

mDeBERTa cross-lingual baseline: mean confidence 0.516
AfriBERTa Yorùbá fine-tune: mean confidence 0.732
Lift: +0.216, label agreement 67% (2/3 — the disagreement is the IMF debt question, where the cross-lingual baseline reads “debt to IMF” as negative and the Yorùbá-pretrained model reads “which African countries owe IMF the most” as neutral; both readings are defensible)

For comparison, the Hausa pilot two days ago measured +0.253 lift on the same kind of out-of-distribution test. Different language, different feed, very similar story: pretraining-language transfer matters, and the workstream-2 finance-domain fine-tune sits on top of an already-meaningful base.

A two-language pattern

The replication matters more than either number on its own. Hausa’s +0.253 by itself could have been “the AfriBERTa Hausa pretraining happens to be unusually strong.” Yorùbá’s +0.216 — measured under the same methodology, on a different language with smaller training data — rules that single-language-luck explanation out. The pretraining-transfer effect is genuine across at least two of the four target Nigerian languages, which is the load-bearing claim under LINGUA Africa workstream 2.

The proposal was updated tonight to carry the side-by-side. The relevant line in workstream 2: “AfriBERTa Hausa 0.716 vs 0.463 (+0.253); AfriBERTa Yorùbá 0.732 vs 0.516 (+0.216). This is the replicable pattern, now demonstrated on Hausa and Yoruba with quantified pretraining-transfer headroom; Igbo and Pidgin extend the same template under workstream 2.”

What scoping Igbo actually surfaced

Igbo was supposed to be the third one tonight — kicked off in parallel with the blog write-up, training overnight, parity by morning. That happened, for the training side. AfriBERTa-Large on the AfriSenti Igbo split (10,193 train / 1,842 dev / 3,683 test — actually the largest of the three AfriSenti Nigerian splits) is running in the background as this post lands.

The wrinkle is the inference corpus. The Hausa and Yorùbá pilots scored their models on real BBC headlines as an out-of-distribution test. Igbo was supposed to do the same. BBC Igbo’s RSS feed today returned 9 total items, with zero passing the finance keyword filter — even after iterating the Igbo keyword list to avoid the same false-positive trap Yorùbá fell into earlier in the day. (Bare ụtụ (tax) substring-matches ụtụtụ (morning) and ọtụtụ (many), so it got dropped in favour of the multi-word ụtụ isi only.) Today’s BBC Igbo feed was teacher-strike coverage and a Biafra War retrospective, nothing finance.

The honest scout went further. Igbo Wikipedia has one category that looks finance-adjacent — Ego Afrika (Money of Africa). It contains three articles: the Lesotho currency, the Ghanaian pesewa, and a list of African currencies. Core articles I’d have expected to be there — Naira, Akụ na ụba (Economics), Banki (Bank), Ụlọ akụ (the older Igbo term for bank, literally “house of wealth”) — all return missing. They don’t exist.

VOA doesn’t run an Igbo service. DW doesn’t either. RFI doesn’t. The major Nigerian English-language press — Premium Times, Punch, Vanguard, Sun — don’t publish steady Igbo editions. Igbo-language radio (Orient FM, Wazobia) is broadcast-only without scrapable feeds.

Igbo-language online finance content is structurally near-absent. That isn’t a bug in the scraper. That’s the substrate the LINGUA Africa workstream would be building.

This is, ironically, the strongest possible grant pitch for Igbo. The Hausa and Yorùbá story is “the gap between cross-lingual baseline and Nigerian-pretrained classifier is real and measurable.” The Igbo story is “the corpus to even have that conversation doesn’t exist yet, and workstream 1 funds the building of it.” Both arguments are now in the proposal.

What the training will produce, and what it won’t

The Igbo AfriBERTa fine-tune that’s running right now will give us a measured macro-F1 on the AfriSenti Igbo test set — a third replication of the same fine-tune pattern, this time on the largest of the three Nigerian splits. That number lands in the proposal alongside the Hausa 0.779 and Yorùbá 0.715. It’s a real evidence point.

What it won’t give us tonight is a BBC-headline inference comparison the way Hausa and Yorùbá have. Doing that requires a live Igbo finance corpus we now know doesn’t exist on the open web. The honest framing in the proposal: Igbo has the trained classifier ready; the BBC inference will accumulate as the daily scraper runs (occasional finance items will appear) and as the funded workstream builds out additional Igbo sources.

The wider lesson — one we couldn’t have written into the proposal a week ago because we didn’t know it — is that the four target languages are not symmetric in their corpus availability. Hausa has multiple online outlets with daily finance content. Yorùbá has BBC. Igbo has BBC barely, Wikipedia thinly, and otherwise near-nothing. Pidgin has BBC News Pidgin and a strong oral/broadcast presence but limited written-text scraping targets. The workstream will need to weight effort accordingly.

What’s next, concretely

The Igbo AfriBERTa fine-tune completes overnight; the morning will bring the test macro-F1 number and the third replication of the fine-tune pattern. The proposal Igbo block will get the same --language ibo invocation footer as Hausa and Yoruba, plus an explicit note about the corpus-availability finding.

The LINGUA Africa proposal is at funding/03_lingua_africa_proposal.md with PDF rendered to the same path. Reviewer pass tomorrow; submit-by 2026-06-08, a week before the 2026-06-15 deadline.

Pidgin is the fourth target. AfriSenti SemEval-2023 covers Naija Pidgin too (pcm, 5,122 train / 1,282 dev / 4,155 test), so the same --language pcm invocation slots into the pipeline that already runs for Hausa, Yoruba, and Igbo. The Pidgin-specific work is on the source side — BBC News Pidgin RSS plus the Wazobia FM and Naija FM transcript pipelines the LINGUA proposal scopes — rather than the model-training side. That’s the shape of the fourth language’s workstream.

The pattern, where the data allows it, is now templated. Two of four target languages are demonstrated end-to-end, the third is training tonight, and the corpus reality for the third is itself a finding worth carrying into the proposal.

← All updates