2026-06-10

The Pidgin corpus is built. The number didn't move. Here's what we learned.

The 2026-06-08 post named the gap plainly: Pidgin lands at AfriSenti macro-F1 0.460, well below Hausa 0.779, Igbo 0.782, Yoruba 0.715. The proposal language at the time was honest about what would close part of it — “the workstream-2 finance-domain fine-tune is expected to close part of this gap once the funded Pidgin source corpus (BBC News Pidgin, Wazobia FM, Naija FM transcripts) is in place — the same lift pattern observed on Hausa and Yoruba.”

Two days later we tested that claim directly. The honest answer is: not on this measurement path, it doesn’t.

The corpus

The first piece of workstream-2 was building the BBC News Pidgin corpus. The end result, on disk:

776 articles crawled from bbc.com/pidgin
41,208 body paragraphs after dropping cross-article sidebar / most-read / related-link leakage
Four-year span — 2022-06-17 through 2026-06-09 — wider archive depth than expected from the polite-rate crawl that produced it
A finance-keyword filter tags roughly 63% of articles as on-topic for the macro / FX / market / energy beat

The cleanup step is worth a sentence on its own: BBC’s React-based site bundles each article’s surrounding furniture (most-read list, related-article cards, image captions for top stories) into the same client-side payload as the article body. A naive pull conflates them. A frequency-cross-article filter — paragraphs appearing in ≥5 distinct articles can’t be body text, they’re navigation — strips out 277 unique contaminating paragraphs, halves the raw paragraph count, and lands a clean 41,208-paragraph corpus from 776 articles.

That artifact alone — domain-tagged, multi-year, locally curated Pidgin economic-news text — is something the workstream-2 LINGUA scope can now point at as in-hand, not aspirational. The honest framing in the proposal can change from “expected to close part of this gap once the funded corpus is in place” to “corpus built; measurement methodology under revision.” Which is what the rest of this post is about.

The four runs

With the corpus in hand, we ran the workstream-2 hypothesis end-to-end. The structure is straightforward: continue AfriBERTa-Large’s masked-language-modeling pretraining on the Pidgin domain corpus, then re-fine-tune for sentiment on AfriSenti pcm with the same hyperparameters as the 0.460 baseline, then test on AfriSenti pcm. If domain MLM lifts the downstream sentiment task, we’d see a positive delta on macro-F1.

We extended the experiment with class-weighted cross-entropy as a second axis — the 0.460 baseline never predicts the neutral class on test (neutral F1 = 0.000), so a weighted loss with sklearn’s balanced formula gives the neutral class 23.7× the weight of negative, a 47× ratio. Four configurations in total:

run	macro-F1	Δ vs baseline	neg F1	pos F1
baseline (CE, 2026-06-07)	0.4596	—	0.778	0.601
MLM-adapted + SFT (CE)	0.4604	+0.0009	0.775	0.606
base + SFT (CW balanced)	0.4611	+0.0015	0.763	0.620
MLM-adapted + SFT (CW balanced)	0.4647	+0.0052	0.773	0.621

The best of the four is MLM + CW at +0.0052 over baseline. Real in direction, tiny in magnitude — well within the noise of a single training seed. The other deltas are smaller still. Calling any of these a workstream-2 “lift” would not survive honest reading.

The mechanism behind why is visible in the rightmost column: every configuration still predicts zero neutrals on test. Neutral F1 stays at 0.000 across all four. The macro-F1 is mathematically capped near 0.46 by that single failure regardless of what else changes.

Why the neutral class will not budge

The AfriSenti pcm training set has 5,121 examples. The class distribution is:

negative: 3,241 (63.3%)
positive: 1,808 (35.3%)
neutral: 72 (1.4%)

A class that is 1.4% of training data is functionally invisible to a 110M-parameter transformer. The model learns, correctly, that the prior on neutral is so low that betting on it is almost never the right call. Weighting the loss by 23.7× doesn’t fix it — there are only 72 examples to learn from, and the neutral class likely overlaps semantically with negative in Twitter Pidgin (which is the AfriSenti source register). Whatever discriminative signal the model could extract from those 72 examples is dwarfed by the noise of the overlap.

This is not a Pidgin representation problem that domain MLM can solve. AfriBERTa already saw plenty of Pidgin in pretraining; the BBC corpus adds 41,208 more paragraphs of editorial Pidgin text. None of it teaches the model to distinguish a neutral Twitter post from a negative one, because the AfriSenti training set doesn’t contain enough labelled neutrals for that distinction to be learnable.

The macro-F1 wall on AfriSenti pcm is structural. The corpus didn’t move it. The loss reformulation didn’t move it. We are confident enough in this diagnosis that we are not going to spend more cycles trying to move it.

The pivot

Here is the more useful framing.

The AfriSenti pcm test set is the wrong measurement target for a workstream-2 claim about domain-adapted Pidgin sentiment. It is Twitter Pidgin, general-domain, with a structural neutral-class ceiling. The workstream-2 corpus we built is editorial BBC Pidgin, finance-domain-tagged. Testing one on the other introduces both a register mismatch and a label-distribution mismatch — and on top of that, the test ceiling is independent of anything the corpus can change.

The right workstream-2 measurement is a hand-labelled BBC Pidgin sentiment test set, drawn from the same domain as the corpus, labelled by a Pidgin speaker, with a label distribution that reflects what BBC Pidgin actually carries — likely more neutral / news-register items than Twitter, almost certainly fewer of the strongly-affective items that dominate AfriSenti.

That test set does not exist yet. So we built the sample for it: 200 BBC Pidgin headlines, stratified across the four-year corpus span and across finance-versus-general topic, deterministic-seeded for reproducibility. Hand-labelling is the next concrete step.

When the labelled set lands, the four trained classifiers above get re-evaluated against it. Whatever those numbers say will be the real workstream-2 result — domain-matched, label-distribution-honest, methodologically defensible. Whether MLM adaptation helps on that test, whether class weights help, whether the corpus we just built turns out to have been worth building — the answer is in the labelled-test numbers, not the AfriSenti delta.

What this means for the LINGUA proposal

The proposal language on workstream-2 needs an update — and the update is in the project’s favor, not against it.

Old framing: “AfriBERTa Hausa 0.779 / Igbo 0.782 / Yoruba 0.715 / Pidgin 0.460. The workstream-2 finance-domain fine-tune is expected to close part of the Pidgin gap once the funded Pidgin source corpus is in place.”

Honest replacement: “AfriBERTa baselines as above. The BBC News Pidgin domain corpus — 776 articles, 41,208 paragraphs, four-year span — has been built and is on disk. Initial measurement on AfriSenti held-out test shows zero meaningful lift, attributable to a structural 1.4% neutral-class ceiling that no domain-adaptation pretraining can move. Workstream-2 evaluation is therefore being shifted to a hand-labelled BBC Pidgin sentiment test set — domain-matched, register-matched — currently in labelling. That measurement is the workstream-2 result the proposal commits to.”

The second framing is stronger than the first. It names a real artifact in hand, names a real methodological choice, and commits to a falsifiable result on a defensible test. Whether the workstream-2 lift is real or not, the proposal stops promising what cannot be promised and starts promising what can.

A note on what’s not new

The pattern here — name the negative result, locate the structural cause, choose a better measurement — is the same pattern Asotele has been running through every workstream since the early posts. The 2026-05-26 Sahm-panel post named the 70%-informal-economy coverage gap. The 2026-06-04 Igbo post named the BBC RSS finance-feed emptiness. The 2026-06-08 Pidgin baseline post named the 0.460 honestly. This post names the AfriSenti-test ceiling and the pivot.

Negative results are findings. Saying so out loud, with the numbers, is how the project earns the right to publish positive results when they come.

The BBC labelled test is next.

← All updates