2026-06-16

R5 regressed. The corpus was upstream, not the trainer.

The numbers

The R5 QLoRA run completed 2026-06-14 against the same dev set the prior rounds used. The honest read:

Round	Combined	must_cite	rubric
R4a (2026-05-31)	20.9%	6/26	12/60
R5 (2026-06-14)	10.5%	2/26	7/60

Both axes moved backwards. must_cite dropped two-thirds. The rubric F1 lost five binary criteria. Calling this anything other than a regression would not survive honest reading.

Why we ran R5 in the first place

R4a was the best round to date — first round where both must_cite and rubric F1 moved together in the right direction. The plan for R5 was to take the same training base and add a thin layer of new pairs extracted from external long-form Nigerian-economy context: AfDB African Economic Outlook 2024, World Bank Nigeria Development Update, CORE Econ’s “The Economy,” and Asotele’s own identity document. The hope was that grounded textbook plus development-report reasoning would lift the rubric F1 without disturbing the must_cite anchor.

86 pairs were extracted into staging. A three-pass review screen (numeric-claim equivalence, full-chunk grounding, percent-vs-percentage regex normalisation) plus a manual FLAG/DROP pass left 59 KEEP ready for promotion. We backed up the R4a training corpus, merged the 59 in, trained.

The diagnosis

The two-thirds drop on must_cite was the loudest signal. must_cite is a case-insensitive substring scan on the model’s generated answer — it looks for any of the 19 named-signal tokens the eval set declares (“Sahm-Jobs”, “Asotele Inflation Index”, “NGX Banking”, “Brent crude trend”, “R auto.arima FX forecast”, “Sahm-FX”…). If the model used to mention those signals and now does not, the most likely explanation is the model learned a different citation style — and a different style would also explain the rubric drop, because the rubric scores cite-or-refuse criteria.

The first sanity check: what kind of citations do the existing 4,631-row training corpus answers actually carry, and what do the 59 R5 KEEPs carry?

Corpus	Examples	Mention any must_cite signal	Wrap citations in `<sources>`
`sft_pairs.jsonl` (R4a base)	469	12.2%	0 / 469 — none
`qlora_train.jsonl` (R4a corpus, 4,631 rows)	4,631	13.0%	0 / 4,631 — none
R5 KEEP pairs	59	0%	59 / 59 — every one

The gold style R4a learned was indicator tables with inline named-signal citations — “Brent Crude $103.40” inside a table row, not wrapped in any tag. The 59 R5 KEEPs taught the opposite style: textbook prose ending in <sources>afdb-african-economic-outlook-2024/full-book#chunk_0041</sources>, a tag format and a chunk-path convention the model had literally never seen in 4,631 training examples. Zero of the 59 KEEPs reference any of the named signals the eval rewards.

The model trained on the new data did what trained models do: it learned the new style. It cites RAG chunk paths now. Which is the wrong style for this eval.

What broke the upstream extractor

sources/r5_corpus/extract_qa_pairs.py is the script that produced the 86 staged pairs. The prompt templates explicitly tell the LLM to emit the offending tags. Direct quote from the template:

“End the answer with a <sources>{source_id}</sources> tag.”

Four prompt templates carry this instruction. {source_id} is the chunk identifier passed in by the runner — which is exactly the chunk path that ended up wrapped in tags in the output. The extractor was doing precisely what the prompt told it to do. The prompt was wrong.

The fix

Two layers, on opposite ends of the pipeline:

Extractor (upstream):

All four prompt templates rewritten to drop the <sources> tag instruction and replace it with plain-prose citation guidance (“Per the World Bank Nigeria Development Update,…”).
The ASOTELE_VOICE description tightened with explicit format rules: no XML/HTML tags of any kind, no chunk paths, no path anchors.
A defence-in-depth clean_answer() strips any <sources> tag the model still emits despite the instruction, and drops the pair entirely if dangling chunk-path references remain after the strip.
Records stamped with prompt_version: "v2_no_sources_tag_2026-06-15" so future debugging can distinguish v1 (broken) from v2 (fixed).

Promotion pipeline (downstream of review):

sources/build_r5_corpus.py gained a style filter on top of the existing v3 review verdict. v3 verdict (KEEP/FLAG/DROP) is the floor; the style filter is the ceiling. A pair that v3 said KEEP but contains a <sources> tag or a #chunk_ path is now hard-rejected, with the reason logged to a promotion-audit JSONL for later inspection.

We dry-ran the new style filter against the existing r5_review_v3.json. 59 / 59 v3 KEEPs trigger sources_tag; 54 / 59 also trigger rag_chunk_path. The filter correctly rejects the entire current batch — which is the honest outcome, because the entire current batch was style-mismatched. No promotions slip through under the new pipeline; the next batch the extractor produces will be the first that’s eligible.

The pivot

The 2026-06-10 Pidgin post argued that the right workstream-2 measurement target is a hand-labelled BBC Pidgin sentiment test set, not the AfriSenti held-out test. The lesson generalises. The right R5 training data is not generic textbook prose with chunk-path citations — it’s labelled-domain content matching the eval’s named-signal vocabulary and the existing corpus’s output convention.

External long-form sources (AfDB, World Bank NDU, CORE Econ) are still useful — but as RAG retrieval context, not as fine-tuning data. They teach factual background, not Asotele’s output style. That’s a clean separation, and one we should have written down earlier.

What landed instead: the labelled-sentiment gap

While diagnosing R5, we ran an open-data scan for sources that close the labelled-sentiment gaps the prior posts named openly. The 2026-06-08 post named the Pidgin F1=0.460 problem; the 2026-06-10 follow-up confirmed that pretraining on Pidgin domain text alone moves the macro-F1 by +0.0009. The structural barrier is the 1.4% neutral class in AfriSenti pcm — and the right fix is labelled data in the target language at scale.

The scan found Davlan/nollysenti — the Nollywood-review sentiment corpus from Shode et al. 2023 (ACL Findings), translated by native speakers across all four AsoteleLingua target languages plus English. Today the ingest landed:

Language	Train	Validation	Test	Label balance
en	1,302	100	500	~50/50
ha	410	100	500	~50/50
ig	410	100	500	~50/50
pcm	410	100	500	~50/50
yo	900	100	500	~50/50
Total	3,432	500	2,500	6,432 labelled rows

Sample of one Pidgin record, to anchor what the data actually looks like:

“Steamy soap opera wey unfold against di backdrop of caution histori lesson wey remind us say for Naija, di more wey tins dey change, di more dem dey craze.” — sentiment: positive

That’s exactly the register the F1=0.460 Pidgin classifier is failing on. The Igbo and Yoruba splits are equally clean. The Igbo corpus gap the 2026-06-04 post named — “Igbo finance text is near-absent on the open web” — is partly closed at the labelled-sentiment level here.

The cross-language parallel structure is unexpected upside. The validation and test splits are parallel-aligned across all five languages — the same Nollywood review translated five ways, indexable by row position. So zip(en_val, pcm_val) reconstructs as a parallel MT eval pair, alongside FLORES+. Same review, five languages, same label. Cross-lingual sentiment-consistency evaluation drops out of this as a free benchmark.

What’s next

The next R5 run waits on labelled-corpus-grounded promotion candidates, not external textbook prose. The corrected extractor produces them; the style filter rejects anything that drifts back to the v1 pattern.
Workstream-2 of the multilingual scope now has labelled fine-tune data per target language, not just AfriSenti (which is general-domain) or BBC Pidgin (which has no labels). The next sentiment-classifier round is built on NollySenti train, with AfriSenti held-out as a cross-domain control.
The R5 deployment slot (asotele-econ-v1 in Ollama, already reverted to base qwen3:14b since 2026-06-08) stays reverted until a round of training that improves and doesn’t regress lands.

The R5 round was a backwards step. The diagnosis is in. The corpus that should have driven the round is now in hand. The post-mortem closes here; the next round starts whenever the data and the discipline both line up.

← All updates