Skip to main content

R5 regressed. The corpus was upstream, not the trainer.

The numbers

The R5 QLoRA run completed 2026-06-14 against the same dev set the prior rounds used. The honest read:

Round Combined must_cite rubric
R4a (2026-05-31) 20.9% 6/26 12/60
R5 (2026-06-14) 10.5% 2/26 7/60

Both axes moved backwards. must_cite dropped two-thirds. The rubric F1 lost five binary criteria. Calling this anything other than a regression would not survive honest reading.

Why we ran R5 in the first place

R4a was the best round to date — first round where both must_cite and rubric F1 moved together in the right direction. The plan for R5 was to take the same training base and add a thin layer of new pairs extracted from external long-form Nigerian-economy context: AfDB African Economic Outlook 2024, World Bank Nigeria Development Update, CORE Econ’s “The Economy,” and Asotele’s own identity document. The hope was that grounded textbook plus development-report reasoning would lift the rubric F1 without disturbing the must_cite anchor.

86 pairs were extracted into staging. A three-pass review screen (numeric-claim equivalence, full-chunk grounding, percent-vs-percentage regex normalisation) plus a manual FLAG/DROP pass left 59 KEEP ready for promotion. We backed up the R4a training corpus, merged the 59 in, trained.

The diagnosis

The two-thirds drop on must_cite was the loudest signal. must_cite is a case-insensitive substring scan on the model’s generated answer — it looks for any of the 19 named-signal tokens the eval set declares (“Sahm-Jobs”, “Asotele Inflation Index”, “NGX Banking”, “Brent crude trend”, “R auto.arima FX forecast”, “Sahm-FX”…). If the model used to mention those signals and now does not, the most likely explanation is the model learned a different citation style — and a different style would also explain the rubric drop, because the rubric scores cite-or-refuse criteria.

The first sanity check: what kind of citations do the existing 4,631-row training corpus answers actually carry, and what do the 59 R5 KEEPs carry?

Corpus Examples Mention any must_cite signal Wrap citations in <sources>
sft_pairs.jsonl (R4a base) 469 12.2% 0 / 469 — none
qlora_train.jsonl (R4a corpus, 4,631 rows) 4,631 13.0% 0 / 4,631 — none
R5 KEEP pairs 59 0% 59 / 59 — every one

The gold style R4a learned was indicator tables with inline named-signal citations — “Brent Crude $103.40” inside a table row, not wrapped in any tag. The 59 R5 KEEPs taught the opposite style: textbook prose ending in <sources>afdb-african-economic-outlook-2024/full-book#chunk_0041</sources>, a tag format and a chunk-path convention the model had literally never seen in 4,631 training examples. Zero of the 59 KEEPs reference any of the named signals the eval rewards.

The model trained on the new data did what trained models do: it learned the new style. It cites RAG chunk paths now. Which is the wrong style for this eval.

What broke the upstream extractor

sources/r5_corpus/extract_qa_pairs.py is the script that produced the 86 staged pairs. The prompt templates explicitly tell the LLM to emit the offending tags. Direct quote from the template:

“End the answer with a <sources>{source_id}</sources> tag.”

Four prompt templates carry this instruction. {source_id} is the chunk identifier passed in by the runner — which is exactly the chunk path that ended up wrapped in tags in the output. The extractor was doing precisely what the prompt told it to do. The prompt was wrong.

The fix

Two layers, on opposite ends of the pipeline:

Extractor (upstream):

Promotion pipeline (downstream of review):

We dry-ran the new style filter against the existing r5_review_v3.json. 59 / 59 v3 KEEPs trigger sources_tag; 54 / 59 also trigger rag_chunk_path. The filter correctly rejects the entire current batch — which is the honest outcome, because the entire current batch was style-mismatched. No promotions slip through under the new pipeline; the next batch the extractor produces will be the first that’s eligible.

The pivot

The 2026-06-10 Pidgin post argued that the right workstream-2 measurement target is a hand-labelled BBC Pidgin sentiment test set, not the AfriSenti held-out test. The lesson generalises. The right R5 training data is not generic textbook prose with chunk-path citations — it’s labelled-domain content matching the eval’s named-signal vocabulary and the existing corpus’s output convention.

External long-form sources (AfDB, World Bank NDU, CORE Econ) are still useful — but as RAG retrieval context, not as fine-tuning data. They teach factual background, not Asotele’s output style. That’s a clean separation, and one we should have written down earlier.

What landed instead: the labelled-sentiment gap

While diagnosing R5, we ran an open-data scan for sources that close the labelled-sentiment gaps the prior posts named openly. The 2026-06-08 post named the Pidgin F1=0.460 problem; the 2026-06-10 follow-up confirmed that pretraining on Pidgin domain text alone moves the macro-F1 by +0.0009. The structural barrier is the 1.4% neutral class in AfriSenti pcm — and the right fix is labelled data in the target language at scale.

The scan found Davlan/nollysenti — the Nollywood-review sentiment corpus from Shode et al. 2023 (ACL Findings), translated by native speakers across all four AsoteleLingua target languages plus English. Today the ingest landed:

Language Train Validation Test Label balance
en 1,302 100 500 ~50/50
ha 410 100 500 ~50/50
ig 410 100 500 ~50/50
pcm 410 100 500 ~50/50
yo 900 100 500 ~50/50
Total 3,432 500 2,500 6,432 labelled rows

Sample of one Pidgin record, to anchor what the data actually looks like:

“Steamy soap opera wey unfold against di backdrop of caution histori lesson wey remind us say for Naija, di more wey tins dey change, di more dem dey craze.” — sentiment: positive

That’s exactly the register the F1=0.460 Pidgin classifier is failing on. The Igbo and Yoruba splits are equally clean. The Igbo corpus gap the 2026-06-04 post named — “Igbo finance text is near-absent on the open web” — is partly closed at the labelled-sentiment level here.

The cross-language parallel structure is unexpected upside. The validation and test splits are parallel-aligned across all five languages — the same Nollywood review translated five ways, indexable by row position. So zip(en_val, pcm_val) reconstructs as a parallel MT eval pair, alongside FLORES+. Same review, five languages, same label. Cross-lingual sentiment-consistency evaluation drops out of this as a free benchmark.

What’s next

The R5 round was a backwards step. The diagnosis is in. The corpus that should have driven the round is now in hand. The post-mortem closes here; the next round starts whenever the data and the discipline both line up.


← All updates