Hausa first: the AsoteleLingua pilot, and why it shipped before the proposal
The LINGUA Africa open call closes on 15 June. The strongest version of that proposal is one where the work it funds is already running in a pilot form — not a slide deck, but a directory of code that produces the same kind of output the funded workstream will scale. So the Hausa pilot went in first.
sources/asotele_lingua/ is now the Nigerian-language financial NLP module of the project. Three pieces ship today: a Hausa-language financial scraper, a Hausa financial-terminology seed dictionary, and a zero-shot sentiment baseline. Together they’re the proof-point the proposal stands on. Each is deliberately a starting artifact, not a finished one — what they document is the gap that LINGUA Africa funds the closing of.
Why Hausa first
Of the four target languages — Hausa, Yoruba, Igbo, Nigerian Pidgin — Hausa is shipped first for three reasons.
It has the largest speaker base. Roughly 70 million Nigerian Hausa speakers, the biggest single bloc among the four, and the most underrepresented in formal financial NLP. Most existing African-language NLP work treats Hausa as a general news domain; financial Hausa specifically is open ground.
It has the cleanest open data source. BBC News Hausa runs a daily Kasuwanci (Business) section with a public RSS feed. No paywall, no scraping fight, no per-article HTML parsing — a stable XML feed produced and maintained by a major outlet whose editorial standards are documented. That’s the kind of source you want to anchor a pilot on.
It has a credible off-the-shelf model substrate. The AfroXLMR family from Masakhane and DICE Lab explicitly includes Hausa in its multilingual pretraining mix, so a meaningful zero-shot sentiment baseline is available today without any fine-tuning. The other three target languages have weaker substrate coverage and need more upstream work. Starting where the model is strongest lets the pilot measure the gap — the part the LINGUA workstream funds — against a baseline that already does something, not nothing.
What’s on disk
The scraper
fetch_bbc_hausa.py reads the BBC News Hausa RSS feed and filters for finance-domain items by Hausa-language keyword matching. The keyword set is drawn from three sources: the BBC Kasuwanci section’s own headline vocabulary, CBN Hausa-language press releases, and Bargery’s 1934 Hausa dictionary for older terms that survive in commerce. About thirty keywords, including:
tattalin arziƙi— economy (literally “management of wealth”)kasafin kuɗi— budgethauhawar farashi— inflation (literally “rising prices”)Babban Bankin Najeriya— Central Bank of Nigeriaɗanyen mai— crude oil (the longer form is intentional; baremaialso means “owner of” and over-matches)
Today’s pull: 35 items in the feed, 4 kept as finance-relevant. Headlines included a piece on currency depreciation following the Iran war, a piece touching the Atiku ADC primary with macro implications, and a piece on the economic drivers pushing displaced communities toward Boko Haram-held islands. That last one is the kind of cross-domain story the keyword filter is meant to catch — politics and security framed in economic vocabulary, which a general Hausa news classifier would route to a different bucket.
Four finance items from one outlet on one day is small. Multiplied across BBC Hausa over a year, Voice of America Hausa, Aminiya, and a handful of CBN press releases — all targets in the funded workstream — it becomes a real corpus.
The terminology
terminology_hausa.json holds 42 Hausa financial terms with English glosses, etymology notes, and CBN/NBS usage context. Each entry carries review_required: true until LINGUA Africa workstream 1 funds the professional Hausa-speaker review. The current set anchors the dictionary at a credible starting state — not a finished one.
Two sample entries:
{"hau": "tattalin arziƙi", "en": "economy",
"notes": "lit. 'thrift/management of wealth'; the standard term used by CBN press"}
{"hau": "kuɗi", "en": "money",
"notes": "general term; plural kuɗaɗe"}
A separate todo_priority_terms list flags 16 additional terms that need translation work — Treasury Bill, MPR, CRR, repo rate, and the rest of the policy-rate vocabulary. Those are the terms a Hausa-speaking banking analyst would expect to find in a localised CBN brief, and they’re the highest-priority items for the funded review.
The sentiment baseline
sentiment_baseline.py runs AfroXLMR-large as a zero-shot classifier over the BBC Hausa financial headlines, using three Hausa polarity tokens as the candidate label set:
mai kyau— goodmara kyau— badmatsakaici— neutral
Outputs go to data/asotele_lingua/sentiment_baseline_hausa.csv plus a summary JSON marked BASELINE — zero-shot in the metadata, so anything downstream that consumes it can tell it apart from a fine-tuned classifier.
This is baseline by design. AfroXLMR was trained on general Hausa news, not Hausa financial news, so it will mislabel domain headlines in ways the funded workstream is meant to characterise and fix. What ships today is the measurement instrument. The numbers it produces are how we’ll know whether the funded fine-tune did anything.
How this fits Asotele’s two-tier strategy
The B2B bank tier needs English with strong citation discipline — Round 4’s combined 20.9% and the cite-or-refuse RAG layer from two days ago. The SME tier needs the chat surface to work in the languages an actual Nigerian small-business owner speaks. AsoteleLingua is the second of those two surfaces.
Both tiers run on the same Qwen3-14B fine-tune underneath. The difference is the input/output layer — the SME interface takes a question in Hausa (or Yoruba, Igbo, Pidgin), routes the financial-domain content to the same retrieval and reasoning stack the bank tier uses, and produces an answer in the user’s language using terminology validated against a Nigerian financial dictionary. The terminology and the sentiment classifier are the two pieces of that pipeline this pilot sits inside.
What’s next
Three things are queued.
Submit the LINGUA Africa proposal in the next week, before the 15 June deadline. The funding/03_lingua_africa_proposal.md draft already carries a “Pre-existing pilot” section pointing at this code; what’s left is final review and submission.
Run the sentiment baseline end-to-end on the four finance items collected today, plus the next few daily pulls, and publish the numbers. A baseline isn’t a baseline until it has a number attached.
Start the Yoruba scraper. Pulse FM Yoruba, BBC Yoruba, and Premium Times Yoruba are the candidate sources; the same review_required: true flag will carry forward to the Yoruba terminology file. The pattern is now templated — Hausa first, the other three at roughly the same shape.
The shape of the work is right. The size of it is what LINGUA funds.