The IrokoBench results, published at NAACL 2025 and now widely used to measure African language performance, reveal a gap that the industry has not addressed quickly enough. GPT-4o scores 72.5% on English tasks but drops to 48.1% on African languages. LLaMA 3 (70B) does even worse, averaging just 25.5%, which is 45 points lower than its English score. At that point, the model is not just underperforming; it is essentially guessing. Since these benchmarks test real tasks like comprehension, reasoning, and question answering, not just simple text classification, the gap shows the kind of failure that users actually experience.
This failure starts with how the language is processed. Hausa uses special hooked consonants like ɗ, ƙ, and ɓ, but standard tokenizers often break these characters apart or remove the diacritics that give them meaning. The result is sentences that look like Hausa but don’t actually work as they should. This leads to unimaginable mistakes. It’s not that the model is confused, but that it has simplified the language so much that physical and digital actions become mixed up. Developers call this problem tokenization trauma. The language is already broken before the model even starts. The technical flaw quickly becomes a human one, misinterpretations and errors in language processing spill over into how people are understood and represented.
While accuracy is important, the most serious issue is bias. Research from Stanford’s Institute for Human-Centered AI found that GPT-3, when given Muslim-related prompts, produced violent responses more than three times as often as it did for similar Christian prompts. This problem continues in newer models trained on similar internet data. For Hausa users, most of whom are Muslim, this means the model can add violence to completely harmless questions. It may not sound odd if this is called distortion of cultural reality. These models reflect the biases in their training data, and when those biases affect millions of Hausa users, they become a real social problem.
The tendency to link Islam with conflict has deep roots in history. Many datasets used to train today’s language models come from archives, reports, and colonial-era sources that often misrepresented African communities, religions, and cultures. Because of this, colonial data collection still shapes how Muslim-majority societies are seen, often through suspicion and violence. These old biases keep appearing when AI repeats these sources without question. Fixing AI bias takes more than technical solutions; it also means looking closely at whose stories are told and how past power dynamics still shape even the latest technologies.