For millions of South Africans, using artificial intelligence in their home language has often meant dealing with inaccurate responses, limited support or tools that simply do not understand local context.
Now, a research team from the University of Cape Town is trying to change that with a homegrown AI language model designed specifically for South Africa’s linguistic landscape.
Drawing attention across the local tech and academic space, the project was recently spotlighted by both UCT researchers and 2oceansvibe News, with the team preparing to present their findings at the Language Resources and Evaluation Conference (LREC) in Mallorca, Spain.
The breakthrough is centred on MzansiLM, a multilingual AI language model trained across South Africa’s 11 official written languages, and MzansiText, a new dataset chosen to support those languages.
Researchers say the initiative could help close a longstanding gap in artificial intelligence, where African languages have historically been underrepresented in mainstream systems.
The project was led by Anri Lombard and Dr Jan Buys from UCT’s Department of Computer Science, together with Dr Francois Meyer and a broader network of collaborators.
While AI-powered tools have become embedded in everyday life globally, many South Africans still struggle to access reliable AI support in languages other than English.
Researchers involved in the project say the issue largely comes down to data availability.
‘In language modelling, languages are considered low resource, primarily because there are much fewer and smaller textual datasets available in these languages for training language models,’ said Dr Buys.
According to the team, nine of South Africa’s official languages remain classified as ‘low-resource’ in the AI space, meaning there is limited training material available for developers building language technologies.
Although languages such as isiZulu and isiXhosa have received growing international research attention in recent years, others, including isiNdebele and Sepedi, remain significantly underrepresented.
That imbalance is what MzansiLM hopes to address.
‘MzansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages,’ the researchers noted.
Unlike commercial chatbots such as ChatGPT or Claude, MzansiLM is not designed for open-ended conversation. Instead, researchers describe it as a foundational AI model that developers can adapt for specialised tasks.
That could include tools for summarising information, annotating data or creating services that allow users to interact digitally in their home language.
‘With MzansiLM, we wanted to build a single model focused specifically on South Africa that covers all 11 official written languages, including those that are often left out,’ said Dr Meyer.
The model itself is relatively small compared to major commercial AI systems, containing 125 million parameters. But researchers say its performance has already shown promising results in local-language benchmarks.
Tests conducted by the team found that MzansiLM competed strongly against significantly larger open-source systems in several South African languages, particularly in isiXhosa text generation tasks.
For Lombard, whose master’s research helped shape the project, the work also reflects a wider shift happening across African AI development.
‘One thing that stood out to me is that publicly available models tended to cover only a subset of the South African languages we care about,’ he stated.
‘MzansiLM was meant to provide a small decoder-only baseline that future work can compare against and build on.’
South Africa currently recognises 12 official languages, including Sign Language, which gained official status in 2023.
Technology experts have increasingly warned that if African languages remain absent from AI development, millions of users risk being excluded from rapidly evolving digital services.
The UCT team says openness and collaboration will be key to preventing that.
‘Closing the gap between South African languages and the capabilities now available in English will require sustained, collective effort,’ Lombard said.
Researchers have made both MzansiText and MzansiLM publicly available in an effort to encourage further development and collaboration within the African natural language processing research community.
Be the first to know – Join our WhatsApp Channel for content worth tapping into! Click here to join!
Also read:
Picture: UCT





