My Researchs & Publications

Throughout my career, I have been involved in a variety of research initiatives, primarily centered on Computer Vision and Natural Language Processing.

Authors

Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab, Davis David, Shamsuddeen Hassan Muhammad, Neo Putini, David O. Ademuyiwa, Andrew Caines, Dietrich Klakow

Abstract

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.

Authors

Mahadia Tunga and Davis David

Abstract

Swahili is the most widely spoken language in Africa with over 200 million speakers. Despite its popularity in the continent, there is insufficient NLP research conducted on the language. The shortage of high-quality annotated datasets is attributed to this. In this paper, we introduce a Swahili dataset collected from Twitter, specifically designed to serve sentiment analysis tasks. The dataset comprised of a comprehensive collection of 8.7K tweets on products and services offered by telecommunications companies based in Tanzania. The tweets on the dataset are annotated manually by Swahili native speakers into three sentiments (Positive, Negative and Neutral). We have provided a detailed description of the steps involved in gathering and annotating the tweets, encompassing an elaborate account of the data collection method, annotation process, and dataset statistics. We tested the suitability of the developed dataset using five sentiment-classical machine learning models producing F1-scores ranging from 0.6889 to 0.7522 and 5 pre-trained transformer models producing F1-scores ranging from 0.7001 to 0.7306. Further, within this scholarly research paper, we expound upon the challenges encountered during the data collection and annotation processes. These challenges encompass bilingual tweets, the translation of emojis, the absence of Swahili language recognition by the Twitter platform, as well as the intricacies arising from Swahili words or phrases with multiple contextual meanings and informal vocabulary slang, and hashtag misclassification.

Authors

Kazimoto, D., Baadel, S., David, D., Mutahaba, R., Rugumyamheto, J.

Abstract

This paper explores Tausi’s pioneering credit risk scoring engine, which revolutionizes credit risk modeling and lending practices by harnessing alternative data sources. Traditional credit assessment methods often overlook segments of the population lacking established credit histories, particularly in regions like Africa. The aim of this paper is to showcase Tausi’s approach that addresses this gap by integrating non-traditional datasets, including social media activity and utility bills statements, to provide lenders with a comprehensive view of borrowers’ creditworthiness. This approach is particularly beneficial for borrowers without traditional credit histories, offering them opportunities to access financial services that were previously inaccessible. Through the integration of Tausi’s credit risk scoring engine into loan management systems (LMS), borrowers gain real-time visibility into their credit scores and credit limits, fostering transparency and informed financial decision-making. Moreover, the system proactively monitors borrower behavior, enabling automatic adjustments to credit limits based on evolving risk profiles. Regular updates to the scoring model ensure adaptability to changes in borrower transactional and repayment behaviors over time, thereby reducing the risk of default or delinquency. The adoption of Tausi’s innovative approach to credit scoring has far-reaching implications for various stakeholders, including credit risk teams, lenders, government regulators, and investors. By expanding access to credit and promoting financial inclusion, Tausi’s platform not only enhances economic opportunities for individuals but also fosters sustainable economic growth and development and can be replicated across the African continent.

Authors

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Sabah al-azzawi, Blessing K. Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Oluwaseyi Ajayi, Tatiana Moteu Ngoli, Brian Odhiambo, Abraham Toluwase Owodunni, Nnaemeka C. Obiefuna, Shamsuddeen Hassan Muhammad, Saheed Salahudeen Abdullahi, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye Bame, Oluwabusayo Olufunke Awoyomi, Iyanuoluwa Shode, Tolulope Anu Adelani, Habiba Abdulganiy Kailani, Abdul-Hakeem Omotayo, Adetola Adeeko, Afolabi Abeeb, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Raphael Ogbu, Chinedu E. Mbonu, Chiamaka I. Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola F. Awosan, Tadesse Kebede Guge, Sakayo Toadoum Sari, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Ussen Kimanuka, Kanda Patrick Tshinu, Thina Diko, Siyanda Nxakama, Abdulmejid Tuni Johar, Sinodos Gebre, Muhidin Mohamed, Shafie Abdi Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Pontus Stenetorp

Abstract

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

Authors

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Felermino Dário Mário António Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, Steven Arthur

Abstract

Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at this https URLand can also be loaded as a huggingface datasets (this https URL).

Authors

Chris_Chinenye_Emezue1, Hellina Hailu Nigatu, Cynthia Thinwa, Helper Zhou, Shamsuddeen Hassan Muhammad, Lerato Louis, Idris Abdulmumin,Davis David, Samuel Gbenga Oyerinde and others.

Abstract

Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and available to use in NLP packages. Stopwords in the context of African languages are understudied and can reveal information about the crossover between languages. The African Stopwords project aims to study and curate stopwords for African languages. In this paper, we present our current progress on ten African languages as well as future plans for the project.

Authors

Claire_Babirye1, Joyce Nakatumba-Nabende, Andrew Katumba, Ronald Ogwang, Jeremy Tusubira Francis, Jonathan Mukiibi, Medadi Ssentanda, Lilian D Wanzare, Davis David

Abstract

Africa has over 2000 languages; however, those languages are not well represented in the existing Natural Language Processing ecosystem. African languages lack essential digital resources to be engaged effectively in the advancing language technologies. This growing gap has attracted researchers to empower and build resources for African languages to transfer the various Natural Language Processing methods to African languages. This paper discusses the process we took to create, curate and annotate language text and speech datasets for low-resourced languages in East Africa. This paper focuses on five languages. Four of the languages: Luganda, Runyankore-Rukiga, Acholi, and Lumasaaba, are majorly spoken in Uganda, and Kiswahili which is a majorly spoken language across East Africa. We have run baseline: machine translation models on the English - Luganda dataset in the parallel text corpora and Automatic Speech Recognition (ASR) models on the Luganda speech dataset. We recorded a BiLingual Evaluation Understudy (BLEU) score of 37 for the English-Luganda model and a BLEU score of 36.8 for the Luganda-English model. For the ASR experiments, we obtained a Word Error Rate (WER) of 33%. Speech, Text, Luganda, Common Voice, ASR, Swahili

Authors

Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I. Adelani, Amelia Taylor, Jamiil Toure ALI, Kevin Degila, Momboladji Balogoun, Thierno Ibrahima DIOP, Davis David, Chayma Fourati, Hatem Haddad, Malek Naski

Abstract

Advances in speech and language technologies enable tools such as voice-search, text-to-speech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D - African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets through hosting of competitive ML challenges.

Authors

D. Adelani, J. Abbott, G. Neubig, D. D. Mwanganda,j. Mukiibi & others

Abstract

We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.

Authors

Sophia Sanga, Victor Mero, Dina Machuve and Davis Mwanganda

Abstract

Smallholder farmers in Tanzania are challenged on the lack of tools for early detection of banana diseases. This study aimed at developing a mobile application for early detection of Fusarium wilt race 1 and black Sigatoka banana diseases using deep learning. We used a dataset of 3000 banana leaves images. We pre-trained our model on Resnet152 and Inceptionv3 Convolution Neural Network architectures. The Resnet152 achieved an accuracy of 99.2% and Inceptionv3 an accuracy of 95.41%. On deployment using Android mobile phones, we chose Inceptionv3 since it has lower memory requirements compared to Resnet152. The mobile application on real environment detected the two diseases with a confidence level of 99% of the captured leaf area. This result indicates the potential in improving the yield of bananas by smallholder farmers using a tool for early detection of diseases.