Davis David Portfolio

Authors

Mahadia Tunga, Davis David

Abstract

Despite sentiment analysis being one of the most popular applications in Natural Language Processing (NLP), most studies are skewed towards languages with a rich corpus (language database). Less emphasis has been placed on low-resource languages like Swahili. Swahili is the official language of the African Union and of 4 countries in East Africa, and is spoken by many people on the African continent. This study performed sentiment analysis using 3,000 tweets hosted on the Zindi Africa platform. Data was processed using a term frequency-inverse document frequency vectorization method, and five classical machine learning algorithms (RandomForest, XgBoost, and CatBoost, HistogramGradientBoost, LightGradientBoos) were trained and evaluated using the collected tweets. We found that CatBoost produced the highest performance in general compared to other classical models, with 0.610 accuracy, 0.470 F1 score, 0.522 Precision and 0.462 Recall. The F1-score of 0.47 indicates modest performance and reflects the challenges posed by the small dataset and the complexity of Swahili sentiment analysis. This study offers a comprehensive overview of the relative performance of various classical machine learning models applied to Swahili social media sentiment data. These insights can help researchers make informed choices when selecting appropriate classical machine learning algorithms for sentiment analysis in a similar context.

Download Research Paper

Authors

Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab, Davis David, Shamsuddeen Hassan Muhammad, Neo Putini, David O. Ademuyiwa, Andrew Caines, Dietrich Klakow

Abstract

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.

Download Research Paper

Authors

Mahadia Tunga and Davis David

Abstract

Swahili is the most widely spoken language in Africa with over 200 million speakers. Despite its popularity in the continent, there is insufficient NLP research conducted on the language. The shortage of high-quality annotated datasets is attributed to this. In this paper, we introduce a Swahili dataset collected from Twitter, specifically designed to serve sentiment analysis tasks. The dataset comprised of a comprehensive collection of 8.7K tweets on products and services offered by telecommunications companies based in Tanzania. The tweets on the dataset are annotated manually by Swahili native speakers into three sentiments (Positive, Negative and Neutral). We have provided a detailed description of the steps involved in gathering and annotating the tweets, encompassing an elaborate account of the data collection method, annotation process, and dataset statistics. We tested the suitability of the developed dataset using five sentiment-classical machine learning models producing F1-scores ranging from 0.6889 to 0.7522 and 5 pre-trained transformer models producing F1-scores ranging from 0.7001 to 0.7306. Further, within this scholarly research paper, we expound upon the challenges encountered during the data collection and annotation processes. These challenges encompass bilingual tweets, the translation of emojis, the absence of Swahili language recognition by the Twitter platform, as well as the intricacies arising from Swahili words or phrases with multiple contextual meanings and informal vocabulary slang, and hashtag misclassification.

Download Research Paper

Authors

Kazimoto, D., Baadel, S., David, D., Mutahaba, R., Rugumyamheto, J.

Abstract

This paper explores Tausi’s pioneering credit risk scoring engine, which revolutionizes credit risk modeling and lending practices by harnessing alternative data sources. Traditional credit assessment methods often overlook segments of the population lacking established credit histories, particularly in regions like Africa. The aim of this paper is to showcase Tausi’s approach that addresses this gap by integrating non-traditional datasets, including social media activity and utility bills statements, to provide lenders with a comprehensive view of borrowers’ creditworthiness. This approach is particularly beneficial for borrowers without traditional credit histories, offering them opportunities to access financial services that were previously inaccessible. Through the integration of Tausi’s credit risk scoring engine into loan management systems (LMS), borrowers gain real-time visibility into their credit scores and credit limits, fostering transparency and informed financial decision-making. Moreover, the system proactively monitors borrower behavior, enabling automatic adjustments to credit limits based on evolving risk profiles. Regular updates to the scoring model ensure adaptability to changes in borrower transactional and repayment behaviors over time, thereby reducing the risk of default or delinquency. The adoption of Tausi’s innovative approach to credit scoring has far-reaching implications for various stakeholders, including credit risk teams, lenders, government regulators, and investors. By expanding access to credit and promoting financial inclusion, Tausi’s platform not only enhances economic opportunities for individuals but also fosters sustainable economic growth and development and can be replicated across the African continent.

Download Research Paper

Authors

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Sabah al-azzawi, Blessing K. Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Oluwaseyi Ajayi, Tatiana Moteu Ngoli, Brian Odhiambo, Abraham Toluwase Owodunni, Nnaemeka C. Obiefuna, Shamsuddeen Hassan Muhammad, Saheed Salahudeen Abdullahi, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye Bame, Oluwabusayo Olufunke Awoyomi, Iyanuoluwa Shode, Tolulope Anu Adelani, Habiba Abdulganiy Kailani, Abdul-Hakeem Omotayo, Adetola Adeeko, Afolabi Abeeb, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Raphael Ogbu, Chinedu E. Mbonu, Chiamaka I. Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola F. Awosan, Tadesse Kebede Guge, Sakayo Toadoum Sari, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Ussen Kimanuka, Kanda Patrick Tshinu, Thina Diko, Siyanda Nxakama, Abdulmejid Tuni Johar, Sinodos Gebre, Muhidin Mohamed, Shafie Abdi Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Pontus Stenetorp

Abstract

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

Download Research Paper

Authors

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Felermino Dário Mário António Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, Steven Arthur

Abstract

Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at this https URLand can also be loaded as a huggingface datasets (this https URL).

Download Research Paper

Authors

Chris_Chinenye_Emezue1, Hellina Hailu Nigatu, Cynthia Thinwa, Helper Zhou, Shamsuddeen Hassan Muhammad, Lerato Louis, Idris Abdulmumin,Davis David, Samuel Gbenga Oyerinde and others.

Abstract

Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and available to use in NLP packages. Stopwords in the context of African languages are understudied and can reveal information about the crossover between languages. The African Stopwords project aims to study and curate stopwords for African languages. In this paper, we present our current progress on ten African languages as well as future plans for the project.

Download Research Paper

Authors

Claire_Babirye1, Joyce Nakatumba-Nabende, Andrew Katumba, Ronald Ogwang, Jeremy Tusubira Francis, Jonathan Mukiibi, Medadi Ssentanda, Lilian D Wanzare, Davis David

Abstract

Africa has over 2000 languages; however, those languages are not well represented in the existing Natural Language Processing ecosystem. African languages lack essential digital resources to be engaged effectively in the advancing language technologies. This growing gap has attracted researchers to empower and build resources for African languages to transfer the various Natural Language Processing methods to African languages. This paper discusses the process we took to create, curate and annotate language text and speech datasets for low-resourced languages in East Africa. This paper focuses on five languages. Four of the languages: Luganda, Runyankore-Rukiga, Acholi, and Lumasaaba, are majorly spoken in Uganda, and Kiswahili which is a majorly spoken language across East Africa. We have run baseline: machine translation models on the English - Luganda dataset in the parallel text corpora and Automatic Speech Recognition (ASR) models on the Luganda speech dataset. We recorded a BiLingual Evaluation Understudy (BLEU) score of 37 for the English-Luganda model and a BLEU score of 36.8 for the Luganda-English model. For the ASR experiments, we obtained a Word Error Rate (WER) of 33%. Speech, Text, Luganda, Common Voice, ASR, Swahili

Download Research Paper

Authors

Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I. Adelani, Amelia Taylor, Jamiil Toure ALI, Kevin Degila, Momboladji Balogoun, Thierno Ibrahima DIOP, Davis David, Chayma Fourati, Hatem Haddad, Malek Naski

Abstract

Advances in speech and language technologies enable tools such as voice-search, text-to-speech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D - African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets through hosting of competitive ML challenges.

Download Research Paper

Authors

D. Adelani, J. Abbott, G. Neubig, D. D. Mwanganda,j. Mukiibi & others

Abstract

We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.

Download Research Paper

Authors

Sophia Sanga, Victor Mero, Dina Machuve and Davis Mwanganda

Abstract

Smallholder farmers in Tanzania are challenged on the lack of tools for early detection of banana diseases. This study aimed at developing a mobile application for early detection of Fusarium wilt race 1 and black Sigatoka banana diseases using deep learning. We used a dataset of 3000 banana leaves images. We pre-trained our model on Resnet152 and Inceptionv3 Convolution Neural Network architectures. The Resnet152 achieved an accuracy of 99.2% and Inceptionv3 an accuracy of 95.41%. On deployment using Android mobile phones, we chose Inceptionv3 since it has lower memory requirements compared to Resnet152. The mobile application on real environment detected the two diseases with a confidence level of 99% of the captured leaf area. This result indicates the potential in improving the yield of bananas by smallholder farmers using a tool for early detection of diseases.

Download Research Paper

My Researchs & Publications

A comparative study for classical machine learning models for swahili social media sentiment analysisswahili social media sentiment analysis • Oct 28, 2025

Authors

Abstract

AFRIDOC-MT: Document-level MT Corpus for African Languages • Jan 10, 2025

Authors

Abstract

Introducing a Swahili social media sentiment analysis dataset for the telecom industry • Jan 9, 2025

Authors

Abstract

Tausi: A Holistic Artificial Intelligence Approach to Credit Scoring Using Informal Data for a Sustainable Micro-lending African Economy • Nov 5, 2024

Authors

Abstract

MasakhaNEWS: News Topic Classification for African languages • Apr 19, 2023

Authors

Abstract

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages • Feb 17, 2023

Authors

Abstract

The African Stopwords Project: Curating Stopwords for African Languages • May 2, 2022

Authors

Abstract

Building Text and Speech Datasets for Low Resourced Languages: A Case of Languages in East Africa • Apr 21, 2022

Authors

Abstract

AI4D -- African Language Program • Apr 6, 2021

Authors

Abstract

MasakhaNER: Named Entity Recognition for African Languages • Mar 21, 2021

Authors

Abstract

Mobile-Based Deep Learning Models for Banana Diseases Detection • Apr 7, 2020

Authors

Abstract