Introducing ABENA: BERT Natural Language Processing for Twi

Fig. 1: We named our main model ABENA — A BERT Now in Akan

Introduction

In our previous blog post we introduced a preliminary Twi embedding model based on fastText and visualized it using the Tensorflow Embedding Projector. As a reminder, text embeddings allow you to convert text into numbers or vectors which a computer can perform arithmetic operations on to enable it reason about human language, i.e., carry out natural language processing (NLP). A screenshot of our fastText Twi embeddings from that exercise is shown in Fig. 2.

Fig. 2: Our fastText (subword word2vec) Twi embedding model screenshot from a previous article
  • We first employ transfer learning to fine-tune a multilingual BERT (mBERT) model on the Twi subset of the JW300 dataset, which is the same data we used to develop our fastText model. This data is largely composed of the Akuapem dialect of Twi.
  • Subsequently, we fine-tune this model further on the Asante Twi Bible data to obtain an Asante Twi version of the model.
  • Additionally, we perform both experiments using the DistilBERT architecture instead of BERT — this yields smaller and more lightweight versions of the (i) Akuapem and (ii) Asante ABENA models.

Motivation

Transformer-based language models have been changing the modern NLP landscape for high-resource languages such as English, Chinese, Russian, etc. However, this technology does not yet exist for any Ghanaian languages. In this post, we introduce the first such model for Twi/Akan, arguably the most widely spoken Ghanaian language.

  • He examined the cell under the microscope.
  • He was locked in a cell.
  • ɔkraman no so paa — the dog is very big
  • ɔkra no da mpa no so — the cat is sleeping on the bed
Fig. 3: Illustrating the key idea behind transfer learning — instead of learning things for scratch, prior knowledge and experience should be shared and used to make the current task easier. Here, we see that learning to play the drum is easier if one already played the piano. Image from “Transfer Learning for NLP” [https://www.manning.com/books/transfer-learning-for-natural-language-processing]

ABENA Twi BERT Models

The first thing we do is initialize a BERT architecture and tokenizer to the multilingual BERT (mBERT) checkpoint. This model was trained on over 100 languages simultaneously. Although these did not include any Ghanaian languages, it does include another “Niger-Congo” language — Nigerian Yoruba. It is thus reasonable to expect that this model contains some knowledge useful for constructing a Twi embedding. We transfer this knowledge by fine-tuning the initialized mBERT weights and tokenizer on the monolingual Twi. The convergence info is shown in Fig 4. All models were trained on a single Tesla K80 GPU on an NC6 Azure VM instance.

Fig.4: Convergence info for ABENA models. All models were trained on a single Tesla K80 GPU on an NC6 Azure VM instance.
Fig.5: Convergence info for DistilABENA models. All models were trained on a single Tesla K80 GPU on an NC6 Azure VM instance.

BAKO Twi BERT Model

Finally, we investigated training the various forms of ABENA described in the previous section from scratch on the monolingual data. We named these set of models BAKO — “BERT with Akan Knowledge Only. An apt visualization of what the Ghanaian character BAKO probably looks like is shown in Fig. 6.

Fig. 6: (BAKO) We also investigate training BERT models from scratch, yielding BAKO — BERT with Akan Knowledge Only. The Twi word “Bako” or “Baako” means “One”.
Fig.7: Convergence info for RoBAKO models trained from scratch. All models were trained on a single Tesla K80 GPU on an NC6 Azure VM instance.

Simple Sentiment Analysis/Classification Example

By way of summary, in Fig. 8 we list and describe all the models we have presented. You can also find them listed on the Hugging Face Model Hub.

Fig.8: Description of all the models we trained and shared in this work.
Fig. 9: Simple Sentiment Analysis Example Dataset

Limitations and Ongoing/Future Work

The models show a varying degree of religious bias. For instance when completing a sentence like “Eyi de ɔhaw kɛse baa ____ hɔ”, you may see completions such as “Eyi de ɔhaw kɛse baa Adam hɔ” and/or “Eyi de ɔhaw kɛse baa Satan hɔ” among the most likely completions. While these are not technically wrong, and can be useful for understanding sentence structure, named entity recognition, etc., the fact that these are among the top 5 completions indicate a strong religious bias in the model. This is obviously due to the religious data used to train them.

Join Us?

If what you have read interests you, and you would like to join and contribute to the Ghana NLP community — either as a volunteer, contributor, partner or sponsor — please find all contact details on our website. Be sure to follow us on any social media platform of your choice so as not to miss anything!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Azunre

Paul Azunre

71 Followers

Paul Azunre holds a PhD in Computer Science from MIT and has served as a Principal Investigator on several DARPA programs. He founded Algorine & Ghana NLP