Ghana NLP — Computational Mapping of Ghanaian Languages

Watch the accompanying video to this post above & be sure to hit subscribe to see future content on YouTube.

Introduction

Image for post
Image for post
Natural Language Processing (NLP) is key for human interaction with computers [image source: thinkpalm.com]

Formally, Natural Language processing can be loosely described as encompassing the tools and methods involved in the analysis or study of languages used for everyday communications by humans, whether by speech or text, through computer manipulations.

“At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves “understanding” complete human utterances, at least to the extent of being able to give useful responses to them.” [ Natural Language Processing with Python, 2009]

The state of Natural Language Processing (NLP) has been undergoing a revolution under our feet. Due to the vast increase in the availability of digitized text data, computers are increasingly needed to keep track of signals online, on social media, on the _Internet of Things_ (IoT) and/or in conversations with customers, partners and other stakeholders. The amount of data available is significantly beyond what human analysts can reasonably digest manually. This touches every conceivable application area — from medical diagnosis and financial system interfaces to translation systems and cybersecurity.

Specifically, NLP has made dramatic strides in allowing engineers to reuse NLP knowledge acquired at major laboratories and institutions — such as Google, Facebook or Microsoft — and adapting it to the engineer’s problem very quickly on a laptop or even smartphone. This is loosely called Transfer Learning by the Machine Learning Community. The figure below shows the dramatic growth in the size of such pretrained models available, in popular languages such as English, recently. While size isn’t everything, it is a good proxy for progress, and Ghanaian Languages are nowhere to be found on this graph.

Image for post
Image for post
Pretrained models bigger than 17 billion parameters are recently available in popular languages such as English (Microsoft Turing-NLG in this graph). Ghanaian languages are nowhere on this map. [source: microsoft.com]

Unfortunately, the state of NLP on Ghanaian languages is being left behind. To date, we do not have a reliable machine translation system for any of our Ghanaian Languages, not even a Google Translate. This makes it harder for the global Ghanaian diaspora to learn their own languages, something many want to do. It risks our language and culture not being preserved in an increasingly digitized future. It also means that service providers and health workers trying to reach remote areas hit by emergencies, disasters, etc. face needless additional obstacles to providing life-saving care.

Beyond translation, fundamental tools for computational analysis are lacking. Tools for summarization, classification, language detection, voice-to-text transcription are limited, and in most cases completely nonexistent. It is a major risk to Ghanaian national security — availability of these tools is directly correlated with the sophistication and efficiency of cyber-security solutions that can be deployed to defend critical social, cultural and cyber infrastructure from both internal and external threats.

Ghana NLP is an Open Source Movement of like-minded volunteers who believe that this is simply unacceptable, and who have dedicated their skills and time to building an ecosystem of (i) open-source datasets (ii) open-source computational methods and most importantly (iii) an army of NLP researcher, scientists and practitioners ready to revolutionize and improve every aspect of Ghanaian life through this increasingly powerful and influential technology. As recent events have shown, we must be prepared to face tomorrow’s threats today. Please consider joining us by signing up on our website.

In this post, we present, describe and visualize the first computational map for a Ghanaian Language — Twi — built by the Ghana NLP team. You can explore the dynamic version of it at this link (see also the figure below). All of the code presented is part of the open API we are building — Kasa — to provide easy access to a broad range of NLP tools for Ghanaian languages to practitioners.

Image for post
Image for post
An image of the first computational map for a Ghanaian Language — Twi — built by the Ghana NLP team. You can explore the dynamic version at this link

Below is an outline of how this article will proceed:

  • A description of word embeddings and tools to visualise them
  • Presentation of an open source web application for visualising word embeddings
  • Description of learning word embeddings for Twi (Ghanaian Language)
  • Code for generating the embeddings and the accompanying files for visualization
  • Sample of the learned embeddings when visualized

Word Embeddings

One of the main building blocks of text is words, and for most, if not all, of natural language processing(NLP) tasks such as; machine translation, question answering, named entity recognition etc., it is extremely essential to learn a high quality representation of the words or sentences under consideration.

Word embeddings seek to capture the semantic and syntactic meanings of words usually from a large corpus of unlabelled data, by representing them with vectors of real values. One major characteristic of good word embeddings is their ability to capture the linguistic relationships between words through their vector representations. The goal is that words with similar meaning should have a similar vector representations.

Even though the vectors generated from learning word embeddings are normally lower dimensional compared to other word representations such as Bag of Words (BOW), they are still multi-dimensional and can be hard to make sense of. Fortunately there exists several methods such as [ Principal Component Analysis (PCA) and T-distributed stochastic neighbour embedding (t-SNE), for reducing the dimensionality of higher dimensional data into a form that can be easily visualised and provide better understanding of the underlying relationship of the data.

Tensorflow provides an Embedding Projector as a web application with excellent interactivity and can be used to visualize word embeddings by projecting them into much lower dimensions.

Image for post
Image for post

Practical Visualization of Twi Embeddings

The rest of the article describes the process of learning word embeddings from a large corpus of Twi Text and visualizing them using Tensorflow’s Embedding projector.

Twi, also known as Akan Kasa, is a dialect of the Akan language spoken in southern and central Ghana by several million people, mainly of the Akan people, the largest of the seventeen major ethnic groups in Ghana. Twi language has about 17–18 million speakers in total, including second-language speakers; about 29% of the Ghanaian population speaks Twi as a first or second language” source: Wikipedia

Image for post
Image for post
source : https://pixabay.com/en/flag-banner-nation-emblem-country-2526396

The embeddings shown in this article have been generated using data from the JW300 dataset which contain 600K English sentences and their corresponding Twi translations. As the embeddings were being created for the Twi sentences, the dataset was parsed to use only this corresponding subset of data. This dataset can be downloaded from here.

After preprocessing the Twi sentences mainly by removing punctuations such as [!.,?] and converting all the words to lower case, we are left with approximately 22K unique words. Due to the “noise” in the source dataset used, some non-Twi words were still left in the final set of words. However this does not affect greatly the final idea being demonstrated.

The embeddings presented in this article have been generated using [fastText] implementation from the Gensim library. fastText is a method for learning word embeddings originally created by Facebook’s AI research team. These following papers describe in detail the idea behind the fastText method of learning word embeddings:

There is also the possibility to learning these embeddings using other popular approaches such as Word2vec and GloVe. As will be shown later in the article, the code has been written to allow for the flexibility of generating the embeddings using the Word2Vec implementation of Gensim.

There are several hyperparameters of the fastText method that can be tuned to, perhaps, improve the quality of the generated embeddings. While we made an effort to to optimize these parameters somewhat, it is quite possible that the results presented in this article, which are already pretty good, can be improved even further. We encourage the reader to experiment with the code, and report any improvements discovered to Ghana NLP as a contribution towards our mission to improve language technologies for our common good. Please refer to the documentation of Gensim for a detailed overview and usage of the possible hyperparameters.

Code Walk Through

In this section, we present the code that has been used to train the embeddings on the Twi corpus and generate the files required for visualising them with the embedding projector.

The following code imports the required modules (built-in and external) that are required. The path to the source files are also defined here as global variables. An extra variable is used to varying the amount of input text read, which is useful especially when working with large datasets.

The method read_dataset is used to read the input file and perform all the necessary preprocessing operations based on the specific language. The accompanying helper functions unicode_to_ascii and normalize_line are shown below as well. The results of reading the data with read_dataset is a list of list of word for each sentence in the input text.

The function get_embedding is where we generate the embeddings using the preprocessed text, now mainly split into chunks/tokens of words for each sentence. The argument typeFunc can be used to choose between fastText or Word2Vec for learning the embeddings . A default value of 100 is set for the size as the dimension of the output embeddings. As mentioned earlier, one can choose to vary this and other hyperparameters as required. Setting the argument save to True, saves the resulting embeddings in the working folder from which the calling script or notebook is.

The function prepare_for_visualization generated the TSV files for the vectors and the metadata, that can be used in the Embedding Projector tool. The output of get_embeddings can be passed directly to this function or the path to saved embeddings can be provided as an alternative. The TSV files are saved in the working folder as well.

Refer to the following article; which shows the process of uploading the TSV files for visualization using the Embedding projector.

How To Use

Sample Results

Below we look at some of the results of our learned Twi embeddings when visualised with the Embedding projector.

Image for post
Image for post
100 closest words to Doctor (dɔkota)

A look at the representation of the Twi translation for doctor (dɔkota); shows highly related and close words such such as drugs (oduro), maker of drugs (oduroyɛfo), mental illness (adwenemyareɛ), researcher (nhwehwɛmufoɔ), intelligent person (ɔbenfoɔ) etc. An interesting observation is that the embeddings tend to capture that doctor is an occupation and therefore shows its closeness to other professions such as teacher (tikyani). The distance scores shown on the right also affirms that the semantic relationship has been well captured by the learned embeddings by the minimal distance between the word doctor and other very close related words.

Image for post
Image for post
100 closest words to Mother (Maame)

Analysing the results for a word such as mother (maame) reveals equaly interesting results. Represented very close to word are several related words such as parents (m’awofo), father (paapa), family (m’abusuafo) , siblings (onuabarima & onuabea) etc. An interesting point here is the relationship between mother and a word such as pregnancy (minyinsɛnee)_which is shown to be in the closest 100 words.

Conclusion

In this article we have looked at word embeddings as one of the most important and useful tools in natural language processing. We have discussed several tools which are available for visualizing and making sense of the higher dimensional representation of word embedding by projecting them to forms of lower dimensions. We have confirmed the effectiveness of the technique by applying it to the Twi language from Ghana and analysed the output the learned embeddings.

Join Us?

If what you have read interests you, and you would like to join and contribute to the Ghana NLP community, please consider signing up on our website.

Sponsors and Partners

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
We are grateful to our sponsors and partners. Please contact us for information on how to support this work and become and a partner/sponsor.

Originally published at https://medium.com on June 13, 2020.

Written by

Paul Azunre holds a PhD in Computer Science from MIT and has served as a Principal Investigator on several DARPA programs. He founded Algorine & Ghana NLP

Get the Medium app