Notes from AfricanNLP workshop colocated at EACL 2021

Apr 20, 2021

The AfricanNLP workshop happened on 19th April 2021. I was excited to see many new authors, most of them being of African descent, and more African languages represented. Out of the ~41 accepted papers, ~11 focused on machine translation datasets and methodologies. A majority were focused on NMT, with SMT and rule-based MT present in <=2papers.

Thank you to the organizers for such a powerful workshop.

Accepted papers in Machine Translation Category

1. [Best paper award] Congolese Swahili Machine Translation for Humanitarian Response [Paper, Video]

This is my favourite paper- probably because it has a current real-world humanitarian application. I am a coastal Swahili speaker who has interacted with Congolese Swahili speakers during my stay in Kigali. We could understand each other 90% of the time except when using Frenchy vocabulary(In coastal Swahili, we call this Utohozi). My greatest takeaway in this was the need for dialect-specific Machine Translation. And of course, they highlighted how transfer-learning(cross-dialect in this case), back-translation and data augmentation techniques aid in low-resource MT. They made available their data and web app, and WhatsApp chatbot. Kudos, Translators Without Borders, for this fantastic work.

2. An Exploration of Data Augmentation Techniques for Improving English to Tigrinya Translation[Paper, Video]

The paper highlights an effective back-translation method in low resource settings where there isn’t enough parallel model to train the backward model. Direct back-translation, indirect back-translation using related higher-resource language(Amharic) and back translation through pivot language in both unsupervised and supervised manner. Back translation through a pivot language performed better with using unsupervised Tigrinya to Amharic having the highest gains. This is the first time I am seeing an implementation with a related higher resource language in a two step process(creating two models) instead of just using the higher resource to initialise the weights in the model(traditional cross-lingual transfer learning). They conjecture the improvement was a result of the closeness between Amharic and Tigrinya and the availability of large monolingual corpora. I should explore using pivot language in unsupervised manner further.

3. [Data] Domain-specific MT for Low-resource Languages: The case of Bambara - French [Paper, Video ]

This work is the first attempt at domain-specific MT in French Bambara. I liked how they discussed the details of their models and data preparation process and the paper writing style in general. They used both BLEU and ChrF as evaluation metrics and suggested as human evaluators because the BLEU scores were automatic evaluation was not ‘explainable’. Automatic evaluation showed that combining both general and domain-specific data resulted in an improvement in scores.

4. Low-Resource Neural Machine Translation for Southern African Languages [Paper, Video]

This paper investigated NMT(Transformer architecture) for Southern African Languages of the Bantu family. The focus was on English-IsiZulu translation with 30,000 sentence pairs. They did a comparative analysis of three learning protocols; transfer learning, zero-shot learning and multilingual modelling. All three yielded better scores than the baseline. Of the three, multilingual learning yielded the best improvement, followed by transfer learning with isiXhosa as source(compared with Shona as the source). IsiXhosa and IsiZhulu are part of the Nguni group whereas Shona isn’t) then zero-shot learning.

5. Design and Implementation of English To Yorùbá Verb Phrase Machine Translation System [Paper, Video]

They designed and built an English to Yoruba verb phrase machine translation system. This is a case of a rule-based machine translation. They used Context-Free Grammar and validated the rewrite rules with finite state automata. Interestingly, their system was better than Google translate, according to the expert human evaluator that tested their system. They also provide a link to the English-Yoruba corpus.

6. [Best paper award] Did they direct the violence or admonish it? A cautionary tale on contronomy, androcentrism and back-translation foibles [Tweet]

7. [Data] MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation [Paper, Video]

They introduce a multi-domain dataset for en-yo translation. An important takeaway is that they evaluated model perplexity for different segments of the data. This is the first time I am seeing model perplexity as an evaluation metric for language models(It is negatively correlated to BLEU), and I love how they investigated and explained the results.

8. Translating the Unseen? Yorùbá-English Machine Translation (MT) in Low-Resource, Morphologically-Unmarked Settings [Paper, Video]

This work analysed how an SMT system compares with two NMT systems (BiLSTM and Transformer) when translating bare nouns in Yorùbá into English. The work is an excellent example of an investigation into how MT systems handle morphological differences. I learnt of Sacrebleu, which was introduced to address reporting and comparing bleu scores across models and papers.

9. [Data] English-Twi Parallel Corpus for Machine Translation [Paper, Video]

English-Twi corpus obtained from crowdsourcing and post-edited sentences from the MT system. The corpus was used to train an NMT system and had better scores than the existing model from HuggingFace.

10. [Data]Extended Parallel Corpus for Amharic-English Machine Translation [Paper, Video]

The paper introduces the largest parallel Amharic-English MT corpus compiled from various sources.

Growing in NLP

Discussion about this post