Title: Fine-grained Error Analysis in Machine Translation
The introduction of neural networks has led to gains in performance in a variety of tasks. Recently, systems such as the ChatGPT language model, and the DALL-E image generation system, have exhibited impressive performance which has attracted a lot of attention. However, these systems have also been shown to have flaws; for example, ChatGPT struggles to generate text which is factually correct or do basic math problems. Similarly, DALL-E has shown difficulties in generating details, such as human faces and human hands. These facts demonstrate that although the performance of neural networks has greatly increased, there are still some flaws which remain. Furthermore, this shows that the boundary between what neural networks can and cannot do is unclear. It is therefore important to develop methods of evaluation which focus on the details of neural network performance. I argue that such new methods of evaluation are especially needed for neural machine translation systems.
Machine translation systems are traditionally evaluated in one of two ways: 1) with subjective numerical scores assigned by human evaluators, or 2) with programmatic metrics designed to approximate human ratings. Both methods are one-dimensional quantifications of translation quality, which are not designed to determine which linguistic phenomena a translation system can or cannot translate. Recently, an approach which I call “fine-grained error analysis” has been increasing in popularity in machine translation research. Rather than measuring general translation quality, fine-grained tests measure a translation system’s abilities with respect to specific phenomena, such as the translation of ambiguous vocabulary, idiomatic or multi-word expressions, ambiguous pronouns, verb inflections, and gender bias, among many other phenomena. Such tests have proven to be far more useful to elucidate the abilities and limits of machine translation systems, and have shown that there is still room for improvement.
In this seminar, I will elaborate on my argument that evaluation in machine translation should primarily use fine-grained tests based on linguistic phenomena, instead of the traditional one-dimensional evaluation approaches. I will also discuss how to design fine-grained tests, namely the different methods of test sentence collection and test application. For these design decisions, I will discuss their advantages, disadvantages, and suitabilities to different linguistic phenomena. Finally, I will discuss to what extent fine-grained tests can be automated to reduce their cost without sacrificing their accuracy.