Talk’s title: Pitfalls and possibilities: What NLP systems are missing out on
Despite recent advancements in language modeling and NLP in general, there are still many areas in NLP that need much progress. With research disproportionally dedicated to English and a few other high-resource languages, the effect of morphology on NLP systems is clearly an under-studied area. Most high-resource languages such as English and Chinese utilize little morphology, encoding more information syntactically (e.g. word order) than morphologically (e.g. case inflection). Morphologically rich languages like Turkish and St. Lawrence Island Yupik use much more variations in word forms to encode meaning and have flexible or free word order. How such morphological typology affects NLP systems remains largely unanswered because it is hard to obtain natural language datasets that represent morphological diversity of the world’s languages. Regarding this issue, we conduct two studies that augment the existing data to investigate how morphology interacts with NLP systems. First, we compile a parallel Bible corpus and a linguistic typology database to study the effect of morphology on LSTM language modeling difficulty. Our results show that morphological complexity, characterized by higher word type counts, makes a language harder to model. Subword segmentation methods such as BPE and Morfessor mitigate the effect of morphology for some languages, but not for others. Even when they do, they still lag behind morpheme segmentation methods based on FSTs. Next, we develop the first dependency treebank for St. Lawrence Island Yupik and demonstrate how morphology interacts with syntax in the morphologically rich language. We argue that the Universal Dependencies (UD) guidelines, which focus on word-level annotations only, should be extended to morpheme-level annotations for morphologically rich languages. As for another area that requires further research, we present a study on long document classification in English using Transformers. While there is an abundance of available English data in general, there has been a lack of care and discussion regarding the validity of evaluation methods available for this newly developed task. As a result, a fair comparison among existing models is difficult even though several methods have been proposed. To address this issue, we provide a comprehensive evaluation of existing models’ relative efficacy against various datasets and baselines — both in terms accuracy as well as time and space overheads. Our results show that existing models often fail to outperform simple baseline models and yield inconsistent performance across the datasets. The findings emphasize that future studies should consider comprehensive baselines and datasets that better represent the task of long document classification to develop robust models. In all, this presentation sheds light on areas in NLP that need further investigation and emphasize the importance of careful consideration of the datasets involved.