0%
Caution: JavaScript execution is disabled in your browser or for this website. You may not be able to answer all questions in this survey. Please, verify your browser parameters.
Data Annotation Bottleneck & Active Learning for NLP in the Era of LLMs

The success of Natural Language Processing (NLP) often depends on the availability of (high-quality) data. In particular, the costly manual annotation of text data has posed a major challenge since the early days of NLP. To overcome the data annotation bottleneck, a number of methods have been proposed. One prominent method in this context is Active Learning, which aims to minimize the set of data that needs to be annotated.

However, the development of Large Language Models (LLMs) has changed the field of NLP considerably. For this reason, it is of huge interest to us working in this field (both in research and in practical application) to understand if and how a lack of annotated data is still affecting NLP today.

At the center of this survey is Active Learning, which was last surveyed in a web survey in 2009. Fifteen years later, we aim to reassess the current state of the method from the user's point of view. Besides inquiring where Active Learning is used, we also ask where it is not used in favor of other methods. Moreover, we want to understand which computational methods the community considers most useful to overcome a lack of annotated data.

The survey is conducted solely for non-commercial, academic purposes. It specifically targets participants who are or have been involved in supervised machine learning for NLP. Knowledge about Active Learning is not required. Filling out the survey will take you approximately 15 minutes.

Why should I invest my time in this survey? We need your collective expertise in the field of NLP, Supervised Machine Learning, or Active Learning, to understand how recent advancements, such as LLMs, have changed the long-standing data annotation bottleneck. The results of this survey will help the community to better understand the state and open issues of contemporary Active Learning, and incorporate these insights into research and development of new methods and technologies. To this end, a study presenting and discussing the results of this survey will be published as an open access publication. If you wish to be notified upon publication, you can optionally enter your email address at the end of the survey.

What is Active Learning? Active Learning is a method to create a small but meaningful annotated dataset with the goal of minimizing the annotation effort. It is an iterative cyclic method between a human annotator and a learning algorithm. In each iteration, an instance selection strategy (also referred to as query or acquisition strategy) is used to select data points that are considered particularly useful to be annotated next. These can be, for example (among many other strategies), the instances for which a machine learning model is most uncertain.

The survey is initiated by the researchers Christopher Schröder (Institut für Angewandte Informatik e. V.), Julia Romberg (GESIS - Leibniz Institute for the Social Sciences), and Julius Gonsior (TU Dresden).

If you have any questions, please contact us via activelearningsurvey2024@gmail.com.