ICL Capabilities of LLMs

[Report] [Code]

Background: I did this project for the COMPSCI 685: Advanced Natural Language Processing – Spring’24 course at UMass along with four other students. My contribution in this project was exploring the in-context learning capabilities of different LLMs which is described in Section 7 in the report.

Project Title: Exploring the New Horizon of Sequence Modeling: Unveiling the Potentials and Challenges of Mamba.

Project Overview: This evaluation explores the In-context learning (ICL) capabilities of pre-trained language models on arithmetic tasks and sentiment analysis using synthetic datasets. The goal is to use different prompting strategies—zero-shot, few-shot, and chain-of-thought—to assess the performance of these models on the given tasks. We conducted two types of arithmetic tasks:

Regular Arithmetic: Involves basic arithmetic operations like addition, subtraction, multiplication, and division (using integer division).
Jumbled Arithmetic: Involves standard arithmetic operations but with new symbols introduced for each operation:

'$' represents addition, so a $ b equals a + b.
'#' represents subtraction, so a # b equals a - b.
'@' represents the operation (a + b) * (a - b), so a @ b equals this calculation.

Introducing jumbled arithmetic aims to determine if the model is learning from the prompts or merely recalling from its pre-trained knowledge. For sentiment analysis, the task is to classify the sentiment (Positive, Negative, or Neutral) of a given text.

This figure shows the examples of demonstrations that are given to LLMs to explore their ICL capabilities.

Findings: Here are the major findings of this ICL analysis:

Mistral-7b consistently outperformed other models across all tasks, demonstrating robust performance irrespective of the demonstration type.
Cerebras-btlm-3b showed limited improvement with increased demonstrations, suggesting potential constraints in its ability to utilize contextual information effectively.
In regular arithmetic, models generally improved with more demonstrations, with Mamba-7b and Mamba-2.8b particularly benefiting from true label demonstrations.
Jumbled arithmetic revealed a stark contrast in model performance with CoT prompting, where Mistral-7b excelled significantly, indicating its strong capability to leverage additional contextual cues.
Sentiment analysis tasks highlighted that all models benefited from demonstrations, especially with true labels. CoT prompting notably enhanced performance, with Cerebras-btlm-3b and Llama2-7b showing considerable gains.
Demonstrations with random labels generally improved model performance but to a lesser extent compared to demonstrations with true labels.