ICL Capabilities of LLMs

In-Context Learning, LLM, Huggingface, Chain-of-Thought (CoT).

[Report] [Code]

Background: I did this project for the COMPSCI 685: Advanced Natural Language Processing – Spring’24 course at UMass along with four other students. My contribution in this project was exploring the in-context learning capabilities of different LLMs which is described in Section 7 in the report.

Project Title: Exploring the New Horizon of Sequence Modeling: Unveiling the Potentials and Challenges of Mamba.

Project Overview: This evaluation explores the In-context learning (ICL) capabilities of pre-trained language models on arithmetic tasks and sentiment analysis using synthetic datasets. The goal is to use different prompting strategies—zero-shot, few-shot, and chain-of-thought—to assess the performance of these models on the given tasks. We conducted two types of arithmetic tasks:

  • Regular Arithmetic: Involves basic arithmetic operations like addition, subtraction, multiplication, and division (using integer division).
  • Jumbled Arithmetic: Involves standard arithmetic operations but with new symbols introduced for each operation:
    • '$' represents addition, so a $ b equals a + b.
    • '#' represents subtraction, so a # b equals a - b.
    • '@' represents the operation (a + b) * (a - b), so a @ b equals this calculation.
Introducing jumbled arithmetic aims to determine if the model is learning from the prompts or merely recalling from its pre-trained knowledge. For sentiment analysis, the task is to classify the sentiment (Positive, Negative, or Neutral) of a given text.
This figure shows the examples of demonstrations that are given to LLMs to explore their ICL capabilities.

Findings: Here are the major findings of this ICL analysis:

  • Mistral-7b consistently outperformed other models across all tasks, demonstrating robust performance irrespective of the demonstration type.

  • Cerebras-btlm-3b showed limited improvement with increased demonstrations, suggesting potential constraints in its ability to utilize contextual information effectively.

  • In regular arithmetic, models generally improved with more demonstrations, with Mamba-7b and Mamba-2.8b particularly benefiting from true label demonstrations.

  • Jumbled arithmetic revealed a stark contrast in model performance with CoT prompting, where Mistral-7b excelled significantly, indicating its strong capability to leverage additional contextual cues.

  • Sentiment analysis tasks highlighted that all models benefited from demonstrations, especially with true labels. CoT prompting notably enhanced performance, with Cerebras-btlm-3b and Llama2-7b showing considerable gains.

  • Demonstrations with random labels generally improved model performance but to a lesser extent compared to demonstrations with true labels.