DOI will be added upon paper acceptance.
Few-Shot Learning (FSL) in Large Language Models (LLMs) is highly sensitive to the selection of in-context examples. While most existing approaches rely on random sampling or retrieval-based heuristics, these methods evaluate examples independently and fail to capture interactions between them.
This repository introduces GAFS (Genetic Algorithm for Few-Shot Selection), a novel method that formulates example selection as a global combinatorial optimization problem. Instead of selecting examples individually, GAFS searches for subsets of examples that jointly maximize downstream performance.
The method is evaluated on multiple Spanish hate speech detection datasets and consistently outperforms standard baselines such as Zero-Shot Learning (ZSL), random and retrieval-based selection FSL.
-
First use of genetic algorithms for FSL example selection
-
GAFS outperforms random and retrieval-based FSL baselines
-
Consistent gains across LLMs and Spanish hate speech datasets
-
Captures complex interactions missed by heuristic selection methods
-
Establishes FSL selection as a global optimization problem
-
Tomás Bernal-Beltrán — University of Murcia
Google Scholar · ORCID -
Ronghao Pan — University of Murcia
Google Scholar · ORCID -
José Antonio García-Díaz — University of Murcia
Google Scholar · ORCID -
Rafael Valencia-García — University of Murcia
Google Scholar · ORCID
Affiliations:
* Departamento de Informática y Sistemas, Universidad de Murcia, Campus de Espinardo, 30100, Murcia, Spain
Publication information will be added upon paper acceptance.
This work is part of the research project LaTe4PoliticES (PID2022-138099OB-I00) funded by MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF EU/FEDER UE)-a way of making Europe. Mr. Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.
Citation information will be added upon paper acceptance.
Few-shot learning (FSL) in large language models relies heavily on the selection of in-context examples, yet most existing approaches use simple heuristics such as random sampling or similarity-based retrieval. In this work, we propose GAFS, a genetic algorithm-based method that formulates example selection as a global combinatorial optimization problem. Unlike standard strategies, GAFS searches for subsets of examples that jointly maximize downstream task performance, explicitly capturing interactions between examples.
We evaluate the proposed approach on four Spanish hate speech detection datasets (HatEval, EXIST, DETESTS and HOMO-MEX) using a diverse set of LLM architectures and parameter scales. The results show that GAFS consistently outperforms random and retrieval-based FSL in the majority of models, achieving substantial improvements in macro-averaged F1-score across datasets. Statistical significance analysis using McNemar's test further confirms that these gains are robust, with GAFS correctly classifying nearly two thousand more instances than the strongest baseline (
GAFS formulates FSL example selection as a combinatorial optimization problem, where the goal is to identify a subset of training examples that maximizes downstream task performance when used in-context.
Formally:
argmax_S f(S), subject to |S| = K
Where:
Sis a subset of training examples.Kis the number of examples included in the prompt.f(S)is the downstream task performance (macro-F1).
The overall optimization process of GAFS is illustrated below:
The figure shows how GAFS iteratively evolves candidate subsets of examples using a genetic algorithm guided by LLM-based evaluation, ultimately producing an optimized few-shot prompt.
Each candidate solution is encoded as a binary vector of size N (training set size), where:
1→ example is selected.0→ example is not selected.
A constraint enforces that exactly K examples are selected.
Each subset S is evaluated by:
- Constructing a FSL prompt using the selected examples.
- Running the LLM on a validation set.
- Computing macro-F1 score.
This allows GAFS to directly optimize task performance, instead of proxy metrics such as similarity or diversity.
GAFS evolves a population of candidate subsets using:
-
Selection (Tournament Selection)
Selects the best individuals among random pairs, balancing exploitation and exploration. -
Crossover (Uniform Crossover)
Combines two subsets by randomly mixing their selected examples. -
Mutation (Swap Mutation)
Replaces one selected example with a non-selected one (preserving the subset sizeK).
- Initialize a population of random subsets.
- Iteratively evolve the population over
Ggenerations. - Evaluate each subset using LLM-based fitness.
- Select the best-performing subset as final solution.
Unlike retrieval-based methods, GAFS does not optimize individual examples, but the subset as a whole.
Unlike standard approaches:
| Method | Selection Strategy | Limitation |
|---|---|---|
| Random FSL | Random sampling | High variance, unstable |
| Retrieval-based FSL | Similarity / diversity | Ignores interactions |
| GAFS | Global optimization | — |
GAFS does not evaluate examples independently, but searches for subsets that:
- Capture complementary information.
- Align with the model’s behavior.
- Optimize joint contribution to performance.
The best set of examples is not the sum of the best individual examples.
GAFS explicitly models interactions between examples, enabling the discovery of subsets that:
- Are not necessarily the most similar or diverse.
- But are optimal for the model and the task.
- Fitness evaluation requires LLM inference
- Cost depends on:
- Population size (
P). - Number of generations (
G). - Validation set size.
- Population size (
To mitigate this:
- Validation subset is used instead of full training set.
- Fitness values can be cached to avoid recomputation.
We evaluate GAFS under a controlled experimental framework designed to isolate the impact of example selection strategies on In-Context Learning (ICL) performance.
All methods are compared using the same prompt templates, decoding configuration and evaluation metrics, ensuring that performance differences are solely attributable to the selection strategy.
The evaluation focuses on Spanish hate speech detection, a challenging classification task characterized by:
- Implicit and context-dependent language.
- Domain variability across datasets.
- Sensitivity to prompt design and example selection.
We evaluate GAFS on four benchmark datasets covering different forms of harmful content:
- HatEval — hate speech targeting immigrants and women.
- EXIST — sexism detection in social media.
- DETESTS — racial stereotypes in news comments.
- HOMO-MEX — LGBT+phobic content in Mexican Spanish.
These datasets vary in:
- Domain (Twitter, news, social media, lyrics).
- Label space (binary vs multi-class).
- Linguistic phenomena (explicit vs implicit hate).
This diversity enables a robust and cross-domain evaluation.
These datasets are not distributed with this repository.
Users are responsible for obtaining access through the official sources and complying with their respective licenses and usage terms.
We evaluate a diverse set of open-weight decoder-only LLMs, covering different architectures and parameter scales:
- Gemma 2 — 2B, 9B
- LLaMA 3.x — 3.1 8B, 3.2 1B, 3.2 3B
- Qwen 2.5 — 1.5B, 3B, 7B
- Mistral — 7B
- Phi-3.5 Mini
This selection allows us to analyze:
- The effect of model size.
- Differences across architectural families.
- Model-dependent behavior in ICL.
We compare GAFS against three representative ICL strategies:
-
Zero-Shot Learning
- No examples provided.
- Serves as a lower-bound baseline.
-
Random Few-Shot Learning
- Random selection of
Kexamples. - Simple but high variance.
- Random selection of
-
Retrieval-based Few-Shot Learning
- Selection based on:
- Semantic similarity.
- Diversity (clustering).
- Uncertainty (entropy).
- Strong state-of-the-art baseline.
- Selection based on:
- Prompts are written in English (control language).
- Inputs remain in Spanish.
- Same prompt structure across all methods.
- Deterministic decoding:
do_sample = False.num_beams = 1.
- Fixed number of examples:
K = 10for all FSL methods.
This ensures fair and reproducible comparisons.
Performance is measured using:
- Macro-F1 (primary metric).
- Macro Precision.
- Macro Recall.
Macro-averaging ensures balanced evaluation across classes, especially under class imbalance.
GAFS consistently outperforms standard FSL selection strategies across datasets and model families.
- GAFS achieves the best performance in the majority of evaluated models.
- Improves over strong baselines by up to +6 F1 points.
- Delivers consistent gains across all datasets, including cross-domain settings.
- Produces +1970 additional correct predictions compared to the strongest baseline.
- All improvements are statistically significant (McNemar’s test, p < 0.05).
- Gains are observed across:
- Different architectures (Gemma, LLaMA, Qwen, Mistral, Phi).
- Different scales (1B → 9B parameters).
- GAFS shows more stable performance than retrieval-based methods, which are more model-dependent.
GAFS provides the largest improvements when:
- The task is complex or ambiguous (e.g., implicit hate speech).
- The model is sensitive to example selection.
- Heuristic methods fail to capture complementary information.
The best set of examples is not the sum of the best individual examples.
Selecting examples based on similarity or diversity alone is not sufficient.
GAFS shows that interactions between examples play a critical role in ICL performance.
- Retrieval-based methods optimize local properties (similarity, diversity)
- GAFS optimizes global performance
- This allows GAFS to discover:
- More informative combinations.
- Better alignment with the model.
- More effective decision boundaries.
This repository provides all the necessary components to fully reproduce the experiments reported in the paper, including:
- Prompt templates for ZSL and FSL settings.
- Implementation of the GAFS optimization algorithm.
- Scripts for all experimental configurations (ZSL, random and retrieval-based FSL).
- Files for efficient and scalable execution.
All experiments are designed to be reproducible under a controlled setup, with fixed seeds and deterministic decoding.
├── prompts/
│ └── Prompt templates used in the experiments
│
├── code/
│ ├── experiment_scripts/
│ │ ├── experiments_zs.sh
│ │ ├── experiments_fs.sh
│ │ └── experiments_ga.sh
│ │
│ ├── zero_shot.py
│ ├── few_shot.py
│ ├── gafs.py
│ ├── functions.py
│ └── ga_functions.py
│
├── images/
│ └── Images included in this repository
│
└── README.md
-
zero_shot.py
Script for running ZSL experiments.Arguments:
-m(int) → Model to evaluate-d(str) → Path to the dataset
-
few_shot.py
Script for running Few-Shot experiments with different example selection strategies.Arguments:
-m(int) → Model to evaluate-d(str) → Path to the dataset-e(int) → Number of examples used in the prompt (default: 5)-t(int) → Strategy to select examples (default: 0)0→ Random selection1→ Retrieval-based selection
-
gafs.py
Main script for running the GAFS optimization process for few-shot example selection.Arguments:
-m(int) → Model to evaluate-d(str) → Path to the dataset-t(int) → Strategy to evaluate each candidate solution (default: 0)0→ Subset-based evaluation1→ Full dataset evaluation
Genetic Algorithm Parameters:
The parameters of the genetic algorithm are defined directly within the script:
POP_SIZE→ Population size (default: 10)N_GEN→ Number of generations (default: 5)MUT_RATE→ Mutation rate (default: 0.1)CROSS_RATE→ Crossover rate (default: 0.8)
Additional Parameters:
K→ Number of examples selected for the few-shot prompt (default: 10)VALIDATION_PERCENTAGE→ Proportion of data used for fitness evaluation (default: 0.1)
-
functions.py
Utility functions used across the experimental pipeline. -
ga_functions.py
Functions related to the genetic algorithm operations used in GAFS.
The following scripts provide a simple way to reproduce the experiments:
-
experiments_zs.sh
Executes ZSL experiments. -
experiments_fs.sh
Executes FSL experiments with baseline methods. -
experiments_ga.sh
Executes experiments using the GAFS optimization process.
These scripts contain the commands and configurations required to run the experiments end-to-end.
The experiments were conducted using the following environment:
- Python — 3.10
- torch — 2.9.1
- transformers — 4.57.3
- scikit-learn — 1.7.2
- tqdm — 4.67.1
- pandas — 2.3.3
- numpy — 2.2.6
These libraries are required to run the scripts included in this repository.
This repository is released under the MIT License.
The code is provided for research and academic purposes, and can be freely used, modified and distributed under the terms of the license.
Any datasets used in the experiments remain subject to their respective licenses and terms of use.
