Literature Review on Sampling Techniques for Language Models

28 Oct, 2024

This post is going to consist of my ongoing review of various sampling methods I’ve found through papers and online implementations. While samplers are relatively straightforward to create and combine, there is a challenge in trying to sample both high quality and diverse outputs. I'm particularly interested in finding samplers that are context-aware, resist hallucination, and can explore multiple reasoning paths in language models. For most of the samplers below I link their corresponding papers, implementations, and default parameter settings.

Basics

Logits and Logprobs

Language models generate text by predicting one token at a time. First, the process starts with tokenizing input text and converting these tokens into embeddings. Then, we feed them through the model’s layers and get logits - raw, unnormalized scores that show how confident the model is in each token from the vocabulary being the next token in the sequence.

The softmax function then transforms these logits into probabilities, and now we can use various sampling methods to select the next token. Each selected token is appended to our input, and we repeat this process until we’ve generated the complete output.

Also, we can take the natural logarithm of these probabilities to give us log probabilities (logprobs). Logprobs are useful for scoring entire sequences since we can add them instead of multiplying probabilities and potentially running into underflow.

Input: "the dog jumped over the __"
    ↓
tokenization
    ↓ 
embeddings
    ↓
language_model
    ↓
[5.21, 2.1, 0.3]  # Logits for "fence", "wall", "house"
    ↓
[0.75, 0.2, 0.05] # Probabilities after softmax

log("the") = -0.69
log("dog") = -1.20
log("jumped") = -1.61

# logprob of a sequence -> sum the individual logprobs
log("the dog jumped") = -0.69 + (-1.20) + (-1.61) = -3.50

Greedy Decoding and Beam Search

Greedy decoding simply picks the highest probability token at each step - very basic and naive. While straightforward, greedy decoding often misses better overall sequences because it can't look ahead to see high probability tokens hiding behind lower probability choices.

Input: "the dog jumped over the __"
    ↓
language_model
    ↓ 
[5.21, 2.1, 0.3] # Logits
    ↓
[0.75, 0.2, 0.05] # Probabilities  
    ↓
"fence" # Select highest probability

Beam search tackles these limitations by tracking multiple candidate sequences (beams) simultaneously. At each step, it considers the next best tokens and selects the top candidates based on probability. This process stops either at a certain max length or when an stop token is chosen.

description

Source: d2l.ai

Greedy decoding and beam search both do well in close-ended tasks (translation, text summarization, image captioning) where the output length can be inferred from the input text length. However, they perform poorly in open-ended text generation (creative writing, roleplaying, conversation dialogue) where there is no clear stopping point. Both of these methods produce repetitive and incoherent text in open-ended tasks.

Sampling and Temperature

Sampling introduces diversity into our text generation by giving every a token a chance to be picked based on its probability, rather than always picking the most likely token. This reduces repetition, but it also risks selecting too many low probability words. To control this, there are techniques used to constrain to a subset of our distribution.

One technique is temperature scaling, where we scale the model’s original logits with a specific value. Lower temperatures T < 1 make the probability distribution more peaked, which favors high probability tokens. Higher temperatures T > 1 flatten the distribution, and increases the chance of selecting lower probability tokens. Higher temperatures are associated with more “creative” or “random” outputs.

Note that temperature divides the raw logits before the softmax, so the scaling effect is non-linear. This means doubling the temperatures won’t double the model’s “randomness”.

Input: "the dog jumped over the __"

# logits for fence, wall, house
logits = [5.0, 3.0, 2.0]

# divide logits by 0.5, more peaked distribution
temperature = 0.5                  
scaled_logits = [10.0, 6.0, 4.0]
probs = [0.85, 0.12, 0.03]        

# divide logits by 1.0
temperature = 1.0                  
scaled_logits = [5.0, 3.0, 2.0]   
probs = [0.67, 0.24, 0.09]        

# divide logits by 2.0, flatter distribution
temperature = 2.0                 
scaled_logits = [2.5, 1.5, 1.0]   
probs = [0.45, 0.32, 0.23]

Beam Search Variations

Stochastic Beam Search

[paper, released in 2019]

Combining sampling with beam search should theoretically blend the creativity of sampling and the maximization likelihood of beam search. However, there is a key problem, in that low probability tokens have compounding effects in beam search. When we encounter a low probability token early in our candidates, beam search drastically reduces the chance of this beam being selected, even if the sentence leads to an interesting output. This pushes beam search toward sampling repetitive sequences, with minimal variations only appearing at the end of the sentence.

Stochastic beam search solves this by adding randomly generated noise to the logprobs before selecting beams. Before sampling token probabilities, we:

Add in Gumbel-distributed noise¹ to our sentence logprobs.
Select the highest-noise adjusted logprobs
Carry this noise forward through the generation

This approach preserves creative words by removing the compounding penalty. When we select the top-k noise-adjusted probabilities, we pick the diverse samples (if noise boosts the lower probability word into the top-k) while maintaining quality (by still choosing the highest probabilities at each step).

Diverse Beam Search

[paper, released in 2018] | Huggingface | fairseq

In practice beam search typically produces candidates that differ only slight from each other since they all stem from a few high-scoring beams. This wastes compute and gives us candidates that differ only slightly from each other in terms of content. Diverse Beam Search fixes this by optimizing for diversity in our candidates.

First we add a diversity penalty that discourages repetition between sequences. Measuring diversity across all B candidates at each timestep however would we be too computationally expensive. So instead, we split our B candidates into G groups. The first group does regular beam search, but the later groups will receive a diversity penalty² if they repeat words from previous groups and beams. We then subtract this penalty from the sequence logprobs, and this gives us B beams that are both highly probably and distinct from each other.

description

Source: Diverse Beam Search paper

Truncation Sampling

Besides temperature scaling, a straightforward way to constrain to a subset of highly probably tokens involve truncation-based methods where we limit to high-probability tokens and sample from this distribution.

Top-K Sampling

[paper, released in 2018] | GPT-2 | default: 40

Top-K restricts token selection to the k most likely tokens, where k is a predefined parameter. This method helps prevent the selection of highly improbable tokens. We first sort our tokens by probabilities from highest to lowest, select the top k tokens, renormalize these probabilities based on these k tokens, and then sample from this reduced set.

GPT-2 popularized using top-k sampling and was inspired by the paper above, which utilized top-k sampling to generate more descriptive stories via neural networks. Higher K values lead to more diverse text and lower K values lead to more predictable text.

One criticism of top-k is that we do not dynamically adapt the words that are filtered in our distribution, and instead pick a fixed value which cuts off our distribution arbitrarily. Different contexts need different K values. For instance, a very probable sentences like “the dog jumped over the ____” require only a few high probability words, whereas a sentence like “I sold the ____” could be any item/noun, so more words should be sampled from a larger set.

Nucleus Sampling (Top-p)

[paper, released in 2019] | OpenAI Docs | default: 0.9

Nucleus Sampling instead adaptively limits the set of tokens by selecting a set of tokens whose cumulative probability exceeds a threshold p. Like top-k, we sort tokens by probability, but we add up probabilities until we hit our p threshold. We then renormalize the probabilities and sample from this subset.

The authors of this paper show that top-p adapts better than top-k while maintaining fluency. Nucleus sampling is easier to tune since it adaptively prevents sampling from too few or too many tokens - a common problem with top-k. Higher p values generate more diverse text, and lower values produce more probable completions.

Although top-p is more adaptive than top-k, it still fails to accurately find the cutoff point where low-probability tokens start to occur, and as a result can still include too many unlikely tokens, leading to incoherent text.

Tail Free Sampling

[article, posted in 2019] | llama.cpp | default: 0.9

Tail Free Sampling, created by Trenton Bricken, measures the derivative of Top-P sampling in order to find the “tail” at which tokens are “no longer replaceable”. The author argues that this cutoff can be found by finding where the token probabilities start to plateau.

The process works by first measuring how quickly token probabilities decrease from highest to lowest. Then, TFS calculates the second-order derivative of this distribution and sums up these derivates until we reach a threshold Z. This Z parameter determines where TFS will cutoff unlikely tokens. The author explains some theoretical advantages of this approach, but there aren’t solid benchmarks for open-ended text generations, so I'm not sure if it dramatically improves nucleus sampling.

Min-P Sampling

[r/LocalLLama thread, posted in 2023] | [paper, released in 2024] | Huggingface | default: 0.1

Min-p sampling was created by @kalomaze and @menhguin and is an alternative truncation-based sampler. The author writes that min-p is able to solve the issue of top-k where too many probable words get discarded, and the issue of top-p where there are too many low probability words in the distribution.

Min-P solves this by setting a minimum probability value min_p for a token as a baseline. This value is then multiplied by the probability of the top token in the distribution, and the resulting value is the minimum threshold that a token will be considered at all. For example, if the min_p is set to 0.05 and the most likely token has a probability of 0.9, logits with a value less than 0.045 are filtered out.

I see this as a similar approach to what Tail Free Sampling tries to accomplish - both of these methods try to find and cut off the tail of “irrelevant” tokens. Unlike TFS however, there are some emerging creative writing benchmarks that show min-p achieving significantly higher winrates compared to top-p. Min-p doesn't show many downsides in standard sampling settings, and there are clear benefits in higher temperature settings where the model is more diverse in its outputs.

description

Source: Min-P paper

Contrastive Sampling

Most language models are trained to predict tokens in a left to right manner, which forces us to follow a single reasoning path. If we hit a point where there are several low probable tokens, we risk going down paths that lead to hallucinations or lower quality text. We are then stuck choosing whatever is “good enough” in our probability distribution, even if it’s not the best path.

Contrastive samplers try to generate multiple candidates and contrast their differences, and by comparing these distinct paths, we can distinguish which ones are worth exploring and which ones we should avoid.

Basically, the gist of contrastive sampling:

> “How do we make our model smarter?” 
> “See what a dumb person says and do the opposite of that”

Contrastive Decoding

[paper, released in 2023] | default: α: 0.1, β: 0.5

Contrastive decoding involves two models:

A larger “expert” model (65B in the paper)
A smaller “amateur” model (1B in the paper)

Essentially, we subtract the distribution of dumb completions from the expert model's distribution, and we use this subset to guide sampling for the expert model. This pushes the text generation toward tokens where there’s a big gap between expert and amateur distributions, and these gaps should signal “smarter” tokens that the smaller model doesn’t catch.

For instance, let’s say you are given the following input sentence and logits:

Input: “The dog jumped over the ____ “

Smart model logits:
wall 6.0
fence 5.8
cat 5.5
house 2.1

Amateur model logits:
wall 5.9
fence 3.2
cat 5.4
house 2.0

There are two parameters here. α controls the masking cutoff, any words below this cutoff get dropped. β is the expert-amateur tradeoff.

Cutoff: log(α) + max(expert_logits)
α = 0.1
Cutoff = log(0.1) + max(expert_logits)
Cutoff = 3.7

Tradeoff: (1 + β)expert - βamateur
β = 0.5
wall: (1.5 * 6.0) - (0.5 * 5.9) = 9.0 - 2.95 = 6.05 
fence: (1.5 * 5.8) - (0.5 * 3.2) = 8.7 - 1.6 = 7.1 
cat: (1.5 * 5.5) - (0.5 * 5.4) = 8.25 - 2.7 = 5.55 
house: (1.5 * 2.1) - (0.5 * 2.0) = 3.15 - 1.0 = 2.15 —> masked out 2.15 < 3.7

Final logits:
fence: 7.1
wall: 6.05
cat: 5.55
house: -inf

Even though wall had higher logits in both models, fence ends up with the highest score in the final logits. Because the amateur model wasn't as confident in fence compared to the expert model, this gap signals that fence is the smarter choice, so contrastive decoding pushes this to the top of the logits.

In the paper, the authors found that contrastive decoding outperformed Hellaswag and GSM8k benchmarks compared to larger sized models, and was shown to better in reasoning tasks and long-form text generation. However, it performs worse at arithmetic and factual recall, so there definitely needs to be more exploration on the hyperparameters, model size ratios, and the training saturations for each of the models.

DoLa Decoding

[paper, released in 2023] | Huggingface

Pretty much the same principles as the paper above, but instead of using a dumb model, we compare the final layer of a language model to the earlier layers of the model and do the same calculation of steering the model towards smarter tokens.

The key idea here is that earlier layers capture basic patterns, while the final layer would capture more complex understanding. DoLa Decoding has similar benefits to the approach above but is more efficient because we aren't storing an extra model to compute these differences.

Entropy Sampling

These methods use entropy, which is a measure of a model’s uncertainty or randomness, to guide the sampling strategy. Like contrastive sampling, we try to explore different reasoning paths, but through the lens of information theory instead.

Entropy measures uncertainty across all tokens in a distribution. High entropy means the model is uncertain across many tokens, whereas low entropy means the model is is confident on few tokens.

Entropy = -Σ p(x) * log₂(p(x))

This connects directly to a token's information content. Lower information content means the word is more expected and common, and higher information content means the word is more unexpected and surprising to the model.

Information Content for one token = -log₂(p(x))

Locally Typical Sampling

[paper, released in 2022] | Huggingface | default: τ: 0.2

Locally Typical Sampling finds words that match the expected information content of the surrounding context. The insight here is that try to avoid words that are either too predictable (low information content), or too surprising (high information) in order to minimize miscommunication risk. Based on a parameter τ, typical sampling tries to create a subset of words that are not too predictable nor too surprising. This leads to text that is more natural and human-like.

In Locally Typical Sampling, we:

Calculate average expected information content (entropy) for the next word
Calculate the information content for each possible next word
Select words that are close to the average (within τ)
Sample from this “typical” subset

Through this method, we get some cool benefits like a built-in repetition penalty. When words repeat, the model sees them as more predictable, which lowers their information content. This pushes them outside the typical subset, so we then avoid repeating words.

However, the core assumption that typical words are within some fixed distance τ of the average entropy seems arbitrary. Moreover, I think we already learn the human patterns through extensive pre-training, and I think an explicit constant enforcement of this distance might be overkill.

Mirostat Sampling

[paper, released in 2020] | llama.cpp | default: τ: 3.0, η: 0.1

Mirostat sampling controls text by dynamically adjusting the perplexity³ through feedback. The goal is to prevent both low perplexity (repetitive text) and high perplexity (incoherent text) by maintaining a target surprise value.

The algorithm uses two key parameters: τ (target perplexity) and η (learning rate). For each token, it calculates actual perplexity and adjusts a maximum threshold (μ) using the error between target and actual perplexity, multiplied by the learning rate. Values above this threshold are filtered out, and the next token is sampled from the remaining subset. This feedback loop maintains perplexity values close to the target throughout generation.

I like the idea of using dynamic values that adapt to context, and this is definitely better than a fixed threshold. However, just relying on perplexity seems insufficient for text control, and in my experience it's often an unreliable quality marker. Also, looking at just the next token limits us because we miss broader patterns that emerge across multiple words and ignore the long-range context for sentences.

Entropix Sampling (Shrek Sampler)

[twitter thread, posted in 2024] | Github repo

Entropix sampling, created by @_xjdr and @doomslide, is an adaptive sampler that utilizes entropy in parts of the model and its logits to guide sampling decisions. There are 5 key metrics that go into this method:

Calculate the entropy: This measures how spread out the probability distribution is over possible next tokens. Higher entropy mean probabilities are dispersed across many tokens, whereas lower entropy means probabilities are concentrated on a few tokens.
Calculate the varentropy: Varentropy measures how much entropy varies across possible tokens at the same position. This shows whether the model is consistent (low varentropy) or uncertain (high varentropy) in its token choices.
Calculate the attention entropy and attention varentropy in the attention heads: High values indicate attention is spread across many tokens, while low values show it's concentrated on a few tokens.
Calculate the attention agreement: This measures consistency across attention heads by comparing each head’s attention distribution to the mean attention. Low agreement means the attention heads are focusing on different parts of the input, and high attention agreement means that the heads focus on a single part of the sequence.
Calculate the interaction strength: The interaction strength is the mean of all attention scores across the model, and it shows how strongly the model connects parts of the input. Higher interaction strength means there is a stronger relationships between tokens.

Then, based on these metric calculations there are 5 possible sampling strategies:

Low entropy, low varentropy: the model is confident and consistent in its prediction. In this case, we simply do greedy sampling to pick the next best token.
High entropy, low varentropy: the model is uncertain but consistent in its uncertainty. Here we are not sure for almost every token. We trigger a clarification token and adjust temperature higher. The clarification token acts as a pause so the model can get more information from the user to return to the correct path.
Low entropy, high varentropy: the model is confident but sees multiple distinct possibilities. We perform exploration sampling by increasing temperature and top-k if attention agreement is low, then resample with these adjusted parameters.
High entropy, high varentropy: the model is highly uncertain and sees many possibilities, basically it has no idea what to do! We need to do an exhaustive exploratory approach, so we increase temperature significantly, decrease top-p, and resample.
Medium entropy, medium varentropy: use an adaptive multi-sample approach. Adjust all the parameters (temperature, top-p, top-k, min-p) and generate multiple samples (default: 12). We then pick the highest scoring sample based on logprobs and confidence score, and the confidence score is just a weighted average of all calculated metrics from before.

description

Source: entropix repo

To summarize, here's the workflow:

After the model generates it's logits, we calculate the metrics (entropy, varentropy, attention entropy, attention varentropy, attention agreement, and interaction strength)
Based on the metrics, the adaptive sampler selects the most appropriate sampling strategy.
Using the strategy and metrics, the sampling parameters (temperature, top-k, top-p, min-p) are dynamically adjusted⁴.
The strategy is applied to select the next token, and the process repeats.

This is a behemoth of a sampler, and from the text generations I've seen online, there seems to be much better reasoning and coherence from the model. Others have reported initial improvements in both reasoning and math tasks across different model architectures also. I'm interested in seeing more benchmarks get released, as well as reproducing these results once the repo starts to mature and there is more guidance on usage.

There are parts to entropix that are unclear however. The correlation between the attention calculations and the sampling strategies is hard for me to understand, and some of the parameter adjustments feel arbitrary (e.g. confidence score calculations, parameter tweaks). Granted, this project is only 30 days old, and I think the author acknowledges this and is still hacking out things, but I'm really interested in seeing entropix get fleshed out in the future.

description

Improved reasoining with entropix | Source: Twitter

Misc (aka idk where to put these)

Speculative Decoding (assisted decoding, speculative sampling)

[paper, released in 2023]⁵ | [paper, released in 2023]⁵

This isn’t really a sampling method but more a technique to speed up your inference if you have extra memory compute. In speculative decoding we have two models:

A big, slow "target" model
A small, fast "draft" model

There are certain sequences (common phrases, pronouns, punctuation) that are easy to predict. We don't want to waste our big target model's compute on these tokens, so we let the smaller draft model handle them.

The draft model computes K possible next tokens, while the target model works on the next tokens. We compare what both model's predict and if the draft model passes the acceptance criteria, we keep the K tokens and move to the next sequence. If any tokens get rejected, we resample via a blend of the two distributions and restart the process.

We get about 2-2.5x faster inference with this approach, and we don't need to change our model's architecture or do any extra training. Most important, we can pair this with any sampling technique, as long as both models use the same sampling method and parameters.

There are some more efficient implementations from Apple and LMSYS that show speculative decoding can work with a single model, eliminating the need for a separate draft model.

Logit Masking (logit bias)

Microsoft Guidance | OpenAI structured outputs | Instructor

There isn’t really any literature to go over, but logit masking is crucial if you want to be able to control your LM's output, especially for structured data.

After you get your probability distribution over your vocabulary, you simply mask out invalid tokens by setting their probabilities to a value like -inf. This gives you hard constraints on what tokens can be sampled, and you can enforce whatever structure you need.

Example: Only allow blue, red, green for the next token

Before Masking:
"        : 0.15 
the      : 0.12
red      : 0.10
,        : 0.08
a        : 0.07
blue     : 0.06
}        : 0.05
green    : 0.04
is       : 0.03
purple   : 0.02

After Masking:
"red"    : 0.50  (renormalized from 0.10)
"blue"   : 0.30  (renormalized from 0.06)
"green"  : 0.20  (renormalized from 0.04)
the      : -inf  (masked out)
,        : -inf  (masked out)
a        : -inf  (masked out)
}        : -inf  (masked out)
is       : -inf  (masked out)
purple   : -inf  (masked out)

Uses something known as the Gumbel-max trick to add gumbel-distributed noise to the token logprobs. The Gumbel distribution models the max of the set of samples, so it's ideal for selecting the top-k sequences.↩
For the diversity penalty, it’s usually n-gram matching or looking at vector similarity, but there are a bunch of ways to compute diversity penalty honestly. Fairseq implements a cumulative diversity penalty, whereas the original paper does hamming diversity which is implemented in huggingface. You can interpolate between the two approaches and get pretty good results.↩
Perplexity is related to the information content, in that it’s just the exponential of information content. Entropy, information content, and perplexity all are just measures of uncertainty in a model.↩
For parameter adjustments, there are four guidlines in the Entropix sampler. 1. Adjust temperature based on logits entropy, attention entropy, and agreement. 2. Adjust top-p based on attention varentropy. 3. Adjust top-k based on on interaction strength and attention agreement. 4. Adjust min-p based on the logits uncertainty (entropy + varentropy)↩
Google and DeepMind to my knowledge independently came up with the same idea at roughly the same time, and their papers describe similar techniques and report the same speedups.↩