LLaMa.CPP

Tokenize, Detokenize

Llama.cpp provides tokenization and detokenization APIs to convert between text and tokens. These operations are essential for:

  • Inspecting token counts (useful for cost estimation, token limits).
  • Debugging generation issues (e.g., understanding how input text is split into tokens).
  • Rebuilding text from tokens.

Tokenize

The tokenize method splits a given prompt into an array of tokens (integers), representing the internal model encoding.

Code

use Partitech\PhpMistral\Clients\LlamaCpp\LlamaCppClient;

$llamacppUrl = getenv('LLAMACPP_URL');
$llamacppApiKey = getenv('LLAMACPP_API_KEY');

$client = new LlamaCppClient(apiKey: $llamacppApiKey, url: $llamacppUrl);

try {
    $tokens = $client->tokenize(prompt: 'What are the ingredients that make up dijon mayonnaise?');
} catch (\Throwable $e) {
    echo $e->getMessage();
    exit(1);
}

print_r($tokens->getTokens());  // Array of token IDs
echo "count: " . $tokens->getTokens()->count() . PHP_EOL;  // Total number of tokens

Detokenize

The detokenize method reconstructs the original text (or close to it) from an array of tokens. This is useful for reversing tokenization or inspecting intermediate token sequences.

Code

use Partitech\PhpMistral\Clients\LlamaCpp\LlamaCppClient;

$llamacppUrl = getenv('LLAMACPP_URL');
$llamacppApiKey = getenv('LLAMACPP_API_KEY');

$client = new LlamaCppClient(apiKey: $llamacppApiKey, url: $llamacppUrl);

try {
    $result = $client->detokenize(tokens: $tokens->getTokens()->getArrayCopy());
} catch (\Throwable $e) {
    echo $e->getMessage();
    exit(1);
}

echo $result->getPrompt() . PHP_EOL;  // Reconstructed text

Example Flow

  1. Tokenize Input:
Prompt: What are the ingredients that make up dijon mayonnaise?
Tokens: [29871, 590, 278, 12058, 349, 366, 709, 8113, 14760, 837, 30]
Count: 11
  1. Detokenize Tokens:
Text: What are the ingredients that make up dijon mayonnaise?

Use Cases

  • Cost estimation: Check token counts before sending prompts to avoid exceeding token limits.
  • Model debugging: Understand how inputs are tokenized, which affects model behavior.
  • Text reconstruction: Rebuild readable text from generated or intermediate tokens.