Expected Behavior
The JTokkitTokenCountEstimator should be improved to more accurately handle binary data (byte[]) within multimodal prompts. Currently, it simply adds the byte array's length to the token count, but most large language models (LLMs) process binary data by tokenizing a Base64-encoded string representation of it. Therefore, the estimation logic should reflect this process.
Current Behavior
Currently, the JTokkitTokenCountEstimator processes binary data (byte[]) in multimodal prompts by adding binaryData.length directly to the token count. As noted in the code's own comment, // This is likely incorrect.
, this approach is inaccurate.
Most LLMs tokenize binary data after it has been encoded into a Base64 string. Base64 encoding increases the original data's size by approximately 33%, so the current logic significantly underestimates the actual number of tokens consumed by the model. This leads to errors in estimating API usage and costs.
Proposed Code Example
// After the proposed improvement
else if (media.getData() instanceof byte[] binaryData) {
// Convert the byte array to a Base64 string for token estimation
String base64Data = Base64.getEncoder().encodeToString(binaryData);
tokenCount += this.estimate(base64Data);
}
Context
Problem: The current JTokkitTokenCountEstimator underestimates the token count for binary data like image files, resulting in a large discrepancy between estimated and actual API costs.
Objective: The goal is to enhance the JTokkitTokenCountEstimator to provide a more accurate token estimate for binary data, which would help developers manage API costs more reliably.