Expected Behavior

The JTokkitTokenCountEstimator should be improved to more accurately handle binary data (byte[]) within multimodal prompts. Currently, it simply adds the byte array's length to the token count, but most large language models (LLMs) process binary data by tokenizing a Base64-encoded string representation of it. Therefore, the estimation logic should reflect this process.

Current Behavior

Currently, the JTokkitTokenCountEstimator processes binary data (byte[]) in multimodal prompts by adding binaryData.length directly to the token count. As noted in the code's own comment, // This is likely incorrect., this approach is inaccurate.

Most LLMs tokenize binary data after it has been encoded into a Base64 string. Base64 encoding increases the original data's size by approximately 33%, so the current logic significantly underestimates the actual number of tokens consumed by the model. This leads to errors in estimating API usage and costs.

Proposed Code Example

// After the proposed improvement
else if (media.getData() instanceof byte[] binaryData) {
    // Convert the byte array to a Base64 string for token estimation
    String base64Data = Base64.getEncoder().encodeToString(binaryData);
    tokenCount += this.estimate(base64Data);
}

Context

Problem: The current JTokkitTokenCountEstimator underestimates the token count for binary data like image files, resulting in a large discrepancy between estimated and actual API costs.

Objective: The goal is to enhance the JTokkitTokenCountEstimator to provide a more accurate token estimate for binary data, which would help developers manage API costs more reliably.