- Published on
Vision-Encoded Text Compression: Achieving 8× Token Savings Without Specialized Models
- Authors

- Name
- Hani Al-Shater
TL;DR: DeepSeek-OCR performance is revolutionary, however, you could achieve similar result of 8× text compression (87% token savings) with 99.5% accuracy using standard LLM models like Gemini 2.5
Don't believe me? Just copy this image and paste it into Gemini with the prompt "read this text and write it down":
This tiny image contains 1,287 characters that Gemini will read at 99.5% accuracy - using only 40 tokens instead of 321.
The Accidental Discovery
A few months ago, I was trying LLM performance on a summarization task. I wanted to use both images and text in my prompts, so naturally, I started sending screenshots. It worked surprisingly well.
Then I got curious: what if I made the screenshots smaller? I started compressing them - smaller, smaller, and smaller. Every time, Gemini kept reading the text perfectly. That's when it hit me:
Reading images containing text was 6× more token-efficient than sending the actual text!
That made intuitive sense. Nature doesn't have tokens - we just use our eyes to read. Vision is the natural interface for information.
I got excited and proposed fine-tuning LLMs on low-resolution text to cut inference costs. But like many things in life, I got discouraged by skepticism about accuracy. "The OCR will be terrible," they said. "You'll lose too much information." The idea went on the shelf, forgotten.
Vindication
Today, I saw DeepSeek-OCR and nearly jumped out of my chair. They proved the idea - a specialized model that compresses text into vision tokens, achieving 7-20× compression. My intuition was right!
DeepSeek-OCR: A custom-trained 3B vision-language model that encodes text images into compact vision tokens, then decodes them back. It achieves 97% accuracy at 10× compression with specialized training on 30M+ documents.
But here's the punchline: You don't need DeepSeek-OCR.
Regular vision models like Gemini 2.5 can already do this. You just need to:
- Render your text into an image
- Downscale it properly
- Send it to any vision API
Poof - you've got an 8× cost cut for your favorite task.
Here's how we did it.
The Results
We ran systematic ablation tests on different compression levels:
| Configuration | Tokens | Compression | Accuracy | Status |
|---|---|---|---|---|
| Plain text | 321 | 1× | 100% | Baseline |
| 8px Y=1.0 | 68 | 4.7× | 91.1% | ✗ Poor |
| 8px Y=0.9 | 60 | 5.4× | 87.8% | ✗ Poor |
| 8px Y=0.8 | 54 | 6× | 99.4% | ✓ Good |
| 8px Y=0.7 | 46 | 7× | 55.0% | ✗ Failed |
| 8px Y=0.6 | 40 | 8× | 99.5% | ✓ Best |
Winner: 8px Verdana with 60% vertical compression (Y-scale 0.6)
Performance:
- 40 tokens vs 321 plain text tokens
- 8× compression (87% token savings)
- 99.5% accuracy (nearly perfect)
What 8× Compression Looks Like
Here's the actual compressed image that Gemini reads at 99.5% accuracy:
This image contains 1,287 characters (a full MacBook Pro review) in just 40 tokens.
The only transcription error: "500K" → "500k" (capitalization).
How It Works
The rendering pipeline is dead simple - just HTML and Playwright:
from playwright.async_api import async_playwright
from PIL import Image
async def render_compressed_text(text: str, output_path: str):
# 1. Generate HTML with ultra-small font
html = f"""
<html><body style="width:800px;font-family:Verdana;font-size:8px;line-height:8px;margin:0">
{text}
</body></html>
"""
# 2. Screenshot with Playwright
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.set_content(html)
await page.screenshot(path='temp.png')
await browser.close()
# 3. Vertical compression (Y-scale 0.6)
img = Image.open('temp.png')
new_height = int(img.height * 0.6)
img = img.resize((img.width, new_height), Image.Resampling.LANCZOS)
img.save(output_path)
That's it. No custom models, no training, no infrastructure.
DeepSeek-OCR vs Our Approach
| Metric | DeepSeek-OCR | Our Approach |
|---|---|---|
| Compression | 7-20× | 8× |
| Accuracy (@8×) | ~95% | 99.5% |
| Model | Custom 3B VLM | Standard Gemini |
| Training | 30M+ documents | None |
| Deployment | Self-hosted GPU | API call |
| Setup Time | Days | Minutes |
We match their compression with better accuracy using zero infrastructure.
The Quality Cliff
Compression isn't linear - there's a sharp quality cliff:
- Y-scale 0.8 → 99.4% accuracy ✓
- Y-scale 0.7 → 55.0% accuracy ✗ (OCR degradation)
- Y-scale 0.6 → 99.5% accuracy ✓ (sweet spot!)
At 0.7, letter shapes degrade just enough to break Gemini's OCR. At 0.6, perfect readability.
What Doesn't Work
We tried other compression schemes:
Binary Grids: Convert to 8-bit binary, render as pixels. Could achieve 23× compression, but vision models can't decode arbitrary binary patterns without training. Failed.
Morse Code: Dots and dashes actually use more tokens (192 vs 321) due to spacing requirements. Failed.
RGB Channel Splitting: Encode different text in R/G/B channels. Gemini interprets it as chromatic aberration. Failed.
Arabic Script: We tested Arabic text compression and it failed. Complex scripts with connected letters and diacritics require larger fonts and don't achieve the same compression ratios. This method works best for Latin-script languages.
Lesson: Simple text rendering beats clever encoding with off-the-shelf models.
Claude vs Gemini: Vision Gap
| Model | Min Readable Font | Best Compression | Accuracy |
|---|---|---|---|
| Claude 4.5 | 14px | 261 tokens (19% savings) | 99.8% |
| Gemini 2.5 | 8px | 40 tokens (87% savings) | 99.5% |
Gemini's vision model is 4× better at reading tiny text. Claude hallucinates at 8px.
Methodology Notes
Token Calculation: Image tokens are calculated using Claude/Gemini's vision API formula: tokens = (width × height) / 750. For the 8px Y=0.6 configuration, the image is 800×30px = 24,000 pixels / 750 = 40 tokens.
Disclaimer: These are small-scale experiments. Take results with a grain of salt. Test thoroughly on your specific use case before production deployment. Different text types (code, poetry, tables) may have different accuracy profiles.
Why Does This Matter?
1. We're Just Getting Started
This is just the beginning. With more efficient fonts and encoding schemes, we could potentially achieve 20× compression or more. RGB channel multiplexing could push this even further.
I wonder when we'll see LLMs trained without tokenized inputs at all - just pure image inputs. The fact that vision models can already read compressed text this well suggests we might not need text tokenization in the future.
2. Hierarchical Context Windows: The Future of RAG
The most exciting implication isn't the compression itself - it's what it reveals about pooling and hierarchical processing.
Imagine future LLM architectures with hierarchical context windows: lower layers process raw images with hundreds of millions of tokens, while smarter upper layers work with compressed, abstract representations using far fewer tokens. If this works, it could revolutionize RAG systems.
Let me call it VET-RAG (Vision-Encoded Text Retrieval-Augmented Generation) - you heard it here first! Instead of chunking and embedding text, we could store entire documents as compressed images, let the vision layers do the heavy lifting, and only elevate the relevant compressed representations to the reasoning layers.
3. You Can Use This Today
Most importantly: this isn't theoretical. You can implement this right now:
- Send your long prompts as compressed images
- Save 8× on token costs immediately
- Thank me later!
No custom models, no infrastructure, no training. Just render, compress, and send.
Appendix: Additional Examples
Example 1: Quantum Computing News (Y=0.6)
Input Text (1,442 characters):
Scientists announced a major breakthrough in quantum computing this week, revealing a new error correction technique that could accelerate the development of practical quantum computers. The research team from MIT demonstrated that their novel approach reduces quantum bit errors by up to 90 percent compared to traditional methods...
Gemini Transcription (99.6% accuracy):
Scientists announced a major breakthrough in quantum computing this week, revealing a new error correction technique that could accelerate the development of practical quantum computers. The research team from MIT demonstrated that their novel approach reduces quantum bit errors by up to 90 percent compared to traditional methods...
Errors: Minor differences in comma placement and "for" vs "to" (1 word change).
Tokens: 53 vs ~360 plain text = 85% savings
Example 2: MacBook Pro Review (Y=0.6)
Input Text (1,287 characters):
The new MacBook Pro with M3 chip represents a significant upgrade in laptop computing. After extensive testing across video editing, software development, and machine learning workflows, the performance improvements are remarkable...
Gemini Transcription (99.5% accuracy):
The new MacBook Pro with M3 chip represents a significant upgrade in laptop computing. After extensive testing across video editing, software development, and machine learning workflows, the performance improvements are remarkable...
Errors: "500K" → "500k" (capitalization only)
Tokens: 40 vs ~321 plain text = 87% savings
Published: October 2025 Research conducted using Gemini 2.5 Flash via OpenRouter API