Vision-Encoded Text Compression: Achieving 8× Token Savings Without Specialized Models

TL;DR: DeepSeek-OCR performance is revolutionary, however, you could achieve similar result of 8× text compression (87% token savings) with 99.5% accuracy using standard LLM models like Gemini 2.5

Don't believe me? Just copy this image and paste it into Gemini with the prompt "read this text and write it down":

This tiny image contains 1,287 characters that Gemini will read at 99.5% accuracy - using only 40 tokens instead of 321.

The Accidental Discovery

A few months ago, I was trying LLM performance on a summarization task. I wanted to use both images and text in my prompts, so naturally, I started sending screenshots. It worked surprisingly well.

Then I got curious: what if I made the screenshots smaller? I started compressing them - smaller, smaller, and smaller. Every time, Gemini kept reading the text perfectly. That's when it hit me:

Reading images containing text was 6× more token-efficient than sending the actual text!

That made intuitive sense. Nature doesn't have tokens - we just use our eyes to read. Vision is the natural interface for information.

I got excited and proposed fine-tuning LLMs on low-resolution text to cut inference costs. But like many things in life, I got discouraged by skepticism about accuracy. "The OCR will be terrible," they said. "You'll lose too much information." The idea went on the shelf, forgotten.

Vindication

Today, I saw DeepSeek-OCR and nearly jumped out of my chair. They proved the idea - a specialized model that compresses text into vision tokens, achieving 7-20× compression. My intuition was right!

DeepSeek-OCR: A custom-trained 3B vision-language model that encodes text images into compact vision tokens, then decodes them back. It achieves 97% accuracy at 10× compression with specialized training on 30M+ documents.

But here's the punchline: You don't need DeepSeek-OCR.

Regular vision models like Gemini 2.5 can already do this. You just need to:

Render your text into an image
Downscale it properly
Send it to any vision API

Poof - you've got an 8× cost cut for your favorite task.

Here's how we did it.

The Results

We ran systematic ablation tests on different compression levels:

Configuration	Tokens	Compression	Accuracy	Status
Plain text	321	1×	100%	Baseline
8px Y=1.0	68	4.7×	91.1%	✗ Poor
8px Y=0.9	60	5.4×	87.8%	✗ Poor
8px Y=0.8	54	6×	99.4%	✓ Good
8px Y=0.7	46	7×	55.0%	✗ Failed
8px Y=0.6	40	8×	99.5%	✓ Best

Winner: 8px Verdana with 60% vertical compression (Y-scale 0.6)

Performance:

40 tokens vs 321 plain text tokens
8× compression (87% token savings)
99.5% accuracy (nearly perfect)

What 8× Compression Looks Like

Here's the actual compressed image that Gemini reads at 99.5% accuracy:

This image contains 1,287 characters (a full MacBook Pro review) in just 40 tokens.

The only transcription error: "500K" → "500k" (capitalization).

How It Works

The rendering pipeline is dead simple - just HTML and Playwright:

from playwright.async_api import async_playwright
from PIL import Image

async def render_compressed_text(text: str, output_path: str):
    # 1. Generate HTML with ultra-small font
    html = f"""
    <html><body style="width:800px;font-family:Verdana;font-size:8px;line-height:8px;margin:0">
        {text}
    </body></html>
    """

    # 2. Screenshot with Playwright
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.set_content(html)
        await page.screenshot(path='temp.png')
        await browser.close()

    # 3. Vertical compression (Y-scale 0.6)
    img = Image.open('temp.png')
    new_height = int(img.height * 0.6)
    img = img.resize((img.width, new_height), Image.Resampling.LANCZOS)
    img.save(output_path)

That's it. No custom models, no training, no infrastructure.

DeepSeek-OCR vs Our Approach

Metric	DeepSeek-OCR	Our Approach
Compression	7-20×	8×
Accuracy (@8×)	~95%	99.5%
Model	Custom 3B VLM	Standard Gemini
Training	30M+ documents	None
Deployment	Self-hosted GPU	API call
Setup Time	Days	Minutes

We match their compression with better accuracy using zero infrastructure.

The Quality Cliff

Compression isn't linear - there's a sharp quality cliff:

Y-scale 0.8 → 99.4% accuracy ✓
Y-scale 0.7 → 55.0% accuracy ✗ (OCR degradation)
Y-scale 0.6 → 99.5% accuracy ✓ (sweet spot!)

At 0.7, letter shapes degrade just enough to break Gemini's OCR. At 0.6, perfect readability.

What Doesn't Work

We tried other compression schemes:

Binary Grids: Convert to 8-bit binary, render as pixels. Could achieve 23× compression, but vision models can't decode arbitrary binary patterns without training. Failed.

Morse Code: Dots and dashes actually use more tokens (192 vs 321) due to spacing requirements. Failed.

RGB Channel Splitting: Encode different text in R/G/B channels. Gemini interprets it as chromatic aberration. Failed.

Arabic Script: We tested Arabic text compression and it failed. Complex scripts with connected letters and diacritics require larger fonts and don't achieve the same compression ratios. This method works best for Latin-script languages.

Lesson: Simple text rendering beats clever encoding with off-the-shelf models.

Claude vs Gemini: Vision Gap

Model	Min Readable Font	Best Compression	Accuracy
Claude 4.5	14px	261 tokens (19% savings)	99.8%
Gemini 2.5	8px	40 tokens (87% savings)	99.5%

Gemini's vision model is 4× better at reading tiny text. Claude hallucinates at 8px.

Methodology Notes

Token Calculation: Image tokens are calculated using Claude/Gemini's vision API formula: tokens = (width × height) / 750. For the 8px Y=0.6 configuration, the image is 800×30px = 24,000 pixels / 750 = 40 tokens.

Disclaimer: These are small-scale experiments. Take results with a grain of salt. Test thoroughly on your specific use case before production deployment. Different text types (code, poetry, tables) may have different accuracy profiles.

Why Does This Matter?

1. We're Just Getting Started

This is just the beginning. With more efficient fonts and encoding schemes, we could potentially achieve 20× compression or more. RGB channel multiplexing could push this even further.

I wonder when we'll see LLMs trained without tokenized inputs at all - just pure image inputs. The fact that vision models can already read compressed text this well suggests we might not need text tokenization in the future.

2. Hierarchical Context Windows: The Future of RAG

The most exciting implication isn't the compression itself - it's what it reveals about pooling and hierarchical processing.

Imagine future LLM architectures with hierarchical context windows: lower layers process raw images with hundreds of millions of tokens, while smarter upper layers work with compressed, abstract representations using far fewer tokens. If this works, it could revolutionize RAG systems.

Let me call it VET-RAG (Vision-Encoded Text Retrieval-Augmented Generation) - you heard it here first! Instead of chunking and embedding text, we could store entire documents as compressed images, let the vision layers do the heavy lifting, and only elevate the relevant compressed representations to the reasoning layers.

3. You Can Use This Today

Most importantly: this isn't theoretical. You can implement this right now:

Send your long prompts as compressed images
Save 8× on token costs immediately
Thank me later!

No custom models, no infrastructure, no training. Just render, compress, and send.

Appendix: Additional Examples

Example 1: Quantum Computing News (Y=0.6)

Input Text (1,442 characters):

Scientists announced a major breakthrough in quantum computing this week, revealing a new error correction technique that could accelerate the development of practical quantum computers. The research team from MIT demonstrated that their novel approach reduces quantum bit errors by up to 90 percent compared to traditional methods...

Gemini Transcription (99.6% accuracy):

Scientists announced a major breakthrough in quantum computing this week, revealing a new error correction technique that could accelerate the development of practical quantum computers. The research team from MIT demonstrated that their novel approach reduces quantum bit errors by up to 90 percent compared to traditional methods...

Errors: Minor differences in comma placement and "for" vs "to" (1 word change).

Tokens: 53 vs ~360 plain text = 85% savings

Example 2: MacBook Pro Review (Y=0.6)

Input Text (1,287 characters):

The new MacBook Pro with M3 chip represents a significant upgrade in laptop computing. After extensive testing across video editing, software development, and machine learning workflows, the performance improvements are remarkable...

Gemini Transcription (99.5% accuracy):

The new MacBook Pro with M3 chip represents a significant upgrade in laptop computing. After extensive testing across video editing, software development, and machine learning workflows, the performance improvements are remarkable...

Errors: "500K" → "500k" (capitalization only)

Tokens: 40 vs ~321 plain text = 87% savings

Published: October 2025 Research conducted using Gemini 2.5 Flash via OpenRouter API