DeepSeek, based in China, is innovating again.
In a fascinating paper posted on Github, Deepseek-OCR: Contexts Optical Compression, DeepSeek uses a new method of “compressing long contexts via optical 2D mapping.” Instead of text tokens, the model will use image or vision tokens of the text. Basically, like taking a screenshot of a page of text.
As Rohan Paul tweeted: “Instead of feeding an LLM thousands of text tokens, it turns long text into an image, encodes that image into a small set of vision tokens, then lets a decoder reconstruct the text.”
Excerpt



Comments from Rohan Paul: