Did you think ChatGPT remembers everything you said to it? This is absolutely not the case for long contexts (over 100k words or 150 images). If you input more words than this (uploading a file equivalent to the length of one or two scientific journal articles will do that), then ChatGPT silently truncates your input to keep it under that limit. This is a limitation that breaks millions of users’ expectations: The user always expects the model to have access to the entirety of the user’s input.
Let’s say the user uploads a book to the platform and asks the model to summarize it. If the platform silently truncates some parts of the book away before passing it to the AI model due to the context length limits, then it won’t be able to summarize the book correctly and could leave out information crucial to the plot. Despite this, the user will have no idea that the model did not have access to the entire context.
Since this is a core technical limitation of the current state-of-the-art transformer-based language model architecture, this situation is the same for any other currently available commercial language model APIs, including OpenAI’s GPT4, Google’s Gemini, Microsoft’s Copilot, etc.
This limitation significantly limits how much information can be used during text generation.
You might ask, “Who needs that much context length anyway”? To which we say, “Everyone!” Cheap, unlimited context length for LLMs enables endless new possibilities, previously impossible with short context models. For example, retrieval augmented generation (RAG), which is already being used to make generative AI models hallucinate less and provide more useful information, can massively benefit from increased context lengths. Many-shot in-context learning (ICL), which is a technique to adapt the model to a specific task by giving the model lots of examples, can finally be possible for multi-modal inputs, such as images, videos, and a mix of both.
We introduce our state-of-the-art training-free sub-quadratic-cost HiP attention. We challenge the slow quadratic complexity of the attention mechanism by developing Hierarchically Pruned Attention (HiP), which exploits the attention score locality to accelerate pre-trained Transformer models. Simply said, this means that our HiP gets rid of redundant computations by cheaply & dynamically finding which part of the long context is important, relying on the characteristics of natural language: sentences with close proximity share more similarity than those that are distant. Using our LLM serving framework, every user can accelerate their end-to-end throughput by at least 2x and decode throughput by up to 5x compared to FlashAttention, with minimal performance degradation in long contexts.
Accelerating the inference speed of language models is only half of the story. Existing language models only produce good responses up to the context length of the data that it was trained on, which is around 100k words for ChatGPT and around 1 million for Gemini Pro.
This is why we not only accelerate the inference speed but also extend the usable context length by extending the positional embeddings using the rolling RoPE indexing first proposed in StreamingLLM. Unlike StreamingLLM, we apply the rolling RoPE indexing after dynamically selecting relevant context representations. Therefore, the usable context length is extended without discarding any part of the past context, which is unlike StreamingLLM or KV eviction policies, while having extended context length.
Extending the context to millions of words is not without downsides: more memory space is needed to store the intermediate results (KV cache) for each inference. However, we get around this problem by using our efficient KV cache offloading scheme.
This allows us to extend the serving context length up to an unprecedented 512k tokens on a single 80GB GPU. We are currently working on integrating it with our language model serving framework. Our HiP attention and DeepAuto.ai Model Acceleration framework will soon make serving extremely long contexts using much cheaper GPUs such as Nvidia L40 possible.
All members of our research team are natural-born frontier researchers from KAIST. Not only are we excited to build novel end-to-end LLM infrastructure, but we are also excited to contribute our excellent research results to the community by open-sourcing our core algorithm and framework. We believe communicating with the research and developer community by sharing our codebase is extremely important as an AI tech venture, because it allows us to test our algorithm and framework in more extreme and various settings. We value any and all feedback from users and researchers, so feel free to join our GitHub issues and make pull requests!