New research introduces Latent Cache Flow (LCF), a method for efficiently communicating between LLM agents by transferring KV caches directly. LCF significantly reduces the size and latency compared to existing methods by jointly compressing keys and values and handling differing contexts.
Recent work in LLM agent communication using text suffers from high latency and information loss due to autoregressive decoding of model states. Existing methods like Cache-to-Cache (C2C) exchange KV caches via large, expensive adapters. However, these methods struggle when LLMs operate with different contexts, which is common in agent communication.
This paper introduces Latent Cache Flow (LCF), which addresses these limitations. LCF achieves efficiency by jointly translating and compressing keys and values, resulting in adapters that are approximately 4% the size of C2C adapters. Crucially, LCF designs the adapter to transmit only a summary of new information the target model lacks, allowing seamless communication even across differing contexts.
Experimental results demonstrate the effectiveness of LCF: a 13 MB LCF adapter achieved higher accuracy than a 956 MB C2C adapter in shared-context settings, and provided 8.5x faster communication and 23% greater accuracy than text-based methods in differing context scenarios.