Latent Cache Flow: Enabling Efficient Model-to-Model Communication Without Text

Recent work in LLM agent communication using text suffers from high latency and information loss due to autoregressive decoding of model states. Existing methods like Cache-to-Cache (C2C) exchange KV caches via large, expensive adapters. However, these methods struggle when LLMs operate with different contexts, which is common in agent communication.

This paper introduces Latent Cache Flow (LCF), which addresses these limitations. LCF achieves efficiency by jointly translating and compressing keys and values, resulting in adapters that are approximately 4% the size of C2C adapters. Crucially, LCF designs the adapter to transmit only a summary of new information the target model lacks, allowing seamless communication even across differing contexts.

Experimental results demonstrate the effectiveness of LCF: a 13 MB LCF adapter achieved higher accuracy than a 956 MB C2C adapter in shared-context settings, and provided 8.5x faster communication and 23% greater accuracy than text-based methods in differing context scenarios.

Latent Cache Flow: Enabling Efficient Model-to-Model Communication Without Text

More from this section

Latent Cache Flow: Enabling Efficient Model-to-Model Communication Without Text

More from this section