Parallel Compaction Improves Efficiency and Control for Long-Horizon LLM Agent Serving

Long-horizon LLM agents often accumulate conversation histories that exceed the model's context window, necessitating context compaction. Current summarization methods are lossy, stall agent inference, and offer poor control over the resulting context. This paper introduces parallel compaction, a novel approach for agentic flows that allows operators fine-grained, predictable control over the summary volume and enables more targeted prompt engineering per block. The method was characterized across various model backbones (8B to 120B parameters, mixing dense and MoE architectures) on long-context benchmarks, demonstrating that parallel compaction reduces end-to-end wall time and improves compaction throughput over sequential baselines.

Parallel Compaction Improves Efficiency and Control for Long-Horizon LLM Agent Serving

More from this section

Parallel Compaction Improves Efficiency and Control for Long-Horizon LLM Agent Serving

More from this section