Entering The Long-Horizon Era
Models can handle the complexity. The infrastructure can't handle the duration

We are entering an era defined by agents acting over long horizons.
For two years the default picture of an agent was a chat window with a loop in it. A goal goes in, tokens stream out, and the run ends when the context fills or the model loses the thread. That paradigm is ending. Agents are starting to carry entire workstreams: owning a migration, running a research sweep for a week, sitting on an inbox and acting when something arrives. They hold a goal across replanning cycles, spin up and right-size subtasks, move across multiple systems, and pull a human in only at genuine decision points, acting self-directed on accumulated context rather than waiting for the next instruction.
This is the long-horizon shift, and it has relocated the hard problem. Model capability is no longer the bottleneck. Today's models have the reasoning and coherence to hold a goal across many steps. The scaffolding around them is taking shape, but long-horizon exposes everything it hasn't solved yet. The scaffolding around them is another story. The harness is taking shape: orchestration, tool routing, context management, replanning. The primitives exist. But long-horizon exposes what short runs forgive. A harness built for a single session struggles to hold a goal across cycles, manage subtasks at scale, or know when to pull a human in. The infrastructure is further behind: persistent state, credential lifecycle, cost controls, environment coherence. The intelligence is there. The scaffolding that lets it run reliably, unattended, over days is where the work remains.
What we mean by long-horizon
Picture the difference this way. A chatbot is a sprint: one question, one answer, and it forgets you the moment the tab closes. A long-horizon agent is the colleague you hand a project to on Monday and check in with on Friday, and the work has moved.
That difference isn't about how long it runs. It's about the shape of the work: a goal made of many interdependent steps, where each builds on the last and an early wrong turn poisons everything downstream. That step structure, carried over a long duration, is the definition. It tends to play out over hours, days, or weeks, but the clock is a consequence of the work, not the thing that defines it. Long-horizon is a property of the task, not the runtime.
And it carries one demand short agents never face: the work outlives any single sitting. The agent has to hold the goal, and the context behind it, across replanning, across failure, and across idle time. Idle time is the hard one. Nearly every real long-horizon task has to go dormant at some point, waiting on an approval, a dependency, a human, then wake on the right signal and pick up exactly where it left off without dropping the thread. That capability, persistent agency and memory across the gaps, is something long-horizon work requires and that almost nobody has truly solved.
Two examples of such workflows: First, an agentic project manager plans the work, runs it across half a dozen systems, then sits for hours or days waiting on an approval or external event before resuming. Second, a personal assistant that lives alongside you and acts the moment something needs doing. Neither runs the whole time. Both have to stay coherent while nothing is happening.
Long-horizon vs. long-running
Right now "long-running" and "long-horizon" get used interchangeably. That hides a distinction worth making.
Our view: long-horizon is the umbrella
Long-running describes a single process: one engagement that stays alive, from start to finish, driven to completion in one continuous run.
Long-horizon is the umbrella above it. It's a goal pursued across many tasks and subtasks over time: some long-running processes doing the deep work, some short bursts knocking out a subtask, each spun up when it's needed against one persistent goal. What makes the whole thing long-horizon isn't that any single process runs long. It's that the engagement doesn't run straight through. It carries intermittency: idle stretches between the pieces, where the agent sits dormant through an approval, a dependency, an overnight gap, then wakes on a signal and resumes exactly where it left off. Intermittency, interdependent steps, long duration: carry those and you have a long-horizon agent, whether the pieces inside it run for two minutes or four hours.
That idle time is where things often break today. The model can reason across the steps. The infrastructure to hold a goal, a credential, and a context alive across days of dormancy still needs to be built. The single continuous run is a solved problem. The umbrella case, project-scale work and always-on assistants, is the unsolved one. That gap is where the next era of agent infrastructure gets built.
Entering the Long-Horizon Era
Most of the failure modes are already on the table. Context that rots, state that vanishes, credentials that die mid-run, costs that spiral, fleets that collide, agents that declare themselves done at thirty percent. They are reality today, and most are still unsolved. Stretch the same agent across idle time and a project-shaped goal, and every one of them gets harder.
They are structural infrastructure problems, not model problems, and solving them is where we spend our time at Keska Labs. Six areas, each getting its own deep dive in this series:
Authentication. OAuth assumes a human present at the consent boundary. A long-horizon agent violates that assumption on every axis: no human at consent, concurrent access, indefinite runtime. Continuous agents need credentials kept alive mid-run; intermittent ones need a background service that keeps them alive while nothing is running. We've written this up separately, and a dedicated deep dive on the token-lifecycle gap is next.
User experience. Setting up an agent to run for a week, and staying in the loop without babysitting it, is still a developer's job. For long-horizon work to reach non-technical operators, the experience of defining the work, watching it, and stepping in at the right moment has to be a product surface, not a config file.
Security. A long-running agent with cloud access and shell execution is a much larger attack surface than a chat session. Credentials reachable from the sandbox where model-generated code runs, memory that can be poisoned across sessions, permissions that don't scope down per subtask: these get harder, not easier, as the agent persists.
Multiple environments. Real long-horizon agents don't live in one place. They move across sandboxes, machines, and cloud runtimes, and they exist across many authenticated services at once. At eight or more concurrent connections on independent schedules, something is always about to expire or drift. Coherence across environments becomes its own problem.
Cost and token consumption. A multi-day run with a frontier model can quietly burn a week's budget in an afternoon. Part of the answer is budgets and circuit breakers; part of it is architectural: routing the right work to smaller, cheaper models and reserving the frontier model for the steps that actually need it. Most of a long horizon doesn't need the biggest brain.
Observability. An agent runs for six hours, makes forty tool calls, and produces a wrong answer. Without a structured trace of what it did and why, you cannot tell whether it hallucinated, hit a bad API response, or was handed the wrong goal at the start. Auditing autonomous activity after the fact is a real human-time problem, and right now most teams are solving it by scrolling logs and hoping they catch what matters
Where this goes?
The gap between a chat window and an agent you can leave running for a week isn't in the model. It's in the state, sessions, credentials, and coordination wrapped around it, and every one of those gets harder the moment the agent has to survive idle time.
That gap is what we build at Keska Labs. This is the first post in a series, and each focus area above gets its own deep dive. We start with authentication, the problem we've thought hardest about and the one that quietly kills more unattended agents than any other.




