From models that predict to systems that act

January 2, 2026

Doing awesome math at a white board with a colleague

AI research in 2026 should confront a simple but transformative realization: Models that predict are not the same as systems that act. The latter is what we actually need.

Over the last decade, we have become extraordinarily good at passive prediction and generative modeling — producing bounding boxes and segmentation masks for objects in images, transcribing audio into text, or generating fluent paragraphs and images on command. These are impressive achievements, yet they remain proxy tasks: tasks that are often assumed to represent real-world economic utility. This is a fallacy. The world’s economically meaningful tasks do not end at a single prediction or generation from a single input. They require taking a sequence of actions (each of which may be a function of predictions or generations from one or more models) in complex, dynamic environments where each action shapes the state of the environment and hence subsequent actions.

In 2026, AI research must move decisively from solving these proxy tasks to the corresponding long-horizon realistic tasks that these proxy tasks loosely approximate. Consider how coding has evolved: Models once autocompleted lines, but modern coding agents increasingly take a high-level specification, search through a codebase, run tests, and return a working solution with minimal human intervention.

I hope we can bring this evolution — from generating proxies to accomplishing goals — to other domains. For example, vision models should be studied as parts of larger systems that use visual input streams to drive digital (web/computer use) and physical (embodied) workflows, monitor processes, or extract insights. Speech systems need to be studied as part of intelligent conversational assistant architectures that understand objectives conveyed through conversation and interface with digital or physical tools to fulfill them. Image- and video-generation models should be studied as parts of systems that generate, say, long-form visual educational content from existing documents or marketing material for products or research artifacts.

Shifting focus to these long-horizon tasks and goal-oriented AI systems has two major benefits. First, it exposes limitations and pain-points of current AI models when we use them to construct these larger systems and pipelines. These goal-oriented AI systems need more than predictive or generative capability. They require persistent memory, ability to focus on a goal over a long time horizon, responsiveness to real-time human feedback, and the ability to cope with uncertainty in an evolving environment. They also require effective interfacing with a wide variety of multimodal information sources, tool calling, the ability to hypothesize and reason, continual learning, self-improvement, and more. Many of these gaps in capability are invisible on short-horizon or single-step predictive tasks but reveal themselves in more complex and realistic long-horizon scenarios. We need better ways to evaluate these aspects of intelligence and methods to improve them.

Second, this goal-centric reframing aligns AI research with end-task utility. By directly trying to solve real end-tasks, researchers are less likely to be led astray by the siren song of seemingly useful proxy tasks that ultimately prove to be incapable of solving real tasks. For instance, for years, semantic parsing was assumed to be an important component of natural language understanding systems by NLP researchers. Today’s LLMs are capable of sophisticated language understanding and manipulation without ever explicitly performing semantic parsing. In hindsight, the semantic parsing research-hours were perhaps better spent on trying to solve the end-task instead of chasing the proxy metric of semantic parsing accuracy.

Real digital or physical tasks unfold over minutes, hours, months, and sometimes years. Humans have the extraordinary capability of consolidating diverse information collected over extended periods of time into a consistent world-view that drives execution of complex goals in evolving environments. The technological advancements in deep learning over the last decade, particularly in LLMs and VLMs, have well set the stage for the AI research community to take a serious shot at replicating this ability on silicon in the next decade. In the last couple years alone, we have seen the rise of LLM-powered agentic systems that are automating well defined workflows. Tackling the underspecified, ill-defined, undiscovered, and unimagined is the next frontier.