OpenHands Coding Agent: What It Is, How It Works, and Whether You Need It
Quick answer: OpenHands is an open-source, autonomous AI coding agent that resolves GitHub issues, writes tests, refactors code, and opens pull requests — without a human in the loop. It runs in a sandboxed Docker container, works with any LLM provider (Anthropic, OpenAI, Google, local models), and scores 68.4% on SWE-Bench Verified with Claude Opus 4.6 — significantly ahead of commercial alternatives like Devin (45.8%). MIT-licensed, 65,000+ GitHub stars. Nebula Deck provides managed OpenHands with per-second billing and gVisor isolation, integrated into a persistent AI workspace starting at $15/mo for the compute tier.
Table of Contents
- What is OpenHands?
- OpenHands vs Cursor vs Claude Code vs Devin: How They Compare
- How OpenHands Works: Architecture
- Setting Up OpenHands Locally
- Benchmarks: What the Numbers Actually Mean
- Real-World Use Cases (and Where It Fails)
- Managed OpenHands with Nebula Deck
- Getting Better Results from OpenHands
- FAQ
What is OpenHands?
OpenHands is a platform for autonomous coding agents. You describe a task — "fix the auth bug," "add test coverage for the billing module," "migrate from Express to Fastify" — and OpenHands plans, codes, tests, and delivers the result without human intervention.
It grew out of the OpenDevin research project (rebranded early 2025) and is maintained by All-Hands-AI, a venture-backed company ($18.8M Series A). The codebase is MIT-licensed at github.com/OpenHands/OpenHands with 65,000+ stars.
The key distinction from tools like Cursor or GitHub Copilot: those are assistants — they help while you code. OpenHands is an agent — it codes while you do something else. You review the output. This isn't a philosophical difference; it's a workflow difference. Assistants are for complex, nuanced work where your judgment matters on every line. Agents are for well-scoped, repeatable tasks you'd rather delegate.
Key characteristics:
- License: MIT (no feature gates, no premium tiers)
- GitHub stars: 65,000+
- LLM support: Any provider via LiteLLM — Anthropic, OpenAI, Google, DeepSeek, Groq, xAI, Ollama, any OpenAI-compatible endpoint
- Interfaces: Web GUI, terminal CLI, headless mode (for CI/scripts), Python SDK
- Execution: Sandboxed Docker containers — the agent has its own filesystem, terminal, and browser
- Deployment: Self-hosted (Docker), OpenHands Cloud (managed), or integrated into platforms like Nebula Deck
- Best SWE-Bench score: 77.6% Verified (Claude Sonnet 4.5 Thinking, V0 harness)
OpenHands comes in three forms: open-source (self-hosted, free), OpenHands Cloud (managed SaaS with free and paid tiers), and OpenHands Enterprise (self-hosted with RBAC, SSO, and compliance features).
OpenHands vs Cursor vs Claude Code vs Devin: How They Compare
AI coding tools in 2026 span a spectrum from interactive assistants to fully autonomous agents. They're not interchangeable — most serious teams use two or three for different task types.
| OpenHands | Cursor | Claude Code | Devin | Aider | |
|---|---|---|---|---|---|
| Type | Autonomous agent | IDE assistant | Terminal agent | Autonomous agent | Terminal pair programmer |
| How you use it | Describe task, walk away | Write code, AI assists inline | Terminal commands, AI reasons | Describe task, walk away | Chat in terminal, AI edits |
| Model support | Any (via LiteLLM) | GPT-4o, Claude, Gemini (proxied) | Claude only (Anthropic lock-in) | Proprietary | Any (via LiteLLM) |
| Self-hostable | Yes (MIT) | No | No | No | Yes (Apache 2.0) |
| Sandboxing | Docker containers | N/A (IDE) | Limited | Cloud sandbox | None (runs in your repo) |
| Best SWE-Bench | 68.4% (Opus 4.6), 77.6% (S4.5T) | N/A (not an agent benchmark) | High (not published separately) | 45.8% | ~30-40% (varies by model) |
| Pricing | Free (self-hosted) + API costs | $20/mo Pro | $20/mo Pro, $100-200/mo Max | $20/mo + $2.25/ACU | Free + API costs |
| Best for | Autonomous tasks, CI, batch work | Daily coding, visual editing | Complex refactoring, automation | Delegated tasks with PM tools | Focused file editing |
When to choose OpenHands over alternatives:
- You want autonomous completion of well-scoped tasks (bug fixes, test generation, migrations, PR reviews) without being in the loop
- You need model freedom — switching between Claude, GPT, Gemini, or local models based on the task
- You want to self-host on your own infrastructure for data sovereignty
- You want to integrate coding agents into CI/CD pipelines or automated workflows
When something else is better:
- Cursor if your workflow is interactive — you're writing code and want AI to accelerate you
- Claude Code if you want the strongest single-model reasoning and don't mind Anthropic lock-in
- Devin if you want a polished UX with native project management integration (Slack, Linear, Jira) and don't need self-hosting
- Aider if you want lightweight, no-sandbox pair programming in a terminal for focused file edits
Many developers run Cursor for interactive work during the day and OpenHands in headless mode for the backlog overnight. They complement, not compete.
How OpenHands Works: Architecture
OpenHands uses a two-container architecture:
┌─────────────────────────────┐
│ Application Server │
│ (conversation mgmt, UI, │
│ LLM interaction loop) │
│ │
│ ┌──────────────────────┐ │
│ │ Agent Loop │ │
│ │ 1. Receive task │ │
│ │ 2. LLM reasons │ │
│ │ 3. Pick action │──────► LLM API
│ │ 4. Execute in │ │ (any provider)
│ │ sandbox │ │
│ │ 5. Observe result │ │
│ │ 6. Repeat or done │ │
│ └──────────────────────┘ │
└──────────┬──────────────────┘
│ Docker socket
▼
┌─────────────────────────────┐
│ Runtime Sandbox │
│ (isolated container) │
│ │
│ - Own filesystem │
│ - Terminal / shell │
│ - Browser (optional) │
│ - Can install packages │
│ - Cannot access host │
└─────────────────────────────┘
The core loop: the agent receives a task, uses the LLM to reason about it, selects an action (read file, edit code, run command, browse web, delegate to sub-agent), executes the action inside the sandbox, receives the observation (output, error, file contents), reasons again, and picks the next action. This continues until the task is resolved, a test passes, or the iteration limit is hit.
What makes this architecture matter in practice:
- Sandboxed execution — the agent can install packages, run tests, execute arbitrary commands, all without touching your host system. This is what makes unsupervised operation safe enough to actually use.
- Model agnostic — LiteLLM abstracts the provider, so the same agent works with Claude, GPT, Gemini, DeepSeek, Llama, or local models. Model choice affects quality dramatically (see benchmarks below), but you're never locked in.
- Event-driven tracing — every action and observation is a typed event, giving you a complete, replayable audit trail. In the web GUI you watch in real time. In headless mode, events stream to your terminal or CI logs.
- MCP support — connect external tools (search engines, databases, issue trackers) that the agent can use during execution. The May 2026 update added sub-agent delegation via TaskToolSet for multi-agent workflows.
- Headless mode — submit tasks programmatically via CLI or REST API. This powers the OpenHands Resolver (GitHub Action that auto-fixes labeled issues) and is what makes batch processing at scale viable.
Setting Up OpenHands Locally
OpenHands requires Docker and at least 4GB of available RAM. Setup takes about 10 minutes.
Option 1: Using uv (recommended)
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install OpenHands
uv tool install openhands --python 3.12
# Launch the web GUI
openhands serve
# Or mount your current project directory
openhands serve --mount-cwd
This pulls Docker images automatically and starts the GUI at http://localhost:3000.
Option 2: Docker directly
docker run -it --rm --pull=always \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ~/.openhands:/.openhands \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.openhands.dev/openhands/openhands:1.7
Option 3: Headless / CLI
For scripting, CI, and automated workflows — no GUI, just task in, result out:
openhands run --task "Fix the authentication bug in src/auth.py — \
the token validation skips expiry check on refresh tokens"
This is the mode that powers the OpenHands Resolver: connect it to a GitHub repo via Actions, label an issue, and OpenHands creates a sandbox, analyzes the issue, writes the fix, runs tests, and opens a PR automatically.
After launching
Open http://localhost:3000, select an LLM provider (Anthropic, OpenAI, Google, or your preferred provider), paste your API key, and you're ready to go. OpenHands also offers its own LLM provider at cost with no markup if you don't have an existing key.
Benchmarks: What the Numbers Actually Mean
SWE-Bench Verified is the standard benchmark for coding agents — 500 real GitHub issues from popular Python repositories, each requiring the agent to understand the problem, locate the relevant code, write a fix, and pass the test suite.
| Configuration | SWE-Bench Verified | Notes |
|---|---|---|
| OpenHands + Claude Sonnet 4.5 Thinking | 77.6% | V0 harness, highest published score |
| OpenHands + Claude Opus 4.6 | 68.4% | V1 SDK, strong general-purpose |
| OpenHands + Devstral 24B | 46.8% | Open-weight model, competitive with Devin |
| OpenHands + Qwen3-235B | ~52% | Estimated, MoE open-weight |
| Devin 2.0 | 45.8% | Commercial, closed-source |
| SWE-Agent 1.0 | 57.6% | Research agent (Princeton/Stanford) |
What to read into these numbers:
The model matters more than the agent framework. OpenHands' architecture enables strong performance, but the difference between 77.6% (Claude Sonnet 4.5 Thinking, expensive) and ~30% (small local model, cheap) is enormous. Budget and model choice are the real decisions.
SWE-Bench itself has known limitations — Python-only repositories, potential data contamination (issues predate model training cutoffs), and weak test cases that inflate scores. The SWE-Bench+ paper found that scores drop 25-35 percentage points when evaluated against enhanced test suites. Real-world performance is lower than benchmark numbers suggest for all agents, not just OpenHands.
The practical takeaway: with a strong model (Claude or GPT-4), OpenHands reliably handles well-scoped bugs, test generation, and migrations. With cheaper models, expect more iteration and more failures. Plan your cost model accordingly.
Real-World Use Cases (and Where It Fails)
Where OpenHands works well:
- Bug fixes with clear reproduction — give it a GitHub issue with steps to reproduce, and it traces the stack, reads the code, writes a fix, and verifies with tests. This is its strongest use case.
- Test coverage expansion — point it at a module with low coverage and ask for tests. The sandbox lets it run tests and iterate until they pass.
- Dependency upgrades and migrations — framework version bumps, library replacements, deprecated API removal. Tedious, well-defined, perfect for delegation.
- PR review automation — wire into GitHub webhooks for autonomous first-pass review: code quality, security issues, coverage gaps, before a human looks.
- Documentation generation — generate or update docs from code changes, commit history, and PR descriptions.
Where it struggles — and these are real limitations, not edge cases:
- Ambiguous tasks — "make the UX better" gives the agent nothing to verify against. The more specific your task description, the better the result.
- The backtrack loop — watch OpenHands work and you'll see it plan, execute, hit an error, reason, retry. This is normal but burns tokens. A 10-minute human task might take 30 minutes of API calls. Still cheaper than your time, but not free.
- Flaky test suites — if your tests pass and fail randomly, the agent's feedback loop breaks. It can't tell whether its fix worked or the test was flaky.
- Browsing — OpenHands has browser automation, but site changes, JavaScript-heavy pages, and bot detection make it unreliable. Prefer
curlor library-level APIs when possible. - V0 vs V1 confusion — the architecture split in November 2025 means older tutorials, the original arXiv paper, and pre-2026 blog posts describe a different codebase. Always check the date on OpenHands content you read.
Managed OpenHands with Nebula Deck
Self-hosting OpenHands means managing Docker, monitoring container health, handling image updates, and provisioning compute for each agent run. For solo developers and small teams, this overhead can eat into the productivity gains the agent is supposed to provide.
Nebula Deck takes a different approach to managed OpenHands. Instead of offering a standalone coding agent (which is what OpenHands Cloud does), Nebula Deck integrates OpenHands as the coding agent layer inside a persistent AI workspace powered by Moltis.
How it works
Your Moltis workspace runs 24/7 and handles everyday work — chat across Telegram, Discord, Slack, and WhatsApp, persistent memory, scheduling, web search, tool execution. When a task needs code, you spawn an OpenHands agent from inside the workspace with a slash command or button click. The agent works in a gVisor-isolated sandbox. Events stream back to your chat in real time so you can steer mid-flight. When the PR is up, the container is destroyed and billing stops.
What Nebula Deck handles
- Provisioning — one-click agent spawning from your workspace, no Docker management
- Isolation — gVisor runtime on all containers (workspace, agents, browser sessions), no Docker socket access for tenants
- Billing — per-second with 60-second minimum, the meter stops when the agent stops
- Credential injection — your API keys and GitHub tokens are injected server-side; ephemeral containers never hold credentials
- Updates — self-service with rollback, you choose when
Pricing
| Component | Cost |
|---|---|
| Deck (always-on Moltis workspace) | $7/mo |
| Developer compute (3 concurrent, $10 credit) | $15/mo |
| Studio compute (5 concurrent, $35 credit) | $39/mo |
| Observatory compute (10 concurrent, $100 credit) | $99/mo |
| Standard agent rate (headless) | $0.05/hr |
| Full agent rate (GUI + IDE) | $0.15/hr |
| LLM tokens | BYOK, no markup |
The distinction: OpenHands Cloud gives you a coding agent. Nebula Deck gives you a persistent AI workspace that can deploy coding agents when needed, then goes back to handling the rest of your work.
Getting Better Results from OpenHands
We run OpenHands as the coding agent layer inside Nebula Deck. These patterns improve results based on actual usage:
Write the test first. If "done" is a test that passes, OpenHands performs dramatically better. The agent uses test results as its feedback signal — a failing test tells it exactly what's wrong, a passing test tells it to stop. Without tests, it has to guess when it's finished.
Be specific about scope. "Fix the auth bug" is worse than "In src/auth.py, the token_validation function skips expiry checking when the token type is refresh. Add an expiry check for refresh tokens and write a test that verifies expired refresh tokens are rejected." More context = fewer wasted iterations = lower cost.
Start with the web GUI, move to headless. The GUI lets you interrupt when the agent goes down a wrong path. Once you understand how OpenHands behaves on your codebase, move repetitive tasks to headless mode.
Don't mount more than the working directory. In headless mode, all actions are auto-approved. Limit the blast radius by mounting only the repo directory, not your home folder.
Choose your model for the task. Use Claude Opus or Sonnet for complex multi-file refactoring. Use Devstral or Qwen for bulk, well-defined tasks where cost matters more than accuracy. The model is the biggest variable in both quality and cost.
FAQ
Is OpenHands free?
Yes. The open-source version is MIT-licensed with no feature gates. You pay only for LLM API tokens from your chosen provider. OpenHands Cloud has a free individual tier. OpenHands Enterprise is a paid self-hosted offering with RBAC, SSO, and compliance features.
What LLM should I use with OpenHands?
For the best results, Anthropic's Claude models are the most battle-tested. Claude Sonnet 4.5 with thinking achieves the highest benchmark scores (77.6% SWE-Bench Verified). For cost-sensitive work, Devstral 24B offers competitive performance (~46.8%) at a fraction of the price. Local models via Ollama work but require capable hardware and agent-tuned models — expect significantly lower success rates.
How does OpenHands compare to Devin on benchmarks?
OpenHands with Claude Opus 4.6 scores 68.4% on SWE-Bench Verified. Devin 2.0 scores 45.8%. Even with the open-weight Devstral 24B, OpenHands matches Devin's score at self-hosted cost. The caveat: benchmark scores don't perfectly predict real-world performance, and Devin offers a more polished product experience with native project management integrations.
Can I run OpenHands without sending code to the cloud?
Yes. The local Docker-based setup keeps everything on your machine. Your code never leaves your infrastructure. The only external calls are to the LLM API — and even those can be replaced with a local model via Ollama, LM Studio, or any OpenAI-compatible endpoint if you have the hardware.
Is it safe to let an AI agent write code autonomously?
The sandboxed container architecture limits what the agent can access. It cannot touch your host filesystem, network, or other containers unless you explicitly mount them. That said, always review PRs before merging — the agent is autonomous but not infallible. The "AI coding agent deletes company database" headlines happened with unsandboxed agents, not with properly isolated setups like OpenHands.
What's the difference between OpenHands and OpenDevin?
Same project, new name. OpenDevin was renamed to OpenHands in early 2025 under the All-Hands-AI organization. Older Docker images, repo URLs, and tutorials referencing OpenDevin are outdated. The canonical repository is github.com/OpenHands/OpenHands.
Can I integrate OpenHands into my CI/CD pipeline?
Yes. Headless mode and the OpenHands Resolver GitHub Action are designed for this. Label a GitHub issue, and OpenHands automatically creates a sandbox, analyzes the issue, writes a fix, runs tests, and opens a PR. The same pattern works with GitLab. For custom integrations, the Python SDK provides programmatic task submission.