OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

Published February 12, 2026

Update on GitHub

Upvote

TuringEnterprises

TuringEnterprises

TuringEnterprises

AI agents often perform impressively in controlled research settings, yet struggle when deployed in real-world systems where they must reason across multiple steps, interact with real tools and APIs, operate under partial information, and recover from errors in stateful, permissioned environments—highlighting a persistent gap between research success and production reliability.

OpenEnv is an open-source framework from Meta and Hugging Face designed to address this challenge by standardizing how agents interact with real environments. As part of this collaboration, Turing contributed a production-grade calendar management environment to study tool-using agents under realistic constraints such as access control, temporal reasoning, and multi-agent coordination.

In this post, we explore how OpenEnv works in practice, why calendars serve as a powerful benchmark for real-world agent evaluation, and what our findings reveal about the current limitations of tool-using agents.

What Is OpenEnv?

OpenEnv is a framework for evaluating AI agents against real systems rather than simulations. It provides a standardized way to connect agents to real tools and workflows while preserving the structure needed for consistent and reliable evaluation.

OpenEnv uses a gym-oriented API (reset, step, action, observations) like OpenAI's Gymnasium. Also, OpenEnv uses a standard MCP tool call interface to connect to envs which provides a consistent interface across domains and simulation to production environments.

The environments maintain state across multiple actions—enabling long-horizon reasoning—and can connect directly to real APIs and tools such as browsers, code repositories, or calendars. This shifts evaluation from "Can this work in a controlled demo?" to "Can this operate reliably in the real world?"

The Calendar Gym: A Production-Grade Benchmark

Calendar systems are deceptively complex. While scheduling a meeting seems simple, real-world calendar management requires agents to reason over time, permissions, multiple users, and incomplete information—often across several dependent steps. These properties make calendars a powerful testbed for evaluating tool-using agents outside controlled simulations.

To ground OpenEnv in this kind of realistic, demanding use case, Turing built a production-grade calendar management environment referred to as the Calendar Gym. Rather than simulating scheduling in the abstract, it exposes agents to the same constraints they would face in real calendar systems: Access Control Lists across users and calendars, limited visibility into other users' state, and multi-step workflows where actions must be chained in the correct order. Agents interact with a rich set of calendar operations—from listing calendars to modifying events and permissions—and must handle failed actions, incorrect assumptions, and missing permissions. Each session runs in an isolated environment, enabling reliable comparisons across runs.

Below is a code example of how to use the Calendar Gym. We explore the environment, discover available tools, list calendars, create an event, and print the result.

from openenv_wrapper.client import MCPEnvClient
from openenv_wrapper.data_models import MCPAction

with MCPEnvClient.from_hub(base_url="TuringEnterprises/calendar-gym") as client:
    # Connect and reset the environment
    result = client.reset()
    print("Reset successful:", result.observation.success)

    # Discover available tools
    result = client.step(MCPAction(action_type="ListToolsAction"))
    print("Available tools:", len(result.observation.tools_list))

    # List calendars
    result = client.step(MCPAction(
        action_type="ToolCallAction",
        tool_name="calendars_list",
        arguments={}
    ))
    calendars = result.observation.tool_result["items"]
    print("Calendars:", calendars)

    # Create an event
    result = client.step(MCPAction(
        action_type="ToolCallAction",
        tool_name="events_insert",
        arguments={
            "calendarId": "primary",
            "summary": "Team Sync",
            "start": {"dateTime": "2026-01-15T14:00:00Z"},
            "end": {"dateTime": "2026-01-15T15:00:00Z"}
        }
    ))
    print("Event created:", result.observation.success)

Below is an excerpt of what the Calendar Gym returns when you call ListToolsAction. Each entry includes the tool name plus an input schema (what arguments the tool accepts).

Click to expand output

{
  "tools_list": [
    {
      "name": "calendars_list",
      "description": "List calendars visible to the current user.",
      "input_schema": {
        "type": "object",
        "properties": {},
        "additionalProperties": false
      }
    },
    {
      "name": "events_insert",
      "description": "Create an event in a calendar.",
      "input_schema": {
        "type": "object",
        "properties": {
          "calendarId": { "type": "string" },
          "summary": { "type": "string" },
          "start": {
            "type": "object",
            "properties": { "dateTime": { "type": "string" } },
            "required": ["dateTime"]
          },
          "end": {
            "type": "object",
            "properties": { "dateTime": { "type": "string" } },
            "required": ["dateTime"]
          }
        },
        "required": ["calendarId", "summary", "start", "end"]
      }
    }
  ]
}

What We Learned

Evaluating agents in the Calendar Gym revealed consistent patterns which were common across multiple domains. While agents often perform well on individual game like actions, reliability breaks down as tasks become longer, more ambiguous, and more constrained.

Multi-step reasoning is the primary bottleneck. Agents struggle to correctly chain actions across longer workflows, suggesting that benchmarks need to test sustained reasoning over multiple dependent steps—not just single tool calls.

Ambiguity significantly degrades performance. Agents achieved close to 90% success on tasks with explicit calendar identifiers, but success dropped to roughly 40% when the same tasks were phrased using natural language descriptions. Building stronger lookup and validation into agent loops—rather than relying on the LLM to resolve references unaided—appears essential.

Correct tool choice isn't enough. Across failed interactions, more than half of errors stemmed from malformed tool arguments or incorrect ordering, even when the right tool was selected. Reliable agent behavior depends as much on execution quality and structured feedback as on tool selection—environment design matters.

These challenges are not unique to scheduling and calendars. They reflect broader limitations that emerge whenever agents operate in changing systems over long periods of time, and they point toward evaluation frameworks that test permissions, partial observability, and multi-step workflows together.

Looking Ahead

OpenEnv provides a foundation for testing agents under realistic conditions, and the Calendar Gym demonstrates how seemingly simple domains can surface deep challenges in reasoning, ambiguity resolution, and tool use. By evaluating agents where failure is measurable and constraints are real, we gain clearer insight into what it takes to build agents that operate reliably in production.

For a deeper dive into the Calendar Gym's design, benchmarking methodology, and quantitative results, explore the full technical article on Turing's site. To explore a clone of the Calendar Gym, visit the Calendar Gym space.

Appendix: Common error cases in tool use

In practice, tool integrations rarely fail in dramatic ways; they fail in small, predictable ones. When wiring up MCP tools to real APIs (like calendar operations), we encountered a handful of recurring issues.

Specific error cases found in the wild

Below are three common failure modes we’ve seen in production, along with representative error payloads and mitigation strategies. These examples illustrate not just what can go wrong, but how structured errors can help agents recover gracefully.

1. Schema validation errors (missing or malformed arguments)

The agent calls a valid tool (e.g. events_insert), but the arguments do not match the declared JSON schema.

Missing required fields like calendarId
Incorrect nesting of start / end
Passing a string where an object is expected.

Click to expand error payload

{
  "ok": false,
  "error_type": "validation_error",
  "tool_name": "events_insert",
  "message": "Invalid arguments for tool 'events_insert'.",
  "details": {
    "missing_required_fields": ["calendarId", "end"],
    "invalid_fields": [
      {
        "field": "start",
        "expected_type": "object",
        "received_type": "string"
      }
    ]
  }
}

We can mitigate this by providing one canonical example of a correct 'events_insert' call in your prompt. Return structured validation errors so the model can repair and retry instead of failing silently.

2. Permission / authorization errors (401/403)

The tool call is syntactically correct, but the API rejects it due to insufficient permissions.

Missing OAuth scopes
Expired access token
User lacks write access to the target calendar

Click to expand error payload

{
  "ok": false,
  "error_type": "permission_error",
  "tool_name": "events_insert",
  "http_status": 403,
  "message": "The authenticated user does not have write access to calendar 'primary'.",
  "remediation": [
    "Ensure the OAuth token includes calendar write scope.",
    "Verify the user has edit access to the target calendar.",
    "Reconnect the integration if the token has expired."
  ]
}

We can mitigate this by clearly documenting the required OAuth scopes. Return structured, actionable remediation steps so the agent can guide the user instead of retrying the same failing call. Clearly document required OAuth scopes. Return structured, actionable remediation steps so the agent can guide the user instead of retrying the same failing call.

3. Datetime / format errors (RFC3339 & timezone issues)

The event is rejected by the API, or it is created at an unexpected time.

Missing timezone offset
Non-RFC3339 datetime format
Incorrect nesting of start.dateTime or end.dateTime
Mixing local time and UTC without specifying an offset

Click to expand error payload

{
  "ok": false,
  "error_type": "format_error",
  "tool_name": "events_insert",
  "message": "Invalid datetime format for field 'start.dateTime'.",
  "details": {
    "received": "02/11/2026 9:30 AM",
    "expected_format": "RFC3339 (e.g. 2026-02-11T09:30:00-05:00)"
  }
}

We can mitigate this by standardizing on RFC3339 with explicit timezone offsets (e.g. 2026-02-11T09:30:00-05:00). Include at least one correct datetime example in your documentation to anchor model behavior and reduce repair retries.