Debugging RAG Chatbots and AI Agents with Sessions

When does your AI agent start hallucinating in the multi-step process? Have you noticed consistent issues with a specific part of your agentic workflow?

These are common questions we faced when building our own RAG-powered chatbots and AI agents. Getting reliable responses and minimizing errors like hallucination was incredibly challenging, without visibility into how our users interacted with our large language models.

In this guide, we explore common AI agent pitfalls, how to debug multi-step processes using Helicone's Sessions, and the best tools for building reliable, production-ready AI agents.

What you will learn:

AI agents vs. traditional chatbots
Components of an AI agent
Challenges of debugging AI agents
How to debug AI agents using Sessions
6 Popular agent debugging tools

AI Agents vs. Traditional Chatbots

Unlike traditional chatbots, which follow explicit instructions or rules, AI agents can autonomously perform specific tasks with advanced decision-making abilities. They interact with their environment by collecting data, processing it, and deciding on the best actions to achieve a predefined goal.

Examples of AI Agents

	What it does	Example
Copilots	Help users by providing suggestions and recommendations	Suggest code snippets, highlight potential bugs, and offer optimization tips. Developer decides whether to implement suggestions.
Autonomous Agents	Perform tasks independently without human intervention	Handle customer inquiries by identifying issues, accessing account info, processing refunds, updating account details, and responding to customers. Can escalate to human agents when needed.
Multi-Agent Systems	Involve interactions and collaboration between multiple autonomous agents to achieve collective goals	Leverage advantages like dynamic reasoning, distributed task handling, and improved memory retention for information.

Use of RAG to Improve Functionality

Retrieval-Augmented Generation (RAG) is an advanced method that allows the agent to incorporate information from external knowledge bases (e.g., databases, documents, articles) into the response.

RAG significantly improved the response outcome as the agent now have access to the most recent data based on keywords, semantic similarity, or other advanced search techniques, and used it to generate more accurate, personalized, and context-specific responses.

Core Components of AI Agents

Typically, AI agents consists of four core components:

Planning: When you define a goal, agents will plan, sequence actions and formulate strategies.
Tool / Vector DB Calls: Advanced AI Agents can interact with external tools, APIs, and services through function calls in order to handle more complicated tasks.

For example:
- Fetching real-time weather data or stock prices.
- Using translation services.
- Performing image recognition or editing using specialized libraries.
- Running custom scripts to automate workflow.
Perception: AI agents can also process information from their environment, making them more interactive and context-aware.

This sensory information can include visual, auditory, or other types of data, to help the agents respond appropriately to environmental cues.
Memory: AI agents are able to remember past interactions, including tools previously used and its planning decisions. These context are stored to help agents self-reflect and inform future actions.

How AI Agents work

Challenges of Debugging AI Agents

⚠️ Their decision making process is complicated.

AI agent's adaptive behavior makes their decision paths non-deterministic and harder to trace. This is because agents base their decisions on many inputs from diverse data sources (i.e. user interactions, environmental data, and internal states), and they learn through patterns and correlations identified in the data.

⚠️ No visibility into their internal states.

AI agents function as "black boxes” and understanding how they transform inputs into outputs is not straightforward. Often times, whenever the agent interacts with external services, APIs or other agents, their behavior is unpredictable.

⚠️ Context builds up over time, so do errors.

Agents can often make multiple dependent vector database calls within a single session, adding some complexity in tracing the data flow. They can also operate over a longer sessions, where an early error can have cascading effects, so it's difficult to identify their original source without proper session tracking.

Scenario 1: Debugging errors in a travel chatbot

A travel chatbot assists users in booking flights, hotels and rent cars. There are often issues with data parsing or third-party integrations. Users are frustrated with incomplete bookings.

Case Study: Resolving Errors in Multi-Step Processes Using Helicone's Sessions

Issue

While building this chatbot, we noticed that the LLM occasionally provided incorrect parameters to the flight search API, leading to incorrect bookings.

How to debug using Helicone's Sessions

Go to the Sessions tab, find related requests

To examine what's causing the error, go to Sessions. Check out the requests that relates to the action where the error occured.

In this case it's the search-flights API call.
Take a closer look at the request

Next, click on the request to inspect its details.

For example, the user input says: I've just arrived at JFK, where can I go to reLAX?

In this case, the LLM misinterpreted capitalized words as airline abbreviations, mistakenly assuming the user wants to book a flight from JFK (New York) to LAX (Los Angeles).
Fix the bug in Playground

To fix this, we go to Playground and improve the prompt to ignore capitalized words.

We discovered that by converting the entire string to lowercase before passing it to the LLM, the LLM was able to correctly parse airport codes while ignoring unrelated capitalized words.

Unfortunately, there are no flights available from JFK at the moment that match your search for relaxation destinations. You might want to consider checking for flights at different times or days, or you could explore relaxing spots within the airport such as lounges or nearby hotels.

The chatbot now accurately interprets user intent and returns the correct departure and arrival airports, leading to more reliable flight search results.

Scenario 2: Personalize responses to match user intent

A health and fitness chatbot that takes the user's goal to create a personalized workout and dietary plans.

Case Study: Understanding User Intent for Personalization Using Helicone's Sessions

Issue

Users report that the chatbot is generating inconsistent dietary recommendations when prompted a few times.

In this example, the user who previously selected a high-protein diet for muscle gain might later receive a low-calorie meal plan designed for weight loss.

How to debug using Helicone's Sessions

Go to the Sessions tab

Identify recent session logs for users who reported unexpected diet plan changes.

As a first step, verify that users' medical records are being correctly analyzed.
Analyze the "Medical History" retrieval step

In the medical-history trace, check if the chatbot is correctly retrieving users' medical history data from the vector database.

In this case, we found that the LLM failed to lookup relevant user data from the vector DB, causing inconsistencies in retrieved dietary preferences.
Fix bug in code

Upon reviewing the code, a bug was found in the data embedding process, leading to incorrect storage and retrieval of user data.

After fixing the issue, tests in the Playground confirmed that the chatbot now consistently provides diet plans that align with user goals.

Scenario 3: Create custom learning materials

An AI tutoring service built an AI agent that creates customized learning materials. It's important that the agent generates both accurate and comprehensive lessons.

Case Study: Generating Educational Content Using Helicone's Sessions

How to debug using Helicone's Sessions

The Session view will outline the structure of the generated course. Each trace in a Session shows you how the agent interpreted your requests and the corresponding content.

Skimming through, wherever the agent misunderstood topics or failed to cover key concepts, you can then fine-tune that specific prompt using the Playground or Experiments. The goal here is to generate a more thorough content while making sure it is appropriate for the student’s learning level.

6 Popular Agent Debugging Tools

One way we try to debug agents is by understanding the internal workings of the model. We also realized that traditional logging methods often lack the granular data to effectively debug complex behaviors.

However, there are tools to help streamline the debugging process:

1. Helicone `open-source`

Helicone's Sessions is ideal for teams looking to intuitively visualize agentic workflows. It's catered to both developers building simple and advanced agents that need to group related LLM calls, trace nested agent workflows, quickly identify issues, and track requests, response and metadata to the Vector Database.

2. AgentOps

AgentOps can be a good choice for teams looking for a comprehensive solution to debug AI agents. Despite a less intuitive interface, AgentOps offers comprehensive features for monitoring and managing AI agents.

3. Langfuse

Langfuse is ideal for developers who prefer self-hosting solutions and have simpler infrastructure needs. It offers features similar to Helicone's and is well-suited for projects with modest scalability requirements or those prioritizing local deployment over cloud-based solutions.

4. LangSmith

LangSmith is ideal for developers working extensively with the LangChain framework as its SDKs and documentation are designed to support developers within this ecosystem best.

5. Braintrust

Braintrust is a good choice for those focusing on evaluating AI models. It’s an effective solutions for projects where model evaluation is a primary concern and agent tracing is a secondary need.

6. Portkey

Portkey is designed for developers looking for the latest tools to track and debug AI agents. It introduces new features quickly, great for teams needing the newest suite of features and willing to face the occasional reliability and stability issues.

Building Production-Ready AI Agents

We're already seeing AI agents in action across various fields like customer service, travel, health and fitness, as well as education. However, for AI agents to be truly production-ready and widely adopted, we need to continue to improve their reliability and accuracy.

This requires us to actively monitor their decision-making processes and get a deep understanding of how inputs influence outputs. The most effective way is by using monitoring tools that provide you the insights to make sure your AI agents consistently deliver the results you want.

Additional Resources

Questions or feedback?

Are the information out of date? Please raise an issue or contact us, we'd love to hear from you!

Time: 8 minute read

Created: October 17, 2024

Author: Lina Lam

Debugging RAG Chatbots and AI Agents with Sessions

What you will learn: