• Latest
  • Trending
ML 18286 Eval Workflow Toolz Guru Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

May 11, 2025
pixart trainium inferentia 1120x630 Toolz Guru Cost-effective AI image generation with PixArt-Σ inference on AWS Trainium and AWS Inferentia

Cost-effective AI image generation with PixArt-Σ inference on AWS Trainium and AWS Inferentia

May 15, 2025
Social share Device trust.width 1300 Toolz Guru Device Trust from Android Enterprise

Device Trust from Android Enterprise

May 15, 2025

Detecting misbehavior in frontier reasoning models

May 14, 2025
TAS Gemini Across Devices Blog Header.width 1300 Toolz Guru Gemini is coming to watches, cars, TV and XR devices

Gemini is coming to watches, cars, TV and XR devices

May 14, 2025

New tools for building agents

May 14, 2025

Driving growth and ‘WOW’ moments with OpenAI

May 14, 2025

OpenAI’s proposals for the U.S. AI Action Plan

May 14, 2025

The court rejects Elon’s latest attempt to slow OpenAI down

May 14, 2025

New in ChatGPT for Business: March 2025

May 14, 2025

EliseAI improves housing and healthcare efficiency with AI

May 14, 2025

Introducing next-generation audio models in the API

May 14, 2025
TAS Material 3 Expressive Blog Header 1.width 1300 Toolz Guru Google launches Material 3 Expressive redesign for Android, Wear OS devices

Google launches Material 3 Expressive redesign for Android, Wear OS devices

May 14, 2025
Toolz Guru
  • Home
    Social share Device trust.width 1300 Toolz Guru Device Trust from Android Enterprise

    Device Trust from Android Enterprise

    TAS Gemini Across Devices Blog Header.width 1300 Toolz Guru Gemini is coming to watches, cars, TV and XR devices

    Gemini is coming to watches, cars, TV and XR devices

    TAS Material 3 Expressive Blog Header 1.width 1300 Toolz Guru Google launches Material 3 Expressive redesign for Android, Wear OS devices

    Google launches Material 3 Expressive redesign for Android, Wear OS devices

    Googles Geothermal Agreement SS 1920x1080.max 1440x810 Toolz Guru Google’s new model for clean energy approved in Nevada

    Google’s new model for clean energy approved in Nevada

    Superpollutants SS 1920x1080.max 1440x810 Toolz Guru We’re announcing two new partnerships to eliminate superpollutants and help the atmosphere.

    We’re announcing two new partnerships to eliminate superpollutants and help the atmosphere.

    Searchscams SS 1920x1080.max 1440x810 Toolz Guru Google’s new report on fighting scams in search results

    Google’s new report on fighting scams in search results

    AIFF SS.width 1300 Toolz Guru Google’s AI Futures Fund works with AI startups

    Google’s AI Futures Fund works with AI startups

    GFSA AI for Energy demo copy blog banner v24.width 1300 Toolz Guru Google for Startup Accelerator: AI for Energy opens

    Google for Startup Accelerator: AI for Energy opens

  • AI News
  • AI Tools
    • Image Generation
    • Content Creation
    • SEO Tools
    • Digital Tools
    • Language Models
    • Video & Audio
  • Digital Marketing
    • Content Marketing
    • Social Media
    • Search Engine Optimization
  • Reviews
No Result
View All Result
Toolz Guru
  • Home
    Social share Device trust.width 1300 Toolz Guru Device Trust from Android Enterprise

    Device Trust from Android Enterprise

    TAS Gemini Across Devices Blog Header.width 1300 Toolz Guru Gemini is coming to watches, cars, TV and XR devices

    Gemini is coming to watches, cars, TV and XR devices

    TAS Material 3 Expressive Blog Header 1.width 1300 Toolz Guru Google launches Material 3 Expressive redesign for Android, Wear OS devices

    Google launches Material 3 Expressive redesign for Android, Wear OS devices

    Googles Geothermal Agreement SS 1920x1080.max 1440x810 Toolz Guru Google’s new model for clean energy approved in Nevada

    Google’s new model for clean energy approved in Nevada

    Superpollutants SS 1920x1080.max 1440x810 Toolz Guru We’re announcing two new partnerships to eliminate superpollutants and help the atmosphere.

    We’re announcing two new partnerships to eliminate superpollutants and help the atmosphere.

    Searchscams SS 1920x1080.max 1440x810 Toolz Guru Google’s new report on fighting scams in search results

    Google’s new report on fighting scams in search results

    AIFF SS.width 1300 Toolz Guru Google’s AI Futures Fund works with AI startups

    Google’s AI Futures Fund works with AI startups

    GFSA AI for Energy demo copy blog banner v24.width 1300 Toolz Guru Google for Startup Accelerator: AI for Energy opens

    Google for Startup Accelerator: AI for Energy opens

  • AI News
  • AI Tools
    • Image Generation
    • Content Creation
    • SEO Tools
    • Digital Tools
    • Language Models
    • Video & Audio
  • Digital Marketing
    • Content Marketing
    • Social Media
    • Search Engine Optimization
  • Reviews
No Result
View All Result
Toolz Guru
No Result
View All Result
Home SEO Tools

Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

by Maxim Makedonsky
May 11, 2025
in SEO Tools
0 0
ML 18286 Eval Workflow Toolz Guru Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge
Share on FacebookShare on Twitter


AI agents are quickly becoming an integral part of customer workflows across industries by automating complex tasks, enhancing decision-making, and streamlining operations. However, the adoption of AI agents in production systems requires scalable evaluation pipelines. Robust agent evaluation enables you to gauge how well an agent is performing certain actions and gain key insights into them, enhancing AI agent safety, control, trust, transparency, and performance optimization.

Amazon Bedrock Agents uses the reasoning of foundation models (FMs) available on Amazon Bedrock, APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks—freeing teams to focus on high-value work. You can enable generative AI applications to automate multistep tasks by seamlessly connecting with company systems, APIs, and data sources.

Ragas is an open source library for testing and evaluating large language model (LLM) applications across various LLM use cases, including Retrieval Augmented Generation (RAG). The framework enables quantitative measurement of the effectiveness of the RAG implementation. In this post, we use the Ragas library to evaluate the RAG capability of Amazon Bedrock Agents.

LLM-as-a-judge is an evaluation approach that uses LLMs to assess the quality of AI-generated outputs. This method employs an LLM to act as an impartial evaluator, to analyze and score outputs. In this post, we employ the LLM-as-a-judge technique to evaluate the text-to-SQL and chain-of-thought capabilities of Amazon Bedrock Agents.

Langfuse is an open source LLM engineering platform, which provides features such as traces, evals, prompt management, and metrics to debug and improve your LLM application.

In the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents, we showcased research agents for cancer biomarker discovery for pharmaceutical companies. In this post, we extend the prior work and showcase Open Source Bedrock Agent Evaluation with the following capabilities:

  • Evaluating Amazon Bedrock Agents on its capabilities (RAG, text-to-SQL, custom tool use) and overall chain-of-thought
  • Comprehensive evaluation results and trace data sent to Langfuse with built-in visual dashboards
  • Trace parsing and evaluations for various Amazon Bedrock Agents configuration options

First, we conduct evaluations on a variety of different Amazon Bedrock Agents. These include a sample RAG agent, a sample text-to-SQL agent, and pharmaceutical research agents that use multi-agent collaboration for cancer biomarker discovery. Then, for each agent, we showcase navigating the Langfuse dashboard to view traces and evaluation results.

Technical challenges

Today, AI agent developers generally face the following technical challenges:

  • End-to-end agent evaluation – Although Amazon Bedrock provides built-in evaluation capabilities for LLM models and RAG retrieval, it lacks metrics specifically designed for Amazon Bedrock Agents. There is a need for evaluating the holistic agent goal, as well as individual agent trace steps for specific tasks and tool invocations. Support is also needed for both single and multi-agents, and both single and multi-turn datasets.
  • Challenging experiment management – Amazon Bedrock Agents offers numerous configuration options, including LLM model selection, agent instructions, tool configurations, and multi-agent setups. However, conducting rapid experimentation with these parameters is technically challenging due to the lack of systematic ways to track, compare, and measure the impact of configuration changes across different agent versions. This makes it difficult to effectively optimize agent performance through iterative testing.

Solution overview

The following figure illustrates how Open Source Bedrock Agent Evaluation works on a high level. The framework runs an evaluation job that will invoke your own agent in Amazon Bedrock and evaluate its response.

Evaluation Workflow

The workflow consists of the following steps:

  1. The user specifies the agent ID, alias, evaluation model, and dataset containing question and ground truth pairs.
  2. The user executes the evaluation job, which will invoke the specified Amazon Bedrock agent.
  3. The retrieved agent invocation traces are run through a custom parsing logic in the framework.
  4. The framework conducts an evaluation based on the agent invocation results and the question type:
    1. Chain-of-thought – LLM-as-a-judge with Amazon Bedrock LLM calls (conducted for every evaluation run for different types of questions)
    2. RAG – Ragas evaluation library
    3. Text-to-SQL – LLM-as-a-judge with Amazon Bedrock LLM calls
  5. Evaluation results and parsed traces are gathered and sent to Langfuse for evaluation insights.

Prerequisites

To deploy the sample RAG and text-to-SQL agents and follow along with evaluating them using Open Source Bedrock Agent Evaluation, follow the instructions in Deploying Sample Agents for Evaluation.

To bring your own agent to evaluate with this framework, refer to the following README and follow the detailed instructions to deploy the Open Source Bedrock Agent Evaluation framework.

Overview of evaluation metrics and input data

First, we create sample Amazon Bedrock agents to demonstrate the capabilities of Open Source Bedrock Agent Evaluation. The text-to-SQL agent uses the BirdSQL Mini-Dev dataset, and the RAG agent uses the Hugging Face rag-mini-wikpedia dataset.

Evaluation metrics

The Open Source Bedrock Agent Evaluation framework conducts evaluations on two broad types of metrics:

  • Agent goal – Chain-of-thought (run on every question)
  • Task accuracy – RAG, text-to-SQL (run only when the specific tool is used to answer question)

Agent goal metrics measure how well an agent identifies and achieves the goals of the user. There are two main types: reference-based evaluation and no reference evaluation. Examples can be found in Agent Goal accuracy as defined by Ragas:

  • Reference-based evaluation – The user provides a reference that will be used as the ideal outcome. The metric is computed by comparing the reference with the goal achieved by the end of the workflow.
  • Evaluation without reference – The metric evaluates the performance of the LLM in identifying and achieving the goals of the user without reference.

We will showcase evaluation without reference using chain-of-thought evaluation. We conduct evaluations by comparing the agent’s reasoning and the agent’s instruction. For this evaluation, we use some metrics from the evaluator prompts for Amazon Bedrock LLM-as-a-judge. In this framework, the chain-of-thought evaluations are run on every question that the agent is evaluated against.

Task accuracy metrics measure how well an agent calls the required tools to complete a given task. For the two task accuracy metrics, RAG and text-to-SQL, evaluations are conducted based on comparing the actual agent answer against the ground truth dataset that must be provided in the input dataset. The task accuracy metrics are only evaluated when the corresponding tool is used to answer the question.

The following is a breakdown of the key metrics used in each evaluation type included in the framework:

  • RAG:
    • Faithfulness – How factually consistent a response is with the retrieved context
    • Answer relevancy – How directly and appropriately the original question is addressed
    • Context recall – How many of the relevant pieces of information were successfully retrieved
    • Semantic similarity – The assessment of the semantic resemblance between the generated answer and the ground truth
  • Text-to-SQL:
  • Chain-of-thought:
    • Helpfulness – How well the agent satisfies explicit and implicit expectations
    • Faithfulness – How well the agent sticks to available information and context
    • Instruction following – How well the agent respects all explicit directions

User-agent trajectories

The input dataset is in the form of trajectories, where each trajectory consists of one or more questions to be answered by the agent. The trajectories are meant to simulate how a user might interact with the agent. Each trajectory consists of a unique question_id, question_type, question, and ground_truth information. The following are examples of actual trajectories used to evaluate each type of agent in this post.

For more simple agent setups like the RAG and text-to-SQL sample agent, we created trajectories consisting of a single question, as shown in the following examples.

The following is an example of a RAG sample agent trajectory:

{
	"Trajectory0": [
		{
			"question_id": 0,
			"question_type": "RAG",
			"question": "Was Abraham Lincoln the sixteenth President of the United States?",
			"ground_truth": "yes"
		}
	]
}

The following is an example of a text-to-SQL sample agent trajectory:

{
	"Trajectory1": [
		{
			"question_id": 1,
			"question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",
			"question_type": "TEXT2SQL",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",
				"ground_truth_sql_context": "[{'table_name': 'frpm', 'columns': [('cdscode', 'varchar'), ('academic year', 'varchar'), ...",
				"ground_truth_query_result": "1.0",
				"ground_truth_answer": "The highest eligible free rate for K-12 students in schools in Alameda County is 1.0."
		}
	]
}

Pharmaceutical research agent use case example

In this section, we demonstrate how you can use the Open Source Bedrock Agent Evaluation framework to evaluate pharmaceutical research agents discussed in the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents . It showcases a variety of specialized agents, including a biomarker database analyst, statistician, clinical evidence researcher, and medical imaging expert in collaboration with a supervisor agent.

The pharmaceutical research agent was built using the multi-agent collaboration feature of Amazon Bedrock. The following diagram shows the multi-agent setup that was evaluated using this framework.

HCLS Agents Architecture

As shown in the diagram, the RAG evaluations will be conducted on the clinical evidence researcher sub-agent. Similarly, text-to-SQL evaluations will be run on the biomarker database analyst sub-agent. The chain-of-thought evaluation evaluates the final answer of the supervisor agent to check if it properly orchestrated the sub-agents and answered the user’s question.

Research agent trajectories

For a more complex setup like the pharmaceutical research agents, we used a set of industry relevant pregenerated test questions. By creating groups of questions based on their topic regardless of the sub-agents that might be invoked to answer the question, we created trajectories that include multiple questions spanning multiple types of tool use. With relevant questions already generated, integrating with the evaluation framework simply required properly formatting the ground truth data into trajectories.

We walk through evaluating this agent against a trajectory containing a RAG question and a text-to-SQL question:

{
	"Trajectory1": [
		{
			"question_id": 3,
			"question_type": "RAG",
			"question": "According to the knowledge base, how did the EGF pathway associate with CT imaging features?",
			"ground_truth": "The EGF pathway was significantly correlated with the presence of ground-glass opacity and irregular nodules or nodules with poorly defined margins."
		},
		{
			"question_id": 4,
			"question_type": "TEXT2SQL",
			"question": "According to the database, What percentage of patients have EGFR mutations?",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT (COUNT(CASE WHEN EGFR_mutation_status="Mutant" THEN 1 END) * 100.0 / COUNT(*)) AS percentage FROM clinical_genomic;",
				"ground_truth_sql_context": "Table clinical_genomic: - Case_ID: VARCHAR(50) - EGFR_mutation_status: VARCHAR(50)",
				"ground_truth_query_result": "14.285714",
				"ground_truth_answer": "According to the query results, approximately 14.29% of patients in the clinical_genomic table have EGFR mutations."
			}
		}
	]
}

Chain-of-thought evaluations are conducted for every question, regardless of tool use. This will be illustrated through a set of images of agent trace and evaluations on the Langfuse dashboard.

After running the agent against the trajectory, the results are sent to Langfuse to view the metrics. The following screenshot shows the trace of the RAG question (question ID 3) evaluation on Langfuse.

Langfuse RAG Trace

The screenshot displays the following information:

  • Trace information (input and output of agent invocation)
  • Trace steps (agent generation and the corresponding sub-steps)
  • Trace metadata (input and output tokens, cost, model, agent type)
  • Evaluation metrics (RAG and chain-of-thought metrics)

The following screenshot shows the trace of the text-to-SQL question (question ID 4) evaluation on Langfuse, which evaluated the biomarker database analyst agent that generates SQL queries to run against an Amazon Redshift database containing biomarker information.

Langfuse text-to-SQL Trace

The screenshot shows the following information:

  • Trace information (input and output of agent invocation)
  • Trace steps (agent generation and the corresponding sub-steps)
  • Trace metadata (input and output tokens, cost, model, agent type)
  • Evaluation metrics (text-to-SQL and chain-of-thought metrics)

The chain-of-thought evaluation is included in part of both questions’ evaluation traces. For both traces, LLM-as-a-judge is used to generate scores and explanation around an Amazon Bedrock agent’s reasoning on a given question.

Overall, we ran 56 questions grouped into 21 trajectories against the agent. The traces, model costs, and scores are shown in the following screenshot.

Langfuse Dashboard

The following table contains the average evaluation scores across 56 evaluation traces.

Metric Category Metric Type Metric Name Number of Traces Metric Avg. Value
Agent Goal COT Helpfulness 50 0.77
Agent Goal COT Faithfulness 50 0.87
Agent Goal COT Instruction following 50 0.69
Agent Goal COT Overall (average of all metrics) 50 0.77
Task Accuracy TEXT2SQL Answer correctness 26 0.83
Task Accuracy TEXT2SQL SQL semantic equivalence 26 0.81
Task Accuracy RAG Semantic similarity 20 0.66
Task Accuracy RAG Faithfulness 20 0.5
Task Accuracy RAG Answer relevancy 20 0.68
Task Accuracy RAG Context recall 20 0.53

Security considerations

Consider the following security measures:

  • Enable Amazon Bedrock agent logging – For security best practices of using Amazon Bedrock Agents, enable Amazon Bedrock model invocation logging to capture prompts and responses securely in your account.
  • Check for compliance requirements – Before implementing Amazon Bedrock Agents in your production environment, make sure that the Amazon Bedrock compliance certifications and standards align with your regulatory requirements. Refer to Compliance validation for Amazon Bedrock for more information and resources on meeting compliance requirements.

Clean up

If you deployed the sample agents, run the following notebooks to delete the resources created.

If you chose the self-hosted Langfuse option, follow these steps to clean up your AWS self-hosted Langfuse setup.

Conclusion

In this post, we introduced the Open Source Bedrock Agent Evaluation framework, a Langfuse-integrated solution that streamlines the agent development process. The framework comes equipped with built-in evaluation logic for RAG, text-to-SQL, chain-of-thought reasoning, and integration with Langfuse for viewing evaluation metrics. With the Open Source Bedrock Agent Evaluation agent, developers can quickly evaluate their agents and rapidly experiment with different configurations, accelerating the development cycle and improving agent performance.

We demonstrated how this evaluation framework can be integrated with pharmaceutical research agents. We used it to evaluate agent performance against biomarker questions and sent traces to Langfuse to view evaluation metrics across question types.

The Open Source Bedrock Agent Evaluation framework enables you to accelerate your generative AI application building process using Amazon Bedrock Agents. To self-host Langfuse in your AWS account, see Hosting Langfuse on Amazon ECS with Fargate using CDK Python. To explore how you can streamline your Amazon Bedrock Agents evaluation process, get started with Open Source Bedrock Agent Evaluation.

Refer to Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications from the Amazon Bedrock team to learn more about multi-agent collaboration and end-to-end agent evaluation.


About the authors

Hasan PoonawalaHasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with healthcare and life sciences customers. Hasan helps design, deploy, and scale generative AI and machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development, and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Blake ShinBlake Shin is an Associate Specialist Solutions Architect at AWS who enjoys learning about and working with new AI/ML technologies. In his free time, Blake enjoys exploring the city and playing music.

Related Post

OpenAI’s EU Economic Blueprint

May 13, 2025
top10contentwriter

The Ultimate Review of the Top 10 Content Writing AI Tools in 2025

May 11, 2025

Rishiraj ChandraRishiraj Chandra is an Associate Specialist Solutions Architect at AWS, passionate about building innovative artificial intelligence and machine learning solutions. He is committed to continuously learning and implementing emerging AI/ML technologies. Outside of work, Rishiraj enjoys running, reading, and playing tennis.



Source link

Donation

Buy author a coffee

Donate
Maxim Makedonsky

Maxim Makedonsky

  • ChatGPT

    The Rise of the Content Creator: How to Build Your Brand in the Digital Age

    36 shares
    Share 14 Tweet 9
  • Grok AI Upgrade

    27 shares
    Share 11 Tweet 7
  • Junia AI: Content Generation & SEO Tools

    26 shares
    Share 10 Tweet 7
  • Boost Your WordPress Speed: Quick Tips!

    25 shares
    Share 10 Tweet 6
  • Cool Tech Gifts for Your Valentine

    23 shares
    Share 9 Tweet 6
pixart trainium inferentia 1120x630 Toolz Guru Cost-effective AI image generation with PixArt-Σ inference on AWS Trainium and AWS Inferentia

Cost-effective AI image generation with PixArt-Σ inference on AWS Trainium and AWS Inferentia

by Maxim Makedonsky
May 15, 2025
0

PixArt-Sigma is a diffusion transformer model that is capable of image generation at 4k resolution. This model shows significant improvements...

Social share Device trust.width 1300 Toolz Guru Device Trust from Android Enterprise

Device Trust from Android Enterprise

by Maxim Makedonsky
May 15, 2025
0

Integrated security, all in one viewMobile security has often been treated as a silo, separate from endpoint and identity security....

Detecting misbehavior in frontier reasoning models

by Maxim Makedonsky
May 14, 2025
0

Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor...

TAS Gemini Across Devices Blog Header.width 1300 Toolz Guru Gemini is coming to watches, cars, TV and XR devices

Gemini is coming to watches, cars, TV and XR devices

by Maxim Makedonsky
May 14, 2025
0

Make your drive more productive and enjoyable, hands-freeHands-free voice commands with Google Assistant have always been at the core of...

No Content Available
Facebook Twitter Instagram Youtube
Currently Playing

Recent Posts

  • Cost-effective AI image generation with PixArt-Σ inference on AWS Trainium and AWS Inferentia
  • Device Trust from Android Enterprise
  • Detecting misbehavior in frontier reasoning models

Categories

  • AI News
  • AI News Feeds
  • AI Tools
  • Blogging Tips
  • Business
  • ChatGPT
  • Content Markeeting
  • Digital
  • Digital Marketing
  • Digital Tools
  • Image Generation
  • Language Models
  • Productivity
  • Prompts
  • Reviews
  • Search Engine Optimization
  • SEO Tools
  • Social Media
  • Technology
  • Video & Audio
  • Videos

2025 by Toolz Guru

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

No Result
View All Result
  • Home

2025 by Toolz Guru

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.
Go to mobile version