An Embodied AR Navigation Agent

Integrating BIM with Retrieval-Augmented Generation for Language Guidance

ISMAR 2025
Woven by Toyota, Inc., Japan
AR Navigation System Overview

The proposed AR navigation system integrates a multi-agent RAG framework with BIM data to support flexible natural language queries for goal retrieval, interaction, and navigation. Communication is enabled through an AR embodied agent interface.

Abstract

Delivering intelligent and adaptive navigation assistance in augmented reality (AR) requires more than visual cues—it demands systems capable of interpreting flexible user intent and reasoning over both spatial and semantic context. Prior AR navigation systems often rely on rigid input schemes or predefined commands, which limit the utility of rich building data and hinder natural interaction. In this work, we propose an embodied AR navigation system that integrates Building Information Modeling (BIM) with a multi-agent retrieval-augmented generation (RAG) framework to support flexible, language-driven goal retrieval and route planning. The system orchestrates three language agents, Triage, Search, and Response, built on large language models (LLMs), which enables robust interpretation of open-ended queries and spatial reasoning using BIM data. Navigation guidance is delivered through an embodied AR agent, equipped with voice interaction and locomotion, to enhance user experience. A real-world user study yields a System Usability Scale (SUS) score of 80.5, indicating excellent usability, and comparative evaluations show that the embodied interface can significantly improve users' perception of system intelligence. These results underscore the importance and potential of language-grounded reasoning and embodiment in the design of user-centered AR navigation systems.

Multi-Agent RAG System for Navigation

System Architecture Overview

Our approach combines Building Information Modeling (BIM) with a multi-agent retrieval-augmented generation (RAG) framework. The system preprocesses BIM data into a vector database using sentence transformers, enabling semantic similarity search. Three specialized agents orchestrate the navigation process: the Triage Agent classifies user queries and extracts semantic targets, the Search Agent performs vector similarity search and candidate selection using LLM reasoning, and the Response Agent generates contextually appropriate navigation instructions. An embodied AR agent delivers guidance through voice interaction, natural locomotion, and adaptive synchronization with user movement.

System Demonstrations

User Query: "Hi, can you take me to somewhere that I can get some food?"
Retrieved Goal: Coffee Shop
User Query: "I want to go to the reception counter."
Retrieved Goal: Counter Reception
User Query: "Can you take me to meeting room 2015?"
Retrieved Goal: Meeting Room V2015

Comparative User Studies

Navigation Interface Design

Interface Comparison

Illustrations of the AR navigation schemes: (a) the arrow-only baseline scheme, and (b) the proposed embodied agent scheme, which complements arrows with a virtual agent.

Comparative Evaluation

Comparative User Ratings

Comparative evaluation of user preferences between arrow-only and embodied agent interfaces. Positive values indicate preference toward the embodied agent, showing significant improvements in perceived intelligence, engagement, and clarity.

Measure Median Score p-value Effect Size (r) Preference Direction
Clarity & Usability 4.0 .002 ** 0.81 Agent > Arrow
Engagement & Enjoyment 5.0 < .001 *** 1.00 Agent > Arrow
Perceived Intelligence 5.0 < .001 *** 1.00 Agent > Arrow
Trustworthiness 4.0 .038 * 0.61 Agent > Arrow
Cognitive Load 4.0 .040 * 0.57 Agent > Arrow

User Study Results (N = 20): Results from Wilcoxon Signed-Rank Test comparing embodied agent vs. arrow-only interface (Neutral Midpoint = 3). Significance levels: p < .05 (*), p < .01 (**), p < .001 (***)

System Usability Scale (SUS) Results

SUS Results

Achieved mean SUS score of 80.5 (SD = 11.5), indicating excellent system usability. Participants rated highly on ease of use (4.35), learnability (4.65), and system integration (4.25). Labels marked with an asterisk (*) indicate reverse-scored items.

Qualitative Results

Case Study Examples

Query Sample Cases

Examples of successful goal retrieval and navigation route generation demonstrating the system's ability to interpret natural language queries in context, including indirect queries like "I'm hungry, where can I find food?" or queries include context like "meeting room with more than 10 seats".

User Trajectories

Visualizations of user trajectories and navigation paths traversed by the embodied agent.
Purple lines represent navigation routes; aqua blue lines indicate actual user walking trajectories.

User Query: I would like to get some information about this building can you tell me where should I go?
Retrieved Goal: Counter Reception
Response: Sure, I can help you with that. Please head over to the information desk.
User Query: I want to go to the meeting room of 2001 could you take me?
Retrieved Goal: Meeting Room V2001
Response: Sure, I can take you to the meeting room V2001.
User Query: OK please take me to the V2014.
Retrieved Goal: Meeting Room V2014
Response: Let's head to meeting room V2014, also known as Sakura.
User Query: Could you please take me to the men's toilet?
Retrieved Goal: Men's Toilet
Response: Sure, let me take you to the men's toilet.

Citation

@inproceedings{embodied_ar_navigation_2025,
  title={An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance},
  author={Yang, Hsuan-Kung and Hsiao, Tsu-Ching and Oka, Ryoichiro and Nishino, Ryuya and Tofukuji, Satoko and Kobori, Norimasa},
  booktitle={Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR)},
  year={2025},
}