Language-Enhanced Mobile Manipulation for Efficient Object Search in Indoor Environments

Technical University of Munich

*Indicates Equal Contribution

Motivation

Abstract

Enabling robots to efficiently search for and identify objects in complex, unstructured environments is critical for diverse applications ranging from household assistance to industrial automation. However, traditional scene representations typically capture only static semantics and lack interpretable contextual reasoning, limiting their ability to guide object search in completely unfamiliar settings. To address this challenge, we propose a language-enhanced hierarchical navigation framework that tightly integrates semantic perception and spatial reasoning. Our method, Goal-Oriented Dynamically Heuristic-Guided Hierarchical Search (GODHS), leverages large language models (LLMs) to infer scene semantics and guide the search process through a multi-level decision hierarchy. Reliability in reasoning is achieved through the use of structured prompts and logical constraints applied at each stage of the hierarchy. For the specific challenges of mobile manipulation, we introduce a heuristic-based motion planner that combines polar angle sorting with distance prioritization to efficiently generate exploration paths. Comprehensive evaluations in Isaac Sim demonstrate the feasibility of our framework, showing that GODHS can locate target objects with higher search efficiency compared to conventional, non-semantic search strategies.

Video

Approach

GODHS Framework

GODHS Framework

Complete Architecture of the GODHS System: First, the user provides the target object through natural language input, which is then processed by a LLM to extract semantic intent. The robot captures geometric data using LiDAR and RGB images via cameras, generating a topological map with room names through the language model. Within the known map, the LLM-driven prioritization ranks rooms based on the likelihood of containing the target object and navigates to them sequentially. Within each room, the robot detects candidate objects through visual object detection, then utilizes the LLM to analyze whether they are carriers and ranks all collected carriers by the probability of containing the target. For each carrier, the LLM plans a hierarchical search strategy: first analyzing geometric features to identify subregions worth searching (top, interior, etc.), then prioritizing and searching these subregions according to probabilities assigned by the LLM. The loop terminates when the LLM verifies target recognition through visual-language grounding of sensor data.

Model Chain and Prompt Design

Model Chain and Prompt Design

LLM Reasoning Process (Left): Data Cleaning and Correcting ensure the input and output are structurally valid and semantically coherent, while Processing executes the core task with Knowledge as Augmentation and Feedback to guarantee the result.
Prompt Design (Right): Input embedding receives the input information, output constraints standardize lexico-syntactic results, task instructions describe goal-oriented content, feature description provides constrained descriptions of features, and there are also additional information and examples.

Pose Dictionary Estimation

Pose Dictionary Estimation

Estimation of Pose Dictionaries between Chassis and End-Effector: To ensure the mobile robot advances sequentially around the carrier while enabling a single chassis (CH) pose to correspond to multiple end-effector (EE) configurations, we generate search trajectories for the chassis and the end-effector separately, and to have the chassis move forward around the carrier while ensuring that the distance traveled by the end of the actuator at each chassis position is minimized. The process involves five sequential stages: (1) extraction of geometric features \( \mathcal{M} \) from point cloud data, (2) generation and selection of EE poses \( \mathcal{P_{EE}} \), (3) fast evaluation and selection of chassis poses \( \mathcal{P_{CH}} \) via geometric solution, (4) validation of the relation between CH and EE \( \mathcal{P_{CH}^{EE}} \) with subsequent inverse kinematics (IK) solving, and (5) sorting of the resulting pose mappings \( \mathcal{P}_{CH}^{EE} \) for sequential execution by polar angle sorting and lexicographical Sorting.

Experiment

Simulation


Results

BibTeX

Coming soon