Language-Enhanced Mobile Manipulation for Efficient Object Search in Indoor Environments

Motivation

Existing robotic systems mainly rely on spatial data for brute-force searches.

They may have semantic labels, but can’t perform dynamic semantic reasoning.

In contrast, humans searching for targets focus on semantic spatial relation.

The searches narrows the area with semantic guidances and common sense.

Abstract

Enabling robots to efficiently search for and identify objects in complex, unstructured environments is critical for diverse applications ranging from household assistance to industrial automation. However, traditional scene representations typically capture only static semantics and lack interpretable contextual reasoning, limiting their ability to guide object search in completely unfamiliar settings. To address this challenge, we propose a language-enhanced hierarchical navigation framework that tightly integrates semantic perception and spatial reasoning. Our method, Goal-Oriented Dynamically Heuristic-Guided Hierarchical Search (GODHS), leverages large language models (LLMs) to infer scene semantics and guide the search process through a multi-level decision hierarchy. Reliability in reasoning is achieved through the use of structured prompts and logical constraints applied at each stage of the hierarchy. For the specific challenges of mobile manipulation, we introduce a heuristic-based motion planner that combines polar angle sorting with distance prioritization to efficiently generate exploration paths. Comprehensive evaluations in Isaac Sim demonstrate the feasibility of our framework, showing that GODHS can locate target objects with higher search efficiency compared to conventional, non-semantic search strategies.

Video

Approach

GODHS Framework

Complete Architecture of the GODHS System: First, the user provides the target object through natural language input, which is then processed by a LLM to extract semantic intent. The robot captures geometric data using LiDAR and RGB images via cameras, generating a topological map with room names through the language model. Within the known map, the LLM-driven prioritization ranks rooms based on the likelihood of containing the target object and navigates to them sequentially. Within each room, the robot detects candidate objects through visual object detection, then utilizes the LLM to analyze whether they are carriers and ranks all collected carriers by the probability of containing the target. For each carrier, the LLM plans a hierarchical search strategy: first analyzing geometric features to identify subregions worth searching (top, interior, etc.), then prioritizing and searching these subregions according to probabilities assigned by the LLM. The loop terminates when the LLM verifies target recognition through visual-language grounding of sensor data.

Model Chain and Prompt Design

LLM Reasoning Process (Left): Data Cleaning and Correcting ensure the input and output are structurally valid and semantically coherent, while Processing executes the core task with Knowledge as Augmentation and Feedback to guarantee the result.
Prompt Design (Right): Input embedding receives the input information, output constraints standardize lexico-syntactic results, task instructions describe goal-oriented content, feature description provides constrained descriptions of features, and there are also additional information and examples.

Pose Dictionary Estimation

Estimation of Pose Dictionaries between Chassis and End-Effector: To ensure the mobile robot advances sequentially around the carrier while enabling a single chassis (CH) pose to correspond to multiple end-effector (EE) configurations, we generate search trajectories for the chassis and the end-effector separately, and to have the chassis move forward around the carrier while ensuring that the distance traveled by the end of the actuator at each chassis position is minimized. The process involves five sequential stages: (1) extraction of geometric features \( \mathcal{M} \) from point cloud data, (2) generation and selection of EE poses \( \mathcal{P_{EE}} \), (3) fast evaluation and selection of chassis poses \( \mathcal{P_{CH}} \) via geometric solution, (4) validation of the relation between CH and EE \( \mathcal{P_{CH}^{EE}} \) with subsequent inverse kinematics (IK) solving, and (5) sorting of the resulting pose mappings \( \mathcal{P}_{CH}^{EE} \) for sequential execution by polar angle sorting and lexicographical Sorting.

Experiment

Simulation

We used the DARKO robot and tested it in a flat environment, built grid maps for each room using LiDAR and captured semantic information with a camera.

For example, we detected bed and wardrobe in this room. Using prompts, the language model identified the room as a bedroom and marked it on the map.

After the mobile manipulator scanned the entire scene, the semantic map was generated.

Input the target orange to search and sort the rooms based on this target. As shown in prompt and sorting results, we prioritize the living room and kitchen.

The system first directs the robot to the living room. In the living room, the robot uses its manipulator for initial observation.

During this process, the system identifies several objects. The language model classifies the carriers, such as the TV shelf, billiard table, and sofa.

It then determines the order in which carriers should be searched.

Here, we only search the coffee table.

The language model predicts that the top of the coffee table is worth checking.

The robot inspects the top of the coffee table but finds the target is not there.

After failing to locate the target in the living room, it navigates to the kitchen.

It confirms carriers like the oven, fridge, and microwave. From the language model, the carriers worth searching are the fridge and the kitchen table.

Then, the language model decides to prioritize searching inside the fridge.

Ultimately, the target is found inside the fridge in the kitchen, successfully concluding the search process.

Results

This table compares four search strategies across success rate, search rates and overall search rates. our method can effectively perform the search tasks while significantly reducing search costs, thereby validating the feasibility of GODHS. However, the experimental results demonstrate that the search success rate progressively decreases following semantic analysis. Traditional methods fail in narrow spaces due to grid-based maps being unable to model 3D semantic relationships, while GODHS faces additional limitations from generative language models, particularly the symbol grounding problem.

We evaluate three key performance metrics: the EE Path Length, which represents the ratio of the total EE travel distance to the theoretical shortest EE path; the CH Path Length, defined similarly for the CH poses; and the Execution Time Ratio, which is the ratio of execution time with a lower value indicating better efficiency. Table presents the comparative results, demonstrating that lexicographical sorting effectively reduces EE path length while polar angle sorting significantly improves CH path efficiency. When both strategies are applied, the method achieves the highest optimization in both path length and execution time, confirming its effectiveness in improving motion planning.

BibTeX

Coming soon