Research Statement

This Research Statement is temperarily GPT generated and will be updated soon. 🤗

Long Term Research Agenda

My long term goal is to build generalizable embodied intelligence that understands the physical world, reasons across modalities, and performs complex tasks efficiently in real 3D environments. I aim to develop the foundations of 3D multimodal large language models that can connect natural language understanding with grounded perception, spatial reasoning, and long horizon action planning. This line of research advances the creation of agents that can operate safely and autonomously in real scenes and also contributes new computational principles for integrating vision, language, and control.

Research Trajectory

Past and Ongoing Work

My early research focused on visual perception, particularly facial expression recognition, head pose estimation, and fine grained image classification. I explored how transformer architectures can extract orientation cues, muscle aware features, and part level relations for robust recognition. Several works were published in TII, TIP, TMM, CVPR, and ROBIO. These projects provided experience in representation learning, model design, and large scale visual data processing. They also strengthened my understanding of geometric cues and spatial structures, which later shaped my interest in 3D perception.
During my undergraduate years and research assistantship, I became deeply interested in multi view 3D vision. Later, in collaboration with the VLRLab at Huazhong University of Science and Technology, I worked on ViT based multi view 3D detection. Our ECCV 2024 work introduced a token compression strategy that accelerates ViT based 3D detectors while preserving accuracy. This experience helped me understand the computational challenges of 3D perception systems and raised a broader question: how can we integrate efficient 3D perception with high level reasoning for embodied agents.

Recent Work: Embodied AI and 3D Grounded Reasoning

Motivated by the limitations of existing task planning datasets, I began exploring how embodied agents can perform complex household tasks described by natural language. I identified two critical gaps: lack of operations research knowledge for efficient scheduling and lack of realistic 3D grounding for action execution. This led to the creation of the OKS3D task and the OKS3D-60K dataset, which introduces large scale 3D grounded task scheduling. Our AAAI 2026 oral paper presents GRANT, a multimodal large language model that uses a simple scheduling token mechanism to generate efficient schedules and grounded actions. This line of work aims to bridge natural language reasoning, 3D scene understanding, and optimal planning. The contributions include a principled integration of task scheduling concepts with embodied action planning and a new dataset that reflects realistic household tasks. The results demonstrate that 3D grounded scheduling can significantly improve agent efficiency and reliability.

Future Research Vision

My future research focuses on building generalizable 3D multimodal large language models for embodied intelligence. I plan to pursue three directions.

1. Unified Language Scene Action Models

I aim to design models that jointly encode 3D environments, language instructions, and action spaces. A key goal is to enable agents to understand spatial relations, predict the consequences of actions, and plan long horizon behaviors based on both commonsense knowledge and real geometric constraints.

2. Efficient 3D Grounded Reasoning

Building on my work in token compression and OR based scheduling, I intend to explore computationally efficient architectures for 3D reasoning. This includes sparse attention mechanisms, structured memory, and dynamic scene representations that allow agents to operate in large and complex environments without prohibitive cost.

3. Toward Parallel and Collaborative Embodied Agents

The AAAI 2026 project on parallel task execution suggests a broader research question: how can multiple agents coordinate, share information, and perform joint tasks in 3D space. I plan to investigate distributed embodied intelligence, where agents negotiate roles, communicate efficiently, and adapt to dynamic environments. Across all directions, I plan to pursue collaboration opportunities with robotics, cognitive science, and operations research communities. I also aim to seek external funding from national research foundations and AI institutes.

Conclusion

My research seeks to advance the foundations of 3D grounded multimodal reasoning for embodied AI. Through contributions in 3D perception, transformer based modeling, task scheduling, and embodied LLM design, my goal is to create agents that can understand, predict, and act in the physical world with both intelligence and efficiency. I will continue developing models that integrate language, vision, and reasoning to push toward generalizable and reliable embodied systems.

Cheng Zhang | 张 诚