CUHK-X

A Large-Scale Multimodal Dataset and Benchmark for Human Action Recognition, Understanding and Reasoning

Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang, Zeyu Liu, Lilin Xu, Yang Li, Yuting He, Liran Dong, Wenrui Lu, Zhenyu Yan, Xiaofan Jiang, Wei Gao, Hongkai Chen, Guoliang Xing.

CUHK

UIUC

Columbia University

PITT University

📄 Paper 🤗 CUHK-S
(HuggingFace) 🗂️ CUHK-X Code 🏆 Best Presentation Award

We have released CUHK-S (sample subset of CUHK-X) on 🤗 HuggingFace. As we are preparing the CUHK-X competition, the full CUHK-X dataset will be released soon. The main differences are that CUHK-S includes only 18 users and excludes the RGB modality. We welcome the community to use it and share feedback.

Abstract

CUHK-X is a comprehensive multimodal dataset containing 64,217 samples across seven modalities designed for human activity recognition, understanding, and reasoning. Unlike existing datasets that focus primarily on recognition tasks, CUHK-X addresses critical gaps by providing the first multimodal dataset specifically designed for Human Action Understanding (HAU) and Human Action Reasoning (HARn).

The dataset was collected from 30 participants across diverse environments using a prompt-based scene creation approach that leverages Large Language Models (LLMs) to generate logical and spatio-temporal activity descriptions. This ensures both consistency and ecological validity in the collected data.

CUHK-X provides three comprehensive benchmarks: HAR (Human Action Recognition), HAU (Human Action Understanding), and HARn (Human Action Reasoning), encompassing eight distinct evaluation tasks. Our extensive experiments demonstrate significant challenges in cross-subject and cross-domain scenarios, highlighting the dataset's value for advancing robust multimodal human activity analysis.

Hardware and Environment Setup

We spans a multi-room home and supports three tasks: HAR, HAU (captioning task), and HARn (question answering task), integrating diverse modalities, including RGB, depth, thermal, infrared, IMU, skeleton, and mmWave, to enable robust perception and reasoning in complex indoor contexts.

CUHK-X was collected using a sophisticated multi-sensor setup ensuring synchronized data capture across all modalities:

Vzense NYX 650

RGB-D camera providing color and depth information

Texas Instruments Radar

mmWave sensing for privacy-preserving motion detection

IMU Sensors

Motion and orientation tracking with high temporal resolution

Thermal Cameras

Heat signature analysis for environmental robustness

Synchronized Recording

Temporal alignment across all modalities for consistent analysis

Hardware setup showing the multi-sensor configuration

Layout with room-wise visual annotations (Bedroom, Kitchen, Bathroom, and Living Room) showing corresponding example images and sensor placements. The icon indicates the location of the ambient sensor:

Benchmarks & Tasks

CUHK-X provides three comprehensive benchmarks that progressively increase in complexity, from basic recognition to advanced reasoning:

🎯

HAR - Human Action Recognition

Objective: Traditional action classification across modalities

Cross-subject evaluation (LOSO protocol)
Cross-domain performance analysis
Long-tail distribution handling
Multimodal fusion strategies

🧠

HAU - Human Action Understanding

Objective: Comprehend actions through contextual integration

Action Captioning: Generate natural language descriptions
Emotion Analysis: Identify emotional states
Sequential Reordering: Organize actions chronologically
Action Selection: Choose relevant actions from candidates

🔮

HARn - Human Action Reasoning

Objective: Infer intentions and causal relationships

Next Action Prediction: Predict likely subsequent actions
Temporal Reasoning: Understand action progression logic
Contextual Inference: Consider environmental factors
Causal Understanding: Link actions to intentions

Novel Data Collection Framework

Leveraging Large Language Models to generate consistent, logical activity descriptions that participants then perform. This approach ensures:

Logical Consistency:

Activities follow natural progression and causality

Spatio-temporal Coherence:

Actions are contextually appropriate

Human-in-the-Loop Validation:

Quality assurance for generated scenarios

Scalable Annotation:

Efficient generation of diverse scenarios

Dataset Overview

CUHK-X represents a significant advancement in multimodal human activity datasets, featuring:

Seven Synchronized Modalities: RGB, Infrared (IR), Depth, Thermal, IMU, mmWave Radar, and Skeleton data
Large-Scale: 64,217 annotated action samples from 30 diverse participants
Dual Data Structure: Both singular actions (30,000+ samples) and sequential activities for temporal reasoning
Rich Annotations: LLM-generated captions with human-in-the-loop validation
Environmental Diversity: Indoor and outdoor settings with varying conditions

Modality Specifications

🎥

RGB Video
Standard color video recordings

📏

Depth
3D spatial information from depth cameras

🔥

Thermal
Heat signature analysis

🌡️

Infrared (IR)
Thermal imaging for lighting robustness

📡

mmWave Radar
Privacy-preserving motion detection

🦴

Skeleton
3D pose estimation and joint tracking

📱

IMU
Inertial Measurement Unit for motion

Dataset overview showing modality examples

Action Categories and Distribution

Distribution

Distribution Feature:> The dataset follows a long-tail distribution (a small number of actions account for a large proportion of occurrences, while most are infrequent), consistent with the common imbalance of real-world datasets.
Category Diversity: Covers basic daily activities, work-related tasks, household chores, and physical exercises, providing a rich foundation for human activity recognition.
Data Scale:
- Each participant contributes over 30 minutes of footage with more than 100 samples.
- Vision modality includes 4,029 clips (total duration: 19 hours and 29 minutes).

Experimental Results

Key Findings

Our comprehensive evaluation across the three benchmarks reveals several important insights:

🎯 HAR Performance

Modality	Accuracy	Precision	Recall	F1-Score
RGB	90.89%	92.24%	91.02%	91.28%
Depth	90.46%	91.76%	90.75%	90.93%
IR	90.22%	91.53%	89.94%	90.46%
Thermal	92.57%	93.54%	93.50%	93.36%
mmWave	46.63%	48.29%	46.63%	44.53%
IMU	45.52%	40.84%	38.00%	38.32%
Skeleton	79.08%	91.46%	79.08%	84.17%

🧠 HAU Performance Highlights

These three images represent the results of action selection, emotion analysis, and action sequence, respectively.

QwenVL-7B: Consistently best performer across tasks
VLLaVA-7B: Strong performance in depth and IR modalities
Emotion Analysis: Up to 77.77% accuracy with thermal imaging
Sequential Reordering: 68.5% accuracy for complex temporal reasoning

🔮 HARn Insights

Reasoning vs Captioning: Reasoning models significantly outperform captioning models
Modality Impact: Depth and IR often superior to RGB for reasoning tasks
Model Scale: Larger models (7B) consistently outperform smaller ones
Context Understanding: Critical for next action prediction accuracy

Citation

If you use CUHK-X in your research, please cite our paper:

@article{jiang2025large,
  title={A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning},
  author={Jiang, Siyang and Yuan, Mu and Ji, Xiang and Yang, Bufang and Liu, Zeyu and Xu, Lilin and Li, Yang and He, Yuting and Dong, Liran and Lu, Wenrui and others},
  journal={arXiv preprint arXiv:2512.07136},
  year={2025}
}

Contact Information

For dataset access, questions, or collaborations:

Primary Contact: syjiang@ie.cuhk.edu.hk
Institution: The Chinese University of Hong Kong
Dataset Request: Please contact for access information

Acknowledgments

🙏

Our Gratitude

We thank all participants who contributed to the CUHK-X dataset collection. Special acknowledgments to the CUHK research team and collaborators who made this comprehensive multimodal dataset possible. The hardware setup and synchronization infrastructure were crucial for achieving the quality and scale of CUHK-X. The CUHK-X dataset creators obtained approval from an Institutional Review Board (IRB) to conduct their study and collect data from human subjects.

We gratefully acknowledge Dr. Jamie Du, Zhijiang Chen, Runju Fan, Yajing Feng, Peipei Li, Yutang Wei, Jiamin Wu, Yixin Xu, and Danni Yuanyong from Huizhou University, as well as Dr. Yunqi Guo from The Chinese University of Hong Kong, for their assistance in collecting the dataset. We also thank Ruijun Xia, Bohan Liu, and Guangyu Chen from The Chinese University of Hong Kong for their help with the paper artifacts.

All datasets used in this study were accessed and used under explicit authorization from their respective owners.

🌍

Broader Impact

CUHK-X aims to advance research in healthcare monitoring, smart environments, and privacy-preserving human activity understanding. We hope this dataset serves as a valuable resource for the research community to develop more robust and practical human activity recognition systems.