CUHK-X

A Large-Scale Multimodal Dataset and Benchmark for Human Action Recognition, Understanding and Reasoning

Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang, Zeyu Liu, Lilin Xu, Yang Li, Yuting He, Liran Dong, Wenrui Lu, Janey Grace Xing, Zhenyu Yan, Xiaofan Jiang, Wei Gao, Hongkai Chen.

CUHK Logo CUHK
UIUC Logo UIUC
Columbia University Logo Columbia University
Pitt University Logo PITT University
ESF Sha Tin College Logo ESF Sha Tin College

Abstract

CUHK-X is a comprehensive multimodal dataset containing 58,445 samples across seven modalities designed for human activity recognition, understanding, and reasoning. Unlike existing datasets that focus primarily on recognition tasks, CUHK-X addresses critical gaps by providing the first multimodal dataset specifically designed for Human Action Understanding (HAU) and Human Action Reasoning (HARn).

The dataset was collected from 30 participants across diverse environments using our novel ActScene framework - a prompt-based scene creation approach that leverages Large Language Models (LLMs) to generate logical and spatio-temporal activity descriptions. This ensures both consistency and ecological validity in the collected data.

CUHK-X provides three comprehensive benchmarks: HAR (Human Action Recognition), HAU (Human Action Understanding), and HARn (Human Action Reasoning), encompassing eight distinct evaluation tasks. Our extensive experiments demonstrate significant challenges in cross-subject and cross-domain scenarios, highlighting the dataset's value for advancing robust multimodal human activity analysis.

Hardware and Environment Setup

CUHK-X was collected using a sophisticated multi-sensor setup ensuring synchronized data capture across all modalities:

Vzense NYX 650

RGB-D camera providing color and depth information

Texas Instruments Radar

mmWave sensing for privacy-preserving motion detection

IMU Sensors

Motion and orientation tracking with high temporal resolution

Thermal Cameras

Heat signature analysis for environmental robustness

Synchronized Recording

Temporal alignment across all modalities for consistent analysis

Hardware setup showing the multi-sensor configuration

We spans a multi-room home and supports three tasks: HAR, HAU (captioning task), and HARn (question answering task), integrating diverse modalities, including RGB, depth, thermal, infrared, IMU, skeleton, and mmWave, to enable robust perception and reasoning in complex indoor contexts.

Background

Layout with room-wise visual annotations (Bedroom, Kitchen, Bathroom, and Living Room) showing corresponding example images and sensor placements. The icon indicates the location of the ambient sensor:

Background

Benchmarks & Tasks

CUHK-X provides three comprehensive benchmarks that progressively increase in complexity, from basic recognition to advanced reasoning:

๐ŸŽฏ

HAR - Human Action Recognition

Objective: Traditional action classification across modalities

  • Cross-subject evaluation (LOSO protocol)
  • Cross-domain performance analysis
  • Long-tail distribution handling
  • Multimodal fusion strategies
๐Ÿง 

HAU - Human Action Understanding

Objective: Comprehend actions through contextual integration

  • Action Captioning: Generate natural language descriptions
  • Emotion Analysis: Identify emotional states
  • Sequential Reordering: Organize actions chronologically
  • Action Selection: Choose relevant actions from candidates
๐Ÿ”ฎ

HARn - Human Action Reasoning

Objective: Infer intentions and causal relationships

  • Next Action Prediction: Predict likely subsequent actions
  • Temporal Reasoning: Understand action progression logic
  • Contextual Inference: Consider environmental factors
  • Causal Understanding: Link actions to intentions

Novel ActScene Framework

Our innovative ActScene framework leverages Large Language Models to generate consistent, logical activity descriptions that participants then perform. This approach ensures:

Logical Consistency:

Activities follow natural progression and causality

Spatio-temporal Coherence:

Actions are contextually appropriate

Human-in-the-Loop Validation:

Quality assurance for generated scenarios

Scalable Annotation:

Efficient generation of diverse scenarios

Dataset Overview

CUHK-X represents a significant advancement in multimodal human activity datasets, featuring:

  • Seven Synchronized Modalities: RGB, Infrared (IR), Depth, Thermal, IMU, mmWave Radar, and Skeleton data
  • Large-Scale: 58,445 annotated action samples from 30 diverse participants
  • Dual Data Structure: Both singular actions (30,000+ samples) and sequential activities for temporal reasoning
  • Rich Annotations: LLM-generated captions with human-in-the-loop validation
  • Environmental Diversity: Indoor and outdoor settings with varying conditions

Modality Specifications

๐ŸŽฅ
RGB Video
Standard color video recordings
๐Ÿ“
Depth
3D spatial information from depth cameras
๐Ÿ”ฅ
Thermal
Heat signature analysis
๐ŸŒก๏ธ
Infrared (IR)
Thermal imaging for lighting robustness
๐Ÿ“ก
mmWave Radar
Privacy-preserving motion detection
๐Ÿฆด
Skeleton
3D pose estimation and joint tracking
๐Ÿ“ฑ
IMU
Inertial Measurement Unit for motion
Dataset overview showing modality examples

Action Categories and Distribution

Categories

Personal Care (6 actions)
Washing face, Brushing teeth, Combing hair, Undressing, Wiping hands, Getting Dressed
Eating and Drinking (6 actions)
Drinking, Eating, Grabbing utensils, Pouring, Stirring, Peeling fruit
Household (5 actions)
Sweeping, Mopping, Washing dishes, Wiping surface, Folding clothes
Working (6 actions)
Typing on a keyboard, Writing, Calling, Checking the time, Reading, Turning a page
Socializing and Leisure (5 actions)
Taking a selfie, Playing board games, Watching TV, Using a phone, Listening to the music with headphones
Sports and Exercises (9 actions)
Walking, Lunges, Sitting down, Lying down, Standing up, Stretching, Jumping jacks, Squats, Running
Caring and Helping (3 actions)
Taking medicine, Checking body temperature, Massaging oneself
Action categories and distribution

Distribution

  • Distribution Feature:> The dataset follows a long-tail distribution (a small number of actions account for a large proportion of occurrences, while most are infrequent), consistent with the common imbalance of real-world datasets.
  • Category Diversity: Covers basic daily activities, work-related tasks, household chores, and physical exercises, providing a rich foundation for human activity recognition.
  • Data Scale:
    • Each participant contributes over 30 minutes of footage with more than 100 samples.
    • Vision modality includes 4,029 clips (total duration: 19 hours and 29 minutes).
Action frequency

Data Visualization

This is an example that includes seven modalities: RGB, IR, Thermal, Depth, Skeleton, Radar, and IMU, which were recorded at the same time.

Experimental Results

Key Findings

Our comprehensive evaluation across the three benchmarks reveals several important insights:

๐ŸŽฏ HAR Performance (Cross-Subject)

Modality Accuracy Precision Recall F1-Score
RGB 90.89% 92.24% 91.02% 91.28%
Depth 90.46% 91.76% 90.75% 90.93%
IR 90.22% 91.53% 89.94% 90.46%
Thermal 92.57% 93.54% 93.50% 93.36%
mmWave 46.63% 48.29% 46.63% 44.53%
IMU 45.52% 40.84% 38.00% 38.32%
Skeleton 79.08% 91.46% 79.08% 84.17%

๐Ÿง  HAU Performance Highlights

  • QwenVL-7B: Consistently best performer across tasks
  • VLLaVA-7B: Strong performance in depth and IR modalities
  • Emotion Analysis: Up to 77.77% accuracy with thermal imaging
  • Sequential Reordering: 68.5% accuracy for complex temporal reasoning

๐Ÿ”ฎ HARn Insights

  • Reasoning vs Captioning: Reasoning models significantly outperform captioning models
  • Modality Impact: Depth and IR often superior to RGB for reasoning tasks
  • Model Scale: Larger models (7B) consistently outperform smaller ones
  • Context Understanding: Critical for next action prediction accuracy

Citation

If you use CUHK-X in your research, please cite our paper:

@inproceedings{jiang2025cuhkx,
  title={CUHK-X: A Large-Scale Multimodal Dataset and Benchmark for Human Action Recognition, Understanding and Reasoning},
  author={Jiang, Siyang and Yuan, Mu and Ji, Xiang and Yang, Bufang and Liu, Zeyu and Xu, Lilin and Li, Yang and He, Yuting and Dong, Liran and Yan, Zhenyu and Jiang, Xiaofan and Gao, Wei and Chen, Hongkai and Xing, Guoliang},
}

Contact Information

For dataset access, questions, or collaborations:

  • Primary Contact: syjiang@cuhk.edu.hk
  • Institution: The Chinese University of Hong Kong
  • Dataset Request: Please contact for access information

Acknowledgments

๐Ÿ™

Our Gratitude

We thank all participants who contributed to the CUHK-X dataset collection. Special acknowledgments to the CUHK research team and collaborators who made this comprehensive multimodal dataset possible. The hardware setup and synchronization infrastructure were crucial for achieving the quality and scale of CUHK-X.

๐ŸŒ

Broader Impact

CUHK-X aims to advance research in healthcare monitoring, smart environments, and privacy-preserving human activity understanding. We hope this dataset serves as a valuable resource for the research community to develop more robust and practical human activity recognition systems.