HARMONI: Using 3D Computer Vision and Audio Analysis to Quantify Caregiver–Child Behavior and Interaction from Videos
- System requirements and installation Guide
- Download dependency data
- Demo on a video clip
- Demo on a audio clip
- Code structure
- Related resources
- Contact
Tested on a linux machine with a single NVIDIA TITAN RTX GPU.
OS version is Ubuntu 20.04.4 LTS. CUDA version: 11.3. Python version 3.9.
- Install Miniconda, and then install the packages.
conda create -n harmoni_visual python==3.9
conda activate harmoni_visual
./install_visual.sh
To install environment for the audio part, please do
cd audio
conda env create -f process_audio.yml
conda activate process_audio
Installation for either visual or audio model should be around 5 to 10 minutes.
-
Download data folder that includes model checkpoints and other dependencies here.
-
We provide the example output from public video clips. You could download them here.
Visualization of the example clip.
Please see below for instructions for reproducing the visual results.Here we show how to run HARMONI on a public video clip. A basic command would be
python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --keep contains_only_both --save_gif
To reproduce the provided results, please use the below two commands instead.
-
Using the detected ground plane as additional constraint (
--ground_constraint
). As in Ugrinovic et al., ground normal is estimated by fitting a plane to the floor depth points, and then--ground_anchor
("child_bottom" | "adult_bottom") speficies whether we use the mean ankle positions of children or adults as the anchor point for the ground plane. Then, we run optimization on all humans and encourage their ankles to be on the ground plane. -
Overwrite the classified tracks. It is hard to have a model to accurately predict whether a detected human is adult or child, so we allow the user to overwrite the predicted body types. For example, we can first run the command with
--dryrun
to run the body type classifition for each track. The results are written to./results/giphy/sampled_tracks
.
python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --dryrun
Then, we can run it again with the tracks we want to overwrite. e.g. --track_overwrite "{2: 'infant', 11: 'infant'}"
.
python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --keep contains_only_both --ground_anchor child_bottom --save_gif --track_overwrite "{2: 'infant', 11: 'infant'}"
If turn on the --add_downstream
flag, the downstream stats will be overlayed to the results. E.g.
For this 60 frame video clip, the typical run time on a single NVIDIA TITAN RTX GPU is 20 seconds for the body pose estimation (excluding data preprocessing and rendering). Data preprocessing (i.e. runnign OpenPose, ground normal estimation, etc) took 2 minutes. Rendering took 1 sec/frame.
Before you run this, make sure to follow the additional installation instructions in audio/README.md
and rebuild the x-vector extractor file.
Here, we show the result on a publicly available demo video. Please download and put it in data/demo/seedlings.mp4
.
cd audio
python run.py ../data/demo/seedlings.mp4 ../results/seedlings/
- preprocess # code for preprocessing: downsample, shot detection, ground plane estimation
- trackers # tracking
- detectors # e.g. openpose, midas, body type classifier
- hps # human pose and shape models. e.g. DAPA
- postprocess # code for refinement. e.g. SMPLify, One Euro Filter.
- visualization # renderers and helpers for visualization
- downstream # code for downstream analysis
- audio # audio code
- data
- cfgs # configurations
- demo # a short demo video
- body_models
- SMPL
- SMIL
- SMPLX
- ckpts # model checkpoints
- _DATA # data for running PHALP. It should be downloaded automatically.
Output folder structure
- openpose
- sampled_tracks
- render
- results.pkl
- dataset.pkl
- result.mp4 # if --save_video is on
- result.gif # if --save_gif is on
We borrowed code from the below amazing resources:
- PARE for HMR-related helpers.
- PHALP for tracking.
- MiDaS for depth estimation.
- Panoptic DeepLab for segmentation.
- size_depth_disambiguation estimating ground normal.
- OpenPose for 2D keypoint estimation.
Zhenzhen Weng (zzweng AT stanford DOT edu)
If you find this work useful, please consider citing:
@article{
weng2025harmoni,
author = {Zhenzhen Weng and Laura Bravo-S\'anchez and Zeyu Wang and Christopher Howard and Maria Xenochristou and Nicole Meister and Angjoo Kanazawa and Arnold Milstein and Elika Bergelson and Kathryn L. Humphreys and Lee M. Sanders and Serena Yeung-Levy },
title = {Artificial intelligence–powered 3D analysis of video-based caregiver-child interactions},
journal = {Science Advances},
volume = {11},
number = {8},
pages = {eadp4422},
year = {2025},
doi = {10.1126/sciadv.adp4422},
URL = {https://www.science.org/doi/abs/10.1126/sciadv.adp4422}}