HARMONI: Using 3D Computer Vision and Audio Analysis to Quantify Caregiver–Child Behavior and Interaction from Videos

Repository Overview

System requirements and installation Guide
Download dependency data
Demo on a video clip
Demo on a audio clip
Code structure
Related resources
Contact

Installation

Tested on a linux machine with a single NVIDIA TITAN RTX GPU.

OS version is Ubuntu 20.04.4 LTS. CUDA version: 11.3. Python version 3.9.

Install Miniconda, and then install the packages.

conda create -n harmoni_visual python==3.9
conda activate harmoni_visual
./install_visual.sh

To install environment for the audio part, please do

cd audio
conda env create -f process_audio.yml
conda activate process_audio

Installation for either visual or audio model should be around 5 to 10 minutes.

Download data folder that includes model checkpoints and other dependencies here.
We provide the example output from public video clips. You could download them here.

Visualization of the example clip.

Please see below for instructions for reproducing the visual results.

Running HARMONI visual mdoel on a demo video

Here we show how to run HARMONI on a public video clip. A basic command would be

python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --keep contains_only_both --save_gif

To reproduce the provided results, please use the below two commands instead.

Some configurations that significantly improves the reconstruction quality:

Using the detected ground plane as additional constraint (--ground_constraint). As in Ugrinovic et al., ground normal is estimated by fitting a plane to the floor depth points, and then --ground_anchor ("child_bottom" | "adult_bottom") speficies whether we use the mean ankle positions of children or adults as the anchor point for the ground plane. Then, we run optimization on all humans and encourage their ankles to be on the ground plane.
Overwrite the classified tracks. It is hard to have a model to accurately predict whether a detected human is adult or child, so we allow the user to overwrite the predicted body types. For example, we can first run the command with --dryrun to run the body type classifition for each track. The results are written to ./results/giphy/sampled_tracks.

python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --dryrun

Then, we can run it again with the tracks we want to overwrite. e.g. --track_overwrite "{2: 'infant', 11: 'infant'}".

python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --keep contains_only_both --ground_anchor child_bottom --save_gif --track_overwrite "{2: 'infant', 11: 'infant'}"

If turn on the --add_downstream flag, the downstream stats will be overlayed to the results. E.g.

Run time

For this 60 frame video clip, the typical run time on a single NVIDIA TITAN RTX GPU is 20 seconds for the body pose estimation (excluding data preprocessing and rendering). Data preprocessing (i.e. runnign OpenPose, ground normal estimation, etc) took 2 minutes. Rendering took 1 sec/frame.

Running HARMONI audio model on example data

Before you run this, make sure to follow the additional installation instructions in audio/README.md and rebuild the x-vector extractor file.

Here, we show the result on a publicly available demo video. Please download and put it in data/demo/seedlings.mp4.

cd audio
python run.py ../data/demo/seedlings.mp4 ../results/seedlings/

Code structure

- preprocess # code for preprocessing: downsample, shot detection, ground plane estimation
- trackers # tracking
- detectors  # e.g. openpose, midas, body type classifier
- hps # human pose and shape models. e.g. DAPA
- postprocess  # code for refinement. e.g. SMPLify, One Euro Filter.
- visualization  # renderers and helpers for visualization
- downstream # code for downstream analysis
- audio # audio code
- data
    - cfgs  # configurations
    - demo  # a short demo video
    - body_models
        - SMPL
        - SMIL
        - SMPLX
    - ckpts # model checkpoints
- _DATA # data for running PHALP. It should be downloaded automatically.

Output folder structure

- openpose
- sampled_tracks
- render
- results.pkl
- dataset.pkl
- result.mp4  # if --save_video is on
- result.gif  # if --save_gif is on

Related Resources

We borrowed code from the below amazing resources:

PARE for HMR-related helpers.
PHALP for tracking.
MiDaS for depth estimation.
Panoptic DeepLab for segmentation.
size_depth_disambiguation estimating ground normal.
OpenPose for 2D keypoint estimation.

Contact

Zhenzhen Weng (zzweng AT stanford DOT edu)

Citation

If you find this work useful, please consider citing:

@article{
weng2025harmoni,
author = {Zhenzhen Weng  and Laura Bravo-S\'anchez  and Zeyu Wang  and Christopher Howard  and Maria Xenochristou  and Nicole Meister  and Angjoo Kanazawa  and Arnold Milstein  and Elika Bergelson  and Kathryn L. Humphreys  and Lee M. Sanders  and Serena Yeung-Levy },
title = {Artificial intelligence–powered 3D analysis of video-based caregiver-child interactions},
journal = {Science Advances},
volume = {11},
number = {8},
pages = {eadp4422},
year = {2025},
doi = {10.1126/sciadv.adp4422},
URL = {https://www.science.org/doi/abs/10.1126/sciadv.adp4422}}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
audio		audio
detectors		detectors
downstream		downstream
evaluation/cmu_panoptic		evaluation/cmu_panoptic
hps		hps
postprocess		postprocess
teasers		teasers
trackers		trackers
utils		utils
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cmd_parser.py		cmd_parser.py
constants.py		constants.py
dataset.py		dataset.py
install_visual.sh		install_visual.sh
main.py		main.py
results.py		results.py
run_demo.sh		run_demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HARMONI: Using 3D Computer Vision and Audio Analysis to Quantify Caregiver–Child Behavior and Interaction from Videos

Repository Overview

Installation

Running HARMONI visual mdoel on a demo video

Some configurations that significantly improves the reconstruction quality:

Run time

Running HARMONI audio model on example data

Code structure

Related Resources

Contact

Citation

About

Releases

Packages

Contributors 2

Languages

License

yeung-lab/HARMONI

Folders and files

Latest commit

History

Repository files navigation

HARMONI: Using 3D Computer Vision and Audio Analysis to Quantify Caregiver–Child Behavior and Interaction from Videos

Repository Overview

Installation

Running HARMONI visual mdoel on a demo video

Some configurations that significantly improves the reconstruction quality:

Run time

Running HARMONI audio model on example data

Code structure

Related Resources

Contact

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages