Skip to content

Code Release for HARMONI: Using 3D Computer Vision and Audio Analysis to Quantify Caregiver–Child Behavior and Interaction from Videos (Science Advances)

License

Notifications You must be signed in to change notification settings

yeung-lab/HARMONI

Repository files navigation

HARMONI: Using 3D Computer Vision and Audio Analysis to Quantify Caregiver–Child Behavior and Interaction from Videos

Repository Overview

Installation

Tested on a linux machine with a single NVIDIA TITAN RTX GPU.

OS version is Ubuntu 20.04.4 LTS. CUDA version: 11.3. Python version 3.9.

  1. Install Miniconda, and then install the packages.
conda create -n harmoni_visual python==3.9
conda activate harmoni_visual
./install_visual.sh

To install environment for the audio part, please do

cd audio
conda env create -f process_audio.yml
conda activate process_audio

Installation for either visual or audio model should be around 5 to 10 minutes.

  1. Download data folder that includes model checkpoints and other dependencies here.

  2. We provide the example output from public video clips. You could download them here.

Visualization of the example clip.

Please see below for instructions for reproducing the visual results.

Running HARMONI visual mdoel on a demo video

Here we show how to run HARMONI on a public video clip. A basic command would be

python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --keep contains_only_both --save_gif

To reproduce the provided results, please use the below two commands instead.

Some configurations that significantly improves the reconstruction quality:

  1. Using the detected ground plane as additional constraint (--ground_constraint). As in Ugrinovic et al., ground normal is estimated by fitting a plane to the floor depth points, and then --ground_anchor ("child_bottom" | "adult_bottom") speficies whether we use the mean ankle positions of children or adults as the anchor point for the ground plane. Then, we run optimization on all humans and encourage their ankles to be on the ground plane.

  2. Overwrite the classified tracks. It is hard to have a model to accurately predict whether a detected human is adult or child, so we allow the user to overwrite the predicted body types. For example, we can first run the command with --dryrun to run the body type classifition for each track. The results are written to ./results/giphy/sampled_tracks.

python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --dryrun

Then, we can run it again with the tracks we want to overwrite. e.g. --track_overwrite "{2: 'infant', 11: 'infant'}".

python main.py --config data/cfgs/harmoni.yaml --video data/demo/giphy.gif --out_folder ./results/giphy --keep contains_only_both --ground_anchor child_bottom --save_gif --track_overwrite "{2: 'infant', 11: 'infant'}"

If turn on the --add_downstream flag, the downstream stats will be overlayed to the results. E.g.

Run time

For this 60 frame video clip, the typical run time on a single NVIDIA TITAN RTX GPU is 20 seconds for the body pose estimation (excluding data preprocessing and rendering). Data preprocessing (i.e. runnign OpenPose, ground normal estimation, etc) took 2 minutes. Rendering took 1 sec/frame.

Running HARMONI audio model on example data

Before you run this, make sure to follow the additional installation instructions in audio/README.md and rebuild the x-vector extractor file.

Here, we show the result on a publicly available demo video. Please download and put it in data/demo/seedlings.mp4.

cd audio
python run.py ../data/demo/seedlings.mp4 ../results/seedlings/

Code structure

- preprocess # code for preprocessing: downsample, shot detection, ground plane estimation
- trackers # tracking
- detectors  # e.g. openpose, midas, body type classifier
- hps # human pose and shape models. e.g. DAPA
- postprocess  # code for refinement. e.g. SMPLify, One Euro Filter.
- visualization  # renderers and helpers for visualization
- downstream # code for downstream analysis
- audio # audio code
- data
    - cfgs  # configurations
    - demo  # a short demo video
    - body_models
        - SMPL
        - SMIL
        - SMPLX
    - ckpts # model checkpoints
- _DATA # data for running PHALP. It should be downloaded automatically.

Output folder structure

- openpose
- sampled_tracks
- render
- results.pkl
- dataset.pkl
- result.mp4  # if --save_video is on
- result.gif  # if --save_gif is on

Related Resources

We borrowed code from the below amazing resources:

Contact

Zhenzhen Weng (zzweng AT stanford DOT edu)

Citation

If you find this work useful, please consider citing:

@article{
weng2025harmoni,
author = {Zhenzhen Weng  and Laura Bravo-S\'anchez  and Zeyu Wang  and Christopher Howard  and Maria Xenochristou  and Nicole Meister  and Angjoo Kanazawa  and Arnold Milstein  and Elika Bergelson  and Kathryn L. Humphreys  and Lee M. Sanders  and Serena Yeung-Levy },
title = {Artificial intelligence–powered 3D analysis of video-based caregiver-child interactions},
journal = {Science Advances},
volume = {11},
number = {8},
pages = {eadp4422},
year = {2025},
doi = {10.1126/sciadv.adp4422},
URL = {https://www.science.org/doi/abs/10.1126/sciadv.adp4422}}

About

Code Release for HARMONI: Using 3D Computer Vision and Audio Analysis to Quantify Caregiver–Child Behavior and Interaction from Videos (Science Advances)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published