awesome-avatar

This is a repository for organizing papers, codes and other resources related to the topic of Avatar (talking-face and talking-body).

🔆 This project is still on-going, pull requests are welcomed!!

If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and pull a request.

News

2024.09.07: add ASR and TTS tool
2024.08.24: add backgrounds for image/video generations
2024.08.24: re-organize paper list with table formating
2024.08.24: add works about full-body avatar synthesis

TO DO LIST

Main paper list
Researchers list
Toolbox for avatar
Add paper link
Add paper notes
Add codes if have
Add project page if have
Datasets and metrics
Related links

Researchers and labs

NVIDIA Research
- Neural rendering models for human generation: vid2vid NeurIPS'18, fs-vid2vid NeurIPS'19, EG3D CVPR'22;
- Talking-face synthesis: face-vid2vid CVPR'21, Implicit NeurIPS'22, SPACE ICCV'23, One-shot Neural Head Avatar arXiv'23;
- Talking-body synthesis: DreamPose ICCV'23;
- Face enhancement (relighting, restoration, etc): Lumos SIGGRAPH Asia 2022, RANA ICCV'23;
- Authorized use of synthetic videos: Avatar Fingerprinting arXiv'23;
Aliaksandr Siarohin @ Snap Research
- Neural rendering models for human generation (focus on flow-based generative models): Unsupervised-Volumetric-Animation CVPR'23, 3DAvatarGAN CVPR'23, 3D-SGAN ECCV'22, Articulated-Animation CVPR'21, Monkey-Net CVPR'19, FOMM NeurIPS'19;
Ziwei Liu @ Nanyang Technological University
- Talking-face synthesis: StyleSync CVPR'23, AV-CAT SIGGRAPH Asia 2022, StyleGANX ICCV'23, StyleSwap ECCV'22, PC-AVS CVPR'21, Speech2Talking-Face IJCAI'21, VToonify SIGGRAPH Asia 2022;
- Talking-body synthesis: MotionDiffuse arXiv'22;
- Face enhancement (relighting, restoration, etc): Relighting4D ECCV'22;
Xiaodong Cun @ Tencent AI Lab:
- Talking-face synthesis: StyleHEAT ECCV'22, VideoReTalking SIGGRAPH Asia'22, ToolTalking ICCV'23, DPE CVPR'23, CodeTalker CVPR'23, SadTalker CVPR'23;
- Talking-body synthesis: LivelySpeaker ICCV'23;

Max Planck Institute for Informatics:
- 3D face models (e.g., 3DMM): FLAME SIGGRAPH Asia 2017;

Papers

Image and video generation

Model	Paper	Blog	Codebase	Note
StyleGANv3	Alias-Free Generative Adversarial Networks, NVIDIA, NeurIPS 2021	The Evolution of StyleGAN: Introduction	Code	high fidlity face generation
Stable Diffusion	High-Resolution Image Synthesis with Latent Diffusion Models, Heidelberg University, CVPR 2022	What are Diffusion Models?	Code	diverse and high quality images
Stable Video Diffusion	Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets, Stability AI, arXiv 2023	Diffusion Models for Video Generation	Code
DiT	Scalable Diffusion Models with Transformers, Meta, ICCV 2023	Diffusion Transformed	Code	magic behind OpenAI Sora
VQ-VAE	Neural Discrete Representation Learning, DeepMind, NIPS 2017	OpenAI's DALL-E 2 and DALL-E 1 Explained		magic behinds OpenAI DALL-E
NeRF	NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, UC Berkeley, ECCV 2020	NeRF Explosion 2020	Code	3D synthesis via volume rendering
3DGS	3D Gaussian Splatting for Real-Time Radiance Field Rendering, Inria, SIGGRAPH 2023	A Comprehensive Overview of Gaussian Splatting	Code	real-time 3d rendering

3D Avatar (face+body)

Conference	Paper	Affiliation	Codebase	Notes
CVPR 2021	Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors	Tsinghua University	Dataset
ECCV 2022	HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling	Shanghai Artificial Intelligence Laboratory	Dataset
SIGGRAPH 2023	AvatarReX: Real-time Expressive Full-body Avatars	Tsinghua University	Dataset
arXiv 2024	A Survey on 3D Human Avatar Modeling - From Reconstruction to Generation	The University of Hong Kong
arXiv 2024	From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations	Meta Reality Labs Research	Code	conversational avatar
CVPR 2024	Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling	Tsinghua Univserity	Code
CVPR 2024	4K4D: Real-Time 4D View Synthesis at 4K Resolution	Zhejiang University	Code	real-time synthesis with 3DGS

2D talking-face synthesis

Conference	Paper	Affiliation	Codebase	Training Code	Notes
MM 2020	Wav2Lip: Accurately Lip-sync Videos to Any Speech	The International Institute of Islamic Thought (IIIT), India	Code	✅	most accurate lip-sync model, bad video quality `96*96`, pre-trained on ~`180` hours video data from LRS2
MM 2021	Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis	Tsinghua University	Code,
CVPR 2021	Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation	The Chinese University of Hong Kong	Code		contrastive learning on audio-lip
ICCV 2021	PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering	Peking University	Code
ECCV 2022	StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN	Tsinghua University	Code		High-fidenity synthesis via StyleGAN
SIGGRAPH Asia 2022	VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild	Xidian University	Code
AAAI 2023	DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video	Virtual Human Group, Netease Fuxi AI Lab	Code	✅	accurate lip-sync and high-quality synthesis (`256*256`)
CVPR 2023	SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation	Xi'an Jiaotong University	Code , Note
arXiv 2023	DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models	Tsinghua University	Code,		diffusion
		Tencent TMElyralab	MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
arXiv 2024	LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control	Kuaishou Technology	Code		face reenactment with micro-expression
arXiv 2024	EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions	Ant Group	Code		accurate lip-sync on Chinese speakers, diffusion, pre-trained on `540 hours` cleaned video data (collected from internet)
arXiv 2024	Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation	Fudan University	Code,	✅	accurate lip-sync, diffusion, pre-trained on `264 hours` of cleaned video data (155 hours from internet and 9 hours from HDTF)
[arXiv 2024]	Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency	Zhejiang University and ByteDance			expressive animation driven by audio only, pre-trained on `160 hours` of cleaned video data (collected from internet)

3D talking-face synthesis

Conference	Paper	Affiliation	Codebase
ICCV 2021	AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis	University of Science and Technology of China	Code
ECCV 2022	Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis	Tsinghua University	Code
ICLR 2023	GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis	Zhejiang University	Code
ICCV 2023	Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis	Beihang University	Code
arXiv 2023	GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation	Zhejiang University	Code
CVPR 2024	SyncTalk: The Devil is in the Synchronization for Talking Head Synthesi	Renmin University of China	Code
ECCV 2024	TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting	Beihang University	Code

Talking-body synthesis

Pose2video

Conference	Paper	Affiliation	Codebase	Notes
NeurIPS 2018	Video-to-Video Synthesis	NVIDIA	Code
ICCV 2019	Everybody Dance Now	UC Berkeley	Code
arXiv 2023	Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation	Alibaba Group	Code
CVPR 2024	MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model	National University of Singapore	Code
arXiv 2024	Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance	Nanjing University	Code
Github repo	MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising	Tencent TMElyralab	Code
Github repo	MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation	Tencent	Code ⭐
arXiv 2024	ControlNeXt: Powerful and Efficient Control for Image and Video Generation	The Chinese University of Hong Kong	Code	stable video diffusion
[arXiv 2024]	CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention	Zhejiang University and ByteDance		pre-trained on `200 hours` video data and more than `10k` unique identities

Datasets

Talking-face

Audio-Visual Datasets for Enlish Speakers
Dataset name	Environment	Year	Resolution	Subject	Duration	Sentence
VoxCeleb1	Wild	2017	360p~720p	1251	352 hours	100k
VoxCeleb2	Wild	2018	360p~720p	6112	2442 hours	1128k
HDTF	Wild	2020	720p~1080p	300+	15.8 hours
LSP	Wild	2021	720p~1080p	4	18 minutes	100k
Audio-Visual Datasets for Chinese Speakers
Dataset name	Environment	Year	Resolution	Subject	Duration	Sentence
CMLR	Lab	2019		11		102k
MAVD	Lab	2023	1920x1080	64	24 hours	12k
CN-Celeb	Wild	2020		3000	1200 hours
CN-Celeb-AV	Wild	2023		1136	660 hours
CN-CVS	Wild	2023		2500+	300+ hours

Metrics

Talking-face

Lip-Sync
Metric name	Description	Code/Paper
LMD↓	Mouth landmark distance
LMD↓	Mouth landmark distance
MA↑	The Insertion-over-Union (IoU) for the overlap between the predicted mouth area and the ground truth area
Sync↑	The confidence score from SyncNet (Sync)	wav2lip
LSE-C↑	Lip Sync Error - Confidence	wav2lip
LSE-D↓	Lip Sync Error - Distance	wav2lip
Image Quality (identity preserving)
Metric name	Description	Code/Paper
MAE↓	Mean Absolute Error metric for image	mmagic
MSE↓	Mean Squared Error metric for image	mmagic
PSNR↑	Peak Signal-to-Noise Ratio	mmagic
SSIM↑	Structural similarity for image	mmagic
FID↓	Frchet Inception Distance	mmagic
IS↑	Inception score	mmagic
NIQE↓	Natural Image Quality Evaluator metric	mmagic
CSIM↑	The cosine similarity of identity embedding	InsightFace
CPBD↑	The cumulative probability blur detection	python-cpbd
Diversity
Metric name	Description	Code/Paper
Diversity of head motions↑	A standard deviation of the head motion feature embeddings extracted from the generated frames using Hopenet (Ruiz et al., 2018) is calculated	SadTalker
Beat Align Score↑	The alignment of the audio and generated head motions is calculated in Bailando (Siyao et al., 2022)	SadTalker

Toolbox

A general toolbox for AIGC, including common metrics and models https://github.com/open-mmlab/mmagic
face3d: Python tools for processing 3D face https://github.com/yfeng95/face3d
3DMM model fitting using Pytorch https://github.com/ascust/3DMM-Fitting-Pytorch
OpenFace: a facial behavior analysis toolkit https://github.com/TadasBaltrusaitis/OpenFace
autocrop: Automatically detects and crops faces from batches of pictures https://github.com/leblancfg/autocrop
OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation https://github.com/CMU-Perceptual-Computing-Lab/openpose
GFPGAN: Practical Algorithm for Real-world Face Restoration https://github.com/TencentARC/GFPGAN
CodeFormer: Robust Blind Face Restoration https://github.com/sczhou/CodeFormer
metahuman-stream: Real time interactive streaming digital human https://github.com/lipku/metahuman-stream
EasyVolcap: a PyTorch library for accelerating neural volumetric video research https://github.com/zju3dv/EasyVolcap
3D Model in gradio https://www.gradio.app/guides/how-to-use-3D-model-component

Automatic Speech Recognition (ASR)

BELLE-2/Belle-whisper-large-v3-zh https://huggingface.co/BELLE-2/Belle-whisper-large-v3-zh
SenseVoice (multilingual) https://github.com/FunAudioLLM/SenseVoice 👍👍

Text to Speech (TTS)

CosyVoice, Alibaba Tongyi SpeechTeam https://github.com/FunAudioLLM/CosyVoice 👍👍
FireRedTTS, FireReadTeam https://github.com/FireRedTeam/FireRedTTS
GPT-SoVITS https://github.com/RVC-Boss/GPT-SoVITS?tab=readme-ov-file

Speech to Speech (GPT4-o)

Mini-Omni, Tsinghua University https://github.com/gpt-omni/mini-omni
Speech To Speech, HuggingFace https://github.com/huggingface/speech-to-speech

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

awesome-avatar

🔆 This project is still on-going, pull requests are welcomed!!

News

TO DO LIST

Researchers and labs

Papers

Image and video generation

3D Avatar (face+body)

2D talking-face synthesis

3D talking-face synthesis

Talking-body synthesis

Pose2video

Datasets

Talking-face

Metrics

Talking-face

Toolbox

Automatic Speech Recognition (ASR)

Text to Speech (TTS)

Speech to Speech (GPT4-o)

Related Links

Files

README.md

Latest commit

History

README.md

File metadata and controls

awesome-avatar

🔆 This project is still on-going, pull requests are welcomed!!

News

TO DO LIST

Researchers and labs

Papers

Image and video generation

3D Avatar (face+body)

2D talking-face synthesis

3D talking-face synthesis

Talking-body synthesis

Pose2video

Datasets

Talking-face

Metrics

Talking-face

Toolbox

Automatic Speech Recognition (ASR)

Text to Speech (TTS)

Speech to Speech (GPT4-o)

Related Links