[Avatar] Talking Face Dataset and Solutions
数据集
Ref: FaceFormer: Speech-Driven 3D Facial Animation with Transformers
Given the raw audio input and a neutral 3D face mesh, our proposed end-to-end Transformer-based architecture, dubbed FaceFormer, can autoregressively synthesize a sequence of realistic 3D facial motions with accurate lip movements.
Dependencies
-
- Check the required Python packages in
requirements.txt
. - ffmpeg
- MPI-IS/mesh # This package contains core functions for manipulating meshes and visualizing them. It requires
Python 3.5+
and is supported on Linux and macOS operating systems.
- Check the required Python packages in
Data
VOCASET
Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy
, raw_audio_fixed.pkl
, templates.pkl
and subj_seq_to_idx.pkl
in the folder VOCASET
. Download "FLAME_sample.ply" from voca and put it in VOCASET/templates
.
# VOCASET is a 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio. The dataset has 12 subjects and 480 sequences of about 3-4 seconds each with sentences chosen from an array of standard protocols that maximize phonetic diversity.
BIWI
Request the BIWI dataset from Biwi 3D Audiovisual Corpus of Affective Communication. The dataset contains the following subfolders:
-
-
- 'faces' contains the binary (.vl) files for the tracked facial geometries.
- 'rigid_scans' contains the templates stored as .obj files.
- 'audio' contains audio signals stored as .wav files.
-
Place the folders 'faces' and 'rigid_scans' in BIWI
and place the wav files in BIWI/wav
.
该语料库总共包含 14 名英语母语者(6 名男性和 8 名女性)说出的 1109 个句子。
We gratefully acknowledge ETHZ-CVL for providing the B3D(AC)2 database and MPI-IS for releasing the VOCASET dataset. The implementation of wav2vec2 is built upon huggingface-transformers, and the temporal bias is modified from ALiBi. We use MPI-IS/mesh for mesh processing and VOCA/rendering for rendering. We thank the authors for their excellent works. Any third-party packages are owned by their respective authors and must be used under their respective licenses.
CodeTaker也用一样的数据集。
Real-Time Talking Face方案
https://github.com/lipku/metahuman-stream/tree/main
有趣的Talking Face方案
wav2lip不用多说,但效果确实不太好。
OpenTalker
腾讯的数字人作品:
video-retalker, 嘴比较大。但故意不给训练代码,只有推断代码。
MuseV
MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising
MuseTaker
MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
Link: https://www.youtube.com/watch?v=L5g5tuH_4a8, 也可以 换嘴?
AniPortrait
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations
Author: Huawei Wei, Zejun Yang, Zhisheng Wang
Organization: Tencent Games Zhiji, Tencent
Link: https://github.com/Zejun-Yang/AniPortrait
2. EMOPortraits
This repo reserved for official implementation of EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars. The code is planned to be released in July 2024.