[GenerativeAI] Avatar solutions

Real-time Interactive Talking Face.

基础知识

Ref: 基于FLAME的三维人脸重建技术总结

Dataset

1999年文章《A Morphable Model For The Synthesis Of 3D Faces》提出了建立人脸数据库的方法，但没有开源数据集；

Pascal Paysan等人在2009年使用激光扫描仪精确采集了200个人脸数据得到了Basel Face Model数据集，称作BFM数据集；（提供的人脸基地能够表征的人脸表情丰富度十分有限）

另一个著名的数据集是2014年提出的FaceWarehouse，不同同样没有开源。

马普所在2017年开源了FLAME，当下最准确、表情最丰富的开源人脸模型。

和BFM的区别

FLAME模型都和BFM很相似，那它们到底有什么区别呢？

FLAME是一个人头模型，BFM是一个人脸模型，直观来看FLAME的输出是一整个三维的人头结构，包括了后脑勺以及脖子，这是BFM模型没有的地方；
FLAME模型通过LBS显式地建模了脖子的旋转、眼球的旋转，也是BFM所不具备的；
FLAME模型能够表示的表情要比BFM模型更加丰富，最原始的BFM模型只在200个人数据上拟合而成，而FLAME模型是在33000个人头数据上拟合出来的结果（这个丹药明显药力更猛）；

Ref: Human-Computer Interaction System of Talking-Head Generation

改论文提到了一些不错的可用数据集。

The GRID [38] dataset was recorded in a laboratory setting with 34 volunteers, which is relatively small in a large dataset, but each volunteer spoke 1000 phrases for a total of 34,000 utterance instances. The phrase composition of the dataset conforms to certain rules. Each phrase contains six words, randomly selected from each of the six types of words to form a random phrase. The six categories of words are “command”, “color”, “preposition”, “letter”, “number”, and “adverb”. Dataset is available at https://spandh.dcs.shef.ac.uk//gridcorpus/, accessed on 30 December 2022.

LRW [39], known for various speaking styles and head poses, is an English-speaking video dataset collected from the BBC program with over 1000 speakers. The vocabulary size is 500 words, and each video is 1.16 s long (29 frames), involving the target word and a context. Dataset is available at https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html, accessed on 30 December 2022.

LRW-1000 [40] is a Mandarin video dataset collected from over 2000 vocabulary subjects. The purpose of the dataset is to cover the natural variation of different speech modalities and imaging conditions to incorporate the challenges encountered in real-world applications. There are large variations in the number of samples in each category, video resolution, lighting conditions, and attributes such as speaker pose, age, gender, and makeup. Note: the official URL (http://vipl.ict.ac.cn/en/view_database.php?id=13, accessed on 30 December 2022.) is no longer available, you can go to the paper page for details about the data and download the agreement file here (https://github.com/VIPL-Audio-Visual-Speech-Understanding/AVSU-VIPL, accessed on 30 December 2022.) if you plan to use this dataset for your research.

ObamaSet [15] is a specific audio-visual dataset that focuses on analyzing the visual speech of former US President Barack Obama. All video samples are collected from his weekly address footage. Unlike previous datasets, it only focuses on Barack Obama and does not provide any human annotations. Dataset is available at https://github.com/supasorn/synthesizing_obama_network_training, accessed on 30 December 2022.

VoxCeleb2 [41] is extracted from YouTube videos, including the video URL and discourse timestamp. At the same time, it is currently the largest public audio-visual data set. Although this dataset is used for speaker recognition tasks, it can also be used to train a talking-head generation model. However, the dataset needs to apply to obtain the download permission to prevent misuse of the dataset. The URL for the permit application is https://www.robots.ox.ac.uk/~vgg/data/voxceleb/, accessed on 30 December 2022. Because the dataset is huge, it requires 300 G+ storage space and supporting download tools. The download method is available at https://github.com/walkoncross/voxceleb2-download, accessed on 30 December 2022.

VOCASET [18] is a 4D-face dataset with approximately 29 min of 4D face scans, synchronized audio from 12-bit speakers (six women and six men), and recorded 4D-face scans at 60 fps. As a representative high-quality 4D face-to-face audio-visual dataset, Vocaset greatly facilitates the research of 3D VSG. Dataset is available at https://voca.is.tue.mpg.de/, accessed on 30 December 2022.

MEAD [42], the Multi-View Emotional Audio-Visual Dataset, is a large-scale, high-quality emotional audio-video dataset. Unlike previous datasets, it focuses on facial generation for natural emotional speech and takes into account multiple emotional states (eight different emotions on three intensity levels). Dataset is available at https://wywu.github.io/projects/MEAD/MEAD.html, accessed on 30 December 2022.

HDTF [43], a large in-the-wild high-resolution audio-visual dataset, stands for the High-definition Talking-Face Dataset. The HDTF dataset consists of approximately 362 different videos of 15.8 h. The resolution of the original video is 720 P or 1080 P. Each cropped video is resized to 512 × 512. Dataset is available at https://github.com/MRzzm/HDTF, accessed on 30 December 2022.

FLAME的表示

FLAME uses standard vertex based linear blend skinning (LBS) with corrective blendshapes, with N = 5023 vertices, K = 4 joints (neck, jaw, and eyeballs), and Blendshapes, which will be learned from data.

从表示方式上来看，FLAME借鉴了身体模型SMPL的表示方式，基于LBS（linear blend skining）并结合blendshapes作为表示，包含5023个顶点，4个关节。

具体来说，FLAME将人头拆分成了左眼球、右眼球、下巴、脖子这四个部位，这四个部位都可以绕着自定义的“关节”进行旋转形成新的三维表示。我们再看看所谓LBS以及blendshape指的是什么意思。

LBS 刻画的就是当部位与部位之间发生相对旋转时，连接处的顶点应该如何发生变化。其实这个名称取得非常清晰，linear blend skining直接翻译过来就是线性混合蒙皮。举个例子，下面的视频中两个圆柱代表两个人体部位，圆柱中间有一块骨骼（刚性），其他灰色的部分表示皮肤（柔性），然后它们可以绕着某个关节进行旋转。那么当两个圆柱完全成一定角度的时候，两个圆柱连接处的“皮肤”会发生拉伸，产生一些新的“皮肤”。

Blendshape：也称作"Morph target animation"，Blendshape就可以理解一个方法，而不是一个数据（它的名称有一定困惑性）。前面提到FLAME等参数化模型可以让人脸的各个属性解耦合，Blendshape的输入就是每个属性的值，输出就是对应人脸模型怎么发生形变的，具体来说就是5023个顶点发生的偏移量。

以上是FLAME表示的大框架，理解这两点之后，我们下面来看具体如何根据参数来生成不同的人脸/人头模型。

参数驱动FLAME模型

FLAME模型的参数有三类：Shape，Pose，Expression.

面部绑定

Ref: 【FACEGOOD新手训练营】第三天-面部绑定工具-回放

从 Maya到苹果的52bs的转换过程，从左到右。

不同标准的blendshape控制器之间的转换。

Ref: 在线课程：用Faceware Analyzer和重定向器制作MetaHuman面部动画

Convert Landmarks to BlendShape

Ref:如何将 Mediapipe Face Mesh 转换为 Blendshape 权重 [非常有料的post]

Blendshape的生成可以分为两种方法：

从网格地标直接数学计算：

kalidokit，https://github.com/yeemachine/kalidokit
Mefamo，https://github.com/JimWest/MeFaMo

人工智能模型：

mocap4face，https://github.com/facemoji/mocap4face
AvatarWebKit，https://github.com/Hallway-Inc/AvatarWebKit

随着监督学习的快速发展，收集人脸和 52-bs 配对数据集似乎是解决这个问题的最佳方法。

Nvidia AR 方案

52 blendshape与arkit貌似有些小不同？

Ref: nVidia AR SDK Face Expression solver

Mediapipe方案

Ref: ARKit 52 blendshapes support request #3421

Ref: Face landmark detection guide

The task outputs 3-dimensional face landmarks, blendshape scores (coefficients representing facial expression) to infer detailed facial surfaces in real-time, and transformation matrices to perform the transformations required for effects rendering.

那么，如何驱动 ARKit的三维头像？

MediaPipe: Enhancing Virtual Humans to be more realistic [py]

Mediapipe-Facelandmarker-Demo [js]

https://codepen.io/mediapipe-preview/pen/OJBVQJm [很棒，在考虑转到js]

UneeQ

Official: https://www.digitalhumans.com/

老旧技术

When shoppers enter the flagship store of Noel Leeming in Westfield Newmarket, their attention may be caught by a large screen inviting them to query ‘Nola’, the retail group’s first digital employee.

“Nola is one of the first human-like interfaces backed by artificial intelligence to be used in the New Zealand retail space,” explains Tim Edwards, Noel Leeming CEO.

Digital human joins Noel Leeming sales team, 26 Sep 2019

UneeQ Webinar | An interview with Dylan Weymouth on Nola, Noel Leeming's digital human and chatbot 【youtube如下】