Table of Contents

1. Introduction
#

這篇 paper 是卡內基美隆大學（Carnegie Mellon University）在 CVPR 2017 所發表的，論文中首先指出了 Pose Estimation 中三個具挑戰性的關鍵：

一張圖片裡有多少人，而這些人擺什麼姿勢和人的大小？
有幾個人是相互疊在一起（overlap）的，他們彼此摭蓋面積？
無法即時（realtime）

另外論文中也提到了一些現有方法存在的瓶頸，現有方法主要是透過 top-down 的方式：

person detector
single-person pose estimation

來解決此類問題，而這很依賴效能，如果 person detector 失敗了，那方法就沒用了，另外時間也是一項考驗。

2. Method
#

Figure 2:

Fig. 2 給出了模型的整個處理過程：

讀進一張圖片大小為 $w \times h$ 的圖片 $\textbf I$。
送進 model VGG-19 的前 10 層 layer train 出大小一樣為 $w \times h$ 的 features $\textbf F$。

$\textbf S = (\textbf S_1, \textbf S_2, \dots, \textbf S_J)$，其中 $J$ 代表人體共有 $J$ 個部位（part）。

$$\textbf S_j \in \mathbb R^{w \times h}, j \in \{1 \dots J\}.$$

$\textbf L = (\textbf L_1, \textbf L_2, \dots, \textbf L_C)$ ，其中 $C$ 代

$$\textbf L_c \in \mathbb R^{w \times h \times 2}, c \in \{1, \dots, C\}.$$

再將 confidence maps $\textbf S$ 和 affinity fields $\textbf L$ 送到 greedy

$\textbf S$ 和 affinity fields $\textbf L$。

每個分支都是一個遞迴的預測結構，整個 model 包含了 $T$ 個 stage，每個 stage 中都

圖片首先經由微調過的 VGG19 前十層得到一組大小為 $w \times h$ 的 feature maps $\textbf F$，將其做為 input 輸入到兩個分支裡第一個 stage。

detection confidence maps $\textbf S^1 = \rho^1(\textbf F)$
part affinity fields $\textbf L^1 = \phi^1(\textbf F)$ 其中 $\rho^1$ 和 $\phi^1$ 表示第一個 stage 的 CNN 架構。

在往後每個 stage 中，模型會將每前個階段的輸出和 $\textbf F$（原本的 feature

$$\textbf S^t = \rho^t(\textbf F, \textbf S^{t - 1}, \textbf L^{t - 1}), \forall t \ge 2,$$

$$\textbf L^t = \phi^t(\textbf F, \textbf S^{t - 1}, \textbf L^{t - 1}), \forall t \ge 2,$$

其中 $\rho^t$ 和 $\phi^t$ 為第 $t$ 階段的 CNN。

Figure 4:

Fig. 4 秀出了每一個階段 confidence maps 和 affinity fields 改善的情況。

在預測的 predictions 和 groundtruth maps and fileds 使用了 loss function $L_2$，論文中特別提到 loss functions 是隨著空間而變的（spatially），因為有些 datasets 不見得會完整地標示所有人。

在 $t$ 階段中的 loss functions 如下：

$$f_{\textbf S}^t = \sum_{j = 1}^J \sum_{\textbf p} \textbf W(\textbf p) \cdot || \textbf S_j^t(\textbf p) - \textbf S_j^*(\textbf p)||_2^2,$$ $$f_{\textbf L}^t = \sum_{c = 1}^C \sum_{\textbf p} \textbf W(\textbf p) \cdot || \textbf L_c^t(\textbf p) - \textbf L_c^*(\textbf p)||_2^2,$$ $$f = \sum_{t = 1}^T (f_\textbf S^t + f_\textbf L^t).$$

$\textbf S_j^*$: groundtruth part confidence map
$\textbf L_c^*$: groundtruth part affinity vector field
$\textbf W$: binary mask 且 $\textbf W(\textbf p) = 0$ 當 annotation 在位置 $\textbf p$ 不存在，這是用來避免在 training 的過程中，即使正確預測了，仍有

2.2 Confidence Maps for Part Detection
#

下邊給出根據 annotation 計算 groundtruth confidence maps $\textbf S^*$ 的方法，每個 confidence map 都是一個 2D 的表示。理想情況下，

當圖片中只包含一個人時：如果一個 keypoint 是可見的話，對應的 confidence map 中只有一個峰值。
當圖片中有多個人時：對於每一個人 $k$ 的每一個可見 keypoint $j$，在對應的 confidence map 中都會有一個峰值。

詳細方法如下：

先找出每個人 $k$ 的某一部位 $j$
- 每一個人 $k$ 的單個 confidence maps $\textbf S_{j, k}^*$ 和
- $\textbf x_{j, k} \in \mathbb R^2$ 表示圖片中人 $k$ 的 part $j$ 對應的 groundtruth position
計算方式如式 (6) 所示，其中 $\sigma$ 用來控制峰值在 confidence map 中的傳播範圍。
（這裡可以理解成，$\forall \textbf p \in \mathbb R^2$，$\textbf p$ 點越接近 $\textbf x_{j, k} \in \mathbb R^2$，$||\textbf p - \textbf x_{j, k}||\_2^2$ 值趨近於 $0$，$\textbf S_{j, k}^*(\textbf p)$ 也就越靠近極大值 $1$。）
$$\textbf S_{j, k}^*(\textbf p) = \exp \Big(- \frac{||\textbf p - \textbf x_{j, k}||\_2^2}{\sigma^2}\Big),$$
再找出所有人的部位 $j$，這裡取最大值而不是平均值能夠更準確地將同一個 confidence map 中的峰值保存下來，即：對整張圖 $w \times h$ 每一個點，找該點在所有人之中的最大值！

$$\textbf S_j^*(\textbf p) = \max_k \textbf S_{j, k}^*(\textbf p).$$

2.3 Part Afﬁnity Fields for Part Association
#

給定一組 keypoints，如 Fig.5(a) 所示，我們如何把它們組裝成，未知數量人的整個身體的 pose 呢？

我們需要一個好方法來確定每對 keypoints 之間的連接，即：它們屬於同一個人。

一個可能的方法是找到一個位於每一對 keypoints 之間的一個中間點，後檢查中間點是真正的中間點的機率，如 Fig. 5(b) 所示。但是當人們擠在一起時，中間點可能是錯誤的連線，如 Fig. 5(b) 中綠線所示。出現這種情況的原因有兩個：

這種方式只編碼了位置資訊，沒有方向
身體的支撐區域已經縮小到一個點上。

為解決這些限制，我們提出了稱為 PAF(part affinity fields) 的特徵表示來保存身體的支撐區域的位置信息和方向信息，如 Fig. 5(c) 所示。對於每一條軀幹來說，the part affinity 是一個 2D 的向量區域。在屬於一個軀幹上的每一像素都對應一個 2D 的向量，這個向量表示軀幹上從一個 keypoint 到另一個 keypoint 的方向。

考慮下圖中給出的一個軀幹（手臂），令 $\textbf x_{j_1, k}$ 和 $\textbf x_{j_2, k}$ 表示圖中的某個人 $k$ 的兩個 keypoints 對應的真實像素點，如果一個像素點 $\textbf p$ 位於這個軀幹上，$\textbf L_{c, k}^*(\textbf p)$ 表示一個從 keypoint $j_1$ 到 keypoints $j_2$ 的單位向量，對於不在軀幹上的像素點，對應的向量則是 $\textbf 0$。

下面這個公式給出了 the groundtruth part affinity vector，對於圖片中的一個點 $\textbf p$ 其值 $\textbf L_{c, k}^*(\textbf p)$ 的值如下：

$$ \textbf L_{c, k}^*(\textbf p) = \begin{cases} \textbf v \text{ if $\textbf p$ on limb $c, k$}; \\ \textbf 0 \text{ otherwise.} \end{cases} $$

其中，

$\textbf v = (\textbf x_{j_2, k} - \textbf x_{j_1, k}) / ||\textbf x_{j_2, k} - \textbf x_{j_1, k}||\_2$: 軀幹對應的單位方向向量。屬於這個軀幹上的像素點滿足下面的不等式：

$$0 \le \textbf v \cdot (\textbf p - \textbf x_{j_1, k}) \le l_{c, k} \text{ and } |\textbf v_\bot \cdot (\textbf p - \textbf x_{j_1, k})| \le \sigma_l.$$

其中，

$\sigma_l$: limb 寬度（注意：不同於軀幹）
軀幹長度：$l_{c, k} = ||\textbf x_{j_2, k} - \textbf x_{j_1, k}||\_2$
$\textbf v_\bot$: 垂直於 $\textbf v$ 的向量整張圖片的 the groundtruth part affinity field 取圖片中所有人對應的 affinity field 的平均值，其中 $n_c(\textbf p)$ 是圖片中 $k$ 個人在像素點 $\textbf p$ 對應的非零向量的個數，即：我們只考慮 $\forall \textbf p \in \mathbb R^2$，$\forall k$ 個人中，有向量的平均。

$$\textbf L_c^*(\textbf p) = \frac{1}{n_c(\textbf p)} \sum_k \textbf L_{c, k}^*(\textbf p).$$

在預測的時候，我們用候選 keypoints 之間的 PAF 來衡量這對 keypoints 是不是屬於同一個人。詳細的說，對於兩個候選 keypoints 對應的像素點 $\textbf d_{j_1}$ 和 $\textbf d_{j_2}$，我們去計算 PAF，如下式所示：

$$E = \int_{u = 0}^{u = 1} \textbf L_c(\textbf p(u)) \cdot \frac{\textbf d_{j_2} - \textbf d_{j_1}}{||\textbf d_{j_2} - \textbf d_{j_1}||\_2}du,$$

其中 $\textbf p(u)$ 表示兩個像素點 $\textbf d_{j_1}$ 和 $\textbf d_{j_2}$ 之間的像素點：

$$\textbf p(u) = (1 - u)\textbf d_{j_1} + u \textbf d_{j_2}.$$

2.4 Multi-Person Parsing using PAFs
#

藉由 non-maximum suppression，我們從預測出的 confidence maps 得到一組離散的 keypoints 候選位置。因為圖片中可能有多個人或者存在 false positive，每個 keypoint 可能會有多個候選位置，因此也就組成了很大數量的 keypoints pair，如 Fig. 6(b) 所示。按照式 (10)，我們給每一個候選 keypoints pair 計算一個分數。

從這些 keypoint pair 中找到最佳結果，是一個 NP-Hard 問題。下面給出本文的方法：

假設模型得到的所有候選 keypoints 組成集合 $\mathcal D_\mathcal J = \{\textbf d_j^m: \text{ for } j \in \{1 \dots J\}, m \in \{1 \dots N_j\}\}$,

其中，

$N_j$: keypoint $j$ 的候選位置數量
$\textbf d_j^m \in \mathbb R^2$: keypoint $j$ 的第 $m$ 個候選位置的像素坐標。

我們需要做的是將屬於同一個人的 keypoints 連成軀幹（胳膊、腿等），為此我們定義變數 $z_{j_1j_2}^{mn} \in \{0, 1\}$ 表示候選 keypoints $\textbf d_{j_1}^m$ 和 $\textbf d_{j_2}^n$ 是否可以連起來。如此以來便得到了集合

$$\mathcal Z = \{z_{j_1j_2}^{mn} \in \{0, 1\}: \text{ for } j_1, j_2 \in \{1 \dots J\}, m \in \{1 \dots N_{j_1}\}, n \in \{1 \dots N_{j_2}\}\}.$$

現在單獨考慮第 $c$ 個軀幹（例如脖子），其對應的兩個 keypoints 應該是 $j_1$ 和 $j_2$，這兩個 keypoints 對應的候選集合分別是 $\mathcal D_{j_1}$ 和 $\mathcal D_{j_2}$，可透過線性方程式如下找出正確 keypoints：

$$\max_{\mathcal Z_c} E_c = \max_{\mathcal Z_c} \sum_{m \in \mathcal D_{j_1}}\sum_{n \in \mathcal D_{j_2}} E_{mn} \cdot z_{j_1j_2}^{mn},$$ $$\text{s.t. } \forall m \in \mathcal D_{j_1}, \sum_{n \in \mathcal D_{j_2}} z_{j_1j_2}^{mn} \le 1,$$ $$\forall n \in \mathcal D_{j_2}, \sum_{m \in \mathcal D_{j_1}} z_{j_1j_2}^{mn} \le 1$$

其中，

$E_c$: 軀幹 $c$ 對應的權值總和
$\mathcal Z_c$: 軀幹 $c$ 對應的 $\mathcal Z$ 的子集
$E_{mn}$: keypoint $d_{j_1}^m$ 和 keypoint $d_{j_2}^n$ 對應的 part affinity

式 (13) 和式 (14) 限制了任意兩個相同類型的軀幹（例如兩個脖子）不會共享關鍵點。問題擴展到所有 $C$ 個軀幹上，我們優化目標就變成了公式 (15)。

$$\max_\mathcal Z E = \sum_{c = 1}^C \max_{\mathcal Z_c} E_c.$$

論文筆記 Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

1. Introduction
#

2. Method
#

2.2 Confidence Maps for Part Detection
#

2.3 Part Afﬁnity Fields for Part Association
#

2.4 Multi-Person Parsing using PAFs
#

3. Results
#

3.1. Results on the MPII Multi-Person Dataset
#

3.2. Results on the COCO Keypoints Challenge
#

3.3. Runtime Analysis
#

4. Discussion
#

1. Introduction#

2. Method#

2.2 Confidence Maps for Part Detection#

2.3 Part Afﬁnity Fields for Part Association#

2.4 Multi-Person Parsing using PAFs#

3. Results#

3.1. Results on the MPII Multi-Person Dataset#

3.2. Results on the COCO Keypoints Challenge#

3.3. Runtime Analysis#

4. Discussion#

1. Introduction
#

2. Method
#

2.2 Confidence Maps for Part Detection
#

2.3 Part Afﬁnity Fields for Part Association
#

2.4 Multi-Person Parsing using PAFs
#

3. Results
#

3.1. Results on the MPII Multi-Person Dataset
#

3.2. Results on the COCO Keypoints Challenge
#

3.3. Runtime Analysis
#

4. Discussion
#