论文阅读_CLAY和Hunyuan3D2.0串讲

论文：Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

论文：CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

两个的思路基本一致，现在只通过hunyuan3D 2.0进行讲解

Hunyuan3D 2.0

系统包含两个基础模型，原因geometry 和 appearance 解耦更稳定。：

1️⃣ Hunyuan3D-DiT生成 几何形状（mesh）

2️⃣ Hunyuan3D-Paint生成 texture map

Shape生成模块Hunyuan3D-DiT

Shape生成包含两个部分：

1	ShapeVAE + Diffusion Transformer

ShapeVAE（核心）

ShapeVAE作用：

1 2	mesh → latent tokens latent tokens → mesh

输入是：

1	mesh surface point cloud

每个点包含六个信息：

1	(x,y,z)+normal

ShapeVAE encoder

Encoder结构：

point cloud
↓
cross attention
↓
self attention
↓
latent tokens

关键点：query是FPS采样得到的point queries

如何训练VAE encoder

普通方法采用uniform sample，但是这种方法边缘信息不足：

Hunyuan3D他们在uniform sample基础上加入importance sampling，专门采样edges，corner这些信息密度高的地方，所以输入point cloud：

1	P = P_uniform + P_importance

再通过：

1	FPS sampling

得到query。

VAE latent

encoder输出：

1 2	mean μ variance σ

通过：

1	reparameterization trick

得到：

1	latent tokens

VAE decoder

decoder任务：

1	latent tokens → SDF

过程：

latent tokens + 空间三维查寻点
↓
transformer
↓
grid queries
↓
predict SDF(x)

所以不是直接生成的SDF！而是同时输入了三维查询点和latent，decoder直接输出每个三维查寻点的SDF值！

另外生成完整 SDF field 是怎么得到的：完整 SDF field 是通过多次查询得到的。

然后通过marching cubes生成mesh。

ShapeVAE训练

1 reconstruction loss

预测：SDF(x)的loss：而不是Mesh的Loss因为Mesh具有高拓扑性值，有面和顶点的拓扑不可微，SDF是连续函数可以通过随机采样点进行监督

另外这个decoder不是一个MLP，因为如果是MLP则无法并性推理点的SDF值，只能一个一个推理，而Transformer结构的decoder可以并行推理

1	MSE(pred_sdf, gt_sdf)

注意：SDF不是全空间计算。而是随机采样点，降低计算。

2 KL loss

保证latent空间连续，为的是让latent distribution接近 standard gaussian，这样diffusion才能在latent space生成。

1	L = MSE + γ KL

reparameterization trick

VAE需要从分布采样，只有在z ~ N(0,1)才能进行采样获得新的数据，通过decoder(Z)，但是值接获得是：

$ z\sim N(\mu, \sigma)$

但直接采样无法反向传播无法训练。所以用：

$z = \mu + \sigma \cdot \epsilon$

其中

$\epsilon \sim N(0,1)$

代码其实就是：

mu, logvar = encoder(x)

std = torch.exp(0.5 * logvar)

eps = torch.randn_like(std)

z = mu + eps * std

这样：

1
2
3

gradient
→ μ
→ σ

都可以回传。

Shape diffusion (Hunyuan3D-DiT)

输入：

1
2
3

noisy shape tokens
+
image tokens

输出：

1	clean shape tokens

条件输入

图像特征来自：

1	DINOv2 Giant

图像在体取特征之前还做了：

1
2
3

background removal
center crop
white background

Transformer结构

采用：dual-stream + single-stream

dual-stream结构分别处理shape tokens和image tokens

在dual stream阶段：分别处理但attention交互，在single stream阶段：直接concat

Diffusion训练

他们不用普通DDPM。而是Flow Matching，优点是：1. 训练更稳定。2. 采样更快。3. ODE求解

训练：

1	xt = (1−t)x0 + t x1

预测：

1	velocity u_t

loss：

1	\|\| uθ(xt) − ut \|\|²

纹理生成 Hunyuan3D-Paint

输入：

1	mesh+image

输出：

1	texture map

流程：

mesh
 ↓
multi-view rendering
 ↓
multi-view diffusion
 ↓
texture baking

纹理生成关键步骤

Step1 Image Delight

去除光照因为：

1	texture ≠ lighting

为此训练了image2image network。训练数据来自HDR lighting和white lighting

Step2 View Selection

选取：8~12 views

原则上需要最大UV覆盖，使用贪心搜索实现并优先覆盖未覆盖区域。

Step3 Multi-view diffusion

生成：multi-view images，可能用的是类似Zero123、SyncDreamer这种，也可能是值接用了Mesh的几何信息

输入条件：

1
2
3

normal map
position map
camera embedding

Attention结构

三种attention：

1
2
3

self attention
reference attention
multi-view attention

公式：

1
2
3

Z = Zsa
  + λref * ref attention  # 保持图像一致
  + λmv * mv attention # 保持多视角一致

Texture baking

生成multi-view images后，需要生成UV texture，方法是需要把投影图像投影到UV空间。这个过程叫做baking

1	project images → UV space

问题是会出现自遮挡

1	self-occlusion

解决：

1	dense view inference

Texture inpainting

UV仍可能有洞，解决方法是插值：

1 2	vertex texture → interpolate

权重：

1	1 / distance

训练数据

1 2	Objaverse Objaverse-XL

每个mesh渲染需要处理出：

多视角图像
normal map
position map
white light图像

CLAY的区别

CLAY是2024年的工作，比hunyuan3D2.0晚了一年，

项目	CLAY	Hunyuan3D
latent diffusion	DDPM	Flow Matching
representation	occupancy field	SDF
encoder输入	point cloud	point + normal
sampling	surface points	uniform + importance
decoder输出	occupancy	SDF
texture	PBR material diffusion	multi-view texture diffusion