Hugo Future Imperfect Slim

L-Lawliet's Blog

记录游戏开发的点点滴滴

URP Bloom效果

URP Bloom后处理

Lawliet

4 分钟

Colourful

分析URP的Bloom效果

原理分析

源码剖析

URP的Bloom是作为后处理实现在PostProcessPass中。C#部分的核心代码有SetupBloom()

// Start at half-res
int tw = m_Descriptor.width >> 1;
int th = m_Descriptor.height >> 1;

// Determine the iteration count
int maxSize = Mathf.Max(tw, th);
int iterations = Mathf.FloorToInt(Mathf.Log(maxSize, 2f) - 1);
iterations -= m_Bloom.skipIterations.value;
int mipCount = Mathf.Clamp(iterations, 1, k_MaxPyramidSize);

由上文原理分析可知,为了减少模糊时的采样次数,模糊之前首先需要先降一次分辨率,所以先计算分辨率一半的宽高值。 然后使用半分辨率的宽高值来计算模糊需要迭代的次数(注意:这里并不是Blur的次数),例如2400,则需要迭代9次,然后剔除需要忽略的迭代次数,不开启skipIterations时,skipIterations.value默认为1次,再对迭代次数限制一下,因此最终迭代次数为8次。 P.S.实际上,这里最后用了mipCount来命名才是较为准确的说法,按照迭代次数的字面理解,应该是Blur的次数(横向+纵向视为一次Blur),但排除了预降采样后,Blur的次数等于mipCount-1。因此用mipCount较为贴切,以2400举例,mip[0]就是1200,mip[1]为600,mip[2]为300…。而首次Blur就是mip[0]到mip[1],这样就能理解后续的逻辑。


//设置材质球属性
//...

// Prefilter
var desc = GetCompatibleDescriptor(tw, th, m_DefaultHDRFormat);
cmd.GetTemporaryRT(ShaderConstants._BloomMipDown[0], desc, FilterMode.Bilinear);
cmd.GetTemporaryRT(ShaderConstants._BloomMipUp[0], desc, FilterMode.Bilinear);
Blit(cmd, source, ShaderConstants._BloomMipDown[0], bloomMaterial, 0);

首先进行模糊前预处理,其中包含降采样和提取亮部信息。

#if _BLOOM_HQ
  float texelSize = _SourceTex_TexelSize.x;
  half4 A = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(-1.0, -1.0));
  half4 B = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(0.0, -1.0));
  half4 C = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(1.0, -1.0));
  half4 D = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(-0.5, -0.5));
  half4 E = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(0.5, -0.5));
  half4 F = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(-1.0, 0.0));
  half4 G = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv);
  half4 H = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(1.0, 0.0));
  half4 I = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(-0.5, 0.5));
  half4 J = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(0.5, 0.5));
  half4 K = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(-1.0, 1.0));
  half4 L = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(0.0, 1.0));
  half4 M = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + texelSize * float2(1.0, 1.0));

  half2 div = (1.0 / 4.0) * half2(0.5, 0.125);

  half4 o = (D + E + I + J) * div.x;
  o += (A + B + G + F) * div.y;
  o += (B + C + H + G) * div.y;
  o += (F + G + L + K) * div.y;
  o += (G + H + M + L) * div.y;

  half3 color = o.xyz;
#else
  half3 color = SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv).xyz;
#endif

降采样为了减少后续模糊的采样的性能压力,由于对画质有不同的要求,所以会有High Quality Filtering(后续简称HQ)的选项。 可以看出,在正常状态下直接采样,然后利用纹理的双向二次插值来平滑。 而开启HQ后,将会对原始纹理进行一次Blur处理,这样效果会更好,随之而来的是性能开销也会更大。

接下来就是gaussian pyramid,这是降采样的高斯模糊算法。

int lastDown = ShaderConstants._BloomMipDown[0];
for (int i = 1; i < mipCount; i++)
{
  tw = Mathf.Max(1, tw >> 1);
  th = Mathf.Max(1, th >> 1);
  int mipDown = ShaderConstants._BloomMipDown[i];
  int mipUp = ShaderConstants._BloomMipUp[i];

  desc.width = tw;
  desc.height = th;

  cmd.GetTemporaryRT(mipDown, desc, FilterMode.Bilinear);
  cmd.GetTemporaryRT(mipUp, desc, FilterMode.Bilinear);

  Blit(cmd, lastDown, mipUp, bloomMaterial, 1);
  Blit(cmd, mipUp, mipDown, bloomMaterial, 2);

  lastDown = mipDown;
}

每次循环,纹理都会降到原来的1/4,这样的好处是可以通过采样的双向二次插值来代替部分卷积计算。 然后采用横向纵向分别做高斯模糊的方式(常见的优化手段,通过拆分来减少采样次数)。

// FragBlurH()
// 9-tap gaussian blur on the downsampled source
half3 c0 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv - float2(texelSize * 4.0, 0.0)));
half3 c1 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv - float2(texelSize * 3.0, 0.0)));
half3 c2 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv - float2(texelSize * 2.0, 0.0)));
half3 c3 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv - float2(texelSize * 1.0, 0.0)));
half3 c4 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv                               ));
half3 c5 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + float2(texelSize * 1.0, 0.0)));
half3 c6 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + float2(texelSize * 2.0, 0.0)));
half3 c7 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + float2(texelSize * 3.0, 0.0)));
half3 c8 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + float2(texelSize * 4.0, 0.0)));

half3 color = c0 * 0.01621622 + c1 * 0.05405405 + c2 * 0.12162162 + c3 * 0.19459459
            + c4 * 0.22702703
            + c5 * 0.19459459 + c6 * 0.12162162 + c7 * 0.05405405 + c8 * 0.01621622;

// FragBlurV()
// Optimized bilinear 5-tap gaussian on the same-sized source (9-tap equivalent)
half3 c0 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv - float2(0.0, texelSize * 3.23076923)));
half3 c1 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv - float2(0.0, texelSize * 1.38461538)));
half3 c2 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv                                      ));
half3 c3 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + float2(0.0, texelSize * 1.38461538)));
half3 c4 = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv + float2(0.0, texelSize * 3.23076923)));

half3 color = c0 * 0.07027027 + c1 * 0.31621622
            + c2 * 0.22702703
            + c3 * 0.31621622 + c4 * 0.07027027;

可以看出来,横向采样了9个点,而纵向则只采样5个点。这是因为横向的输入纹理是原纹理,而到纵向时,输入纹理是已经降分辨率的纹理,由于降采样已经对纵向进行了一次平均计算(双向二次插值),所以只需要5个点也能达到模糊效果。

// Upsample (bilinear by default, HQ filtering does bicubic instead
for (int i = mipCount - 2; i >= 0; i--)
{
  int lowMip = (i == mipCount - 2) ? ShaderConstants._BloomMipDown[i + 1] : ShaderConstants._BloomMipUp[i + 1];
  int highMip = ShaderConstants._BloomMipDown[i];
  int dst = ShaderConstants._BloomMipUp[i];

  cmd.SetGlobalTexture(ShaderConstants._SourceTexLowMip, lowMip);
  Blit(cmd, highMip, BlitDstDiscardContent(cmd, dst), bloomMaterial, 3);
}
half3 Upsample(float2 uv)
{
  half3 highMip = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTex, sampler_LinearClamp, uv));

#if _BLOOM_HQ && !defined(SHADER_API_GLES)
  half3 lowMip = DecodeHDR(SampleTexture2DBicubic(TEXTURE2D_X_ARGS(_SourceTexLowMip, sampler_LinearClamp), uv, _SourceTexLowMip_TexelSize.zwxy, (1.0).xx, unity_StereoEyeIndex));
#else
  half3 lowMip = DecodeHDR(SAMPLE_TEXTURE2D_X(_SourceTexLowMip, sampler_LinearClamp, uv));
#endif

  return lerp(highMip, lowMip, Scatter);
}

HQ状态下,每次升采样都会用highMip(n+1)lowMip(n)进行一次插值混合,混合比例就是Scatter值,这样能控制模糊的散射程度。 在HQ状态下,一样是使用highMiplowMip进行混合,唯一的不同就是获取lowMip是使用了双向三次插值来进行采样。这样获取的颜色值会进一步平滑,但随之而来的就是性能消耗也随之提升。

性能优化

简单优化

根据上面源码分析可知,开了HQ性能会下降,HQ首次缩放时会对原图(全屏尺寸)做了一次模糊(单个像素做13次采样),升采样阶段又会调用SampleTexture2DBicubic做了双向三次插值计算,这两部分额外的采样和计算导致了开启HQ后耗时增加。

另外URP Bloom在升降采样过程中,默认的缩放次数其实比较高,也就是Blit次数较多,按照2400*1080的尺寸,迭代次数为9次,mip为8(8 = 9 - 1,默认忽略一次), 按Prefilter(1次)、降采样(7次*2)、升采样(7次),默认一共需要做22次Blit。而我们知道移动端对于处理Blit其实也有耗时,而且当纹理小的一定程度时,对于Blur的影响将会减少。根据游戏画面影响程度,笔者将Skip iterations设为6,mip为3(3 = 9 - 6),Blur的次数也就降为2,而Blit的次数也减少到7次

最后在小米9测出来优化前后影响(由于受到发热降频影响,数据可能跟实际有少许出入):

描述帧率耗时
开启HQ、没有限制迭代次数2343ms
关闭HQ、没有限制迭代次数2540ms
关闭HQ、限制到两次2638ms

可以看出,在经过调整参数来优化后,整体的耗时能减少5ms,而效果并没有太大的差别。

粗暴的对比一下优化前后的Sample次数,就能知道为什么能减少那么耗时了(N为屏幕尺寸像素数量,T为迭代次数):

  • Prefilter:
    • 关闭HQ:$N$
    • 开启HQ:$P = 13 * N$
  • DownHorizontal:
    • $DH = 9 * (\frac{1}{4}N + \frac{1}{16}N + \frac{1}{64}N + …) \Rightarrow DH = 9 * \frac{1 - (\frac{1}{4})^T}{3}N $
  • DownVertical:
    • $DV = 5 * (\frac{1}{16}N + \frac{1}{64}N + \frac{1}{256}N + …) = \Rightarrow DH = 5 * \frac{1}{4} * \frac{1 - (\frac{1}{4})^T}{3}N $
  • Up:
    • 关闭HQ:$UP = (\frac{1}{4}N + \frac{1}{16}N + \frac{1}{64}N + …) \Rightarrow DH = \frac{1 - (\frac{1}{4})^T}{3}N $
    • 开启HQ:$UP = 4 * (\frac{1}{4}N + \frac{1}{16}N + \frac{1}{64}N + …) \Rightarrow DH = 4 * \frac{1 - (\frac{1}{4})^T}{3}N $
描述Blit次数采样次数
开启HQ、没有限制迭代次数2217.74971*N
关闭HQ、没有限制迭代次数224.74977*N
关闭HQ、限制到两次74.515625*N

Dual Blur算法

Dual (Kawase) Blur算法跟URP的Bloom算法都采用了降采样、升采样来优化模糊算法,唯一的不同是URP使用的是高斯模糊加降采样,而Dual是采用Kawase加降采样。 Dual从算法来说就是DownHorizontal和DownVertical合并为一个,只降采样中间和四个角落顶点。升采样则对中心周边八个顶点进行采样。 对比来说,两者的采样次数其实相差无几,但Blit次数可以减少1/3左右。综合来说两者性能实际上差别是有的,但没有预想中那么大。

说些什么

评论

还沒有留言。

最新文章

Colourful

HLSL

分类

关于

记录游戏开发的点点滴滴