<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://filmicworlds.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://filmicworlds.com/" rel="alternate" type="text/html" /><updated>2026-05-03T15:53:55+00:00</updated><id>https://filmicworlds.com/feed.xml</id><title type="html">Filmic Worlds</title><subtitle>Next-next-gen graphics.</subtitle><entry><title type="html">Temporal Super Resolution via Multisampling</title><link href="https://filmicworlds.com/blog/temporal-super-resolution-via-multisampling/" rel="alternate" type="text/html" title="Temporal Super Resolution via Multisampling" /><published>2025-05-10T00:00:00+00:00</published><updated>2025-05-10T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/temporal-super-resolution-via-multisampling</id><content type="html" xml:base="https://filmicworlds.com/blog/temporal-super-resolution-via-multisampling/"><![CDATA[<p>Is anyone using temporal reprojection to improve MSAA?</p>

<p>Temporal anti-aliasing (TAA) has been around for over 10 years [8][1] and various approaches have been used to apply
super resolution as well [2][3][4][5][9][10][11]. MSAA has been around even longer, so it would make sense to use temporal information to improve the quality, right?</p>

<p>I spent quite a bit of time trying to find existing references, but temporal information with MSAA seems largely unexplored. SMAA S4 [6] uses a hybrid of 2x MSAA and
temporal information, and a few approaches incorporate reprojection with a checkerboard pattern [12]. Additionaly, The Order: 1886 [13] uses a custom 4x MSAA resolve
with temporal information without applying a jitter pattern. But I have not been able to find references to using larger MSAA patterns
with longer jitter patterns.</p>

<p>For this test, we will stick
with 4x MSAA at 1080p, and then upsample into a 4k buffer (just as the <a href="/blog/upsampling-via-multisampling/">previous post</a>). I highly recommend reading that post
before this one, as this post builds that algorithm. The difference is that we will offset each frame by half a pixel in a 4 frame cycle with the following
jitter pattern.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="kt">int</span> <span class="n">jitterIndex</span> <span class="o">=</span> <span class="n">GetFrameIndex</span><span class="p">()</span> <span class="o">%</span> <span class="mi">4</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">jitterX</span> <span class="o">=</span> <span class="mf">0.5</span><span class="n">f</span> <span class="o">*</span> <span class="p">(</span><span class="kt">float</span><span class="p">)(</span><span class="n">jitterIndex</span><span class="o">%</span><span class="mi">2</span><span class="p">);</span>
<span class="kt">float</span> <span class="n">jitterY</span> <span class="o">=</span> <span class="mf">0.5</span><span class="n">f</span> <span class="o">*</span> <span class="p">(</span><span class="kt">float</span><span class="p">)</span><span class="n">j</span><span class="p">(</span><span class="n">itterIndex</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span></code></pre></figure>

<p>There are other ways to jitter the image, but this simple pattern above has some helpful synergies with 4x MSAA at a 4x area upsample. In the header image of this page, the left image shows a standard TAA implementation
and the second image shows the same scene with 4x MSAA using a standard box resolve. The third image shows a 4x upsample
using the algorithm from the previous post. Then the fourth image shows a temporal
super resolution image using the algorithm from this page.</p>

<p>Edges are important, but what about texture detail? The following image below shows a comparison of the curtains in Sponza. The TAA, 4x MSAA, and spatial upsample from the previous post are all unable to resolve the pattern on the curtain.
However, by using a 4 frame jitter with temporal super resolution we can clearly resolve it.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/curtain-resolve_crop_v2.jpg" /></div>

<p><strong>Limitations of Single Frame Upsampling</strong></p>

<p>Looking back at the <a href="/blog/upsampling-via-multisampling/">previous post</a>, there are a few obvious limitations of using a single MSAA frame. To keep things simple, we will only focus on the 4x MSAA version
(and ignore the 2x and 8x MSAA variations).</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/sample-4x-compare.png" /></div>

<p><i>A comparison of a 4x MSAA standard resolve (left) and 4x MSAA Upsample Image from the previous post (right).</i></p>

<!--- 
<strong><i>Pixel Aliasing:</i></strong>

With 4x MSAA, we have 4 geometry samples for each source pixel. However, we will in general only have one shading sample. The same pixel color will be reused on all
4 samples if all samples are on the same triangle, which is the most common situation. Thus while there are some tricks we could do with filtering, it is going to be very
hard to increase the quality of detail inside the edges with filtering alone.

<strong><i>Wobbly Edges:</i></strong>

Another artifact is wobbly edges. Even though a triangle edge should be straight, that edge cuts through the MSAA pattern irregularly and we have these sharp jagged angles
in the silhouette.

<strong>Temporal Super Resolution:</strong>

Generally, if we have a single image and we try to increase the perceptual resolution of that image, it is called upsampling. Rather, merge several low resolution images
into a single high-res image is called super resolution. While there are many super resolution algorithms using a source 1x image per frame, can we instead use a 4x MSAA image
as the source?

--->

<p><strong>Jitter for 4x MSAA</strong></p>

<p>The key issue with 4x MSAA is that while we are calculating color samples at 4 positions per pixel, we are reusing the same information at all 4 samples if they are covered by the 
same quad.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/dx12-4x.png" /></div>

<p>In the previous post we are able to get increased edge quality around the edge between different triangles. However, for the flat parts of triangles we end up with blocky 2x2
pixels that are all the same color.</p>

<!---

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/sample-4x-blocks.png"></div>

<i>The left side is a 4x MSAA image with a standard resolve, and the right side uses a 4x area upsample. This results in cleaner edges between triangles, but 2x2 blocks of similar
color within triangle boundaries.</i>

--->

<p>Let us take a look at a single pixel. Our goal is to render a 4x MSAA image at 1080p and apply it to a 4k output (4x area scaling). Let us look at one of these output pixels in black.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/patern_samples_0.png" /></div>

<p>Assuming we are not on a triangle edge, a single sample point in blue will write to the 4 adjacent high-res pixels. In a single frame, our target black pixel uses the nearby blue pixel for its color.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/patern_samples_1.png" /></div>

<p>But what happens if we shift the sampling horizontally by half a pixel in the source grid? Of course, “half a pixel” in the original 1080p 4xMSAA image is the same a “full pixel” in
the output 4k image. The black pixel will get a color from a different sampling position.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/patern_samples_2.png" /></div>

<p>Next, we can offset the sampling vertically.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/patern_samples_3.png" /></div>

<p>And one more time, now vertically and horizontally.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/patern_samples_4.png" /></div>

<p>Then we can repeat this cycle of 4 offsets and each output pixel will converge to the sum of the four neighbors on its corners.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/patern_samples_5.png" /></div>

<p>This is the key insight into temporal super resolution using MSAA. With this 4 cycle pattern, the output image converges quickly. Additionally, this algorithm allows us to resolve smaller details which can not 
be resolved from filtering/sharpening a single lower-res image. But what about the other major problem of wobbly edges?</p>

<p><strong>Wobbly Edges</strong></p>

<p>Going back to the previous post, we end up with a little bit of wobbling in diagonal edges. Why is this? To start, let us take a closer look at the 4x MSAA pattern.
As a rotated grid, the MSAA pattern solves the “4 rooks” problem. If we assume that the pattern is a 4x4 chessboard, and all 4 sample points are rooks, they would
not be able to attack each other.</p>

<p>If we draw a horizontal line through each sample point, each line only touches one sample point per MSAA pixel.</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/intersect_grid_horiz.png" /></div>

<p>Similarly, for vertical rows, each line only touches a single sample point per MSAA pixel.</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/intersect_grid_vert.png" /></div>

<p>However, this is not true for diagonal lines. For the diagonal case, two of the sample points are on their own line (samples 0 and 3). But samples 1 and 2
share the same line (in red). And there are no samples on the line that goes through the origin (in green).</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/intersect_grid_diag.png" /></div>

<p>If a triangle edge goes through a pixel at a near 45 degree angle, the actual coverage of the triangle will depend heavily
on how that edge is aligned to the sample grid, which results in wobbly edges. But since we need to jitter anyways for super resolution, how does this jitter pattern
affect the sample grid over time? The first frame will use the standard 4x MSAA pattern.</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/grid_offset_0.png" /></div>

<p>The second offset in our pattern applies a half pixel offset in x but not y (0.5,0.0). Here is where they land on the grid.</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/grid_offset_1.png" /></div>

<p>The third offset applies a half pixel offset in y but not x (0.0,0.5).</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/grid_offset_2.png" /></div>

<p>And the final offset is a half pixel in both x and y.</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/grid_offset_3.png" /></div>

<p>Over the course of 4 frames, we end up accumulating a perfectly even pattern.</p>

<div style="text-align:center;"><img width="400" src="/images/2025_05_12_temporal_msaa/intersect_grid_diag_fixed.png" /></div>

<p>In the image below, the original MSAA image has trouble resolving the diagonal lines.
In each of the 4 frames with upsampling, the edges are resolved but they wobble a bit causing sharp angles. However, the four frames evenly balance each other and the diagonal lines are clean
in the final temporal image.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/layer_sample_blend_crop.jpg" /></div>

<p>Originally, after the spatial upsample test I was wondering if it made sense to apply pattern matching like SMAA [6] to fix the wobbling areas.
But temporal anti-aliasing cleans it up. Conveniently, the pattern that gives us good data for 4x super sampling on interior
regions also gives us the perfect sample pattern for diagonal edges. I would like to claim that I had some grand plan here, but the truth 
is that sometimes things just work out.</p>

<p><strong>Applying the Jitter</strong></p>

<p>In the previous post, the 4 output pixel colors were generated from the marked ‘x’ positions below.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/ascii-4x-neighbor.png" /></div>

<p>With jitter it becomes slightly more complicated. We can think of the MSAA pattern as 9 different buckets, and depending on our jitter we want
a different 2x2 group of buckets.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/sample_pattern_blocks_buckets.png" /></div>

<p>By performing both a jitter and a separate offset it keeps the actual frame stationary. Over the course of one cycle of the pattern, we end up
with 4 separate images.</p>

<p>The next step is to actually accumulate the pattern. Once we render into a 4x MSAA pattern, how should we apply it to our temporal accumulation buffer? To keep
it simple, we will store a 4x area accumulation buffer. That means each frame we will render into a 1080p (1920x1080) 4x MSAA buffer, and merge that data into a 4k (3840x2160)
accumulation buffer.</p>

<p>Motion vectors are fetched from a 1x, 1080p buffer that gets rendered earlier. Then for each of the 4 buckets, it gathers the previous reprojected color, applies a color
clamp, and lerps with the color for the current frame bucket. All 4 buckets use the same motion vector.</p>

<p>For the color clamp we can use all pixels involved in the calculation for that bucket. Each color inside the bucket is determined by evaluating a cross from 4 neighbors and weighing them
base on the gradient. Simply expanding the color box to include all 4 neighbors made the most sense. The example code for calculating the color and box bounds
for a bucket is below. Also note that instead of using the smallest absolute gradient, the algorithm now weighs both gradients together.</p>

<p>First we have a helper function that given the 4 points in the cross calculates the color and expands the color box.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="nf">CalcDiamondAbsDiffColorClamp</span><span class="p">(</span>
             <span class="n">inout</span> <span class="n">float3</span> <span class="n">low</span><span class="p">,</span> <span class="n">inout</span> <span class="n">float3</span> <span class="n">high</span><span class="p">,</span>
             <span class="n">float3</span> <span class="n">left</span><span class="p">,</span> <span class="n">float3</span> <span class="n">right</span><span class="p">,</span> <span class="n">float3</span> <span class="n">up</span><span class="p">,</span> <span class="n">float3</span> <span class="n">down</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">float</span> <span class="n">lumL</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">left</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">lumR</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">right</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">lumU</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">up</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">lumD</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">down</span><span class="p">);</span>
	
	<span class="kt">float</span> <span class="n">eps</span> <span class="o">=</span> <span class="mf">1e-5</span><span class="n">f</span><span class="p">;</span>
	<span class="kt">float</span> <span class="n">diffH</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">eps</span><span class="p">,</span><span class="n">abs</span><span class="p">(</span><span class="n">lumL</span> <span class="o">-</span> <span class="n">lumR</span><span class="p">));</span>
	<span class="kt">float</span> <span class="n">diffV</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">eps</span><span class="p">,</span><span class="n">abs</span><span class="p">(</span><span class="n">lumU</span> <span class="o">-</span> <span class="n">lumD</span><span class="p">));</span>
	
	<span class="kt">float</span> <span class="n">wh</span> <span class="o">=</span> <span class="n">diffV</span><span class="o">/</span><span class="p">(</span><span class="n">diffH</span> <span class="o">+</span> <span class="n">diffV</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">wv</span> <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span> <span class="o">-</span> <span class="n">wh</span><span class="p">;</span>
	
	<span class="n">float3</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	
	<span class="n">float3</span> <span class="n">avgH</span> <span class="o">=</span> <span class="p">(</span><span class="n">left</span> <span class="o">+</span> <span class="n">right</span><span class="p">)</span> <span class="o">*</span> <span class="mf">.5</span><span class="n">f</span><span class="p">;</span>
	<span class="n">float3</span> <span class="n">avgV</span> <span class="o">=</span> <span class="p">(</span><span class="n">up</span> <span class="o">+</span> <span class="n">down</span><span class="p">)</span> <span class="o">*</span> <span class="mf">.5</span><span class="n">f</span><span class="p">;</span>
	
	<span class="n">ret</span> <span class="o">=</span> <span class="n">avgH</span><span class="o">*</span><span class="n">wh</span> <span class="o">+</span> <span class="n">avgV</span><span class="o">*</span><span class="n">wv</span><span class="p">;</span>
	
	<span class="n">low</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">low</span><span class="p">,</span><span class="n">min</span><span class="p">(</span><span class="n">left</span><span class="p">,</span><span class="n">right</span><span class="p">));</span>
	<span class="n">high</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">high</span><span class="p">,</span><span class="n">max</span><span class="p">(</span><span class="n">left</span><span class="p">,</span><span class="n">right</span><span class="p">));</span>
	
	<span class="n">low</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">low</span><span class="p">,</span><span class="n">min</span><span class="p">(</span><span class="n">up</span><span class="p">,</span><span class="n">down</span><span class="p">));</span>
	<span class="n">high</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">high</span><span class="p">,</span><span class="n">max</span><span class="p">(</span><span class="n">up</span><span class="p">,</span><span class="n">down</span><span class="p">));</span>
	
	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>Next, the color box is initialized to the one known sample and expanded as the other points are calculated.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">// initialize min and max</span>
<span class="n">float3</span> <span class="n">min00</span> <span class="o">=</span> <span class="n">color__4__5</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">max00</span> <span class="o">=</span> <span class="n">color__4__5</span><span class="p">;</span>

<span class="c1">// calculate 4 colors from grid, and expand color box</span>
<span class="n">float3</span> <span class="n">grid__4__5</span> <span class="o">=</span> <span class="n">color__4__5</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">grid__4__4</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiffColorClamp</span><span class="p">(</span>
           <span class="n">min00</span><span class="p">,</span><span class="n">max00</span><span class="p">,</span><span class="n">color__4__1</span><span class="p">,</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__2__4</span><span class="p">,</span><span class="n">color__6__4</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__5__4</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiffColorClamp</span><span class="p">(</span>
           <span class="n">min00</span><span class="p">,</span><span class="n">max00</span><span class="p">,</span><span class="n">color__5__3</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">,</span><span class="n">color__2__4</span><span class="p">,</span><span class="n">color__6__4</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__5__5</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiffColorClamp</span><span class="p">(</span>
           <span class="n">min00</span><span class="p">,</span><span class="n">max00</span><span class="p">,</span><span class="n">color__5__3</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">,</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__8__5</span><span class="p">);</span>

<span class="c1">// calculate final expected pixel for this frame</span>
<span class="n">float3</span> <span class="n">dst00</span> <span class="o">=</span> <span class="mf">0.25</span><span class="n">f</span><span class="o">*</span><span class="p">(</span><span class="n">grid__4__4</span> <span class="o">+</span> <span class="n">grid__4__5</span> <span class="o">+</span> <span class="n">grid__5__4</span> <span class="o">+</span> <span class="n">grid__5__5</span><span class="p">);</span></code></pre></figure>

<p>Then the four samples are gathered with the same motion plus a small offset.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">// fetch the 4 previous colors to use for our 4 buckets</span>
<span class="n">float3</span> <span class="n">overlayTex00</span> <span class="o">=</span> <span class="n">prevTex</span><span class="p">.</span><span class="n">SampleLevel</span><span class="p">(</span><span class="n">s_samplerLinearClamp</span><span class="p">,</span>
           <span class="n">uv</span><span class="o">+</span><span class="n">motion</span> <span class="o">+</span> <span class="n">float2</span><span class="p">(</span><span class="o">-</span><span class="n">quarterPixelX</span><span class="p">,</span><span class="o">-</span><span class="n">quarterPixelY</span><span class="p">),</span><span class="mf">0.0</span><span class="n">f</span><span class="p">).</span><span class="n">rgb</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">overlayTex01</span> <span class="o">=</span> <span class="n">prevTex</span><span class="p">.</span><span class="n">SampleLevel</span><span class="p">(</span><span class="n">s_samplerLinearClamp</span><span class="p">,</span>
           <span class="n">uv</span><span class="o">+</span><span class="n">motion</span> <span class="o">+</span> <span class="n">float2</span><span class="p">(</span> <span class="n">quarterPixelX</span><span class="p">,</span><span class="o">-</span><span class="n">quarterPixelY</span><span class="p">),</span><span class="mf">0.0</span><span class="n">f</span><span class="p">).</span><span class="n">rgb</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">overlayTex10</span> <span class="o">=</span> <span class="n">prevTex</span><span class="p">.</span><span class="n">SampleLevel</span><span class="p">(</span><span class="n">s_samplerLinearClamp</span><span class="p">,</span>
           <span class="n">uv</span><span class="o">+</span><span class="n">motion</span> <span class="o">+</span> <span class="n">float2</span><span class="p">(</span><span class="o">-</span><span class="n">quarterPixelX</span><span class="p">,</span> <span class="n">quarterPixelY</span><span class="p">),</span><span class="mf">0.0</span><span class="n">f</span><span class="p">).</span><span class="n">rgb</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">overlayTex11</span> <span class="o">=</span> <span class="n">prevTex</span><span class="p">.</span><span class="n">SampleLevel</span><span class="p">(</span><span class="n">s_samplerLinearClamp</span><span class="p">,</span>
           <span class="n">uv</span><span class="o">+</span><span class="n">motion</span> <span class="o">+</span> <span class="n">float2</span><span class="p">(</span> <span class="n">quarterPixelX</span><span class="p">,</span> <span class="n">quarterPixelY</span><span class="p">),</span><span class="mf">0.0</span><span class="n">f</span><span class="p">).</span><span class="n">rgb</span><span class="p">;</span></code></pre></figure>

<p>We also want slightly different behavior depending on how much motion we have for the pixel. If the camera is stationary we generally want to apply
a low influence for the current frame so that the frames average together well. However, if we have significant motion in this pixel (more than half a pixel)
then we should rely more on the current pixel and less on the history. While the testing was not rigorous, a 5% blend for stationary motion
vectors versus a 25% blend for signficant motion seemed like a good balance.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">// 0.1 pixel movement is considered as no movement, 0.5 is full movement</span>
<span class="kt">float</span> <span class="n">motionT</span> <span class="o">=</span> <span class="n">saturate</span><span class="p">((</span><span class="n">motionLength</span> <span class="o">-</span> <span class="mf">0.1</span><span class="n">f</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="mf">0.5</span><span class="n">f</span><span class="o">-</span><span class="mf">0.1</span><span class="n">f</span><span class="p">));</span>
		
<span class="c1">// check for borders</span>
<span class="kt">float</span> <span class="n">weight</span> <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prevUv</span><span class="p">.</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">invSizeX</span> <span class="o">||</span> <span class="n">prevUv</span><span class="p">.</span><span class="n">y</span> <span class="o">&lt;</span> <span class="n">invSizeY</span> <span class="o">||</span>
    <span class="n">prevUv</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="n">invSizeX</span> <span class="o">||</span> <span class="n">prevUv</span><span class="p">.</span><span class="n">y</span> <span class="o">&gt;</span> <span class="mf">1.0</span><span class="n">f</span> <span class="o">-</span> <span class="n">invSizeY</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">weight</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// determina final lerp value</span>
<span class="kt">float</span> <span class="n">t</span> <span class="o">=</span> <span class="n">lerp</span><span class="p">(</span><span class="mf">0.05</span><span class="n">f</span><span class="p">,</span><span class="mf">0.25</span><span class="n">f</span><span class="p">,</span><span class="n">motionT</span><span class="p">)</span><span class="o">*</span><span class="n">weight</span><span class="p">;</span>

<span class="c1">// clamp history value with color box</span>
<span class="n">overlayTex00</span> <span class="o">=</span> <span class="n">clamp</span><span class="p">(</span><span class="n">overlayTex00</span><span class="p">,</span><span class="n">min00</span><span class="p">,</span><span class="n">max00</span><span class="p">);</span>

<span class="c1">// lerp final pixel with history</span>
<span class="n">ret00</span> <span class="o">=</span> <span class="n">lerp</span><span class="p">(</span><span class="n">overlayTex00</span><span class="p">,</span><span class="n">dst00</span><span class="p">,</span><span class="n">t</span><span class="p">);</span></code></pre></figure>

<p>Choosing temporal accumulation parameters is an exercise in endless tweaking. At a glance the image looks pretty sharp in motion and converges quickly when still, but much
more testing would be required for a real production.</p>

<p><strong>TAA Comparison 1: Thin Features</strong></p>

<p>One of the common issues that TAA implementations run into is thin features which are less than a pixel wide. Why is this such an issue, and can MSAA help with it? The image below 
shows a comparison of one of the edges on a pillar in Sponza. The top uses vanilla TAA and the bottom uses the algorithm described in this post.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/thin_comparison_semi_zoom.jpg" /></div>

<p>TAA does a pretty good job here on the surfaces but fails to reconstruct the thin edge. Why?</p>

<p>Here are two consecutive images in the TAA sequence. In each frame, the history is clamped to the min and max of the 3x3 neighborhood in the current frame.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/thin_comparison_crop_non_matching.png" /></div>

<p>The first frame will put these bright colors into the history buffer. But the following frame will only
see black pixels and clamp those bright history pixels into the black neighborhood. Thus, the reconstruction fails.</p>

<p>However we get much better results with temporal super resolution via MSAA. But why? Let’s take a look at the 4 candidate images that are used in the jitter pattern.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/thin_comparison_crop_4.png" /></div>

<p>At a glance, the results are not much better than TAA. Each frame has gaps in the pixels. But the color bounding box includes the 4 candidate pixels from the cross. Even if the current cross point gets evaluated poorly, the color clamp is adjusted by the 4 samples
that form the cross for that point, and at least one of those samples should have a color that matches the correct value. Those crossing pixels expand the color clamp which
allows this algorithm to know that the history pixels are still valid. In other words, the MSAA temporal super resolution algorithm <strong>IS NOT</strong> better than TAA at choosing the current pixel color, but it <strong>IS</strong> better at knowing when to trust the history.</p>

<p><strong>TAA Comparison 2: Motion</strong></p>

<p>The second major issue with TAA is motion. In most TAA implementations, the images look great in a still frame but tend to look blurry in motion. Most engines that I have seen use a temporal
weighting of about 4% to 5% for their TAA influence. For each pixel, the output uses an algorithm similar to the following:</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="kt">float</span> <span class="n">lerpT</span> <span class="o">=</span> <span class="mf">0.04</span><span class="n">f</span><span class="p">;</span> <span class="c1">// or some other number</span>

<span class="n">float3</span> <span class="n">colorMin</span> <span class="o">=</span> <span class="n">MinColorFromNeighborhood</span><span class="p">();</span>
<span class="n">float3</span> <span class="n">colorMax</span> <span class="o">=</span> <span class="n">MaxColorFromNeighborhood</span><span class="p">();</span>

<span class="n">float3</span> <span class="n">prevFrameColor</span> <span class="o">=</span> <span class="n">ReprojectedColorUsingMotionVectors</span><span class="p">();</span>
<span class="n">float3</span> <span class="n">currColor</span> <span class="o">=</span> <span class="n">ResolvedColor</span><span class="p">();</span>
<span class="n">float3</span> <span class="n">clampedColor</span> <span class="o">=</span> <span class="n">clamp</span><span class="p">(</span><span class="n">prevFrameColor</span><span class="p">,</span><span class="n">colorMin</span><span class="p">,</span><span class="n">colorMax</span><span class="p">);</span>

<span class="n">float3</span> <span class="n">finalColor</span> <span class="o">=</span> <span class="n">lerp</span><span class="p">(</span><span class="n">clampedColor</span><span class="p">,</span><span class="n">currColor</span><span class="p">,</span><span class="n">lerpT</span><span class="p">);</span></code></pre></figure>

<p>There are other tricks you can do, like changing <i>lerpT</i> based on the length of the motion vector, or using a better color space like YCrCb, or use variance for the color clamp.</p>

<p>If you are using a <i>lerpT</i> value of 4%, then it takes about 60 frames for the image to converge if there is no motion. But if you have a scene where everything is always moving
a little bit, then TAA never really has time to converge. This can happen with grass blowing in the wind, walking forward slowly, or even games that have a slight up and down camera translation
to simulate the character breathing. The type of color clamp and amount of influence for each frame is a complicated trade off of detail vs blurriness
in motion vs edge jaggies vs ghosting. Changing the parameters to improve one element tends to cause a regression somewhere else.</p>

<p>However, using MSAA provides several meaningful benefits. With MSAA, we always get a bit of edge anti-aliasing from the current frame (without temporal information) so
we can get acceptable edges even in motion. And since we are only using a 4 frame cycle, we can aggressively discard the history with the current frame when in motion.</p>

<p>Here is a test from my TAA implementation vs the MSAA super resolution algorithm discussed here. Note that most TAA implementations in game engines are much more robust and sophisticated,
especially in comparison to the super resolution implementations optimized by IHVs. But, you will see similar artifacts in shipping titles, as documented in comparison videos [5][7].</p>

<p>For this test, I put together an automated camera script. The camera simply moves forward for 60 frames, takes a screenshot, then remains still for 60 frames to let the image converge and takes another screenshot.</p>

<div style="text-align:center;"><img src="/images/2025_05_12_temporal_msaa/blurry_compare_crop.jpg" /></div>

<p><i>The upper-left image shows the TAA image while in motion. The upper-right shows the TAA image after the camera stays still for 60 frames allowing the image to converge. The bottom-left
shows the MSAA super resolution image while undergoing the same motion, and the lower-right shows that image after the camera remains still. If you look closely, the MSAA super resolution
image in motion is slightly blurrier than the still image.</i></p>

<p>Looking at the first row for the TAA image, there is obvious blurring of the image. Additionally, the edges
are not converged either which causes crawling jaggy artifacts in motion.</p>

<p>The MSAA super resolution image, while imperfect, looks much improved. The edges of the moving images are slightly softer. We can also see some light “jagged teeth” along
the edges too, although since they are quite small and fade quickly they are hard to see in motion. Finally, while the details are slightly blurrier than the converged image,
it looks much better than the moving TAA image.</p>

<p><strong>Performance</strong></p>

<p>Of course, adding temporal information is not free. Included are the costs from the previous post, as well as the new entry for 4x with temporal information. Timings are in microseconds
on my RTX 3070.</p>

<table border="1" cellspacing="0" cellpadding="10" align="center">
  <tr align="center">
     <th>MSAA Level</th><th>Single Frame</th><th>Temporal Super Resolution</th>
  </tr>
  <tr align="center">
     <td>2x</td><td>118</td><td>N/A</td>
  </tr>
  <tr align="center">
     <td>4x</td><td>283</td><td>365</td>
  </tr>
  <tr align="center">
     <td>8x</td><td>731</td><td>N/A</td>
  </tr>
</table>

<p>The cost is higher, but it seems very much worth it. All of these passes have significant room for improvement. So while it’s still too slow, 365 microseconds is good starting point
before optimization.</p>

<p><strong>Evaluation</strong></p>

<p>Overall, the quality is better than I had expected. Adding temporal super resolution increases the edge quality around the near 45 degree lines. It “de-blocks”
the 2x2 pattern while also increasing resolvable resolution. If you have a 4x MSAA rendered image, temporal super resolution seems like a clear improvement at reasonable cost. But there are also several significant areas to improve.</p>

<ol>
  <li>
    <p>In some cases there are some flickering artifacts. With a repeating 4 pattern cycle, there are cases where one of the pixels causes an outlier which results in little “flickering teeth”. It is
subtle, but you can see it if you look closely. It is unclear if the better approach is to apply an explicit “unteething” pass (similar to Rainbow Six Siege [12]) or if the temporal accumulation
algorithm should be modified.</p>
  </li>
  <li>
    <p>The algorithm still is not equivalent to supersampling. Each pixel is the average of the 4 neighboring corner sample points. Ideally we would be sampling
from pixel centers, as opposed to the average of neighbors. While the super resolution
image is able to resolve details that the lower resolution images miss, the result does look a little soft.</p>
  </li>
  <li>
    <p>The biggest problem (and it’s a big one!) is that you need to have an MSAA render target. Games rely on rendering techniques like deferred lighting, SSAO, SSR, etc. They can
theoretically be done with MSAA, but in practice it’s a maintenance nightmare full of little performance regressions. As it stands, this technique is only viable if you are already rendering an MSAA target which
is a non-starter for most titles.</p>
  </li>
</ol>

<p><strong>Example Code</strong></p>

<p>For a reference implementation, here are the functions in a standalone file. It will require some changes as this is a snippet from a larger
code base, but hopefully it makes the algorithm easier to understand.</p>

<p><a href="/downloads/2025_05_05_msaa_upsampling/UpsamplingViaMultisampling.hlsl">UpsamplingViaMultisampling.hlsl</a></p>

<p><strong>References</strong></p>

<p>[1] A Survey of Temporal Antialiasing Techniques. Lei Yang, Shiqiu Liu, and Marco Salvi. (<a href="http://behindthepixels.io/assets/files/TemporalAA.pdf">http://behindthepixels.io/assets/files/TemporalAA.pdf</a>)</p>

<p>[2] AMD FidelityFX, Super Resolution. AMD Inc. (<a href="https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution">https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution</a>)</p>

<p>[3] Anti-Aliasing and Upscaling, Epic (<a href="https://dev.epicgames.com/documentation/en-us/unreal-engine/anti-aliasing-and-upscaling-in-unreal-engine">https://dev.epicgames.com/documentation/en-us/unreal-engine/anti-aliasing-and-upscaling-in-unreal-engine</a>)</p>

<p>[4] Arm Accuracy Super Resolution, ARM (<a href="https://github.com/arm/accuracy-super-resolution">https://github.com/arm/accuracy-super-resolution</a>)</p>

<p>[5] DLSS 4.0 Super Resolution Stress Test, Digital Foundry (<a href="https://www.youtube.com/watch?v=iK4tT9AHIOE">https://www.youtube.com/watch?v=iK4tT9AHIOE</a>)</p>

<p>[6] SMAA: Enhanced Subpixel Morphological Antialiasing, Jorge Jimenez, Jose I. Echevarria, Tiago Sousa, and Diego Gutierrez (<a herf="https://www.iryoku.com/smaa/downloads/SMAA-Enhanced-Subpixel-Morphological-Antialiasing.pdf">https://www.iryoku.com/smaa/downloads/SMAA-Enhanced-Subpixel-Morphological-Antialiasing.pdf</a>)</p>

<p>[7] FSR 4 is Even Better at 4K, Hardware Unboxed (<a href="https://www.youtube.com/watch?v=SWTot0wwaEU">https://www.youtube.com/watch?v=SWTot0wwaEU</a>)</p>

<p>[8] High-Quality Temporal SuperSampling, Brian Karis (<a href="https://advances.realtimerendering.com/s2014/#_HIGH-QUALITY_TEMPORAL_SUPERSAMPLING">https://advances.realtimerendering.com/s2014/#_HIGH-QUALITY_TEMPORAL_SUPERSAMPLING</a>)</p>

<p>[9] Intel Arc Gaming Technologies, Intel (<a href="https://www.amd.com/en/products/graphics/technologies/fidelityfx/super-resolution.html">https://www.amd.com/en/products/graphics/technologies/fidelityfx/super-resolution.html</a>)</p>

<p>[10] Introducting Snapdragon Game Super Resolution, Qualcomm (<a href="https://www.qualcomm.com/news/onq/2023/04/introducing-snapdragon-game-super-resolution">https://www.qualcomm.com/news/onq/2023/04/introducing-snapdragon-game-super-resolution</a>)</p>

<p>[11] NVIDIA DLSS. NVIDIA Inc. (<a href="https://www.nvidia.com/en-us/geforce/technologies/dlss/">https://www.nvidia.com/en-us/geforce/technologies/dlss/</a>)</p>

<p>[12] Rendering Tom Clancy’s Rainbow Six Siege, Jalal El Mansouri. (<a href="https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdc2016/Presentations/El_Mansouri_Jalal_Rendering_Rainbow_Six.pdf">https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdc2016/Presentations/El_Mansouri_Jalal_Rendering_Rainbow_Six.pdf</a>)</p>

<p>[13] Rendering The Alternate History of The Order: 1886, Matt Pettineo, (<a href="https://www.youtube.com/watch?v=nj4puag4hwc">https://www.youtube.com/watch?v=nj4puag4hwc</a>)</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Is anyone using temporal reprojection to improve MSAA?]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222025_05_12_temporal_msaa/header_v2_crop.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222025_05_12_temporal_msaa/header_v2_crop.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Upsampling via Multisampling</title><link href="https://filmicworlds.com/blog/upsampling-via-multisampling/" rel="alternate" type="text/html" title="Upsampling via Multisampling" /><published>2025-05-04T00:00:00+00:00</published><updated>2025-05-04T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/upsampling-via-multisampling</id><content type="html" xml:base="https://filmicworlds.com/blog/upsampling-via-multisampling/"><![CDATA[<p>Can we use multisampling effectively for upsampling? This has been a question in the back of my mind for give or take 10+ years.</p>

<p>In the image at the top,
the left images shows a standard 4x MSAA scene resolved to 1x by averaging the 4 sample points. The second image uses the exact same source MSAA render target, but
upsamples to 4x area scaling (2x in each dimension). Similarly, the third image shows an 8x scene resolved to 1x by averaging the samples, while the fourth image
is upsampling to 4x area as well. It is a pretty simple idea, and it seems like something that someone has probably tried, but I can not find any references to it so here we are.</p>

<p>I specifically remember thinking
about this problem as I was reading Matt Pettineo’s article <a href="https://therealmjp.github.io/posts/msaa-resolve-filters/">Experimenting with Reconstruction Filters for MSAA Resolve</a> (while
Kenny Loggins played in the background) [9]. It always seemed like there must be a good way to use the jittered sample information in a useful way to upsample the image. I have
looked around for references to doing this, but I have not found many hits on the web (of course, please ping
me if there are important references that I missed).</p>

<p>Then I saw this <a href="https://x.com/NOTimothyLottes/status/1884356243596181677">post</a> from Timothy Lottes and it sent me down a deep rabbit hole.</p>
<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/lottes-lines.png" /></div>

<p>When I saw this post, I nearly jumped out of my seat. I ran into the exact same problem several years ago in my own <a href="/blog/visibility-taa-and-upsampling-with-subsample-history/">upsampling adventures</a> and I never figured out a good solution.
But this is such a simple, elegant solution to the problem. And it got me thinking about how to use a trick like that for general purpose
multisample upsampling.</p>

<p>In the most typical case, we would render to 1080p with 2x/4x/8x MSAA, and we want to upsample the result to 4k (3840x2160). We do not have any other buffers
or temporal information. And we want to do it in one pass to keep the bandwidth down. Can we use the multisampled data in a meaningful way?</p>

<p><strong>2x MSAA Upsampling to 4x Area</strong></p>

<p>Let us start with the 2x case. The sample pattern for 2x is very simple [7]. If you split the pixel into 4 quads, you get one sample in the upper left and one sample in the lower right.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/dx12-2x.png" /></div>

<p>Here is an example image showing a comparison between a naive resolve which averages the colors together, versus the two diagonal samples that make up that pixel with black pixels in the missing areas.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/sample-2x-empty.png" /></div>

<p><i> The left shows the naive resolve. The right shows the two original samples. </i></p>

<p>Thus the question becomes: How should we fill those black pixels? We can start with a synthetic example where three of the neighbors are red and one is blue.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/diamond_interpolation_single.png" /></div>

<p>The obvious solution would be the average of all four pixels, but that actually causes problems, especially on edges. On edges, it causes a “zipper” pattern, where the pixels inside and outside
 the “teeth” of the edge alternate colors. Fortunately, there is a better algorithm called “smallest absolute difference” which means you pick the edge with a smaller gradient.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/diamond_interpolation_full.png" /></div>

<p><i>The left shows the original pixel with empty pixels in white. The middle shows the zipper pattern from linear interpolation. The right side fills the missing pixels using the smallest absolute difference.</i></p>

<p>There are other more complex methods, but smallest absolute difference is cheap and effective. At a minimum, the smallest absolute difference result looks much cleaner than the average of all 4 neighbors.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/sample-2x-linear-sad.png" /></div>

<p><i>A linear resolve from averaging all four samples (left) vs using the edge from smallest absolute difference (right).</i></p>

<p>Also, note that this algorithm is very well known. The original use that I could find is from debayering with VNG [18] in DCRAW [8]. But I have also seen it with checkerboard rendering [17][3][1], as well as a component of SMAA 2x [10].
The example code is below. How you calculate luminance is up to you, but I generally prefer the 25% red, 50% green, 25% blue approach.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="nf">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">float3</span> <span class="n">left</span><span class="p">,</span> <span class="n">float3</span> <span class="n">right</span><span class="p">,</span> <span class="n">float3</span> <span class="n">up</span><span class="p">,</span> <span class="n">float3</span> <span class="n">down</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">float</span> <span class="n">lumL</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">left</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">lumR</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">right</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">lumU</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">up</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">lumD</span> <span class="o">=</span> <span class="n">CalcLuminance</span><span class="p">(</span><span class="n">down</span><span class="p">);</span>
	
	<span class="kt">float</span> <span class="n">diffH</span> <span class="o">=</span> <span class="n">abs</span><span class="p">(</span><span class="n">lumL</span> <span class="o">-</span> <span class="n">lumR</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">diffV</span> <span class="o">=</span> <span class="n">abs</span><span class="p">(</span><span class="n">lumU</span> <span class="o">-</span> <span class="n">lumD</span><span class="p">);</span>
	
	<span class="n">float3</span> <span class="n">avgH</span> <span class="o">=</span> <span class="p">(</span><span class="n">left</span> <span class="o">+</span> <span class="n">right</span><span class="p">)</span> <span class="o">*</span> <span class="mf">.5</span><span class="n">f</span><span class="p">;</span>
	<span class="n">float3</span> <span class="n">avgV</span> <span class="o">=</span> <span class="p">(</span><span class="n">up</span> <span class="o">+</span> <span class="n">down</span><span class="p">)</span> <span class="o">*</span> <span class="mf">.5</span><span class="n">f</span><span class="p">;</span>
	
	<span class="n">float3</span> <span class="n">ret</span> <span class="o">=</span> <span class="p">(</span><span class="n">diffH</span> <span class="o">&lt;</span> <span class="n">diffV</span><span class="p">)</span> <span class="o">?</span> <span class="n">avgH</span> <span class="o">:</span> <span class="n">avgV</span><span class="p">;</span>
	
	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>As an aside, there are other options instead of using the smallest absolute difference. For example, Rainbow Six Siege used linear interpolation with an explicit “unteething” filter [17].</p>

<p><strong>4x MSAA Upsampling to 4x Area</strong></p>

<p>Next up, how about 4x? The jittered grid MSAA pattern looks like so.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/dx12-4x.png" /></div>

<p>We have these points which have a known color (in green), and these red points that are unknown (in red). Each upsampled pixel (at 4x area scaling) has one
green and three reds. Each red point can be estimated from the 4 neighboring green points using smallest absolute difference. Then we can add up all points in the
upsampled pixel, and if we can calculate a good estimate of all three, then each of those upsampled pixels would theoretically have AA quality roughly equal to 2x MSAA.</p>

<div style="text-align:center;"><img width="800" src="/images/2025_05_05_super_msaa/4x-lines.png" /></div>

<p>For simplicity, I prefer to think about one of the original MSAA pixels as a 4x4 grid with 4 samples. In a single pixel, the 4 locations of the samples are listed below.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/ascii-4x-single.png" /></div>

<p>If we want to calculate all 16 points (4 known, 12 unknown), we need to look at at the neighbors as well.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/ascii-4x-neighbor.png" /></div>

<p>We want to calculate the color of the <strong>x</strong>s in the grid. The first step is to
fetch the known values in the grid. This is obviously not the fastest way to do it, but it keeps things simple for now.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">int2</span> <span class="n">srcXy</span> <span class="o">=</span> <span class="n">dispatchThreadId</span><span class="p">.</span><span class="n">xy</span><span class="p">;</span>
	
<span class="kt">int</span> <span class="n">halfX0</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">srcXy</span><span class="p">.</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">halfX1</span> <span class="o">=</span> <span class="n">srcXy</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">halfX2</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">sizeX</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="n">srcXy</span><span class="p">.</span><span class="n">x</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
	
<span class="kt">int</span> <span class="n">halfY0</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">srcXy</span><span class="p">.</span><span class="n">y</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">halfY1</span> <span class="o">=</span> <span class="n">srcXy</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">halfY2</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">sizeY</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="n">srcXy</span><span class="p">.</span><span class="n">y</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>

<span class="n">float3</span> <span class="n">color__0__5</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__1__7</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__2__4</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__3__6</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>

<span class="n">float3</span> <span class="n">color__4__1</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__5__3</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__6__0</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__7__2</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__4__5</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__5__7</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__6__4</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__7__6</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__4__9</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__5_11</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__6__8</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__7_10</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>

<span class="n">float3</span> <span class="n">color__8__5</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__9__7</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_10__4</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_11__6</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span></code></pre></figure>

<p>Then we need to calculate all 16 grid points.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">grid__4__4</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__4__1</span><span class="p">,</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__2__4</span><span class="p">,</span><span class="n">color__6__4</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__4__5</span> <span class="o">=</span> <span class="n">color__4__5</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">grid__4__6</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__4__9</span><span class="p">,</span><span class="n">color__3__6</span><span class="p">,</span><span class="n">color__7__6</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__4__7</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__4__9</span><span class="p">,</span><span class="n">color__1__7</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">);</span>

<span class="n">float3</span> <span class="n">grid__5__4</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__5__3</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">,</span><span class="n">color__2__4</span><span class="p">,</span><span class="n">color__6__4</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__5__5</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__5__3</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">,</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__8__5</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__5__6</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__5__3</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">,</span><span class="n">color__3__6</span><span class="p">,</span><span class="n">color__7__6</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__5__7</span> <span class="o">=</span> <span class="n">color__5__7</span><span class="p">;</span>
		
<span class="n">float3</span> <span class="n">grid__6__4</span> <span class="o">=</span> <span class="n">color__6__4</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">grid__6__5</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__6__4</span><span class="p">,</span><span class="n">color__6__8</span><span class="p">,</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__8__5</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__6__6</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__6__4</span><span class="p">,</span><span class="n">color__6__8</span><span class="p">,</span><span class="n">color__3__6</span><span class="p">,</span><span class="n">color__7__6</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__6__7</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__6__4</span><span class="p">,</span><span class="n">color__6__8</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">,</span><span class="n">color__9__7</span><span class="p">);</span>

<span class="n">float3</span> <span class="n">grid__7__4</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__7__2</span><span class="p">,</span><span class="n">color__7__6</span><span class="p">,</span><span class="n">color__6__4</span><span class="p">,</span><span class="n">color_10__4</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__7__5</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__7__2</span><span class="p">,</span><span class="n">color__7__6</span><span class="p">,</span><span class="n">color__4__5</span><span class="p">,</span><span class="n">color__8__5</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__7__6</span> <span class="o">=</span> <span class="n">color__7__6</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">grid__7__7</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__7__6</span><span class="p">,</span><span class="n">color__7_10</span><span class="p">,</span><span class="n">color__5__7</span><span class="p">,</span><span class="n">color__9__7</span><span class="p">);</span></code></pre></figure>

<p>Then merge the grid points into the 4x area pixels….</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">dst00</span> <span class="o">=</span> <span class="mf">0.25</span><span class="n">f</span><span class="o">*</span><span class="p">(</span><span class="n">grid__4__4</span> <span class="o">+</span> <span class="n">grid__4__5</span> <span class="o">+</span> <span class="n">grid__5__4</span> <span class="o">+</span> <span class="n">grid__5__5</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">dst01</span> <span class="o">=</span> <span class="mf">0.25</span><span class="n">f</span><span class="o">*</span><span class="p">(</span><span class="n">grid__4__6</span> <span class="o">+</span> <span class="n">grid__4__7</span> <span class="o">+</span> <span class="n">grid__5__6</span> <span class="o">+</span> <span class="n">grid__5__7</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">dst10</span> <span class="o">=</span> <span class="mf">0.25</span><span class="n">f</span><span class="o">*</span><span class="p">(</span><span class="n">grid__6__4</span> <span class="o">+</span> <span class="n">grid__7__5</span> <span class="o">+</span> <span class="n">grid__6__4</span> <span class="o">+</span> <span class="n">grid__7__5</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">dst11</span> <span class="o">=</span> <span class="mf">0.25</span><span class="n">f</span><span class="o">*</span><span class="p">(</span><span class="n">grid__6__6</span> <span class="o">+</span> <span class="n">grid__7__7</span> <span class="o">+</span> <span class="n">grid__6__6</span> <span class="o">+</span> <span class="n">grid__7__7</span><span class="p">);</span></code></pre></figure>

<p>And finally we can just write the 4 pixels and we are done. So how does it look? Honestly,
not too shabby.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/sample-4x-compare.png" /></div>

<p><i>A naive resolve of the 4x MSAA image (left) and the upsampled resolve (right).</i></p>

<p><strong>8x MSAA Upsampling to 4x Area</strong></p>

<p>And finally we are back to the original problem: 8x. At 4x area upsampling, we could potentially
just pick the two samples in that region. That is actually what I did in a previous post,
and I had the expected “saw-tooth” artifacts as a result. But we can do better and reconstruct the edges.</p>

<p>The source pixel has 8 MSAA sample points. Then when we split it into 4 output pixels, each output pixel
has 2 known and 14 unknown points. The pattern looks like this:</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/8x-pattern.png" /></div>

<p>And then the entire tile grid with 1 ring of neighbor source pixels:</p>

<div style="text-align:center;"><img width="800" src="/images/2025_05_05_super_msaa/8x-tiled.png" /></div>

<p>Now we have to manually write out all the intersections. I considered writing a script to do this, but
it was easier just to do it by hand.</p>

<p>First up is the source points.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">color__0_15</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">7</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__1_10</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__2_12</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__3__8</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">5</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__4_14</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__5_11</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__6__9</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">4</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__7_13</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY0</span><span class="p">),</span><span class="mi">6</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>

<span class="n">float3</span> <span class="n">color__8__7</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">7</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__9__2</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_10__4</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_11__0</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">5</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_12__6</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_13__3</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_14__1</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">4</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_15__5</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX0</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">6</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__8_15</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">7</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__9_10</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_10_12</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_11__8</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">5</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_12_14</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_13_11</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_14__9</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">4</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_15_13</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">6</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__8_23</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">7</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color__9_18</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_10_20</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_11_16</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">5</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_12_22</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_13_19</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_14_17</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">4</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_15_21</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX2</span><span class="p">,</span><span class="n">halfY1</span><span class="p">),</span><span class="mi">6</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>

<span class="n">float3</span> <span class="n">color_16_15</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">7</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_17_10</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">3</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_18_12</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">0</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_19__8</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">5</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_20_14</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">2</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_21_11</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">1</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_22__9</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">4</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">color_23_13</span> <span class="o">=</span> <span class="n">texData</span><span class="p">.</span><span class="n">Load</span><span class="p">(</span><span class="n">int2</span><span class="p">(</span><span class="n">halfX1</span><span class="p">,</span><span class="n">halfY2</span><span class="p">),</span><span class="mi">6</span><span class="p">).</span><span class="n">xyz</span><span class="p">;</span></code></pre></figure>

<p>And then the crosses.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">grid__8__8</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8__7</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__3__8</span><span class="p">,</span><span class="n">color_11__8</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__8__9</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8__7</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__6__9</span><span class="p">,</span><span class="n">color_14__9</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__8_10</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8__7</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__1_10</span><span class="p">,</span><span class="n">color__9_10</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__8_11</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8__7</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__5_11</span><span class="p">,</span><span class="n">color_13_11</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__8_12</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8__7</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__2_12</span><span class="p">,</span><span class="n">color_10_12</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__8_13</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8__7</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__7_13</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__8_14</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8__7</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__4_14</span><span class="p">,</span><span class="n">color_12_14</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid__8_15</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">);</span>
		
<span class="c1">// ...</span>

<span class="n">float3</span> <span class="n">grid_15__8</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15__5</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_11__8</span><span class="p">,</span><span class="n">color_19__8</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid_15__9</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15__5</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_14__9</span><span class="p">,</span><span class="n">color_22__9</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid_15_10</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15__5</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color__9_10</span><span class="p">,</span><span class="n">color_17_10</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid_15_11</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15__5</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_13_11</span><span class="p">,</span><span class="n">color_21_11</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid_15_12</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15__5</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_10_12</span><span class="p">,</span><span class="n">color_18_12</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid_15_13</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_15_13</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid_15_14</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_15_21</span><span class="p">,</span><span class="n">color_12_14</span><span class="p">,</span><span class="n">color_20_14</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grid_15_15</span> <span class="o">=</span> <span class="n">CalcDiamondAbsDiff</span><span class="p">(</span><span class="n">color_15_13</span><span class="p">,</span><span class="n">color_15_21</span><span class="p">,</span><span class="n">color__8_15</span><span class="p">,</span><span class="n">color_16_15</span><span class="p">);</span></code></pre></figure>

<p>And then summing up each of our 4 output pixels.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">dst00</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid__8__8</span> <span class="o">+</span> <span class="n">grid__8__9</span> <span class="o">+</span> <span class="n">grid__8_10</span> <span class="o">+</span> <span class="n">grid__8_11</span> <span class="o">+</span>
                <span class="n">grid__9__8</span> <span class="o">+</span> <span class="n">grid__9__9</span> <span class="o">+</span> <span class="n">grid__9_10</span> <span class="o">+</span> <span class="n">grid__9_11</span> <span class="o">+</span>
                <span class="n">grid_10__8</span> <span class="o">+</span> <span class="n">grid_10__9</span> <span class="o">+</span> <span class="n">grid_10_10</span> <span class="o">+</span> <span class="n">grid_10_11</span> <span class="o">+</span>
                <span class="n">grid_11__8</span> <span class="o">+</span> <span class="n">grid_11__9</span> <span class="o">+</span> <span class="n">grid_11_10</span> <span class="o">+</span> <span class="n">grid_11_11</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">16.0</span><span class="n">f</span><span class="p">);</span>							 

<span class="n">float3</span> <span class="n">dst01</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid__8_12</span> <span class="o">+</span> <span class="n">grid__8_13</span> <span class="o">+</span> <span class="n">grid__8_14</span> <span class="o">+</span> <span class="n">grid__8_15</span> <span class="o">+</span>
                <span class="n">grid__9_12</span> <span class="o">+</span> <span class="n">grid__9_13</span> <span class="o">+</span> <span class="n">grid__9_14</span> <span class="o">+</span> <span class="n">grid__9_15</span> <span class="o">+</span>
                <span class="n">grid_10_12</span> <span class="o">+</span> <span class="n">grid_10_13</span> <span class="o">+</span> <span class="n">grid_10_14</span> <span class="o">+</span> <span class="n">grid_10_15</span> <span class="o">+</span>
                <span class="n">grid_11_12</span> <span class="o">+</span> <span class="n">grid_11_13</span> <span class="o">+</span> <span class="n">grid_11_14</span> <span class="o">+</span> <span class="n">grid_11_15</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">16.0</span><span class="n">f</span><span class="p">);</span>							 

<span class="n">float3</span> <span class="n">dst10</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid_12__8</span> <span class="o">+</span> <span class="n">grid_12__9</span> <span class="o">+</span> <span class="n">grid_12_10</span> <span class="o">+</span> <span class="n">grid_12_11</span> <span class="o">+</span>
                <span class="n">grid_13__8</span> <span class="o">+</span> <span class="n">grid_13__9</span> <span class="o">+</span> <span class="n">grid_13_10</span> <span class="o">+</span> <span class="n">grid_13_11</span> <span class="o">+</span>
                <span class="n">grid_14__8</span> <span class="o">+</span> <span class="n">grid_14__9</span> <span class="o">+</span> <span class="n">grid_14_10</span> <span class="o">+</span> <span class="n">grid_14_11</span> <span class="o">+</span>
                <span class="n">grid_15__8</span> <span class="o">+</span> <span class="n">grid_15__9</span> <span class="o">+</span> <span class="n">grid_15_10</span> <span class="o">+</span> <span class="n">grid_15_11</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">16.0</span><span class="n">f</span><span class="p">);</span>							 

<span class="n">float3</span> <span class="n">dst11</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid_12_12</span> <span class="o">+</span> <span class="n">grid_12_13</span> <span class="o">+</span> <span class="n">grid_12_14</span> <span class="o">+</span> <span class="n">grid_12_15</span> <span class="o">+</span>
                <span class="n">grid_13_12</span> <span class="o">+</span> <span class="n">grid_13_13</span> <span class="o">+</span> <span class="n">grid_13_14</span> <span class="o">+</span> <span class="n">grid_13_15</span> <span class="o">+</span>
                <span class="n">grid_14_12</span> <span class="o">+</span> <span class="n">grid_14_13</span> <span class="o">+</span> <span class="n">grid_14_14</span> <span class="o">+</span> <span class="n">grid_14_15</span> <span class="o">+</span>
                <span class="n">grid_15_12</span> <span class="o">+</span> <span class="n">grid_15_13</span> <span class="o">+</span> <span class="n">grid_15_14</span> <span class="o">+</span> <span class="n">grid_15_15</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">16.0</span><span class="n">f</span><span class="p">);</span>							 </code></pre></figure>

<p>And that ends up working pretty well.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/sample-8x-compare.png" /></div>

<p><i>An 8x MSAA image with naive resolve (left) and the upsampled resolve (right).</i></p>

<p>There is one other thing we can do. We do not necessarily need to use all 16 points for each output pixel. Rather, we
can get 4 gradations for each output pixel with only 4 positions as long as these positions solve the N rooks problem.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/8x-n-rooks-all.png" /></div>

<p>And we can do this by simply swapping out the final pixel evaluation code. Instead of calculating 16 points
per pixel, we can get away with only 4. We just keep the two black and two blue points while skipping the 12 red points.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">dst00</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid__8__9</span> <span class="o">+</span> <span class="n">grid__9_10</span> <span class="o">+</span> <span class="n">grid_10_11</span> <span class="o">+</span> <span class="n">grid_11__8</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">4.0</span><span class="n">f</span><span class="p">);</span>							 
<span class="n">float3</span> <span class="n">dst01</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid__8_15</span> <span class="o">+</span> <span class="n">grid__9_14</span> <span class="o">+</span> <span class="n">grid_10_12</span> <span class="o">+</span> <span class="n">grid_11_13</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">4.0</span><span class="n">f</span><span class="p">);</span>							 
<span class="n">float3</span> <span class="n">dst10</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid_12_10</span> <span class="o">+</span> <span class="n">grid_13_11</span> <span class="o">+</span> <span class="n">grid_14__9</span> <span class="o">+</span> <span class="n">grid_15__8</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">4.0</span><span class="n">f</span><span class="p">);</span>							 
<span class="n">float3</span> <span class="n">dst11</span> <span class="o">=</span> <span class="p">(</span><span class="n">grid_12_14</span> <span class="o">+</span> <span class="n">grid_13_12</span> <span class="o">+</span> <span class="n">grid_14_15</span> <span class="o">+</span> <span class="n">grid_15_13</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">/</span><span class="mf">4.0</span><span class="n">f</span><span class="p">);</span>							 </code></pre></figure>

<p>Here is a comparison between the two approaches, and there is minimal difference in quality.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/sample-8x-compare-4-vs-16.png" /></div>

<p><i>8x upsampling resolve using all 16 samples per output pixel (left) vs using only 4 samples per output pixel (right).</i></p>

<p>If you look <i>really</i> closely you can see a minor difference, but in general picking “4 rooks” looks very close to brute forcing
all 16 samples.</p>

<p><strong>Image Quality</strong></p>

<p>For reference, here is a comparison of all 3 multisample levels.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/full-comparison.png" /></div>

<p><i>A comparison of 2x (top row), 4x (middle row), and 8x (bottom row) MSAA. In each row, the left side shows the naive resolve and the right side shows the upsampled variation. The 8x upsample uses the 4 rooks approximation.</i></p>

<p>The first obvious thing to note is that performing an upsample in this method provides no benefit to non-edges. That is because all samples on the
same pixel will have the same color. It is possible to use a filter for a mild improvement but that was not done here. How do the long edges look?</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/long-edges.png" /></div>

<p><i>A comparison of 2x (top row), 4x (middle row), and 8x (bottom row) MSAA. In each row, the left side shows the naive resolve and the right side shows the upsampled variation.</i></p>

<p>In general, the edges look exactly as we would want them to.
We would expect a native 4x MSAA image after upsampling to have 2 gradations on the output pixels. Similarly, we would expect the native 8x MSAA image
to have 4 gradations in the output pixels. It turns out that both work exactly as we would hope. Let us take a look at another region.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/sample-wavy-edges.png" /></div>

<p><i>A comparison of 2x (top row), 4x (middle row), and 8x (bottom row) MSAA. In each row, the left column shows the naive resolve and the middle column shows the upsampled variation. The right column is the same as middle but with additional markup.</i></p>

<p>In the image 2x you can clearly see the “wavy” nature of near-45 degree images. You can see a similar affect in the 4x upsampled image but it is mostly removed once you go to 8x (although if you look really closely you can see slight bending). There are some ways this could be fixed. In particular, it should be possible to write diagonal detection similar to SMAA [10] or MLAA [15], but that was outside the scope of this test.
As another option, the Decima Engine [3] actually used FXAA [11] on a rotated diagonal checkerboard image to fix a similar artifact. And if you have not seen that presentation before,
it is worth reading through the slides just for the checkerboard tangram trick.</p>

<p>Finally, let us look at some thin lines. While classic sponza does not have any thin lines in it, at the moment alpha testing is broken in this scene (please do not judge me!) so the chains
form a beautiful orange line which is perfect for testing.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/thin-lines-compare.png" /></div>

<p><i>A comparison of 2x (top row), 4x (middle row), and 8x (bottom row). In each row, the left side shows the naive resolve and the right side shows the upsampled variation.</i></p>

<p>In this shot, the line on the left is slightly wider than one native pixel, whereas the line on the right is slightly thinner than one native pixel. The upsampling algorithms do
a reasonable job with both lines, although we do end up with some jagged teeth. There is a slight gradient in the thin line, but the background is pure black,
so the interpolation is choosing the horizontal black gradient which creates little gaps in the line. One way we could address this is using depth, so that if two gradients are very close,
we choose the one that is closer to the camera. But this is a problem for another day.</p>

<p>In comparison, this a full failure case for TAA. Since there are frames where the sections of the
right line are completely missing, TAA fails to reconstruct the line here. You have probably seen this problem before with thin power lines, fences, and tree branches in the
distance.</p>

<div style="text-align:center;"><img src="/images/2025_05_05_super_msaa/thin-lines-taa.png" /></div>

<p><i>Thin lines using TAA for reconstruction.</i></p>

<p>In terms of performance, all times were on my RTX 3070, at 1080p (upsampling to 4k). The code is completely unoptimized, and I would expect
significant gains by putting some optimization effort into it. Additionally, the cost would change depending on triangle density, as increased triangle density would mean
reading from more image planes in the MSAA target on PC platforms.</p>

<table border="1" cellspacing="0" cellpadding="10" align="center">
  <tr align="center">
     <th>MSAA Level</th><th>Upsample Time (in microseconds)</th>
  </tr>
  <tr align="center">
     <td>2x</td><td>118</td>
  </tr>
  <tr align="center">
     <td>4x</td><td>283</td>
  </tr>
  <tr align="center">
     <td>8x</td><td>731</td>
  </tr>
</table>

<p>That being said, the numbers are not particularly meaningful and are just included as a rough starting point. There are many obvious optimizations to make, but the approach for optimization will
depend significantly on your platform and use case. Is your platform a mobile TBDR device or a desktop GPU? Are you tonemapping during MSAA resolve? What about depth of field and motion blur? And of course,
do you also want to apply a convolution (such as an approximate Lanczos filter)? These numbers are definitely slower than I would like, but
there is ample room for improvement depending on the specific use case. Also, there are many options for improving the quality. Temporal
reprojection, jittered sampling, and a better filter kernel come to mind just to name a few.</p>

<p>Now, the main problem with this algorithm is that it is only applicable if you are using MSAA rendering. And even then, if you have operations
that run in between the main color pass and the final output pass (such as depth of field, distortion, motion blur, bloom, etc) then there
are other non-trivial problems to solve. It really is <i><strong>vastly</strong></i> easier to just render everything at 1x with temporal jitter and and use TAA [12], variants of TAA [2], or one of the
many upsampling algorithms (DLSS [16], FSR [4], XeSS [13], TAAU [5], GSR [14], ASR [6], etc). But if you do happen to have an MSAA buffer just sitting there in your frame, performing
a direct 4x area scale might be a compelling option for you.</p>

<p><strong>Source Code:</strong>
For a reference implementation, I took these functions and put them into a standalone file with an MIT license. You will have to make minor modifications to get it to
work in your codebase (as it was pseudo-ripped out of my larger codebase). But hopefully it can get you started.</p>

<p><a href="/downloads/2025_05_05_msaa_upsampling/UpsamplingViaMultisampling.hlsl">UpsamplingViaMultisampling.hlsl</a></p>

<p><strong>References:</strong></p>

<p>[1] 4K Checkerboard in Battlefield 1 and Mass Effect Andromeda, Graham Wilhidal (<a href="https://www.gdcvault.com/play/1024709/4K-Checkerboard-in-Battlefield-1">https://www.gdcvault.com/play/1024709/4K-Checkerboard-in-Battlefield-1</a>)</p>

<p>[2] A Survey of Temporal Antialiasing Techniques. Lei Yang, Shiqiu Liu, and Marco Salvi. (<a href="http://behindthepixels.io/assets/files/TemporalAA.pdf">http://behindthepixels.io/assets/files/TemporalAA.pdf</a>)</p>

<p>[3] Advances in Lighting And AA, Giliam de Carpentier and Kohei Ishiyama. (<a href="https://www.guerrilla-games.com/media/News/Files/DecimaSiggraph2017.pdf">https://www.guerrilla-games.com/media/News/Files/DecimaSiggraph2017.pdf</a>)</p>

<p>[4] AMD FidelityFX, Super Resolution. AMD Inc. (<a href="https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution">https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution</a>)</p>

<p>[5] Anti-Aliasing and Upscaling, Epic (<a href="https://dev.epicgames.com/documentation/en-us/unreal-engine/anti-aliasing-and-upscaling-in-unreal-engine">https://dev.epicgames.com/documentation/en-us/unreal-engine/anti-aliasing-and-upscaling-in-unreal-engine</a>)</p>

<p>[6] Arm Accuracy Super Resolution, ARM (<a href="https://github.com/arm/accuracy-super-resolution">https://github.com/arm/accuracy-super-resolution</a>)</p>

<p>[7] D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS enumeration (d3d11.h). Microsoft, Inc. (<a href="https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ne-d3d11-d3d11_standard_multisample_quality_levels">https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ne-d3d11-d3d11_standard_multisample_quality_levels</a>)</p>

<p>[8] dcraw.c, Dave Coffin (<a href="https://www.dechifro.org/dcraw/">https://www.dechifro.org/dcraw/</a>)</p>

<p>[9] Experimenting with Reconstruction Filters for MSAA Resolve. Matt Pettineo. (<a href="https://therealmjp.github.io/posts/msaa-resolve-filters/">https://therealmjp.github.io/posts/msaa-resolve-filters/</a>)</p>

<p>[10] Filmic SMAA: Sharp Morphological and Temporal Anialiasing, Jorge Jimenez (<a herf="https://research.activision.com/publications/archives/filmic-smaasharp-morphological-and-temporal-antialiasing">https://research.activision.com/publications/archives/filmic-smaasharp-morphological-and-temporal-antialiasing</a>)</p>

<p>[11] FXAA, Timothy Lottes (<a href="https://developer.download.nvidia.com/assets/gamedev/files/sdk/11/FXAA_WhitePaper.pdf">https://developer.download.nvidia.com/assets/gamedev/files/sdk/11/FXAA_WhitePaper.pdf</a>)</p>

<p>[12] High-Quality Temporal SuperSampling, Brian Karis (<a href="https://advances.realtimerendering.com/s2014/#_HIGH-QUALITY_TEMPORAL_SUPERSAMPLING">https://advances.realtimerendering.com/s2014/#_HIGH-QUALITY_TEMPORAL_SUPERSAMPLING</a>)</p>

<p>[13] Intel Arc Gaming Technologies, Intel (<a href="https://www.amd.com/en/products/graphics/technologies/fidelityfx/super-resolution.html">https://www.amd.com/en/products/graphics/technologies/fidelityfx/super-resolution.html</a>)</p>

<p>[14] Introducting Snapdragon Game Super Resolution, Qualcomm (<a href="https://www.qualcomm.com/news/onq/2023/04/introducing-snapdragon-game-super-resolution">https://www.qualcomm.com/news/onq/2023/04/introducing-snapdragon-game-super-resolution</a>)</p>

<p>[15] Morphological Antialiasing, Alexander Reshetov (<a href="https://www.intel.com/content/dam/develop/external/us/en/documents/z-shape-arm-785403.pdf">https://www.intel.com/content/dam/develop/external/us/en/documents/z-shape-arm-785403.pdf</a>)</p>

<p>[16] NVIDIA DLSS. NVIDIA Inc. (<a href="https://www.nvidia.com/en-us/geforce/technologies/dlss/">https://www.nvidia.com/en-us/geforce/technologies/dlss/</a>)</p>

<p>[17] Rendering Tom Clancy’s Rainbow Six Siege, Jalal El Mansouri. (<a href="https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdc2016/Presentations/El_Mansouri_Jalal_Rendering_Rainbow_Six.pdf">https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdc2016/Presentations/El_Mansouri_Jalal_Rendering_Rainbow_Six.pdf</a>)</p>

<p>[18] Variable Number of Gradients (<a href="https://web.archive.org/web/20120422035609/http://scien.stanford.edu/pages/labsite/1999/psych221/projects/99/tingchen/main.htm">https://web.archive.org/web/20120422035609/http://scien.stanford.edu/pages/labsite/1999/psych221/projects/99/tingchen/main.htm</a>)</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Can we use multisampling effectively for upsampling? This has been a question in the back of my mind for give or take 10+ years.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222025_05_05_super_msaa/sample-comparison-header.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222025_05_05_super_msaa/sample-comparison-header.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Solving Blendshapes for ARKit</title><link href="https://filmicworlds.com/blog/solving-face-scans-for-arkit/" rel="alternate" type="text/html" title="Solving Blendshapes for ARKit" /><published>2021-11-14T00:00:00+00:00</published><updated>2021-11-14T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/solving-face-scans-for-arkit</id><content type="html" xml:base="https://filmicworlds.com/blog/solving-face-scans-for-arkit/"><![CDATA[<p>As you have probably seen, ARKit has become quite popular for facial animation. It has obvious appeal for mocap on a budget, but it’s also being used on
professional productions as well. So I decided to do a test, and here it is:</p>

<iframe src="https://player.vimeo.com/video/650749692" width="800" height="600" frameborder="0" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe>

<p>The workflow is very compelling. Essentially, you capture, the data is solved to blendshape weights on the fly…and that’s it. The advantage is the ease of use.
You can simply press record and you get a stream of blendshape weights. It works live too, as popularized by Cory Strassburger at <a href="https://www.youtube.com/watch?v=lXZhgkNFGfM">Siggraph Realtime Live in 2018</a>.
The animation works well with standard tools (it’s just a set of blendshape weights), and there is an excellent headcam for it by Standard Deviation
(<a href="https://sdeviation.com/iphone-hmc/">sdeviation.com/iphone-hmc/</a>).</p>

<p>The hard part is creating a face rig to be driven by this animation data. In theory, the workflow is very simple. ARKit uses a set of 52 standard shapes, so all you
have to do is author those blendshapes, hook up the weights, and then you’re done. The common approach is to scan an actor, clean the shapes, and you’re all set.
For example, since ARKit has shapes for smile, frown, etc. you can scan those shapes, put them in a rig, and drive them with the blendshape animation. Apple documentation
has a list of the shapes here:
<a href="https://developer.apple.com/documentation/ARKit/arfaceanchor/blendshapelocation">ARKit Blenshapes</a></p>

<p>The problem is I’ve never managed to get good results out of it in the past. I tried FaceShift even back before they were purchased by Apple, but it never really worked for me. And even
more concerning, I didn’t really understand why. To investigate this, I put together a photogrammetry setup in my apartment. Covid has made us all do strange things, and I built this scaning rig. My
friends got a dog.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_11_26_arkit/camera_setup_med.jpg" /></div>

<p>For this test, Colin Zargarpour graciously agreed to be the test subject. While I was putting this setup together, I asked Habib Zargarpour if he knew any
actors/models who would be up for this, and he said “Well, my son Colin is in town for a few more days”. At the time, I was actually about a week away from being ready to shoot,
so I said “sure, why not”. When has lack of preparation ever caused problems?</p>

<div style="text-align:center;"><img width="800" src="/images/2021_11_26_arkit/shoot_image_00.jpg" /></div>

<p>So we did the shoot, and we ran into a few issues, but got it done. I definitely owe Colin a beer for his patience with my very out-of-practice eyeliner skills.</p>

<div style="text-align:center;"><img width="500" src="/images/2021_11_26_arkit/shoot_image_01.jpg" /></div>

<p>The actual scan setup isn’t particularly interesting. I’m using 20x Rebel SL3s controlled by Raspberry PIs. Since I don’t have an extra room, it was a hard
constraint that the entire setup must fit on a desk. In the past I have had photogrammetry setups scattered about my living room floor for months on end but…no, not doing that again.
So while you would normally use 50mm fixed lenses, these cameras have 35mm fixed lenses to keep the setup compact.</p>

<p>I’ve scanned quite a few shapes over the years, and tried to use ARKit (or FaceShift before they were purchased) several times, with different scans. Each time I try it…it looks terrible. The animation is unstable, jittery, and just looks
awful. So, what am I missing? Turns out, a lot. After looking at the data, it started to make sense. Take a look at this shot, with Colin making a lip upper up expression. He is neutral, relaxed, but only moving up the upper lip.</p>

<div style="text-align:center;"><img width="500" src="/images/2021_11_26_arkit/proc_0069_0009.jpg" /></div>

<p>ARKit has shapes specifically for this motion, <strong>mouthUpperUp_L</strong> and <strong>mouthUpperUp_R</strong>. So in theory, we should take this finished scan and split it into
a left and right blendshape. Then we can do this for all the shapes in the blendshape list, and we should have a rig that can be effectively driven by ARKit. However, take a closer look at the scanning rig: You might have missed something.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_11_26_arkit/camera_setup_med_circle.jpg" /></div>

<p>Yep, that’s an iPhone. During the shoot, the iPhone was streaming the live ARKit shapes (using <a href="https://www.bannaflak.com/face-cap/">FaceCap</a>). Then right before each capture I recorded the ARKit weights so that I have a set of
ARKit blendshape weights for every scan shot.</p>

<p>Looking at the reference, Colin is making a lip upper up. However, ARKit sees something different. Skipping ahead for a moment, the final ARKit shape looks like this:</p>

<div style="text-align:center;"><img width="500" src="/images/2021_11_26_arkit/lip_shape_00_full.jpg" /></div>

<p>ARKit does not see a single lip upper up shape. Rather, it only sees a 47.1% <strong>mouthUpperUp_L</strong> and a 49.7% <strong>mouthUpperUp_R</strong>. Those shapes look like
this after processing.</p>

<div style="text-align:center;"><img width="500" src="/images/2021_11_26_arkit/lip_shape_02_upper_up.jpg" /></div>

<p>ARKit also finds a bunch of other small blendshapes. For example, it sees a 15.0% <strong>mouthFunnel</strong>, a 22.1% <strong>mouthSmile_R</strong>, and a 25.2% <strong>mouthLowerDown_L</strong>.
All the extra shapes add up to this:</p>

<div style="text-align:center;"><img width="500" src="/images/2021_11_26_arkit/lip_shape_03_other.jpg" /></div>

<p>And for reference, here is the neutral:</p>

<div style="text-align:center;"><img width="500" src="/images/2021_11_26_arkit/lip_shape_01_neutral.jpg" /></div>

<p>ARKit under-the-hood has a very specific meaning for each of the shapes. The solver on the iPhone expects the combinations of shapes to work a certain way. And there
are many circumstances where two shapes are activated in a way that counteract each other. ARKit uses combinations of these shapes to make meaningful movements to the
rig, but if the small movements in your rig don’t match the small movements in the internal solver then the resulting animation falls apart. And that’s why my previous tests
of ARKit failed…my blendshape rigs weren’t respecting the combinations of the underlying animation solver. If you take the entire
scanned lip upper up and use that as your <strong>mouthUpperUp_L</strong> and <strong>mouthUpperUp_R</strong> blendshapes, then these shapes will contain a bunch of movement that ARKit doesn’t
expect to be in those shapes. And that explains why I could never get good results.</p>

<p>To me, that is the most difficult part of working with ARKit shapes. If you create a custom shape set, and those shapes don’t counteract each other properly,
you will end up with weird motions. So where is the list of all the dependencies between the shapes to ensure that an ARKit blendshape rig moves smoothly?
Where is the guide that says “shape A should have as much upward movement
as shape B, but it should move to the left 50% as much as shape C”. There isn’t one.</p>

<p>However, we do have indirect knowledge. We know what the base shapes should look like. And given an expression, we know what the weights are. So we can try to
solve the underlying shapes indirectly. Rather than cleaning up the final blendshapes by hand, we can choose some constraints and find the best fit shapes
that fit our constraints, and hopefully we end up with good results.</p>

<p><strong>Data Processing</strong></p>

<p>The actual scan capture is pretty standard. The images were processed to scans using a custom pipeline I threw together. I’ve found that if you have poorly
captured data, then some programs work and others don’t. In particular, Photoscan is very good at salvaging poorly shot data, not that I would know from
experience with my own shoots (cough). But if you have clean images with good sync and coverage, Photoscan, Capture Reality, and AliceVision
will all do a great job. For this one, I actually used Colmap because of how well the command-line workflow integrates with custom C++ coding, but any
program is fine.</p>

<p>For processing the scans, I use WrapX because WrapX is awesome. I put together some flows and contracted <a href="https://www.centaurdigital.com/">Centaur Digital</a> to perform the actual alignment. I have a few
custom steps involving C++ coding, like using the Aruco markers for the initial rigid alignment. But I’ve found the cheapest, easiest way to process scans in bulk is to
essentially brute-force it with WrapX. After processing this data, I had a set of clean scans.</p>

<p><strong>The ARKit Shape Solver</strong></p>

<p>Since capturing the ARKit shapes directly isn’t practical, let’s instead write a solver. In this case, as input we have the 102 expressions. In the data set
you can download, the file <strong>colin_proc_rig.fbx</strong> has the final cleaned up shapes from Centaur Digital. And we know how each
expression maps to the 52 blendshape weights (which is in the json folder). Thus, we will find the 52 blendshapes that are the best fit for the 102 expressions.</p>

<p>To explain it better, let’s work in rounder numbers. Suppose we have 100 scanned expressions and 50 blendshapes to solve (instead of 102 and 52). Additionally,
our topology has about 6,000 verts. How would we set up a solver to do it?</p>

<p>Since X, Y, and Z are independent, we can solve each channel individually. We have 6,000 verts in the base mesh, and a full ARKit rig needs 50 blendshapes. So we 
are essentially solving for a vector of 6,000 * 50 = 300,000 variables.</p>

<p>We need to describe our constraints as a single, large, sparse matrix. There are many approaches to do this, but I prefer to work with flat std::vectors.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">std</span><span class="o">::</span><span class="n">vector</span> <span class="o">&lt;</span> <span class="kt">int</span> <span class="o">&gt;</span> <span class="n">sparseR</span><span class="p">,</span> <span class="n">sparseC</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span> <span class="o">&lt;</span> <span class="kt">float</span> <span class="o">&gt;</span> <span class="n">sparseV</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span> <span class="o">&lt;</span> <span class="kt">float</span> <span class="o">&gt;</span> <span class="n">sparseB</span><span class="p">;</span></code></pre></figure>

<p>We are trying to find the best fit solution for the over-constrained system of equations: Ax=b. The matrix A is sparse. For each nonzero element in the matrix A, <strong>sparseR</strong> is the row, <strong>sparseC</strong> is the col, and <strong>sparseV</strong> is the the value.
On the right side, <strong>sparseB</strong> is just b. To construct the matrix, we can add matrix elements one at a time by using the SetSparse() helper function.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">static</span> <span class="kt">void</span> <span class="nf">SetSparse</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span> <span class="o">&lt;</span> <span class="kt">int</span> <span class="o">&gt;</span> <span class="o">&amp;</span> <span class="n">sparseR</span><span class="p">,</span>
	<span class="n">std</span><span class="o">::</span><span class="n">vector</span> <span class="o">&lt;</span> <span class="kt">int</span> <span class="o">&gt;</span> <span class="o">&amp;</span> <span class="n">sparseC</span><span class="p">,</span>
	<span class="n">std</span><span class="o">::</span><span class="n">vector</span> <span class="o">&lt;</span> <span class="kt">float</span> <span class="o">&gt;</span> <span class="o">&amp;</span> <span class="n">sparseV</span><span class="p">,</span>
	<span class="kt">int</span> <span class="n">r</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">float</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">ASSERT_ALWAYS</span><span class="p">(</span><span class="n">sparseR</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="n">sparseC</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
	<span class="n">ASSERT_ALWAYS</span><span class="p">(</span><span class="n">sparseR</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="n">sparseV</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>

	<span class="n">sparseR</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">r</span><span class="p">);</span>
	<span class="n">sparseC</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">c</span><span class="p">);</span>
	<span class="n">sparseV</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">v</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>

<p><strong>The Blendshape Constraint</strong></p>

<p>Our first constraint is that for each of our 100 shapes, we
know the blendshape weights. For every vertex in that shape, the sum of the known weights times the unknown shapes should equal the observed, scanned shape. We have 100 scanned shapes,
and 6,000 verts, so we have a total of 600,000 constraints. I.e. we have 600,000 rows in our matrix for this constraint.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">poseIter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">poseIter</span> <span class="o">&lt;</span> <span class="n">M</span><span class="p">;</span> <span class="n">poseIter</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">vertIter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">vertIter</span> <span class="o">&lt;</span> <span class="n">V</span><span class="p">;</span> <span class="n">vertIter</span><span class="o">++</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">faceCapIter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">faceCapIter</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">faceCapIter</span><span class="o">++</span><span class="p">)</span>
    <span class="p">{</span>
      <span class="c1">// at this pose, what is the expected weight for this shape</span>
      <span class="kt">float</span> <span class="n">faceCapW</span> <span class="o">=</span> <span class="n">faceCapWeights</span><span class="p">[</span><span class="n">poseIter</span><span class="p">][</span><span class="n">faceCapIter</span><span class="p">];</span>
      <span class="kt">float</span> <span class="n">faceCapVertW</span> <span class="o">=</span> <span class="n">capRigShapeInfl</span><span class="p">[</span><span class="n">faceCapIter</span><span class="p">][</span><span class="n">vertIter</span><span class="p">];</span>
      
      <span class="k">if</span> <span class="p">(</span><span class="n">faceCapW</span> <span class="o">&gt;=</span> <span class="mf">1e-5</span><span class="n">f</span> <span class="o">&amp;&amp;</span> <span class="n">faceCapVertW</span> <span class="o">&gt;=</span> <span class="mf">1e-4</span><span class="n">f</span><span class="p">)</span>
      <span class="p">{</span>
        <span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="n">faceCapIter</span> <span class="o">*</span> <span class="n">V</span> <span class="o">+</span> <span class="n">vertIter</span><span class="p">;</span> <span class="c1">// shape iter, then vert</span>
        <span class="n">SetSparse</span><span class="p">(</span><span class="n">sparseR</span><span class="p">,</span> <span class="n">sparseC</span><span class="p">,</span> <span class="n">sparseV</span><span class="p">,</span> <span class="n">currRow</span><span class="p">,</span> <span class="n">col</span><span class="p">,</span> <span class="n">faceCapW</span><span class="p">);</span>
      <span class="p">}</span>
    <span class="p">}</span>
    
    <span class="n">Vec3</span> <span class="n">expectedP</span> <span class="o">=</span> <span class="n">fullShapeMesh</span><span class="p">.</span><span class="n">m_blendShapeData</span><span class="p">[</span><span class="n">poseIter</span><span class="p">][</span><span class="n">vertIter</span><span class="p">];</span>
    <span class="n">Vec3</span> <span class="n">baseP</span> <span class="o">=</span> <span class="n">fullShapeMesh</span><span class="p">.</span><span class="n">m_posData</span><span class="p">[</span><span class="n">vertIter</span><span class="p">];</span>
    <span class="n">Vec3</span> <span class="n">offsetP</span> <span class="o">=</span> <span class="n">expectedP</span> <span class="o">-</span> <span class="n">baseP</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">val</span> <span class="o">=</span> <span class="n">ExtractVec3</span><span class="p">(</span><span class="n">offsetP</span><span class="p">,</span> <span class="n">dim</span><span class="p">);</span>
    <span class="n">sparseB</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">val</span><span class="p">);</span>
    <span class="n">currRow</span><span class="o">++</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>On thing you will also notice is the faceCapVertW. The final blendshape should only affect the region touched by the initial blendshape. As a preprocess, a mask is created
for each shape describing the region of influence. This way our eye shapes don’t have movement in the lips and vice-versa.</p>

<p><strong>The Source Constraint</strong></p>

<p>Additionally, we want to make sure that our blendshapes match the original shapes. We want our smile shape to roughly match the shape of the original ARKit smile shape, etc.
So for this, we add a <i>Source</i> constraint. For each vertex in our original shape, we simply create an equation where the found shape matches the original shape. As we have 50 original
shapes, and 6,000 verts, this constraint adds 300,000 rows to our matrix.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">faceCapIter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">faceCapIter</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">faceCapIter</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">vertIter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">vertIter</span> <span class="o">&lt;</span> <span class="n">V</span><span class="p">;</span> <span class="n">vertIter</span><span class="o">++</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">Vec3</span> <span class="n">expectedP</span> <span class="o">=</span> <span class="n">capRigSrc</span><span class="p">.</span><span class="n">m_blendShapeData</span><span class="p">[</span><span class="n">faceCapIter</span><span class="p">][</span><span class="n">vertIter</span><span class="p">];</span>
    <span class="n">Vec3</span> <span class="n">baseP</span> <span class="o">=</span> <span class="n">capRigSrc</span><span class="p">.</span><span class="n">m_posData</span><span class="p">[</span><span class="n">vertIter</span><span class="p">];</span>
    <span class="n">Vec3</span> <span class="n">offsetP</span> <span class="o">=</span> <span class="n">expectedP</span> <span class="o">-</span> <span class="n">baseP</span><span class="p">;</span>
    
    <span class="kt">float</span> <span class="n">val</span> <span class="o">=</span> <span class="n">ExtractVec3</span><span class="p">(</span><span class="n">offsetP</span><span class="p">,</span> <span class="n">dim</span><span class="p">);</span>
    				
    <span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="n">faceCapIter</span> <span class="o">*</span> <span class="n">V</span> <span class="o">+</span> <span class="n">vertIter</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">mask</span> <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span><span class="p">;</span>
    
    <span class="kt">float</span> <span class="n">faceCapVertW</span> <span class="o">=</span> <span class="n">capRigShapeInfl</span><span class="p">[</span><span class="n">faceCapIter</span><span class="p">][</span><span class="n">vertIter</span><span class="p">];</span>
    
    <span class="kt">float</span> <span class="n">w</span> <span class="o">=</span> <span class="n">faceCapVertW</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">scale</span> <span class="o">=</span> <span class="n">LerpFloat</span><span class="p">(</span><span class="mf">10.0</span><span class="n">f</span><span class="p">,</span> <span class="mf">1.0</span><span class="n">f</span><span class="p">,</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">w</span><span class="p">));</span>
    <span class="n">mask</span> <span class="o">*=</span> <span class="n">scale</span><span class="p">;</span>
    
    <span class="n">SetSparse</span><span class="p">(</span><span class="n">sparseR</span><span class="p">,</span> <span class="n">sparseC</span><span class="p">,</span> <span class="n">sparseV</span><span class="p">,</span> <span class="n">currRow</span><span class="p">,</span> <span class="n">col</span><span class="p">,</span> <span class="mf">1.0</span><span class="n">f</span> <span class="o">*</span> <span class="n">mask</span><span class="p">);</span>
    
    <span class="n">sparseB</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">val</span><span class="o">*</span><span class="n">mask</span><span class="p">);</span>
    <span class="n">currRow</span><span class="o">++</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>As a tweak, this constraint has a higher influence in areas that are outside the masked area of this shape. In theory we could make the constraint infinitely strong,
but in practice large numbers start to make the solver unstable.</p>

<p><strong>The Laplacian Constraint</strong></p>

<p>In order to get good results, we actually need one more constraint. It’s important to ensure that the curvature of our source shapes roughly matches the curvature
of our found shapes. And we can do this by constraining the laplacian.</p>

<p>In geometry, the laplacian is the offset of a vertex from the average of its neighbors.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_11_26_arkit/geo_laplacian.png" /></div>

<p>The laplacian contains quite a lot of information about a vertex. For example, if it is a regular pattern, the vertex will be near the average and the laplacian will be zero. However, most objects
do not have evenly spaced edge loops, and the laplacian can encode this information. In another case, the neighbors form an approximate plane. If this vertex is in front of the plane, then the geometry is convex.
Whereas if this vertex is behind the normal of the plane then this vertex is in a concavity. By ensuring that our resulting shapes have the same laplacian as our original shape, we can 
roughly match the original intention of the shape.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">faceCapIter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">faceCapIter</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">faceCapIter</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
  <span class="kt">float</span> <span class="n">laplWeight</span> <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">vertIter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">vertIter</span> <span class="o">&lt;</span> <span class="n">V</span><span class="p">;</span> <span class="n">vertIter</span><span class="o">++</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">Vec3</span> <span class="n">poseLapl</span> <span class="o">=</span> <span class="n">capRigLapl</span><span class="p">[</span><span class="n">faceCapIter</span><span class="p">][</span><span class="n">vertIter</span><span class="p">];</span>
    <span class="n">Vec3</span> <span class="n">baseLapl</span> <span class="o">=</span> <span class="n">neutralLapl</span><span class="p">[</span><span class="n">vertIter</span><span class="p">];</span>
    
    <span class="n">Vec3</span> <span class="n">offsetLapl</span> <span class="o">=</span> <span class="n">poseLapl</span> <span class="o">-</span> <span class="n">baseLapl</span><span class="p">;</span>
    
    <span class="kt">int</span> <span class="n">numAdj</span> <span class="o">=</span> <span class="n">adjData</span><span class="p">.</span><span class="n">m_vertSize</span><span class="p">[</span><span class="n">vertIter</span><span class="p">];</span>
    <span class="n">ASSERT_ALWAYS</span><span class="p">(</span><span class="n">numAdj</span> <span class="o">&gt;=</span> <span class="mi">1</span><span class="p">);</span>
    
    <span class="kt">float</span> <span class="n">scale</span> <span class="o">=</span> <span class="n">SafeInv</span><span class="p">(</span><span class="kt">float</span><span class="p">(</span><span class="n">numAdj</span><span class="p">));</span>
    <span class="kt">float</span> <span class="n">expected</span> <span class="o">=</span> <span class="n">ExtractVec3</span><span class="p">(</span><span class="n">offsetLapl</span><span class="p">,</span> <span class="n">dim</span><span class="p">);</span>
    
    <span class="c1">// the actual is the current point minus the average of the neighbors</span>
    <span class="kt">int</span> <span class="n">start</span> <span class="o">=</span> <span class="n">adjData</span><span class="p">.</span><span class="n">m_vertStart</span><span class="p">[</span><span class="n">vertIter</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="n">faceCapIter</span> <span class="o">*</span> <span class="n">V</span> <span class="o">+</span> <span class="n">vertIter</span><span class="p">;</span>
    
    <span class="n">SetSparse</span><span class="p">(</span><span class="n">sparseR</span><span class="p">,</span> <span class="n">sparseC</span><span class="p">,</span> <span class="n">sparseV</span><span class="p">,</span> <span class="n">currRow</span><span class="p">,</span> <span class="n">col</span><span class="p">,</span> <span class="n">laplWeight</span><span class="p">);</span>
    
    <span class="kt">float</span> <span class="n">sumAvg</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">numAdj</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
    <span class="p">{</span>
      <span class="kt">int</span> <span class="n">adjVert</span> <span class="o">=</span> <span class="n">adjData</span><span class="p">.</span><span class="n">m_vertData</span><span class="p">[</span><span class="n">start</span> <span class="o">+</span> <span class="n">i</span><span class="p">];</span>
      <span class="kt">int</span> <span class="n">adjCol</span> <span class="o">=</span> <span class="n">faceCapIter</span> <span class="o">*</span> <span class="n">V</span> <span class="o">+</span> <span class="n">adjVert</span><span class="p">;</span>
      
      <span class="n">SetSparse</span><span class="p">(</span><span class="n">sparseR</span><span class="p">,</span> <span class="n">sparseC</span><span class="p">,</span> <span class="n">sparseV</span><span class="p">,</span> <span class="n">currRow</span><span class="p">,</span> <span class="n">adjCol</span><span class="p">,</span> <span class="p">(</span><span class="o">-</span><span class="mf">1.0</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">laplWeight</span> <span class="o">*</span> <span class="n">scale</span><span class="p">);</span>
    <span class="p">}</span>
    
    <span class="n">sparseB</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">expected</span> <span class="o">*</span> <span class="n">laplWeight</span><span class="p">);</span>
    <span class="n">currRow</span><span class="o">++</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>Finally, once this matrix built we simply solve it and use the result. <strong>startX</strong> is the initial guess. The sparse
parameters are described above. <strong>numRows</strong> and <strong>numCols</strong> are the dimension of matrix A. And 1000 is the maximum
iteration count, although it tends to converge in about 50. The solver is quite fast (only a few seconds) so I didn’t bother with optimizing it.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">LinearAlgebraUtil</span><span class="o">::</span><span class="n">SolveConjuateGradientSparseLeastSquares</span><span class="p">(</span><span class="n">startX</span><span class="p">,</span> <span class="n">sparseR</span><span class="p">,</span> <span class="n">sparseC</span><span class="p">,</span> <span class="n">sparseV</span><span class="p">,</span> <span class="n">sparseB</span><span class="p">,</span> <span class="n">numRows</span><span class="p">,</span> <span class="n">numCols</span><span class="p">,</span> <span class="mi">1000</span><span class="p">);</span></code></pre></figure>

<p><strong>Results</strong></p>

<p>That’s really it. What does it look like? Overall, I was pleasantly surprised.</p>

<p>Here is the original lip funnel expression, made by the simply adding the offsets from the original ARKit shape to the Colin neutral scan.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_11_26_arkit/compare_00_funnel_original.jpg" /></div>

<p>It looks pretty rough. The lower lip is very thin. But applying the solver results in this shape for the lip funnel.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_11_26_arkit/compare_01_funnel_solved.jpg" /></div>

<p>The solved expression are pretty clean. The expression matches the reference, and the lips seems to keep their volume. What I really like about this approach is
that the combinations “just work”. By running a solver on a dense data set, we end up with shapes that work well together. In particular, I’ve always had trouble making
lip shapes (like funnel, etc) work well with a jaw down expression.</p>

<p>If I had more time, the next step would be to tweak the solver for the lips and eyelids to close. In a few cases, shapes that should be closed ended up slightly open.
For example, the smile opens the lips when it really should not.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_11_26_arkit/compare_02_smile.jpg" /></div>

<p>There are several ways to do this while still using a linear solver. For each point on the lower lip, we could find a nearby point on the upper lip and add a constraint for certain
poses that the distance between those points shouldn’t change for the Y component. I would expect this to work quite well, but have not had time to test.</p>

<p>I’m going to skip a long discussion of all the shapes, since the best way to evaluate the results is to actually look at the results. Since the data is included in the link below,
feel free to look at it yourself. But here are my main conclusions:</p>

<p><strong>What worked well:</strong></p>
<ul>
<li>The shapes "just worked". I was expecting to need manual cleanup for the eyes/eyelids, but the results were clean with minimal tweaking.</li>
<li>The combinations add together well. In the past I've ended up in the quagmire of trying to get all the jaw and lip combinations to add well together. It turns into a delicate pile
of sand where tweaking one shape breaks another. It's much easier to just throw it into a solver.</li>
<li>The workflow is non-destructive. If one of the source shapes needs a revision, you can change it, press a button, and the whole rig rebuilds.</li>
<li>Shooting and rigging is easier. Since we don't care about capturing an exact shape, the actual shoot goes very quickly. We don't need to make sure the talent is activating the perfect
combination of shapes. As long as we are capturing enough range of motion, then we have enough data for the solver. Then for processing, we just have to clean up the shapes
to match the scan. We don't need 10 sculpting revisions of the lower lip because something in the animation looks off.</li>
<li>The solver acts as a soft denoise for the shapes. If you look closely, the processed shapes could be improved with a few more revisions. The neck has extra movement, some of the lip contours are bit off...that kind of thing.
However, the solver removes those uncorrelated movements so we don't actually need to fix them.</li>
</ul>

<p><strong>What did not go so well:</strong></p>
<ul>
<li>The main disadvantage is that we lose quite a bit of detail in the shapes. The hand-wrapped shapes have details in the lips which are lost in the conversion to ARKit shapes.</li>
<li>The lips and eyelids opening/closing would definitely need to be fixed before using a technique like this in a real production.</li>
<li>When you apply a solver, "you get what you get". However, if your animation directory doesn't like the shape of the lip, there really isn't much you can do, which is both a blessing and a curse.
If you want manual changes, there might be ways to do it. For example, if you need to change the lower lip in the dimpler, you could sculpt it by hand and add a constraint to the solver. But if you apply
too many manual changes it defeats the purpose of having a solver in the first place.</li>
<li>We capture more shapes that we actually use. In this case, 100 shapes were processed to create only 50 final blendshapes. And it's really only about 30 blendshapes if you merge the 
left and right variations. If I were to optimize this capture process, I think I could reduce the set to about 60 and keep the same quality, but that wasn't tested. In particular, I find it
essential to capture jaw-down and jaw-neutral variations of the lip movements, but some of those could be removed if cost is a factor. And cost is always a factor. </li>
</ul>

<p><strong>Data:</strong></p>

<p>You can download the data set here. It’s available on a CC0 license, so feel free to look at it and use it however you like.
<a href="/downloads/2021_11_26_arkit/colin_shape_data.zip">colin_shape_data.zip</a></p>

<p>Anyways, that’s what I got. Feel free to look at the data, and I hope it’s useful to you.</p>

<p><strong>Acknowledgments:</strong></p>

<p><i>Scan Talent:</i> Colin Zargarpour</p>

<p><i>Executive Producer:</i> Habib Zargarpour</p>

<p><i>Scan Manual Cleanup:</i> Centaur Digital (<a href="https://www.centaurdigital.com">www.centaurdigital.com</a>)</p>

<p><i>Source Animation:</i> Bannanaflak, the makers of FaceCap. Source recording from their site: <a href="https://www.bannaflak.com/face-cap/documentation.html#1.5">FaceCap_ExampleRecording.rar</a></p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[As you have probably seen, ARKit has become quite popular for facial animation. It has obvious appeal for mocap on a budget, but it’s also being used on professional productions as well. So I decided to do a test, and here it is:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_11_26_arkit/shape_header_med.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_11_26_arkit/shape_header_med.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Visibility TAA and Upsampling with Subsample History</title><link href="https://filmicworlds.com/blog/visibility-taa-and-upsampling-with-subsample-history/" rel="alternate" type="text/html" title="Visibility TAA and Upsampling with Subsample History" /><published>2021-07-28T00:00:00+00:00</published><updated>2021-07-28T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/visibility-taa-and-upsampling-with-subsample-history</id><content type="html" xml:base="https://filmicworlds.com/blog/visibility-taa-and-upsampling-with-subsample-history/"><![CDATA[<p><strong>Adventures in Visibility Rendering</strong></p>
<ul>
<li>Part 1: <a href="/blog/visibility-buffer-rendering-with-material-graphs/">Visibility Buffer Rendering with Material Graphs</a></li>
<li>Part 2: <a href="/blog/decoupled-visibility-multisampling/">Decoupled Visibility Multisampling</a></li>
<li>Part 3: <a href="/blog/software-vrs-with-visibility-buffer-rendering/">Software VRS with Visibility Buffer Rendering</a></li>
<li>Part 4: Visibility TAA and Upsampling with Subsample History</li>
</ul>

<p>This is the 4th post in the series, and you should definitely read <i>Part 2: Decoupled Visibility Multisampling</i> before going further as this technique is
an extension of DVM.</p>

<p><strong>Introduction</strong></p>

<p>By decoupling our geometry sampling rate from our shading rate, we have several ways of merging the different types of information. The Decoupled Visibility Multisampling post
demonstrated a method for rendering Visibility with 8xMSAA but rendering a GBuffer at 1x. Then it can reconstruct the edges at 8xMSAA quality. To recap, regular TAA starts with a standard, aliased 1x frame:</p>

<p><i>Aliased 1x Frame:</i></p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/closeup_crop_01_raw.png" /></div>

<p>Over time, the standard TAA algorithms can effectively reconstruct edges in some cases by accumulating multiple aliased frames together. TAA works very well for solid objects when the camera has little
movement, but fails when there is significant movement or when a thin object is less than a pixel wide. For a more thorough discussion of TAA algorithms, I’d recommend the recent overview by Lei Yang, Shiqiu Liu, and Marco Salvi [6].</p>

<p><i>TAA:</i></p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/closeup_crop_02_taa.png" /></div>

<p>As discussed previously, we can use the multisampled visibility buffer to reconstruct the lighting at subsample rate. This process allows better reconstruction of
thin edges. It is also robust against movement, since it can reconstruct edges based on a single frame.</p>

<p><i>Decoupled Visibility Multisampling:</i></p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/closeup_crop_03_dvm.png" /></div>

<p>With an 8xMSAA buffer, there is more that we can do. A single 8x MSAA pixel is roughly equivalent to a 2x2 group of 2xMSAA pixels. We can demonstrate this by doing do a trivial reconstruction
at higher resolution.</p>

<p><i>Decoupled Visibility Multisampling with Naive 2x Upsample:</i></p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/closeup_crop_04_upsample.png" /></div>

<p>If we look closely, the one disadvantage is the jagged edges that appear. We can minimize this effect by applying a light, half-pixel wide filter on the edges. While the
zipper effect is not entirely gone, it is no longer visually noticable at typical viewing distances. I.e. you will not be able to see this zipper pattern on a 4k television
that is 8 feet away. But there are options to reduce this artifact if it is a priority.</p>

<p><i>Decoupled Visibility Multisampling, with 2x Upsample and Custom MSAA Resolve:</i></p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/closeup_crop_05_resolve.png" /></div>

<p><strong>Multisampled Visibility Rejection</strong></p>

<p>One of the key elements of Decoupled Visibility Multisampling is that we can encode the coverage for each subsample. In a given 2x2 block of pixels, we have 4 pixel colors,
and 4 masks (32 bits each) so we know which subsample is covered by which material. In this particular case, we have two different materials covering this 2x2 block of pixels.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_31_upsampling/reproj_offset_01.png" /></div>

<p>Suppose that in the next frame the object shifted by a fraction of a pixel. Now the material edge is slightly to the right:</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_31_upsampling/reproj_offset_02.png" /></div>

<p>In typical TAA, we would reproject from the previous pixel. Since the previous pixel is merged into a single color in standard TAA, we would have no choice but to
sample from the previous accumulated value, which has both materials blended together.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/reproj_group_01.png" /></div>

<p>However, we have the subpixel history encoded in the coverage mask. So we can discard pixels that are not part of this material, and accumulate the image.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/reproj_group_02.png" /></div>

<p>There are many types of ghosting in TAA, but by far the most common is when the accumulated history is from a different material than the current pixel. By only gathering
subsamples from the same material, this kind of ghosting is completely eliminated. There are other types of ghosting of course, such as depth discontinuities or lighting/shadow
changes. But ghosting of one opaque object onto another becomes a solved problem.</p>

<p>The visibility information for a 2x2 quad is encoded into 4 32-bit masks, so we can reject invalid samples by doing a careful bitwise dot product of the current mask with
the previous mask. The actual implementation could be much improved though. For each pixel, the implementation recovers the previous frame’s 2x2 group of 2x2 quads (for a full block of 4x4 pixels).
This pass has very poor occupancy because that 4x4 block of pixels is stored for each thread, and each pixel is 3 floats. Reducing that is a prime target for optimization if more time were available.</p>

<p><strong>Efficient Upsampling</strong></p>

<p>Since we have 8x MSAA visibility information, we can experiment with reconstruction algorithms. Note how the 8x MSAA pattern compares to the 2x MSAA pattern. Here is the image from the HLSL spec of MSAA patterns [3]</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/d3d11_msaapatterns_2_8.png" /></div>

<p>Each quadrant of the 8x MSAA pattern is very similar to a rotated 2x MSAA pixel. We can resolve at 1x resolution with a box filter by averaging all 8 samples together.
To upsample and resolve the image at 2x resolution with a box filter all we have to do is blend the 2-pixel quadrants together. 8x MSAA at 1080p is roughly equivalent to 2x MSAA at 4k. Here is a comparison between
sample positions of a 1080p image with 8x MSAA versus a 4k image with 2x MSAA.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_31_upsampling/msaa_grouped.png" /></div>

<p><i>Comparison of 8x MSAA sample points overlayed with 2x MSAA sample points from double the resolution. Note that the 2x MSAA positions have a 90 degree rotation from their original positions.</i></p>

<p>And here is the image after a naive upsample resolved with a box filter. Each pixel is calculated from averaging the two subsamples in a quadrant.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/closeup_crop_04_upsample.png" /></div>

<p>Note that this algorithm introduces a zipper pattern along edges. This effect
happens because an edge can cut through the samples in such a way that even pixels intersect with one material and odd pixels intersect with the other. For the two pixels on top, there are 4 different ways that a nearly horizontal edge
can cut through.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/group_cut.png" /></div>

<p>The two scenarios on the left will show a zipper pattern whereas the two on the right will not.
We can fix this to a degree by using a more advanced MSAA resolve. For a thorough explanation, you should play some Kenny Loggins and read Matt Pettineo’s post
where he explores different custom MSAA resolve filters [5]. This image below uses a triangle filter pattern with a half-pixel radius.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_31_upsampling/closeup_crop_05_resolve.png" /></div>

<p>Note that the stairstepping is still visible, but is much less pronounced. There are several ways that we could properly fix this effect:</p>

<ol>
<li><strong>Custom Sample Positions:</strong> We could use programmable sample positions to exactly emulate 2x MSAA.</li>
<li><strong>Wider Blur:</strong> We could increase the resolve radius to blur it out.</li>
<li><strong>Smarter Upsample:</strong> Since we have visibility, we could actually detect this case by checking the material IDs and apply a special fix.</li>
</ol>

<p>Of those options, the one that makes the most sense to me is #3. It shouldn’t be too difficult to detect the zippers. The difficult part is optimization, and it would need to be
optimized in tandem with the custom resolve. Note that with solution #1, we could change the sample positions but long jaggies would only have two levels of gradients. By
upsampling from 8x MSAA we can theoretically actually achieve cleaner edges than a double-resolution image with 2x MSAA. A 1080p, 8xMSAA image upsampled to 4k has 4 gradations of color
in long jaggies whereas a native 4k, 2xMSAA image only has 2 gradations of color in long jaggies.</p>

<p>Also, the interior of the triangles look just as blocky in the 1x standard and 2x upsampled versions. There has been great research in temporal upsampling, including
DLSS from NVIDIA [4], FidelityFX from AMD [1], and Super-Resolution in Unreal 5 [2]. In short, this problem is well-studied with several excellent solutions. The contribution of this post is in
upsampling with clean edges along borders, and the problem of how to render the interiors is mostly orthogonal.</p>

<p>In total, the render time is quite consistent. At 1080p, here are the timings for four different variations.</p>

<ul>
<li><i>Regular 1x:</i> Standard TAA algorithm</li>
<li><i>DVM 1x:</i> This pass applies visibility-aware accumulation, and runs a second pass to resolve to 1x.</li>
<li><i>DVM 2x Upsample:</i> This version applies visibility-aware accumulation, and applies a 2x upsample.</li>
<li><i>DVM 2x Custom Resolve:</i> This approach is the same as DVM 2x, except that it applies a custom MSAA resolve.</li>
</ul>

<p><i>TAA/Resolve Cost:</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Accumulation</th><th>Resolve</th>
  </tr>
  <tr align="center">
     <td>Regular 1x</td><td colspan="2">0.161</td>
  </tr>
  <tr align="center">
     <td>DVM 1x</td><td>0.869</td><td>0.082</td>
  </tr>
  <tr align="center">
     <td>DVM 2x Upsample</td><td>0.869</td><td>0.103</td>
  </tr>
  <tr align="center">
     <td>DVM 2x Custom Resolve</td><td>0.868</td><td>0.733</td>
  </tr>
</table>

<p>The regular 1x version of TAA requires about 0.161ms to render for each frame. The accumulation step is quite a bit more expensive
than before, taking 0.869ms. The box filter resolves are quite fast, at 0.082ms and 0.103ms, but the custom resolve is vastly more expensive at 0.733ms.</p>

<p>Since this is a toy engine, it didn’t make sense to spend the time doing proper optimization. I did some preliminary optimizations (the original was 5ms or so), but
honestly 1.6ms for both combined passes is still too expensive. The path to optimize the compute shaders is clear so I left that out due to time constraints.</p>

<p><strong>Putting it all Together</strong></p>

<p>The different aspects of visibility rendering really synergize, and we can run VRS as well. With a 1080p “native” image, 25% VRS and a 2x upsampling, we can render a pretty respectable 4k image even though the number of pixels we are shading
is the equivalent of 540p.</p>

<p><strong>References:</strong></p>

<p>[1] AMD FidelityFX, Super Resolution. AMD Inc. (<a href="https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution">https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution</a>)</p>

<p>[2] Unreal Engine 5 Early Access Release Notes. Epic Games, Inc. (<a href="https://docs.unrealengine.com/5.0/en-US/ReleaseNotes/">https://docs.unrealengine.com/5.0/en-US/ReleaseNotes/</a>)</p>

<p>[3] D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS enumeration (d3d11.h). Microsoft, Inc. (<a href="https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ne-d3d11-d3d11_standard_multisample_quality_levels">https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ne-d3d11-d3d11_standard_multisample_quality_levels</a>)</p>

<p>[4] NVIDIA DLSS. NVIDIA Inc. (<a href="https://www.nvidia.com/en-us/geforce/technologies/dlss/">https://www.nvidia.com/en-us/geforce/technologies/dlss/</a>)</p>

<p>[5] Experimenting with Reconstruction Filters for MSAA Resolve. Matt Pettineo. (<a href="https://therealmjp.github.io/posts/msaa-resolve-filters/">https://therealmjp.github.io/posts/msaa-resolve-filters/</a>)</p>

<p>[6] A Survey of Temporal Antialiasing Techniques. Lei Yang, Shiqiu Liu, and Marco Salvi. (<a href="http://behindthepixels.io/assets/files/TemporalAA.pdf">http://behindthepixels.io/assets/files/TemporalAA.pdf</a>)</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Adventures in Visibility Rendering Part 1: Visibility Buffer Rendering with Material Graphs Part 2: Decoupled Visibility Multisampling Part 3: Software VRS with Visibility Buffer Rendering Part 4: Visibility TAA and Upsampling with Subsample History]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_07_31_upsampling/merged_upsample_header.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_07_31_upsampling/merged_upsample_header.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Software VRS with Visibility Buffer Rendering</title><link href="https://filmicworlds.com/blog/software-vrs-with-visibility-buffer-rendering/" rel="alternate" type="text/html" title="Software VRS with Visibility Buffer Rendering" /><published>2021-07-19T00:00:00+00:00</published><updated>2021-07-19T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/software-vrs-with-visibility-buffer-rendering</id><content type="html" xml:base="https://filmicworlds.com/blog/software-vrs-with-visibility-buffer-rendering/"><![CDATA[<p><strong>Introduction</strong></p>

<p>There are some very interesting tradeoffs involved in visibility rendering, and many techniques to explore. The first post was an overview
of visibility rendering and how it could be optimized with material graphs. The second is a new technique for decoupling geometry sampling rate
from shader sampling rate for better anti-aliasing. This third one discusses another way of separating the geometry sampling rate from the shading sampling rate: Variable Rate
Shading.</p>

<p><strong>Adventures in Visibility Rendering</strong></p>
<ul>
<li>Part 1: <a href="/blog/visibility-buffer-rendering-with-material-graphs/">Visibility Buffer Rendering with Material Graphs</a></li>
<li>Part 2: <a href="/blog/decoupled-visibility-multisampling/">Decoupled Visibility Multisampling</a></li>
<li>Part 3: Software VRS with Visibility Buffer Rendering</li>
<li>Part 4: <a href="/blog/visibility-taa-and-upsampling-with-subsample-history/">Visibility TAA and Upsampling with Subsample History</a></li>
</ul>

<p>Hardware Variable Rate Shading (or VRS for short) allows GPUs to perform shading and geometry sampling at different rates. It is a relatively new
feature, but is now supported on AMD [1], Intel [4], and NVIDIA [7] GPUs. Here is a single zoomed-in example frame with Forward Rendering and TAA:</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_03_ref.jpg" /></div>

<p>Next, here is the same shot with 2x2 Hardware VRS enabled. It uses different sampling points than the 1x image, with each 2x2 shading point being the average of the four
original sample points.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_04_2x2.jpg" /></div>

<p>However, with Visibility rendering, we can take a different approach. Instead of changing the samples like hardware VRS does, we can choose a subset of the sample
points that we would normally render.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_00_chosen.jpg" /></div>

<p>We can then perform a custom interpolation of those points which gives a blurry result.</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_01_first_frame.jpg" /></div>

<p>We can then randomize the sample points each frame. Since the samples are in the same positions as the 1x frame, it will converge to the original reference frame over time if nothing is moving. This feature
is in contrast to Hardware VRS which would converge to a blurrier image than the original. Here is the final converged frame:</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_02_vis_final.jpg" /></div>

<p>In addition to higher converged quality than Hardware VRS, this approach has performance benefits as well, especially as the triangle density increases.</p>

<p><strong>Prior Art</strong></p>

<p>In particular, there are several options to vary the shading rate over the image with hardware VRS [8]. Additionally, since you can specify a screen-space mask for VRS level, you can adaptively apply conservative or aggressive
sampling rates based on content, as was done in the latest Gears 5 [9]. There are also software versions, such as Call of Duty: Modern Warfare which emulated 2x2 pixel
quads using 4xMSAA [2]. Additionally, Marissa du Bois from Intel [3] has a good overview video discussing the technique, practical considerations, and the integration with UE4.
In the second half of that video, John Gibson from Tripwire discusses performance numbers from their game, Chivalry 2. Tomasz Stachowiak also implemented
an impressive system via GCN hacking to group shaders by occupancy, with a VRS algorithm when data is shared along the same pixel quad [10].</p>

<p>It’s a very nice win. The API is very easy to enable, and in many cases you can get better performance without any other work. If it looks the same, but is faster,
then of course you should do it. But there are ways we can improve the technique by doing it ourselves, in software.</p>

<p>As mentioned previously, we can
use the same sample positions as the 1x reference image, so that our image converges to the non-VRS result. Also, Visibility VRS is able to
solve some of the performance inefficiencies with Hardware VRS. What inefficiencies does Hardware VRS have? Well, if you thought
we were done talking about quad utilization then I have some bad news for you.</p>

<p><strong>Quad Utilization</strong></p>

<p>In the previous post, there is a discussion about how rendering is broken up into quads. With regular 1x rendering, the rasterizer will split this triangle
into 2x2 quads, and the grey pixels will be helper lanes that need to run.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/quads/small_quads.png" /></div>

<p>We end up with 3, 2x2 quads. One for the upper-left, bottom-left, and bottom-right.
The grey pixels do not contribute to the final image, and exist only to provide partial
derivatives to the UVs of pixels that will actually appear on the screen.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/quads/grid-colors.png" /></div>

<p>So, what happens if we render this same triangle with 2x2 VRS? The rasterizer will think of each 4x4 block of pixels as a single 2x2 quad, like below.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/quads/small_quads_2.png" /></div>

<p>Then it will spawn a single pixel shader quad, where the upper-right is a helper lane.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/quads/grid-colors-vrs.png" /></div>

<p>After the pixel shader is run, it will apply each of those colors to each 2x2 block in the original image.
Suppose we have a large triangle in purple, like below. In the regular non-VRS case, if the triangle touches a single pixel in a 2x2 square, then the other points need to be rendered as helper lanes. These lanes are
shown in grey.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/quad_big_01_small.png" /></div>

<p>This process “rounds up” when VRS is enabled. With 2x2 VRS, if this triangle touches a single pixel in a 4x4 grid, the rest must be filled in with helper lanes. Said another way, if a triangle touches at least one pixel of a 4x4 block, then the 4x4 block must be rendered.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/quad_big_01_vrs.png" /></div>

<p>When rendering this triangle with 2x2 VRS, the rasterizer needs to find all the 4x4 blocks that are touched by this triangle. Then those 4x4 blocks becomes quads which get sent
to the pixel shader. The 2x2 groups of blue pixels become a helper lane in the VRS pixel shader quad. Although that 4x4 block of pixels will only spawn a single 2x2 quad to process.</p>

<p>So, how many times does a pixel shader need to run for a large triangle? As discussed previously, in the standard non-VRS case, 
a Forward renderer would call the pixel shader once per pixel. It would be the same for the Material and Lighting pass of a Deferred renderer, as well as a Visibility
renderer. However, what about the 2x2 VRS case?</p>

<p>In a Forward renderer, we would expect each pixel shader to run once per 2x2 quad, which would mean 0.25 pixel shader invocations per pixel. For Deferred,
the Material pass would be rasterized at 0.25 pixel shader invocations per pixel, but the Lighting pass would run at 1x since it isn’t rasterized.
Yes, there are ways to run Lighting at lower rate as was performed in Gears 5 [9], but we’ll skip those details for now.</p>

<p>What about visibility though? Can we render at 0.25 invocations per pixel? Sure, why not? We can choose a subset of pixels, light those, and interpolate the in-between
pixels.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/large-quad_vrs.png" /></div>

<p>We will have extra work to do, but we can absolutely do it. Also, note that we do not need helper lanes since we are able to calculate uv derivatives analytically.</p>

<p>Here is the table showing how many times the Lighting and Material evaluation would run for Forward/Deferred/Visibility in the non-VRS and 2x2 VRS versions.</p>

<p><i>Approximate shader function invocations per pixel for large triangles:</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Material (non-VRS)</th><th>Lighting (non-VRS)</th><th>Material (2x2 VRS)</th><th>Lighting (2x2 VRS)</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td colspan="2">1x</td><td colspan="2">0.25x</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>1x</td><td>1x</td><td>0.25x</td><td>1x</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1x</td><td>1x</td><td>0.25x</td><td>0.25x</td>
  </tr>
</table>

<p>Simple enough. Forward and Deferred Material can run at 0.25x rate thanks to Hardware VRS. Visibility can run at 0.25x rate with our own Software VRS
solution. And Deferred Lighting runs at 1x since it is in a full-screen pass, although it could be optimized as it was done in Gears 5.</p>

<p>Let’s move on to small triangles. In the non-VRS case, any 1-pixel triangle becomes a 2x2 quad with 1 active lane and 3 helper lanes.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/tiny_tri_2.png" /></div>

<p>What happens to that same tiny triangle in the 2x2 VRS case? Unsurprisingly, the exact same thing happens. The pixel shader converts it to a 2x2 quad with 1 active lane and 3 helper lanes,
just like the non-VRS case.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/small_tri_4_vrs.png" /></div>

<p>For tiny, 1 pixel triangles, the VRS and non-VRS cases are the same. A tiny triangle will always need 1 active lane, which will require a quad with 3 helper lanes.
But what about for Visibility rendering?</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/small_tri_4_vrs_multi.png" /></div>

<p>With visibility, we can still render at 0.25 shader invocations per pixel. Since the choice of pixels to run is arbitrary, we can use the same algorithm as the
large triangle case. We only need to render a subset of the pixels and we can interpolate the rest. The size of the triangle is irrelevant. Here is the table of shader invocations per pixel.</p>

<p><i>Approximate shader function invocations per pixel for 1-pixel triangles:</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Material (non-VRS)</th><th>Lighting (non-VRS)</th><th>Material (2x2 VRS)</th><th>Lighting (2x2 VRS)</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td colspan="2">4x</td><td colspan="2">4x</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>4x</td><td>1x</td><td>4x</td><td>1x</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1x</td><td>1x</td><td>0.25x</td><td>0.25x</td>
  </tr>
</table>

<p>And that is really the key idea of Visibility rendering with VRS. With tiny triangles, the Forward and Deferred Material pass have to render 4 pixel shader
invocations per pixel due to quad utilization, regardless of VRS. But with visibility rendering we can maintain 0.25 shader invocations per pixel in both cases.</p>

<p>Finally, what about more typical, 10 pixel triangles? Let’s start with this one again:</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/pix_10_first_right_0_small_pure.png" /></div>

<p>In that case, the triangle fits perfectly into a 4x4 block of pixels, and will create exactly one quad. However, there are 16 ways for that triangle to align with the
4x4 grid.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/util-10-10-groups.png" /></div>

<p>There is exactly 1 way for the triangle to fit perfectly inside a single 4x4 block, but 6 ways to touch 2 blocks, 6 ways to touch 3 blocks, and 3 ways to touch 4 blocks.
That means this shape of triangle will spawn 10.75 pixel shader lanes on average (active + helper).</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th>Touched 4x4 Blocks</th><th>Variations</th>
  </tr>
  <tr align="center">
     <td>1</td><td>1</td>
  </tr>
  <tr align="center">
     <td>2</td><td>6</td>
  </tr>
  <tr align="center">
     <td>3</td><td>6</td>
  </tr>
  <tr align="center">
     <td>4</td><td>3</td>
  </tr>
</table>

<p>Let’s take a look at a longer, thinner shape.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/quads/util-10-10-thin.png" /></div>

<p>This one is a little worse. There is no way for it to hit exactly one 4x4 block, and on average this shape will spawn 12.25 pixel shaders.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th>Touched 4x4 Blocks</th><th>Variations</th>
  </tr>
  <tr align="center">
     <td>1</td><td>0</td>
  </tr>
  <tr align="center">
     <td>2</td><td>3</td>
  </tr>
  <tr align="center">
     <td>3</td><td>10</td>
  </tr>
  <tr align="center">
     <td>4</td><td>2</td>
  </tr>
  <tr align="center">
     <td>5</td><td>1</td>
  </tr>
</table>

<p>For our two shapes, they will on average require 10.75 and 12.25 pixel shader invocations each. Don’t forget: The original triangle is only 10 pixels. So even though we are trying
to render the shading rate at 1/4 of the geometry rate, we still actually have more shader invocations than we have pixels.</p>

<p>If you remember from the previous post,
with non-VRS we have about 2x shader invocations per pixel. For simplicity, let’s say that our 10-pixel triangles at 2x2 VRS require 1x shader invocation per pixel. And for Visibility rendering, we can easily get 0.25x shader invocations per pixel.</p>

<p><i>Approximate shader function invocations per pixel for 10-pixel triangles:</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Material (non-VRS)</th><th>Lighting (non-VRS)</th><th>Material (2x2 VRS)</th><th>Lighting (2x2 VRS)</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td colspan="2">2x</td><td colspan="2">1x</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x</td><td>1x</td><td>1x</td><td>1x</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1x</td><td>1x</td><td>0.25x</td><td>0.25x</td>
  </tr>
</table>

<p>Based on this analysis, we would expect performance for Forward, Deferred, and Visibility to be similar for very large triangles. However,
as triangles get closer to 10 we would expect Visibility to be faster. And then as triangles shrink all the way to a single pixel, Visibility should
be the winner due to quad utilization. Of course, before we do that, we should probably discuss how to actually do VRS in software with Visibility buffers.</p>

<p><strong>Software VRS with Visibility</strong></p>

<p>At a high level, we are going to mark pixels as either on or off. Pixels marked as on will be accumulated into the Visibility Material pass which will generate a
sparse GBuffer. The sparse GBuffer will be lit by the Lighting pass. And then we will unpack the sparse lighting and fill in the holes. Then we can pass that image to
TAA which can magically fix everything. Joking, not joking.</p>

<p>As an example, the selected pixels will look like this:</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_00_chosen.jpg" /></div>

<p>The single reconstructed frame will look like this:</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_01_first_frame.jpg" /></div>

<p>And the final frame after TAA will look like this:</p>

<div style="text-align:center;"><img width="800" src="/images/2021_07_19_vrs/sphere_crop/crop_02_vis_final.jpg" /></div>

<p>The first question is: How should we reconstruct the image from sparse points? I tried several options, and many variations of bilateral upsampling, but the approach
which worked the best was from the Deferred Active Compute Shading papers, from Ian Mallet, Cem Yuksel, and Larry Seiler [5,6]. Their key idea was to first render a full GBuffer, but perform lighting at variable rate using iterative passes. Their algorithm starts by calculating lighting at every
4th sample in X and Y, like so:</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_img_0.png" /></div>

<p>The next step is to fill in the missing pixels. For a pixel like the one below, we can interpolate the value from the 4 neighbors.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_img_1_interp.png" /></div>

<p>They had one very, very interesting innovation. Since the data is there, they could choose to either calculate or interpolate the pixel on-the-fly. They would compare
the values of all 4 neighbors, and if they are close enough, interpolate. But if the GBuffer data was different, they would perform the more expensive full lighting
calculation. It’s a very cool approach, and I’d recommend reading both papers. For each value in between,
they choose new colors either by interpolating or calculating it directly.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_img_1.png" /></div>

<p>This continues for a 2nd step…</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_img_2.png" /></div>

<p>…and a 3rd step…</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_img_3.png" /></div>

<p>…and a 4th step…</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_img_4.png" /></div>

<p>…until the image is complete. Waiting to decide between interpolating and lighting a pixel until the neighbors were lit is not practical in this Visibility buffer
implementation. So for our variation, we will start with an image after the sparse lighting pass.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_merged_1.png" /></div>

<p>The green pixels are the locations where we calculated a lighting value. Then for each pass, we fill in the blanks. What algorithm should we use
for this new pixel?</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_merged_1_interp.png" /></div>

<p>The common approach would be to use the smaller absolute difference, as has been used in image debayering. It’s probably easiest to explain with code:</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="nf">InterpolatePrimaryCrossColorMerged</span><span class="p">(</span>
	<span class="n">uint</span> <span class="n">matC</span><span class="p">,</span> <span class="n">uint</span> <span class="n">mat0</span><span class="p">,</span> <span class="n">uint</span> <span class="n">mat1</span><span class="p">,</span> <span class="n">uint</span> <span class="n">mat2</span><span class="p">,</span> <span class="n">uint</span> <span class="n">mat3</span><span class="p">,</span>
	<span class="kt">bool</span> <span class="n">validC</span><span class="p">,</span>
	<span class="n">float3</span> <span class="n">colorC</span><span class="p">,</span> <span class="n">float3</span> <span class="n">color0</span><span class="p">,</span> <span class="n">float3</span> <span class="n">color1</span><span class="p">,</span> <span class="n">float3</span> <span class="n">color2</span><span class="p">,</span> <span class="n">float3</span> <span class="n">color3</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">float4</span> <span class="n">color</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
  
  <span class="n">float4</span> <span class="n">temp0</span> <span class="o">=</span> <span class="p">(</span><span class="n">matC</span> <span class="o">==</span> <span class="n">mat0</span><span class="p">)</span> <span class="o">?</span> <span class="n">float4</span><span class="p">(</span><span class="n">color0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">:</span> <span class="n">float4</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
  <span class="n">float4</span> <span class="n">temp1</span> <span class="o">=</span> <span class="p">(</span><span class="n">matC</span> <span class="o">==</span> <span class="n">mat1</span><span class="p">)</span> <span class="o">?</span> <span class="n">float4</span><span class="p">(</span><span class="n">color1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">:</span> <span class="n">float4</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
  <span class="n">float4</span> <span class="n">temp2</span> <span class="o">=</span> <span class="p">(</span><span class="n">matC</span> <span class="o">==</span> <span class="n">mat2</span><span class="p">)</span> <span class="o">?</span> <span class="n">float4</span><span class="p">(</span><span class="n">color2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">:</span> <span class="n">float4</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
  <span class="n">float4</span> <span class="n">temp3</span> <span class="o">=</span> <span class="p">(</span><span class="n">matC</span> <span class="o">==</span> <span class="n">mat3</span><span class="p">)</span> <span class="o">?</span> <span class="n">float4</span><span class="p">(</span><span class="n">color3</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">:</span> <span class="n">float4</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
  
  <span class="n">float4</span> <span class="n">avg0</span> <span class="o">=</span> <span class="mf">.5</span><span class="n">f</span><span class="o">*</span><span class="p">(</span><span class="n">temp0</span> <span class="o">+</span> <span class="n">temp1</span><span class="p">);</span>
  <span class="n">float4</span> <span class="n">avg1</span> <span class="o">=</span> <span class="mf">.5</span><span class="n">f</span><span class="o">*</span><span class="p">(</span><span class="n">temp2</span> <span class="o">+</span> <span class="n">temp3</span><span class="p">);</span>
  
  <span class="kt">bool</span> <span class="n">bothGood0</span> <span class="o">=</span> <span class="n">temp0</span><span class="p">.</span><span class="n">w</span> <span class="o">&gt;=</span> <span class="mf">.75</span><span class="n">f</span><span class="p">;</span>
  <span class="kt">bool</span> <span class="n">bothGood1</span> <span class="o">=</span> <span class="n">temp1</span><span class="p">.</span><span class="n">w</span> <span class="o">&gt;=</span> <span class="mf">.75</span><span class="n">f</span><span class="p">;</span>
  
  <span class="k">if</span> <span class="p">(</span><span class="n">bothGood0</span> <span class="o">&amp;&amp;</span> <span class="n">bothGood1</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="kt">float</span> <span class="n">diff0</span> <span class="o">=</span> <span class="n">abs</span><span class="p">(</span><span class="n">dot</span><span class="p">(</span><span class="n">float3</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">temp0</span> <span class="o">-</span> <span class="n">temp1</span><span class="p">));</span>
    <span class="kt">float</span> <span class="n">diff1</span> <span class="o">=</span> <span class="n">abs</span><span class="p">(</span><span class="n">dot</span><span class="p">(</span><span class="n">float3</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">temp2</span> <span class="o">-</span> <span class="n">temp3</span><span class="p">));</span>
    <span class="n">color</span> <span class="o">=</span> <span class="n">diff0</span> <span class="o">&lt;</span> <span class="n">diff1</span> <span class="o">?</span> <span class="n">avg0</span> <span class="o">:</span> <span class="n">avg1</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">bothGood0</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">bothGood1</span><span class="p">)</span>
    <span class="n">color</span> <span class="o">=</span> <span class="n">avg0</span><span class="p">;</span>
  <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">bothGood0</span> <span class="o">&amp;&amp;</span> <span class="n">bothGood1</span><span class="p">)</span>
    <span class="n">color</span> <span class="o">=</span> <span class="n">avg1</span><span class="p">;</span>
  <span class="k">else</span>
    <span class="n">color</span> <span class="o">=</span> <span class="n">avg0</span> <span class="o">+</span> <span class="n">avg1</span><span class="p">;</span>
  
  <span class="k">if</span> <span class="p">(</span><span class="n">color</span><span class="p">.</span><span class="n">w</span> <span class="o">&lt;</span> <span class="mf">.25</span><span class="n">f</span><span class="p">)</span>
    <span class="n">color</span> <span class="o">=</span> <span class="n">float4</span><span class="p">(</span><span class="n">color0</span> <span class="o">+</span> <span class="n">color1</span> <span class="o">+</span> <span class="n">color2</span> <span class="o">+</span> <span class="n">color3</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>

  <span class="n">color</span><span class="p">.</span><span class="n">xyz</span> <span class="o">*=</span> <span class="n">rcp</span><span class="p">(</span><span class="n">color</span><span class="p">.</span><span class="n">w</span><span class="p">);</span>
  
  <span class="k">if</span> <span class="p">(</span><span class="n">validC</span><span class="p">)</span>
    <span class="n">color</span><span class="p">.</span><span class="n">xyz</span> <span class="o">=</span> <span class="n">colorC</span><span class="p">;</span>
  
  <span class="k">return</span> <span class="n">color</span><span class="p">.</span><span class="n">rgb</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>Note that we are passing in both the draw call IDs of the center pixel and the 4 neighbors, and we are only going to interpolate using neighbors
that are from the same draw call ID. There are two sets of pairs that we can use to interpolate this pixel. A pair is considered “good” if both pixels
are on the same draw call ID as the pixel we want to calculate. If both pairs are good, then we interpolate using the “better” pair, where the “better” pair
is the one with a smaller absolute difference. If only one pair is good, then we use that pair. And if neither pair is good then we just add up the pixels and hope
for the best.</p>

<p>In retrospect, selecting pairs for interpolation using smaller absolute differences likely wasn’t the best choice. In the raw images there are a number of single pixels that look a bit like noise. Those are pixels
that were significantly brighter or darker than the average. Thus when interpolating the nearby pixels, those pixels were not used since the other pair was always
chosen. So I might switch to a straight average of pixels, or a hybrid approach. It needs more investigation.</p>

<p>The advantage of this approach is that interpolation is fast, as it only requires the nearby 4 pixels. But it is also edge-aware so colors will not bleed.
What happens if all 4 neighbors are on a different draw call ID? In the worst case, we can take the average and accept the bleeding. But we can actually
stop it from happening in the first place. When selecting which pixels to calculate/interpolate earlier in the frame, we can prioritize pixels that do not
have any good neighbors, which stops this case from happening.</p>

<p>Using this algorithm, we can proceed with each pass. For each pixel in pass 1, we interpolate the missing pixels.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_merged_2.png" /></div>

<p>Then we can continue with pass 2…</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_merged_3.png" /></div>

<p>…and pass 3…</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_merged_4.png" /></div>

<p>…and pass 4.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/dacs_overview_merged_5.png" /></div>

<p>The key advantage of this approach is flexibility. We have complete control of whichever pixels we want to render or interpolate. The only hard rule is that
we need to render all the pixels in pass 0 (which is 1/16th of the total pixels). And as a soft rule, we want to render pixels when all 4 neighbors are invalid.
Otherwise we can enable/disable pixels by any metric we want.</p>

<p><strong>Convergence</strong></p>

<p>As you might have noticed, we are sampling from different locations than Hardware VRS. In a plain, 1x native render, we would shade all of our sample points in the
center of each pixel.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/crop_sample_points_std.png" /></div>

<p>If we enable 2x2 VRS, the sample points move to the center of each 2x2 block. Our final image will be blurrier because we are rendering at lower resolution.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/crop_sample_points_2x2.png" /></div>

<p>But with this VRS variation, we are rendering at the same sample points as the standard pattern. We are rendering a partial subset at the same
locations, as opposed to Hardware VRS which changes the positions.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/crop_sample_points_vrs_1.png" /></div>

<p>Since we have flexibility, we can choose different sample points in the next frame…</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/crop_sample_points_vrs_2.png" /></div>

<p>…and the next one…</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/crop_sample_points_vrs_3.png" /></div>

<p>…and the next one.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_19_vrs/dacs/crop_sample_points_vrs_4.png" /></div>

<p>That means if we are careful, we can actually make the image converge to the reference non-VRS image. We can do this by jittering the 4x4 grid every frame, and also
giving each pixel a noisy offset to its priority as well.</p>

<p>In the TAA accumulation buffer, we can store a confidence
value in the alpha channel. As we get new samples, we increase confidence at that pixel. When we have an interpolated value, if our existing value has low confidence,
then we will treat the interpolated value as important. However if we have high confidence, we can disregard interpolated values. As long as we make sure
to jitter different pixels in different frames, it will converge properly.</p>

<p>Here is an image from a face scan that I had lying around. The original scan is from <a href="https://triplegangers.com/">Triplegangers</a> which I
wrapped to my standard topology with <a href="https://www.russian3dscanner.com/wrapx-tutorials/">WrapX</a>. This is a closeup of the pores and specular highlight on the model’s cheek. On the left side is the first frame, and the right side is the converged result. The first row is a
regular 1x frame, and the second row is Forward with 2x2 VRS. The remaining 4 lines are Visibility with VRS at different ratios.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/diagrams/vrs_converged.jpg" /></div>

<p>As you can see, the 2x2 VRS version is blurrier both before and after TAA, but the Visibility VRS version converges to the same result as the reference. The rates at 25% and
higher converge quite quickly. I wouldn’t recommend actually rendering at 10% rate, as the convergence rate is pretty rough, but it does still get to where
it needs to go.</p>

<p>One of the interesting tricks presented in several VRS talks deals with motion. When the object is in motion, it is more difficult for your eye to track it,
and that surface will have motion blur applied anyways. So the advice is to have a low sampling rate in areas of high motion, and a high sampling rate
in areas of low motion.</p>

<p>However, with Visibility VRS, we can change the tradeoff. We can still have a low sampling rate in high motion. But we can have a low sampling rate
in areas of low motion as well, because it will converge in a few frames anyways. We can just have a low shading rate everywhere!</p>

<p>Visiblity VRS fundamentally
changes the tradeoff. With standard Hardware VRS, you are trading GPU performance for image quality. Rather, Visibility VRS is trading GPU performance for
convergence time, with converged image quality staying constant regardless of shading rate.</p>

<p>The interesting thing for me about Visibility VRS is not that it provides any one particular improvement to render times. Rather, the benefit of Visibility is that all these improvements
fit perfectly together. At high triangle counts, Visibility has significant improvements over Deferred/Forward because:</p>

<ol>
<li>Visibility at 1x is faster than pure Deferred/Forward due to quad utilization.</li>
<li>Visibility with VRS gives you a better multipler than rasterized VRS because the rasterizer needs to round up to larger blocks of pixels.</li>
<li>Visibility with VRS allows more aggressive variable shading rates because it converges to the non-VRS image.</li>
</ol>

<p>Each of these benefits is great on its own, but the real strength is how they multiply together, which will be discussed in the numbers below.</p>

<p><strong>Choosing Pixels<strong></strong></strong></p>

<p>The last detail to discuss is choosing pixels. In the most recent work, the trend is to analyze the image [7,9], detect how much detail is in that
area, and then set the appropriate VRS shading rate. However, we have more flexibility.</p>

<p>We could perform a similar test, determining a cutoff value of quality, but it would cause the performance cost to swing. If you are looking at the sky, you
would have GPU cycles idle, whereas when you are looking at a detailed area you would potentially go over your frame time limit.</p>

<p>Instead, we will choose a <strong>priority</strong> for each pixel, as a 4 bit value. Then we can bin them all up into a histogram and set an explicit threshold value
for how many pixels we want shaded in the final image. We can then find the cutoff value, where the cutoff ends up being a float value between 0 and 16. Each pixel tests its priority plus a dither value
against the threshold. A value of 3.62 would include all pixels of priority 0, 1, 2, and about 62% of the pixels in priority 3.</p>

<p>Since the interpolation algorithm runs in passes where each pass affects subsequent pixels, pixels in the earlier passes should of course have higher priority. Also,
we have pixels that do not share the same draw call with the 4 neighbors, so those need high priority as well. Thus, the base priority looks like this:</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th>Priority</th><th>Pixels</th>
  </tr>
  <tr align="center">
     <td>0</td><td>Pass 0</td>
  </tr>
  <tr align="center">
     <td>1</td><td>Zero Neighbors</td>
  </tr>
  <tr align="center">
     <td>2</td><td>Pass 1</td>
  </tr>
  <tr align="center">
     <td>3</td><td>Pass 2</td>
  </tr>
  <tr align="center">
     <td>4</td><td>Pass 3</td>
  </tr>
  <tr align="center">
     <td>5</td><td>Pass 4</td>
  </tr>
</table>

<p>There are several other things we should do. In general, thin objects look quite poor when they do not have enough valid pixels inside them.
So we can do a quick search to see if the nearby 4 neighbors are the same draw call in each direction, and then bias the priority so that edges get a priority increase. Here
is an example image.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/thin/thin_merged_01.jpg" /></div>

<p>Here are the selected pixels with thin object priority bias off:</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/thin/thin_merged_02.jpg" /></div>

<p>And the selected pixels with thin object priority bias on:</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/thin/thin_merged_03.jpg" /></div>

<p>Here is a diff of the image, showing just the pixels that are added by the priority bias.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/thin/thin_merged_04.jpg" /></div>

<p>However, since the total number of pixels is constant, adding pixels somewhere means that we have to take away pixels from somewhere
else. Here is the reverse of the diff, showing the pixels that were removed by giving priority to edges.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/thin/thin_merged_05.jpg" /></div>

<p>The advantage of this approach is that we can simply choose any combination
of metrics we want, and we will always have a consistent number of pixels rendered each frame.</p>

<p><strong>Results</strong></p>

<p>So what do the numbers look like? For the numbers, we will simplify them from previous batches. We will keep track of the length of the PrePass, Material pass, and Lighting pass.
But everything else will be in Other. As always, the Material and Lighting passes in Forward are merged together.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>VRS</th><th>PrePass</th><th>Material</th><th>Lighting</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>Off</td><td>0.019</td><td colspan="2">1.601</td><td>0.565</td><td>2.185</td>
  </tr>
  <tr align="center">
     <td>Forward</td><td>2x2</td><td>0.019</td><td colspan="2">0.442</td><td>0.564</td><td>1.025</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>Off</td><td>0.019</td><td>1.065</td><td>0.748</td><td>0.575</td><td>2.407</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x2</td><td>0.019</td><td>0.402</td><td>0.733</td><td>0.579</td><td>1.733</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>Off</td><td>0.042</td><td>1.117</td><td>0.792</td><td>0.996</td><td>2.947</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>25%</td><td>0.042</td><td>0.610</td><td>0.256</td><td>1.556</td><td>2.464</td>
  </tr>
</table>

<p>The results should not be surprising. If we have a small number of very big triangles, VRS works perfectly. I like to compare the ratio between the pass times
with VRS to the pass time without. This tells us how much we are actually saving by turning VRS on for each pass.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>VRS</th><th>Material</th><th>Lighting</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>2x2</td><td colspan="2">27.6%</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x2</td><td>37.7%</td><td>97.9%</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>25%</td><td>54.6%</td><td>32.3%</td>
  </tr>
</table>

<p>The results are pretty great for Hardware VRS. In the Forward pass, we would expect 25% of the pixels to take 25% of the time, and in reality it gets 27.6%. The Deferred
Material pass takes 37.7% of the time. In absolute numbers, the Deferred Material pass actually takes less than the Forward pass (0.442ms vs 0.402ms), so it is
plausible that it is hitting another bottleneck, such as primitive setup costs. Bandwidth cost is also a plausible bottleneck. Regardless, it’s still a good improvement. The Deferred
Lighting pass is largely unchanged.</p>

<p>Interestingly, the visibility case is by far the worst of the three. The Material evaluation pass is 54.6% of the original time, which is inferior to the other
two methods. The lighting pass fares better, requiring 32.3% of the time of the original. My suspicion is that the memory access pattern of the Visibility VRS
algorithm is less efficient as it becomes sparser.</p>

<p>Next up, here are numbers for the medium-density triangles case, where each triangle is around 5-10 pixels.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>VRS</th><th>PrePass</th><th>Material</th><th>Lighting</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>Off</td><td>0.132</td><td colspan="2">3.886</td><td>0.572</td><td>4.590</td>
  </tr>
  <tr align="center">
     <td>Forward</td><td>2x2</td><td>0.131</td><td colspan="2">3.076</td><td>0.572</td><td>3.779</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>Off</td><td>0.132</td><td>2.930</td><td>0.769</td><td>0.592</td><td>4.423</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x2</td><td>0.132</td><td>2.236</td><td>0.748</td><td>0.584</td><td>3.700</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>Off</td><td>0.159</td><td>1.617</td><td>0.828</td><td>1.008</td><td>3.612</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>25%</td><td>0.158</td><td>0.697</td><td>0.268</td><td>1.565</td><td>2.688</td>
  </tr>
</table>

<p>As the triangle count goes up, the cost of the rasterization passes (Forward and Deferred Materials) jumps
significantly, but Visibility remains relatively stable. The more interesting finding is the relative VRS savings.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>VRS</th><th>Material</th><th>Lighting</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>2x2</td><td colspan="2">79.2%</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x2</td><td>76.3%</td><td>97.3%</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>25%</td><td>43.1%</td><td>32.4%</td>
  </tr>
</table>

<p>The Forward and Deferred Material passes take 79.2% and 76.3% of their non-VRS timings respectively. As triangles get small,
the savings from Hardware VRS drop significantly. However the Visibility pass
is running at 43.1% of its original time. Ideally, it would run at 25%, but 43.1% is still pretty good.</p>

<p>Finally, here are the numbers for the high-density triangles case.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>VRS</th><th>PrePass</th><th>Material</th><th>Lighting</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>Off</td><td>1.004</td><td colspan="2">8.958</td><td>0.591</td><td>10.553</td>
  </tr>
  <tr align="center">
     <td>Forward</td><td>2x2</td><td>1.005</td><td colspan="2">8.975</td><td>0.574</td><td>10.554</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>Off</td><td>1.005</td><td>4.654</td><td>0.771</td><td>0.583</td><td>7.013</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x2</td><td>1.005</td><td>5.080</td><td>0.762</td><td>0.582</td><td>7.429</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>Off</td><td>1.154</td><td>1.703</td><td>0.835</td><td>1.025</td><td>4.717</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>25%</td><td>1.153</td><td>0.718</td><td>0.267</td><td>1.582</td><td>3.72</td>
  </tr>
</table>

<p>Once again, we can compare the relative cost of the VRS passes to the non-VRS passes.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>VRS</th><th>Material</th><th>Lighting</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>2x2</td><td colspan="2">100.2%</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x2</td><td>109.2%</td><td>98.8%</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>25%</td><td>42.2%</td><td>32.0%</td>
  </tr>
</table>

<p>The Deferred Material case is a bit surprising, as I definitely did not expect it to be 9.2% higher. I looked around the PIX capture to see
if that pass was overlapping with something unexpected, but nothing jumps out. The slowdown looks to be real, although it is trivial to remove
as you could simply turn VRS off in that case. As the triangles get tiny, Hardware VRS gains disappear.</p>

<p>Let’s think about this another way. Here is a comparison of the cost of just the Material pass for Deferred vs Visibility
at Low-Density and Medium-Density.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Low-Density 1x</th><th>Medium-Density 1x</th><th>Medium-Density VRS</th>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>1.065</td><td>2.930</td><td>2.236</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1.117</td><td>1.617</td><td>0.697</td>
  </tr>
  <tr align="center">
     <td>Ratio</td><td>0.953x</td><td>1.812x</td><td>3.208x</td>
  </tr>
</table>

<p>We are getting cascading efficiency gains from Visibility rendering. The Deferred Material pass
is slightly better than the Visibility Material pass with low-density triangles at 1x, so it runs slightly slower (95.3% of the speed of Deferred).
But as the density goes to medium, the Visibility Material shaders run ~1.8x faster than the Deferred pass. And then with a 25% rate,
Visibility has a better reduction of work multiplied on top of that, and the Visibility Material VRS pass is now ~3.2x faster than the Deferred
pass.</p>

<p>The numbers get even more extreme as we aim for 1 pixel triangles. Here is the same comparison for the High-density triangle case. Note that since the Deferred VRS pass was actually higher,
I switched it for the non-VRS number.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Low-Density 1x</th><th>High-Density 1x</th><th>High-Density VRS</th>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>1.065</td><td>4.654</td><td>4.654</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1.117</td><td>1.703</td><td>0.718</td>
  </tr>
  <tr align="center">
     <td>Ratio</td><td>0.953x</td><td>2.733x</td><td>6.482x</td>
  </tr>
</table>

<p>When doing benchmarks, the common question to ask is: How much faster is this workload? Visibility with VRS is faster, but that’s not the point. Rather, the
question in my mind is: What kind of workload could I run?</p>

<p>The larger gain is that Visibility VRS fundamentally changes the scaling of the Material
and Lighting passes. In the Medium-density case, the material pass is 3.2x faster. That means we could, in theory, have ~3.2x as many nodes in our material graphs.
In comparison to the High-density case, the material case is ~6.5x faster. On another note, the lighting pass is ~3.1x faster. The benefit is not that we
can reduce frame time. Rather, the benefit is that we can significantly increase our material and lighting complexity while fitting in the same budget.</p>

<p>The actual, real gains will be smaller of course. We have to pay for the fixed cost of the extra passes which takes away at the gains. And the speedup does
not apply the same to all types of shading. For example, if we are accumulating denoised shadows via stochastic ray tracing, then we aren’t necessarily
increasing the convergence rate by shading at a variable rate.</p>

<p>We won’t actually get a 6.5x/3.2x gain in material complexity or a 3.1x gain in lighting complexity. But even a 1.2x gain in either category is a big win.
The results are compelling even though real-world gains will be smaller than the numbers from this synthetic test case.</p>

<p><strong>Decoupled Visibility Multisampling</strong></p>

<p>Does this work with DVM from the previous post? In short, yes. Visibility VRS does not change the fundemental structure of the GBuffer. So the single frame
case “just works”. The one area that still requires more work is TAA. I adjusted the DVM accumulation formula, and it works, but it is not quite as clean in
movement for small objects as the regular 1x TAA case. In order to properly tweak TAA you need to spend several months obsessing over every corner case in your content,
fiddling with passes and numbers to optimize every detail. Unfortunately, I don’t have real content to test with, and it doesn’t make sense to spend several months
fine-tuning the parameters of a toy engine. TAA with DVM converges when stationary and subjectively looks acceptable in motion. But it is definitely
not as good as it could be, and optimizing it will have to wait for another day.</p>

<div style="text-align:center;"><img src="/images/2021_07_19_vrs/dvm_vrs.jpg" /></div>

<p><strong>Sparse GBuffers</strong></p>

<p>One side note is that the GBuffer is sparse. However, most screen-space passes require sampling nearby values in the GBuffer. So what are the options?</p>

<ol>
<li><b>Extract Full-Res:</b> The simplest option would be to run the same pass on the GBuffer as we do on lighting. I.e. just expand it to full-res. That
would have a significant bandwidth cost, but it's the easiest solution.</li>
<li><b>Extract Half-Res:</b> As an alternative, we could skip the full-res GBuffer, and just go to half-res. Would it be acceptable for any pass which needs
neighbors to get a half res-version. Would SSAO really degrade that much if it was forced to use a half-res normal?</li>
<li><b>Embrace Sparsity:</b> Then again, do we actually need a full-res GBuffer? Perhaps, instead of storing a full GBuffer, we could get away with only storing a list of several nearby samples?
For example, in Subsurface Scattering we generally want to randomly sample a nearby point. We don't actually care about that particular point. Rather we just want to randomly
sample from a reasonable point a certain distance away without bias. So if each pixel gave us a list of 4 pixels to choose from, we could randomly choose one of those 4. It should be possible to tweak
the math to do it without bias. But this would require a thorough examination of all the GBuffer passes, and most engines have a lot of them.</li>
</ol>

<p>It seems plausible that we can do most of our screen-space passes without paying for a full-res GBuffer. Maybe a hybrid approach would be best,
such as extracting world normals to full-res but leaving everything else sparse? I don’t have the answer to that question, but it seems like an interesting problem
to solve.</p>

<p><strong>Sample Choosing</strong></p>

<p>Also worth noting is that various presentations on VRS have discussed different metrics for reducing the shading rate. In this implementation,
the only inputs to determining pixel priority are the pass index, the distance to an edge, and if a pixel has no neighbors. But there are many other options. In no particular order:</p>

<ul>
<li>Pixels that lack detail can of course use fewer samples. Classic detection method is a Sobel filter.</li>
<li>Objects in motion tend to be blurry, so we can reduce samples on pixels with large motion vectors.</li>
<li>Areas under heavy transparency can reduce sample count.</li>
<li>Pixels under the scene GUI certainly do not need to be rendered at full rate.</li>
<li>Areas that are out-of-focus from DOF can be sampled at a lower rate as well.</li>
<li>We can disable shading rate in the skybox pixels.</li>
<li>Foveated Rendering in VR can significantly drop the shading rate.</li>
</ul>

<p>I’m sure there are others. Which of those would makes sense to use? Honestly, all of them, and I still see the possibility of major wins from using these techniques.
But in order to do that, it would need to be tested on a wider variety of content which was impractical for this implementation.</p>

<p><strong>References</strong></p>

<p>[1] Next-Generation Gaming with AMD RDNA 2 and DirectX 12 Ultimate. AMD. (<a href="https://community.amd.com/t5/blogs/next-generation-gaming-with-amd-rdna-2-and-directx-12-ultimate/ba-p/427032">https://community.amd.com/t5/blogs/next-generation-gaming-with-amd-rdna-2-and-directx-12-ultimate/ba-p/427032</a>)</p>

<p>[2] Variable Rate Shading in Call of Duty: Modern Warfare. Michal Drobot. (<a href="https://research.activision.com/publications/2020-09/software-based-variable-rate-shading-in-call-of-duty--modern-war">https://research.activision.com/publications/2020-09/software-based-variable-rate-shading-in-call-of-duty–modern-war</a>)</p>

<p>[3] Variable Rate Shading Tier 1 with Microsoft DirectX 12 From Theory to Practice. Marissa du Bois and John Gibson. (<a href="https://www.youtube.com/watch?v=d-qEvmVcg8I">https://www.youtube.com/watch?v=d-qEvmVcg8I</a>)</p>

<p>[4] Get Started with Variable Rate Shading on Intel Processor Graphics. Intel. (<a href="https://software.intel.com/content/www/us/en/develop/articles/getting-started-with-variable-rate-shading-on-intel-processor-graphics.html">https://software.intel.com/content/www/us/en/develop/articles/getting-started-with-variable-rate-shading-on-intel-processor-graphics.html</a>)</p>

<p>[5] Deferred Adaptive Compute Shading. Ian Mallet and Cem Yuksel. (<a href="https://geometrian.com/data/research/dacs/HPG2018_DeferredAdaptiveComputeShading.pdf">https://geometrian.com/data/research/dacs/HPG2018_DeferredAdaptiveComputeShading.pdf</a>)</p>

<p>[6] Efficient Adaptive Deferred Shading with Hardware Scatter Tiles. Ian Mallet, Cem Yuksel, and Larry Seiler. (<a href="https://dl.acm.org/doi/abs/10.1145/3406184">https://dl.acm.org/doi/abs/10.1145/3406184</a>)</p>

<p>[7] VRWorks - Variable Rate Shading (VRS). NVIDIA. (<a href="https://developer.nvidia.com/vrworks/graphics/variablerateshading">https://developer.nvidia.com/vrworks/graphics/variablerateshading</a>)</p>

<p>[8] Variable Rate Shading: A scalpel in a world of sledgehammers. Jacques van Rhyn. (<a href="https://devblogs.microsoft.com/directx/variable-rate-shading-a-scalpel-in-a-world-of-sledgehammers/">https://devblogs.microsoft.com/directx/variable-rate-shading-a-scalpel-in-a-world-of-sledgehammers/</a>)</p>

<p>[9] Moving Gears to Tier 2 Variable Rate Shading, Jacques van Rhyn. (<a href="https://devblogs.microsoft.com/directx/gears-vrs-tier2/">https://devblogs.microsoft.com/directx/gears-vrs-tier2/</a>)</p>

<p>[10] A Deferred Material Rendering System. Tomasz Stachowiak. (<a href="https://onedrive.live.com/view.aspx?resid=EBE7DEDA70D06DA0!115&amp;app=PowerPoint&amp;authkey=!AP-pDh4IMUug6vs">https://onedrive.live.com/view.aspx?resid=EBE7DEDA70D06DA0!115&amp;app=PowerPoint&amp;authkey=!AP-pDh4IMUug6vs</a>)</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Introduction]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_07_19_vrs/vrs_header_01.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_07_19_vrs/vrs_header_01.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Decoupled Visibility Multisampling</title><link href="https://filmicworlds.com/blog/decoupled-visibility-multisampling/" rel="alternate" type="text/html" title="Decoupled Visibility Multisampling" /><published>2021-07-13T00:00:00+00:00</published><updated>2021-07-13T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/decoupled-visibility-multisampling</id><content type="html" xml:base="https://filmicworlds.com/blog/decoupled-visibility-multisampling/"><![CDATA[<p><strong>Adventures in Visibility Rendering</strong></p>
<ul>
<li>Part 1: <a href="/blog/visibility-buffer-rendering-with-material-graphs/">Visibility Buffer Rendering with Material Graphs</a></li>
<li>Part 2: Decoupled Visibility Multisampling</li>
<li>Part 3: <a href="/blog/software-vrs-with-visibility-buffer-rendering/">Software VRS with Visibility Buffer Rendering</a></li>
<li>Part 4: <a href="/blog/visibility-taa-and-upsampling-with-subsample-history/">Visibility TAA and Upsampling with Subsample History</a></li>
</ul>

<p><strong>Introduction</strong></p>

<p>Decoupled Visibility Multisampling (DVM) is an antialiasing technique that works by decoupling the geometry and shading sample rates. Triangle visibility is rendered with
8x MSAA, but a 1x GBuffer is rendered. Most prior art that splits the geometry and shading rate starts with the assumption that shading rate should always be
at least 1x, and then they try to add extra samples with as little overhead as possible. In contrast, DVM renders with a fixed rate of 4 samples per 4 pixels in a single 2x2 quad. When
a given pixel needs more than one sample, instead of <b>adding</b> a sample, DVM <b>switches</b> the sample within the 2x2 quad. This approach keeps the GBuffer at a fixed size,
and avoids the branches and divergent GBuffer texture fetches of other approaches. In cases where a 2x2 quad needs more than 4 samples, the approach falls back to TAA.</p>

<p>Standard TAA Algorithms will generally begin the first frame with an aliased image, like the one below:</p>
<div style="text-align:center;"><img src="/images/2021_07_13_dvm/hairball/hairball_0_aliased.jpg" /></div>

<p>However, we’re not going to do that. Rather, we will use our 8x MSAA Visibility buffer to perform an edge-aware dilation of our samples allowing
the first frame in the sequence to have anti-aliased edges.</p>
<div style="text-align:center;"><img src="/images/2021_07_13_dvm/hairball/hairball_1_dvm.jpg" /></div>

<p>Finally, each additional frame will refine the result using temporal information for shading, much like TAA.</p>
<div style="text-align:center;"><img src="/images/2021_07_13_dvm/hairball/hairball_2_dvm.jpg" /></div>

<p>The net result is that we can achieve MSAA style edges with temporal accumulation similar to TAA.</p>

<p>As mentioned in the previous post, there are very interesting things you can do with Visibility rendering. And if you have not read that post, I’d highly recommend that
you do before reading this one. The main point was that with Visibility rendering, the shading pass is completely decoupled from the geometry rasterization pass.
And the previous post showed the benefits of that decoupling for performance, mainly in terms of quad overdraw.</p>

<p>Previous Post: <a href="/blog/visibility-buffer-rendering-with-material-graphs/">Visibility Buffer Rendering with Material Graphs</a></p>

<p>But the much more interesting aspect of decoupling the shading pass from rasterization is that your shading rate is no longer tied to your rasterization rate. In this
particular variation, we are going to render the visibility buffer at 8x MSAA. However, the shading rate will be fixed at 1x. Since the shading rate is 1x,
we use a standard, regular 1x GBuffer (with a little bit of sample switching), for standard, regular shading. Then we can use a visibility buffer to dilate those samples onto
geometry at a higher spatial sampling rate. Since we are exploiting the fact that visibility decouples geometry samples from shading samples, the natural name for this
technique is Decoupled Visibility Multisampling (DVM).</p>

<p><strong>Visual Acuity and Vernier Acuity</strong></p>

<p>Before diving in, there is one other question we should address: Why do this? Why is it desirable to decouple the geometry sampling rate from the shading sampling rate?</p>

<p>Mathematically, aliasing happens when a signal is undersampled. There are many different types of aliasing that can happen in realtime graphics. Aliasing can happen
both inside and outside of triangles. So why are triangle edges so important?</p>

<p>We tend to imagine that our eye is like a camera. It’s assumed that the rods and cones in our retina sense the amount of light hitting them, that image gets sent to our brains,
and we see that image in our consciousness. But that’s not how it actually works. We don’t have enough bandwidth in our retinal nerve to pass all that information
from our eyes to our brain. For a more complete explanation of the cells in the visual system <a href="https://www.wikiwand.com/en/">wikiwand.com</a> has detailed articles on <a href="https://www.wikiwand.com/en/Photoreceptor_cell">Photoreceptor Cells</a>, the <a href="https://www.wikiwand.com/en/Receptive_field">Receptive Field</a>,
and <a href="https://www.wikiwand.com/en/Retinal_ganglion_cell">Retinal Ganglion Cells</a>.</p>

<p>The short version is that our retina has the well-known “Rods and Cones” that detect color and luminance. However, this information does not go straight to your brain. Rather
it goes into your retinal ganglion cells, which have different receptive fields. The retinal ganglion cells accumulate information from different rods and cones and
pass a compressed signal to the brain. What we “see” in our heads isn’t what the real world actually looks like. Rather, it’s an approximation pieced together from
the various contrast detection algorithms in our eyes.</p>

<div style="text-align:center;"><a href="https://www.wikiwand.com/en/Receptive_field"><img src="/images/2021_07_13_dvm/receptive_field.png" /></a></div>

<p><i>Receptive Field image from <a href="https://www.wikiwand.com/en/Receptive_field">https://www.wikiwand.com/en/Receptive_field</a></i></p>

<p>The human eye is literally a neural network. The photoreceptors are split between cells that get excited and inhibited by light, which allows the network of retinal
ganglion cells behind them to act as a contrast detection filter. Additionally, these ganglion cells fire at different rates based on the signal from the cones. If nearby
retinal ganglion cells are sending a signal at a different rate, the cells behind them in the network can infer subtle details about the gradients.</p>

<p>As it turns out, our ability as humans to detect slight misalignment in edges is actually much higher than our ability to resolve details. The resolution at which
we can detect misalignment is called <strong>Vernier Acuity</strong>, as opposed to resolving details which is called <strong>Visual Acuity</strong>. This aspect of vision has been exploited
for over a century using vernier calipers. Vernier calipers rely on our ability to detect misalignment to add an extra decimal point to precise measurements.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/vernier_caliper.jpg" /></div>

<p>Want to see this effect on your own eyes? Take the test yourself at <a href="https://michaelbach.de/ot/lum-hyperacuity/">https://michaelbach.de/ot/lum-hyperacuity/</a>.</p>

<div style="text-align:center;"><a href="https://michaelbach.de/ot/lum-hyperacuity/"><img src="/images/2021_07_13_dvm/vernier_test.png" /></a></div>

<p>Go to this website, and I would recommend setting up your moniter 5 or 10 feet away. The empty space in the “Landolt C”
is the same size as the gap between the misaligned lines. Try to get the C as small as possible. At that point, you should still be able to see the misalignment since
your vernier acuity greatly exceeds your visual acuity. While that is a somewhat cursory test, the medical profession has done proper, thorough studies of
vernier acuity.</p>

<p>According to Wikipedia, in a typical person, visual acuity has resolution of about 0.6 arc minutes, whereas vernier acuity is 0.13 arc minutes [19]. By those numbers, vernier acuity resolution is 4.6x more detailed.</p>

<p>So what does this mean for us in terms of computer graphics? Suppose we are rendering a triangle. We have detailed textures and shading inside the triangle. And of course,
the edges of the triangle can cause aliasing as well. Our ability to resolve texture details inside will be limited the visual acuity, but our ability to
see aliasing is determined by vernier acuity, which is roughly 4.6x stronger.</p>

<p>If we render the triangle at the resolution of visual acuity, the interior shading will look fine but the edges will be clearly aliased. However, if we render everywhere at a resolution
that is high enough to exceed vernier acuity, we will be wasting cycles rendering detail that we can not see. For the ideal tradeoff between quality and performance
we want to decouple the geometry and shading resolution. We want our shading resolution high enough to exceed visual acuity and the geometry resolution high enough to 
exceed vernier acuity.</p>

<p><strong>Multisampling Anti-Aliasing (MSAA)</strong></p>

<p>Multisampling performs shading and geometry at a separate resolution. As you probably know multisampling renders subsamples instead of pixels.
In this case, we have a 2x2 group of pixels, with each pixel having 8 subsamples. They are split into two triangles, with orange on the left and green on the right.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa-intro-b-1.png" /></div>

<p>For each pixel/triangle pair, the GPU will run the pixel shader once, and then apply the same color to the other samples on the same pixel and triangle.
On the upper left pixel, the GPU will calculate the orange triangle at one point, and the green triangle at another point, requiring 2 separate shader invocations.
However, on the right the pixel shader will only run once.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa-intro-b-2.png" /></div>

<p>In terms of perception, this approach is ideal. In the interior of the triangles, we have no extra work to do. And then on a triangle edge, we would run an extra
sample just where we need it. We can shade our pixels at the resolution of visual acuity, and then use 4x or 8x MSAA to render our edges at the
limit of vernier acuity. In the image below the left side has no AA applied, and the right side uses 8x MSAA. If you are interested, there are much more in-depth discussions of MSAA, such as is Matt Pettineo’s [13].</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/box_msaa_split.jpg" /></div>

<p><i>On the left side MSAA is disabled, and 8x MSAA enabled on the right.</i></p>

<p><i>Forward MSAA</i></p>

<p>The most obvious method to render with MSAA is Forward. But forward has general performance loss relative to Deferred and Visibility.
You can read my last post discussing the impact of quad utilization, but there are other issues as well. Everything must be rendered in one giant forward pass,
even though it is (usually) more efficient to split rendering into several smaller passes. There is no GBuffer for screen-space effects. And MSAA has performance
issues with small triangles above and beyond the 1x sampling case.</p>

<p>There are several other “forward-ish” approaches for AA as well. In particular there is Jorge Jimenez’s Filmic SMAA, which ships in the Call of Duty franchise [6].
Additionally, the GPU vendors have implemented custom formats that split color and depth into different sample rates. NVIDIA developed CSAA [12]
and AMD has developed EQAA [1]. Splitting the shading and geometry sample rate is definitely not a new concept!</p>

<p><i>Deferred MSAA</i></p>

<p>Another option is to render deferred, and write an MSAA GBuffer. But the trick is to only perform lighting
calculations where you need it. The first approach was from Andrew Lauritzen at Intel who proposed classifying tiles to split the GBuffer lighting passes
into 1x and 2x versions [8]. This approach was implemented in CryEngine3 [18] as well as a Dx11 sample from NVIDIA [11]. Matt Pettineo
has a writeup of his tests as well with code you can download [14].</p>

<p>While those approaches can work, they usually have a fixed cost overhead in terms of memory and bandwidth to render the larger GBuffer. Anecdotally, I’ve had
several backchannel discussions with teams that implemented a similar approach (like rasterizing a 2x MSAA GBuffer and only shading the edges twice). All
of them removed it shortly therafter because:</p>
<ol>
<li>The fixed cost of even a 2x GBuffer is quite high in terms of memory and performance.</li>
<li>Since many passes read from the GBuffer, the code complexity becomes a significant burden on engineering time.</li>
</ol>

<p>In a real engine, you end up with lots and lots of corner cases where you need to read the GBuffer. Splitting that computation quickly becomes
unweildly, especially if that calculation requires branches or divergent texture/memory fetches. I’ve had many variations of the same conversation with
developers who implemented 2x MSAA GBuffers and then removed it in favor of TAA.</p>

<p><i>Other Approaches</i></p>

<p>There are many other creative approaches to splitting the shading rate from geometry rate. One of the earlier approaches was Surface Based Anti-Aliasing (SBAA)
from Marco Salvi and Kiril Vidimče at Intel [16]. They would choose a value of N, and store at most N samples for an individual pixel using a clever multi-pass
rendering approach.  The DAIS paper [17] compresses visibility into a link list of visibility samples per pixel to only render as many as are needed. The
Subpixel Reconstruction Anti-Aliasing (SRAA) paper [2] renders 16x depths, 1x GBuffer, and
uses the depth to assist in interpolation between GBuffer samples. HRAA [5] uses the GCN rasterization modes to split coverage and shading.
The Aggregate G-Buffer Anti-Aliasing paper [3] and [4] renders MSAA depth and uses Target Independent Rasterization
to perform a unique depth test on different render targets to render a special Aggregate G-Buffer. Finally, the Adaptive Temporal Anti Aliasing paper [10] actually shoots
rays to gather samples in-between the pixels of a 1x GBuffer.</p>

<p><i>Triangle Culling, Quad Utilization and MSAA</i></p>

<p>Another major issue when shading with MSAA is the pair of quad utilization and triangle culling, especially as triangles become very small. While the long-term goal in graphics is to have 1 pixel per triangle,
that technically implies one <b>visible</b> triangle per pixel. If we want triangles to be one pixel wide and tall, we want the area to actually be 0.5 pixels per triangle such
that half the triangles are rejected as empty. To achieve film style rendering we want pixels that are 1x1 pixel wide, not 1.4x1.4 pixels wide. We actually want 1 vertex per pixel,
which means 2 triangles per pixel.</p>

<p>As an example, suppose we are rendering at 1080p (2 million pixels). For each pixel to hit a unique triangle, we actually need about 4 million triangles in our frame. In a standard 1x renderer,
half of those triangles will be discarded. Here is a quick example, and note that the triangles which do not cover a pixel center are discarded as zero-pixel
triangles. Fortunately, those triangles will not run any pixel shader invocations.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/quad-utilization-msaa-1.png" /></div>

<p>In the image above, the green triangles would have to run a shader invocation, but the greyed out pixels would not because they do not touch a pixel center. Since
each triangle must run its own 2x2 pixel shader job, the GPU would render the pixel shader 4 times per pixel. But at least 4 shader invocations per pixel is as bad as
it gets. However, with MSAA, it can actually get worse. In the image below, with MSAA, we now need to render the purple triangles as well.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/quad-utilization-msaa-2.png" /></div>

<p>All of those triangles which did not cover any pixel centers suddenly start covering subsamples (except really tiny pixels that do not cover a single subsample). If
we are rendering on average 2 triangles per pixel, we suddenly have doubled the number of pixel shader invocations we need. Instead of 4 per pixel, now we are at 8.</p>

<p>It still gets worse. In the grid below, the thick lines are the 2x2 quad boundaries. And if you notice, these triangles in red have subsamples on both sides
of the quad boundary. While these triangles would be culled in the 1x case, they each require 4 pixel shader invocations for each side, or 8 total.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/quad-utilization-msaa-3.png" /></div>

<p>And in the even worse case? This cyan triangle up here covers 4 quad boundaries, and would require 16 total pixel shader invocations just for that one tiny little
triangle.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/quad-utilization-msaa-4.png" /></div>

<p>This might seem extreme, but it’s actually not. Since quad boundaries are 2 pixels wide, a pixel that is 1 pixel wide has a roughly 50% chance of touching a quad boundary.
The actual chance of spawning multiple quads is a bit lower, since it would need to cover a sample on both sides of the boundary, not just touch the border. With triangles less than a pixel in size, MSAA gets absolutely wrecked by quad utilization.</p>

<p>If we want to get triangles down to 1 pixel in longest dimension and we want MSAA, then Forward shading is going to be rough.
But before we try Visibility, we should also look at the most common anti-aliasing solution: TAA.</p>

<p><strong>Temporal Antialiasing</strong></p>

<p>To start off, here is a simple scene of a few long, thin cubes (like thin rods). The original image with no anti-aliasing looks like this.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/crop_plain_1x.jpg" /></div>

<p><i>Long, thin cubes with no Anti-Alisasing</i></p>

<p>With MSAA, we can fix this issue quite effectively.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/wide_msaa8_crop.jpg" /></div>

<p><i>Long, thin cubes with 8x MSAA</i></p>

<p>Due to the cost of MSAA, the most common approach for achieving cleaner edges in realtime is Temporal Anti-Aliasing (TAA).
For a thorough discussion of prior art regarding Temporal Antialiasing, I’d recommend the paper from Lei Yang, Shiqiu Liu, and Marco Salvi [20]. While there are many variations,
most implementations follow the outline of Timothy Lottes’s paper [9], Brian Karis’s presentation [7], and Marco Salvi’s presentation [15]. For this post,
I’m using a standard TAA approach with 8 samples and the standard 8x multisampling locations (N Rooks instead of Halton). Also,
for this implementation I’m performing color clamping in RGB space.</p>

<p>While details vary among implementations, the key ideas are:</p>

<ol>
  <li>Jittered Projection Matrix</li>
  <li>Reprojection of the previous frame</li>
  <li>Color clamp/variance check to reduce ghosting.</li>
</ol>

<p>Rather than rendering 8x samples in a single frame, the idea is to render 1x sample each frame, but jitter the projection matrix so that after 8 frames, you
have a sample from all 8 positions. Here is a simple shot from my toy engine showing two of the raw original frames. Note that the jaggies shift from
one frame to the next due to the different subpixel offset. Obviously, they are quite aliased.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/wide_taa_off_split.png" /></div>

<p>Then we can accumulate and over time we converge to the following image:</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/wide_taa_on_final.png" /></div>

<p>If nothing moves, then in theory the results are equivalent to supersampling. The edges are clean and look similar to MSAA/supersampling. The interiors of triangles
tend to have less aliasing since TAA can help fix issues like speckles in hot specular highlights. It can also look softer than MSAA since it will
clamp the peaks and valleys in the signal due to the color clamp.</p>

<p>While the edges look great if nothing moves, the obvious problem is that in games, things tend to move. You can see it in the typical situations like camera moves, but
the most problematic areas tend to be deformable objects like grass. And even though TAA is made by 8 sequential frames, it doesn’t actually converge in 8 frames.
In TAA, if a pixel moves too far, it has to reject the pixel history and start with a jagged, aliased image. Then it gets refined by using a fraction of the weight
of each new frame. Most implementations seem to have a value in the range of 5% to 10%. We’ll call this value <b>T</b>.</p>

<p>As an example, if your <b>T</b> value is 10%, after the first frame, you will retain 90% of your original 1x frame. This process continues, so after <b>N</b> 
frames that original frame still counts for pow(1-<b>T</b>,<b>N</b>) of your final image. Below is a quick table for how much influence the original frame has after <b>N</b> frames have passed.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>8</th><th>15</th><th>30</th><th>45</th><th>60</th>
  </tr>
  <tr align="center">
     <td><b>T</b>=0.10</td><td>43.0%</td><td>20.6%</td><td>4.2%</td><td>0.87%</td><td>0.18%</td>
  </tr>
  <tr align="center">
     <td><b>T</b>=0.05</td><td>66.3%</td><td>46.3%</td><td>21.5%</td><td>9.9%</td><td>4.6%</td>
  </tr>
</table>

<p>Obviously, higher values of <b>T</b> will converge faster than low values, but the tradeoff is increased flickering. The short answer is that there is
no perfect solution, and tradeoffs have to be made. If you are at <b>T</b>=0.1, you’ll see quite a bit of flicker. If the object is stationary or moving slowly then TAA works very well. However strong movements will
invalidate the history at the edge, and TAA will fail if the history is being invalidated faster than TAA can converge. And as we all know, objects can move quite
a distance in 30 to 60 frames.</p>

<p>We also have another major issue with TAA: Thin objects. Now let’s look at the same image, but move the camera farther away. Here is what two of our jittered images look
like before any TAA is applied.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/thin_taa_off_split.png" /></div>

<p>And here is the final result:</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/thin_taa_on_low.png" /></div>

<p>So, why does this happen? Think about the sequence of three frames.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/diagrams/taa-thin-diagram.png" /></div>

<p>In the first frame, the patch of all pixels is grey, and naturally the min/max color box is grey as well. In the next frame, a thin object goes through
the patch and will lerp with the original image using <b>T</b> as the weight. The problem happens in the third frame. We have another patch of all grey pixels, the
min and max are grey, and the previous frame is clamped out.</p>

<p>At a glance it might seem like the solution is to incorporate more information, such as store the variance of the pixel. The problem is that sometimes thin
objects like this are stationary, and sometimes they move. For example, if you have a thin wire swinging in the wind you would want the clamp to keep it from ghosting.
But if it is stationary, you would want to avoid the clamp and keep it in the accumulated history. Unfortunately, you don’t know if the wire has actually moved until 8 frames later, so 
you have to make an imperfect guess. It’s a tradeoff between flickering and ghosting, and there is no perfect answer.</p>

<p>As the object becomes less than one pixel in width, it seems to vanish. TAA handles this case by color clamping, and as a result the image “disappears”. In addition
to thin objects like wires, you can also see this effect in real games on faceted edges, like the trim on buildings. For reference, here is the MSAA8 image
which is effectively the reference solution:</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/thin_msaa8.png" /></div>

<p>The primary advantage of TAA is the simplicity for the rest of the pipeline. You just render everything at 1x and TAA fixes the rest. It works very well if you
have objects more than a pixel wide that have minimal movement. But it breaks down when you have too much movement or objects thinner than a pixel. Otherwise
the common solution is MSAA.</p>

<p><strong>Decoupled Visibility Multisampling</strong></p>

<p>To fix these problems, let’s try something a little different. Suppose we render triangle visibility for a scene at 8x MSAA. Here is a little 2x2 quad, and the grey path is the material edge from
several triangles. The subsamples would each have information on which material they point to. In this case, two different
materials touch this 2x2 quad.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_0_b_vis.png" /></div>

<p>However, instead of jittering a projection matrix and rendering at 1x, let’s just extract the 1x shading samples from our 8x visibility buffer. We could actually do the same algorithm as TAA. Instead of rendering with a jittered projection matrix, we can just choose the samples from the visibility
buffer that correspond to those same positions. And from here, we could render everything normally just like a 1x buffer.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_1_a.png" /></div>

<p>However, we have information that a TAA algorithm does not. In particular, TAA does not know anything about the samples off the jitter pattern. They could
have moved from the previous frame or they could be in the same spot. However, with a full visibility buffer we know exactly which material they belong to, so
we can expand the influence of our samples in the same pixel.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_1_b.png" /></div>

<p>We have several subsamples that are not filled in though, as they do not have a sample on the same pixel to use data from. In the case of MSAA, these empty pixels would be on separate
triangles, and the pixel shader would evaluate an additional time. But we can get a good approximation by choosing
another sample on the quad which has the same material.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_1_c.png" /></div>

<p>The benefits are pretty obvious. Using visibility information, we can achieve the same edge quality as MSAA8 as long as we are willing to relax our restrictions
on where the colors come from. MSAA guarantees that every sample will come from a centroid that is on the same triangle and the same pixel. Whereas in this
case we are relaxing our restriction to be the same material (not triangle) on the same 2x2 quad (not pixel).</p>

<p>But not all situations are handled so easily. A few frames later, we will end up at this case:</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_2_a.png" /></div>

<p>If we dilate the samples, we still have unknown subsamples. Since all 4 chosen samples were from the same material, we do not have any similar samples to choose
from in the 2x2 quad.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_3.png" /></div>

<p>The most obvious and accurate solution would be to add a new sample.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_6.png" /></div>

<p>There would be a cost for adding the sample, and it would not drastically change the computation needed. It would only increase the total number of samples
by a small percentage. But it adds a huge amount of subtle complexity to the renderer. Any time we need to read the GBuffer, we would need to do some kind of branch,
and potentially fetch data from a non-cache-friendly location. Various screen-space passes need the GBuffer. Various materials need the GBuffer. Adding a divergent
cost to every single pass is a pretty big deal, even if the actual number of samples added is small.</p>

<p>This approach is very similar in spirit to several previous approaches. The closest direct comparison is Surface Based Anti-Aliasing from Marco Salvi and Kiril Vidimče [16]. 
However the SRAA [2], ATAA [10], and DAIS [17] papers include very similar concepts as well. Still, it would be much preferred to keep the GBuffer exactly
at 1x, even if it means a reasonable degradation in quality compared to a more accurate solution that adds samples.</p>

<p>So, instead of <b>adding</b> samples, why don’t we try <b>switching</b> the samples?</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_5.png" /></div>

<p>We can disable the bottom left sample, and enable a new one on the upper left in its place. Now we have at least one sample for each of the two materials in this quad.
Using that sample might seem like a strange choice, though. Wouldn’t this sample make more sense, since it is closer and on the same pixel?</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_4.png" /></div>

<p>The reason is starvation. Ideally, we want every sample to have the same chance of being used to avoid biasing the final image. However, if we were to prioritize
samples from the same pixel, that pixel would always be switched out and never contribute. So when we need to switch a sample, we randomly pick from the available
samples instead of prioritizing samples from the same pixel. This approach will bias the image as certain samples have more weight than others, but it’s the best we can do.</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_switch_7.png" /></div>

<p>You might notice that the samples in the bottom left are chosen randomly, as opposed to using the nearest pixel. Once again, the goal is to minimize bias. There will
be plenty of bias of course, but we can at least try to reduce it as much as possible.</p>

<p>Finally, what do we do if we have too many materials?</p>

<div style="text-align:center;"><img width="400" src="/images/2021_07_13_dvm/diagrams/msaa_sample_overflow.png" /></div>

<p>In this case, we try to fall back to TAA as gracefully as possible. We will choose 4, and some subsamples will not be covered.
In practice, the number of 2x2 quads that touch 5 unique materials is quite low, so it’s acceptable for the algorithm to become slightly blurry in that case as
long as it does not flicker or cause a strong visual artifact.</p>

<p>The data structure is quite simple. For each sample, we need a triangle visibility ID just like the regular visibility algorithm. Since a 2x2 quad at 8x MSAA
has exactly 32 subsamples, each sample has a 32 bit mask to store the coverage. Also, since samples can be switched, we need to store a 5-bit index to keep track of
which of the 32 possible locations to use for lighting calculations. So the total additional storage per-pixel is a 32bit uint for visibility ID (triangle and draw call ID),
a 32bit uint for the coverage mask, and 5 bit index to know this sample’s subpixel location.</p>

<p>The visibility uint32 is stored in the exact same way as the previous post. We can treat the GBuffer as a standard, 1x GBuffer. There are no extra branches
or data reads when performing GBuffer lighting calculations. We just need to be a little more careful when determining the screen-space
(x,y) position.</p>

<p>Finally, how do we perform the resolve? For each pixel, we need to read the 4 visibility ids and the 4 masks. Each 2x2 block shares the same values,
so each pixel only needs to read 1 visibility id and mask. Then the 4 samples can be shared using SM 6.0 intrinsics without actually using shared memory. Since
the data is per 2x2 quad, it also should work if we need to read the GBuffer data in either a compute or a rasterization pass. The easiest way to explain it
is with a source code snippet.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">myColor</span>  <span class="o">=</span> <span class="n">TexFetch</span><span class="p">();</span>
<span class="n">uint</span>   <span class="n">myMask</span>   <span class="o">=</span> <span class="n">MaskFetch</span><span class="p">();</span>
<span class="n">float3</span> <span class="n">sumColor</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="n">uint</span> <span class="n">numMarked0</span> <span class="o">=</span> <span class="n">countbits</span><span class="p">(</span><span class="n">myMask</span> <span class="o">&amp;</span> <span class="n">baseMask</span><span class="p">);</span>
<span class="n">sumColor</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked0</span><span class="p">)</span> <span class="o">*</span> <span class="n">myColor</span><span class="p">;</span>

<span class="n">uint</span> <span class="n">numMarked1</span> <span class="o">=</span> <span class="n">countbits</span><span class="p">(</span><span class="n">QuadReadAcrossX</span><span class="p">(</span><span class="n">myMask</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">baseMask</span><span class="p">);</span>
<span class="n">sumColor</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked1</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossX</span><span class="p">(</span><span class="n">myColor</span><span class="p">);</span>

<span class="n">uint</span> <span class="n">numMarked2</span> <span class="o">=</span> <span class="n">countbits</span><span class="p">(</span><span class="n">QuadReadAcrossY</span><span class="p">(</span><span class="n">myMask</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">baseMask</span><span class="p">);</span>
<span class="n">sumColor</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked2</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossY</span><span class="p">(</span><span class="n">myColor</span><span class="p">);</span>

<span class="n">uint</span> <span class="n">numMarked3</span> <span class="o">=</span> <span class="n">countbits</span><span class="p">(</span><span class="n">QuadReadAcrossDiagonal</span><span class="p">(</span><span class="n">myMask</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">baseMask</span><span class="p">);</span>
<span class="n">sumColor</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked3</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossDiagonal</span><span class="p">(</span><span class="n">myColor</span><span class="p">);</span>

<span class="n">uint</span> <span class="n">totalMarked</span> <span class="o">=</span> <span class="n">numMarked0</span> <span class="o">+</span> <span class="n">numMarked1</span> <span class="o">+</span> <span class="n">numMarked2</span> <span class="o">+</span> <span class="n">numMarked3</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">avgColor</span> <span class="o">=</span> <span class="n">sumColor</span> <span class="o">*</span> <span class="nf">rcp</span><span class="p">(</span><span class="kt">float</span><span class="p">(</span><span class="n">totalMarked</span><span class="p">));</span></code></pre></figure>

<p>This shader also required changing the standard TAA algorithm in a few ways. Typically, a TAA algorithm would calculate the color clamp
based on the 3x3 neighborhood. However, I was able to get better results for determining the clamp for each pixel in the 2x2 grid by doing a 3x3
 search and only using other pixels that share the same draw call ID. Then the min and max are interpolated in the same way as the regular resolve.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="n">myMin</span>    <span class="o">=</span> <span class="n">MinOfNeighborsWithSameDrawCallId</span><span class="p">();</span>
<span class="n">float3</span> <span class="n">myMax</span>    <span class="o">=</span> <span class="n">MaxOfNeighborsWithSameDrawCallId</span><span class="p">();</span>
<span class="n">float3</span> <span class="n">sumMin</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">float3</span> <span class="n">sumMax</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="n">sumMin</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked0</span><span class="p">)</span> <span class="o">*</span> <span class="n">myMin</span><span class="p">;</span>
<span class="n">sumMax</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked0</span><span class="p">)</span> <span class="o">*</span> <span class="n">myMax</span><span class="p">;</span>

<span class="n">sumMin</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked1</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossX</span><span class="p">(</span><span class="n">myMin</span><span class="p">);</span>
<span class="n">sumMax</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked1</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossX</span><span class="p">(</span><span class="n">myMax</span><span class="p">);</span>

<span class="n">sumMin</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked2</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossY</span><span class="p">(</span><span class="n">myMin</span><span class="p">);</span>
<span class="n">sumMax</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked2</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossY</span><span class="p">(</span><span class="n">myMax</span><span class="p">);</span>

<span class="n">sumMin</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked3</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossDiagonal</span><span class="p">(</span><span class="n">myMin</span><span class="p">);</span>
<span class="n">sumMax</span> <span class="o">+=</span> <span class="kt">float</span><span class="p">(</span><span class="n">numMarked3</span><span class="p">)</span> <span class="o">*</span> <span class="nf">QuadReadAcrossDiagonal</span><span class="p">(</span><span class="n">myMax</span><span class="p">);</span>

<span class="n">float3</span> <span class="n">avgMin</span> <span class="o">=</span> <span class="n">sumMin</span> <span class="o">*</span> <span class="nf">rcp</span><span class="p">(</span><span class="kt">float</span><span class="p">(</span><span class="n">totalMarked</span><span class="p">));</span>
<span class="n">float3</span> <span class="n">avgMax</span> <span class="o">=</span> <span class="n">sumMax</span> <span class="o">*</span> <span class="nf">rcp</span><span class="p">(</span><span class="kt">float</span><span class="p">(</span><span class="n">totalMarked</span><span class="p">));</span></code></pre></figure>

<p>This approach gives you a tighter box that is weighted by coverage. One lingering issue was that in some cases, thin objects would flicker. This problem occurs
on a surface that has a sharp change in lighting within a single pixel, such as faceted edges. In some cases, the jittered position is
in the bright pixels, other times in the dark pixels. As a somewhat hacky solution, I faded out the color clamp in cases where camera movement is less than
0.05 pixels, but a better solution would be to clamp based on per-pixel variance.</p>

<p>Here are some results with the thin cubes from before. The left side is an aliased frame (what the first frame of TAA would look like), and the right
side is the first frame with DVM.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/wide_dvm_on_off_split.png" /></div>

<p><i>Regular 1x image on left, and the DVM image on the right.</i></p>

<p>And here is the comparison between the temporally accumulated DVM image and the MSAA image.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/wide_split_dvm_vs_msaa.png" /></div>

<p><i>DVM on left, 8xMSAA on right.</i></p>

<p>Here is the same thing for the most difficult case, after zooming out where the thin object is less than a pixel wide. This is the case where TAA fails to converge to the correct result due to the color clamp.</p>

<div style="text-align:center;"><img src="/images/2021_07_13_dvm/zoom/thin_split_dvm_vs_msaa.png" /></div>

<p><i>DVM on left, 8xMSAA on right.</i></p>

<p>Of course, this should not be a surprise. If there are 4 or less unique materials in a 2x2 quad, DVM is using the exact same coverage mask as MSAA for each material.
The only difference is that MSAA is ensuring a unique color sample for each unique triangle per pixel, whereas DVM is only averaging one color sample per pixel and using jitter
to temporally accumulate the shading results (like TAA).</p>

<p><strong>Performance</strong></p>

<p>For testing performance, I’ll use the same scenes as in the previous blog post. Although I made a few changes to the methodology. In the previous post,
the shadows were very inefficient as the dense meshes are naively rendered. So for this test, the shadows were tweaked so that if nothing changes
from the previous frame, then the depth maps are reused from the previous frame. Since these tests are all with a static camera, we do not have
any shadow depth rendering in these tests.</p>

<p>The results below show timing numbers of several different rendering paths.</p>

<ul>
<li><b>Forward:</b> Regular forward rendering with TAA.</li>

<li><b>Msaa 2x/4x/8x:</b> Regular forward with MSAA, and no TAA since they perform an MSAA resolve.</li>

<li><b>Deferred:</b> Deferred shading with TAA.</li>

<li><b>Visibility:</b> Visibility rendering (1x) with TAA.</li>

<li><b>DVM:</b> Decoupled Visibility MSAA, with 8x samples.</li>

</ul>

<p>Here is a short discussion of the passes:</p>

<ul>
<li><b>PrePass:</b> The Prepass performs a depth-only prepass for the Forward, Deferred, and MSAA renders. For Visibility, it
renders 1x Depth and 1x VisibilityID, whereas for DVM it renders 8x Depth and 8x VisibilityID.</li>

<li><b>Materials:</b> Forward and MSAA have this pass combined with lighting, since a single shader performs material
evaluation and lighting. Deferred performs a regular GBuffer rasterization pass. Visibility and DVM take an identical
code path which evaluates the material samples in an indirect compute shader.</li>


<li><b>Lighting:</b> Forward and MSAA skip this pass (as it is combined with Material). Deferred performs a full screen lighting
pass in a compute shader. Visibility and DVM run lighting as an indirect compute shader, skipping sky pixels.</li>

<li><b>Motion:</b> The rendering paths have very different algorithms for calculating motion vectors. Forward and Deferred calculate static motion vectors
from reprojecting the depth to the previous view, and then render moving geometry a second time. The MSAA path calculates a 1x depth buffer, and then proceeds along
the same path as Forward/Deferred. Visibility and DVM perform motion vector calculations in a compute shader. The MSAA versions are slower due to the extra depth resolve.</li>

<li><b>Resolve/TAA:</b> This category accounts for the Resolve and TAA time. Forward, Deferred, and Visibility perform regular TAA. MSAA applies a simple resolve in Reinhard
space, and DVM applies the more complicated resolve algorithm.</li>

<li><b>VisUtil:</b> This number includes all the major Visibility and DVM passes. It includes analyzing the Visibility ID buffer and radix sorting
the materials. It also includes the DVM analysis pass which chooses and switches the samples.</li>

<li><b>Other:</b> This category includes everything else. There are several miscellaneous barriers, a debug GUI pass, HDR tonemapping, and several extra copies. Note
that the MSAA passes are about 0.1ms longer than the other render paths because of redundant copies. Some of the passes use render target textures, and others
use buffers. If this was a real, production renderer those would be optimized out. But to keep the code simple, there are a few extra copies which are included in this number.</li>

<li><b>Shadows:</b> Note that there is no shadow depth pass. When nothing is moving, shadows are reused from the previous frame. So shadow depth maps are not being
rasterized in these tests.</li>

<li><b>Total:</b> Total is full render time. Note that Other is determined from Total minus all the other passes.</li>

</ul>

<p>Low-Density Triangles:</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass</th><th>Material</th><th>Lighting</th><th>VisUtil</th><th>Motion</th><th>Resolve/TAA</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>0.020</td><td colspan="2">1.600</td><td></td><td>0.093</td><td>0.177</td><td>0.353</td><td>2.243</td>
  </tr>
  <tr align="center">
     <td>Msaa 2x</td><td>0.038</td><td colspan="2">1.615</td><td></td><td>0.092</td><td>0.075</td><td>0.443</td><td>2.263</td>
  </tr>
  <tr align="center">
     <td>Msaa 4x</td><td>0.072</td><td colspan="2">1.648</td><td></td><td>0.092</td><td>0.079</td><td>0.439</td><td>2.330</td>
  </tr>
  <tr align="center">
     <td>Msaa 8x</td><td>0.124</td><td colspan="2">1.686</td><td></td><td>0.109</td><td>0.251</td><td>0.439</td><td>2.609</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>0.019</td><td>1.06</td><td>0.753</td><td></td><td>0.093</td><td>0.178</td><td>0.363</td><td>2.466</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>0.043</td><td>1.239</td><td>0.797</td><td>0.347</td><td>0.103</td><td>0.176</td><td>0.332</td><td>3.109</td>
  </tr>
  <tr align="center">
     <td>DVM</td><td>0.224</td><td>1.250</td><td>0.795</td><td>0.719</td><td>0.118</td><td>0.418</td><td>0.373</td><td>3.989</td>
  </tr>
</table>

<p>Medium-Density Triangles (about 10 pixels per triangle):</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass</th><th>Material</th><th>Lighting</th><th>VisUtil</th><th>Motion</th><th>Resolve/TAA</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>0.132</td><td colspan="2">3.881</td><td></td><td>0.093</td><td>0.176</td><td>0.354</td><td>4.636</td>
  </tr>
  <tr align="center">
     <td>Msaa 2x</td><td>0.211</td><td colspan="2">4.705</td><td></td><td>0.110</td><td>0.083</td><td>0.463</td><td>5.572</td>
  </tr>
  <tr align="center">
     <td>Msaa 4x</td><td>0.361</td><td colspan="2">5.649</td><td></td><td>0.146</td><td>0.121</td><td>0.464</td><td>6.741</td>
  </tr>
  <tr align="center">
     <td>Msaa 8x</td><td>0.560</td><td colspan="2">6.382</td><td></td><td>0.205</td><td>0.252</td><td>0.460</td><td>7.859</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>0.132</td><td>2.942</td><td>0.764</td><td></td><td>0.094</td><td>0.178</td><td>0.363</td><td>4.473</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>0.158</td><td>1.614</td><td>0.826</td><td>0.466</td><td>0.104</td><td>0.175</td><td>0.228</td><td>3.645</td>
  </tr>
  <tr align="center">
     <td>DVM</td><td>0.626</td><td>1.618</td><td>0.827</td><td>0.921</td><td>0.119</td><td>0.420</td><td>0.261</td><td>4.882</td>
  </tr>
</table>

<p>High-Density Triangles (about 1 pixel per triangle):</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass</th><th>Material</th><th>Lighting</th><th>VisUtil</th><th>Motion</th><th>Resolve/TAA</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>1.004</td><td colspan="2">8.987</td><td></td><td>0.094</td><td>0.176</td><td>0.391</td><td>10.652</td>
  </tr>
  <tr align="center">
     <td>Msaa 2x</td><td>0.874</td><td colspan="2">15.262</td><td></td><td>0.112</td><td>0.099</td><td>0.457</td><td>16.804</td>
  </tr>
  <tr align="center">
     <td>Msaa 4x</td><td>0.949</td><td colspan="2">23.261</td><td></td><td>0.148</td><td>0.157</td><td>0.468</td><td>24.983</td>
  </tr>
  <tr align="center">
     <td>Msaa 8x</td><td>1.487</td><td colspan="2">27.211</td><td></td><td>0.235</td><td>0.257</td><td>0.463</td><td>29.653</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>1.006</td><td>4.640</td><td>0.772</td><td></td><td>0.093</td><td>0.221</td><td>0.371</td><td>7.103</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1.156</td><td>1.696</td><td>0.836</td><td>0.294</td><td>0.101</td><td>0.175</td><td>0.404</td><td>4.742</td>
  </tr>
  <tr align="center">
     <td>DVM</td><td>1.530</td><td>1.745</td><td>0.836</td><td>0.965</td><td>0.118</td><td>0.421</td><td>0.259</td><td>5.968</td>
  </tr>
</table>

<p>All timing numbers are in milliseconds. The numbers are pretty similar to the previous post, which makes sense since it is using the same rendering algorithm. The biggest change
from the previous numbers is that the shadow pass has been removed. It does look like the shadow pass was overlapping with the
other passes more than I had previously thought, which was changing the numbers. For example, in my original tests, with low density triangles, the
Deferred Material pass and the Visibility Material pass were the exact same length. That was likely caused by overlap from the shadow depth pass,
and now the Visibility Material pass is 17% longer, which makes sense for the extra vertex interpolation and partial derivative calculations.</p>

<p>The DVM pass takes about 1.2ms more than Visibility, although that number is closer to 0.9 in the low triangle density case. GPUs optimize the memory
layout of MSAA to minimize the fetching cost when triangles are not dense, which explains the discrepancy. Spending 1.2ms to ensure that your first frame
has clean edges instead of waiting for TAA to converge seems like a pretty good tradeoff. Of course, it would be more than that on older hardware making the
technique less compelling. Also, those passes are unoptimized, so I’m sure there is room for improvement.</p>

<p>And finally, the cost for MSAA with Forward rendering is absolutely brutal. Here is a table of just the Forward shader pass from the Forward and MSAA render paths:</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Low</th><th>Medium</th><th>High</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>1.600</td><td>3.881</td><td>8.987</td>
  </tr>
  <tr align="center">
     <td>Msaa 2x</td><td>1.615</td><td>4.705</td><td>15.262</td>
  </tr>
  <tr align="center">
     <td>Msaa 4x</td><td>1.648</td><td>5.649</td><td>23.261</td>
  </tr>
  <tr align="center">
     <td>Msaa 8x</td><td>1.686</td><td>6.382</td><td>27.211</td>
  </tr>
</table>

<p>Just…ouch. When the triangles are not dense, MSAA 8x is great. The cost only increases by a tiny 5% compared to 1x. But switching to dense triangles takes us from 1.686ms to
27.211ms. When triangles get small, performance with MSAA falls off a cliff.</p>

<p>And here is the same table, divided by the Forward/Low pass (i.e. dividing all numbers by 1.600). This gives us the cost of the pass relative to the
baseline, best-case Forward cost.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Low</th><th>Medium</th><th>High</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>1.00</td><td>2.43</td><td>5.62</td>
  </tr>
  <tr align="center">
     <td>Msaa 2x</td><td>1.01</td><td>2.94</td><td>9.54</td>
  </tr>
  <tr align="center">
     <td>Msaa 4x</td><td>1.03</td><td>3.53</td><td>14.54</td>
  </tr>
  <tr align="center">
     <td>Msaa 8x</td><td>1.05</td><td>3.99</td><td>17.01</td>
  </tr>
</table>

<p>Due to quad utilization, we would expect the Forward pass with 1-pixel triangles
to be 4x longer than the low density triangle case. But the actual number is 5.62, which implies that we are paying a 1.41x cost per shader invocation (possibly
due to increased primitive rasterization cost) multiplied in with the 4x shader invocations from poor quad utilization. At high density, the 2x, 4x, and 8x 
scale by an additional 1.70x, 2.59x, and 3.03x on top of 1x for the same triangle count.</p>

<p>But when you run the numbers, it makes sense. For 8x at high density compared to 1x at low density, we are paying:</p>
<ul>
<li>1.41x in higher GPU cost per shader invocation, which is due to bandwidth, interpolators, etc.</li>
<li>4x for worst-case quad utilization.</li>
<li>3.03x for triangles that touch a sample but not a pixel center. These pixels are rendered in the MSAA case but culled as 0 pixel triangles in the 1x case.</li>
</ul>

<p>Multiply those together and you get 17x. And while 8x is extreme, the 2x and 4x MSAA cases are still quite rough. Finally, here is one more table. Given the Visibility
numbers, we can estimate how much it would cost if we switched DVM to brute force supersampling:</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>High Triangle Density Forward</th><th>Estimated Brute Force Supersampling</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>8.987</td><td>2.581</td>
  </tr>
  <tr align="center">
     <td>Msaa 2x</td><td>15.262</td><td>5.162</td>
  </tr>
  <tr align="center">
     <td>Msaa 4x</td><td>23.261</td><td>10.324</td>
  </tr>
  <tr align="center">
     <td>Msaa 8x</td><td>27.211</td><td>20.648</td>
  </tr>
</table>

<p>Note that the DVM pass can calculate Material and Lighting in 2.581ms. In theory, the quad utilization of forward MSAA is bad enough that we really
could brute force render supersampled visibility at less cost. Those numbers assume linear scaling of time as the number of samples increases.</p>

<p>Looking at the numbers, the use case that jumps out at me is VR. In particular, the usual game tricks to fake detail in materials (like
normal maps) don’t work in VR because your eye can see through the illusion. The best practice is to really push the triangle count and reduce your material complexity.
Unfortunately pushing the triangle count is the worst thing you can do for performance with Forward and MSAA. Still, proper parallax is compelling
enough to make it worth the very high cost. With such poor quad utilization, trading Forward/MSAA for Visibility with DVM or even Visibility supersampling starts to look compelling.</p>

<p><strong>Future Work</strong></p>

<p>There are a few pretty obvious next steps. The major performance win would be optimizing the passes. These shaders have not been optimized properly.</p>

<p>The big quality improvement would be to add samples instead of switching them. In general, I have come to really like having a list of samples and 32bit masks
per 2x2 quad. In the initial implementation, I had made a completely arbitrary number of samples per quad, but the performance just wasn’t there which was why I redesigned
the algorithm to use exactly 4 samples per quad. However, I’m quite optimistic about a hybrid approach to use the first 4 samples as is, and then have an option to add
a few extra samples where they add the most visual impact.</p>

<p>With an MSAA visibility buffer, rendering extra samples is trivial. The material compute shader runs on a flat list of visibility samples, so it is very easy
to add several extra samples to the list at sharp edges. Previous papers have done great work on determining which samples to add,
so incorporating that work is an obvious direction to go. The hard part is efficiently incorporating these samples into all
the GBuffer passes in a more complex engine.</p>

<p><strong>References</strong></p>

<p>[1] EQAA Modes for AMD 6900 Series Graphics Cards. AMD. (<a href="https://developer.amd.com/wordpress/media/2012/10/EQAA%2520Modes%2520for%2520AMD%2520HD%25206900%2520Series%2520Cards.pdf">https://developer.amd.com/wordpress/media/2012/10/EQAA%2520Modes%2520for%2520AMD%2520HD%25206900%2520Series%2520Cards.pdf</a>)</p>

<p>[2] Subpixel Reconstruction Antialiasing for Deferred Shading. Matthäus G. Chajdas, Morgan McGuire, and David Luebke. (<a href="https://research.nvidia.com/sites/default/files/pubs/2011-02_Subpixel-Reconstruction-Antialiasing/I3D11.pdf">https://research.nvidia.com/sites/default/files/pubs/2011-02_Subpixel-Reconstruction-Antialiasing/I3D11.pdf</a>)</p>

<p>[3] Aggregate G-Buffer Anti-Aliasing (Slides). Cyril Crassin, Morgan McGuire, Kayvon Fatahalian, and Aaron Lefohn. (<a href="https://casual-effects.com/research/Crassin2015Aggregate/Crassin2015Aggregate-presentation.pdf">https://casual-effects.com/research/Crassin2015Aggregate/Crassin2015Aggregate-presentation.pdf</a>)</p>

<p>[4] Aggregate G-Buffer Anti-Aliasing -Extended Version-. Cyril Crassin, Morgan McGuire, Kayvon Fatahalian, and Aaron Lefohn. (<a href="https://graphics.stanford.edu/~kayvonf/papers/agaa_tvcg2016.pdf">https://graphics.stanford.edu/~kayvonf/papers/agaa_tvcg2016.pdf</a>)</p>

<p>[5] Hybrid Reconstruction Anti Aliasing. Michal Drobot. (<a href="http://advances.realtimerendering.com/s2014/drobot/HRAA_notes_final.pdf">http://advances.realtimerendering.com/s2014/drobot/HRAA_notes_final.pdf</a>)</p>

<p>[6] Dynamic Temporal Antialiasing and Upsampling in Call of Duty. Jorge Jimenez. (<a href="https://www.activision.com/cdn/research/Dynamic_Temporal_Antialiasing_and_Upsampling_in_Call_of_Duty_v4.pdf">https://www.activision.com/cdn/research/Dynamic_Temporal_Antialiasing_and_Upsampling_in_Call_of_Duty_v4.pdf</a>)</p>

<p>[7] High Quality Temporal Supersampling. Brian Karis. (<a href="http://advances.realtimerendering.com/s2014/#_HIGH-QUALITY_TEMPORAL_SUPERSAMPLING">http://advances.realtimerendering.com/s2014/#_HIGH-QUALITY_TEMPORAL_SUPERSAMPLING</a>)</p>

<p>[8] Deferred Rendering for Current and Future Rendering Pipelines. Andrew Lauritzen. (<a href="https://software.intel.com/content/dam/develop/external/us/en/documents/lauritzen-deferred-shading-siggraph-2010-181241.pdf">https://software.intel.com/content/dam/develop/external/us/en/documents/lauritzen-deferred-shading-siggraph-2010-181241.pdf</a>)</p>

<p>[9] TSSAA: Temporal supersamping AA. Timothy Lottes. (<a href="http://timothylottes.blogspot.com/2011/04/tssaatemporal-super-sampling-aa.html">http://timothylottes.blogspot.com/2011/04/tssaatemporal-super-sampling-aa.html</a>)</p>

<p>[10] Adaptive Temporal Antialiasing. Adam Marrs, Josef Spjut, Holger Gruen, Rahul Sathe, and Morgan McGuire. (<a href="https://research.nvidia.com/sites/default/files/pubs/2018-08_Adaptive-Temporal-Antialiasing/adaptive-temporal-antialiasing-preprint.pdf">https://research.nvidia.com/sites/default/files/pubs/2018-08_Adaptive-Temporal-Antialiasing/adaptive-temporal-antialiasing-preprint.pdf</a>)</p>

<p>[11] Antialiased Deferred Rendering. NVIDIA. (<a href="https://docs.nvidia.com/gameworks/content/gameworkslibrary/graphicssamples/d3d_samples/antialiaseddeferredrendering.htm">https://docs.nvidia.com/gameworks/content/gameworkslibrary/graphicssamples/d3d_samples/antialiaseddeferredrendering.htm</a>)</p>

<p>[12] Coverage Sampled Antialiasing. NVIDIA. (<a href="https://developer.download.nvidia.com/SDK/9.5/Samples/DEMOS/Direct3D9/src/CSAATutorial/docs/CSAATutorial.pdf">https://developer.download.nvidia.com/SDK/9.5/Samples/DEMOS/Direct3D9/src/CSAATutorial/docs/CSAATutorial.pdf</a>)</p>

<p>[13] A quick Overview of MSAA. Matt Pettineo. (<a href="https://therealmjp.github.io/posts/msaa-overview/">https://therealmjp.github.io/posts/msaa-overview/</a>)</p>

<p>[14] Bindless Texturing for Deferred Rendering and Decals. Matt Pettineo. (<a href="https://therealmjp.github.io/posts/bindless-texturing-for-deferred-rendering-and-decals/">https://therealmjp.github.io/posts/bindless-texturing-for-deferred-rendering-and-decals/</a>)</p>

<p>[15] An excursion in temporal supersampling. Marco Salvi. (<a href="https://developer.download.nvidia.com/gameworks/events/GDC2016/msalvi_temporal_supersampling.pdf">https://developer.download.nvidia.com/gameworks/events/GDC2016/msalvi_temporal_supersampling.pdf</a>)</p>

<p>[16] Surface Based Anti-Aliasing. Marco Salvi and Kiril Vidimče. (<a href="https://software.intel.com/content/dam/develop/external/us/en/documents/43579-SBAA.pdf">https://software.intel.com/content/dam/develop/external/us/en/documents/43579-SBAA.pdf</a>)</p>

<p>[17] Deferred Attribute Interpolation for Memory-Efficient Deferred Shading. Cristoph Schied and Carsten Dachsbacher. (<a href="http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf">http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf</a>)</p>

<p>[18] CryEngine3 Graphics Gems. Tiago Sousa. (<a href="https://advances.realtimerendering.com/s2013/Sousa_Graphics_Gems_CryENGINE3.pptx">https://advances.realtimerendering.com/s2013/Sousa_Graphics_Gems_CryENGINE3.pptx</a>)</p>

<p>[19] Visual Acuity. Wikipedia. (<a href="https://en.wikipedia.org/wiki/Visual_acuity">https://en.wikipedia.org/wiki/Visual_acuity</a>)</p>

<p>[20] A Survey of Temporal Antialiasing Techniques. Lei Yang, Shiqiu Liu, and Marco Salvi. (<a href="http://behindthepixels.io/assets/files/TemporalAA.pdf">http://behindthepixels.io/assets/files/TemporalAA.pdf</a>)</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Adventures in Visibility Rendering Part 1: Visibility Buffer Rendering with Material Graphs Part 2: Decoupled Visibility Multisampling Part 3: Software VRS with Visibility Buffer Rendering Part 4: Visibility TAA and Upsampling with Subsample History]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_07_13_dvm/hairball_group_split_header_crop.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222021_07_13_dvm/hairball_group_split_header_crop.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Visibility Buffer Rendering with Material Graphs</title><link href="https://filmicworlds.com/blog/visibility-buffer-rendering-with-material-graphs/" rel="alternate" type="text/html" title="Visibility Buffer Rendering with Material Graphs" /><published>2021-07-05T00:00:00+00:00</published><updated>2021-07-05T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/visibility-buffer-rendering-with-material-graphs</id><content type="html" xml:base="https://filmicworlds.com/blog/visibility-buffer-rendering-with-material-graphs/"><![CDATA[<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/main_tri_ids.jpg"><img src="/images/2019_07_03_visibility/screens/small/main_tri_ids.jpg" /></a></div>

<p><strong>Adventures in Visibility Rendering</strong></p>
<ul>
<li>Part 1: Visibility Buffer Rendering with Material Graphs</li>
<li>Part 2: <a href="/blog/decoupled-visibility-multisampling/">Decoupled Visibility Multisampling</a></li>
<li>Part 3: <a href="/blog/software-vrs-with-visibility-buffer-rendering/">Software VRS with Visibility Buffer Rendering</a></li>
<li>Part 4: <a href="/blog/visibility-taa-and-upsampling-with-subsample-history/">Visibility TAA and Upsampling with Subsample History</a></li>
</ul>

<p><strong>Introduction</strong></p>

<p>The last year or so has been strange for everyone. Each of us has been dealing with covid quarantine in our own way, and in my case I’ve been coding.
As you likely know, small triangles are ineffcient on GPUs for a variety
of reasons, including low quad utilization. Since a Visibility Buffer is, in theory, more resilient than Deferred for cases with poor quad utilization,
I’ve had this hunch that a Visibility Buffer renderer would be able to get
better performance than a classic Deferred renderer. With a bit of free time on my hands,
I put together a Dx12 toy engine for testing this theory.</p>

<p><strong>Overview and Prior Art</strong></p>

<p>The idea of a visibility buffer is pretty simple. During a first pass you render a “Visibility Buffer” which stores the object and triangle IDs in a single value,
usually a packed U32. Then that triangle/object ID is all you need to fetch any parameters that you need.</p>

<p>The initial major paper was from Christopher Burns and Warren Hunt at Intel [3]. In addition to coining the term “Visibility Buffer”, they were
the first reference I could find to storing a triangle ID and reconstructing the interpolated vertex data. To handle multiple materials, they split the screen into tiles
and classified the pixels inside them. Then they would render a single draw call per material only covering the tiles that touch the needed
pixels. A Visibility buffer has also been used to optimize rendering in other ways, such as Christoph Schied and Carsten Dachsbacher [12] who
approached the problem as a multisampling compression algorithm. Wolfgang Engel from ConfettiFX has demonstrated a visibility buffer [5], forgoing arbitrary material graphs
but using bindless textures. Their approach treats materials as one ubershader with arbitrary texture access. They also provide source code under
a permissive license so I’d highly recommend taking a look if you are interested [4].</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/diagram/visibility-00.jpg" /></div>

<p>The most recent work of course is Nanite in Unreal 5, from Epic Games. They approach the problem differently from the past.
While I don’t have any secrets to tell you, the high-level approach is publicly known. Instead of using a Visibility Buffer as a replacement for a GBuffer, they are using a Visibility Buffer as an optimization to create a GBuffer more
efficiently. In particular, GPU rasterizers have performance inefficiencies with small triangles, so Nanite uses
a custom rasterizer to bypass these bottlenecks as you can see in their overview videos [7]. Note that you can jump to 1:00:45 for the quick discussion
on triangle sizes. The UE5 visibility rasterizer is a Nanite-only rasterizer, so other objects go through the standard Deferred path.</p>

<p>In theory, with Visibility rendering there is no need for a GBuffer. If you need the normal, you can always fetch it directly from the
vertex parameters. And if you need it again, you can fetch it again. But in practice, material graphs are already quite long, and they are getting longer every year. If all we needed was
direct lighting, then we could do without a GBuffer. But since we need values like the normal multiple times per frame (direct lighting, 
screen-space reflections, ambient probes, etc), storing material outputs in a GBuffer seems like the way to go.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/diagram/visibility-01.jpg" /></div>

<p>In the variation proposed here, the Visibility Buffer will be used to generate the GBuffer, and we will use it for all triangles. We’re also going to do this with arbitrary material graphs, which means calculating our own analytic partial derivatives.
In doing these tests, there are several questions I am trying to find the answer to:</p>

<ol>
<li>Can we efficiently calculate analytic partial derivatives with material graphs?
    <ul><li>In other words, is this approach viable, at all? If we can't calculate analytic partial derivatives, this approach is a non-starter.</li></ul></li>
<li>For very high triangle counts (1 pixel per triangle), is the Visibility approach faster?
    <ul><li>Is this approach faster for future workloads? If we want to hit film quality, eventually we need all triangles down to 1 pixel in size. Backgrounds, characters, grass, props, everything.</li></ul></li>
<li>What about more typical triangle sizes (5-10 pixels per triangle)? Is the Visibility approach faster there too?
    <ul><li>Is this approach faster for current AAA game workloads?</li></ul></li>
</ol>

<p>Turns out, in the tests below the answer to all three questions is: Yes! But with the caveat that this is a toy engine and not a real AAA engine.</p>

<p><strong>Forward/Deferred/Visibility Overview</strong></p>

<p>First, we should do a quick overview of Forward, Deferred, and Visibility rendering. In forward rendering, you are calculating everything in a single pixel shader, which will
look something like this.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">struct</span> <span class="nc">Interpolators</span><span class="p">;</span> <span class="c1">// the position, normal, uvs, etc.</span>
<span class="k">struct</span> <span class="nc">BrdfData</span><span class="p">;</span> <span class="c1">// normal mapped normal, albedo color, roughness, metalness, etc.</span>
<span class="k">struct</span> <span class="nc">LightData</span><span class="p">;</span> <span class="c1">// the output lighting data, usually just a float3</span>

<span class="c1">// Pass 0: Render all meshes, output final light color</span>
<span class="n">LightData</span> <span class="nf">MainPS</span><span class="p">(</span><span class="n">Interpolators</span> <span class="n">interp</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">BrdfData</span>  <span class="n">brdfData</span>  <span class="o">=</span> <span class="n">MaterialEval</span><span class="p">(</span><span class="n">interp</span><span class="p">);</span>
  <span class="n">LightData</span> <span class="n">lightData</span> <span class="o">=</span> <span class="n">LightingEval</span><span class="p">(</span><span class="n">brdfData</span><span class="p">);</span>
  <span class="k">return</span> <span class="n">lightData</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>Fundamentally, every physically based Forward shader starts with the hidden step that isn’t in the code: Interpolating vertices. The hardware interpolates the vertex data,
and it magically passes the interpolated vertex data in for you. Of course, hardware isn’t magical, but that step does happen before the pixel shader code begins executing. In the next step,
the MaterialEval() function will take the interpolated values (like UVs, normals, and tangents) to perform math and texture lookups to calculate surface material params.
These typically include the normal-mapped normal, the base color, etc. In the final step, it evaluates every light for those parameters and outputs the resulting color.</p>

<p>However, the more common approach for rendering in games these days is Deferred, using a GBuffer, which renders in two passes.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">// Pass 0: Render all meshes, output material data.</span>
<span class="n">BrdfData</span> <span class="nf">MaterialPS</span><span class="p">(</span><span class="n">Interpolators</span> <span class="n">interp</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">BrdfData</span>  <span class="n">brdfData</span>  <span class="o">=</span> <span class="n">MaterialEval</span><span class="p">(</span><span class="n">interp</span><span class="p">);</span>
  <span class="k">return</span> <span class="n">brdfData</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// Pass 1: Compute shader (or large quad) to calculate lighting.</span>
<span class="n">LightData</span> <span class="n">LightingCS</span><span class="p">(</span><span class="n">float2</span> <span class="n">screenPos</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">BrdfData</span>  <span class="n">brdfData</span>  <span class="o">=</span> <span class="n">FetchMaterial</span><span class="p">(</span><span class="n">screenPos</span><span class="p">);</span>
  <span class="n">LightData</span> <span class="n">lightData</span> <span class="o">=</span> <span class="n">LightingEval</span><span class="p">(</span><span class="n">brdfData</span><span class="p">);</span>
  <span class="k">return</span> <span class="n">lightData</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>The Deferred approach uses the same basic steps as Forward. However the lighting data is evaluated in a separate pass, either a full-screen quad,
or a compute shader. The benefits are:</p>
<ol>
  <li>The big advantage is that the lighting function is guaranteed to run exactly once per pixel. When rasterizing geometry MaterialPS() may run more than once per pixel, but LightingCS() is guaranteed to only run once.</li>
  <li>When the MaterialEval() and LightingEval() functions are in the same shader, they get compiled with the worst case register allocation for both, whereas
when they are split one pass can use fewer registers than the other (and achieve better occupancy).</li>
  <li>With the deferred approach, we have a GBuffer which can be used for other effects, such as Screen-space Reflections, SSGI, SSAO, and Subsurface Scattering.</li>
</ol>

<p>The obvious drawback is that Deferred increases the amount of bandwidth used. In general the screen-space effects you can do and the improved shader performance
greatly outweigh the bandwidth and memory cost. Of course, all that goes out the window if you absolutely need MSAA, but that is a whole other discussion.</p>

<p>Visibility rendering takes a very different approach. Instead of rasterizing the lighting color (like Forward) or the GBuffer data (like Deferred), visibility
rasterizes only the ID for the triangle and draw call. We could also store a barycentric co-ordinate or derivatives, but we are going to use a single triangle ID.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">// Pass 0: Rasterize all meshes, just output thin visibility</span>
<span class="n">U32</span> <span class="nf">VisibilityPS</span><span class="p">(</span><span class="n">U32</span> <span class="n">drawCallId</span><span class="p">,</span> <span class="n">U32</span> <span class="n">triangleId</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">return</span> <span class="p">(</span><span class="n">drawCallId</span> <span class="o">&lt;&lt;</span> <span class="n">NUM_TRIANGLE_BITS</span><span class="p">)</span> <span class="o">|</span> <span class="n">triangleId</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// Pass 1: In a CS convert from triangle ID to BRDF data</span>
<span class="n">BrdfData</span> <span class="n">MaterialCS</span><span class="p">(</span><span class="n">float2</span> <span class="n">screenPos</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">U32</span> <span class="n">drawCallId</span> <span class="o">=</span> <span class="n">FetchVisibility</span><span class="p">()</span> <span class="o">&gt;&gt;</span> <span class="n">NUM_TRIANGLE_BITS</span><span class="p">;</span>
  <span class="n">U32</span> <span class="n">triangleId</span> <span class="o">=</span> <span class="n">FetchVisibility</span><span class="p">()</span> <span class="o">&amp;</span>      <span class="n">TRIANGLE_MASK</span><span class="p">;</span>

  <span class="n">Interpolators</span> <span class="n">interp</span>    <span class="o">=</span> <span class="n">FetchInterpolators</span><span class="p">(</span><span class="n">drawCallId</span><span class="p">,</span> <span class="n">triangleId</span><span class="p">);</span>
  <span class="n">BrdfData</span>      <span class="n">brdfData</span>  <span class="o">=</span> <span class="n">MaterialEval</span><span class="p">(</span><span class="n">interp</span><span class="p">);</span>
  <span class="k">return</span> <span class="n">brdfData</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// Pass 2: In a CS, fetch BRDF data and calculate lighting</span>
<span class="n">LightData</span> <span class="n">LightingCS</span><span class="p">(</span><span class="n">float2</span> <span class="n">screenPos</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">BrdfData</span>      <span class="n">brdfData</span>  <span class="o">=</span> <span class="n">FetchMaterial</span><span class="p">(</span><span class="n">screenPos</span><span class="p">);</span>
  <span class="n">LightData</span>     <span class="n">lightData</span> <span class="o">=</span> <span class="n">LightingEval</span><span class="p">(</span><span class="n">brdfData</span><span class="p">);</span>
  <span class="k">return</span> <span class="n">lightData</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>Note that we have a separate pass for the Material and Lighting steps, which is different from most prior art [3,13]. Most previous papers have performed both steps together in order
to reduce GBuffer bandwidth. But given the complexity of material shaders, my perspective is that we need a GBuffer. Material shaders can be very long, but they are long
for legitimate tech-art reasons. Material shaders made by artists can be inefficient (maybe a slight understatement). But even when they are a complete mess,
there is usually a good, valid reason for the effect the material is trying to achieve, even if that effect is not achieved in the most performant way. My personal view is that long material graphs are here to stay,
they are going to be expensive, and we need to figure out the most efficient way to handle it.</p>

<p>In this case though, how do we handle multiple materials, and in particular material graphs? We’ll use the flow diagrammed below.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/diagram/color-graph.png" /></div>

<ol>
<li>As a first step, we render the full screen visibility buffer.</li>
<li>Go through each pixel, and calculate the number of pixels used for each material. Store the result in the Material Count buffer.</li>
<li>Perform a prefix sum to figure out the Material Start.</li>
<li>Run another pass through the visibility buffer, storing the XY position of each pixel in the appropriate position in the Pixel XY buffer. Note that
the Pixel XY buffer has the same number of elements as the Visibility Buffer.</li>
<li>For each material, run an indirect compute shader to calculate the GBuffer data.</li>
</ol>

<p>That is how the passes are ordered, and it allows us to render multiple material graphs with different generated HLSL code. However there is
one more issue to address when calculating GBuffer data: Partial Derivatives.</p>

<p><strong>Hardware Partial Derivatives</strong></p>

<p>If you have written a pixel shader, at some point you have surely written a line to read from a texture. Something like this:</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">Sampler2D</span> <span class="n">sampler</span><span class="p">;</span>
<span class="n">Texture2D</span> <span class="n">texture</span><span class="p">;</span>
<span class="p">...</span>
<span class="n">float2</span> <span class="n">uv</span> <span class="o">=</span> <span class="n">SomeUv</span><span class="p">();</span>
<span class="n">float4</span> <span class="n">value</span> <span class="o">=</span> <span class="n">texture</span><span class="p">.</span><span class="n">Sample</span><span class="p">(</span><span class="n">sampler</span><span class="p">,</span><span class="n">uv</span><span class="p">);</span></code></pre></figure>

<p>When you run this code, the GPU will figure out the optimal mipmap level to read from and filter the data for you. But how does it figure out the correct mipmap level?</p>

<p>The key is that pixel shaders don’t run on single pixels. Rather, they run on 2x2 groups of pixels, called a quad. In the purple example below, the triangle covers all 4 pixels, all 4 pixels run the same shader
in lockstep, and during the texture read the GPU will compare the 4 uv values to determine the mipmap level. The GPU can estimate the partial derivative w.r.t. x by subtracting the left pixels
from the right pixels and the partial derivative w.r.t. y by subtracting the top from the bottom. Then it can determine the proper level from the log2 of that difference. This approach
is called finite differences.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/diagram/deriv-0.png" /></div>

<p>However, what happens if a triangle does not cover all 4 pixels, such as this triangle below which only covers 3? In that case, the GPU will extrapolate the triangle onto that missing pixel,
and run it like normal. The 3 pixels that are actually running are called “Active Lanes” and the 1 pixel which is only running to provide derivatives to the other three
is a “Helper Lane.”</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/diagram/deriv-1.png" /></div>

<p>For more information, see the HLSL Shader Model 6.0 wave intrinsics doc [9]. In fact, there are intrinsics to pass data around between with other
pixels in the same 2x2 quad. However, what happens if multiple triangles overlap the same 2x2 quad?</p>

<div style="text-align:center;width: 600px;"><img src="/images/2019_07_03_visibility/quad-util-tris/quad_01.png" /></div>

<p>In this sample, 3 different triangles cover sample centers in the 2x2 grid. First up, the green triangle covers the upper left corner. To render this, the GPU would render the upper
left pixel as an active lane and the other three would be shaded as helper lanes, providing texture derivatives to the one active lane.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/quad_01_green_labeled.png" /></div>

<p>Next up, the blue triangle would have 2 active lanes and 2 helper lanes.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/quad_01_blue_labeled.png" /></div>

<p>And finally, the red triangle would have 1 active lane and 3 helper lanes.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/quad_01_red_labeled.png" /></div>

<p>When we have 1 triangle that covers all 4 pixels in a quad, the pixel shader workload looks like this:</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/diagram/deriv-2.png" /></div>

<p>But when we have 3 triangles that cover the 4 pixels in a quad, the pixel shader workload looks like this:</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/grid-colors.png" /></div>

<p>If we have 3 triangles covering the same 2x2 quad, then we actually have 3 times as much pixel shader work to do relative to a single triangle covering
all 4 pixels. This ratio of active lanes divided by total lanes is <strong>Quad Utilization</strong>. The purple quad has 100% quad utilization but the workload of these
three triangles has 33% quad utilization. And what is the main factor that affects quad utilization? Triangle size.</p>

<p><strong>Quad Utilization Efficiency</strong></p>

<p>Given this, suppose that we are rendering only 1 pixel triangles. Even with no overdraw, each 2x2 quad would have 1 active lane and 3 helper lanes.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/tiny_tri_2.png" /></div>

<p>If we were to render the entire scene with 1 pixel triangles, we would have to execute each pixel shader 4 times per pixel, instead of just one. The Forward and Deferred Material
shaders would run 4 times for every pixel, whereas the Deferred Lighting and Visibility Material and Lighting passes would only run once per pixel.</p>

<p><i>Shader function invocations per pixel for 1-pixel sized triangles:</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Material</th><th>Lighting</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td colspan="2">4x</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>4x</td><td>1x</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1x</td><td>1x</td>
  </tr>
</table>

<p>Going to the other extreme, what happens if we have big, large triangles? In that case it’s much simpler. The number of helper lanes will be a small percentage of the overall pixels rendered,
and for the purposes of this post we can call it incidental. The pixels shaders are running about once per pixel.</p>

<div style="text-align:center;width: 600px;"><img src="/images/2019_07_03_visibility/quad-util-tris/quad_big_01_small.png" /></div>

<p><i>Approximate shader function invocations per pixel for large triangles:</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Material</th><th>Lighting</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td colspan="2">1x</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>1x</td><td>1x</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1x</td><td>1x</td>
  </tr>
</table>

<p>Those are the extremes, but what happens in the middle? The middle is more complicated. The conventional wisdom is to aim for about 10 pixels per triangle. What is the quad utilization of
a 10 pixel triangle? It will vary by the shape, but let’s try a few and find out. We’ll start with the simplest 10 pixel triangle.</p>

<div style="text-align:center;width: 400px;"><img src="/images/2019_07_03_visibility/quad-util-tris/pix_10_first_right_0_small.png" /></div>

<p>At a glance, it looks really good, as there are 10 active lanes and only 2 helper lanes. However, there are 4 possible ways that this triangle can align to the 2x2 grid.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/pix_10_first_right_2.png" /></div>

<p>If you do the counting, on average you end up with 9 helper lanes to the 10 active lanes. Next, let’s try one that is a little longer and thinner.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/pix_10_first_grid_longer.png" /></div>

<p>With this shape, we have on average 11 helper lanes to the 10 active lanes. Here is the worst-case shape:</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/quad-util-tris/pix_10_first_worst_merge.png" /></div>

<p>If I counted that one correctly, it’s 21 helper lanes to 10 active lanes (ouch). Now, that’s an extreme case as a triangle would have to be perfectly aligned to cause a shape
like that. As a reasonable estimate, if we say that the first (cyan) and second (orange) triangles happen equally, and the third (purple) never happens, we will hit 50% quad utilization.
In other words, the Forward and Deferred Material passes will run about 2x per pixel. Once again, Deferred Lighting and Visibility Material and Lighting will run
once per pixel.</p>

<p><i>Approximate shader function invocations per pixel for 10 pixel triangles:</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>Material</th><th>Lighting</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td colspan="2">2x</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>2x</td><td>1x</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1x</td><td>1x</td>
  </tr>
</table>

<p>At a glance, Visibility rendering suddenly looks very compelling compared to Deferred. With 10 pixel sized triangles the Deferred Material pass has to run 2x as many
times as the Visibility Material pass, and it turns into 4x if triangles are one pixel. However, the Visibility pass has extra work to do.</p>

<p><strong>Interpolation and Analytic Partial Derivatives</strong></p>

<p>Whereas the Deferred approach relies on the hardware to pass the interpolators to the pixel shader, we have to fetch and interpolate this data ourselves.
The first step is fetching data, which is relatively straightforward. Note that you can gain significant wins by packing the data aggressively, but
for this test the data is stored as 32bit floats for simplicity. g_dcElemData is the draw call element data, which is a StructuredBuffer that contains important
per-instance data, such as where the vertex buffer starts.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">uint3</span> <span class="nf">FetchTriangleIndices</span><span class="p">(</span><span class="n">uint</span> <span class="n">dcElemIndex</span><span class="p">,</span> <span class="n">uint</span> <span class="n">primId</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">TriangleIndecis</span> <span class="n">ret</span> <span class="o">=</span> <span class="p">(</span><span class="n">TriangleIndecis</span><span class="p">)</span><span class="mi">0</span><span class="p">;</span>
  <span class="n">uint</span> <span class="n">startIndex</span> <span class="o">=</span> <span class="n">g_dcElemData</span><span class="p">[</span><span class="n">dcElemIndex</span><span class="p">].</span><span class="n">m_visStart_index_pos_geo_materialId</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
  <span class="k">return</span> <span class="n">g_visIndexBuffer</span><span class="p">.</span><span class="n">Load3</span><span class="p">(</span><span class="n">startIndex</span> <span class="o">+</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">primId</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">TrianglePos</span> <span class="n">FetchTrianglePos</span><span class="p">(</span><span class="n">uint</span> <span class="n">dcElemIndex</span><span class="p">,</span> <span class="n">TriangleIndecis</span> <span class="n">triIndices</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">uint</span> <span class="n">startPos</span> <span class="o">=</span> <span class="n">g_dcElemData</span><span class="p">[</span><span class="n">dcElemIndex</span><span class="p">].</span><span class="n">m_visStart_index_pos_geo_materialId</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
  
  <span class="n">TrianglePos</span> <span class="n">triPos</span> <span class="o">=</span> <span class="p">(</span><span class="n">TrianglePos</span><span class="p">)</span><span class="mi">0</span><span class="p">;</span>
  <span class="n">triPos</span><span class="p">.</span><span class="n">m_pos0</span><span class="p">.</span><span class="n">xyz</span> <span class="o">=</span> <span class="n">asfloat</span><span class="p">(</span><span class="n">g_visPosBuffer</span><span class="p">.</span><span class="n">Load3</span><span class="p">(</span><span class="n">startPos</span> <span class="o">+</span> <span class="mi">12</span> <span class="o">*</span> <span class="n">triIndices</span><span class="p">.</span><span class="n">m_idx0</span><span class="p">));</span>
  <span class="n">triPos</span><span class="p">.</span><span class="n">m_pos1</span><span class="p">.</span><span class="n">xyz</span> <span class="o">=</span> <span class="n">asfloat</span><span class="p">(</span><span class="n">g_visPosBuffer</span><span class="p">.</span><span class="n">Load3</span><span class="p">(</span><span class="n">startPos</span> <span class="o">+</span> <span class="mi">12</span> <span class="o">*</span> <span class="n">triIndices</span><span class="p">.</span><span class="n">m_idx1</span><span class="p">));</span>
  <span class="n">triPos</span><span class="p">.</span><span class="n">m_pos2</span><span class="p">.</span><span class="n">xyz</span> <span class="o">=</span> <span class="n">asfloat</span><span class="p">(</span><span class="n">g_visPosBuffer</span><span class="p">.</span><span class="n">Load3</span><span class="p">(</span><span class="n">startPos</span> <span class="o">+</span> <span class="mi">12</span> <span class="o">*</span> <span class="n">triIndices</span><span class="p">.</span><span class="n">m_idx2</span><span class="p">));</span>
  <span class="k">return</span> <span class="n">triPos</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>It’s not a lot of instructions, but what hurts performance is stalls waiting for the data. Fetching UVs and normal data is much the same so we’ll skip listing it here. The next step
is to calculate the barycentric co-ordinate. The DAIS paper [12] has a very handy formula in Appendix A, and the ConfettiFX code is a very useful reference
as well [4].</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">struct</span> <span class="nc">BarycentricDeriv</span>
<span class="p">{</span>
  <span class="n">float3</span> <span class="n">m_lambda</span><span class="p">;</span>
  <span class="n">float3</span> <span class="n">m_ddx</span><span class="p">;</span>
  <span class="n">float3</span> <span class="n">m_ddy</span><span class="p">;</span>
<span class="p">};</span>

<span class="n">BarycentricDeriv</span> <span class="n">CalcFullBary</span><span class="p">(</span><span class="n">float4</span> <span class="n">pt0</span><span class="p">,</span> <span class="n">float4</span> <span class="n">pt1</span><span class="p">,</span> <span class="n">float4</span> <span class="n">pt2</span><span class="p">,</span> <span class="n">float2</span> <span class="n">pixelNdc</span><span class="p">,</span> <span class="n">float2</span> <span class="n">winSize</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">BarycentricDeriv</span> <span class="n">ret</span> <span class="o">=</span> <span class="p">(</span><span class="n">BarycentricDeriv</span><span class="p">)</span><span class="mi">0</span><span class="p">;</span>

  <span class="n">float3</span> <span class="n">invW</span> <span class="o">=</span> <span class="n">rcp</span><span class="p">(</span><span class="n">float3</span><span class="p">(</span><span class="n">pt0</span><span class="p">.</span><span class="n">w</span><span class="p">,</span> <span class="n">pt1</span><span class="p">.</span><span class="n">w</span><span class="p">,</span> <span class="n">pt2</span><span class="p">.</span><span class="n">w</span><span class="p">));</span>

  <span class="n">float2</span> <span class="n">ndc0</span> <span class="o">=</span> <span class="n">pt0</span><span class="p">.</span><span class="n">xy</span> <span class="o">*</span> <span class="n">invW</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
  <span class="n">float2</span> <span class="n">ndc1</span> <span class="o">=</span> <span class="n">pt1</span><span class="p">.</span><span class="n">xy</span> <span class="o">*</span> <span class="n">invW</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
  <span class="n">float2</span> <span class="n">ndc2</span> <span class="o">=</span> <span class="n">pt2</span><span class="p">.</span><span class="n">xy</span> <span class="o">*</span> <span class="n">invW</span><span class="p">.</span><span class="n">z</span><span class="p">;</span>

  <span class="kt">float</span> <span class="n">invDet</span> <span class="o">=</span> <span class="n">rcp</span><span class="p">(</span><span class="n">determinant</span><span class="p">(</span><span class="n">float2x2</span><span class="p">(</span><span class="n">ndc2</span> <span class="o">-</span> <span class="n">ndc1</span><span class="p">,</span> <span class="n">ndc0</span> <span class="o">-</span> <span class="n">ndc1</span><span class="p">)));</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span> <span class="o">=</span> <span class="n">float3</span><span class="p">(</span><span class="n">ndc1</span><span class="p">.</span><span class="n">y</span> <span class="o">-</span> <span class="n">ndc2</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">ndc2</span><span class="p">.</span><span class="n">y</span> <span class="o">-</span> <span class="n">ndc0</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">ndc0</span><span class="p">.</span><span class="n">y</span> <span class="o">-</span> <span class="n">ndc1</span><span class="p">.</span><span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">invDet</span> <span class="o">*</span> <span class="n">invW</span><span class="p">;</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span> <span class="o">=</span> <span class="n">float3</span><span class="p">(</span><span class="n">ndc2</span><span class="p">.</span><span class="n">x</span> <span class="o">-</span> <span class="n">ndc1</span><span class="p">.</span><span class="n">x</span><span class="p">,</span> <span class="n">ndc0</span><span class="p">.</span><span class="n">x</span> <span class="o">-</span> <span class="n">ndc2</span><span class="p">.</span><span class="n">x</span><span class="p">,</span> <span class="n">ndc1</span><span class="p">.</span><span class="n">x</span> <span class="o">-</span> <span class="n">ndc0</span><span class="p">.</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="n">invDet</span> <span class="o">*</span> <span class="n">invW</span><span class="p">;</span>
  <span class="kt">float</span> <span class="n">ddxSum</span> <span class="o">=</span> <span class="n">dot</span><span class="p">(</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span><span class="p">,</span> <span class="n">float3</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">));</span>
  <span class="kt">float</span> <span class="n">ddySum</span> <span class="o">=</span> <span class="n">dot</span><span class="p">(</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span><span class="p">,</span> <span class="n">float3</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">));</span>

  <span class="n">float2</span> <span class="n">deltaVec</span> <span class="o">=</span> <span class="n">pixelNdc</span> <span class="o">-</span> <span class="n">ndc0</span><span class="p">;</span>
  <span class="kt">float</span> <span class="n">interpInvW</span> <span class="o">=</span> <span class="n">invW</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">ddxSum</span> <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">y</span><span class="o">*</span><span class="n">ddySum</span><span class="p">;</span>
  <span class="kt">float</span> <span class="n">interpW</span> <span class="o">=</span> <span class="n">rcp</span><span class="p">(</span><span class="n">interpInvW</span><span class="p">);</span>

  <span class="n">ret</span><span class="p">.</span><span class="n">m_lambda</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">interpW</span> <span class="o">*</span> <span class="p">(</span><span class="n">invW</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">y</span><span class="o">*</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">m_lambda</span><span class="p">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">interpW</span> <span class="o">*</span> <span class="p">(</span><span class="mf">0.0</span><span class="n">f</span>    <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span><span class="p">.</span><span class="n">y</span> <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">y</span><span class="o">*</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">m_lambda</span><span class="p">.</span><span class="n">z</span> <span class="o">=</span> <span class="n">interpW</span> <span class="o">*</span> <span class="p">(</span><span class="mf">0.0</span><span class="n">f</span>    <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span><span class="p">.</span><span class="n">z</span> <span class="o">+</span> <span class="n">deltaVec</span><span class="p">.</span><span class="n">y</span><span class="o">*</span><span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span><span class="p">.</span><span class="n">z</span><span class="p">);</span>

  <span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span> <span class="o">*=</span> <span class="p">(</span><span class="mf">2.0</span><span class="n">f</span><span class="o">/</span><span class="n">winSize</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span> <span class="o">*=</span> <span class="p">(</span><span class="mf">2.0</span><span class="n">f</span><span class="o">/</span><span class="n">winSize</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
  <span class="n">ddxSum</span>    <span class="o">*=</span> <span class="p">(</span><span class="mf">2.0</span><span class="n">f</span><span class="o">/</span><span class="n">winSize</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
  <span class="n">ddySum</span>    <span class="o">*=</span> <span class="p">(</span><span class="mf">2.0</span><span class="n">f</span><span class="o">/</span><span class="n">winSize</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>

  <span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span> <span class="o">*=</span> <span class="o">-</span><span class="mf">1.0</span><span class="n">f</span><span class="p">;</span>
  <span class="n">ddySum</span>    <span class="o">*=</span> <span class="o">-</span><span class="mf">1.0</span><span class="n">f</span><span class="p">;</span>

  <span class="kt">float</span> <span class="n">interpW_ddx</span> <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span> <span class="o">/</span> <span class="p">(</span><span class="n">interpInvW</span> <span class="o">+</span> <span class="n">ddxSum</span><span class="p">);</span>
  <span class="kt">float</span> <span class="n">interpW_ddy</span> <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span> <span class="o">/</span> <span class="p">(</span><span class="n">interpInvW</span> <span class="o">+</span> <span class="n">ddySum</span><span class="p">);</span>

  <span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span> <span class="o">=</span> <span class="n">interpW_ddx</span><span class="o">*</span><span class="p">(</span><span class="n">ret</span><span class="p">.</span><span class="n">m_lambda</span><span class="o">*</span><span class="n">interpInvW</span> <span class="o">+</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_ddx</span><span class="p">)</span> <span class="o">-</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_lambda</span><span class="p">;</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span> <span class="o">=</span> <span class="n">interpW_ddy</span><span class="o">*</span><span class="p">(</span><span class="n">ret</span><span class="p">.</span><span class="n">m_lambda</span><span class="o">*</span><span class="n">interpInvW</span> <span class="o">+</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_ddy</span><span class="p">)</span> <span class="o">-</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_lambda</span><span class="p">;</span>  

  <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p><i> Edit (5/7/2022): James McLaren and Stephen Hill discovered that the original version of <strong>CalcFullBary</strong> had incorrect gradients.
This updated version of <strong>CalcFullBary</strong> and <strong>InterpolateWithDeriv</strong> from James McLaren and Stephen Hill is more accurate and more closely
matches GPU rasterization behavior.
</i></p>

<p>The input points are in homogeneous clip space (just after the MVP transformation). Notice that we calculate the derivative of the barycentric
w.r.t. x and y. The barycentric co-ordinate (m_lambda) is determined by perspective-correct interpolation. Finally, the derivatives of the barycentric
are scaled by 2/winSize to change the scale from NDC units (-1 to 1) to pixel units. And finally, m_ddy is flipped because NDC is bottom to top whereas
window co-ordinates are top to bottom.</p>

<p>Once the barycentric and partial derivatives of the barycentric are found, interpolating any attribute from the vertices is easy. Given the three floats,
this function returns a triplet of the interpolated value, the derivative w.r.t. x, and the derivative w.r.t. y.</p>

<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">float3</span> <span class="nf">InterpolateWithDeriv</span><span class="p">(</span><span class="n">BarycentricDeriv</span> <span class="n">deriv</span><span class="p">,</span> <span class="kt">float</span> <span class="n">v0</span><span class="p">,</span> <span class="kt">float</span> <span class="n">v1</span><span class="p">,</span> <span class="kt">float</span> <span class="n">v2</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">float3</span> <span class="n">mergedV</span> <span class="o">=</span> <span class="n">float3</span><span class="p">(</span><span class="n">v0</span><span class="p">,</span> <span class="n">v1</span><span class="p">,</span> <span class="n">v2</span><span class="p">);</span>
  <span class="n">float3</span> <span class="n">ret</span><span class="p">;</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">dot</span><span class="p">(</span><span class="n">mergedV</span><span class="p">,</span> <span class="n">deriv</span><span class="p">.</span><span class="n">m_lambda</span><span class="p">);</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">dot</span><span class="p">(</span><span class="n">mergedV</span><span class="p">,</span> <span class="n">deriv</span><span class="p">.</span><span class="n">m_ddx</span><span class="p">);</span>
  <span class="n">ret</span><span class="p">.</span><span class="n">z</span> <span class="o">=</span> <span class="n">dot</span><span class="p">(</span><span class="n">mergedV</span><span class="p">,</span> <span class="n">deriv</span><span class="p">.</span><span class="n">m_ddy</span><span class="p">);</span>
  <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>Finally, for any values on the path from the interpolator to the texture sample in the material graph, we apply the chain rule. The texture is sampled
using SampleGrad(), explicitly passing in the uv derivatives.</p>

<p>Also, note that this isn’t a new concept. There are implementations
using C++ templates to generate derivatives in this way [11], and that is the approach that Arnold uses [8]. But instead of
templates to generate the derivative code, this toy engine generates the derivative calculations in hlsl from the material graph. Arnold uses the term “derivative sink” for these nodes that actually need the derivatives,
and any nodes on the path to the sink need to calculate derivatives along the way. However, they estimate that only 5%-10% of nodes are along this path. The rest of the nodes can ignore derivatives.</p>

<p>In practice, the vast majority of shaders in the real world use interpolated UVs with
only trivial adjustments (like scale and rotation). Most of the complexity in materials comes from
complicated math and blending textures together after the UV lookup. So in most cases we only need to perform extra derivative calculations on a few nodes.</p>

<p>Still, it’s a healthy chunk of extra work. Are these additional instructions so heavy that the Visibility Material function is slower than the GBuffer Material function, despite the GBuffer Material
requiring 2x or 4x more invocations as the triangle density goes up? Or are these extra calculations light enough that the Visibility Material function is faster? Let’s find out.</p>

<p><strong>Performance Tests</strong></p>

<p>For testing, I put together a single model with a heightmap, and duplicated it into a 5x3 grid. There are also several meshes below the camera casting shadows. The shadow depth pass is
very inefficient because it brute force renders the depth of a very high number of triangles, but the cost is the same for all three types of rendering so the numbers are still valid.
Also, all commands (including copies) are running on the Graphics Queue to minimize overlap and get consistent numbers. For all these shots, timing captures are from PIX
on my machine with an NVIDIA RTX 3070 at 1080p (well, technically 1088 because the framebuffer size is rounded up to a multiple of 16 for…reasons).</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_high_view.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_high_view.jpg" /></a></div>

<p>The main ground is a 5x3 grid of heightmap meshes. They are not tessellated heightmaps. Rather, there is a preprocessing step that generates the
mesh points, and then it gets processed like a regular mesh. The idea is that I wanted to control the approximate density of the triangles, but I also
wanted a little bit of overdraw to give some resemblance to real use-cases. The draw calls from that angle look like this:</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_high_dc.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_high_dc.jpg" /></a></div>

<p>With this setup we can keep the camera fixed, change the resolution of those meshes, and it gives us a rough idea of the tradeoff between
Forward, Deferred, and Visibility as the triangle count scales up.</p>

<p>For the material shader, I wanted something roughly similar to what games actually use. In test models, it’s very common to use simple PBR textures
that do a quick albedo, normal, specular lookup and output them directly. But material graphs in the real world tend to look like a bowl of spaghetti. I made this shader from two sets of textures from <a href="https://www.ambientcg.com">AmbientCg.com</a> [2],
which by the way, is highly recommended if you need free, high-quality textures under a permissive license.</p>

<p><a href="https://ambientcg.com/view?id=PavingStones054">PavingStones054</a></p>

<p><a href="https://ambientcg.com/view?id=Ground037">Ground037</a></p>

<p>For blending, I used 3 octaves of Perlin noise. I also made a third layer, which I had originally planned to be a wetness layer. But my first
test was flat red, which has a nice dry powder look, so that’s what I went with. It’s blended in with an octave of Perlin noise combined with the heightmap
so the red layer is biased into the cracks between the stones.</p>

<p>Here is a screenshot of the material graph. It’s a bit messy, as this material editor lacks most (all?) of the UI features of a proper,
commercial engine. There are a lot of extra nodes because I never got around to adding constants to nodes (like adding 0.5 requires both an add node
and a 0.5 constant node). But it was good enough for this test.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/material-graph.png" /></div>

<p><strong>Low Triangle Count</strong></p>

<p>For the first test, we will look at big triangles, so each mesh is just a quad made from two triangles. Here is a low view for you, so you can see how flat it is.</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_low_rocks.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_low_rocks.jpg" /></a></div>

<p>Here is the triangle ID view. As you can see, each draw call is just two triangles.</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/low_triangles_raw.jpg"><img src="/images/2019_07_03_visibility/screens/small/low_triangles_raw.jpg" /></a></div>

<p>And the final image.</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_low_view.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_low_view.jpg" /></a></div>

<p>Let’s capture some numbers. Here is an explanation of the passes.</p>

<p><i>PrePass:</i> For the Forward and Deferred pass, PrePass writes only depth. However, for the Visibility pass, it also writes the visibility U32
including drawCallId and triangleId.</p>

<p><i>Material:</i> For the Deferred pass, this pass refers to the material rasterization pass. For Visibility, it refers to the time of the compute pass.
And of course, for the Forward pass this is merged with the Lighting pass for one number.</p>

<p><i>Lighting:</i> In the Deferred case, this pass is a compute shader that reads the textures and writes the lighting. Visibility does a similar
operation with buffers instead of textures.</p>

<p><i>VisUtil:</i> This category refers to the other passes in the Visibility renderer. This includes the compute shader which counts the number of pixels
for each material, reorders the visibility buffer, and then reorders it back to a linear buffer when the pixels are shaded.</p>

<p><i>Other:</i> This category refers to everything else. The main passes here are the shadow pass, TAA, motion vectors, tonemapping, GUI (which doesn’t appear in these
screenshots), and miscellaneous barriers. The way I actually counted this was by taking the total GPU time and subtracting all the other categories.</p>

<p>The Other category is a bit tricky, as Raster/Compute overlap is one of the key design decisions in organizing your rendering passes. But for this test the goal
is to determine the relative cost among different algorithms, not minimize the final render time. The cost is roughly similar for all three rendering types, so it made sense to group them separately. The choice
of Forward/Deferred/Visibility has minimal effect on the cost of these items like TAA and shadows.</p>

<p><i>Low density triangle view performance.</i></p>
<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass</th><th>Material</th><th>Lighting</th><th>VisUtil</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>0.020</td><td colspan="2">1.61</td><td></td><td>0.749</td><td>2.379</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>0.020</td><td>1.06</td><td>0.730</td><td></td><td>0.759</td><td>2.569</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>0.043</td><td>1.06</td><td>0.762</td><td>0.322</td><td>0.832</td><td>3.01</td>
  </tr>
</table>

<p>Well, that result is interesting. In the deferred case, the Material shader cost is 1.06ms, and the visibility shader cost is, somehow,
exactly the same at 1.06ms. Additionally, the lighting shader cost is slightly higher by 0.032ms, and it has an extra 0.322ms of
overhead in managing the visibility passes. Finally, the forward pass can calculate Material and Lighting slightly faster, likely because it
is saving on bandwidth.</p>

<p>First, as a disclaimer, the 5x3 quads are nearly flat, and there is a little z-fighting, so it’s plausible that some pixels in the GBuffer pass are
not getting correct early-z rejection, causing a small amount of overdraw. But the more likely explanation is that the pass is primarily bandwidth limited,
so the extra ALU cost of interpolating vertices and calculating derivatives is hidden by the bandwidth cost.</p>

<p>But, looking at the numbers, the extra cost of fetching the vertex attributes and calculating the partial derivatives is…nothing? The Visibility lighting pass
is slightly higher, and the extra management passes add up, but overall that’s a very encouraging result for the next test. Also, the VisUtil pass could likely come down
a bit. The current implementation renders the Lighting using buffers instead of textures, and sorts the data later. But clearly it would be faster to
store the output of the Visibility Data directly into the GBuffer as UAVs.</p>

<p><strong>Medium Triangle Count</strong></p>

<p>Next up, let’s try a medium-resolution view. For the high-res image, we will be at 500x500. To get pixels 10x lighter we can make them a resolution of
500/sqrt(10)=158. Thus, the meshes for this medium-resolution setup are 158x158. They have detail but are definitely on the lumpy side.</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_medium_rocks.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_medium_rocks.jpg" /></a></div>

<p>Here is the final view from our camera angle:</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_medium_view.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_medium_view.jpg" /></a></div>

<p>And of course, the view of all the triangles. When angling the camera, I was trying for about 10 pixels per triangle, but at a glance it seems closer to 8. The goal is to capture the trend, not any specific size, so it’s good enough.</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/medium_triangles_raw.jpg"><img src="/images/2019_07_03_visibility/screens/small/medium_triangles_raw.jpg" /></a></div>

<p>Given that triangles are about 8-10 pixels, we would expect about the cost to be about 2x for any rasterized passes. What are the numbers then?</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass</th><th>Material</th><th>Lighting</th><th>VisUtil</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>0.132</td><td colspan="2">3.92</td><td></td><td>1.099</td><td>5.151</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>0.132</td><td>2.95</td><td>0.764</td><td></td><td>1.122</td><td>4.968</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>0.158</td><td>1.65</td><td>0.818</td><td>0.336</td><td>1.188</td><td>4.15</td>
  </tr>
</table>

<p>As a disclaimer, the change in Other is not significant, as the change is driven by the shadow depth passes. The shadow pass is quite naive, simply rendering all
the geometry into the cascades and point light shadow passes. Since the geometry jumps in complexity, so does the shadow pass. However, this change in cost is effectively the
same for all three rendering types. Let’s examine just the relevant passes for the three different algorithms: PrePass, Material, Lighting, and VisUtil.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass + Material + Lighting + VisUtil</th>
  </tr>
  <tr align="center">
     <td>Forward (M)</td><td>4.05</td>
  </tr>
  <tr align="center">
     <td>Deferred (M)</td><td>3.85</td>
  </tr>
  <tr align="center">
     <td>Visibility (M)</td><td>2.96</td>
  </tr>
</table>

<p>Looking at the numbers, it’s clear that as triangles get smaller, Visibility rendering pulls ahead. And the numbers are not as close as I thought they would be. The first thing I actually noticed was
the PrePass. I had expected more of a performance impact from rendering both Visibility ID and Depth (as opposed to only depth), but the cost difference
is pretty low (0.033ms). The biggest jump is the Forward and Deferred Material passes, which get 2.43x and 2.78x the length respectively (compared to the previous
first image). The Visibility Material pass is 1.56x the length of the original pass.</p>

<p>But why does the Visibility Material pass take longer than the big triangle case? After all it is the same shader running on the same number of pixels. The issue is cache coherency. Let’s look at two scenarios. On the left, we have an 8x8
block of pixels that are split between two triangles, and on the right each pixel in the 8x8 block points to a different triangle.</p>

<div style="text-align:center;"><img src="/images/2019_07_03_visibility/diagram/cache-coherency.png" /></div>

<p>In the compute shader, all 64 threads fetch the data for the first vertex. But in the case on the left, the GPU will only need to fetch 2 unique vertex
locations for the entire 8x8 block. However in the case on the right, the GPU will need to fetch from 64 unique locations in memory. In addition to worse coherency,
it will also have more raw bandwidth to fetch, as the total number of bytes it needs to fetch from memory is higher. So while these extra fetches had a trivial
cost in the first test case with large triangles, they have a relevant cost in this scene. However, that cost is much less than the penalty that the
Deferred path pays due to poor quad utilization. Thus, the Visibility approach is faster overall.</p>

<p><strong>High Triangle Count</strong></p>

<p>Finally, let’s do a third screenshot at a high triangle count. Each of the models is 500x500, and we are well into 1 pixel per triangle. Here is a closup of the rocks.</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_high_rocks.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_high_rocks.jpg" /></a></div>

<p>The final image:</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/screenshot_high_view.jpg"><img src="/images/2019_07_03_visibility/screens/small/screenshot_high_view.jpg" /></a></div>

<p>And the triangle ids:</p>

<div style="text-align:center;"><a href="/images/2019_07_03_visibility/screens/high_triangles_raw.jpg"><img src="/images/2019_07_03_visibility/screens/small/high_triangles_raw.jpg" /></a></div>

<p>So what do the numbers look like?</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass</th><th>Material</th><th>Lighting</th><th>VisUtil</th><th>Other</th><th>Total</th>
  </tr>
  <tr align="center">
     <td>Forward</td><td>1.00</td><td colspan="2">9.27</td><td></td><td>4.726</td><td>14.996</td>
  </tr>
  <tr align="center">
     <td>Deferred</td><td>1.00</td><td>4.64</td><td>0.792</td><td></td><td>4.729</td><td>11.161</td>
  </tr>
  <tr align="center">
     <td>Visibility</td><td>1.15</td><td>2.01</td><td>0.836</td><td>0.341</td><td>4.516</td><td>8.853</td>
  </tr>
</table>

<p>Starting the timeline, the PrePass cost goes up but stays reasonable, and the cost of writing the Visibility U32 only adds 15%. The Other pass also
goes up significantly but that is primarily driven by the shadow depth pass. The visibility pass is 0.21ms less in the Other category, which is a little strange.
Looking at the PIX run, the shadow depth pass does have some overlap with the PrePass, so it’s possible that the extra 0.15ms cost of the PrePass is hiding
0.15ms of the shadow pass and the other 0.06ms is hidden by other overlap with VisUtil.</p>

<p>But the major difference is the Material and Lighting costs. The numbers pretty much speak for themselves. The Forward cost scales by 5.76x compared to the first frame,
and the Deferred Material cost increases by scales to 4.38x of the first frame. However, the Visiblity Material cost scales by 1.90x of the first frame.</p>

<p>Once again, let’s isolate the passes relevant to the differences between the rendering algorithms.</p>

<table border="1" cellspacing="0" cellpadding="10" width="200" align="center">
  <tr align="center">
     <th></th><th>PrePass + Material + Lighting + VisUtil</th>
  </tr>
  <tr align="center">
     <td>Forward (H)</td><td>10.27</td>
  </tr>
  <tr align="center">
     <td>Deferred (H)</td><td>6.43</td>
  </tr>
  <tr align="center">
     <td>Visibility (H)</td><td>4.34</td>
  </tr>
</table>

<p>The numbers are clear. In this test case, once triangle density reduces to a single pixel, the better quad utilization of Visibility rendering greatly outweighs
the additional cost of interpolating vertex attributes and analytically calculating partial derivatives.</p>

<p><strong>Conclusions:</strong></p>

<p>Getting back to the original questions:</p>

<p><i>Can we efficiently calculate analytic partial derivatives with material graphs?</i></p>

<p>In this test case, the answer is “yes”. But in the general case, the better answer is “maybe”. The extra calculations needed to generate partial derivatives are trivial in the simple cases of UV scales and offsets. I did some other cursory tests (without doing a full
set of PIX runs) and at a glance there are no critical use cases that would cause enough of a performance penalty to fundamentally change the numbers.</p>

<p>The most common case would be a texture output that becomes a UV for another texture. If we have a standard material with 4 texture reads, and then we add
a UV offset texture read, the Forward/Deferred Material pass would add 1 texture read whereas the Visibility Material pass would add 3. But the difference between 5 and 7 texture samples
will not cause a performance cliff strong enough to drastically change the numbers. And I would expect those two extra samples to have minimal cost since they will be very cache-coherent with the first one.</p>

<p>A more problematic case is Parallax Occlusion Mapping. In theory we would need 3 texture reads instead of 1 for each step. But would the derivatives
actually change that much from each step? Would it be acceptable to use the same derivatives/mip-map level for all of them? At a glance it seems reasonable,
but I haven’t verified.</p>

<p>And of course, we have the really bad cases. A refraction eye shader would need to pass along the partial derivative of the view vector as it
refracts with the cornea geometry normal. I could imagine that shader being 3x slower than the finite differences version, since we need to account for the derivatives
of the view vector w.r.t. x and y and also the partial derivatives of the cornea height and normal w.r.t. x and y. But I can also think of approximations that would reduce the cost. For example,
I suspect we could assume the curvature of the cornea is too small to be relevant and we could make that derivative zero in the calculations.</p>

<p>Finally, standard derivatives using finite differences are not perfect either. We have problematic cases like branches and discard which are elegantly solved by
switching to analytic derivatives. This is especially true when using helper lanes that go off the edge of the triangle.</p>

<p>So yes, it works in this case. But for the general problem of partial derivatives in AAA games, my answer is “maybe, with a lean towards yes”.
My conclusion is that analytic partial derivatives are likely viable, but the approach would need more testing to be sure for the more complicated use cases.</p>

<p><i>For very high triangle counts (1 pixel per triangle), is the Visibility approach faster?</i></p>

<p>For very high triangle counts, where each pixel is run 4 times, Visibility rendering is a clear winner. The Deferred cost is 6.43ms compared to 
4.34ms for Visibility. A 32.5% reduction in overall GPU cost for the relevant passes is non-trivial.</p>

<p><i>What about more typical triangle sizes (5-10 pixels per triangle)? Is the Visibility approach faster there too?</i></p>

<p>For medium counts, in these tests, yes, the visibility approach is faster as well. The margin is closer though, at 3.85ms vs 2.96ms. Still, a 23.1% reduction is non-trivial. Additionally, I would expect the Visibility approach to be less spikey in the bad view angles with lots of geometry and overdraw,
but that’s conjecture.</p>

<p><strong>Other Considerations:</strong></p>

<p>Given that Visibility rendering scales better than Deferred for high triangle counts, and triangle counts are going up every year, should
every game engine drop everything and switch? Of course not. There are several other factors in any major architectural rendering decision.</p>

<p><i>Code Complexity:</i></p>

<p>Probably the best argument against Visibility rendering is the complexity involved. Visibility rendering requires engineering time to manage
vertex buffers and partial derivatives. Engineering time is not infinite.</p>

<p><i>Memory:</i></p>

<p>In order to use Visibility rendering, all of your dynamic geometry needs to be in a giant buffer (or buffers) that can be accessed from a shader.
If your screen is covered in blades of grass, every single one of those post-deformation vertices needs to be in a buffer somewhere. That being said, the memory is not
necessarily as bad as it sounds. You can probably pack position XYZ into 16 bits each, so each vertex is 6 bytes. If you need a post-deformation tangent space you
can store that in a 4 byte quaternion, for 10 bytes per pixel total. Suppose you are rendering at 1080p (2 million pixels), and you
have one vertex per pixel. That puts you at 20MB of RAM, or 40MB if you need to store the previous frame too. 40MB of RAM is not trivial, but it is not ridiculous either. And if you want to include the mesh in
raytracing you need the post-deformation vertices anyways.</p>

<p><i>PSO Switches:</i></p>

<p>One of the subtle advantages of Visibility rendering is fewer PSO switches during rasterization. In a Forward or Deferred rasterization pass, bubbles can form
when the Pixel Shader is starved for work from the earlier stages, especially if the PSO is constantly switching. But opaque geometry can share the
same Visibility pixel shader despite having a completely different material. While there are a few exceptions (backface culling, alpha testing, etc), we
can group all visible geometry into only a few PSOs for the triangle ID writing pass. We can also more aggressively group PSOs in the material pass. For example, a material that renders opaque and a variation
with alpha-testing can evaluate the material data in the same Visibility indirect CS dispatch, whereas they would require separate PSOs in a Deferred material pass. The Visibility pipeline should have
significantly fewer bubbles, but testing that kind of workload is beyond the scope of a small toy engine.</p>

<p><i>Material Cost:</i></p>

<p>If you are rendering with giant, complex materials, then visibility is more compelling. That’s because the cost of fetching and
interpolating your vertex data is hidden by the cost of material evaluation. If you are rendering with short material shaders,
then the interpolation cost will be more exposed, as well as the fixed cost overhead.</p>

<p><i>Min-Spec:</i></p>

<p>These tests were performed on an NVIDIA GTX 3070, which is higher than the min-spec of any AAA game coming out in the near future. In particular,
this test case has about 0.34ms of fixed cost. However, that same cost on a low-end laptop GPU from 5 years ago is going to be quite steep. Visibility scales 
well for future GPUs which will be even more powerful, but by the same token it scales poorly for past GPUs that will be in the min-spec for a long time.</p>

<p><i>Resolution Upscaling:</i></p>

<p>A secondary consideration is the role of upscaling algorithms, such as NVIDIA’s DLSS [10], AMD’s Super Resolution [1], and Unreal’s Temporal Super
Resolution [6]. Opinions differ of course, but if I could choose between a 4k native resolution with cut-down shading versus a really good 1080p image with high quality shading upscaled to 4k,
I’d take the upscaled 1080p. But suppose that you are targeting a 4k image with 10 pixels per triangle, and then you decide to switch to 1080p. Well, suddenly
your 10-pixel triangles become 2.5-pixel triangles. In other words, if we push the triangle count from the PS4/XB1 but keep our resolutions targeting 1080p framebuffers,
then we are going to have a whole lot of very small triangles on PS5/XSX.</p>

<p><i>Quad Utilization Matters:</i></p>

<p>Really, the big conclusion here is that Quad Utilization matters. It is effectively the same cost as overdraw! We all know that 1 pixel triangles are bad,
but 10 pixel triangles are not ideal either. 4x is worse than 2x, but 2x is also worse than 1x. Quad utilization is not a distant problem for the future. It is a real issue today in the workloads that games are actually shipping with. But if we
do address quad utilization, we actually have a lot of room to optimize our renderers and use those GPU cycles for interesting effects instead of helper lanes.</p>

<p><strong>REFERENCES:</strong></p>

<p>[1] AMD FidelityFX, Super Resolution. AMD Inc. (<a href="https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution">https://www.amd.com/en/technologies/radeon-software-fidelityfx-super-resolution</a>)</p>

<p>[2] AbientCG, (<a href="https://www.ambientcg.com">https://www.ambientcg.com</a>)</p>

<p>[3] The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading. Christopher Burns and Warren Hunt. (<a href="http://jcgt.org/published/0002/02/04/">http://jcgt.org/published/0002/02/04/</a>)</p>

<p>[4] ConfettiFX/The-Forge. ConfettiFX. (<a href="https://github.com/ConfettiFX/The-Forge">https://github.com/ConfettiFX/The-Forge</a>)</p>

<p>[5] 4K Rendering Breakthrough: The Filtered and Culled Visibility Buffer. Wolfgang Engel. (<a href="https://www.gdcvault.com/play/1023792/4K-Rendering-Breakthrough-The-Filtered">https://www.gdcvault.com/play/1023792/4K-Rendering-Breakthrough-The-Filtered</a>)</p>

<p>[6] Unreal Engine 5 Early Access Release Notes. Epic Games, Inc. (<a href="https://docs.unrealengine.com/5.0/en-US/ReleaseNotes/">https://docs.unrealengine.com/5.0/en-US/ReleaseNotes/</a>)</p>

<p>[7] Nanite, Inside Unreal. Brian Karis, Chance Ivey, Galen Davis, and Victor Brodin. (<a href="https://www.youtube.com/watch?v=TMorJX3Nj6U">https://www.youtube.com/watch?v=TMorJX3Nj6U</a>)</p>

<p>[8] Sony Pictures Imageworks Arnold. Christopher Kulla, Alejandro Conty, Clifford Stein, and Larry Gritz. (<a href="https://dl.acm.org/doi/10.1145/3180495">https://dl.acm.org/doi/10.1145/3180495</a>)</p>

<p>[9] HLSL Shader Model 6.0, Microsoft Inc. (<a href="https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-0-features-for-direct3d-12">https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-0-features-for-direct3d-12</a>)</p>

<p>[10] NVIDIA DLSS. NVIDIA Inc. (<a href="https://www.nvidia.com/en-us/geforce/technologies/dlss/">https://www.nvidia.com/en-us/geforce/technologies/dlss/</a>)</p>

<p>[11] Automatic Differentiation, C++ Template and Photogrammetry. Dan Piponi. (<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.7749&amp;rep=rep1&amp;type=pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.7749&amp;rep=rep1&amp;type=pdf</a>)</p>

<p>[12] Deferred Attribute Interpolation for Memory-Efficient Deferred Shading. Cristoph Schied and Carsten Dachsbacher. (<a href="http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf">http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf</a>)</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222019_07_03_visibility/header_split.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222019_07_03_visibility/header_split.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Face Scannning Services with Standard Deviation</title><link href="https://filmicworlds.com/blog/face-scanning-services/" rel="alternate" type="text/html" title="Face Scannning Services with Standard Deviation" /><published>2019-03-16T00:00:00+00:00</published><updated>2019-03-16T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/face-scanning-services</id><content type="html" xml:base="https://filmicworlds.com/blog/face-scanning-services/"><![CDATA[<p>In partnership with Standard Deviation (<a href="http://www.sdeviation.com">sdeviation.com</a>), we are very excited to announce new face scanning services in Los Angeles! These services are:</p>
<ol>
<li>3D Photometric Stereo Face Scanning</li>
<li>4D Face Capture</li>
<li>Headcam Face Capture</li>
</ol>

<p><strong>1. 3D Photometric Stereo Face Scanning</strong></p>

<p>The first service that we are offering is a continuation of the face scanning projects that I’ve been doing for the last few years. We’ve put together a portable rig with computer vision cameras so that we can do face capture on-location. We can scan your talent, wrap the head to your topology, and deliver all necessary textures. In other words, we are a scanning/modeling vendor.</p>

<p>The scanner is extremely solid, but actually portable. There are 3 camera columns (with 3 cameras each) and 4 light columns (with 4 lights each). So altogether that is 9 cameras and 16 lights.</p>

<div style="text-align:center;"><img src="/images/2019_03_16_scanning/scanner_med.jpg" /></div>

<p>Each column of 3 cameras has a server inside that controls the cameras and lights, and encodes the images onto a removable SSD in the back. Since the cameras are all pre-wired and pre-aimed, the setup for shooting is minimal. And when shooting is complete data transfer is as simple as removing the SSD.</p>

<p>The really exciting feature is that we can fire each light individually. So instead of standard Passive Stereo scanning, we can do <strong>3D Photometric Stereo</strong> scans. That means that we can sync lighting to capture the talent from many different lighting directions.</p>

<div style="text-align:center;"><img src="/images/2019_03_16_scanning/neutral_small_tile.jpg" /></div>

<p>Why do we shoot with different lighting conditions? So that we can calculate raw normals that look like this:</p>

<div style="text-align:center;"><img src="/images/2019_03_16_scanning/neutral_front_normal.jpg" /></div>

<p>For each pixel that a camera sees, it knows which direction the light is coming from. By analyzing the difference in light intensities, each camera is able to calculate a world-space normal map.</p>

<p>Note that this map is the raw normal map for a single camera, so it has artifacts around areas like eyelashes. We can then merge multiple views together and fix up the problem areas to get a single high-res normal map. This approach gives us much more accurate normals and pore details than extracting them from the final color map.</p>

<p>Finally, we have an automated solution for wrapping each pose to the base topology. We can handle the remaining fixes like the interior of the mouth and making sure the edge loops around the eyes line up with the eyeball geometry. Then we deliver a cleaned mesh, diffuse map, and normal map for every pose.</p>

<div style="text-align:center;"><img src="/images/2019_03_16_scanning/wire_three_small.jpg" /></div>

<p><strong>2. 4D Face Capture</strong></p>

<p>The second service we provide is 4D Face Capture, which is currently in beta. We are still sorting through a few issues but we should be ready to go live relatively soon.</p>

<p>Here is a video of some test footage that we can scrub inside Maya.</p>
<iframe src="https://player.vimeo.com/video/324773256" width="640" height="340" frameborder="0" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe>

<p>4D Face Capture means we are capturing a mesh per frame, instead of stills. Originally we had planned to use the same cameras for both the static scans and 4D scans. The problem we ran into is actually lenses.</p>

<p>For static scans, the talent is trying to keep their head still, so we focus high-res cameras as tightly as possible. Different people have different opinions, but our view is that actors need to be able to move their neck while performing.</p>

<p>As an example, here is a clip from Jeff Berg’s monologue during our test shoot. Even without audio, we can clearly see how much expression comes from overall fluidity of the motion. If we had to direct Jeff to keep his head rigid, it would not be the same performance.</p>

<iframe src="https://player.vimeo.com/video/324781499" width="640" height="764" frameborder="0" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe>

<p>Allowing neck motion has obvious downsides. It means we that have less effective resolution, and the data is much more challenging to track. But ultimately making sure the actor can give their best performance is more important than maximizing the accuracy of the data.</p>

<p>Thus, we came to the conclusion that we needed separate cameras for the 4D Capture and static 3D Scanning. However, we could use the same lights with a different strobing pattern.</p>

<p>That led us to this design where the upper camera is for 4D Capture and the lower one is for high-res 3D Scanning. So instead of needing to book two separate shoots, we can do high-res 3D Scanning and 4D Capture with the same talent, on the same day, in the same chair, with the same lighting.</p>

<div style="text-align:center;"><img src="/images/2019_03_16_scanning/sb_cam_med.jpg" /></div>

<p>Additionally, the high power strobing LEDs have benefits for 4D Captures as well. During the 4D Captures, we are firing the LEDs at 60fps with a 100 microsecond exposure time. That gives us evenly lit images with virtually no motion blur or ambient lighting, all without causing eyestrain.</p>

<p>This setup gives us flexibility on pricing. In particular, if you are already doing a FACS shoot for high-res scans, we can add on 4D Capture for a significantly lower price than doing a completely separate shoot.</p>

<p><strong>3. Headcam Face Capture</strong></p>

<p>For motion capture performers, we have designed and built a custom Stereo Headcam (HMC).</p>

<iframe src="https://player.vimeo.com/video/324774149" width="640" height="427" frameborder="0" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe>

<p>Standard Deviation has been making headcams for over a decade, and they have a highly refined headcam design. It is strong, rigid, and extremely light. In fact, you can actually do a handstand wearing it.</p>

<div style="text-align:center;"><img src="/images/2019_03_16_scanning/headcam_base.jpg" /></div>

<p>The specs for the HMC are:</p>
<ul>
<li>2 Cameras, 2MP each</li>
<li>Global Shutter</li>
<li>60fps</li>
<li>150-200 Microsecond Exposure</li>
<li>Strobing Light</li>
<li>IR and Visible Light Options Available</li>
<li>Encode directly to JPEG</li>
<li>Passive Cooling (No fans!!)</li>
<li>10W Power Consumption</li>
</ul>

<p>In particular, we are most proud of the small size and low power. Writing to JPEG keeps the data size manageable if you are capturing large volumes of data. The challenge was encoding that much data in a small enough package so that we could use passive cooling. Eventually we figured it out so we can capture clean audio without fans.</p>

<p>We were also able to get the cameras to work with very short exposure times because the lighting strobes in sync with the cameras. To the talent, the lights appear quite dim, but since the entire pulse is condensed into under 200 microseconds we get a well exposed image. Additionally, this short exposure time makes the camera virtually immune to ambient light and motion blur.</p>

<p><strong>Next Steps</strong></p>

<p>At GDC, Standard Deviation is going to have a booth where you can see the hardware in person. So if you are at the expo make sure to stop by booth <strong>P1768</strong>. It was a great experience working with Standard Deviation to put this hardware/software pipeline together and I can’t wait to see what content creators do with it. And if you have any questions feel free to ping us directly.</p>

<p><strong>PS:</strong> Big thanks to Jeff Berg (<a href="http://www.theofficialjeffberg.com/about-jeff/">theofficialjeffberg.com</a>, <a href="https://twitter.com/jeff_berg">@jeff_berg</a>) for being our first test case and giving a great performance. If you are looking for a talented actor in LA he is highly recommended.</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[In partnership with Standard Deviation (sdeviation.com), we are very excited to announce new face scanning services in Los Angeles! These services are: 3D Photometric Stereo Face Scanning 4D Face Capture Headcam Face Capture]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222019_03_16_scanning/header_split.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222019_03_16_scanning/header_split.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Command-Line Photogrammetry with AliceVision (Tutorial/Guide)</title><link href="https://filmicworlds.com/blog/command-line-photogrammetry-with-alicevision/" rel="alternate" type="text/html" title="Command-Line Photogrammetry with AliceVision (Tutorial/Guide)" /><published>2018-08-11T00:00:00+00:00</published><updated>2018-08-11T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/command-line-photogrammetry-with-alicevision</id><content type="html" xml:base="https://filmicworlds.com/blog/command-line-photogrammetry-with-alicevision/"><![CDATA[<p>Do you need to automate a huge number of photogrammetry scans? Then I have some good news for you.</p>

<div><iframe width="560" height="315" src="https://www.youtube.com/embed/v_O6tYKQEBA" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen=""></iframe></div>

<p>Above is a video for <i>Meshroom</i>, an open source photogrammetry program. It is a project that has been around in some form for quite a while, but the big news
this week is that they released binaries, so you can just download and run it. The video shows how to use the GUI to load images, process them, change parameters,
etc. You should definitely try it out.</p>

<p>But my interest is in full automation. If you have a scanning rig where you are doing 100+ scans per day, then you need a completely automated solution for batch processing these files. This post is a Guide and/or Tutorial on how to do that.</p>

<p>The important thing to understand is that <i>Meshroom</i> is not a giant, monolithic project. Rather, all of the actual processing is done in standalone command-line c++ programs, and
<i>Meshroom</i> is a thin python program that makes the appropriate calls. So instead of using <i>Meshroom</i>, we will just call these programs directly. Note that full source is available so you could also link to the libraries directly.</p>

<p><i>Meshroom</i> has another convenient feature: Every time you run an operation it displays the command in the terminal. So to figure out the steps for this process, I simply ran
<i>Meshroom</i> while looking at the commands. Then I also looked at the code to change some parameters as necessary. Also, FWIW you can run tell Meshroom to build an image set from the command line, but I prefer to keep the steps separate.</p>

<h2>Prep and Install</h2>

<p><strong>0: Requirements</strong>
<i>Meshroom</i>/<i>AliceVision</i> does not run everywhere. Some of the steps do use CUDA, so you will need an NVIDIA gpu to build the depth maps. Unfortunately, there is no
cpu fallback. Otherwise it should work fine on both Windows and Linux. The instructions on this page are for Windows but it should be possible
to tweak them for Linux with minimal changes.</p>

<p><strong>1: Download Meshroom Release</strong></p>

<p><a href="https://github.com/alicevision/meshroom/releases/tag/v2018.1.0">Meshroom 2018.1.0</a></p>

<p>The first thing you will have to do is install <i>Meshroom</i>. Choose a directory where you would like to work out of, and then download the latest version. The zip file includes binaries of all dependencies.</p>

<p>If you are feeling a thirst for adventure, you can try to build it yourself. The dynamically linked Release libraries seem to work fine (/MD), but I have had to hack cmake files to build Debug and/or statically linked builds. If you are going to build on Windows, I <strong>HIGHLY</strong> suggest using VCPKG.</p>

<p><strong>2: Download Data</strong></p>

<p><a href="https://github.com/alicevision/dataset_monstree">alicevision/dataset_monstree</a></p>

<p>Obviously, the entire point of photogrammetry software is to process your own images, but as a starting point I would suggest using images that are known to work, which reduces the number of variables to isolate if something goes wrong. Thankfully, they have released the set of images for their test tree.</p>

<p><strong>3: Download the <i>run_alicevision.py</i> script</strong></p>

<p><a href="/downloads/2018_08_10_alicevision/run_alicevision.zip">run_alicevision.zip</a></p>

<p>Here is the script that we will be using. Just download the zip, and unzip it to the working directory.</p>

<p><strong>4: Install Python</strong></p>

<p><a href="https://www.python.org/download/releases/2.7/">https://www.python.org/download/releases/2.7/</a></p>

<p>If you do not already have it, install python. Yes, I still write code for python 2.7.0. The easiest method is to install the <i>Windows X86-64 MSI Installer</i> from the releases.</p>

<p><strong>5: Install Meshlab (Optional)</strong></p>

<p><a href="http://www.meshlab.net/">MeshLab</a></p>

<p>As an optional step, you should also install <i>MeshLab</i>. You will not actually need it for processing, but several steps along the way
output PLY point files. These do not load in <i>Maya</i>, so I use <i>MeshLab</i> to view them.</p>

<p>When all the files are unzipped, your folder should look like this (except for <i>build_files</i>, which is generated by the scripts):</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/file_structure.png" /></div>

<p>Those files are:</p>

<ul>
<li><strong>build_files:</strong> These are the files we will build.</li>
<li><strong>dataset_monstree-master:</strong> The source images</li>
<li><strong>Meshroom-2018.1.0:</strong> <i>Meshroom</i>/<i>AliceVision</i> binaries.</li>
<li><strong>Everything else:</strong> The scripts to run it, which come from <a href="/downloads/2018_08_10_alicevision/run_alicevision.zip">run_alicevision.zip</a>.</li>
</ul>

<h2>Running AliceVision</h2>
<p>Now would be a good time to take a look at <i>run_alicevision.py</i></p>

<p>The python file takes 5 arguments:
python run_alicevision.py &ltbaseDir&gt &ltimgDir&gt &ltbinDir&gt &ltnumImages&gt &ltrunStep&gt</p>

<ol>
<li><strong>baseDir</strong>: The directory where you want to put intermediary files.</li>
<li><strong>imgDir</strong>: The directory containing your source images. In our case, <i>IMG_1024.JPG</i> (among others).</li>
<li><strong>binDir</strong>: The directory containing the <i>AliceVision</i> executable files, such as <i>aliceVision_cameraInit.exe</i>.</li>
<li><strong>numImages</strong>: The number of images in <strong>imgDir</strong>, in this case 6. Note that would could detect this automatically, but the goal was to keep the python script as simple as possible so you have to specify it manually.</li>
<li><strong>runStep</strong>: The operation to run.</li>
</ol>

<p>In summary, we will start with 6 images that look like this:</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/build_src_images.png" /></div>

<p>With the <i>run_alicevision.py</i> python script, we are going to create this directory structure:</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/build_directory.png" /></div>

<p>And the <i>11_Texturing</i> directory will have the final model that opens up in <i>Meshlab</i>:</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/scan_final_meshlab.jpg" /></div>

<p>Each of those directories is one of the steps. We can either run those one at a time using the <i>run_monstree_runXX.bat</i> files, or we can use <i>run_monstree_all.bat</i> to build them all.</p>

<p>That is it. You can now either run the <i>run_monstree_all.bat</i> file, or do it one step at a time. You should be able to look at the script and figure it out. For those of you who want to customize the pipeline, here is an introduction on the individual steps.</p>

<p><strong>00_CameraInit</strong></p>

<p>The first step will generate an SFM file. SFM files are json files that store camera size, sensor information, found 3d points (observations),
distortion coeffecients, and other information. The initial SFM file in this directory will just contain the sensor information, and it will
choose defaults from a local sensor database. Later steps will create SFM files that contain full camera extrinsic matrices, bundle points, etc.</p>

<p>This is a step you might want to customize. If you have a rig with 4 cameras, but you take 10 shots as an object rotates on a turntable, you will want an SFM
file with 40 images, but only 4 different sensor calibrations. This is a major reason why I like <i>AliceVision</i>’s structure. It is easy to customize the batch operations
(such as generating a custom SFM file) without having to dig into the other software pieces that you would rather not touch.</p>

<p><strong>01_FeatureExtraction</strong></p>

<p>The next step extracts features from the images, as well as descriptors for those features. It will change the file extension based on what type
of feature you are extracting.</p>

<p><strong>02_ImageMatching</strong></p>

<p><i>02_ImageMatching</i> is a preprocessing step which figures out which images make sense to match to each other. If you have a set of 1000 images, a brute
force check of all 1000 images against all 1000 images would take 1 million pairs. That might take a while (actually, half that many but you get the idea). The <i>02_ImageMatching</i> step culls those pairs</p>

<p><strong>03_FeatureMatching</strong></p>

<p><i>03_FeatureMatching</i> finds the corresponces between the images, using feature descriptors. The generated txt files are self-explanatory.</p>

<p><strong>04_StructureFromMotion</strong></p>

<p>Ok, here is the first big step. Based on the correspondences, <i>04_StructureFromMotion</i> solves the camera positions as well as camera intrinsics. Note that
“Structure From Motion” is used as the generic term for solving camera positions. If you have a 10 camera synced
photogrammetry setup, “Structure From Motion” is used to align them, even if nothing is actually moving.</p>

<p>By default <i>Meshroom</i> stores the solved data as an <i>Alembic</i> file, but I prefer to keep it as an SFM file. This step generates intermediary data so that you
can verify that the cameras aligned properly. The script outputs PLY files which you can look at in <i>Meshlab</i>. The important files are:</p>

<ul>
<li><strong>bundle.sfm:</strong> SFM file with all observations.</li>
<li><strong>cameras.fm:</strong> SFM file with only the aligned cameras.</li>
<li><strong>cloud_and_poses.ply:</strong> Found points and cameras.</li>
</ul>

<p>Here is <i>cloud_and_poses.ply</i>. The green dots are the cameras. I find this view is the easiest way to verify that nothing went horribly wrong with the 
camera alignment. If something does go wrong, you can go back and change the features, matches, or SFM parameters.</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/sfm_debug.jpg" /></div>

<p><strong>05_PrepareDenseScene</strong></p>

<p><i>05_PrepareDenseScene</i>’s primary function is to undistort the images. It generates undistorted EXR images so that the following depth calculation and
projection steps do not have to convert back and forth from the distortion function. The images look like this:</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/undistort.jpg" /></div>

<p>Note that you will see black areas. The later <i>AliceVision</i> steps will not use the camera’s actual matrix. Rather, we will pretend that the camera has
a new matrix without distortion, and <i>05_PrepareDenseScene</i> warps the original image to this fictional matrix. Since this new virtual sensor is larger than the actual sensor, some areas will be missing (black).</p>

<p><strong>06_CameraConnection</strong></p>

<p>Technically, this step breaks our workflow. These steps were designed such that each folder was a completely unique standalone step. However, <i>06_CameraConnection</i>
creates the <i>camsPairsMatrixFromSeeds.bin</i> file in <i>05_PrepareDenseScene</i> because that file needs to be in the same directory as the undistorted images.</p>

<p><strong>07_DepthMap</strong></p>

<p>Here is the longest step of <i>AliceVision</i>: Generate depth maps. It creates a depth map for each image as an EXR file. I tweaked it to be easier to see.
You can see the little “tongue” of the tree.</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/depth_map_original.jpg" /></div>

<p>Since this step can take a long time, there is a parameter to allow you to run groups of different cameras as different standalone commands. So if you have 1000 cameras, you
could depth process groups of cameras with different machines on a farm. Alternatively, running in smaller groups can be useful so that if one machine crashes, you do not have to 
rerun the whole process.</p>

<p><strong>08_DepthMapFilter</strong></p>

<p>The original depth maps will not be entirely consistent. Certain depth maps will claim to see areas that are occluded by other depth maps. The <i>08_DepthMapFilter</i>
step isolates these areas and forces depth consistency.</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/depth_map_refined.jpg" /></div>

<p><strong>09_Meshing</strong></p>

<p>This is the first step which actually generates the mesh. There might be some problems with the mesh which can be solved with…</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/mesh_src.jpg" /></div>

<p><strong>10_MeshFiltering</strong></p>

<p>The <i>10_MeshFiltering</i> step takes the <i>09_Meshing</i> mesh and applies some refinements. It performs actions such as:</p>

<ul>
<li>Smoothing the mesh.</li>
<li>Removing large triangles.</li>
<li>Keeping the largest mesh but removing all the others.</li>
</ul>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/mesh_refinement.jpg" /></div>

<p>Some of these operations are not necessarily desirable for certain applications, so you can tweak those parameters as necessary.</p>

<p><strong>11_Texturing</strong></p>

<p>And the final step. <i>11_Texturing</i> creates UVs and projects the textures. And with that step we are done!</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/scan_final_meshlab.jpg" /></div>

<p>One final trick with <i>Meshlab</i> is that you can drag and drop different OBJ and PLY files as layers.</p>

<div style="text-align:center;"><img src="/images/2018_08_10_alicevision/meshlab_overlay.jpg" /></div>

<p>So in this case I have a layer for both the
final mesh and the SFM points/cameras. Sometimes the mesh smoothing step can be a little too agressive so I find it useful to compare between the original
mesh and the smooth mesh. If the mesh looks broken, the PLY sfm data and the OBJ meshes are great for tracing through the pipeline.</p>

<p><strong>Acknowledgements</strong></p>

<p>This post would not be complete without a big thanks to the <i>AliceVision</i> and <i>OpenMVG</i> teams. The original inspiration was actually the 
<i>libmv</i> project. That project was a precursor for <i>OpenMVG</i>, which is a repository for computer vision engineers/researchers to develop
new algorithms. <i>AliceVision</i> is a fork of <i>OpenMVG</i> with the explicit goal of turning those algorithms into a standalone, production-ready
solution.</p>

<p><i>AliceVision/Meshroom</i> is a large,
ambitious open-source project. It is a major accomplishment to get a project this big over the finish line and we owe them a debt of thanks. We also owe thanks to the <i>OpenMVG</i> team (and <i>libmv</i>)
which perfomed the foundational work that allowed <i>AliceVision</i> to exist.</p>

<p>Finally, I have to give a special thanks to Microsoft for <i>VCPKG</i>. <i>VCPKG</i> is a package manager that has made it vastly easier to build large open source projects on Windows. 
Several years ago I tried to build <i>OpenMVG</i> on Windows. It did not go well. So when I heard about <i>AliceVision</i> a few months ago I tried to compile it, but was failing miserably with
even simple things. See: Boost. Then I tried <i>VCPKG</i>, and it all worked the first time. It is hard to quanitify the benefit of something like <i>VCPKG</i>, but it is a great help to the
open-source ecosystem on Windows.</p>

<p><a href="https://github.com/alicevision">github.com/alicevision</a></p>

<p><a href="https://github.com/openMVG/openMVG">github.com/openMVG/openMVG</a></p>

<p><a href="https://github.com/libmv/libmv">github.com/libmv/libmv</a></p>

<p><a href="https://github.com/Microsoft/vcpkg">github.com/Microsoft/vcpkg</a></p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Do you need to automate a huge number of photogrammetry scans? Then I have some good news for you.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222018_08_10_alicevision/alicevision_header.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222018_08_10_alicevision/alicevision_header.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Minimal Color Grading Tools</title><link href="https://filmicworlds.com/blog/minimal-color-grading-tools/" rel="alternate" type="text/html" title="Minimal Color Grading Tools" /><published>2017-03-28T00:00:00+00:00</published><updated>2017-03-28T00:00:00+00:00</updated><id>https://filmicworlds.com/blog/minimal-color-grading-tools</id><content type="html" xml:base="https://filmicworlds.com/blog/minimal-color-grading-tools/"><![CDATA[<p>As you have probably noticed, there are a lot of color correction algorithms out there. It can be a bit daunting to figure out the best combination of tools to use on a given project. I’ve personally gone through many iterations of trying different things in different orders and over time I’ve settled on a few key features.</p>

<p>Color correction is subjective, so there is no correct way to do it. But the operations listed here are (IMO) a reasonable starting point. You can also bake the curve into a LUT which makes the cost of the individual operations mostly irrelevant.</p>

<p>Btw, the image used in these examples is the “Wooden Door” from Christian Bloch’s <a href="http://www.hdrlabs.com/sibl/archive.html">sIBL Archive</a>. If you need some HDR environment maps, they are highly recommended. If you would prefer to make your own HDR environmeent maps, you should buy his book: <a href="http://www.rockynook.com/shop/photography/the-hdri-handbook-2-0/">The HDRI Handbook 2.0</a>.</p>

<p>In order, the tools we’ll go over are:</p>
<ol>
<li>Exposure</li>
<li>Color Filter</li>
<li>Saturation</li>
<li>Log-Space Contrast</li>
<li>Filmic Tone Curve</li>
<li>Display Gamma</li>
<li>Lift/Gamma/Gain</li>
</ol>

<p>Then we’ll go over implementing and optimizing these steps, as well as baking them into a LUT.</p>

<h3>Exposure and Color Filter</h3>
<p>Exposure is the simplest curve, as it just affects the overal brightness of the scene. The most common way to represent exposure is with F stops, where each step represents a power of 2. So an exposure value of 0 means you multiply your scene intensity by 2^0=1.0. An exposure of 3 would multiply the scene intensity by 2^3=8. An exposure of -2 would multiply the scene by 2^(-2)=.25. You get the idea. Here are several different exposure adjustments applied prior to the filmic curve.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/filmic_exposure.jpg" /><br /><strong>Exposure</strong></div>

<p>The second operation we need to perform is a color filter. It’s a fancy way of saying that you mutliply the scene by a color. For implementation, we would of course combine the color filter and exposure operations into a single multiplier.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/filmic_color_filter.jpg" /><br /><strong>Color Filter</strong></div>

<p>The final implementation looks something like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">float3</span> <span class="n">exposureColorFilter</span> <span class="o">=</span> <span class="n">exp2f</span><span class="p">(</span><span class="n">exposure</span><span class="p">)</span><span class="o">*</span><span class="n">colorMult</span><span class="p">;</span>
</code></pre></div></div>

<h3>Saturation</h3>
<p>Saturation is another simple operation, which is just a lerp between the original image and the grey scale version. There are differing opinions on how to convert to grey scale since the eye is more sensitive to green than it is red or blue. There has been a ton of research on the best way to weight the RGB channels to match the same perceptual intensity, and typical numbers are around R=.30,G=.59,B=.11. Feel free to read the <a href="https://en.wikipedia.org/wiki/Luma_(video)">Wikipedia page on Luma</a> for more info.</p>

<p>I’ve found that in video games we do not really care about matching the exact perceptual luminance, and true perceptual weights can shift certain colors around more than desirable. So I generally use the numbers R=.25,G=.50,B=.25.</p>

<p>The formula is as follows:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">float3</span> <span class="n">lumaWeights</span> <span class="o">=</span> <span class="n">float3</span><span class="p">(</span><span class="mf">.25</span><span class="p">,</span><span class="mf">.50</span><span class="p">,</span><span class="mf">.25</span><span class="p">);</span>
<span class="n">float3</span> <span class="n">grey</span> <span class="o">=</span> <span class="n">dot</span><span class="p">(</span><span class="n">lumaWeights</span><span class="p">,</span><span class="n">rgbVal</span><span class="p">)</span>
<span class="n">float3</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">grey</span> <span class="o">+</span> <span class="n">saturation</span><span class="p">(</span><span class="n">rgbVal</span><span class="o">-</span><span class="n">grey</span><span class="p">);</span>
</code></pre></div></div>

<p>Here is an example of pushing and pulling saturation.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/filmic_saturation.jpg" /><br /><strong>Saturation</strong></div>

<h3>Contrast and Filmic Curve</h3>
<p>With contrast things start to get interesting. The standard contrast operation simply pushes values away from grey, like so:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">float3</span> <span class="n">grey</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="n">float3</span> <span class="n">result</span> <span class="o">=</span> <span class="n">grey</span> <span class="o">+</span> <span class="p">(</span><span class="n">color</span><span class="o">-</span><span class="n">grey</span><span class="p">)</span><span class="o">*</span><span class="n">contrast</span><span class="p">;</span>
</code></pre></div></div>

<p>The major problem with this approach is clamping. As you increase the contrast, the values tend to clamp out because values are pushed past zero. We also have white clamping problems as values push past 1.0.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/filmic_contrast_lin.jpg" /><br /><strong>Linear Contrast</strong></div>

<p>There are two fixes we can make here. The first tweak is to apply the contrast operation before the filmic curve which has a shoulder. This fixes most of our white clamping issues. As we push overexposed values the shoulder brings them back into range.</p>

<p>The second tweak we can make is to apply the value in log space, which fixes the clamping in the blacks. The function involves converting your linear RGB to log, applying a contrast, and converting back to linear. Your log values can go negative but no matter how far you push them they will never go past zero, which preserves detail in the shadows/blacks.</p>

<p>Here is a comparison of the main image with log-space contrast applied before the filmic curve. Note that detail in the shadows is preserved.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/filmic_contrast_log.jpg" /><br /><strong>Log Contrast before Filmic Curve</strong></div>

<p>And here is the code. We need the epsilon so that our linear value of 0 is well behaved. <strong>logMidpoint</strong> is the log of our linear midpoint which I usually hardcode to 0.18.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">float</span> <span class="nf">EvalLogContrastFunc</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">eps</span><span class="p">,</span> <span class="kt">float</span> <span class="n">logMidpoint</span><span class="p">,</span> <span class="kt">float</span> <span class="n">contrast</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">float</span> <span class="n">logX</span> <span class="o">=</span> <span class="n">log2f</span><span class="p">(</span><span class="n">x</span><span class="o">+</span><span class="n">eps</span><span class="p">);</span>
	<span class="kt">float</span> <span class="n">adjX</span> <span class="o">=</span> <span class="n">logMidpoint</span> <span class="o">+</span> <span class="p">(</span><span class="n">logX</span> <span class="o">-</span> <span class="n">logMidpoint</span><span class="p">)</span> <span class="o">*</span> <span class="n">contrast</span><span class="p">;</span>
	<span class="kt">float</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">MaxFloat</span><span class="p">(</span><span class="mf">0.0</span><span class="n">f</span><span class="p">,</span><span class="n">exp2f</span><span class="p">(</span><span class="n">adjX</span><span class="p">)</span> <span class="o">-</span> <span class="n">eps</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3>Filmic Curve</h3>

<p>Of course, we should also apply a filmic cure. Details in the previous post: <a href="/blog/filmic-tonemapping-with-piecewise-power-curves/">Filmic Tonemapping with Piecewise Power Curves</a>.</p>

<h3>Display Gamma</h3>

<p>After converting to a filmic curve, we need to convert to display gamma, which is usually 2.2, unless that display gamma is convolved into the filmic curve. It’s as simple as:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">float3</span> <span class="n">outputColor</span> <span class="o">=</span> <span class="n">pow</span><span class="p">(</span><span class="n">filmicColor</span><span class="p">,</span><span class="mf">1.0</span><span class="o">/</span><span class="mf">2.2</span><span class="p">);</span>
</code></pre></div></div>

<p>We can convolve the gamma curve into the filmic curve at a slight loss of accuracy. If you need to apply the curve as a function in a shader it can save you a few instructions. However, if you are baking everything into a LUT, the cost is irrelevant and you should stick with a separate gamma function.</p>

<h3>Lift/Gamma/Gain</h3>
<p>Lift/Gamma/Gain goes by many different names is the most well known color correction algorithm from the film world. In fact, one of the biggest online forums for color correction is <a href="http://liftgammagain.com">LiftGammaGain.com</a>. For a more thorough explanation of the tools in Nuke, you can go to <a href="http://www.qvolabs.com/nuke_color_correction_basic.html">http://www.qvolabs.com/nuke_color_correction_basic.html</a>.</p>

<p>The concept is pretty simple. All parameters affect the entire curve, but lift primarily affects shadows, gamma primarily affects the midtones, and gain primarily affects highlights. So you can move those three colors around and tweak shadows, midtones, and highlights separately.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/nuke_lift_gamma_gain.jpg" /></div>

<p>There are several variations, but the one I’ll use does gamma before lift and gain. I prefer to apply gamma first, and then use it as a lerp between the shadow and highlight color. It’s just Lift/Gamma/Gain with a slightly different interface.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dstColor</span> <span class="o">=</span> <span class="n">lerp</span><span class="p">(</span><span class="n">shadows</span><span class="p">,</span><span class="n">highlights</span><span class="p">,</span><span class="n">pow</span><span class="p">(</span><span class="n">srcColor</span><span class="p">,</span><span class="n">gamma</span><span class="p">))</span>
</code></pre></div></div>

<p>The most common interface for using Lift/Gamma/Gain is color wheels. I grabbed this iamge from <a href="https://www.taoofcolor.com/573/davinci-resolve-3way-interface/">Tao of Color</a>.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/davinci_color_wheel.jpg" /></div>

<p>What you might not know is that the levels tool in Photoshop/After Effects/Premiere Pro/Lightroom/etc. is actually the same operation with a different interface. You just have to tweak the RGB channels separately.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/photoshop_levels.png" /></div>

<p>In Photoshop, you can select the shadow, midtone, and highlight color with the eyedropper, then it calculates Lift/Gamma/Gain values under the hood and applies the curve.</p>

<p>How does that math work? We have a function of the form:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f(x) = lift + gain*(x^(1/gamma));
f(0.0) = S; // shadows
f(0.5) = M; // midtones
f(1.0) = H; // highlights
</code></pre></div></div>

<p>Then we can trivially find our lift, gamma, and gain values, and that is the formula for Lift/Gamma/Gain. We have another problem though: Finding the right interface.</p>

<p>The typical interface is three color wheels. You can also use the levels tool, although it is a pain to use for delicate correction because you have to constantly shift back between the RGB panels. You can also buy one of these add ons:
<a href="https://www.bhphotovideo.com/c/product/1322803-REG/blackmagic_design_davinci_resolve_micro_panel.html">Blackmagic Design  DaVinci Resolve Micro Panel</a>. Be warned: If you click that link the ads will stalk you all over the internet.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/color_wheel.jpg" /></div>

<p>Historically, professional colorists use color wheels. But you should never do something just because film does it. What are the real UI benefits of a physical color wheel? I can see several:</p>
<ol>
<li>A large trackball, which gives you precise, subpixel control of your colors.</li>
<li>In addition to the two axis of left/right and forward/backward, you can also rotate around the vertical axis. You get three degrees of freedom.</li>
<li>A colorist can make all controls by feel without moving his or her eyes away from the screen.</li>
</ol>

<p>Those are sensible reasons to use a physical color a wheel, but they don’t really translate to using a color wheel interface on screen using a mouse for input. A color wheel is a fancy way of just choosing a color, so IMO any reasonable color picker will do.</p>

<p>The one extra trick is it helps to have luminance as a separate control. I.e. moving the color should not change the luminance of the chosen color. So in my example, I have a separate control for Shadow/Midtone/Highlight <strong>Color</strong>, and a Shadow/Midtone/Highlight <strong>Offset</strong> which only affects luminance. Here is the full code to convert between the user inputs and the actual values used in the formula.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Vec3</span> <span class="n">liftC</span> <span class="o">=</span> <span class="p">(</span><span class="n">userParams</span><span class="p">.</span><span class="n">m_shadowColor</span><span class="p">);</span>
<span class="n">Vec3</span> <span class="n">gammaC</span> <span class="o">=</span> <span class="p">(</span><span class="n">userParams</span><span class="p">.</span><span class="n">m_midtoneColor</span><span class="p">);</span>
<span class="n">Vec3</span> <span class="n">gainC</span> <span class="o">=</span> <span class="p">(</span><span class="n">userParams</span><span class="p">.</span><span class="n">m_highlightColor</span><span class="p">);</span>

<span class="kt">float</span> <span class="n">avgLift</span> <span class="o">=</span> <span class="p">(</span><span class="n">liftC</span><span class="p">.</span><span class="n">x</span><span class="o">+</span><span class="n">liftC</span><span class="p">.</span><span class="n">y</span><span class="o">+</span><span class="n">liftC</span><span class="p">.</span><span class="n">z</span><span class="p">)</span><span class="o">/</span><span class="mf">3.0</span><span class="n">f</span><span class="p">;</span>
<span class="n">liftC</span> <span class="o">=</span> <span class="n">liftC</span> <span class="o">-</span> <span class="n">avgLift</span><span class="p">;</span>

<span class="kt">float</span> <span class="n">avgGamma</span> <span class="o">=</span> <span class="p">(</span><span class="n">gammaC</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">gammaC</span><span class="p">.</span><span class="n">y</span> <span class="o">+</span> <span class="n">gammaC</span><span class="p">.</span><span class="n">z</span><span class="p">)</span><span class="o">/</span><span class="mf">3.0</span><span class="n">f</span><span class="p">;</span>
<span class="n">gammaC</span> <span class="o">=</span> <span class="p">(</span><span class="n">gammaC</span> <span class="o">-</span> <span class="n">avgGamma</span><span class="p">);</span>

<span class="kt">float</span> <span class="n">avgGain</span> <span class="o">=</span> <span class="p">(</span><span class="n">gainC</span><span class="p">.</span><span class="n">x</span><span class="o">+</span><span class="n">gainC</span><span class="p">.</span><span class="n">y</span><span class="o">+</span><span class="n">gainC</span><span class="p">.</span><span class="n">z</span><span class="p">)</span><span class="o">/</span><span class="mf">3.0</span><span class="n">f</span><span class="p">;</span>
<span class="n">gainC</span> <span class="o">=</span> <span class="p">(</span><span class="n">gainC</span> <span class="o">-</span> <span class="n">avgGain</span><span class="p">);</span>

<span class="n">rawParams</span><span class="p">.</span><span class="n">m_liftAdjust</span>  <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span> <span class="o">+</span> <span class="p">(</span><span class="n">liftC</span>  <span class="o">+</span> <span class="n">userParams</span><span class="p">.</span><span class="n">m_shadowOffset</span>   <span class="p">);</span>
<span class="n">rawParams</span><span class="p">.</span><span class="n">m_gainAdjust</span>  <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span> <span class="o">+</span> <span class="p">(</span><span class="n">gainC</span>  <span class="o">+</span> <span class="n">userParams</span><span class="p">.</span><span class="n">m_highlightOffset</span><span class="p">);</span>

<span class="n">Vec3</span> <span class="n">midGrey</span> <span class="o">=</span> <span class="mf">0.5</span><span class="n">f</span> <span class="o">+</span> <span class="p">(</span><span class="n">gammaC</span> <span class="o">+</span> <span class="n">userParams</span><span class="p">.</span><span class="n">m_midtoneOffset</span>  <span class="p">);</span>
<span class="n">Vec3</span> <span class="n">H</span> <span class="o">=</span> <span class="n">rawParams</span><span class="p">.</span><span class="n">m_gainAdjust</span><span class="p">;</span>
<span class="n">Vec3</span> <span class="n">S</span> <span class="o">=</span> <span class="n">rawParams</span><span class="p">.</span><span class="n">m_liftAdjust</span><span class="p">;</span>
	
<span class="n">rawParams</span><span class="p">.</span><span class="n">m_gammaAdjust</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">logf</span><span class="p">((</span><span class="mf">0.5</span><span class="n">f</span><span class="o">-</span><span class="n">S</span><span class="p">.</span><span class="n">x</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">H</span><span class="p">.</span><span class="n">x</span><span class="o">-</span><span class="n">S</span><span class="p">.</span><span class="n">x</span><span class="p">))</span><span class="o">/</span><span class="n">logf</span><span class="p">(</span><span class="n">midGrey</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
<span class="n">rawParams</span><span class="p">.</span><span class="n">m_gammaAdjust</span><span class="p">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">logf</span><span class="p">((</span><span class="mf">0.5</span><span class="n">f</span><span class="o">-</span><span class="n">S</span><span class="p">.</span><span class="n">y</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">H</span><span class="p">.</span><span class="n">y</span><span class="o">-</span><span class="n">S</span><span class="p">.</span><span class="n">y</span><span class="p">))</span><span class="o">/</span><span class="n">logf</span><span class="p">(</span><span class="n">midGrey</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
<span class="n">rawParams</span><span class="p">.</span><span class="n">m_gammaAdjust</span><span class="p">.</span><span class="n">z</span> <span class="o">=</span> <span class="n">logf</span><span class="p">((</span><span class="mf">0.5</span><span class="n">f</span><span class="o">-</span><span class="n">S</span><span class="p">.</span><span class="n">z</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">H</span><span class="p">.</span><span class="n">z</span><span class="o">-</span><span class="n">S</span><span class="p">.</span><span class="n">z</span><span class="p">))</span><span class="o">/</span><span class="n">logf</span><span class="p">(</span><span class="n">midGrey</span><span class="p">.</span><span class="n">z</span><span class="p">);</span>
</code></pre></div></div>

<p>The code to actually apply that correction is as follows:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">FilmicColorGrading</span><span class="o">::</span><span class="n">ApplyLiftInvGammaGain</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="n">lift</span><span class="p">,</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">invGamma</span><span class="p">,</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">gain</span><span class="p">,</span> <span class="kt">float</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
	<span class="c1">// lerp gain</span>
	<span class="kt">float</span> <span class="n">lerpV</span> <span class="o">=</span> <span class="n">Saturate</span><span class="p">(</span><span class="n">powf</span><span class="p">(</span><span class="n">v</span><span class="p">,</span><span class="n">invGamma</span><span class="p">));</span>
	<span class="kt">float</span> <span class="n">dst</span> <span class="o">=</span> <span class="n">gain</span><span class="o">*</span><span class="n">lerpV</span> <span class="o">+</span> <span class="n">lift</span><span class="o">*</span><span class="p">(</span><span class="mf">1.0</span><span class="n">f</span><span class="o">-</span><span class="n">lerpV</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">dst</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">Vec3</span> <span class="n">FilmicColorGrading</span><span class="o">::</span><span class="n">EvalParams</span><span class="o">::</span><span class="n">EvalLiftGammaGain</span><span class="p">(</span><span class="n">Vec3</span> <span class="n">v</span><span class="p">)</span> <span class="k">const</span>
<span class="p">{</span>
	<span class="n">Vec3</span> <span class="n">ret</span><span class="p">;</span>
	<span class="n">ret</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">ApplyLiftInvGammaGain</span><span class="p">(</span><span class="n">m_liftAdjust</span><span class="p">.</span><span class="n">x</span><span class="p">,</span><span class="n">m_invGammaAdjust</span><span class="p">.</span><span class="n">x</span><span class="p">,</span><span class="n">m_gainAdjust</span><span class="p">.</span><span class="n">x</span><span class="p">,</span><span class="n">v</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
	<span class="n">ret</span><span class="p">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">ApplyLiftInvGammaGain</span><span class="p">(</span><span class="n">m_liftAdjust</span><span class="p">.</span><span class="n">y</span><span class="p">,</span><span class="n">m_invGammaAdjust</span><span class="p">.</span><span class="n">y</span><span class="p">,</span><span class="n">m_gainAdjust</span><span class="p">.</span><span class="n">y</span><span class="p">,</span><span class="n">v</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
	<span class="n">ret</span><span class="p">.</span><span class="n">z</span> <span class="o">=</span> <span class="n">ApplyLiftInvGammaGain</span><span class="p">(</span><span class="n">m_liftAdjust</span><span class="p">.</span><span class="n">z</span><span class="p">,</span><span class="n">m_invGammaAdjust</span><span class="p">.</span><span class="n">z</span><span class="p">,</span><span class="n">m_gainAdjust</span><span class="p">.</span><span class="n">z</span><span class="p">,</span><span class="n">v</span><span class="p">.</span><span class="n">z</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>

</code></pre></div></div>

<p>You can check out the source code for more details, of course. Here is an example of pushing the values farther than you should probably go.</p>

<div style="text-align:center;"><img src="/images/2017_03_30_filmic_curve/filmic_lift_gamma_gain.jpg" /><br /><strong>Lift/Gamma/Gain</strong></div>

<h3>LUT Baking</h3>

<p>To speed up the process, we can convert most of these operations into a lut. We can do these operations easily:</p>

<ol>
<li>Exposure</li>
<li>Color Filter</li>
<li>Saturation</li>
</ol>

<p>And then bake these operations into a LUT.</p>
<ol>
<li>Log-Space Contrast</li>
<li>Filmic Tone Curve</li>
<li>Display Gamma</li>
<li>Lift/Gamma/Gain</li>
</ol>

<p>One obvious issue is going to be dynamic range. If we take these combined operations and bake them into a curve we will have way too much precision in the whites and not enough in the blacks. In the original filmic tonemapping curve, HP used the cineon node which as linear to log conversions. But if all we want to do is compress the range a simple <strong>sqrt(x)</strong> or <strong>sqrt(sqrt(x))</strong> is usually enough.</p>

<p>Here is the C++ version. In a shader, the ApplySpacingInv() function would just be <strong>sqrt(x)</strong> or <strong>sqrt(sqrt(x))</strong>. SampleTable() is effectively a tex2D, although you have to add a half pixel pad so that your lookup starts and ends at the right place.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Vec3</span> <span class="n">FilmicColorGrading</span><span class="o">::</span><span class="n">BakedParams</span><span class="o">::</span><span class="n">EvalColor</span><span class="p">(</span><span class="k">const</span> <span class="n">Vec3</span> <span class="n">srcColor</span><span class="p">)</span> <span class="k">const</span>
<span class="p">{</span>
	<span class="n">Vec3</span> <span class="n">rgb</span> <span class="o">=</span> <span class="n">srcColor</span><span class="p">;</span>

	<span class="c1">// exposure and color filter</span>
	<span class="n">rgb</span> <span class="o">=</span> <span class="n">rgb</span> <span class="o">*</span> <span class="n">m_linColorFilterExposure</span><span class="p">;</span>

	<span class="c1">// saturation</span>
	<span class="kt">float</span> <span class="n">grey</span> <span class="o">=</span> <span class="n">Vec3</span><span class="o">::</span><span class="n">Dot</span><span class="p">(</span><span class="n">rgb</span><span class="p">,</span><span class="n">m_luminanceWeights</span><span class="p">);</span>
	<span class="n">rgb</span> <span class="o">=</span> <span class="n">Vec3</span><span class="p">(</span><span class="n">grey</span><span class="p">)</span> <span class="o">+</span> <span class="n">m_saturation</span><span class="o">*</span><span class="p">(</span><span class="n">rgb</span> <span class="o">-</span> <span class="n">Vec3</span><span class="p">(</span><span class="n">grey</span><span class="p">));</span>

	<span class="n">rgb</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">ApplySpacingInv</span><span class="p">(</span><span class="n">rgb</span><span class="p">.</span><span class="n">x</span><span class="p">,</span><span class="n">m_spacing</span><span class="p">);</span>
	<span class="n">rgb</span><span class="p">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">ApplySpacingInv</span><span class="p">(</span><span class="n">rgb</span><span class="p">.</span><span class="n">y</span><span class="p">,</span><span class="n">m_spacing</span><span class="p">);</span>
	<span class="n">rgb</span><span class="p">.</span><span class="n">z</span> <span class="o">=</span> <span class="n">ApplySpacingInv</span><span class="p">(</span><span class="n">rgb</span><span class="p">.</span><span class="n">z</span><span class="p">,</span><span class="n">m_spacing</span><span class="p">);</span>

	<span class="c1">// contrast, filmic curve, gamme </span>
	<span class="n">rgb</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">SampleTable</span><span class="p">(</span><span class="n">m_curveR</span><span class="p">,</span><span class="n">rgb</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
	<span class="n">rgb</span><span class="p">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">SampleTable</span><span class="p">(</span><span class="n">m_curveG</span><span class="p">,</span><span class="n">rgb</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
	<span class="n">rgb</span><span class="p">.</span><span class="n">z</span> <span class="o">=</span> <span class="n">SampleTable</span><span class="p">(</span><span class="n">m_curveB</span><span class="p">,</span><span class="n">rgb</span><span class="p">.</span><span class="n">z</span><span class="p">);</span>

	<span class="k">return</span> <span class="n">rgb</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3>Additional Features</h3>
<p>The list above is by no means exhaustive. One obvious missing feature is any selective color editing or hue shifting. Those features can of course be added.</p>

<p>If you add any features that have crosstalk between channels, then you will need to switch from 3x 1D LUTs to a single 3D lut. If you do, just make sure to be careful with your precision.</p>

<p>Another common feature to add is an additional 3D LUT for convolution. You could use whatever tools you like inside Nuke/Fusion/Photoshop/etc and bake the result into a 3D LUT. The workflow gets more complicated and you have to be very careful with your color conversions, but it is an approach that has served many games well.</p>

<p>And it is worth repeating that color correction is subjective. There is no right or wrong way to do it. That being said, the operations listed here should be a good starting point for most realtime applications.</p>

<h3>Source Code</h3>
<p>The souce code for these operations is available on GitHub under a permissive CC0 license. I often want to share small cunks of code but end up keeping it private because of the time required to package everything up cleanly. So that’s what the github account is for. Honestly, the code is not as pretty as I would like (and has a bunch of warnings that I’m too busy to fix) but it should give you a solid reference point for testing these ideas out.
<a href="https://github.com/johnhable/fw-public">github.com/johnhable/fw-public</a></p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[As you have probably noticed, there are a lot of color correction algorithms out there. It can be a bit daunting to figure out the best combination of tools to use on a given project. I’ve personally gone through many iterations of trying different things in different orders and over time I’ve settled on a few key features.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://filmicworlds.com/%7B%22feature%22=%3E%222017_03_30_filmic_curve/grading_header.jpg%22%7D" /><media:content medium="image" url="https://filmicworlds.com/%7B%22feature%22=%3E%222017_03_30_filmic_curve/grading_header.jpg%22%7D" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>