When is MSAA Really Needed?

GaemzDood

Well-Known Member
#1
After the launch of Deus Ex: Mankind Divided, I've been thinking about this. I personally will never give up on MSAA even with the rise of deferred rendering and PBR, but one must ask: when is it actually needed? In the case of Deus Ex, it was arguably broken because of how much performance it crippled and how little it does to actually reduce crawling with alpha textures. It was clearly a last minute thing they forced in, even though theoretically, it should've been implemented better since DX 11.0 Compute Tiled based deferred shading/lighting MSAA is easier to work with, as seen with Battlefield 3, which I could run 4x MSAA and FXAA on my 960 and still get 60 FPS.

While Ubisoft's games fared better performance wise, they still did nothing to reduce alpha crawling. In Far Cry 4, ATOC was only accessible via a config file and basically did almost nothing compared to Far Cry 3, every deferred Assassin's Creed game just slapped on FXAA on top of MSAA to slightly anti-alias alpha textures (which still led to shimmer everywhere), and Watch Dogs' MSAA basically did nothing for anything that wasn't geometry; Temporal SMAA had better coverage. Rainbow Six: Siege's MSAA has the same problem, but since you can combine it with the game's built in TAA, it's not much of a problem.

So how is deferred MSAA done right? DX 10.1 introduced SV_SampleIndex / SV_Coverage system value semantics, which allows to solve via multipass for pixel/sample frequency passes. While it sounds simple, even Forward+ rendered games like Dirt Rally have all sorts of artifacts like white/black aliased outlines and breaking down.

The guys at Crytek, who pretty much mastered deferred MSAA in my opinion, have an entire guide.
ANTIALIASING\DEFERRED MSAA REVIEW
The problem: Multiple passes + r/w from Multisampled RTs
 DX 10.1 introduced SV_SampleIndex / SV_Coverage system value semantics.
 Allows to solve via multipass for pixel/sample frequency passes [Thibieroz11]
SV_SampleIndex
 Forces pixel shader execution for each sub-sample and provides index of the sub-sample currently executed
 Index can be used to fetch sub-sample from a Multisampled RT. E.g. FooMS.Load( UnnormScreenCoord, nSampleIndex)
SV_Coverage
 Indicates to pixel shader which sub-samples covered during raster stage.
 Can modify also sub-sample coverage for custom coverage mask
DX 11.0 Compute Tiled based deferred shading/lighting MSAA is simpler
 Loop through MSAA tagged sub-samples

DEFERRED MSAA\HEADS UP !
Simple theory, troublesome practice
 At least with complex deferred renderers
Non-MSAA friendly code accumulates fast.
 Breaks regularly, as new techniques added without MSAA consideration
 Even if still works.. Very often you’ll need to pinpoint and fix non-msaa friendly techniques, as these introduce visual
artifacts.
 E.g. white/dark outlines, or no AA at all
Do it upfront. Retrofitting a renderer to support Deferred MSAA is some work
 And it is very finicky

DEFERRED MSAA\CUSTOM RESOLVE & PER-SAMPLE MASK
Post G-Buffer, perform a custom msaa resolve
 Pre-resolves sample 0, for pixel frequency passes such as lighting/other MSAA dependent passes
 In same pass create sub-sample mask (compare samples similarity, mark if mismatching)
 Avoid default SV_COVERAGE, since it results in redundant processing on regions not requiring MSAA
SV_Coverage Custom Per-Sample Mask

DEFERRED MSAA\STENCIL BATCHING [SOUSA13]
Batching per-sample stencil mask with regular stencil buffer usage
 Reserve 1 bit from stencil buffer
 Update with sub-sample mask
 Tag entire pixel-quad instead of just single pixel -> improves stencil culling efficiency
 Make usage of stencil read/write bitmask to avoid per-sample bit override
 StencilWriteMask = 0x7F
 Restore whenever a stencil clear occurs
Not possible due to extreme stencil usage?
 Use clip/discard
 Extra overhead also from additional texture read for per-sample mask

DEFERRED MSAA\PIXEL AND SAMPLE FREQUENCY PASSES
Pixel Frequency Passes
 Set stencil read mask to reserved bits for per-pixel regions (~0x80)
 Bind pre-resolved (non-multisampled) targets SRVs
 Render pass as usual
Sample Frequency Passes
 Set stencil read mask to reserved bit for per-sample regions (0x80)
 Bind multisampled targets SRVs
 Index current sub-sample via SV_SAMPLEINDEX
 Render pass as usual

DEFERRED MSAA\ALPHA TEST SSAA
Alpha testing requires ad hoc solution
 Default SV_Coverage only applies to triangle edges
Create your own sub-sample coverage mask
 E.g. check if current sub-sample uses AT or not and set bit
static const float2 vMSAAOffsets[2] = {float2(0.25, 0.25),float2(-0.25,-0.25)};
const float2 vDDX = ddx(vTexCoord.xy);
const float2 vDDY = ddy(vTexCoord.xy);
[unroll] for(int s = 0; s < nSampleCount; ++s)
{
float2 vTexOffset = vMSAAOffsets.x * vDDX + (vMSAAOffsets.y * vDDY);
float fAlpha = tex2D(DiffuseSmp, vTexCoord + vTexOffset).w;
uCoverageMask |= ((fAlpha-fAlphaRef) >= 0)? (uint(0x1)<<i) : 0;
}
Alpha Test SSAA Disabled
Alpha Test SSAA Enabled

DEFERRED MSAA\PERFORMANCE SHORTCUTS
Deferred cascades sun shadow maps
 Render shadows as usual at pixel frequency
 Bilateral upscale during deferred shading composite pass

DEFERRED MSAA\PERFORMANCE SHORTCUTS (2)
Non-opaque techniques accessing depth (e.g. Soft-Particles)
 Recommendation to tackle via per-sample frequency is fairly slow on real world scenarios
 Using Max Depth works ok for most cases and N-times faster

MSAA\PERFORMANCE SHORTCUTS (3)
Many games, also doing:
 Skipping Alpha Test Super Sampling (which I do not recommend you do as it really sticks out like a sore thumb)
 Use alpha to coverage instead, or even no alpha test AA (let morphological AA tackle that)
 Render only opaque with MSAA
 Then render transparents withouth MSAA
 Assuming HDR rendering: note that tone mapping is implicitly done post-resolve resulting is loss of detail on high
contrast regions

DEFERRED MSAA\MSAA FRIENDLINESS
Look out for these:
 No MSAA noticeably working, or noticeable bright/dark silhouettes.

DEFERRED MSAA\RECAP
Accessing and/or rendering to Multisampled RTs?
 Then you need to care about accessing and outputting correct sub-sample
In general always strive to minimize BW
 Avoid vanilla deferred lighting
 Prefer fully deferred, hybrids, or just skip deferred altogether.
 If deferred, prefer thin g-buffers
 Each additional target on g-buffer incurs in export rate overhead [Thibieroz11]
 NV/AMD (GCN): Export Cost = Cost(RT0)+Cost(RT1)...,AMD (older hw): Export Cost = (Num RTs) * (Slowest RT)
 Fat formats are half rate sampling cost for bilinear filtering modes on GCN [Thibieroz13]
 For lighting/some hdr post processes: 32 bit R11G11B10F fmt suffices for most cases
 
Last edited:

GaemzDood

Well-Known Member
#5
Top