When is MSAA Really Needed?


Well-Known Member
After the launch of Deus Ex: Mankind Divided, I've been thinking about this. I personally will never give up on MSAA even with the rise of deferred rendering and PBR, but one must ask: when is it actually needed? In the case of Deus Ex, it was arguably broken because of how much performance it crippled and how little it does to actually reduce crawling with alpha textures. It was clearly a last minute thing they forced in, even though theoretically, it should've been implemented better since DX 11.0 Compute Tiled based deferred shading/lighting MSAA is easier to work with, as seen with Battlefield 3, which I could run 4x MSAA and FXAA on my 960 and still get 60 FPS.

While Ubisoft's games fared better performance wise, they still did nothing to reduce alpha crawling. In Far Cry 4, ATOC was only accessible via a config file and basically did almost nothing compared to Far Cry 3, every deferred Assassin's Creed game just slapped on FXAA on top of MSAA to slightly anti-alias alpha textures (which still led to shimmer everywhere), and Watch Dogs' MSAA basically did nothing for anything that wasn't geometry; Temporal SMAA had better coverage. Rainbow Six: Siege's MSAA has the same problem, but since you can combine it with the game's built in TAA, it's not much of a problem.

So how is deferred MSAA done right? DX 10.1 introduced SV_SampleIndex / SV_Coverage system value semantics, which allows to solve via multipass for pixel/sample frequency passes. While it sounds simple, even Forward+ rendered games like Dirt Rally have all sorts of artifacts like white/black aliased outlines and breaking down.

The guys at Crytek, who pretty much mastered deferred MSAA in my opinion, have an entire guide.
The problem: Multiple passes + r/w from Multisampled RTs
 DX 10.1 introduced SV_SampleIndex / SV_Coverage system value semantics.
 Allows to solve via multipass for pixel/sample frequency passes [Thibieroz11]
 Forces pixel shader execution for each sub-sample and provides index of the sub-sample currently executed
 Index can be used to fetch sub-sample from a Multisampled RT. E.g. FooMS.Load( UnnormScreenCoord, nSampleIndex)
 Indicates to pixel shader which sub-samples covered during raster stage.
 Can modify also sub-sample coverage for custom coverage mask
DX 11.0 Compute Tiled based deferred shading/lighting MSAA is simpler
 Loop through MSAA tagged sub-samples

Simple theory, troublesome practice
 At least with complex deferred renderers
Non-MSAA friendly code accumulates fast.
 Breaks regularly, as new techniques added without MSAA consideration
 Even if still works.. Very often you’ll need to pinpoint and fix non-msaa friendly techniques, as these introduce visual
 E.g. white/dark outlines, or no AA at all
Do it upfront. Retrofitting a renderer to support Deferred MSAA is some work
 And it is very finicky

Post G-Buffer, perform a custom msaa resolve
 Pre-resolves sample 0, for pixel frequency passes such as lighting/other MSAA dependent passes
 In same pass create sub-sample mask (compare samples similarity, mark if mismatching)
 Avoid default SV_COVERAGE, since it results in redundant processing on regions not requiring MSAA
SV_Coverage Custom Per-Sample Mask

Batching per-sample stencil mask with regular stencil buffer usage
 Reserve 1 bit from stencil buffer
 Update with sub-sample mask
 Tag entire pixel-quad instead of just single pixel -> improves stencil culling efficiency
 Make usage of stencil read/write bitmask to avoid per-sample bit override
 StencilWriteMask = 0x7F
 Restore whenever a stencil clear occurs
Not possible due to extreme stencil usage?
 Use clip/discard
 Extra overhead also from additional texture read for per-sample mask

Pixel Frequency Passes
 Set stencil read mask to reserved bits for per-pixel regions (~0x80)
 Bind pre-resolved (non-multisampled) targets SRVs
 Render pass as usual
Sample Frequency Passes
 Set stencil read mask to reserved bit for per-sample regions (0x80)
 Bind multisampled targets SRVs
 Index current sub-sample via SV_SAMPLEINDEX
 Render pass as usual

Alpha testing requires ad hoc solution
 Default SV_Coverage only applies to triangle edges
Create your own sub-sample coverage mask
 E.g. check if current sub-sample uses AT or not and set bit
static const float2 vMSAAOffsets[2] = {float2(0.25, 0.25),float2(-0.25,-0.25)};
const float2 vDDX = ddx(vTexCoord.xy);
const float2 vDDY = ddy(vTexCoord.xy);
[unroll] for(int s = 0; s < nSampleCount; ++s)
float2 vTexOffset = vMSAAOffsets.x * vDDX + (vMSAAOffsets.y * vDDY);
float fAlpha = tex2D(DiffuseSmp, vTexCoord + vTexOffset).w;
uCoverageMask |= ((fAlpha-fAlphaRef) >= 0)? (uint(0x1)<<i) : 0;
Alpha Test SSAA Disabled
Alpha Test SSAA Enabled

Deferred cascades sun shadow maps
 Render shadows as usual at pixel frequency
 Bilateral upscale during deferred shading composite pass

Non-opaque techniques accessing depth (e.g. Soft-Particles)
 Recommendation to tackle via per-sample frequency is fairly slow on real world scenarios
 Using Max Depth works ok for most cases and N-times faster

Many games, also doing:
 Skipping Alpha Test Super Sampling (which I do not recommend you do as it really sticks out like a sore thumb)
 Use alpha to coverage instead, or even no alpha test AA (let morphological AA tackle that)
 Render only opaque with MSAA
 Then render transparents withouth MSAA
 Assuming HDR rendering: note that tone mapping is implicitly done post-resolve resulting is loss of detail on high
contrast regions

Look out for these:
 No MSAA noticeably working, or noticeable bright/dark silhouettes.

Accessing and/or rendering to Multisampled RTs?
 Then you need to care about accessing and outputting correct sub-sample
In general always strive to minimize BW
 Avoid vanilla deferred lighting
 Prefer fully deferred, hybrids, or just skip deferred altogether.
 If deferred, prefer thin g-buffers
 Each additional target on g-buffer incurs in export rate overhead [Thibieroz11]
 NV/AMD (GCN): Export Cost = Cost(RT0)+Cost(RT1)...,AMD (older hw): Export Cost = (Num RTs) * (Slowest RT)
 Fat formats are half rate sampling cost for bilinear filtering modes on GCN [Thibieroz13]
 For lighting/some hdr post processes: 32 bit R11G11B10F fmt suffices for most cases
Last edited:


Well-Known Member


Active Member
In the case of making games, 4x MSAA in Unity game engine, is used by some indie game devs, who are making games on mobile devices, that have Mali gpu's. In most cases it is Mali gpu friendly, but some indie devs have stated that in some mobile devices, both old and new, 4x MSAA seemed to give the Mali gpu some problems. This is something I have always found interesting.
Last edited: