This put up presents finest practices for implementing ray tracing in video games and different real-time graphics purposes. We current these as briefly as attainable that can assist you shortly discover key concepts. This relies on a presentation made on the 2019 GDC by NVIDIA engineers.
Optimize your acceleration construction (BLAS/TLAS) construct/replace to take at most 2ms by way of pruning and selective updates
Denoising RT results is important. We’ve packaged up finest at school denoisers with the NVIDIA RTX Denoiser SDK)
Overlap the acceleration construction (BLAS/TLAS) construct/replace and denoising with different regimes (G-Buffer, shadow buffer, bodily simulation) utilizing asynchronous compute queues
Leverage HW acceleration for traversal at any time when attainable
Minimum variety of rays forged must be billed as “RT On” and will ship noticeably higher picture high quality than rasterization. Increasing high quality ranges ought to enhance picture high quality and perf at a good fee. See desk beneath:
(as high quality degree will increase)
Increase variety of rays
Increase ray size
Maintain a relentless shading complexity, ideally less complicated shaders than rasterization to reduce divergence.
Increase denoiser high quality
Include smaller (foliage?) alpha/clear (a.okay.a. Any Hit Shader) objects
(as high quality degree will increase)
Increase shading complexity
Maintain fixed denoiser high quality
Use as an aggressive geometry LOD. BVH handles complexity slightly properly by structure.
Performance Best Practices
1.0 Acceleration Structure Management
1.1 General Practices
Move AS administration (construct/replace) to an async compute queue. Using an async compute queue pairs properly with graphics workloads and in lots of circumstances hides the price virtually fully. Similarly, any AS dependency (reminiscent of skinning) may also be moved to async compute and hidden properly.
Build the Top-Level Acceleration Structure (TLAS) slightly than Update. It’s simply simpler to handle in most circumstances, and the price financial savings to refit doubtless aren’t price sacrificing high quality of TLAS.
Ensure descriptors for GetRaytracingAccelerationStructurePrebuildInfo and BuildRaytracingAccelerationStructure match. Otherwise the allotted buffers could also be too small to carry the AS or scratch reminiscence, probably producing a refined bug!
Don’t embody the skybox/skysphere in your TLAS. Having sky geometry in your scene solely serves to extend raytracing instances. Implement your sky shading within the Miss Shader as a substitute.
Implement a single barrier between BLAS and TLAS construct. Generally talking, no extra must be required for correctness. Overlap between BLAS builds can happen naturally on the hardware however including pointless boundaries can serialize execution of that work.
1.2 Bottom-Level Acceleration Structures (BLAS)
Use triangles over AABBs . RTX GPUs excel in accelerating traversal of AS created from triangle geometry.
Mark geometry as OPAQUE at any time when attainable. If the geometry doesn’t require any-hit shader code to execute (e.g. for alpha testing), then all the time make certain it’s marked as OPAQUE with a view to use the HW as successfully as attainable. It doesn’t matter whether or not the OPAQUE flag comes from the geometry descriptor (D3D12_RAYTRACING_GEOMETRY_FLAG_OPAQUE / VK_GEOMETRY_OPAQUE_BIT), the occasion descriptor (D3D12_RAYTRACING_INSTANCE_FLAG_FORCE_OPAQUE / VK_GEOMETRY_INSTANCE_FORCE_OPAQUE_BIT), or by a ray flag (RAY_FLAG_FORCE_OPAQUE / gl_RayFlagsOpaqueNV).
Batching/Merging of construct/replace calls and geometries is necessary. Ultimately, the GPU shall be under-occupied in conditions the place AS manipulation is carried out on small batches of primitives. Take benefit of the truth that a construct can settle for multiple geometry descriptor and remodel the geometry whereas constructing. This typically results in probably the most environment friendly information buildings, particularly when objects’ AABBs overlap one another. Grouping issues into BLAS/situations ought to comply with spatial locality. Do not “throw everything with the same material into the same BLAS no matter where it ends up in space”.
Know when to replace, versus, (re)construct. Continually updating a BLAS degrades its effectiveness as a spatial information construction, making traversal/intersection queries a lot slower relative to at least one freshly constructed. As a basic rule, solely dynamic objects must be thought of for replace. If components of the mesh change place wildly relative to their native neighborhood, then traversal high quality will lower shortly on replace. If issues are simply “bending but not breaking”, then replace will work fairly properly. Example: tree waving in wind: replace=good; mesh exploding: replace=unhealthy. Deciding to replace or rebuild a skinned character: it relies upon. Say the unique construct was performed in t-pose, then each replace will assume that the ft are shut collectively. During a strolling/working animation, this might impression hint effectivity. One resolution right here is to construct acceleration buildings for just a few key poses upfront, then use the closest match as a supply for refit. An experiment guided circulation/course of is beneficial.
Use compaction with all static geometry. Compaction is quick and may usually reclaim important quantities of reminiscence. There isn’t any efficiency draw back when tracing rays towards compacted acceleration buildings. See the merchandise on construct flags beneath for extra element.
Use the appropriate construct flags.
Start by selecting a mix from the desk beneath…
Fastest attainable construct. Slower hint than #3 and #4.
Fully dynamic geometry like particles, destruction, altering prim counts or shifting wildly (explosions and many others), the place per-frame rebuild is required.
Slightly slower construct than #1, however permits very quick replace.
Lower LOD dynamic objects, unlikely to be hit by too many rays however nonetheless should be refitted per body to be right.
Fastest attainable hint. Slower construct than #1 and #2.
Default selection for static degree geometry.
Fastest hint towards updateable AS. Updates barely slower than #2. Trace a bit slower than #3.
Hero character, high-LOD dynamic objects which might be anticipated to be hit by a major variety of rays.
Then contemplate including these flags:
ALLOW_COMPACTION. It’s typically a good suggestion to do that on all static geometry to reclaim (probably important) quantities of reminiscence.
For updateable geometry, it is sensible to compact these BLASs which have a protracted lifetime, so the additional step is price it (compaction and replace usually are not mutually unique!).
For absolutely dynamic geometry that’s rebuilt each body (versus up to date), there’s typically no profit from utilizing compaction.
One potential motive to NOT use compaction is to take advantage of the assure of BLAS storage necessities rising monotonically with primitive rely — this doesn’t maintain true within the context of compaction.
MINIMIZE_MEMORY (DXR) / LOW_MEMORY_BIT (VK). Use solely when the applying is beneath a lot reminiscence strain ray tracing path isn’t possible with out optimizing for reminiscence consumption as a lot as attainable. This flag often sacrifices construct and hint efficiency. This isn’t the case in all conditions, however beware that future driver variations may behave otherwise, so don’t depend on experimental information “confirming” that the flag doesn’t price perf.
2.0 – Ray-Tracing
2.1 – Pipeline Management
Avoid State Object creation on the essential path. Collections and pipelines can take tens to a whole bunch of milliseconds to compile. An utility ought to due to this fact both create all PSOs upfront (e.g. at degree load), or asynchronously create state objects on background threads and hot-swap them when prepared.
Consider utilizing multiple ray tracing pipeline (State Object). This particularly applies while you hint a number of sorts of rays, reminiscent of shadows and reflections the place one kind (shadows) has just a few easy shaders, small payloads, and/or low register strain, whereas the opposite kind (reflections) includes many complicated shaders and/or bigger payloads. Separating these circumstances into totally different pipelines helps the driving force schedule shader execution extra effectively and run workloads at larger occupancy.
Set your payload and attribute sizes to the minimal attainable. The values configured for MaxPayloadMeasurementInBytes and MaxAttributeSizeInBytes have a direct impact on register strain, so don’t set them larger than what your utility/pipeline completely wants.
Set your max hint recursion depth to the minimal attainable. Trace recursion depth impacts how a lot stack reminiscence is allotted for a DispatchRays launch. This can have a big impact on reminiscence consumption and total efficiency.
2.2 – Shaders
2.2.1 – General
Keep the ray payload small. Payload measurement interprets to register rely, so straight results occupancy. It’s usually price a little bit of math to pack the payload much like the way you’d pack a gbuffer. Large payloads will spill to reminiscence.
Keep attribute rely low. Similar to payload information, attributes for customized intersection shaders translate to register rely and will due to this fact be saved to a minimal. Fixed-function triangle intersection makes use of two attributes, which serves as an excellent guideline.
Use RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH / gl_RayFlagsTerminateOnFirstHitNV wherever attainable. This often applies to shadow and ambient occlusion rays, for instance. Note that utilizing this flag is extra environment friendly than utilizing an any-hit shader that calls Settle forHitAndEndSearch() / terminateRayNV() explicitly.
Avoid dwell state throughout hint calls. Generally, variables computed earlier than a TraceRay name and used after TraceRay must be spilled to the stack. The compiler can keep away from this in sure circumstances, reminiscent of utilizing rematerialization, however most of the time the spill is critical. So the extra a shader can keep away from them within the first place, the higher. In some circumstances, when shading complexity may be very low and there aren’t any recursive TraceRay calls, it is sensible to place a few of the dwell state into the payload with a view to keep away from the spill. However, this conflicts with the will to maintain the payload small, so use this trick very judiciously.
Avoid too many hint calls in a shader. Many TraceRay calls in a shader might lead to suboptimal efficiency and shader compile instances. Try structuring your code so a number of hint calls collapse into one.
Use loop unrolling judiciously. This is particularly true if stated loop accommodates a hint name (a corollary to the earlier level). Complex shaders might endure from unrolled loops greater than they profit. Experiment with the [loop] attribute in HLSL or express unrolling in GLSL.
Try to execute TraceRay calls unconditionally. Keeping TraceRay calls out of ‘if’ statements may also help the compiler streamline the generated code and enhance efficiency. Instead of utilizing a conditional, experiment with setting a ray’s tmin and tmax values to 0 with a view to set off a miss, and (if required for proper conduct) use a no-op miss shader to keep away from unintended unintended effects.
Use RAY_FLAG_CULL_BACK_FACING_TRIANGLES / gl_RayFlagsCullBackFacingTrianglesNV judiciously. Unlike rasterization, backface culling in ray tracing is often not a efficiency optimization and may end up in extra work being performed slightly than much less.
2.2.2 – Ray-Generation Shaders
Ensure every ray-gen thread produces a ray. Dispatching/allocating threads in ray-generation shaders that finally don’t produce any rays can hurt scheduling. Manual compaction could also be essential right here.
2.2.3 – Any Hit Shaders
Keep any-hit shaders minimalistic. Any-hit shaders execute many instances per TraceRay (in comparison with closest-hit or miss shaders, for instance, which execute as soon as), making them costly. In addition, any-hit executes on the level within the name graph the place register strain is highest. So maintain them as trivial as you probably can for finest efficiency.
2.2.4 – Shading Execution Divergence
Start with a simple shading implementation. Chances are, when implementing a way that requires numerous materials shading (e.g. reflections or GI), efficiency could also be restricted by shading divergence. There are many causes for this, generally, however not restricted to: instruction cache thrashing and/or divergent reminiscence accesses. Employ the next methods to fight these issues:
Optimize for instruction divergence utilizing simplified shaders:
Use decrease high quality, or simplified shaders (relative to rasterization) for ray-tracing.
In some excessive circumstances (for instance: diffuse GI or tough specular) it may be visually acceptable to fall all the best way again to vertex degree shading (which additionally has the additional advantage of lowering noise).
Optimize for divergent reminiscence entry by:
Reduce decision of texture accesses – or bias mip-map ranges
Defer lighting calculations in ray-tracing shaders, till a later level within the body
Manual scheduling (sorting/binning) of shades could also be essential in excessive circumstances. When the above optimization technique isn’t sufficient, shading will be scheduled manually by the applying. However this prevents the driving force/HW-based scheduling from being efficient. Improvements to our scheduling are continually being made by NVIDIA.
2.3 – Resources
Use the worldwide root signature (DXR) or world useful resource bindings (VK) for scene-global sources. This avoids replication within the native per-geometry root tables and will lead to higher caching conduct.
Avoid useful resource temporaries. This usually ends in unintuitive code duplication. For instance, holding a texture in a brief and assigning it based mostly on some situation will outcome within the duplication of all pattern operations for every attainable texture project. Possible workaround: Use a useful resource array and index into it dynamically.
Accessing 64 or 128 bits of aligned native root desk information collectively permits vectorized hundreds.
Prefer StructuredBuffer over ByteAddressBuffer for aligned uncooked information.
3.0 – Denoisers
Use the RTX Denoiser SDK for prime quality, quick denoising of ray traced results. You can discover extra particulars on the GameWorks Ray Tracing web page.
4.0 – Memory Management
For DXR, Consider price range reported by QueryVideoMemory API as a gentle trace. The precise phase measurement is ~20% bigger.
Isolate Command Allocators to Command Lists of various varieties. Don’t combine and match non-DXR CAs with DXR CAs for those who can keep away from it.
Command Allocator Reset is not going to unlock related reminiscence. Those allocations will be freed with destroy/create, however this have to be performed off the essential path to keep away from lengthy stalls
Watch the pipeline’s stack measurement. Stack measurement will increase with the quantity of dwell state saved throughout TraceRay calls and with management circulation complexity round TraceRay calls. The most hint depth is actually a direct multiplier on the stack measurement – maintain it as little as attainable.
Manually handle the stack if relevant. Use the API’s question features to find out the stack measurement required per shader, and apply app-side data in regards to the name graph to cut back reminiscence consumption and enhance efficiency. A superb instance is costly reflection shaders at hint depth 1 capturing shadow rays (hint depth 2) that are identified by the app to solely hit trivial hit shaders with low stack necessities. The driver can’t know this name graph upfront, so the default conservative stack measurement computation will over-allocate reminiscence.
Reuse transient sources. For instance, reuse scratch reminiscence sources for BVH builds for different (probably non-raytracing) functions. On DXR, make use of positioned sources and useful resource heaps tier 2.
5.0 – Profiling and Debugging
Be conscious of the next instruments that embody help for DirectX Raytracing and NVIDIA’s VKRay. They are evolving shortly, so ensure you use the most recent variations.
NVIDIA Nsight Graphics. Provides wonderful debugging and profiling instruments for ray tracing builders (Shader Table & Resource Inspector, Acceleration Structure Viewer, Range profiling, Warp Occupancy & GPU Metrics, Crash Debugging by way of Nsight Aftermath, C++ Frame seize).
NVIDIA Nsight Systems. Provides system large profiling capabilities and stutter evaluation performance.
Q. What’s the connection between variety of primitives and value (time) of acceleration construction construct/updates?
A. It’s largely a linear relationship. Well, it begins getting linear past a sure primitive rely, earlier than that it’s certain by fixed overhead. The actual numbers listed here are in flux and wouldn’t be dependable.
Q. Assuming most occupancy, what’s the GPU throughput SOL for acceleration construction construct/updates?
A. An order-of-magnitude guideline is O(100 million) primitives/sec for full builds and O(1 billion) primitive/sec for replace.
Q. What’s the connection between variety of distinctive shaders and compilation price (time) for RT PSOs?
A. It is roughly linear.
Q. What’s the everyday price of RT PSO compilation in video games at this time?
A. Anywhere from, 20ms → 300ms, per pipeline.
Q. Is there steerage for the way a lot alpha/transparency must be used? What’s the price of anyhit vs closest hit?
A. Any-hit is dear and must be used minimally. Preferably mark geometry (or situations) as OPAQUE, which can permit ray traversal to occur in fixed-function hardware. When AH is required (e.g. to guage transparency and many others), maintain it so simple as attainable. Don’t consider big shading networks simply to execute what quantities to an alpha tex lookup and an if-statement.
Q. How ought to the developer handle shading divergence?
A. Start by shading in closest-hit shaders, in a simple implementation. Then analyze perf and determine how a lot of a downside divergence is and the way it may be addressed. The resolution might or might not embody “manual scheduling”.
Q. How can the developer question the stack reminiscence allocation?
A. The API has performance to question per-thread stack necessities on pipelines/shaders. This is helpful for monitoring and evaluation functions, and an app ought to all the time try to make use of as little shader stack as attainable (one advice is to dump stack measurement histograms and flag outliers throughout improvement). Stack necessities are most straight influenced by dwell state throughout hint calls, which must be minimized (see Best Practices)..
Q. How a lot additional VRAM does a typical ray-tracing implementation eat?
A. Today, video games implementing ray-tracing are usually utilizing round 1 to 2 GB additional reminiscence. The essential contributing components are acceleration construction sources, ray tracing particular screen-sized buffers (prolonged g-buffer information), and driver-internal allocations (primarily the shader stack).
The following folks made important contributions to this put up: Patrick Neill, Pawel Kozlowski, Marc Blackstein, Nuno Subtil, Martin Stich, Ignacio Llamas, Zhen Yang, Eric Werness, Evan Hart, and Seth Schneider.