Transform Feedback / Stream Output with the HLSL to Spir-V compiler

I’ll follow up with more on denoising approaches shortly; but here’s a quick interesting tip that may help someone in the future come a lucky google search. It’s about a black-sheep rendering API feature called transform feedback" in OpenGL lingo, or “stream output” in DirectX lingo. This feature causes the geometry shader to write primitive data to an arbitrary memory buffer instead of passing it onto the rasterizer.

Vulkan requires 2 things to use transform feedback / stream output:

  1. use of the “VK_EXT_transform_feedback” extension on the CPU side (https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_transform_feedback.html)
  2. some extra instructions in the SPIR-V shader for stride and offsets of the attributes of the written primitives

If you’re using the spirv-tools GLSL shader, this can write out the necessary SPIR-V instructions. However, if you’re using the HLSL to SPIR-V compiler (https://github.com/microsoft/DirectXShaderCompiler/blob/master/docs/SPIR-V.rst), then this cannot. The HLSL compiler will never write out shader code that can be directly used with VK_EXT_transform_feedback. It’s almost like the GLSL compiler is stream-output-aware, while the HLSL is just not stream-output-aware.

This wasn’t immediately obvious to me at first. There no documentation about how or if this should work – and if you try it, the shader seems to run ok (and the stream output counter variables actually increase) – but nothing will ever be written to your output buffer, no matter what you do.

However, there is still a way to use a HLSL geometry shaders, with transform feedback on Vulkan, and it’s not too bad.

SPIR-V lines for transform feedback

Unlike Microsoft’s bytecode, SPIR-V is openly defined, which means we can actually patch up the byte code ourselves. The format is very simple; the building block is just a 4 byte “op” code with an explicit parameter count (all relevant constants are in spirv.hpp). Plus it turns out that somebody nice will actually validate bad spirv before it causes a GPU crash (this might be a Vulkan layer or the nvidia driver, I’m not sure). This isn’t the PS3, it seems like this kind of shader patching should very quite practical and effective using SPIR-V.

Consider the following (trivial) HLSL shader code.

    [maxvertexcount(1)]
        void main(triangle VSOUT input[3], inout PointStream<GSOutput> outputStream)
    {
        GSOutput result;
        result.gsOut.x = max(max(input[0].vsOut.x, input[1].vsOut.x), input[2].vsOut.x);
        result.gsOut.y = max(max(input[0].vsOut.y, input[1].vsOut.y), input[2].vsOut.y);
        result.gsOut.z = max(max(input[0].vsOut.z, input[1].vsOut.z), input[2].vsOut.z);
        result.gsOut.w = max(max(input[0].vsOut.w, input[1].vsOut.w), input[2].vsOut.w);
        outputStream.Append(result);
    }

By comparing the GLSL compiler output to the HLSL compiler output, I tracked down the minimum set of things we need to add to the SPIR-V to have it work. Just below you’ll see the compiled version of the above shader, expressed in spirv-tools disassembly format. You’ll notice it’s divided into 3 parts:

  1. some initial declarations (OpExecutionMode, OpName, OpDecorate, etc)
  2. types, variables and constants (OpTypeFloat, OpVariable, etc)
  3. a function implementation (actual instructions like OpLoad/OpStore, etc)

Focus on part (1), everything we need right now is in that part.

               OpCapability Geometry
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint Geometry %main "main" %gl_Position %out_var_POINT0
               OpExecutionMode %main OutputVertices 1
               OpExecutionMode %main Invocations 1
               OpExecutionMode %main Triangles
               OpExecutionMode %main OutputPoints
               OpSource HLSL 500
               OpName %out_var_POINT0 "out.var.POINT0"
               OpName %main "main"
               OpDecorate %gl_Position BuiltIn Position
               OpDecorate %out_var_POINT0 Location 0

       %uint = OpTypeInt 32 0
     %uint_3 = OpConstant %uint 3
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%_arr_v4float_uint_3 = OpTypeArray %v4float %uint_3
%_ptr_Input__arr_v4float_uint_3 = OpTypePointer Input %_arr_v4float_uint_3
%_ptr_Output_v4float = OpTypePointer Output %v4float
       %void = OpTypeVoid
         %13 = OpTypeFunction %void
%gl_Position = OpVariable %_ptr_Input__arr_v4float_uint_3 Input
%out_var_POINT0 = OpVariable %_ptr_Output_v4float Output
         %14 = OpUndef %v4float

       %main = OpFunction %void None %13
         %15 = OpLabel
         %16 = OpLoad %_arr_v4float_uint_3 %gl_Position
         %17 = OpCompositeExtract %v4float %16 0
         %18 = OpCompositeExtract %v4float %16 1
         %19 = OpCompositeExtract %v4float %16 2
         %20 = OpCompositeExtract %float %17 0
         %21 = OpCompositeExtract %float %18 0
         %22 = OpExtInst %float %1 FMax %20 %21
         %23 = OpCompositeExtract %float %19 0
         %24 = OpExtInst %float %1 FMax %22 %23
         %25 = OpCompositeInsert %v4float %24 %14 0
         %26 = OpCompositeExtract %float %17 1
         %27 = OpCompositeExtract %float %18 1
         %28 = OpExtInst %float %1 FMax %26 %27
         %29 = OpCompositeExtract %float %19 1
         %30 = OpExtInst %float %1 FMax %28 %29
         %31 = OpCompositeInsert %v4float %30 %25 1
         %32 = OpCompositeExtract %float %17 2
         %33 = OpCompositeExtract %float %18 2
         %34 = OpExtInst %float %1 FMax %32 %33
         %35 = OpCompositeExtract %float %19 2
         %36 = OpExtInst %float %1 FMax %34 %35
         %37 = OpCompositeInsert %v4float %36 %31 2
         %38 = OpCompositeExtract %float %17 3
         %39 = OpCompositeExtract %float %18 3
         %40 = OpExtInst %float %1 FMax %38 %39
         %41 = OpCompositeExtract %float %19 3
         %42 = OpExtInst %float %1 FMax %40 %41
         %43 = OpCompositeInsert %v4float %42 %37 3
               OpStore %out_var_POINT0 %43
               OpEmitVertex
               OpReturn
               OpFunctionEnd

Now, here is section 1 of the above shader again, but this time with the declarations for transform feedback added:

               OpCapability Geometry
               OpCapability TransformFeedback
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint Geometry %main "main" %gl_Position %out_var_POINT0
               OpExecutionMode %main Xfb
               OpExecutionMode %main OutputVertices 1
               OpExecutionMode %main Invocations 1
               OpExecutionMode %main Triangles
               OpExecutionMode %main OutputPoints
               OpSource HLSL 500
               OpName %out_var_POINT0 "out.var.POINT0"
               OpName %main "main"
               OpDecorate %gl_Position BuiltIn Position
               OpDecorate %out_var_POINT0 Location 0
               OpDecorate %out_var_POINT0 XfbBuffer 0
               OpDecorate %out_var_POINT0 XfbStride 16
               OpDecorate %out_var_POINT0 Offset 0

Breaking down the specific additions, we get:

               OpCapability TransformFeedback

This must appear with the other OpCapability’s, which will be at the top.

               OpExecutionMode %main Xfb

This should appear after the “OpEntryPoint” part.

               OpDecorate %out_var_POINT0 XfbBuffer 0
               OpDecorate %out_var_POINT0 XfbStride 16
               OpDecorate %out_var_POINT0 Offset 0

This should appear with the other “OpDecorate” parts, before type and constant declarations. Note that here we’re specifying the parameters for the attributes, which is the extra bit of markup we need. I’m not 100% sure that the “OpName” part for a variable will appear before the OpDecorates, but you should use that OpName to identify the correct output variables to markup.

Remember that the constants for all of this stuff is in spirv.hpp (this disassembly is just a text version of the shader byte code, so it’s the byte code itself that needs to be patched). The op codes must appear in a specific order, in SPIR-V you can’t put things in arbitrary orders (which is unfortunate because otherwise we’d just append everything).

And that’s about it. If you want to use the geometry streams feature, there’s a capability for that as well, and a decoration to specify the stream and you should change OpEmitVertex to OpEmitStreamVertex (with parameters).

Patching bytecode post-compile

This patching needs to happen after shader compilation, obviously, so it looks something like this:

  1. <HLSL input> -> compiler -> <spirv>
  2. <spirv> + <offset and stride values> -> patching -> <patched-spirv>
  3. <patched-spirv> + <other states> -> create graphics pipeline

The good thing about is we don’t need to know the offset and stride values at shader compile time. This is more in line with what happens in DirectX, where those values are passed to the construct geometry shader method, alongside the compiled byte code. Whereas, with GLSL, those values are hard coded into GLSL shader (and therefore need to be known earlier). This turns out to be convenient for reusing existing HLSL geometry shaders.

Why isn’t there a clearer way to do this?

As far as I can see, there’s no explicit documentation for how this was intended to work. It seems in general transform feedback is a very low priority for the Vulkan team (as can be seen by the fact that it’s an extension and that extension wasn’t prioritized). I think many graphics engineers don’t particular like this feature, and perhaps Vulkan team have probably just decided it’s time is up.

It can be useful for particular cases, though, because it’s the only way to reuse the input assembly infrastructure for non-rasterizing shaders. That’s actually pretty nice, because otherwise anything you wanted to do with a vertex buffer would require writing custom input assembly code. I find it pretty convenient just for that – as a way to leverage the driver’s input assembly infrastructure for a shader that doesn’t draw (such as ray vs model detection).

Also it’s useful when you have existing transform feedback shaders that you just want to support without having to rewrite them.

However, the API is quite clunky, and doesn’t fit in well with other API features. And ultimately everything you can do with it can be done with compute shader (perhaps with some kind of input assembly library?). And input assembly may go the way of the dodo, also, with greater reliance on mesh shaders. Mesh shaders prefix the geometry pipeline with a compute-shader like program that generates primitives (perhaps from index and vertex buffers, or elsewhere). That effectively moves the entire concept of input assembly out from the driver domain and into the engine domain. Under that scheme the entire concept of transform feedback seems redundant – you would instead just repurpose your mesh shader code as a compute shader.

It’s interesting to think about, because input assembly has been one of the last legacies of the old fixed function pipeline model. It’s not the very last (in the end, we still have triangle rasterization, right?), but it’s kind of getting there. It’s kind of curious to see how graphics APIs have slowly evolved away from concepts like vertices, attributes and triangles and become more and more of the APIs surface area starts to revolve around memory usage patterns, SIMD wave groups and parallelization.