๐ŸŽ‰ Celebrating 25 Years of GameDev.net! ๐ŸŽ‰

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

GLM vs VS2019 for SIMD

Started by
8 comments, last by Programmer71 3ย years, 3ย months ago

From https://docs.microsoft.com/en-us/cpp/parallel/openmp/openmp-simd?view=msvc-160

Visual C++ currently supports the OpenMP 2.0 standard, however Visual Studio 2019 also now offers SIMD functionality.

One of the biggest reasons for me to use GLM is its automagic (I think) SIMD support.

So can I use my own vec/mat/math classes and get decent SIMD with VS2019 as described in the link above?

It would take me a while to understand that information well enough to use (or know not to use) and I don't want to learn it if it isn't comparable to GLM's SIMD.

๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚<โ†The tone posse, ready for action.

Advertisement

From what I see in the MS post, it isn't real user-mode SIMD but some way of compiler switch turned on for certain instructions that optimize math into SIMD instructions, while GLM uses real platform depdenant commands that are guaranteed to turn into SIMD instructions.

So MS finally discovered what LLVM/clang already had for years: optimizing code into SIMD instructions ?

fleabay said:
One of the biggest reasons for me to use GLM is its automagic (I think) SIMD support.

Repeating my findings when implementing fluid simulator (which i have already posted one or two times here)โ€ฆ

I've used GLM initially, because it has similar syntax to the open source stuff i had used as reference.
Later i have replaced GLM with two things:

Sonys old SSE2 simd vector math lib which came with Bullet Physics (not sure if that's still the case.)
This gave me a speedup of more than three!
But not because of SIMD. Sonys lib also has a non SIMD version, which was only 10% slower. Meaning MSVC became better with auto vectorization, as it seems.

Then i also replaced glm::ivec3 with int[3]. This was used mainly to index grids here and there, and i did not expect any change in performance at all. But i got another overall speedup of 3, which is unbelievable.

So, in the end i got a speedup of 10, just by replacing GLM. I don't know the reason. Profiler showed many samples went into GLM constructors, but i did not look any further. Would be interesting to see how's it with Clang, but i never tried to set this up with VC yet.

We know GLM is not meant to be fast. But from my experience i can not recommend to use it at all. At least not as core math lib for games.

Shaarigan said:
while GLM uses real platform depdenant commands that are guaranteed to turn into SIMD instructions.

umโ€ฆ do we need to enable this with some define?

I could eventually repeat my tests. It does not feel fair to criticize glm so hard, and i assume the bad performance is more of an MSVC issue anyways.

Performance IS an MSVC issue from my experience.I don't know GLM well because I never used it for my engine code rather than writing my own stuff and try to optimize that. From the source there are A LOT of defines and preprocessor magic to include right headers like immintrin.h which is not just for SIMD but also for platform atomics.

I set up my vector structs as unions so the compiler is free to choose how it'll access the values

typedef union Vector4
{
    public:
        struct _Fields
        {
            public:
                float X;
                float Y;
                float Z;
                float W;
        };
        
    float Value[4];
    _Fields Fields;

    force_inline float X()
    {
        return Fields.X;
    }
    force_inline float X() const
    {
        return Fields.X;
    }
    force_inline void X(float value)
    {
        Fields.X = value;
    }
    ...
} vec4;

And then I wrote my code in a way that it helps the compiler to identify instruction sets that could and optimize them into SIMD. For example my matrix multiplication code

inline void Mult(Matrix4 const& m1, Matrix4 const& m2, Matrix4& result)
{
    const Vector4& row1 = m2.Row(0);
    const Vector4& row2 = m2.Row(1);
    const Vector4& row3 = m2.Row(2);
    const Vector4& row4 = m2.Row(3);

    for(uint32 i = 0; i < 4; i++)
    {
        result.Rows.Row[i] = ((row1 * m1.Value[4*i + 0]) + (row2 * m1.Value[4*i + 1])) + ((row3 * m1.Value[4*i + 2]) + (row4 * m1.Value[4*i + 3]));
    }
}

It may look quite ugly and every programmer might come and โ€œoptimizeโ€ the instructions placed in here but it turned out that there is a huge difference between MSVC and LLVM/clang compilation. LLVM/clang with full speed optimizations turned on, turned that code into an SIMD instruction set that results in 1 million matrix multiplications from random generated matrices, in less than 10 ms while MSVC also with the same compiler settings reached only an average of 200ms so it was 20 times slower on my test setup

I don't think using โ€˜inlineโ€™ for compiler optimization speeds for SIMD is useful. You just told the compiler to create 1 million functions (minus the call I would assume) and depending on the compiler, can totally disregard, giving 2 totally different results having nothing to do with SIMD.

๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚<โ†The tone posse, ready for action.

Inline isn't an issue here as it depends on the compiler being clever enougth to unroll the loop into SIMD. LLVM/clang is that clever for this loop as it is a constant loop of 4 iterations which accesses 4 items at once so it is turned into the coresponding float128 instructions. You can test that yourself, changing the loop to be โ€œless predictable" also causes a significant decrease in performance within an otherwise unchanged setup

I'm just going to use my own math lib and pick up some SIMD along the way. Seems interesting enough. Maybe even toy around with some asm in the process.

Thanks

๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚๐Ÿ™‚<โ†The tone posse, ready for action.

In my experience SIMD are usefull when you have the need to compute a large matrix many many times, switching over and over in the same loop computing a 4x4 matrix every time using SIMD, might lead to performances issues.

This topic is closed to new replies.

Advertisement