Prepare for a wall of text and some images.
DISCLAIMER:
For Paradox: Please don't sue me. This is not reverse engineering. Just a bit more advanced way of looking at your CPU usage than Windows Task Manager. No disassembly was performed.
For others: Please don't go "Hurr paradox dumb cannot multithread". It is not easy. And various things can easily happen. It requires a lot of specialized expertise, and even then profiling and instrumentation is key.
This contains a lot of speculation from my part. As I do not know their codebase. So I may easily speculate on the wrong thing. This is just how an analysis based on educated guess looks like.
The performance of Stellaris tanked again in this patch, so I started to wonder how the performance of Stellaris actually looks like when looked bit closer. The general "Only 20% usage in other cores" is nowhere near informative enough to deduct anything. So I took the good old Intel vTune profiler and did a small run on Stellaris.
vTune is a performance profiler by Intel that only works on Intel CPU's. However the lessons it tells are applicable to other platforms too. Most CPU vendors have their own similar tools. But because my trusty gaming laptop happens to have Intel CPU it is the only option, as it works by sampling hardware performance counters that are platform specific. It used to cost a ton, and still does for most cases, but they have a free community license with only forum based support so I can do it at home easiy.
If I would have the debug database of the program we could even see what specific functions would take the time. The PDB(Program Database) basically just contains information like "The method at address 0xdeadbeef is called Wiz::humiliateBlondie(int amount)".
That was obviously just an example because if there was a method to humiliate Blondie it would not need an amount for humiliation, as that would always be set as max when called. Therefore we are going in blind and can only estimate things. Also we will be just looking at where the work happens, not if the work is badly made or not.
I loaded a game on year 2310 with performance starting to tank using the 2.2.1 patch. I get around 2 days per second with some FPS drops. At start the game is paused and around midway I unpause it for awhile and then pause again.
This just shows how much CPU time is spent on a thread. It's not per core but per thread. Immediately we can see that only 2 threads are active when we are paused. And when we unpause bunch of others start running.
Let's figure out what the threads do. First let's zoom in around the paused area.
The time between blocks of CPU activity in the 16436 are ~16ms. Corresponding to the 60fps framerate the game is running. This thread basically runs the mainloop of the game. Handles windows events. Outputs DX commands etc.
These have to live in a single thread as older DX than 12 is effectively same as OpenGL. Almost purely singlethreaded and has to live with the main event loop. This can be seen because the thread calls windows functions that are relevant for this. I try to keep the amount of stuff as low as possible so I don't post things like screenshot of the part that shows the function calls. If I would we would be here for a long time.
TID 8780 only has functions called from inside nvidias WDDM driver stack. So it's a driver thread and just participates in the rendering. Not made by Paradox.
Do note how little free time the rendering thread has left. So if any more work is being added to this thread it would quite quickly slow the whole thing down.
Let's look a bit what happens when the simulation is running. I cropped this to contain a single tick. The period between those big block is around 0.5s, matching the ticks per second so it's easy to see that one tick has one of those happening.
I can't remember where but Paradox did state that Stellaris uses OpenMP or something similar. The <3 marked area looks like a block of OpenMP parallel for or something similar. Regardless of what makes it this is good multithreading. Not quite perfect but it's something you can be happy about. So hats off. That corresponds to around 0.4s of single core CPU time in total. Just imagine that running in a single thread.
However it is not the only thing we can see. The sadness block shows most threads idling and just intermittently doing anything. In this kind of case generally the overhead of synchronization is more than any gains in parallelization one can get. But it's likely just a worker pool where the mainthread is issuing commands, because in games, and in life, plenty of things just are serialized. 1 human can give birth to a kid in 9 months. But 2 humans cannot give birth to a kid in 4.5 months.
But that is not the worst. The worst is what happens in the thread that handles the windows mainloop and drawing. It has a ton of more work added to it. I bet the serialized logic of the game runs in the same thread as rendering does. And that is massive contention for performance.
We can look at the nvidia driver thread as a proxy for FPS, as it wakes up once per frame. You can see it under the edge of the sadness square on top. And it's not equal. On the places where the heavy multithreaded code works the frametime is 100ms for that single frame, and after that it works fast again. Basically it stutters quite heavily with the tick rate. Making it seem slower than just looking at average FPS would indicate.
Even so the mainthread is not running at max. It seems to stall quite often. There are bunch of 10ms gaps there where CPU doesn't do anything. None of the cores are doing. This in general looks like they're waiting for some sort of mutex or whatnot but it just happens to take 10ms for windows to realize that there are threads to be scheduled. This is a big reason why one should be sparing with synchronization primitives. By zooming in one gap one can see that it only contains cycles from ExReleasePushLockExclusiveEx, which is windows sync primitive.
Improvement suggestions and conclusions:
The decision to put single threaded work of tick calculation in the same thread that handles drawing and input may not be the best decision. As that thread is already overworked. Instead of even trying to multithread that part of the code one could "just" rip it out into a separate thread. It would still be serialized but it would not contend with resources of the main rendering thread. It may not be easy to separate but it's way easier than trying to parallelize it, as the actual work doesn't change, just where it is issued. Just this would give massive perf improvement as so much time is spent on rendering due to DirectX. In addition minimizing sync points to once per frame would drop a bunch of weird stalls.
One possibility of how something like this could look: There is a single tick state that the rendering thread reads. It just consumes. The tick generating code reads from the previous tick state (that the rendering thread also accesses) and generates the next state. Once the tick is calculated you just flip so that the now fresh tick is being read by both. Only one sync point per tick instead of usual mutex lock hell.
The profiler also says that the code is pretty terrible for a modern CPU. Lot's of stalls, branches etc. But improving that would be just gravy. It may not even be possible as the game must be scriptable, and that is fine. The biggest single bottleneck now is how the work is issued, not what work is issued.
So to reiterate:
Non parallelizable tick code should be removed from the mainthread that handles rendering.
Get a programmer from there who understands these well enough to change it, or if you don't have or if the person prefers other stuff hire some other expert either as permanent or get a specialized consulting company to help. Latter may be a good choice as after the work is done you don't really need to look at it for a long time.
Thank you for reading. <3 Stellaris. After Utopia it has been personally the best 4x I've ever played.
DISCLAIMER:
For Paradox: Please don't sue me. This is not reverse engineering. Just a bit more advanced way of looking at your CPU usage than Windows Task Manager. No disassembly was performed.
For others: Please don't go "Hurr paradox dumb cannot multithread". It is not easy. And various things can easily happen. It requires a lot of specialized expertise, and even then profiling and instrumentation is key.
This contains a lot of speculation from my part. As I do not know their codebase. So I may easily speculate on the wrong thing. This is just how an analysis based on educated guess looks like.
The performance of Stellaris tanked again in this patch, so I started to wonder how the performance of Stellaris actually looks like when looked bit closer. The general "Only 20% usage in other cores" is nowhere near informative enough to deduct anything. So I took the good old Intel vTune profiler and did a small run on Stellaris.
vTune is a performance profiler by Intel that only works on Intel CPU's. However the lessons it tells are applicable to other platforms too. Most CPU vendors have their own similar tools. But because my trusty gaming laptop happens to have Intel CPU it is the only option, as it works by sampling hardware performance counters that are platform specific. It used to cost a ton, and still does for most cases, but they have a free community license with only forum based support so I can do it at home easiy.
If I would have the debug database of the program we could even see what specific functions would take the time. The PDB(Program Database) basically just contains information like "The method at address 0xdeadbeef is called Wiz::humiliateBlondie(int amount)".
That was obviously just an example because if there was a method to humiliate Blondie it would not need an amount for humiliation, as that would always be set as max when called. Therefore we are going in blind and can only estimate things. Also we will be just looking at where the work happens, not if the work is badly made or not.
I loaded a game on year 2310 with performance starting to tank using the 2.2.1 patch. I get around 2 days per second with some FPS drops. At start the game is paused and around midway I unpause it for awhile and then pause again.
This just shows how much CPU time is spent on a thread. It's not per core but per thread. Immediately we can see that only 2 threads are active when we are paused. And when we unpause bunch of others start running.
Let's figure out what the threads do. First let's zoom in around the paused area.
The time between blocks of CPU activity in the 16436 are ~16ms. Corresponding to the 60fps framerate the game is running. This thread basically runs the mainloop of the game. Handles windows events. Outputs DX commands etc.
These have to live in a single thread as older DX than 12 is effectively same as OpenGL. Almost purely singlethreaded and has to live with the main event loop. This can be seen because the thread calls windows functions that are relevant for this. I try to keep the amount of stuff as low as possible so I don't post things like screenshot of the part that shows the function calls. If I would we would be here for a long time.
TID 8780 only has functions called from inside nvidias WDDM driver stack. So it's a driver thread and just participates in the rendering. Not made by Paradox.
Do note how little free time the rendering thread has left. So if any more work is being added to this thread it would quite quickly slow the whole thing down.
Let's look a bit what happens when the simulation is running. I cropped this to contain a single tick. The period between those big block is around 0.5s, matching the ticks per second so it's easy to see that one tick has one of those happening.
I can't remember where but Paradox did state that Stellaris uses OpenMP or something similar. The <3 marked area looks like a block of OpenMP parallel for or something similar. Regardless of what makes it this is good multithreading. Not quite perfect but it's something you can be happy about. So hats off. That corresponds to around 0.4s of single core CPU time in total. Just imagine that running in a single thread.
However it is not the only thing we can see. The sadness block shows most threads idling and just intermittently doing anything. In this kind of case generally the overhead of synchronization is more than any gains in parallelization one can get. But it's likely just a worker pool where the mainthread is issuing commands, because in games, and in life, plenty of things just are serialized. 1 human can give birth to a kid in 9 months. But 2 humans cannot give birth to a kid in 4.5 months.
But that is not the worst. The worst is what happens in the thread that handles the windows mainloop and drawing. It has a ton of more work added to it. I bet the serialized logic of the game runs in the same thread as rendering does. And that is massive contention for performance.
We can look at the nvidia driver thread as a proxy for FPS, as it wakes up once per frame. You can see it under the edge of the sadness square on top. And it's not equal. On the places where the heavy multithreaded code works the frametime is 100ms for that single frame, and after that it works fast again. Basically it stutters quite heavily with the tick rate. Making it seem slower than just looking at average FPS would indicate.
Even so the mainthread is not running at max. It seems to stall quite often. There are bunch of 10ms gaps there where CPU doesn't do anything. None of the cores are doing. This in general looks like they're waiting for some sort of mutex or whatnot but it just happens to take 10ms for windows to realize that there are threads to be scheduled. This is a big reason why one should be sparing with synchronization primitives. By zooming in one gap one can see that it only contains cycles from ExReleasePushLockExclusiveEx, which is windows sync primitive.
Improvement suggestions and conclusions:
The decision to put single threaded work of tick calculation in the same thread that handles drawing and input may not be the best decision. As that thread is already overworked. Instead of even trying to multithread that part of the code one could "just" rip it out into a separate thread. It would still be serialized but it would not contend with resources of the main rendering thread. It may not be easy to separate but it's way easier than trying to parallelize it, as the actual work doesn't change, just where it is issued. Just this would give massive perf improvement as so much time is spent on rendering due to DirectX. In addition minimizing sync points to once per frame would drop a bunch of weird stalls.
One possibility of how something like this could look: There is a single tick state that the rendering thread reads. It just consumes. The tick generating code reads from the previous tick state (that the rendering thread also accesses) and generates the next state. Once the tick is calculated you just flip so that the now fresh tick is being read by both. Only one sync point per tick instead of usual mutex lock hell.
The profiler also says that the code is pretty terrible for a modern CPU. Lot's of stalls, branches etc. But improving that would be just gravy. It may not even be possible as the game must be scriptable, and that is fine. The biggest single bottleneck now is how the work is issued, not what work is issued.
So to reiterate:
Non parallelizable tick code should be removed from the mainthread that handles rendering.
Get a programmer from there who understands these well enough to change it, or if you don't have or if the person prefers other stuff hire some other expert either as permanent or get a specialized consulting company to help. Latter may be a good choice as after the work is done you don't really need to look at it for a long time.
Thank you for reading. <3 Stellaris. After Utopia it has been personally the best 4x I've ever played.