Multithreading in Stellaris - A quick performance analysis.

  • We have updated our Community Code of Conduct. Please read through the new rules for the forum that are an integral part of Paradox Interactive’s User Agreement.

sharpneli

Private
120 Badges
Nov 26, 2009
21
0
  • Arsenal of Democracy
  • Cities in Motion 2
  • Crusader Kings II
  • Crusader Kings II: The Old Gods
  • Crusader Kings II: Rajas of India
  • Crusader Kings II: The Republic
  • Crusader Kings II: Sons of Abraham
  • Crusader Kings II: Sunset Invasion
  • Crusader Kings II: Sword of Islam
  • Darkest Hour
  • Europa Universalis IV: Mandate of Heaven
  • Hearts of Iron IV: Death or Dishonor
  • Europa Universalis IV: Common Sense
  • Tyranny - Tales from the Tiers
  • Tyranny - Bastards Wound
  • Pillars of Eternity
  • Crusader Kings II: Way of Life
  • Magicka: Wizard Wars Founder Wizard
  • Tyranny: Gold Edition
  • Europa Universalis IV: El Dorado
  • Crusader Kings II: Conclave
  • Cities: Skylines - Natural Disasters
  • Stellaris: Leviathans Story Pack
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Galaxy Edition
  • Hearts of Iron IV: Field Marshal
  • Crusader Kings II: Reapers Due
  • Europa Universalis IV: Rights of Man
  • Tyranny: Archon Edition
  • Tyranny: Archon Edition
  • Stellaris: Distant Stars
  • Europa Universalis IV
  • Stellaris: Necroids
  • Imperator: Rome Deluxe Edition
  • Europa Universalis IV: Golden Century
  • Stellaris: Megacorp
  • Surviving Mars: First Colony Edition
  • Shadowrun: Dragonfall
  • Shadowrun Returns
  • Europa Universalis III Complete
  • Europa Universalis IV: Dharma
  • Cities: Skylines - Green Cities
  • Cities: Skylines - Parklife
  • Europa Universalis III Complete
  • Europa Universalis IV: Rule Britannia
  • Stellaris: Humanoids Species Pack
  • Victoria 2
  • Hearts of Iron IV: Expansion Pass
  • War of the Roses
  • 500k Club
Prepare for a wall of text and some images.

DISCLAIMER:
For Paradox: Please don't sue me. This is not reverse engineering. Just a bit more advanced way of looking at your CPU usage than Windows Task Manager. No disassembly was performed.

For others: Please don't go "Hurr paradox dumb cannot multithread". It is not easy. And various things can easily happen. It requires a lot of specialized expertise, and even then profiling and instrumentation is key.

This contains a lot of speculation from my part. As I do not know their codebase. So I may easily speculate on the wrong thing. This is just how an analysis based on educated guess looks like.

The performance of Stellaris tanked again in this patch, so I started to wonder how the performance of Stellaris actually looks like when looked bit closer. The general "Only 20% usage in other cores" is nowhere near informative enough to deduct anything. So I took the good old Intel vTune profiler and did a small run on Stellaris.

vTune is a performance profiler by Intel that only works on Intel CPU's. However the lessons it tells are applicable to other platforms too. Most CPU vendors have their own similar tools. But because my trusty gaming laptop happens to have Intel CPU it is the only option, as it works by sampling hardware performance counters that are platform specific. It used to cost a ton, and still does for most cases, but they have a free community license with only forum based support so I can do it at home easiy.

If I would have the debug database of the program we could even see what specific functions would take the time. The PDB(Program Database) basically just contains information like "The method at address 0xdeadbeef is called Wiz::humiliateBlondie(int amount)".

That was obviously just an example because if there was a method to humiliate Blondie it would not need an amount for humiliation, as that would always be set as max when called. Therefore we are going in blind and can only estimate things. Also we will be just looking at where the work happens, not if the work is badly made or not.

I loaded a game on year 2310 with performance starting to tank using the 2.2.1 patch. I get around 2 days per second with some FPS drops. At start the game is paused and around midway I unpause it for awhile and then pause again.

GNK5Qsb.png


This just shows how much CPU time is spent on a thread. It's not per core but per thread. Immediately we can see that only 2 threads are active when we are paused. And when we unpause bunch of others start running.

Let's figure out what the threads do. First let's zoom in around the paused area.

lRtU5lN.png


The time between blocks of CPU activity in the 16436 are ~16ms. Corresponding to the 60fps framerate the game is running. This thread basically runs the mainloop of the game. Handles windows events. Outputs DX commands etc.

These have to live in a single thread as older DX than 12 is effectively same as OpenGL. Almost purely singlethreaded and has to live with the main event loop. This can be seen because the thread calls windows functions that are relevant for this. I try to keep the amount of stuff as low as possible so I don't post things like screenshot of the part that shows the function calls. If I would we would be here for a long time.

TID 8780 only has functions called from inside nvidias WDDM driver stack. So it's a driver thread and just participates in the rendering. Not made by Paradox.

Do note how little free time the rendering thread has left. So if any more work is being added to this thread it would quite quickly slow the whole thing down.

Let's look a bit what happens when the simulation is running. I cropped this to contain a single tick. The period between those big block is around 0.5s, matching the ticks per second so it's easy to see that one tick has one of those happening.

3ynWDAW.png


I can't remember where but Paradox did state that Stellaris uses OpenMP or something similar. The <3 marked area looks like a block of OpenMP parallel for or something similar. Regardless of what makes it this is good multithreading. Not quite perfect but it's something you can be happy about. So hats off. That corresponds to around 0.4s of single core CPU time in total. Just imagine that running in a single thread.

However it is not the only thing we can see. The sadness block shows most threads idling and just intermittently doing anything. In this kind of case generally the overhead of synchronization is more than any gains in parallelization one can get. But it's likely just a worker pool where the mainthread is issuing commands, because in games, and in life, plenty of things just are serialized. 1 human can give birth to a kid in 9 months. But 2 humans cannot give birth to a kid in 4.5 months.

But that is not the worst. The worst is what happens in the thread that handles the windows mainloop and drawing. It has a ton of more work added to it. I bet the serialized logic of the game runs in the same thread as rendering does. And that is massive contention for performance.

We can look at the nvidia driver thread as a proxy for FPS, as it wakes up once per frame. You can see it under the edge of the sadness square on top. And it's not equal. On the places where the heavy multithreaded code works the frametime is 100ms for that single frame, and after that it works fast again. Basically it stutters quite heavily with the tick rate. Making it seem slower than just looking at average FPS would indicate.

Even so the mainthread is not running at max. It seems to stall quite often. There are bunch of 10ms gaps there where CPU doesn't do anything. None of the cores are doing. This in general looks like they're waiting for some sort of mutex or whatnot but it just happens to take 10ms for windows to realize that there are threads to be scheduled. This is a big reason why one should be sparing with synchronization primitives. By zooming in one gap one can see that it only contains cycles from ExReleasePushLockExclusiveEx, which is windows sync primitive.

Improvement suggestions and conclusions:
The decision to put single threaded work of tick calculation in the same thread that handles drawing and input may not be the best decision. As that thread is already overworked. Instead of even trying to multithread that part of the code one could "just" rip it out into a separate thread. It would still be serialized but it would not contend with resources of the main rendering thread. It may not be easy to separate but it's way easier than trying to parallelize it, as the actual work doesn't change, just where it is issued. Just this would give massive perf improvement as so much time is spent on rendering due to DirectX. In addition minimizing sync points to once per frame would drop a bunch of weird stalls.

One possibility of how something like this could look: There is a single tick state that the rendering thread reads. It just consumes. The tick generating code reads from the previous tick state (that the rendering thread also accesses) and generates the next state. Once the tick is calculated you just flip so that the now fresh tick is being read by both. Only one sync point per tick instead of usual mutex lock hell.

The profiler also says that the code is pretty terrible for a modern CPU. Lot's of stalls, branches etc. But improving that would be just gravy. It may not even be possible as the game must be scriptable, and that is fine. The biggest single bottleneck now is how the work is issued, not what work is issued.

So to reiterate:
Non parallelizable tick code should be removed from the mainthread that handles rendering.

Get a programmer from there who understands these well enough to change it, or if you don't have or if the person prefers other stuff hire some other expert either as permanent or get a specialized consulting company to help. Latter may be a good choice as after the work is done you don't really need to look at it for a long time.

Thank you for reading. <3 Stellaris. After Utopia it has been personally the best 4x I've ever played.
 

Alli Baba

Private
Dec 14, 2018
20
0
Nice analysis. What annoys me most right now are the lag spikes, where all the code you marked with <3 is executed in a single frame, basically lagging the game once ervery day. In my game at year 2400 this lag lasts a full second and then the game runs (relatively) smoothly till the next day. Fixing this might be as easy as spreading some of the code-execution over multiple frames or it might require an entire overhaul of the engine. I hope it's not the latter
 

Danarcis

Second Lieutenant
57 Badges
Apr 28, 2012
176
226
  • Magicka
  • Cities: Skylines - After Dark
  • Stellaris - Path to Destruction bundle
  • Europa Universalis IV: Res Publica
  • Europa Universalis IV: Wealth of Nations
  • Europa Universalis IV: Conquest of Paradise
  • Europa Universalis IV: Art of War
  • Crusader Kings II
  • Crusader Kings II: Sword of Islam
  • Crusader Kings II: Sons of Abraham
  • Crusader Kings II: The Republic
  • Crusader Kings II: The Old Gods
  • Crusader Kings II: Legacy of Rome
  • Crusader Kings II: Charlemagne
  • Stellaris: Leviathans Story Pack
  • Stellaris: Ancient Relics
  • Europa Universalis IV: Mandate of Heaven
  • BATTLETECH
  • Stellaris: Lithoids
  • Age of Wonders III
  • Europa Universalis IV: Cradle of Civilization
  • Stellaris: Humanoids Species Pack
  • Stellaris: Apocalypse
  • Cities: Skylines - Parklife
  • Stellaris: Distant Stars
  • Shadowrun Returns
  • Shadowrun: Dragonfall
  • BATTLETECH: Flashpoint
  • Stellaris: Megacorp
  • Crusader Kings II: Holy Fury
  • Imperator: Rome
  • Imperator: Rome Sign Up
  • Crusader Kings II: Horse Lords
  • Europa Universalis IV
  • Warlock: Master of the Arcane
  • 500k Club
  • Stellaris: Nemesis
  • Europa Universalis IV: El Dorado
  • Magicka: Wizard Wars Founder Wizard
  • Crusader Kings II: Way of Life
  • Pillars of Eternity
  • Europa Universalis IV: Common Sense
  • Stellaris: Digital Anniversary Edition
  • Crusader Kings III
  • Crusader Kings II: Conclave
  • Stellaris: Federations
  • Stellaris
  • Stellaris Sign-up
  • Hearts of Iron IV: Cadet
  • Crusader Kings II: Reapers Due
Can anyone ELI5 (explain to me like I am 5 years old), why multi-threading isn't just partitioned out by the processor itself, to the thread which is availible? Some sort of automatic sorting rather than designation of particular tasks to particular threads?
 

Alluton

Sergeant
Feb 8, 2013
69
7
Can anyone ELI5 (explain to me like I am 5 years old), why multi-threading isn't just partitioned out by the processor itself, to the thread which is availible? Some sort of automatic sorting rather than designation of particular tasks to particular threads?

Because not everything can be partitioned. For many things you need to calculate X first and only then calculate Y. The processor can't really know when this is possible or not (in some cases using the old value of X for calculating Y may be completely fine in other cases it can lead to serious issues.)
 

Scratx

Sergeant
65 Badges
Nov 12, 2009
88
34
  • Crusader Kings II
  • Stellaris: Galaxy Edition
  • Stellaris - Path to Destruction bundle
  • Europa Universalis IV: Third Rome
  • Victoria 2: Heart of Darkness
  • Victoria 2: A House Divided
  • Stellaris: Synthetic Dawn
  • Semper Fi
  • Victoria: Revolutions
  • Europa Universalis IV: Res Publica
  • Magicka
  • Europa Universalis IV: Wealth of Nations
  • Europa Universalis IV: Conquest of Paradise
  • Europa Universalis IV: Art of War
  • Europa Universalis III Complete
  • Hearts of Iron III
  • Arsenal of Democracy
  • Stellaris: Apocalypse
  • Tyranny: Archon Edition
  • Stellaris: Digital Anniversary Edition
  • BATTLETECH - Beta Backer
  • Stellaris: Leviathans Story Pack
  • Hearts of Iron IV: Together for Victory
  • Steel Division: Normandy 44
  • Europa Universalis IV: Mandate of Heaven
  • Europa Universalis IV: Rule Britannia
  • BATTLETECH
  • Stellaris: Humanoids Species Pack
  • Hearts of Iron IV: Death or Dishonor
  • Hearts of Iron IV: Expansion Pass
  • Age of Wonders
  • Europa Universalis IV: Cradle of Civilization
  • 500k Club
  • BATTLETECH: Flashpoint
  • Europa Universalis IV: Dharma
  • Battle for Bosporus
  • BATTLETECH - Backer
  • Europa Universalis IV
  • Europa Universalis III Complete
  • Europa Universalis III Complete
  • Teleglitch: Die More Edition
  • Victoria 2
  • Europa Universalis IV: Rights of Man
  • Europa Universalis IV: El Dorado
  • Europa Universalis IV: Common Sense
  • Europa Universalis IV: Cossacks
  • Europa Universalis IV: Mare Nostrum
  • Stellaris
  • Stellaris: Galaxy Edition
  • Hearts of Iron IV: Cadet
Can anyone ELI5 (explain to me like I am 5 years old), why multi-threading isn't just partitioned out by the processor itself, to the thread which is availible? Some sort of automatic sorting rather than designation of particular tasks to particular threads?

Because parallelization is hard and full of subtle mind-boggling traps for the unwary. And the experienced. And even the wizard. I'm not joking when I say entire reams of books are written about it and there are entire courses dedicated to teaching how to get parallelism going. And it's hard to get it right.

I literally don't have the time to expand on this, though, and I'm not sure if by the forum rules I can link to another game's equivalent to developer diaries to highlight some of their own talk about multithreading their game. (much interesting stuff, including a funny moment where trying to parallelize something actually made it slower due to cache invalidation)
 

Xeorm

Lt. General
77 Badges
Jun 27, 2011
1.595
2.029
  • Europa Universalis IV: Mare Nostrum
  • Victoria 2
  • Stellaris: Humanoids Species Pack
  • Europa Universalis IV: Cradle of Civilization
  • Age of Wonders II
  • Age of Wonders
  • Age of Wonders III
  • 500k Club
  • Europa Universalis IV: El Dorado
  • Surviving Mars
  • Europa Universalis IV: Common Sense
  • Europa Universalis IV: Cossacks
  • Stellaris: Apocalypse
  • Stellaris
  • Stellaris: Galaxy Edition
  • Stellaris: Galaxy Edition
  • Stellaris Sign-up
  • Europa Universalis IV: Rights of Man
  • Tyranny: Archon Edition
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Leviathans Story Pack
  • Stellaris - Path to Destruction bundle
  • Europa Universalis IV: Mandate of Heaven
  • BATTLETECH
  • Prison Architect
  • Stellaris: Nemesis
  • Stellaris: Necroids
  • Crusader Kings III
  • Stellaris: Federations
  • Stellaris: Lithoids
  • Age of Wonders: Planetfall Sign Up
  • Europa Universalis IV
  • Age of Wonders: Planetfall
  • Stellaris: Ancient Relics
  • Stellaris: Megacorp
  • Surviving Mars: First Colony Edition
  • Cities: Skylines - Parklife
  • Europa Universalis III Complete
  • BATTLETECH - Digital Deluxe Edition
  • Surviving Mars: Digital Deluxe Edition
  • Europa Universalis IV: Rule Britannia
  • Sword of the Stars II
  • Heir to the Throne
  • Divine Wind
  • Europa Universalis III
  • Crusader Kings II
  • Stellaris: Synthetic Dawn
  • Ancient Space
  • Europa Universalis IV: Res Publica
  • Cities in Motion
Can anyone ELI5 (explain to me like I am 5 years old), why multi-threading isn't just partitioned out by the processor itself, to the thread which is availible? Some sort of automatic sorting rather than designation of particular tasks to particular threads?

Some things have to be done in series and attempting to learn which items could be done in parallel is likely harder than doing the actual calculations. The other hard bit is memory usage. Two threads can't access the same items in memory.

There's also the separate issue that if the game is limited by memory times more than calculation speed (which is quite possible and happens fairly often with games) than multithreading doesn't change the game much.
 

bobucles

Captain
Jun 29, 2018
425
3
multi-threading isn't just partitioned out by the processor itself, to the thread which is availible?
Let's try illustrating with an example. I'm making stew but I'm out of tomatos. I need you to go to the store, get some tomatos, and bring them back. What we have are two threads of execution. I can continue dicing vegetables and general prep work, but at that point I'm stuck. I can't finish the stew until you come back with the tomatos.

Let's say I also want some milk because the fridge is out. You grab a friend, go to the store together and split up the shopping list. He grabs the milk while you get the tomatos, but he runs into a friend and starts chatting it up. You're waiting for him to show up and I'm sitting on my butt because I want tomatos to finish the stew.

Multi threading is pretty much the same way. You can throw a bunch of people at a task, but the actual hard part is figuring out how to divide it up so everyone has something to do. Not only that but managing those extra people/threads takes extra effort because you don't want people stepping on each other's toes. At some point it's more trouble dealing with them than to do it all yourself.
 

Alblaka

Foresightful Flag-Choser
101 Badges
Apr 12, 2013
4.016
1.665
  • Crusader Kings II
  • Stellaris - Path to Destruction bundle
  • Stellaris: Synthetic Dawn
  • Europa Universalis IV: Mare Nostrum
  • Europa Universalis IV: Pre-order
  • Europa Universalis IV: Third Rome
  • Victoria 2: Heart of Darkness
  • Victoria 2: A House Divided
  • Sword of the Stars
  • Sengoku
  • Semper Fi
  • Victoria: Revolutions
  • Europa Universalis IV: Res Publica
  • Magicka
  • Hearts of Iron III: Their Finest Hour
  • Hearts of Iron III
  • Europa Universalis IV: Call to arms event
  • Crusader Kings II: Legacy of Rome
  • Crusader Kings II: Charlemagne
  • Crusader Kings II: The Old Gods
  • Crusader Kings II: Rajas of India
  • Crusader Kings II: The Republic
  • Crusader Kings II: Sons of Abraham
  • Cities in Motion 2
  • Crusader Kings II: Sunset Invasion
  • Crusader Kings II: Sword of Islam
  • For the Motherland
  • Europa Universalis IV: Art of War
  • Europa Universalis IV: Conquest of Paradise
  • Europa Universalis IV: Wealth of Nations
  • Crusader Kings II: Monks and Mystics
  • Hearts of Iron IV: Together for Victory
  • Stellaris: Digital Anniversary Edition
  • Tyranny: Archon Edition
  • Europa Universalis IV: Rights of Man
  • Crusader Kings II: Reapers Due
  • Hearts of Iron IV: Colonel
  • Hearts of Iron IV: Cadet
  • Europa Universalis IV: Mandate of Heaven
  • Stellaris
  • Crusader Kings II: Conclave
  • Victoria 3 Sign Up
  • Crusader Kings II: Horse Lords
  • Europa Universalis IV: Common Sense
  • Magicka 2
  • Crusader Kings II: Way of Life
  • Magicka: Wizard Wars Founder Wizard
  • Europa Universalis IV: El Dorado
  • Cities: Skylines
  • Victoria 2
I never realized you could get a detailed glance into the game just by monitoring it's CPU usage on threads via basic windows tools.

Thanks for enlightening me, and another thanks for doing the analytic work here.
And, given this is more than just me whining about balance or bugs, I would dare to use a @Jamor here, hoping this might be worth a devs time.
 

Scratx

Sergeant
65 Badges
Nov 12, 2009
88
34
  • Crusader Kings II
  • Stellaris: Galaxy Edition
  • Stellaris - Path to Destruction bundle
  • Europa Universalis IV: Third Rome
  • Victoria 2: Heart of Darkness
  • Victoria 2: A House Divided
  • Stellaris: Synthetic Dawn
  • Semper Fi
  • Victoria: Revolutions
  • Europa Universalis IV: Res Publica
  • Magicka
  • Europa Universalis IV: Wealth of Nations
  • Europa Universalis IV: Conquest of Paradise
  • Europa Universalis IV: Art of War
  • Europa Universalis III Complete
  • Hearts of Iron III
  • Arsenal of Democracy
  • Stellaris: Apocalypse
  • Tyranny: Archon Edition
  • Stellaris: Digital Anniversary Edition
  • BATTLETECH - Beta Backer
  • Stellaris: Leviathans Story Pack
  • Hearts of Iron IV: Together for Victory
  • Steel Division: Normandy 44
  • Europa Universalis IV: Mandate of Heaven
  • Europa Universalis IV: Rule Britannia
  • BATTLETECH
  • Stellaris: Humanoids Species Pack
  • Hearts of Iron IV: Death or Dishonor
  • Hearts of Iron IV: Expansion Pass
  • Age of Wonders
  • Europa Universalis IV: Cradle of Civilization
  • 500k Club
  • BATTLETECH: Flashpoint
  • Europa Universalis IV: Dharma
  • Battle for Bosporus
  • BATTLETECH - Backer
  • Europa Universalis IV
  • Europa Universalis III Complete
  • Europa Universalis III Complete
  • Teleglitch: Die More Edition
  • Victoria 2
  • Europa Universalis IV: Rights of Man
  • Europa Universalis IV: El Dorado
  • Europa Universalis IV: Common Sense
  • Europa Universalis IV: Cossacks
  • Europa Universalis IV: Mare Nostrum
  • Stellaris
  • Stellaris: Galaxy Edition
  • Hearts of Iron IV: Cadet
The other hard bit is memory usage. Two threads can't access the same items in memory.

This is not actually true. Read-only access in parallel is 1000% fine. The problem is if you have one thread trying to write on one set of data and another one trying to read. You have to prevent the reader from reading while the other is writing and vice versa.

Semaphores are one way to handle this, albeit hardly the only one or the best. Ideally you'd structure it so you don't have readers and writers competing access to the same data.

As you can imagine, this can very quickly go down the rabbit hole into Alice's Wonderland.
 

Sifer2

Major
48 Badges
Jul 18, 2007
502
77
  • Stellaris: Megacorp
  • Age of Wonders III
  • Stellaris: Humanoids Species Pack
  • Stellaris: Apocalypse
  • Surviving Mars: Digital Deluxe Edition
  • BATTLETECH - Digital Deluxe Edition
  • Cities: Skylines - Parklife
  • Stellaris: Distant Stars
  • Shadowrun Returns
  • Shadowrun: Dragonfall
  • Shadowrun: Hong Kong
  • BATTLETECH: Flashpoint
  • Stellaris: Synthetic Dawn
  • Imperator: Rome Deluxe Edition
  • Imperator: Rome
  • BATTLETECH: Season pass
  • Age of Wonders: Planetfall
  • Age of Wonders: Planetfall Deluxe edition
  • Age of Wonders: Planetfall Premium edition
  • Age of Wonders: Planetfall Season pass
  • Age of Wonders: Planetfall Sign Up
  • BATTLETECH: Heavy Metal
  • Age of Wonders: Planetfall - Revelations
  • Stellaris: Federations
  • Mount & Blade: Warband
  • Gettysburg
  • Knights of Pen and Paper +1 Edition
  • Magicka
  • Majesty 2
  • Majesty 2 Collection
  • Sword of the Stars II
  • Warlock: Master of the Arcane
  • 500k Club
  • Cities: Skylines
  • Cities: Skylines Deluxe Edition
  • Magicka: Wizard Wars Founder Wizard
  • A Game of Dwarves
  • Magicka 2
  • Cities: Skylines - After Dark
  • Knights of Pen and Paper 2
  • Stellaris
  • Stellaris: Galaxy Edition
  • Stellaris: Galaxy Edition
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Leviathans Story Pack
  • Stellaris - Path to Destruction bundle
  • BATTLETECH
  • Surviving Mars
That's interesting about some threads being paused during the game being paused. I noticed while i'm paused doing stuff sometimes the AI randomly asks for some kind of trade or diplomatic deal. I guess whatever logic they run to look for trades/deals is still running in those threads that don't pause lol.
 

exi123

Colonel
28 Badges
Jan 19, 2018
800
1.764
  • Cities: Skylines - Green Cities
  • Stellaris: Nemesis
  • Stellaris: Necroids
  • Stellaris: Federations
  • Stellaris: Lithoids
  • Stellaris: Ancient Relics
  • Cities: Skylines - Campus
  • Prison Architect
  • Stellaris: Megacorp
  • Cities: Skylines Industries
  • Stellaris: Distant Stars
  • Cities: Skylines - Parklife
  • Stellaris: Apocalypse
  • Stellaris: Humanoids Species Pack
  • Europa Universalis IV
  • Cities: Skylines - Mass Transit
  • Stellaris - Path to Destruction bundle
  • Cities: Skylines - Natural Disasters
  • Stellaris: Leviathans Story Pack
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Galaxy Edition
  • Cities: Skylines - Snowfall
  • Cities: Skylines - After Dark
  • Cities: Skylines
  • Stellaris: Synthetic Dawn
  • Stellaris: Galaxy Edition
  • Stellaris
  • Cities: Skylines Deluxe Edition
This is a very interesting technical insight into the mechanics from cpu's. Reminds me a bit of the Factorio Friday Facts (the dev-book of them). These are often extremely technical but they show what they are doing with the game. They had a big performance upgrade last year and they optimized their game mostly with how the cpu gets fed with data and get it as efficient as possible.
 

sharpneli

Private
120 Badges
Nov 26, 2009
21
0
  • Arsenal of Democracy
  • Cities in Motion 2
  • Crusader Kings II
  • Crusader Kings II: The Old Gods
  • Crusader Kings II: Rajas of India
  • Crusader Kings II: The Republic
  • Crusader Kings II: Sons of Abraham
  • Crusader Kings II: Sunset Invasion
  • Crusader Kings II: Sword of Islam
  • Darkest Hour
  • Europa Universalis IV: Mandate of Heaven
  • Hearts of Iron IV: Death or Dishonor
  • Europa Universalis IV: Common Sense
  • Tyranny - Tales from the Tiers
  • Tyranny - Bastards Wound
  • Pillars of Eternity
  • Crusader Kings II: Way of Life
  • Magicka: Wizard Wars Founder Wizard
  • Tyranny: Gold Edition
  • Europa Universalis IV: El Dorado
  • Crusader Kings II: Conclave
  • Cities: Skylines - Natural Disasters
  • Stellaris: Leviathans Story Pack
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Galaxy Edition
  • Hearts of Iron IV: Field Marshal
  • Crusader Kings II: Reapers Due
  • Europa Universalis IV: Rights of Man
  • Tyranny: Archon Edition
  • Tyranny: Archon Edition
  • Stellaris: Distant Stars
  • Europa Universalis IV
  • Stellaris: Necroids
  • Imperator: Rome Deluxe Edition
  • Europa Universalis IV: Golden Century
  • Stellaris: Megacorp
  • Surviving Mars: First Colony Edition
  • Shadowrun: Dragonfall
  • Shadowrun Returns
  • Europa Universalis III Complete
  • Europa Universalis IV: Dharma
  • Cities: Skylines - Green Cities
  • Cities: Skylines - Parklife
  • Europa Universalis III Complete
  • Europa Universalis IV: Rule Britannia
  • Stellaris: Humanoids Species Pack
  • Victoria 2
  • Hearts of Iron IV: Expansion Pass
  • War of the Roses
  • 500k Club
Nice analysis. What annoys me most right now are the lag spikes, where all the code you marked with <3 is executed in a single frame, basically lagging the game once ervery day. In my game at year 2400 this lag lasts a full second and then the game runs (relatively) smoothly till the next day. Fixing this might be as easy as spreading some of the code-execution over multiple frames or it might require an entire overhaul of the engine. I hope it's not the latter

This is also my pet peeve. In addition even when the game runs fast it still has a small stutter as shown here. This is from a practically fresh start using the beta patch.

O48578b.png


That's 56ms pause time, effectively the same as game running 17fps.

I did few checks and it appears that the game runs 12 frames per tick on the fastest speed, regardless of how long the tick takes to run. I loaded few saves and counted the amount of spikes in driver thread in between those blocks of work. And that always came out as 12. The driver thread wakes up basically only when you present an image using IDXGISwapChain::present or wglSwapBuffers. (unless you do GPU readbacks but they likely don't do those things)

They seem to have work spread out across the frames but it is definitely not well spread out. And worst of all it's spread out in fixed fashion. So if one part is heavier than others it will always stutter, and it just gets worse.

This is precisely the reason why it would be good to detach the tick calculation completely from the main rendering thread. No need to even think about how to spread the work across frames. Always maintain smooth 60fps even if it would take a day to calculate one tick. To cap the tick speed just make the rendering thread increment a counter and the tick thread peek at it and not start a new tick unless certain amount of frames or time has passed.

Doing that may require overhaul of the engine. But I don't think it's likely. I have done similar things to existing codebases and have yet to encounter things where it couldn't be done at least in some level.
 

sharpneli

Private
120 Badges
Nov 26, 2009
21
0
  • Arsenal of Democracy
  • Cities in Motion 2
  • Crusader Kings II
  • Crusader Kings II: The Old Gods
  • Crusader Kings II: Rajas of India
  • Crusader Kings II: The Republic
  • Crusader Kings II: Sons of Abraham
  • Crusader Kings II: Sunset Invasion
  • Crusader Kings II: Sword of Islam
  • Darkest Hour
  • Europa Universalis IV: Mandate of Heaven
  • Hearts of Iron IV: Death or Dishonor
  • Europa Universalis IV: Common Sense
  • Tyranny - Tales from the Tiers
  • Tyranny - Bastards Wound
  • Pillars of Eternity
  • Crusader Kings II: Way of Life
  • Magicka: Wizard Wars Founder Wizard
  • Tyranny: Gold Edition
  • Europa Universalis IV: El Dorado
  • Crusader Kings II: Conclave
  • Cities: Skylines - Natural Disasters
  • Stellaris: Leviathans Story Pack
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Galaxy Edition
  • Hearts of Iron IV: Field Marshal
  • Crusader Kings II: Reapers Due
  • Europa Universalis IV: Rights of Man
  • Tyranny: Archon Edition
  • Tyranny: Archon Edition
  • Stellaris: Distant Stars
  • Europa Universalis IV
  • Stellaris: Necroids
  • Imperator: Rome Deluxe Edition
  • Europa Universalis IV: Golden Century
  • Stellaris: Megacorp
  • Surviving Mars: First Colony Edition
  • Shadowrun: Dragonfall
  • Shadowrun Returns
  • Europa Universalis III Complete
  • Europa Universalis IV: Dharma
  • Cities: Skylines - Green Cities
  • Cities: Skylines - Parklife
  • Europa Universalis III Complete
  • Europa Universalis IV: Rule Britannia
  • Stellaris: Humanoids Species Pack
  • Victoria 2
  • Hearts of Iron IV: Expansion Pass
  • War of the Roses
  • 500k Club
I started the investigation originally as I had seen so many "Why don't they multithread" and wanted to get some timings to show that they in fact do as claimed. Quite quickly it changed into something bit more interesting. And I just had to do one more fast round.

This is a continuation and super rough analysis on what kind of performance benefits it _may_ bring to separate the serialized part of tick calculation in it's own thread.

All numbers here are super rough estimates. And a lot would in the end depend on things like how much the animation interpolation between ticks actually takes time. We are now assuming that even that is part of the tick code, which it in reality cannot be. This shifts our estimate into more conservative direction so it is fine. There are other possibilities that shifts the estimate to be too loose. So very strong YMMV applies here. But this kind of back of the envelope calculation gives us some idea what could happen.

In addition we are interested in how many ticks per second the game could handle. So in case where we assume more work to the tick calculation than there is the real result would be even better.

Again I loaded the save where perf is tanking. To render 12 frames and handle the input for all of those frames takes 76ms of CPU time on this laptop. This is with game paused so the animation system is paused too and will likely idle happily. But it is work that also needs to happen per frame at minimum.

Running the 12 frames takes 400ms of CPU time for the mainthread taking 487ms of real time. I took the limit of 12 frames roughly from the vsync points showed by the driver thread.

a0St58s.png


So let's remove the 76ms from rendering work and that brings us to ~324ms of purely serialized work. As the tick thread wouldn't have to deal with things like waiting for vsync or whatnot stuff this total CPU time gives better estimate than real time. As the gaps that increase the real consumed time happen pretty much always after vsync. So the rendering thread is just stalling inside the present function and/or windows message loop handling function.

This would then increase our tick rate from 2.05 ticks per second into bit over 3 ticks per second. 50% improvement at minimum. Also let's not forget that it would allow the rendering to run at smooth 60fps constantly without the around 0.1 sec stall per actual tick. In reality the result would likely be higher than that. As things like animation code and whatnot when the game is running stops interfering with the serialized parts of tick calculation itself.

The rendering/UI thread does have to communicate a bit with the tick thread though. Because player can make decisions that affect the next tick. One way to solve this could be to just push the changed orders into queue. Then at the end of tick just check the queue and rerun the parts that need to change. Repeat as necessary. It may be a safe assumption that the player is unable to be so fast as to prevent the next tick from ever occurring.

It is also possible that the actual tick only happens during that 0.1s stall and all of the other stuff in the mainthread is just animation system etc. In that case separation would only help with the stuttering part. This is pretty much how far we can go without going into reverse-engineering territory. But indications of how the AI works would imply that at least some of that runs in the mainthread.

I really would like to see some stellaris dev comment at least so far as to tell how much wrong the assumptions made are wrong.
 

Volapyk

Second Lieutenant
63 Badges
Aug 11, 2013
145
37
  • Crusader Kings II: Charlemagne
  • Europa Universalis IV: Mare Nostrum
  • Crusader Kings II: Jade Dragon
  • Europa Universalis IV: Third Rome
  • Cities: Skylines Deluxe Edition
  • Europa Universalis IV: Res Publica
  • Europa Universalis IV: Wealth of Nations
  • Europa Universalis IV: Conquest of Paradise
  • Europa Universalis IV: Art of War
  • Crusader Kings II
  • Crusader Kings II: Sword of Islam
  • Crusader Kings II: Sunset Invasion
  • Crusader Kings II: Sons of Abraham
  • Crusader Kings II: The Republic
  • Crusader Kings II: Rajas of India
  • Crusader Kings II: The Old Gods
  • Crusader Kings II: Legacy of Rome
  • Crusader Kings II: Holy Fury
  • Europa Universalis IV: Mandate of Heaven
  • BATTLETECH
  • Surviving Mars
  • Hearts of Iron IV: Death or Dishonor
  • Stellaris: Necroids
  • Stellaris: Ancient Relics
  • Age of Wonders III
  • Europa Universalis IV: Cradle of Civilization
  • Hearts of Iron IV: Expansion Pass
  • Stellaris: Humanoids Species Pack
  • Stellaris: Apocalypse
  • Europa Universalis IV: Rule Britannia
  • Stellaris: Distant Stars
  • Europa Universalis IV: Dharma
  • Europa Universalis IV: Golden Century
  • Crusader Kings II: Conclave
  • Europa Universalis IV
  • Cities: Skylines
  • Europa Universalis IV: El Dorado
  • Crusader Kings II: Way of Life
  • Pillars of Eternity
  • Europa Universalis IV: Common Sense
  • Crusader Kings II: Horse Lords
  • Europa Universalis IV: Cossacks
  • Crusader Kings II: Monks and Mystics
  • Stellaris
  • Hearts of Iron IV: Cadet
  • Crusader Kings II: Reapers Due
  • Europa Universalis IV: Rights of Man
  • Tyranny: Archon Edition
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Leviathans Story Pack
This is very fascinating to me at least, and to you as well it looks like. I don't have the knowlegde or the tools to do such an indepth analasys, but I am curious about a few things.

While you are stating that you wont reverse engineer it, which sounds like a good plan, would it be possible to see the effect of different ingame events on the CPU, to try and narrow down other places where some optimization would be greatly beneficial.

For example there has been a few threads claiming gateways and wormholes result in massive spikes, others claiming it is the trade routes calculations or job management, what about number of empires, settled planets, galaxy population, number of wars.

Okay guess I'm really curious now, might have to try and figure out how this CPU profiler works.
 

Alastor

Colonel
87 Badges
Nov 14, 2008
846
454
  • Stellaris: Federations
  • Tyranny: Gold Edition
  • Crusader Kings Complete
  • Tyranny - Tales from the Tiers
  • Tyranny - Bastards Wound
  • Europa Universalis IV
  • Imperator: Rome
  • Stellaris: Nemesis
  • Rome Gold
  • Teleglitch: Die More Edition
  • Stellaris: Necroids
  • Imperator: Rome Deluxe Edition
  • Imperator: Rome - Magna Graecia
  • Tyranny: Archon Edition
  • 500k Club
  • Crusader Kings III
  • Stellaris
  • Crusader Kings II
  • Cities: Skylines
  • Hearts of Iron IV: Cadet
  • Mount & Blade: With Fire and Sword
  • Age of Wonders III
  • Age of Wonders: Planetfall Deluxe edition
  • Surviving Mars
  • BATTLETECH
  • Pillars of Eternity
  • Tyranny: Archon Edition
  • Prison Architect
  • Shadowrun Returns
  • Shadowrun: Dragonfall
  • Shadowrun: Hong Kong
  • Magicka
  • Magicka 2
  • Ancient Space
  • Cities in Motion
  • Cities in Motion 2
  • Warlock 2: The Exiled
  • Majesty 2 Collection
  • Stellaris: Synthetic Dawn
  • Europa Universalis IV: Art of War
  • Europa Universalis IV: Third Rome
  • Europa Universalis IV: Conquest of Paradise
  • Europa Universalis IV: Wealth of Nations
  • Cities: Skylines - Snowfall
  • Crusader Kings II: Jade Dragon
  • Europa Universalis IV: Call to arms event
  • Europa Universalis IV: Cossacks
  • Europa Universalis IV: Pre-order
  • Cities: Skylines - After Dark
  • Crusader Kings II: The Old Gods
I doubt anyone who has even a basic understanding of how things work really believes they don't multithread at all. I believe the idea was and is they don't do it well. And can do a lot better. Anyway this has been a fairly interesting read and it makes a good deal of sense.

Lets hope the devs are taking notice.
 

SectorsAreOkay

Major
19 Badges
Feb 8, 2017
629
1.545
  • Europa Universalis IV
  • Stellaris: Necroids
  • Crusader Kings III
  • Stellaris: Federations
  • Stellaris: Lithoids
  • Stellaris: Ancient Relics
  • Stellaris: Megacorp
  • Stellaris: Distant Stars
  • Stellaris: Apocalypse
  • Stellaris: Humanoids Species Pack
  • Stellaris - Path to Destruction bundle
  • Stellaris: Leviathans Story Pack
  • Stellaris: Digital Anniversary Edition
  • Europa Universalis IV: Rights of Man
  • Stellaris
  • Europa Universalis IV: Common Sense
  • Europa Universalis IV: Art of War
  • Crusader Kings II
  • Stellaris: Synthetic Dawn
For those wondering why the CPU doesn't automatically split things up, the answer is that it does, but within a single core. It reorders instructions and executes them or parts of them in parallel so that all functional units of the CPU are active, if possible. It detects data dependencies and program flow dependencies in order to do this safely (module Spectre and Meltdown-type attacks). Hyperthreading is an extension of this where it treats a single core as if it is two cores and lets two threads run at the same time, interleaving their work automatically.
 

SectorsAreOkay

Major
19 Badges
Feb 8, 2017
629
1.545
  • Europa Universalis IV
  • Stellaris: Necroids
  • Crusader Kings III
  • Stellaris: Federations
  • Stellaris: Lithoids
  • Stellaris: Ancient Relics
  • Stellaris: Megacorp
  • Stellaris: Distant Stars
  • Stellaris: Apocalypse
  • Stellaris: Humanoids Species Pack
  • Stellaris - Path to Destruction bundle
  • Stellaris: Leviathans Story Pack
  • Stellaris: Digital Anniversary Edition
  • Europa Universalis IV: Rights of Man
  • Stellaris
  • Europa Universalis IV: Common Sense
  • Europa Universalis IV: Art of War
  • Crusader Kings II
  • Stellaris: Synthetic Dawn
I doubt anyone who has even a basic understanding of how things work really believes they don't multithread at all. I believe the idea was and is they don't do it well. And can do a lot better. Anyway this has been a fairly interesting read and it makes a good deal of sense.

Lets hope the devs are taking notice.
A lot of people actually believed they just didn't multithread at all, and also that all they needed to do was add more threads and it would magically work.

The analysis in this thread may be spot on, but it may also not work. I remember the Factorio devs talking about when they actually multithreaded game logic in a way that really seemed like it would help, but it actually made performance worse. That may be the case for Stellaris. They may need to just do some plain old single-threaded optimization to reduce memory contention, unnecessary calculations, etc.
 

sharpneli

Private
120 Badges
Nov 26, 2009
21
0
  • Arsenal of Democracy
  • Cities in Motion 2
  • Crusader Kings II
  • Crusader Kings II: The Old Gods
  • Crusader Kings II: Rajas of India
  • Crusader Kings II: The Republic
  • Crusader Kings II: Sons of Abraham
  • Crusader Kings II: Sunset Invasion
  • Crusader Kings II: Sword of Islam
  • Darkest Hour
  • Europa Universalis IV: Mandate of Heaven
  • Hearts of Iron IV: Death or Dishonor
  • Europa Universalis IV: Common Sense
  • Tyranny - Tales from the Tiers
  • Tyranny - Bastards Wound
  • Pillars of Eternity
  • Crusader Kings II: Way of Life
  • Magicka: Wizard Wars Founder Wizard
  • Tyranny: Gold Edition
  • Europa Universalis IV: El Dorado
  • Crusader Kings II: Conclave
  • Cities: Skylines - Natural Disasters
  • Stellaris: Leviathans Story Pack
  • Stellaris: Digital Anniversary Edition
  • Stellaris: Galaxy Edition
  • Hearts of Iron IV: Field Marshal
  • Crusader Kings II: Reapers Due
  • Europa Universalis IV: Rights of Man
  • Tyranny: Archon Edition
  • Tyranny: Archon Edition
  • Stellaris: Distant Stars
  • Europa Universalis IV
  • Stellaris: Necroids
  • Imperator: Rome Deluxe Edition
  • Europa Universalis IV: Golden Century
  • Stellaris: Megacorp
  • Surviving Mars: First Colony Edition
  • Shadowrun: Dragonfall
  • Shadowrun Returns
  • Europa Universalis III Complete
  • Europa Universalis IV: Dharma
  • Cities: Skylines - Green Cities
  • Cities: Skylines - Parklife
  • Europa Universalis III Complete
  • Europa Universalis IV: Rule Britannia
  • Stellaris: Humanoids Species Pack
  • Victoria 2
  • Hearts of Iron IV: Expansion Pass
  • War of the Roses
  • 500k Club
This is very fascinating to me at least, and to you as well it looks like. I don't have the knowlegde or the tools to do such an indepth analasys, but I am curious about a few things.

While you are stating that you wont reverse engineer it, which sounds like a good plan, would it be possible to see the effect of different ingame events on the CPU, to try and narrow down other places where some optimization would be greatly beneficial.

For example there has been a few threads claiming gateways and wormholes result in massive spikes, others claiming it is the trade routes calculations or job management, what about number of empires, settled planets, galaxy population, number of wars.

Okay guess I'm really curious now, might have to try and figure out how this CPU profiler works.

Tool such as this is simply divine for a developer. Let's look at the bottom-up view of stellaris capture that I took with stack-trace on.

GWlxoY7.png


If we were stellaris devs and have access to the source code instead of listing things like func@0xdeadbeef it would directly name the function and tell which source file has it. It is also capable of matching things like "Line 200 in wizIsAwesome.cpp has X amount of l2 cache misses". On the right under the "Viewing 1 of 5 selected stacks" it tells that 42.9% of the CPU time this function took it was called from those functions (read from bottom up). Without the debug database vTune can only tell that there was a function. Even so it allows us to directly jump to the disassembly and see what instructions were executed and what parts took what time. So one has to be careful not to doubleclick a line in this view :rolleyes:

So a developer just has to have a savegame with the perf issues and the part of the code causing the slowdown will jump out and is clearly pointed out. Tools such as this are pretty much mandatory for optimizing code.

But as we are not stellaris devs trying to figure out the individual parts that take CPU time is time consuming, compared to devs who will just see the culprit directly.

Tools like these plenty of limitations though. It can show the callstacks of C/C++ code and in some cases .Net and Java. Any internal scripting engine will just show the C++ functions implementing the scripting engine.

EDIT:
The analysis in this thread may be spot on, but it may also not work. I remember the Factorio devs talking about when they actually multithreaded game logic in a way that really seemed like it would help, but it actually made performance worse. That may be the case for Stellaris. They may need to just do some plain old single-threaded optimization to reduce memory contention, unnecessary calculations, etc.

You are correct. The only thing I'm pretty sure about is that mixing the DirectX rendering code with the other things is not a good idea. The end result should look like 2 threads being close to maxed out, instead of just one. The pre-dx12/vulkan graphics API's are massive CPU hogs, in addition one is at the mercy of vsync.
 
Last edited:

Alastor

Colonel
87 Badges
Nov 14, 2008
846
454
  • Stellaris: Federations
  • Tyranny: Gold Edition
  • Crusader Kings Complete
  • Tyranny - Tales from the Tiers
  • Tyranny - Bastards Wound
  • Europa Universalis IV
  • Imperator: Rome
  • Stellaris: Nemesis
  • Rome Gold
  • Teleglitch: Die More Edition
  • Stellaris: Necroids
  • Imperator: Rome Deluxe Edition
  • Imperator: Rome - Magna Graecia
  • Tyranny: Archon Edition
  • 500k Club
  • Crusader Kings III
  • Stellaris
  • Crusader Kings II
  • Cities: Skylines
  • Hearts of Iron IV: Cadet
  • Mount & Blade: With Fire and Sword
  • Age of Wonders III
  • Age of Wonders: Planetfall Deluxe edition
  • Surviving Mars
  • BATTLETECH
  • Pillars of Eternity
  • Tyranny: Archon Edition
  • Prison Architect
  • Shadowrun Returns
  • Shadowrun: Dragonfall
  • Shadowrun: Hong Kong
  • Magicka
  • Magicka 2
  • Ancient Space
  • Cities in Motion
  • Cities in Motion 2
  • Warlock 2: The Exiled
  • Majesty 2 Collection
  • Stellaris: Synthetic Dawn
  • Europa Universalis IV: Art of War
  • Europa Universalis IV: Third Rome
  • Europa Universalis IV: Conquest of Paradise
  • Europa Universalis IV: Wealth of Nations
  • Cities: Skylines - Snowfall
  • Crusader Kings II: Jade Dragon
  • Europa Universalis IV: Call to arms event
  • Europa Universalis IV: Cossacks
  • Europa Universalis IV: Pre-order
  • Cities: Skylines - After Dark
  • Crusader Kings II: The Old Gods
A lot of people actually believed they just didn't multithread at all, and also that all they needed to do was add more threads and it would magically work.

The analysis in this thread may be spot on, but it may also not work. I remember the Factorio devs talking about when they actually multithreaded game logic in a way that really seemed like it would help, but it actually made performance worse. That may be the case for Stellaris. They may need to just do some plain old single-threaded optimization to reduce memory contention, unnecessary calculations, etc.
A lot of people don't have a basic understanding.

Indeed, parallelization is tricky. But even if this particular suggestion doesn't work sth else will. The fact remains, the game is not well optimized, it can definitely do a lot better, the how is up to the developers. The reason we are even discussing this is because they have consistently failed to do their part.
 
Last edited: