• We have updated our Community Code of Conduct. Please read through the new rules for the forum that are an integral part of Paradox Interactive’s User Agreement.

Stellaris Dev Diary #181 : Threading and Loading Times

Hello everyone, this is The French Paradox speaking!

On behalf of the whole Stellaris team, we hope you've had a good summer vacation, with current circumstances and all!

We're all back to work, although not at the office yet. It is going to be a very exciting autumn and winter with a lot of interesting news! We are incredibly excited to be able to share the news with you over the coming weeks and months!

Today I open the first look at the upcoming 2.8 release with some of the technical stuff that we programmers have been working on over summer. The rest of the team will reveal more about the upcoming content and features in the following diaries.

Without further ado, let's talk about threads!

Threads? What threads?

There is a running joke that says fans are always wondering which one will come first: Victoria III or a PDS game using more than one thread.

image (26).png

Don't lie, I know that's how some of you think our big decision meetings go

I’m afraid I’ll have to dispel the myth (again): all PDS games in production today use threads, from EU4 to CK3. Even Stellaris! To better explain the meme and where it comes from, we have to go through a little history. I’m told you guys like history.

For a long time, the software industry relied on “Moore’s Law”, which states that a CPU built in two years will be roughly twice as efficient as one today.
This was especially true in the 90s, when CPUs went from 50 MHz to 1GHz in the span of a decade. The trend continued until 2005 when we reached up to 3.8GHz. And then the clock speed stopped growing. In the 15 years since, the frequency of CPUs has stayed roughly the same.
As it turns out, the laws of physics make it quite inefficient to increase speeds beyond 3-4 GHz. So instead manufacturers went in another direction and started “splitting” their CPUs into several cores and hardware threads. This is why today you’ll look at how many cores your CPU has and won’t spend much time checking the frequency. Moore’s Law is still valid, but, to put it in strategy terms, the CPU industry reached a soft cap while trying to play tall so they changed the meta and started playing wide.

This shift profoundly changed the software industry, as writing code that will run faster on a CPU with a higher speed is trivial: most code will naturally do just that. But making usage of threads and cores is another story. Programs do not magically “split” their work in 2, 4 or 8 to be able to run on several cores simultaneously, it’s up to us programmers to design around that.

Threading nowhere faster

Which brings us back to our games and a concern we keep reading on the forums: “is the game using threads?”. The answer is yes, of course! In fact, we use them so much that we had a critical issue a few releases back where the game would not start on machines with 2 cores or less.

But I suspect the real question is : “are you making efficient usage of threads?”. Then the answer is “it depends”. As I mentioned previously, making efficient use of more cores is a much more complex issue than making use of more clock cycles. In our case, there are two main challenges to overcome when distributing work among threads: sequencing and ordering.

Sequencing issues occur when 2 computations running simultaneously need to access the same data. For example let’s say we are computing the production of 2 pops: a Prikki-Ti and a Blorg. They both access the current energy stockpile, add their energy production to it and write the value back. Depending on the sequence, they could both read the initial value (say 100), add their production (say 12 and 3, the Blorg was having a bad day) and write back. Ideally we want to end up with 115 (100 + 12 + 3). But potentially both would read 100, then compute and overwrite each other ending up with 112 or 103.
The simple way around it is to introduce locks: the Prikki-Ti would “lock” the energy value until it’s done with its computation and has written the new value back, then the Blog would take its turn and add his own. While this solves the problem, it introduces a greater one: the actions are now sequential again, and the benefit of doing them on concurrent threads has been lost. Worse, due to the cost of locking, unlocking and synchronizing, the whole thing will likely take longer than if we simply computed both on the same thread in the first place.

The second issue is ordering, or “order dependency”. Meaning in some cases changing the order of operations changes the outcome. For example let’s say our previous Prikki-Ti and Blorg decide to resolve a dispute in a friendly manner. We know the combat system will process both combatants, but since there are potentially hundreds of combat actions happening, we don’t know which one will happen first. And potentially on 2 different machines the order will differ. For example on the server the Prikki-Ti action will happen first, while on the client the Blorg will act first.

OOS.png

#BlorgShotFirst

On the server the Prikki-Ti action is resolved first, killing the Blorg. The Blorg action that comes after (possibly on another thread) is discarded as dead Blorgs can’t shoot (it’s a scientific fact). The client however distributed the computation in another way (maybe it has more cores than the server) and in his world the Blorg dispatched the Prikki-Ti first, which in turn couldn’t fight back. Then both players get the dreaded “Player is Out of Sync” popup as their realities have diverged.

There are, of course, ways to solve the problem, but they usually require redoing the design in a way that satisfies both constraints. For example in our first case each thread could store the production output of each pop to add to each empire, and then those could be consolidated at the end. In the same fashion our 2 duelists problem could be solved by recording damage immediately, but applying the effects in another phase to eliminate the need for a deterministic order.

As you can imagine, it is much easier to design something with threading in mind rather than retrofitting an existing system for it. If you don’t believe me just look at how much time is spent retrofitting your fleets, I’ll wait.

The good news

This is all nice and good, but what’s in it for you in the next patch, concretely? Well you will be happy to hear that I used some time to apply this to one of the oldest bits of our engine: the files and assets loading system.

For the longest time we have used a 3rd party software to handle this. While it saved us a lot of trouble, it has also turned out to be quite bad at threading. Up to the point that it was sometimes slower with more cores than less, most notably to the locking issues I mentioned before.
In conjunction with a few other optimizations, it has enabled us to drastically reduce the startup time of the game.
I could spend another thousand word explaining why, but I think this video will speak better:


This comparison was done on my home PC, which uses a venerable i7 2600K and an SSD drive. Both were “hot” startups (the game had been launched recently), but in my experiments I found that even on a “cold” start it makes a serious difference.

To achieve the best speedup, you will need to use the new beta DirectX11 rendering engine. Yes, you read correctly: the next patch will also offer an open beta which replaces the old DX9 renderer by a more recent DX11 version that was initially made by our friends at Tantalus for the console edition of Stellaris. While visually identical, using DX11 to render graphics enables a whole range of multi-threading optimizations that are hard or impossible to achieve with DX9. Playing with the old renderer will still net you some nice speedup on startup, the splash screen step should still be much faster, but you’re unlikely to see the progress bar “jump” as it does with DX11 when the game loads the models and textures.

Some of those optimizations have also been applied to newer versions of Clausewitz, and will be part of CK3 on release. Imperator should also benefit from it. It might be possible to also apply it to EU4 and HoI4, but so far my experiments with EU4 haven’t shown a huge speedup like it did for Stellaris and CK3.

If you want to read more technical details about the optimizations that were applied to speedup Stellaris, you can check out the article I recently published on my blog.

And with that I will leave you for now. This will likely be my last dev diary on Stellaris, as next month I will be moving teams to lead the HoI4 programmers. You can consider those optimizations my farewell gift.
This may have been a short time for me on Stellaris but don’t worry: even if I go, Jeff will still be there for you!
 
Last edited:
  • 145Like
  • 38Love
  • 24
  • 6
  • 5Haha
  • 4
Reactions:
No. But I CAN blame people for being iackasses about it instead of being decent human beings. And I'm speaking as a Day One buyer myself.

If we're gonna bring decency into this, think a lot of people would agree that there is a deficit on both sides.

Stellaris is the most expensive game I own. Stellaris is the buggiest game I own, and considering I own a few early access titles that's really something.

I don't want to imply that the Devs themselves are behaving indecently, but at the company level, yeah this reeks of an exploitative cash grab rather than a fair exchange of currency for goods.
 
  • 16
  • 2
  • 1Like
Reactions:
If we're gonna bring decency into this, think a lot of people would agree that there is a deficit on both sides.

Stellaris is the most expensive game I own. Stellaris is the buggiest game I own, and considering I own a few early access titles that's really something.

I don't want to imply that the Devs themselves are behaving indecently, but at the company level, yeah this reeks of an exploitative cash grab rather than a fair exchange of currency for goods.

Perhaps you're right, and Stellaris was just meant to be a cash-grab by the higher ups since space 4x games were popular (And one could argue still are popular) at the time Stellaris was released.

Bringing it back on topic with the DD, I will admit that the Loading time issue wasn't much of an issue (Though something at least a few people had, given some of the posts here) but, as mentioned in the DD, it helps the developers most of all.

That includes getting to the bottom of the performance issues and seeing where they can be streamlined.

For example, if the number of pops is the issue, make it so each pop is roughly equal to a number of people (Something like 1 pop equals 1 billion people). If that's the case, just make the district system run on that assumption, 1 job is roughly equal to 1 billion jobs of that type.

That would lower the amount of pop calculation lag.
 
  • 1
  • 1Like
  • 1
Reactions:
Perhaps you're right, and Stellaris was just meant to be a cash-grab by the higher ups since space 4x games were popular (And one could argue still are popular) at the time Stellaris was released.
I will admit that the Loading time issue wasn't much of an issue but, as mentioned in the DD, it helps the developers ( Paradox ) most of all.
Fine, you've spotted how Paradox sets its priorities, but ...
I CAN blame people ( customers ) for being iackasses about it instead of being decent human beings.
You don't blame Paradox for its egoism, but the customers as being "jackasses" and in-"decent" if they're expressing any form of critique in regards to this ? And before you may begin to explain how normal this is since Paradox is a company, 01. egoism is also normal for customers and 02. you were the one who had begun with your subjective indignation in regards to morality that's not aimed towards Paradox at all, but strangely ( in this context: exclusively ) towards customers.
 
Last edited:
  • 7
  • 3
Reactions:
64 bit, AVX-512, DX 12, Vicky III confirmed
Actually we thought about enabling AVX on Imperator release, turned out the gain wasn't worth making a whole class of CPUs unable to boot the game.
For the rest I cannot confirm or deny ;)
 
  • 5Haha
  • 2Like
  • 1
  • 1
Reactions:
All your suggestion would do is add an extra step.
where is it adding extra steps ? the opposite is the case. not every single pop have to "ask" the value of the stock and then rewrite it after "the work is done". this way all pops can be paralized. the way it is now, it cant and at the very end, the stock-value is rewritten. im not sure if you realy see the change here.
 
Last edited:
where is it adding extra steps ? the opposite is the case. not every single pop have to "ask" the value of the stock and then rewrite it after "the work is done". this way all pops can be paralized. the way it is now, it cant and at the very end, the stock-value is rewritten. im not sure if you realy see the change here.

To start off, unless you have a few thousand cores, parallizing EVERY pop is overkill. I don't know exactly how much processing pops take up, and it might be that you can keeps jobs all on the same thread and split off other planetary stuff into a different thread or something. Also the actual number of threads available differs based on the computer so whatever system you have needs to be able to handle that.

The way I understand code, you HAVE to ask the value of something in order to effect it relative to it's current value (aka adding or subtracting). If you add one, you don't send it an "increase by one" command, you pull the initial value, increase it by one, then push back the updates value. The reason this can't be paralized is that if pop A pulls the value in the time between pop B asking for the value and giving back the updated value, when pop A updates the value it overwrites over whatever production pop B did.

It doesn't matter if the Number that the pops are updating is the base stockpile or some other value. If multiple pops are accessing it, they have to do it in sequence (or write in safeties which would take even longer). The only way to properly paralize the jobs would be to split the pops into multiple groups, calculate the changes in each group sequentially, then combine the results of all the groups sequentially.

Something I just realized is that the easiest way to divide up pops is by empire, since each empire calculate resources individually.
 
Something I just realized is that the easiest way to divide up pops is by empire, since each empire calculate resources individually.

Personally, I feel that calculating individual pops is the cause of a lot of the problem there. Every single last pop needs to calculate a lot of things, and as a result, every-time a new pop is generated, a lot more calculations need to happen. I feel turning pops into a more abstract thing would work better to calculating them all as individual objects.

Have a planetary population variable that counts how many pops are on a planet. Use the local ethics attraction to divide their ethics; since there are no individual pops, it uses this to set different variables for how many pops in the main count belong to a specific ethic. (Say, the planet has a population of 30, and a 33% spiritualist ethics attraction. It then decides it has 10 spiritualist pops; though there is still no individual pop. Everything would simply be integers that go up slowly, rather than new pops that individually calculate things being generated. A total population per planet, the populations of specific ethics and species, ect. When it comes to empire-wide demography, you just add up these numbers and do some division. (Which I think it already does.)

Basically, make pops be abstract numbers instead of individual entities with their own things to calculate. That's how I think pops should be calculated.
 
  • 5
  • 3
Reactions:
Basically, make pops be abstract numbers instead of individual entities with their own things to calculate. That's how I think pops should be calculated.
i've made a thread about this idea, it didn't get the positive reaction i was hoping for
 
Personally, I feel that calculating individual pops is the cause of a lot of the problem there. Every single last pop needs to calculate a lot of things, and as a result, every-time a new pop is generated, a lot more calculations need to happen. I feel turning pops into a more abstract thing would work better to calculating them all as individual objects.

Have a planetary population variable that counts how many pops are on a planet. Use the local ethics attraction to divide their ethics; since there are no individual pops, it uses this to set different variables for how many pops in the main count belong to a specific ethic. (Say, the planet has a population of 30, and a 33% spiritualist ethics attraction. It then decides it has 10 spiritualist pops; though there is still no individual pop. Everything would simply be integers that go up slowly, rather than new pops that individually calculate things being generated. A total population per planet, the populations of specific ethics and species, ect. When it comes to empire-wide demography, you just add up these numbers and do some division. (Which I think it already does.)

Basically, make pops be abstract numbers instead of individual entities with their own things to calculate. That's how I think pops should be calculated.
Once again proving the pop system of vic 2 is just plain superior.
 
  • 3
  • 1
Reactions:
The way I understand code, you HAVE to ask the value of something in order to effect it relative to it's current value (aka adding or subtracting). If you add one, you don't send it an "increase by one" command, you pull the initial value, increase it by one, then push back the updates value. The reason this can't be paralized is that if pop A pulls the value in the time between pop B asking for the value and giving back the updated value, when pop A updates the value it overwrites over whatever production pop B did.

that is the point, dont pull the values. u dont need to for example energy-production, or mineral production, cause not a single resource is based if the stockpile-value. so u can calculate ALL the pops what they produce and what they consume and AFTER that u sum every resource and THEN access the stockpile.

right now u did not said why this whouldnt work, just assumptions, weired assumptions. and no one said anything with tthousand cores, pls be serious.

so, simple: no pop does access the stockpile-value. all ress that are produced/consumed are summed and THEN added/subtracted to the stockpile. until you tell me a very good reason this should not work ... i dunno where there is a problem. and i mean a real good reason.
 
that is the point, dont pull the values. u dont need to for example energy-production, or mineral production, cause not a single resource is based if the stockpile-value. so u can calculate ALL the pops what they produce and what they consume and AFTER that u sum every resource and THEN access the stockpile.

right now u did not said why this whouldnt work, just assumptions, weired assumptions. and no one said anything with tthousand cores, pls be serious.

so, simple: no pop does access the stockpile-value. all ress that are produced/consumed are summed and THEN added/subtracted to the stockpile. until you tell me a very good reason this should not work ... i dunno where there is a problem. and i mean a real good reason.
I think you might not understand how computers do math. THAT is the reason. If my explanation here isn't making sense, I recommend googling some resources that might explain it better.

In order to sum, you have to pull both values from where they are stored, then push the new value to somewhere else. If you are adding X and Y, you need to know what both X and Y are. In order to combine the production of all your jobs, after you've added X (current resources) and Y (job production) you push that updated value back to X. If two different cores try to add (or subtract) some amount to/from X at the same time, Whichever one happens second will effectively "ignore" whatever process the first one did.

Now let's take your idea. Instead of pulling from X and pushing to X, we create a brand new variable/location Z, which serves as a buffer so you don't need to access the stockpile. Only now instead every pop needs to add resources to Z, so you go from [X+Y->X] to [Z+Y->Z]. The pops still need to know how many resources are in the buffer in order to increase that amount. Summing the production of those jobs still needs every job's production to be calculated in sequence. Instead, if you were to split the group in half, now you could sum each half in sequence (they are pulling from separate sources, so no errors), then combine the two halves and then add it to the base stockpile. The trouble then becomes how do you split up the jobs and ensure the proper distribution of cores (remembering that cores are also processing sound, graphics, AI, etc.).
 
  • 2
Reactions:
The thing about pops is how they behave. Each pop can have different jobs with different resource costs and resource production values. How would you be able to calculate say, 10 Clerks, 5 Researchers, ... etc if pops themselves aren't entities?

 
I love that you're working on making the game faster, but this post touched on an issue that has plagued the vast majority of the (multi-session) multiplayer Stellaris games I've played:
On the server the Prikki-Ti action is resolved first, killing the Blorg. The Blorg action that comes after (possibly on another thread) is discarded as dead Blorgs can’t shoot (it’s a scientific fact). The client however distributed the computation in another way (maybe it has more cores than the server) and in his world the Blorg dispatched the Prikki-Ti first, which in turn couldn’t fight back. Then both players get the dreaded “Player is Out of Sync” popup as their realities have diverged.
That is, the dreaded "desync" issue. Are these "out of sync" issues simply an unavoidable consequence of how the game is designed? Why can't a single player (i.e. the game host) preform these order-dependent calculations so that the "out of sync" issues can be avoided?

Disappointingly, it seems to happen for us most often when we host a save file, immediately after attempting to continue games we have already invested hours into playing.

Is Paradox investigating (or otherwise working on) this problem?
 
Er, I wonder about the elephant in the room.

Has anyone asked why pdx is using dx rendering instead of vulkan?
 
Because the game is bound by the simulation thread, not the graphics thread. Switching to vulkan is meaningless.

Okay,
Then I've obviously missed something somewhere.
DX12 and vulkan can both do low level (low overhead) resource and task management.
DX12 and vulkan can do graphics.

What do you mean by simulation thread, and how does that translate to dx11 doing something that opengl and vulkan cannot?