HOI4 Dev Diary - Tech bugaloo II - Dragonslaying

  • We have updated our Community Code of Conduct. Please read through the new rules for the forum that are an integral part of Paradox Interactive’s User Agreement.
Showing developer posts only. Show all posts in this thread.

podcat

Game Director <unannounced>
Paradox Staff
12 Badges
Jul 23, 2007
12.811
38.516
  • Europa Universalis IV
  • Hearts of Iron III
  • Semper Fi
  • 500k Club
  • Europa Universalis III: Collection
  • Europa Universalis IV: Pre-order
  • Hearts of Iron II: Beta
  • Europa Universalis: Rome Collectors Edition
  • Mount & Blade: Warband
  • Paradox Order
  • Hearts of Iron IV Sign-up
  • Hearts of Iron IV: Together for Victory
Hi guys! Today's Diary is going to be a bit of a short one as I am away at a conference (it has free breakfast! Two most magic words!)

Last week we celebrated HOI4’s 3 year anniversary and released 1.7 ‘Hydra’ along with Radio Pack and Axis Armor. I hope you enjoyed them :)

After the weekend we looked at our telemetry data after 1.7 released and noticed that the amount of multiplayer out of syncs were more common than before. This indicates that we introduced a new OOS problem. While HOI4 has resync and hotjoin its still pretty annoying when you out of sync so we are currently investigating this for a small hotfix patch (1.7.1). Out of syncs can be really tricky to find and nail down so no definite ETA yet on when the patch is ready as we are still hunting. But we acquired a solid lead on the problem just yesterday, and we're currently working out a good solution.

Technical section (warning!): What is an Out of Sync?
For those interested what an out of sync is I figured I’d dig into it a bit. Feel free to skip this if you are a... normal human being, I guess ;D

An Out of Sync (OOS) happens when the host and clients in a multiplayer game start acting differently. This could for example be something like a battle ending in the favor of Germany on one of the computers and in the favor of Soviets on the other. Usually though its nothing that big as the state difference is usually spotted earlier at say one of those units having a 1% higher organization than the other or the like. Once it happens everyone’s experience will very quickly start diverging, so we stop and alert the players. At this point the host can click a ‘Resync’ button to bring the game back into sync. Resyncing will reset the state of the game, send the current host state over as a savegame and have everyone load that, and then things can resume.

So what can cause an OOS? This is where it gets tricky, and its pretty much always a new reason when a problem appears. Some good candidates are multiplayer between different platforms because underlying code libraries can behave differently in some cases. Other common reasons are multithreading. We thread a lot of our code, yet to stay in sync we must assure that events still happen in the same order on all machines in the end if they affect the world. There can also be issues like touching illegal memory or the like that can alter the game state in unplanned ways (or crash… but those are easy to spot and fix!).

Finding and fixing an OOS can be a long process simply because they are often quite rare occurances and it usually takes many steps and iteration to home in on exactly what it is. To find them we run multiplayer tests with QA with special settings that spit out giant log files (which makes everything horribly slow generally) and once and OOS happens we compare log files and savegames to see what differs. This will usually give us an area to start looking at. Lets say for example that you have a unit’s org being different. This could be due to many reasons - battle damage, weather, bad supply etc. So we add more logging to the relevant code areas and do another test. Hopefully this will tell us which of our guesses was right, and we repeat again with more logs and more details for this area. Of course the most fun-to-find OOS errors disappears when you add logging and framerate slows ;P

Some of this can be done automatically over night as well if the problem is unrelated to players, but this is often not the case.

Once found and fixed this is usually the stage we make an open beta patch to verify that it is indeed fixed, or if we deem it sure that we found the problem we will go straight to regular patch. Speaking of beta patches, thanks for the help testing 1.7! we got something like 30k game sessions of testing on the weekend before the final build which was a great help!

Hopefully this little look into some of the technicalities behind working on HOI4 was interesting. If you got questions feel free to ask away!

The part of the team that isn’t trying to solve this OOS has now moved fully on to 1.8 ‘Husky’ and next expansion work, but its early days and its going to be a while until we have things to show off. So this will be the last dev diary in a while as we go into radio silence (and soon glorious socialist swedish summer vacation!) until we have things that are ready to show off.

See you on the other side! And keep an eye on the forum for announcement about the 1.7.1 hotfix when that is ready.
 
Last edited:
I'm the programmer who's been tackling this particular OOS, so I can add some further technical details, for those nerdy enough in just the right way to actual enjoy it. :)

During this round of OOS hunting, we've improved our sync debugging tools in multiple ways:
  1. We've reworked the verbose logging system that we turn on when trying to catch an OOS in the act and figure out what the game was doing at the exact moment that the machines begin to diverge, but before any machine is aware of the divergence. Since we only care about the most recent data written to the log, we now just discard all the older data on a regular basis, and thereby avoid multiple gigabytes of logs. This allows us to be far more verbose with our logging, without fear of filling up our team's hard drives.
  2. We've streamlined the process for collecting logs and save files, so that when the team or QA get an OOS during a multiplayer session, each machine automatically zips up all the relevant files, so that it's easy for everyone to add their data to a bug report or send it directly to whichever programmer is currently leading the charge on the OOS attack.
  3. We've increased the miscellaneous scattering of data from a variety systems that we track every game hour, so that there is a higher likelihood of noticing the OOS as quickly as possible. Many out of syncs will firs touch some tiny little piece of data that isn't frequently checked, and this slowly avalanches into more and more data getting out of sync. Once the game is finally aware that something is off between machines, so much data is different that it is almost impossible to identify what was the piece of data that went out of sync first. So the faster we spot an OOS, the better chance we have of understanding and fixing it.
  4. We're now able to include checks in our internal developer build of the game to validate that we avoid reading or writing data in one thread that is potentially being modified in another thread at the same time. This will run even in single player games, and will alert the programmer exactly at the moment that a piece of code did something naughty that could cause an OOS. It still requires that we scatter the checks around in relevant places, but as we get more coverage, we'll get more protection against sneaky OOS risks.
Our lead on the current OOS problem is related to #4. Each country has its own thread to figure out what it wants to do with its units, and then we have a single thread actually execute those plans, one country at a time. Normally this is fine, but volunteers are a little special, being controlled by one country but operating within the context of another country.

There was a particular piece of code for moving a division over water, where it would check that there are enough convoys available to do the transport. (For volunteers, this is checked within the sender country's thread, but it uses the recipient country's convoys). If there are enough convoys, it would add the movement action to a pending queue to execute later, when everything is executed in a single thread and there is no risk of order-dependent differences. Otherwise, it simply would not add the action to the queue at all.

Unfortunately, in very rare cases, the recipient country barely has enough convoys, and due to different thread timings, things can get off. One machine processes the volunteer controller country earlier, and queues the movement action before the recipient country's thread consumes the remaining convoys for some other purpose. Meanwhile, a different machine gets around to processing the recipient country earlier, consuming the convoys before the sender country even considers moving the volunteer over water. So the first machine tries to start the naval transport in the belief that there are enough convoys, and the second machine lets the unit remain where it is because there wouldn't be enough convoys for transport anyway. Bam! Out of sync.

To fix this, we are going to try to add a few layers of protection. First, when executing a queued action, we will validate the convoy count a second time and just skip the action if there are now insufficient convoys, a fact that all machines will agree on at this phase of execution. And to keep the queued action in sync to begin with, during the more volatile period where the available convoy count was unreliable, volunteers will now check the last known stable convoy count, instead of the actual current count that might be getting modified by a different thread.

And as point #4 noted, we've added a check on convoys so that any code that wants to read a country's convoy status will verify that it is executing within a context where it is safe to do so. If not, it will immediately break and alert the programmer to the inappropriate code.

So not only will we hopefully resolve the increase of out of syncs that came with 1.7, but with all of the other tool improvements we should also be in a better position to catch any other newly introduced out of sync problems during normal development.
 
Is there a way we can help with the OoS problem?

In my multiplayer group of around 10 players, we have a quite regular OoS problem. It only happens after WW2 started in 1939/1940, but as soon as it happens, it won't go away. Doesn't matter if we resync or rehost, only a few minutes after the games restarts/resumes there is a new OoS, but with the same errorcode (ressource excivation, ressource transportation was the description of the error), which happens to all the clients.
For an OoS that reproduces so reliably like that, a bug report with a save game (preferably from the host) is usually enough to reproduce it internally and quickly identify the source of the problem. Including any info about mods in use, expansions that are not in use, and the OS of host and each client can help, in case it is a problem that reproduces for a specific set of players under specific conditions, but doesn't easily reproduce for us internally.
 
I know...
I was critiquing the ambiguity of the given statement, hoping to get "we'll fix the essential stuff before that" or "yes, mid August". Even a simple "yes" or "no" would work.
It was ambiguous on purpose. They will resume after we all return from summer break. When podcat sets a date for the next dev diary, he will definitely let you know!