OK, so in the same moment I understand you little bit more, but also loosing you. Memory-bound issues are above my paygrade. I have zero ideas what can cause them in the game like this, especially when you see maxed CPU usage. Do you have a theory or example what is happing in the game?
Plus do you believe it is fixable or not?
EDIT:
So, I found a video that helped me understand the memory-bound problems a little bit:
And now I have at least idea why the issue with CS2 performance could be I/O bound, but it leads me back to agent system of this game, where each cim is at least one object system(hardware) has to take care of individually. In another words, more cims you have, more I/O to memory will happen, and you are bound be how fast you can R/W to memory.
What I do not know if there is a solution for a memory-bound problem. For me it seems the CO would have to change whole ideology how the game works.
This is going to be long, but remember you asked for it...
An archetypical (single-cycle) CPU runs one instruction per clock cycle (CPI = 1). Within the CPU an instruction is actually executed through multiple steps on different parts of the clock signal (2-4 phases.) A more reliable way of doing this is separating an instruction cycle into stages and going through a single stage at every clock cycle (multi-cycle), which opens up more architectural options, like arbitrary number of stages. So a
n-stage processor completes a typical instruction in
n cycles (CPI =
n; but you can also run it around
n times faster, so it comes down to the same speed as a single-cycle).
In multi-stage processors, like the single-stage ones, most of the CPU sits idle because only one part (or stage) of the processor is active at a time. In order to exploit this, we have pipelined processors. A pipelined processor is basically a multi-stage processor where,
under optimal conditions, each stage of the CPU works on different instruction at every clock cycle. So a
n-stage pipelined processor can work on
n instructions at the same time, each at a different stage; meaning it completes
n/n = 1 instruction per clock cycle (CPI = 1). Some modern CPUs are pipelined with high number (20+) of stages.
Finally, we have the modern multi-core CPUs. They basically multiply the amount of work that can be done by number of cores so this one is simple. A pipelined CPU with
n cores,
again under optimal conditions, can carry out
n instructions at every clock cycle (CPI = 1/
n). Note that these optimizations don't always work. For example some instructions are atomic so they reserve the entire pipeline while they are being processed, or software may not be designed for parallelism. Note also that not all instructions run in a single instruction cycle; for example floating point operations often take multiple iterations through the processor to complete.
Now let's talk about the elephant in the room...
By far the worst thing that can happen to a CPU instruction-wise is a memory operation, which can take hundreds of clock cycles to complete. I/O can also be costly but that is mostly managed asynchronously (through interrupts) so the CPU is not held while an I/O operation is carried out. However, that is not the case with memory operations as there will be further instructions that depend on their result. Therefore waiting for a memory operation puts the CPU in a stall (
not idle), where it cannot do anything but wait. This is the reason why CPUs have various levels of caches, as they allow keeping copies of data close to the processor where they be accessed faster than the memory (though still rather slow.)
There are some ways to mitigate CPU stalls, for instance well optimized software schedules instructions in such a way that memory operations begin long before their results will be needed (prefetching.) Thankfully modern compilers help developers in that regard. Modern CPUs also have their own prefetching mechanism, and they might also do some house-keeping while waiting. But despite all these measures, on average a lot of CPU time is spent in stall. Anyone who wants to create high performance software has to make sure proper optimization is made here; reducing memory operations, prefetching, utilizing the cache with reference locality, making use of arrays where applicable, encoding data into quad-word chunks, etc.
Memory operations in poorly optimized software can greatly increase the CPI, as a lot of clock cycles will be wasted in stall just to complete one memory instruction. Normally a CPU-bound software runs with CPI around 1 on a pipelined processor, less in multi-core. This number increases slightly if floating point operations are used heavily. But if you see CPI >> 1, then you know those nasty memory operations are putting your CPU on stall. In my tests, C:S2 showed CPI > 20! This might be a Unity problem, but someone failed at something miserably. Ultimately, it means the game is heavily memory-bound.
Can it be fixed? Most likely not...
I don't know how Unity is implemented or how other Unity games perform, so the root cause might be there. The language they use (C#) is not ideal for optimizing for memory access, either; this kind of thing is best done in C. It ultimately comes down to poor software design due to not understanding the limitations of technologies being used. I don't think this game can be fixed, and certainly not by CO with their current structure.