For all those that say: "just use more cores, and everything is solved", here are some things to consider. It's not as simple as it sounds.
For each task you want to run in parallel to other tasks, you need to create a separate thread of execution. And that comes with overhead. And that overhead is quite significant.
Say, for example, you have a very simple task of: object->myCounter = object->myCounter + 1. And you need this to run for ten thousand objects. Creating 10 thousand threads for it most surely isn't the best solution. Yes, creating 10 thousand threads (one for each individual object) creates the best opportunity to spread them around all of your existing cores. However (and using sample numbers here), the basic code that needs to run is perhaps 10 machine instructions long. Creating each thread, and setting them up to run the code on the correct object, would be in the ballpark of 500 machine instructions. Probably even more (I can't peek inside Microsoft's internal code for creating a thread, but this is based on my own multi processing kernel I made 30 years ago to run atop of MSDOS). So, in order to take max advantage of all cores available, you just increased the grand total of 100,000 machine instructions to 5,100,000 machine instructions. And that's even completely ignoring the additional overhead you place on the process scheduler in the kernel.
Granted, this is an extreme example (minimum work to be done in parallel vs. the overhead parallelism creates), but it demonstrates the basic issue of the additional overhead. If this example was the workload, then the best solution is to run it in serial on one single core, as that still takes the least amount of real time.
Staying with this example, what if you used less threads than one thread per object? Now you get into the realm of thread pools. And you must do (in part) your own scheduling. Because you now have to decide which objects are processed by which existing thread in the thread pool. Again (you guessed it), in comes additional overhead.
For each of these cases, there will be a break even point with the number of cores where spreading among more cores will bring down the overall real time to get the job done. But, for the sake of argument, say this break even point happens to be 9 cores. Then anyone with an AMD Threadripper is a happy camper (yay, my time goes down, no more lag), but anyone with a Core2Duo is a sad camper, because, due to the additional overhead, he just got saddled with more lag than the old solution with everything calculated on a single core.
And this is assuming that the calculations can be fully done in parallel, with none of the objects needing input (or calculated results) from other objects. If that's not the case (and it rarely is), then you need mechanisms where objects signal that their calculations are complete. And that other objects need to halt their work until those results become available. This, again, adds extra overhead, eating up additional CPU cycles. Which are not used for the actual calculations, but merely for housekeeping/coordination. Which you can completely do away with when everything is calculated on a single core.
And this isn't even taking hardware constraints into consideration.
A lot of CPU's have a function called HyperThreading. This fakes additional cores. For each real core, a virtual extra core is created. Yes, each real core can run two threads concurrently due to this, but not at the same time. Each time one thread gets stalled while waiting for access to memory (for reading or writing), the other thread is allowed to run for a bit. This makes both threads appear to be a little faster. But not twice as fast. Not even close to that. At best, you gain some 10%.. And if your additional overhead needs 30% extra CPU time, congratulations, you just made everything 20% slower by utilizing the fake cores as if they were extra cores.
And then there is the tricky business of the thermal envelope. Each CPU is designed for a maximum amount of power it can draw, to prevent it from overheating. If you run work on more cores, the CPU gets hotter. And, to prevent it from going outside of the thermal envelope, the CPU scales back the clock speed. Or, in reverse, if you run calculations on one of the cores, the clock speed can increase. This is what Intel calls TurboBoost. Which throws in another moving part in the determination if doing things on a single core in serial is faster than spreading the work out over multiple cores.