Will Victoria II's checksum be more than 4 characters? Because, currently, a 4-character alpha-only checksum is theoretically very brute-forceable for collisions (~456k possible combinations). Now that I think about it, there may be other vectors of attack (co-opting the code that sends the hash over the wire), but I'm going to stick to my original reason for posting this. Is there something I'm mis-understanding here, that makes 4 character not-very-bruteforceable? because I could totally see adding random comments and the like to one of the checksummed files until I find a collision.
For those who don't understand what I'm saying: a Checksum is a small string of letters and/or numbers that is generated from a piece of data (in this case, the executable and possibly other core files). In PI games, you see a set of four letters, like GFKW. This is the checksum. The checksum is generated using a "hash function" (that is to say, they take the total data of the file and use this number to create a new, far smaller number). This hash is generally considered to be "secure", assuming it's a large enough number; the chance for "collisions" is minimal. The bone I have to pick is that I feel it's not a large enough number.
The following is an attempt to explain to a layman why this is "bad".
How does a hash function work?
A hash function is a way of deriving a small number from a big number. Let's consider a very simple hash function. In this case, H is our resulting hash, N is our starting number, and % stands for modulo (just in case you thought I meant division)
H = N % 32
It's somewhat important to note that most hash functions are far more complex than this, but the end result is the same: They split a number-space in to different "groups".
What is a hash used for?
Okay, let's say we're Paradox Interactive. We release a new piece of software. The entire contents of this software is the number 5872.
That's all well and good, but, being as open to modding as we are, some of our users are modding their games to be 6231! We can't have them playing with our vanilla 5872 players; they'd have an unfair advantage, and there are potential desync issues! How do we solve it? we implement a hash function.
Using the hash function above:
5872 resolves to 16
6231 resolves to 23
So, 5872 and 6231, when trying to start a game, send each other their "hash" (in this case 16 for vanilla and 23 for the modded game). They can't go around sending the entire number; that'd be no good, everyone plays on 1 bit/second modems. You'd have to wait a long time. So, the hashes are sent, and compared, and they realize that they don't share their hash. So they don't play with each other.
What is a collision?
This above situation works well and good, until it doesn't. The problem, as you can probably already see, is that there are other numbers that come out to 16 when hashed through our function. For instance, 5616 also comes out to 16. That's why most hashes are generally longer the 5 bits (the number of 1s and 0s it takes to express a number between 1 and 32). For example, cutting-edge RSA encryption now uses 1024 bits, or more. It's important to note that the difficulty of finding a collision increases exponentially with the number of bits. For instance, 2^10 = 1,024 , but 2^20= 1,048,576.
Why are collisions bad news?
Let's say I'm a big meanie, who wants to cheat at 5872. I go and mod my game to be 6224. I can now play with the 5872 guys, but with a huge advantage! This is totally unacceptable!
How does this example translate to the actual case, here in Paradox games?
4 letters, all upper case, produce a maximum of 26^4 combinations. That's 456,976 potential combinations. In pure mathematical difficulty, this means that, on average, I will find a version of Victoria 2 that generates an identical hash but is a different game once every 228,488 attempts. With a computer, assuming I was able to find what is hashed and what the hash algorithm is (otherwise I'd have to boot up the game every time to check the hash), I'd probably be able to find a collision, that also includes some malicious code (think if(ship.isEnemy()){ship.sink();} bad. Really bad. game-breaker bad), in as little as a few days of running my automated search program. Maybe a few weeks if the program needed to boot up Vicky.
The point is, from my perspective, the current hash of 4 characters is VERY brute-forceable.
What's the solution?
Well, the easiest solution (at least of the solutions that would let me sleep well at night) would be to make the hash something like 8-16 characters, or maybe even longer. This has disadvantages; it'd be harder to go "what version do you have?" "oh, GWBH!", but there's a silver lining. On the client end, you could always just provide the first or last 4 characters of it. This would retain even this value of the current hash system.
It is, of course, always possible that PI are ultra-on-the-ball and already do what I described in the previous paragraph, in which case I congratulate them for out-thinking me. However, I get the feeling that the hash is wysiwig at the moment, and 26^4 combinations just doesn't cut it. For reference, that's a small amount under 19 bits; it's not much data, from a cryptography standpoint.
Before someone goes "blah blah not worth the time to implement" from a position of ignorance, let me remind you that such things are, if they were designed well, relatively easily scalable. Perhaps not with this "version" (I'm willing to bet that this version is feature-complete and going through the final phases of testing, if not already RTM'd), but almost certainly possible for the next version. that said, if a dev wants to tell me "yeah, no, not easily scalable", I'm open to it.
I realize this may seem like a far-off and far-fetched problem, but an ounce of prevention is worth a pound of cure. Imagine the headache if someone brute-forced the vanilla checksum, and started smurfing around online. it could be the end of remotely public games.
For those who don't understand what I'm saying: a Checksum is a small string of letters and/or numbers that is generated from a piece of data (in this case, the executable and possibly other core files). In PI games, you see a set of four letters, like GFKW. This is the checksum. The checksum is generated using a "hash function" (that is to say, they take the total data of the file and use this number to create a new, far smaller number). This hash is generally considered to be "secure", assuming it's a large enough number; the chance for "collisions" is minimal. The bone I have to pick is that I feel it's not a large enough number.
The following is an attempt to explain to a layman why this is "bad".
How does a hash function work?
A hash function is a way of deriving a small number from a big number. Let's consider a very simple hash function. In this case, H is our resulting hash, N is our starting number, and % stands for modulo (just in case you thought I meant division)
H = N % 32
It's somewhat important to note that most hash functions are far more complex than this, but the end result is the same: They split a number-space in to different "groups".
What is a hash used for?
Okay, let's say we're Paradox Interactive. We release a new piece of software. The entire contents of this software is the number 5872.
That's all well and good, but, being as open to modding as we are, some of our users are modding their games to be 6231! We can't have them playing with our vanilla 5872 players; they'd have an unfair advantage, and there are potential desync issues! How do we solve it? we implement a hash function.
Using the hash function above:
5872 resolves to 16
6231 resolves to 23
So, 5872 and 6231, when trying to start a game, send each other their "hash" (in this case 16 for vanilla and 23 for the modded game). They can't go around sending the entire number; that'd be no good, everyone plays on 1 bit/second modems. You'd have to wait a long time. So, the hashes are sent, and compared, and they realize that they don't share their hash. So they don't play with each other.
What is a collision?
This above situation works well and good, until it doesn't. The problem, as you can probably already see, is that there are other numbers that come out to 16 when hashed through our function. For instance, 5616 also comes out to 16. That's why most hashes are generally longer the 5 bits (the number of 1s and 0s it takes to express a number between 1 and 32). For example, cutting-edge RSA encryption now uses 1024 bits, or more. It's important to note that the difficulty of finding a collision increases exponentially with the number of bits. For instance, 2^10 = 1,024 , but 2^20= 1,048,576.
Why are collisions bad news?
Let's say I'm a big meanie, who wants to cheat at 5872. I go and mod my game to be 6224. I can now play with the 5872 guys, but with a huge advantage! This is totally unacceptable!
How does this example translate to the actual case, here in Paradox games?
4 letters, all upper case, produce a maximum of 26^4 combinations. That's 456,976 potential combinations. In pure mathematical difficulty, this means that, on average, I will find a version of Victoria 2 that generates an identical hash but is a different game once every 228,488 attempts. With a computer, assuming I was able to find what is hashed and what the hash algorithm is (otherwise I'd have to boot up the game every time to check the hash), I'd probably be able to find a collision, that also includes some malicious code (think if(ship.isEnemy()){ship.sink();} bad. Really bad. game-breaker bad), in as little as a few days of running my automated search program. Maybe a few weeks if the program needed to boot up Vicky.
The point is, from my perspective, the current hash of 4 characters is VERY brute-forceable.
What's the solution?
Well, the easiest solution (at least of the solutions that would let me sleep well at night) would be to make the hash something like 8-16 characters, or maybe even longer. This has disadvantages; it'd be harder to go "what version do you have?" "oh, GWBH!", but there's a silver lining. On the client end, you could always just provide the first or last 4 characters of it. This would retain even this value of the current hash system.
It is, of course, always possible that PI are ultra-on-the-ball and already do what I described in the previous paragraph, in which case I congratulate them for out-thinking me. However, I get the feeling that the hash is wysiwig at the moment, and 26^4 combinations just doesn't cut it. For reference, that's a small amount under 19 bits; it's not much data, from a cryptography standpoint.
Before someone goes "blah blah not worth the time to implement" from a position of ignorance, let me remind you that such things are, if they were designed well, relatively easily scalable. Perhaps not with this "version" (I'm willing to bet that this version is feature-complete and going through the final phases of testing, if not already RTM'd), but almost certainly possible for the next version. that said, if a dev wants to tell me "yeah, no, not easily scalable", I'm open to it.
I realize this may seem like a far-off and far-fetched problem, but an ounce of prevention is worth a pound of cure. Imagine the headache if someone brute-forced the vanilla checksum, and started smurfing around online. it could be the end of remotely public games.
Last edited: