Anyway, he decided to fire up a profiler and get cracking on optimising the hell out of the SHA1 class in mono, and i decided that i'd skip some study and do the same. We had two main aims in this:
- To make the fastest possible implementation using no unsafe code
- To make the fastest possible managed SHA1 implementation
In order to do this, we couldn't use unsafe code, everything had to be done without pointers. We also had a constraint that we couldn't make the size of the class significantly longer, in fact, *reducing* the lines of code was also an aim because the more lines of code, the slower it is for the JIT to process.
We did a number of iterations on the code, testing different things out. We fully unrolled the three main loops, we partially unrolled some of em and benchmarked everything in between. The things we learned in the end were:
1) In this particular algorithm, there was a great benefit to using local variables rather than fields. That gave a biggish boost. EDIT: This is because mono doesn't perform array out of bounds check removal (abcrem) on fields. It only performs it on local vars. As array access is *very* frequent, removing these checks is a huge benefit.
2)
buff[i] = ((buff[i-3] ^ buff[i-8] ^ buff[i-14] ^ buff[i-16]) << 1)
| ((buff[i-3] ^ buff[i-8] ^ buff[i-14] ^ buff[i-16]) << 31));
The above code actually performed significantly slower than:
uint temp = buff[i-3] ^ buff[i-8] ^ buff[i-14] ^ buff[i-16];
buff[i] = (temp << 1) | (temp >> 31);
)3)Re-rolling some of the loops did not affect performance significantly. The performance delta between when all three of the loops were fully unrolled and when two of them were only partially unrolled was less than 6% on my system, but the IL was reduced by a fairly massive proportion.
4) Massive methods perform slower. We found that by splitting the 'ProcessBlock' method up, performance increase noticeably. Bear in mind that when we originally tested this, we had all three big loops fully unrolled and they were all in the same method. Still, it's worth bearing in mind.
So, what was the final performance delta with all these changes?
Core 2 Duo T7400 @ 2.16GHz
2.54x faster
Athlon 64 3200+
3.10x faster
Pentium 4 @ 2.80GHz
1.36x faster
Xeon X5355 @ 2.66GHz
1.38x faster
64bit PPC:
1.60x faster
EDIT: It seems I've mixed up my numbers when i was recording them yesterday. Some of the numbers compare against SVN head and some compare against the version in Mono 1.9. I think the first 2 results are against the version in 1.9 and the last three are against SVN head.
Not too shabby.
Aim 2:
More to come on this later, we're still in the optimisation process, but i think it's fair to say that the unsafe version is quite a fair bit faster thant the safe version so far.