Wednesday, April 30, 2008

I haven't done a mono-specific post in a while, so i'm glad i can change that with some good news! A few days ago, Scott Peterson, (who took part in the Summer of Code last year to port banshee to windows) was complaining that monsoon took an hour to hash a 60GB torrent. I was a bit surprised at this, because it shouldn't be that slow! I did know that SHA1 hashing in mono was a tad slow, but no way should it be *that* slow.

Anyway, he decided to fire up a profiler and get cracking on optimising the hell out of the SHA1 class in mono, and i decided that i'd skip some study and do the same. We had two main aims in this:
  1. To make the fastest possible implementation using no unsafe code
  2. To make the fastest possible managed SHA1 implementation
Aim 1:
In order to do this, we couldn't use unsafe code, everything had to be done without pointers. We also had a constraint that we couldn't make the size of the class significantly longer, in fact, *reducing* the lines of code was also an aim because the more lines of code, the slower it is for the JIT to process.

We did a number of iterations on the code, testing different things out. We fully unrolled the three main loops, we partially unrolled some of em and benchmarked everything in between. The things we learned in the end were:

1) In this particular algorithm, there was a great benefit to using local variables rather than fields. That gave a biggish boost. EDIT: This is because mono doesn't perform array out of bounds check removal (abcrem) on fields. It only performs it on local vars. As array access is *very* frequent, removing these checks is a huge benefit.

buff[i] = ((buff[i-3] ^ buff[i-8] ^ buff[i-14] ^ buff[i-16]) << 1)
| ((buff[i-3] ^ buff[i-8] ^ buff[i-14] ^ buff[i-16]) << 31));

The above code actually performed significantly slower than:
uint temp = buff[i-3] ^ buff[i-8] ^ buff[i-14] ^ buff[i-16];
buff[i] = (temp << 1) | (temp >> 31);

3)Re-rolling some of the loops did not affect performance significantly. The performance delta between when all three of the loops were fully unrolled and when two of them were only partially unrolled was less than 6% on my system, but the IL was reduced by a fairly massive proportion.

4) Massive methods perform slower. We found that by splitting the 'ProcessBlock' method up, performance increase noticeably. Bear in mind that when we originally tested this, we had all three big loops fully unrolled and they were all in the same method. Still, it's worth bearing in mind.

So, what was the final performance delta with all these changes?

Core 2 Duo T7400 @ 2.16GHz
2.54x faster

Athlon 64 3200+
3.10x faster

Pentium 4 @ 2.80GHz
1.36x faster

Xeon X5355 @ 2.66GHz
1.38x faster

64bit PPC:
1.60x faster

EDIT: It seems I've mixed up my numbers when i was recording them yesterday. Some of the numbers compare against SVN head and some compare against the version in Mono 1.9. I think the first 2 results are against the version in 1.9 and the last three are against SVN head.

Not too shabby.

Aim 2:
More to come on this later, we're still in the optimisation process, but i think it's fair to say that the unsafe version is quite a fair bit faster thant the safe version so far.


Sebastien Pouliot said...

(1) but you don't tell why ;-)

(2) the two code segments are not identical, I suspect some problems with < > and the HTML output

(4) the timings are inconsistent, between them and with my results. The biggest ones are probably being compared to 1.9 while the lowest one compared to what was in HEAD at the time. The actual target (1.9 or HEAD) is not clear either.

Anonymous said...

Have you compare MS.NET vs Mono performance on this.
It can be interesting!

Hit Counter