Wednesday, July 11, 2007

Since my last blogpost several things have been happening in MonoTorrent. Firstly, i finally managed to track down what was probably the most elusive bug to date.

Here's an abridged version of the events.

I had been getting reports on and off of strange crashes in the BigInteger class. These crashes appeared to be memory corruption problems of some sort. Needless to say, i immediately suspected that the reporter either had faulty ram or a highly overclocked system or even a corrupt .NET framework install. However, after getting him to run Memtest and a few other tests, i had to rule that out as a cause. However, as i couldn't reproduce, and he couldn't reproduce, there wasn't much i could do.

Time passed, several weeks i think and i was getting him to log every access to the class for me. Two other people had reported the same bug, so i knew something was definitely up. The only thing is, i couldn't reproduce it, no matter what i tried! I was hammering the code with dozens of new connections a second and getting no crash.

I had been in touch with Sebastien Pouliot who was doing his best to help me with tracking down the bug. Eventually i got fairly pissed off about the whole thing and decided it was time to solve this once and for all. I had already wasted hours trying to reproduce this at this stage, so if i didn't get it fixed this time, i was just going to completely disable the encryption code and thus "fix" the issue.

I logged into the windows machine i had in work, coded up a quick testcase which hammered the BigInteger class with random calculations. This didn't break after a fairly lengthy time running. Then i added 10 threads performing the calculation, as this was a 4 processor machine, so this way i could check more numbers at a time and so (hopefully) get a crash sooner.

BANG! I had reproduced.

I then spent the next hour or so (wasting some of both miguels and sebastians time) in finally tracking it down to a compiler bug in gmcs. The conditions for reproducing the bug were fairly strict, which is why i never managed to reproduce it myself.

1) You must be running under MS.NET 2.0 (meaning you have to be on windows)
2) You must be running a multi-processor machine
3) There must be more than 1 thread running a BigInteger.ModPow calculation simultaneously.
4) It has to be Wednesday ;)

The quick fix was to just compile the big integer code using the microsoft compiler. The only thing i'll say is i pity the person who has to track down the bug in the compiler, it's unlikely to be easy.

1 comment:

Emmanuel said...

one question, why did the bug only happen on a Wednesday?

Hit Counter