Monday, November 02, 2009

Don't BinaryReader.PeekChar () at me!

Update: As pointed out in the comments, there are errors in the implementation of SeekableStream below. That's what happens when you write a class at 2am ;) The errors are that Writing to the stream will be off by 1 byte if you've peeked, and Position will be off by 1 byte if you've peeked. I'll re-read and update the class later.

Update 2: I updated the implementation of PeekableStream to make it a read-only stream wrapper. I'm not pushed about supporting writing on it as it's not something I require. I'll leave it as an exercise to the reader to support writing if they need it ;)

How many times has that urge hit you to peek inside a stream and see the secrets hidden inside? Have there been times when you wished that Stream exposed a 'public byte Peek ()" method so you could do this easily? Were you delighted when you discovered that BinaryReader exposed PeekChar () , which did nearly the same thing [0]? If that describes you, you're in for a horrible horrible surprise.

I received a bug report during the week saying that MonoTorrent reads all data twice from disk when it is hashing. I responded with "No it doesn't! That'd be crazy!", to which I was handed a screenshot of a ~310MB torrent which in the windows task manager reported ~750MB of I/O Read Bytes. I was told this happened after loading the torrent and calling HashCheck () on it. This was irrefutable evidence that something was up, but I was still unconvinced .

So over the weekend I fired up windows and double checked. I could replicate the bizarre statistic. But strangely enough, it wasn't hashing that was causing it! The I/O Read Bytes was up at around 350MB before hashing even started. But that was strange, because the only thing that happened before that was:

Torrent torrent = Torrent.Load (path);

There's no way a simple forward-only parser reading a 100kB file could possibly result in 350MB of IO reads, could it? Actually, it could! Changing the declaration to the following completely vanished the extra 350MB:

Torrent torrent = Torrent.Load (File.ReadAllBytes (path))

So what was going wrong? Internally the parser used BinaryReader.PeekChar () to figure out the type of the next element so that correct decoder could be called. I thought this would be a simple array access, or something similar. However what actually happens is that one byte is read from the underlying stream, then the stream seeks 1 byte backwards . In the case of FileStream, this meant that the entire read buffer was refilled from 'disk' [1] every time I peeked. A 100kB file really was really being turned into a 350MB monstrosity! And yes, the Mono implementation unfortunately has to do the same. So how could I fix this?


Simples! I could write a PeekableStream, one that's smart enough to not need to do horrible buffer killing seeks. What was the end result? Well, that particular .torrent file loaded nearly 5x faster, ~100ms for everything instead of ~500ms. An average file would experience a much smaller speedup. This one is a bit different in that it contains over 2000 files and the speed up is proportional to the number of BEncoded elements in the .torrent file.

public class PeekableStream : Stream
{
bool hasPeek;
Stream input;
byte[] peeked;

public PeekableStream (Stream input)
{
this.input = input;
this.peeked = new byte[1];
}

public override bool CanRead
{
get { return input.CanRead; }
}

public override bool CanSeek
{
get { return input.CanSeek; }
}

public override bool CanWrite
{
get { return false; }
}

public override void Flush()
{
throw new NotSupportedException();
}

public override long Length
{
get { return input.Length; }
}

public int PeekByte()
{
if (!hasPeek)
hasPeek = Read(peeked, 0, 1) == 1;
return hasPeek ? peeked[0] : -1;
}

public override int ReadByte()
{
if (hasPeek)
{
hasPeek = false;
return peeked[0];
}
return base.ReadByte();
}

public override long Position
{
get
{
if (hasPeek)
return input.Position - 1;
return input.Position;
}
set
{
if (value != Position)
{
hasPeek = false;
input.Position = value;
}
}
}

public override int Read(byte[] buffer, int offset, int count)
{
int read = 0;
if (hasPeek && count > 0)
{
hasPeek = false;
buffer[offset] = peeked[0];
offset++;
count--;
read++;
}
read += input.Read(buffer, offset, count);
return read;
}

public override long Seek(long offset, SeekOrigin origin)
{
long val;
if (hasPeek && origin == SeekOrigin.Current)
val = input.Seek(offset - 1, origin);
else
val = input.Seek(offset, origin);
hasPeek = false;
return val;
}

public override void SetLength(long value)
{
throw new NotSupportedException();
}

public override void Write(byte[] buffer, int offset, int count)
{
throw new NotSupportedException();
}
}


This code is under the MIT/X11 license, so everywhere you use PeekChar () and actually just want to Peek at a byte, use this class instead. Your harddrives will love you for it. If you actually want to peek at a char, extend this class to be able to read a (multi-byte) char from the underlying stream and cache it locally just like the current PeekByte method . A side benefit is that you can now peek at unseekable streams. Not too bad, eh?

[0] PeekChar does exactly what it says on the tin. It reads one (multi-byte) character from the stream. So if you're using PeerChar on a binary stream which does not contain valid data as defined by the current Encoding, you're going to corrupt some data or get exceptions. I mention this here in case anyone is using PeekChar () as a way of reading bytes from the stream.

[1] I say that the FileStream buffer was being filled from disk, but that's not quite accurate. It was actually being refilled from either the windows cache or the harddrives cache. It's physically impossible for a mere 7200 RPM harddrive to supply data that fast. However I still was physically copying 350MB of data around in memory so that was a huge penalty right there.

13 comments:

mjcoder said...

The functions Seek and Write seem to be incomplete. I'd assume the following:

stream.Write(new byte[] { 0, 1, 2 }, 0, 3);
stream.Seek(0, SeekOrigin.Begin);
stream.PeekByte();
var p = stream.Seek(0, SeekOrigin.Current);
System.Diagnostics.Debug.Assert(p == 0);

and for Write:

stream.Write(new byte[] { 0, 1, 2 }, 0, 3);
stream.Seek(0, SeekOrigin.Begin);
stream.PeekByte();
stream.Write(new byte[] { 1, 2, 3 }, 0, 3);
stream.Seek(0, SeekOrigin.Begin);
var v = stream.PeekByte();
System.Diagnostics.Debug.Assert(v == 1);

I guess that you get the idea - even when there are errors in the code ;)

Fowl said...

Ouch!

What is the / is there a reason for this braindead behaviour?

Alan said...

@mjcoder: You're absolutely right. I only used the Read code so thats why i never spotted this mistake. Position is incorrect in this implementation and Write can end up writing data to the wrong index. I'll add proper tests and check it later.

@fowl:
Well, Stream doesn't support peeking so BinaryReader has to fake it by using Seeking.

The real question is why did they allow such a braindead method in the framework. My opinion is that no-one realised. It's been more than 3 years since I implemented the parsing code and I only noticed this issue now.

For proper 'Peek' support you'd have to have Peek on Stream itself, which would greatly complicate subclassing Stream as everyone would have to implement Peek correctly.

Fowl said...

But surely there is a saner way of implementing it?

Or is buffering 1 byte against 'the rules' of a stream?

Alan said...

If BinaryReader wanted to expose a PeekChar method it should really buffer internally to remove the requirement that Peek does a Seek call. That would've been the sane alternative. However, probably for ease of implementation, they didn't. Now we're stuck with a method which has hideous (hidden!) performance implications.

Jeffrey Stedfast said...

It should be possible to implement buffering in Mono and we could offer an 'internal' method for peeking at the next byte.

Porges said...
This comment has been removed by the author.
Ryan said...

And why would you return an int when you are trying to look at the next byte?

Alan said...

Porges:
every seekable stream which is used with binary reader will suffer the exact same issue. However with some streams it just won't matter, for example a seek in a memorystream is really just changing the value of a single int. If you wrap a stream in a buffered stream and peek, you'll end up having to refill the buffer ontje buffered stream too.

Ryan:
how else would I be able to return an error condition? If you check the sample, you'll see that ReadByte is an overridden method, I.e. Part of the base class libraries, and it does exactly the sane thing for exactly the same reason.

Anonymous said...

酒店兼職 酒店打工 打工兼差 台北酒店 酒店兼差 酒店經紀 禮服酒店 酒店工作 酒店上班 兼差 酒店應徵 酒店 打工兼職 打工

gaohui said...

Have you noticed ed hardy Clothing that she is spending time with ed hardy sale one person in particular ed hardy and they seemed to come from ed hardy UK nowhere. When you ask how she ed hardy cheap knows them she becomes aloof and ed hardy Clothes disinterested. Is there someone's house ed hardy store she seems to be always going to? This edhardy.com could spell something is wrong with the christian audigier sale relationship. Is she taking trips, possibly day ed hardy dresses trips or small vacations without you? If ed hardy Polos she was doing this before you even ed hardy sandals got married or dated, then it may be okay, but if it is a recent ed hardy Jackets development then you may have problems.

sports handicapping services said...

Hi, great article. The way you explained it is really awesome and makes every one to read till the end. keep posting..

Dartme said...

Dude, that's the wrong way to go. I had a problem like this a while back. I needed to be able to peek, so I would read a bit and then seek backward. When I used a FileStream, it would be excellent, but a NetworkStream, it was bad. So, I created a MemoryStream by reading all the bytes I was expecting (something like System.IO.File.ReadAllBytes), and putting THAT into my BinaryReader. Then, I can seek all I want, I only read once, and I didn't have write a new class. I actually did write a new class because the underlying protocol can be either endianness, but that's a separate point.

Hit Counter