In reality, the transport you use for your data doesn't make much of a difference. First of, you (hopefully) have a layer of abstraction between your code and XMLHttpRequest that takes whatever data structure you throw at it and serializes it into XML or JSON. If all goes well, you'll never even see the data in its serialized form and so whether its particularly easy to read or complete gibberish shouldn't matter. Secondly, the real problem with web services is not the amount of data you transfer back and forth but the fact that in a worst case scenario, you'll have to re-establish a TCP-connection for every request you make - something not even JSON can prevent (to some degree, the keep-alive mechanism can).
Motivation
So when I decided to try and implement a binary web service protocol in JavaScript, I didn't do so to solve a particular problem, but to see if it was actually possible. The only advantage a binary protocol would give you is that for someone sniffing your packages, it would be a little bit more difficult to make sense of them than it would be with a plain-text format. But then again that wouldn't stop anyone with a little ambition. So this is really just a proof of concept that shows once more that JavaScript can do a lot more than some people will give it credit for. It's not intended at all to be better, faster or more space efficient than XML or JSON.
The gritty details
I come from a background of traditional (read non-web) programming and so for me the problem of serializing a variable into a stream of bytes seemed trivial. If you've programmed in a language like C or C++ before, you'll probably know that these languages support the concept of pointers. Now pointers aren't particularly popular anymore, but they allow you to do something that modern programming languages aren't so good at: treat a variable as something that it's not. Let's say you have a 64 bit floating point variable and you want to save it to a file or send it over a socket connection. In order to turn the variable into a stream of bytes, all you have to do is create a byte-pointer and have it point at the floating point variable. Then you can access the individual bytes of the floating point variable and do whatever you want with them.
Unfortunately JavaScript is one of those modern programming languages that doesn't do pointers so all of a sudden serializing a variable becomes a little bit more challenging. Challenging but not impossible. I'm not going to get into too much detail about how the serialization works, but I basically ended up serializing everything on the bit-level which isn't particularly difficult but requires a lot more more code to be written. If you care about the details of the implementation, just check out the source code (see the end of this page).
Once I had the serialization and deserialization part figured out, it was time to put it to the test. So I wrote a PHP class to do the serialization and deserialization on the server and developed a little test application that would send a JavaScript object to the PHP script and have the PHP script echo it back to the JavaScript client. That's when I ran into the first problem: the XMLHTTPRequest object which I used on the client side didn't seem to like null-bytes. In many programming languages, the null-byte is used to mark the end of a string. So when I sent my binary messages over XHR, it would ignore anything past the first null-byte. I wasn't going to give up so quickly, so I looked for a solution and found yEnc. yEnc is a mechanism for encoding text messages that is often found on usenet. Unlike base64 encoding which can sometimes be twice the size of the original unencoded message, yEnc has very little overhead and will get rid of any null-bytes. Once I had added yEnc to my serializer, my little test application finally worked.
However, when I looked at the size of the messages I was sending, I quickly noticed that I had yet another problem. The XMLHttpRquest object's send method automatically applies utf-8 encoding to anything it sends. This may be fine for text messages, but what I was sending wasn't exactly text. Now utf-8 encoding will encode characters with a numeric represtation that is larger than 128 with anything between two and four bytes. Meaning that when I was thinking I had sent a single byte, I might have actually sent four. Now this problem I could not yet circumvent and while sending and receiving works just fine, the messages are a lot bigger than they have to be. Typically they're about the same size as a JSON message but sometimes they're also bigger.
The bottom line
So now I have a binary web service protocol and an implementation that sort of works but that suffers from a problem that I cannot fix and that makes the whole thing a lot less useful. I guess the most sensible thing to do now is to come up with a catchy name for it and get it out there. How about BISON (binary interchange standard and object notation)? It sounds remotely like JSON and is very Web 2.0.
As for the "get it out there" part: I'm releasing the JavaScript and PHP source code as well as the documentation under the LGPL so you can play around with it. If anyone comes up with a solution for the utf-8 problem, please let me know!

43 comments
Write a new comment | Trackback URI for this entryFirst of all: Thanks for sharing code, that's the spirit
And: Cool, I always wondered what the benefits from yenc over base64 were but was always too lazy to read. Now I've got my answer :)
btw: I love to call all the modern programming languages "nerf languages", because they hide the pointy thingies from you ;)
p.s. anyone who doesn't know nerf:
You should be ashamed of yourself ;)
http://images.google.com/images?q=nerf
By the way: The links for the demos seem to be messed up...
I just sat down to produce a Perl implementation of BISON and I notice
that the maximum array size is 64k elements. Would it not be better to
encode the length using some variable length binary encoding?
For example:
0x0000 = length <= 0x7FFF length is 0 to 32767
0x8000 <= length <= 0xFFFF extended length follows
So 32767 would be encoded as 0x7FFF and 32768 would be 0x8000
0x0000 etc. Likewise for objects.
That would be backwards compatible with the current scheme but would
allow unlimited array sizes.
I also note that there's no schema version in the binary data. If you
upgrade the encoding format it's going to be hard for decoders to
automatically sense the version used for encoding.
Since the stream starts encoding objects immediately after the magic
number can I suggest that you use object type 0xFF to denote the
encoding format version:
66 6D 62 0D 04 00 .... Version 0.0.1 (current)
66 6D 62 FF 00 00 02 0D 04 00 .... Version 0.0.2
Again that's backwards compatible with the current encoding scheme.
There's no way of referring back to a previously serialized object - so
you can't encode data structures which contain multiple references to
the same object - or self referential data structures.
Can I suggest that object type 0x11 be renamed 'hash' and a new object
type 0x13 be introduced as 'object' with the classname encoded inline
after the number of elements.
To handle the problem with multiple references to the same object there
should be object type 0x14 'backref'. 0x14 is followed by a number
encoded in the same way as my proposed array length encoding which
refers to object N in the preceding stream. So the first thing encoded
can be referred to as 0x14 0x0000, the second thing encoded as 0x14
0x0001 etc.
The spec says that null bytes in strings are backslash escaped. To make this work backslashes must also be backslash escaped otherwise you can't tell whether the sequence backslash, null indicates the an embedded null byte or a string that ends with a backslash.
What you say makes a lot of sense.
1. I thought that 64k array elements might be enough, but you're right. There's really no reason to force this limitation upon whoever is using the format.
2. Including a version number makes a lot of sense. I actually thought about that at one point but must have forgotten about it.
3. The missing class name option is mostly due to me focussing too much on the stuff I thought I was going to use it for. You're right though, a class name would be a good addition and your hash vs. object idea is really good.
4. The backref thing sounds good. I'm not sure if I fully understand what you're saying, but to me it seems that making the distinction between a nested hash and an object with references to other objects would be pretty difficult to do automatically, at least on the JavaScript side of things. But again, I probably just didn't quite get it.
5. The null-byte/backlash thing is in the spec, it's just sort of obscure (From the spec: "The byte sequence 5Ch 5Ch 00h would decode to “\0”"). I guess I need to make this more obvious.
Thanks again for all your comments. I'll try and get all your ideas into the spec.
> what you're saying, but to me it seems that making the distinction
> between a nested hash and an object with references to other
> objects would be pretty difficult to do automatically, at least on
> the JavaScript side of things. But again, I probably just didn't
> quite get it.
When you're encoding you need to keep track of objects you've already
seen in a hash. When you get to an object you've already seen you output
0x14 + the ordinal position of the original object instead of encoding
it over again.
> 5. The null-byte/backlash thing is in the spec, it's just sort of
> obscure (From the spec: "The byte sequence 5Ch 5Ch 00h would decode
> to “\0”"). I guess I need to make this more obvious.
Ah - missed it. Sorry :)
I'm getting on pretty well with a Perl version. The encoder is done
assuming it passes the tests I'm about to write. I hope to get it
uploaded to CPAN tonight - I'll let you know.
Drop me mail at andy AT hexten DOT net if you'd like to discuss any of this.
Alright, I figured it out. Since JavaScript uses utf-8 internally, certain byte patterns will be returned as one character by the charCodeAt method. Weird that I hadn't noticed this before as this is pretty obvious. Either way, it's fixed now. Thanks for pointing it out to me.
@Andy:
I think I get it now. What confused me was that for some reason I thought you meant keeping track of object references across several requests. Obviously what you really meant makes a lot of sense. So that'll definitely go into the spec as well as the implementations.
Also, I can't thank you enough for pointing these things out to me. I'm also looking forward to checking out your Perl implementation.
I'm sure I'll have questions when updating the spec, so I'll get back to you some time soon (if you don't mind).
The only advantage a binary protocol would give you is that for someone sniffing your packages, it would be a little bit more difficult to make sense of them than it would be with a plain-text format.
[/quote]
I think this is not the right way for secure exchange betwen client and server ever if you can crypte bison after serialized.
The httpS procole (http://tools.ietf.org/html/rfc 2818) implement security with ssl encryption and that work juste fine ;).
But the expermentation is interesting. I think (perhaps) the advantage is that bison is more compressible if you enabel gzip on your apache server.
You're right that a binary format alone is in not really more secure than a plain-text format. That's why right after the passage you quoted there's this sentence: "But then again that wouldn't stop anyone with a little ambition."
Binary formats have the advantage of not being human readable, but that's really all there is to it.
As for gzip compression, I don't think BISON compresses better than a plain-text format. I haven't tested it yet, but considering that gzip uses entropy encoding, I'd say BISON and let's say JSON should compress to about the same size.
I'm just testing my Perl version and I think there's a problem with your
backslash escaping in the PHP version.
If I generate the BISON data like this:
include_once('andy/source/bison .php');
$bison = new Bison;
$ar = array(
'numbers' => array ( 1, 2, 3.1415, 127, 128, -128 ),
'strings' => array ( 'Hello', 'World' ),
'null' => null,
'hash' => array ( 'this' => 1, 'that' => 2 ),
'unicode' => 'π',
'nested' => array(
'hash' => array( slashed => '\\\\\\' ),
'array' => array(array(array())),
)
);
$data = $bison->serialize($ar);
$fh = fopen('t.bison', 'w');
fwrite($fh, $data);
fclose($fh);
The resulting output (after un-yEncoding) looks like this:
0x0000 : 46 4D 42 11 06 00 6E 75 6D 62 65 72 73 00 10 06 : FMB...numbers...
0x0010 : 00 05 01 05 02 0D 56 0E 49 40 05 7F 06 80 00 05 : ......V.I@......
0x0020 : 80 73 74 72 69 6E 67 73 00 10 02 00 0F 48 65 6C : .strings.....Hel
0x0030 : 6C 6F 00 0F 57 6F 72 6C 64 00 6E 75 6C 6C 00 01 : lo..World.null..
0x0040 : 68 61 73 68 00 11 02 00 74 68 69 73 00 05 01 74 : hash....this...t
0x0050 : 68 61 74 00 05 02 75 6E 69 63 6F 64 65 00 0F CF : hat...unicode...
0x0060 : 80 00 6E 65 73 74 65 64 00 11 02 00 68 61 73 68 : ..nested....hash
0x0070 : 00 11 01 00 73 6C 61 73 68 65 64 00 0F 5C 5C 5C : ....slashed..\\\
0x0080 : 00 61 72 72 61 79 00 10 01 00 10 01 00 10 00 00 : .array..........
Notice that there are only three backslashes. I think the correct
encoding would be six backslashes. The original string contains three
and each of them must be escaped.
Is that right?
http://search.cpan.org/~and ya/Data-BISON-v0.0.1/
http://search.cpan.org/~and ya/Data-BISON-v0.0.1/
I can't wait to test it.
Thank you!
Size Limits
===========
Remove array size limit. Size encoding should be
0x0000 - 0x7FFF => size is 0 - 32767
0x8000 - 0xFFFF => this is the low 15 bits of the size with another
16 bits to follow
Version
=======
Add version encoding. Version object has tag 0xFF and follows the FMB
header. The following u16 gives the schema number. If no version
present schema version 1 is assumed. Bit 16 of the schema number is set
if this stream might contain backrefs. This is a hint to the decoder
that it needs to remember objects that it has created so that it can
refer back to them. See "Back References" below.
Objects
=======
Add a new tag (0x13) for real objects and rename 0x11 has HASH. Objects
are serialised in the same way as hashes except that the object class
name is encoded before the element count using normal string encoding.
The encoder and decoder should provide hooks to map the class names
stored in the file to some portable variant. So for example a Perl
encoder might want to translate classnames like
MyApp::UserData
MyApp::SessionState
into
UserData
SessionState
and then a JS or PHP decoder would remap those names to whatever
classes it uses.
Back References
===============
Add a new tag (0x14) for back references to previously encoded items.
When an item that has already been encoded is encountered again a
reference to it will be written as 0x14 followed by the ordinal position
of the original object in the stream encoded using the extended array
size encoding described above.
It is possible for an object to refer to itself:
my $hash = { };
$hash->{abc} = $hash;
$enc->encode( $hash );
0x11 0x01 0x00 0x61 0x62 0x63 0x00 0x14 0x00 0x00
\____________/ \_________________/ \____________/
hash, 1 el 'abc' back ref #0
At the discretion of the encoder this technique may be used for repeated
scalars (numbers and strings) as well as hashes and arrays if this would
make the encoded data more compact.
very nice idea!
By the way: XmlHttpRequest defaults to UTF-8, but you should be perfectly able to do myXMLHttpRequest.setRequestHead er("Content-Type", "text/xml; charset=ASCII);
Give it a try.
Great stuff. I'm excited to see what you do with it.
@Andy Armstrong:
Excellent! You're right about backslash escaping. I remember not implementing it at all when I noticed that PHP didn't seem to support null-bytes in strings properly. But obviously backslashes need to be escaped regardless, so I'll fix that. As you said, three backslashes would become six backslashes in the escaped string.
Also, thanks for the effort of doing a Perl translation and thanks for uploading it to CPAN! That's really awesome.
Finally ;-) thanks for the summary of your proposals. They will go into the spec very soon . I'm thinking about enforcing Identifier kind of naming conventions for member names with the new "object" type. What do you think?
@Michal Kuklis:
Thanks, man!
@Paul Bakaus:
Thanks a lot! I'm pretty sure I tried that with no luck. I think in order to get XHR to use something other than utf-8, you need to have the <?xml ... encoding="ascii" ?> but I'll give your idea a try and let you know if it worked.
<br />
<b>Fatal error<b>: Uncaught exception 'Exception' with message 'Not a valid BISON message' in /var
/www/members/jaeger/downl oads/bison/source/bison.php:61 7
Stack trace:
#0 /var/www/members/jaeger/downloa ds/bison/examples/echo/bisonse rver.php(7): Bison->deserialize(''
)
#1 {main}
thrown in <b>/var/www/members/jaeger/downloa ds/bison/source/bison.php<b> on line <b>617<b><br />
1) As for BISON vs Bison: I wasn't totally serious about the name BISON and I was also aware that the name was already "in use". I don't think this is a real issue though because as you said, BISON is quite useless ;-).
2) You mention that Gmail uses AJAX to upload attachments. I don't know where you got that information, but that's not actually how Gmail does it. The "asynchronous upload trick" actually uses a hidden IFRAME as target for the upload form. AFAIK, that's also the only "asynchronous" way to upload a file from an HTML page. No chance you could grab the file and pass it through XHR.
While it's totally possible to set a Content-Type and even a charset with the XHR object, this does not change the way it treats null-bytes. The null-byte issue arises, because the XHR "send" method treats whatever you pass to it as a null-terminated string.
As for setting a more appropriate charset: wenn sending something other than XML, the charset will be utf-8 regardless of which charset you specify in the header. Again, that's understandable because XHR was never intended to be used for sending binary data (Microsofts implementation actually supports it, but not from within JavaScript).
3.) About base64-encoding: during the tests I performed with base64-encoding, messages were anything from 10 to 100% larger than the original message. The average was around 30%, but since I was trying to get a point across, of course I picked the extreme. Come on, that's totally legitimate. ;-)
4.) As for repeat/reverse: sorry, but I usually go for the most obvious solution and not necessarily the shortest, especially with code I'm going to share with others. The real issue here is that I shouldn't be extending the string prototype in the first place. I don't really have anything to say in my defense though, except that I will remove this with the next release and put it somewhere inside the BISON constructor.
Thanks again for your input and your criticism. Much appreciated.
These two solutions also suffer from the same charset/null-byte issues that I’m struggling with so encoding needs to be applied here, too.
http://ebml.sourceforge. net/
It's useless. The reason for a binary protocol like this is primarily to save on a needless waste of bandwidth. Sending data in binary form will particularly help reduce the size of numeric data.
However, then you go and put a 64k item limit on arrays. Now why'd you do that? Obviously if I have a need for this, 64k is a relevant and annoying limitation.
Similarly, the string backslash escape thing is needless; there's already a binary data stream type, so just state that 00 may never occur in strings. If people absolutely need it, they'll implement their own escaping scheme on top.
Finally, a separate 'type' for every 8-bit granularity for integers, yet no series of unicode varieties for strings, nor any support for maps (a.k.a. dictionaries a.k.a. JS objects).
All in all commendable effort but please make this go away, as it'll cause confusion when real attempts at this kind of thing are being made.
Then there's strings in general: Why not encode on length? Makes parsing a hell of a lot easier on both sides. Again, if this is to be used for large chunks of data, it's extremely useful to know if there's a tiny string coming down the pipes, or a huge monster that may need to be saved to disk intermittently.
So, to summarize what needs to be done:
- Add unicode stuff. I suggest at least one type for UTF-8, and another for ISO-8859-(pick one).
- nix the backslash escaping stuff.
- make strings length-coded (and make it a 31-bit integer for maximum support, though if you want to support a 'smallstring' with a smaller int type for encoding length, go ahead)
- change (or add) the array type to support 31-bit integer lengths
- add support for maps/dictionaries/objects, whichever name you use for such things. Keys should be either strings or numbers. It's okay (and size-wise efficient) if maps may only have either all numbers or all strings for keys. Values must be anything you can represent with bison.
Other things aren't going to change (UTF-8 requirement for strings, string escaping) because of BISONs strong ties with JavaScript and in order not to break comaptibility with existing implementation.
Maps/Dictionar ies/Objects (whatever you want to call them) are already supported. Just check out the "Object" data type in the spec. Also, pretty much all examples on this page actually send objects. (I guess you must have overlooked it)
Thanks again for your feedback! I will not make this go away though, because like the article says "[...] this is really just a proof of concept [...]. It's not intended at all to be better, faster or more space efficient than XML or JSON."
* arbitrary precision -- all implementations should not obligatorily support that, obviously, but the original JSON spec lets you input as many digits as you want for numbers, so in order to express as much it would be good to be able to express arbitrary-precision numbers in your encoding as well. One common way to do it for integers is to use the "continuation bit" trick: write the integer as a series of 7-bit digits, and then add 128 to all these digits except for the last one, you get a string of bytes that can represent an arbitrary-length integer.
* Word for encoding Length???? Either an arbitrary precision number would do or Integer32 at least...
* There is an error in your grammar: ByteStreams may contain NullByte, so defining Strings as ByteStream NullByte doesn't make sense. Since Unicode Strings may legally contain the null byte, so you should either (a) escape null bytes in your "ByteStream" data type (and escape the escape character itself), or (b) use a tag-length-value method to encode Strings
* define a canonical encoding -- that is important for comparing message hashes, which itself is important for guaranteeing data integrity and/or security in certain use-cases. For that you need to do just two things: - make sure that the most efficient encoding type is used to represent integers (e.g. can't use the 64-bit integer type to store just the number 42...)
- order "members" in some ordering (lexicographic ordering would be straightforward for example)
Thanks.
It can easily compete with Google's Protocol Buffers ( http://google-opensource.blogspot.com/2008/07/protocol-buffers-googl... ), but in easier and JSON compatible way.
Are they the same or competing binary JSON implementations? If they are not the same, what are the differences and which one is more popular?
<a href="http://www.universalutsi.com/">web programming<a>
Write a new comment
<strong>,<em>,<cite>and<code>. Links, email addresses and line breaks are parsed automatically.