Yesterday, Microsoft published their CU-RTC-Web WebRTC API proposal as an
alternative to the existing W3C WebRTC API being implemented in Chrome
and Firefox. Microsoft’s proposal is a “low-level API” proposal which
basically exposes a bunch of media- and transport-level primitives to
together into a complete calling system. By contrast to the current
“mid-level” API, the Microsoft API moves a lot of complexity from the
powerful and flexible. I don’t find these arguments that convincing,
however: a lot of them seem fairly abstract and rhetorical and when
we get down to concrete use cases, the examples Microsoft gives seem
like things that could easily be done within the existing framework.
So, while it’s clear that the Microsoft proposal is a lot more work
for the application developer; it’s a lot less clear that it’s
sufficiently more powerful to justify that additional complexity.
Microsoft’s arguments for the superiority of this API fall into
three major categories:
- JSEP doesn’t match with “key Web tenets”; i.e., it doesn’t match
the Web/HTML5 style.
- It allows the development of applications that would otherwise
be difficult to develop with the existing W3C API.
- It will be easier to make it interoperate with existing VoIP
Like any all-new design, this API has the significant advantage (which
the authors don’t mention) of architectural cleanliness. The existing
API is a compromise between a number of different architectural
notions and like any hybrid proposals has points of ugliness where
those proposals come into contact with each other (especially in the
area of SDP.) However, when we actually look at functionality rather
than elegance, the advantages of an all-new design—not only one
which is largely not based on preexisting technologies but one which
involves discarding most of the existing work on WebRTC itself—start
to look fairly thin.
Looking at the three claims listed above: the first seems more
rhetorical than factual. It’s certainly true that in the early
years of the Web designers strove to keep state out of the Web
browser, but that hasn’t been the case with rich Web applications
for quite some time. To the contrary, many modern HTML5 technologies
(localstore, WebSockets, HSTS, WebGL) are about pushing state onto the
browser from the server.
The interoperability argument is similarly weakly supported.
Given that JSEP is based on existing VoIP technologies, it
seems likely that it is easier to make it interoperate with
existing endpoints since it’s not first necessary to implement
you can even try to interoperate. The idea here seems to be that
it will be easier to accomodate existing noncompliant endpoints
if you can adapt your Web application on the fly, but given the
significant entry barrier to interoperating at all, this
seems like an argument that needs rather more support than
MS has currently offered.
complexity tradeoff, it’s somewhat distressing that the specific
applications that Microsoft cites (baby monitoring, security cameras,
etc.) are so pedestrian and easily handled by JSEP. This isn’t of
course to say that there aren’t applications which we can’t currently
envision which JSEP would handle badly, but it rather undercuts this
argument if the only examples you cite in support of a new design are
those which are easily handled by the old one.
None of this is to say that CU-RTC-Web wouldn’t be better in some
respects than JSEP. Obviously, any design has tradeoffs and as I
said above, it’s always appealing to throw all that annoying legacy
stuff away and start fresh. However, that also comes with a lot of
costs and before we consider that we really need to have a far
better picture of what benefits other than elegance starting
over would bring to the table.
More or less everyone agrees about the basic objectives of the WebRTC
effort: to bring real-time communications (i.e., audio, video, and
direct data) to browsers. Specifically, the idea is that Web
applications should be able to use these capabilities directly. This
sort of functionality was of course already available either via
generic plugins such as Flash or via specific plugins such as Google
Talk, but the idea here was to have a standardized API that was built
In spite of this agreement about objectives, from the beginning there
was debate about the style of API that was appropriate, and in particular
how much of the complexity should be in the browser and how much in
- High-level APIs — essentially a softphone in the browser. The Web
application would request the creation of a call (perhaps with some
settings as to what kinds of media it wanted) and then each browser
would emit standardized signaling messages which the Web application
would arrange to transit to the other browser. The original WHATWG
HTML5/PeerConnection spec was of this type.
- Low-level APIs — an API which exposed a bunch of primitive
implemented this sort of API couldn’t really do much by itself.
Instead, you would need to write something like a softphone in
signaling state machinery, etc. Matthew Kaufman from Microsoft
was one of the primary proponents of this design.
After a lot of debate, the WG ultimately rejected both of these and
(JSEP), which is probably best described as a mid-level API. That
design, embodied in the current specifications
keeps the transport
establishment and media negotiation in the browser but moves a fair
While it doesn’t standardize signaling, it also has a natural mapping
to a simple signaling protocol as well as to SIP and Jingle, the two
dominant standardized calling protocols. The idea is supposed to be
that it’s simple to write a basic application (indeed, a large number
of such simple demonstration apps have been written) but that
it’s also possible to exercise advanced features by manipulating
the various data structures emitted by the browser. This is obviously
something of a compromise between the first two classes of proposals.
The decision to follow this trajectory was made somewhere around six
months ago and at this point Google has a fairly mature JSEP
implementation available in Chrome Canary while Mozilla has a less
mature implementation which you could compile yourself but hasn’t been
released in any public build.
Below is an initial, high-level analysis of this proposal.
Disclaimer: I have been heavily involved with both the IETF and
W3C working groups in this area and have contributed significant
chunks of code to both the Chrome and Firefox implementations. I am
also currently consulting for Mozilla on their implementation. However,
the comments here are my own and don’t necessarily represent those of any other
WHAT IS MICROSOFT PROPOSING?
What Microsoft is proposing is effectively a straight low level API.
There are a lot of different API points, and I don’t plan to discuss
the API in much detail, but it’s helpful to talk about the API
some to get a flavor of what’s required to use it.
- RealTimeMediaStream — each RealTimeMediaStream represents a single
flow of media (i.e., audio or video).
- RealTimeMediaDescription — a set of parameters for the
- RealTimeTransport — a transport channel which a RealTimeMediaStream
can run over.
- RealTimePort — a transport endpoint which can be paired with a
RealTimePort on the other side to form a RealTimeTransport.
In order to set up an audio, video, or audio-video session, then, the JS
has to do something like the following:
- Acquire local media streams on each browser via the getUserMedia()
API, thus getting some set of MediaStreamTracks.
- Create RealTimePorts on each browser for all the local network
addresses as well as for whatever media relays are available/
- Communicate the coordinates for the RealTimePorts from each
browser to the other.
- On each browser, run ICE connectivity checks for all combinations
of remote and local RealTimePorts.
- Select a subset of the working remote/local RealTimePort pairs
and establish RealTimeTransports based on those pairs.
(This might be one or might be more than one depending on
the number of media flows, level of multiplexing, and the
level of redundancy required).
- Determine a common set of media capabilities and codecs between
each browser, select a specific set of media parameters, and
create matching RealTimeMediaDescriptions on each browser
based on those parameters.
- Create RealTimeMediaStreams by combining RealTimeTransports,
RealTimeMediaDescriptions, and MediaStreamTracks.
- Attach the remote RealTimeMediaStreams to some local display
method (such as an audio or video tag).
For comparison, in JSEP you would do something like:
- Acquire local media streams on each browser via the getUserMedia()
API, thus getting some set of MediaStreamTracks.
- Create a PeerConnection() and call AddStream() for each of the
- Create an offer on one brower send it to the other side,
create an answer on the other side and send it back to the
offering browser. In the simplest case, this just involves
making some API calls with no arguments and passing the
results to the other side.
- The PeerConnection fires callbacks announcing remote media
streams which you attach to some local display method.
As should be clear, the CU-RTC-Web proposal requires significantly
be a lot smarter about what it’s doing. In a JSEP-style API, the Web
programmer can be pretty ignorant about things like codecs and
transport protocols, unless he wants to do something fancy, but with
CU-RTC-Web, he needs to understand a lot of stuff to make things work
at all. In some ways, this is a much better fit for the traditional
Web approach of having simple default behaviors which fit a lot of
cases but which can then be customized, albeit in ways that
are somewtimes a bit clunky.
Note that it’s not like this complexity doesn’t exist in JSEP,
it’s just been pushed into the browser so that the user doesn’t have
to see it. As discussed below, Microsoft’s argument is that this
and robustness, and that libraries will be developed (think jQuery)
to give the average Web programmer a simple experience, so that
they won’t have to accept a lot of complexity themselves. However,
since those libraries don’t exist, it seems kind of unclear how
well that’s going to work.
ARGUMENTS FOR MICROSOFT’S PROPOSAL
Microsoft’s proposal and the associated blog post makes a number of
major arguments for why it is a superior choice (the proposal just came
out today so there haven’t really been any public arguments for why
it’s worse). Combining the blog posts, you would get something like
- That the current specification violates “fit with key web tenets”,
specifically that it’s not stateless and that you can only make
changes when in specific states. Also, that it depends on
the SDP offer/answer model.
- That it doesn’t allow a “customizable response to changing network
- That it doesn’t support “real-world interoperability” with
- That it’s too tied to specific media formats and codecs.
- That JSEP requires a Web application to do some frankly inconvenient
stuff if it wants to do something that the API doesn’t have explicit
- That it’s inflexible and/or brittle with respect to new applications
and in particular that it’s difficult to implement some specific
“innovative” applications with JSEP.
Below we examine each of these arguments in turn.
FITTING WITH “WEB TENETS”
Honoring key Web tenets-The Web favors stateless interactions which
do not saddle either party of a data exchange with the
responsibility to remember what the other did or expects. Doing
otherwise is a recipe for extreme brittleness in implementations;
it also raises considerably the development cost which reduces the
reach of the standard itself.
This sounds rhetorically good, but I’m not sure how accurate it is.
First, the idea that the Web is “stateless” feels fairly anachronistic
in an era where more and more state is migrating from the server. To
pick two examples, WebSockets involves forming a fairly long-term stateful
two-way channel between the browser and the server, and localstore/localdb
allow the server to persist data semi-permanently on the browser.
Indeed, CU-RTC-Web requires forming a nontrivial amount of state on
the browser in the form of the RealTimePorts, which represent actual
resource reservations that cannot be reliably reconstructed if
(for instance) the page reloads. I think the idea here is supposed
to be that this is “soft state”, in that it can be kept on the
server and just reimposed on the browser at refresh time, but as
the RealTimePorts example shows, it’s not clear that this is the case.
Similar comments apply to the state of the audio and video devices
which are inherently controlled by the browser.
Moreover, it’s never been true that neither party in the data exchange
was “saddled” with remembering what the other did; rather, it used
to be the case that most state sat on the server, and indeed, that’s
where the CU-RTC-Web proposal keeps it. This is the first time we have
really built a Web-based peer-to-peer app. Pretty much all previous
applications have been client-server applications, so it’s hard to
know what idioms are appropriate in a peer-to-peer case.
I’m a little puzzled by the argument about “development cost”; there
are two kinds of development cost here: that to browser implementors
and that to Web application programmers. The MS proposal puts
more of that cost on Web programmers whereas JSEP puts more of
the cost on browser implementors. One would ordinarily think that
as long as the standard wasn’t too difficult for browser implementors
to develop at all, then pushing complexity away from Web programmers
would tend to increase the reach of the standard. One could of course
argue that this standard is too complicated for browser implementors
to implement at all, but the existing state of Google and Mozilla’s
implementations would seem to belie that claim.
Finally, given that the original WHATWG draft had even more state in
the browser (as noted above, it was basically a high-level API), it’s
a little odd to hear that Ian Hickson is out of touch with the “key
CUSTOMIZABLE RESPONSE TO CHANGING NETWORK QUALITY
The CU-RTC-Web proposal writes:
Real time media applications have to run on networks with a wide
range of capabilities varying in terms of bandwidth, latency, and
noise. Likewise these characteristics can change while an
application is running. Developers should be able to control how the
user experience adapts to fluctuations in communication quality. For
example, when communication quality degrades, the developer may
prefer to favor the video channel, favor the audio channel, or
suspend the app until acceptable quality is restored. An effective
protocol and API will have to arm developers with the tools to
tailor such answers to the exact needs of the moment, while
minimizing the complexity of the resulting API surface.
It’s certainly true that it’s desirable to be able to respond to
changing network conditions, but it’s a lot less clear that the
CU-RTC-Web API actually offers a useful response to such changes. In
general, the browser is going to know a lot more about the
bandwidth/quality tradeoff of a given codec is going to be than most
you’re going to do better with a small number of policies (audio is
more important than video, video is more important than audio, etc.)
than you would by having the JS try to make fine-grained decisions
about what it wants to do. It’s worth noting that the actual
“customizable” policies that are proposed here seem pretty simple.
The idea seems to be not that you would impose policy on the browser
but rather that since you need to implement all the negotiation
logic anyway, you get to implement whatever policy you want.
Moroever, there’s a real concern that this sort of adaptation will
have to happen in two places: as MS points out, this kind of network
variability is really common and so applications have to handle it.
Unless you want to force every JS calling application in the universe
to include adaptation logic, the browser will need some (potentially
configurable and/or disableable) logic. It’s worth asking whether
whatever logic you would write in JS is really going to be enough
better to justify this design.
In their blog post today, MS writes about JSEP:
it shows no signs of offering real world interoperability with
existing VoIP phones, and mobile phones, from behind firewalls and
across routers and instead focuses on video communication between
web browsers under ideal conditions. It does not allow an
application to control how media is transmitted on the network.
I wish this argument had been elaborated more, since it seems like
CU-RTC-Web is less focused on interoperability, not more. In
particular, since JSEP is based on existing technologies such as SDP
and ICE, it’s relatively easy to build Web applications which gateway
JSEP to SIP or Jingle signaling (indeed, relatively simple prototypes
of these already exist). By contrast, gatewaying CU-RTC-Web signaling
to either of these protocols would require developing an entire
SDP stack, which is precisely the piece that the MS guys are implicitly
arguing is expensive.
Based on Matthew Kaufman’s mailing list postings, his concern seems to
be that there are existing endpoints which don’t implement some of the
specifications required by WebRTC (principally ICE, which is used to
set up the network transport channels) correctly, and that it will be
easier to interoperate with them if your ICE implementation is written
baked into the browser. This isn’t a crazy theory, but I think there are
serious open questions about whether it is correct. The basic problem
is that it’s actually quite hard to write a good ICE stack (though
easy to write a bad one). The browser vendors have the resources to
do a good job here, but it’s less clear that random JS toolkits that
people download will actually do that good a job (especially if they
are simultaneously trying to compensate for broken legacy equipment).
The result of having everyone write their own ICE stack might be good
but it might also lead to a landscape where cross-Web application interop
is basically impossible (or where there are islands of noninteroperable
de facto standards based on popular toolkits or even popular toolkit
A lot of people’s instincts here seem to be based on an environment
where updating the software on people’s machines was hard but
updating one’s Web site was easy. But for about half of the population
of browsers (Chrome and Firefox) do rapid auto-updates, so they
actually are generally fairly modern. By contrast, Web applications
often use downrev version of their JS libraries (I wish I had survey
data here but it’s easy to see just by opening up a JS debugger
on you favorite sites). It’s not at all clear that the
JS is easy to upgrade/native is hard dynamic holds up any more.
TOO TIED TO SPECIFIC MEDIA FORMATS AND CODECS
The proposal says:
A successful standard cannot be tied to individual codecs, data
formats or scenarios. They may soon be supplanted by newer versions,
which would make such a tightly coupled standard obsolete just as
quickly. The right approach is instead to to support multiple media
formats and to bring the bulk of the logic to the application layer,
enabling developers to innovate.
I can’t make much sense of this at all. JSEP, like the standards that
it is based on, is agnostic about the media formats and codecs that
are used. There’s certainly nothing in JSEP that requires you to use
VP8 for your video codec, Opus for your audio codec, or anything
else. Rather, two conformant JSEP implementations will converge on a
common subset of interoperable formats. This should happen
automatically without Web application intervention.
Arguably, in fact, CU-RTC-Web is *more* tied to a given codec because
the codec negotiation logic is implemented either on the server or in
application needs to detect that and somehow know how to prioritize it
against existing known codecs. By contrast, when the browser
manufacturer adds a new codec, he knows how it performs compared to
existing codecs and can adjust his negotiation algorithms accordingly.
Moreover, as discussed below, JSEP provides (somewhat clumsy)
mechanisms for the user to override the browser’s default choices.
These mechanisms could probably be made better within the JSEP
Based on Matthew Kaufman’s interview with Janko Rogers
it seems like
this may actually be about the proposal to have a mandatory to
implement video codec (the leading candidates seem to be H.264 or
VP8). Obviously, there have been a lot of arguments about whether
such a mandatory codec is required (the standard argument in favor
of it is that then you know that any two implementations have
at least one codec in common), but this isn’t really a matter
of “tightly coupling” the codec to the standard. To the contrary,
if we mandated VP8 today and then next week decided to mandate
H.264 it would be a one-line change in the specification.
In any case, this doesn’t seem like a structural argument about
JSEP versus CU-RTC-Web. Indeed, if IETF and W3C decided to ditch
JSEP and go with CU-RTC-Web, it seems likely that this wouldn’t
affect the question of mandatory codecs at all.
THE INCONVENIENCE OF SDP EDITING
Probably the strongest point that the MS authors make is that if the
API doesn’t explicitly support doing something, the situation is kind
In particular, the negotiation model of the API relies on the SDP
offer/answer model, which forces applications to parse and generate
SDP in order to effect a change in browser behavior. An application
is forced to only perform certain changes when the browser is in
specific states, which further constrains options and increases
complexity. Furthermore, the set of permitted transformations to SDP
are constrained in non-obvious and undiscoverable ways, forcing
applications to resort to trial-and-error and/or browser-specific
code. All of this added complexity is an unnecessary burden on
applications with little or no benefit in return.
What this is about is that in JSEP you call CreateOffer() on a
PeerConnection in order to get an SDP offer. This doesn’t actually
change the PeerConnection state to accomodate the new offer; instead,
you call SetLocalDescription() to install the offer. This gives
the Web application the opportunity to apply its own preferences
by editing the offer. For instance, it might delete a line containing
a codec that it didn’t want to use. Obviously, this requires a lot
of knowledge of SDP in the application, which is irritating to say
the least, for the reasons in the quote above.
The major mitigating factor is that the W3C/IETF WG members intend to
allow most common manipulations to made through explicit settings
parameters, so that only really advanced applications need to know
anything about SDP at all. Obviously opinions vary about how good a
job they have done, and of course it’s possible to write libraries
that would make this sort of manipulation easier. It’s worth noting
that there has been some discussion of extending the W3C APIs to have
an explicit API for manipulating SDP objects rather than just editing
the string versions (perhaps by borrowing some of the primitives in
CU-RTC-Web). Such a change would make some things easier while not
really representing a fundamental change to the JSEP model. However,
it’s not clear if there are enough SDP-editing tasks to make this
With that said, that in order to have CU-RTC-Web interoperate with
existing SIP endpoints at all you would need to know far more about
SDP than would be required to do most anticipated transformations in a
JSEP environment, so it’s not like CU-RTC-Web frees you from SDP if
you care about interoperability with existing equipment.
SUPPORT FOR NEW/INNOVATIVE APPLICATIONS
Finally, the MSFT authors argue that CU-RTC-Web is more flexible
and/or less brittle than JSEP:
On the other hand, implementing innovative, real-world applications
like security consoles, audio streaming services or baby monitoring
through this API would be unwieldy, assuming it could be made to
work at all. A Web RTC standard must equip developers with the
ability to implement all scenarios, even those we haven’t thought
Obviously the last sentence is true, but the first sentence provides
scant support for the claim that CU-RTC-Web fulfills this requirement
better than JSEP. The particular applications cited here, namely audio
streaming, security consoles, and baby monitoring, seem not only
doable with JSEP, but straightforward. In particular, security
consoles and baby monitoring just look like one way audio and/or video
calls from some camera somewhere. This seems like a trivial subset of
the most basic JSEP functionality. Audio streaming is, if anything,
even easier. Audio streaming from servers already exists without any
WebRTC functionality at all, in the form of the audio tag, and audio
streaming from client to server can be achieved with the combination
of getUserMedia and WebSockets. Even if you decided that you wanted to
use UDP rather than WebSockets, audio streaming is just a one-way
audio call, so it’s hard to see that this is a problem.
to the W3C WebRTC mailing list, Matthew Kaufman mentions the
use case of handling page reload:
An example would be recovery from call setup in the face of a
browser page reload… a case where the state of the browser must be
reinitialized, leading to edge cases where it becomes impossible with
cases (because without an offer one cannot generate an answer, and
once an offer has been generated one must not generate another offer
until the first offer has been answered, but in either case there is
no longer sufficient information as to how to proceed).
This use case, often called “rehydration” has been studied a fair bit
and it’s not entirely clear that there is a convenient solution with
JSEP. However, the problem isn’t the offer/answer state, which is actually
easily handled, but rather the ICE and cryptographic state, which
are just as troublesome with CU-RTC-Web as they are with JSEP
[for a variety of technical reasons, you can’t just reuse the
previous settings here.] So, while rehydration is an issue, it’s
not clear that CU-RTC-Web makes matters any easier.
This argument, which should be the strongest of MS’s arguments, feels
rather like the weakest. Given how much effort has already gone into
JSEP, both in terms of standards and implementation, if we’re going to
replace it with something else that something else should do something
that JSEP can’t, not just have a more attractive API. If MS can’t
come up with any use cases that JSEP can’t accomplish, and if in fact
the use cases they list are arguably more convenient with JSEP than
with CU-RTC-Web, then that seems like a fairly strong argument that we
should stick with JSEP, not one that we should replace it.
What I’d like to see Microsoft do here is describe some applications
that are really a lot easier with CU-RTC-Web than they are with
JSEP. Depending on the details, this might be a more or less convincing
argument, but without some examples, it’s pretty hard to see
what considerations other than aesthetic would drive us towards
Thanks to Cullen Jennings, Randell Jesup, Maire Reavy,
Tim Terriberry for early comments on this draft.