Call your friends and family from your computer—a look at the future or the present? With Linux, the future is now.
The OpenPhone Project (http://www.openphone.org/) has a simple goal—to phone-enable every computer on the planet. If a computer can browse the Web and play audio from Internet radio stations, it should be able to place and receive phone calls, too. The basic technology is available today. The OpenPhone Project aims at fostering the development of the software that can make this a reality.
Internet telephony, or voice-over IP (VoIP) technology, has come a long way in the last few years, and the path is cluttered with a confusing array of ever-changing standards. I will aim to provide an introduction to the technology and standards on Internet telephony with the hope that more people will participate in the OpenPhone Project. Internet Telephony is one of the most exciting and fastest-growing areas in today's telecommunications world—and it's perfect for Linux!
Phone-enable every computer—what does that mean? It means every computer should to be able to act as a phone—hopefully, a very smart and programmable phone. There are several practical ways to accomplish this feat. The simple way is to mimic some of the capabilities of a phone with a computer's normal resources. For example, use the sound card and a microphone/speaker to communicate, and use either a screen-pop dialog or play a sound file to indicate ringing. Another much more sophisticated technique is to use a telephony interface board that allows you to plug a normal phone into your computer. Given today's inexpensive and powerful phones (especially cordless phones) and the availability of low-cost telephony interface cards, this is the approach favored by the OpenPhone Project.
Telephony interfaces come in a wide and confusing array of types and capabilities. They seem to fall into two basic categories: high-density multiline digital interface cards (T-1 or better) and low-density analog cards. Since most folks don't have a T-1 circuit in their home, the OpenPhone Project focuses on the low-density analog cards. These cards are no more expensive than a decent video card, and provide a whole host of critical features that make Internet telephony work. The simplest thing they do is let you plug a normal, inexpensive analog phone into your computer and provide full control over the ringing and audio. However, they also provide the hardware-based audio compression so critical to voice quality. This hardware technology will be discussed in more detail below.
The OpenPhone Project is not Linux-specific, although it certainly has a strong Linux leaning. Like fax machines, the usefulness of Internet phones depends upon how many other devices are out there that they can interoperate with. If a Linux-based OpenPhone can only call other Linux computers, it's of limited value. However, if an OpenPhone can call any other computer regardless of operating system, or any other phone anywhere in the world—that's powerful! This is the goal of the OpenPhone Project.
At the time of this writing, the OpenPhone Project is using the telephony interface boards made by Quicknet Technologies, Inc. The Internet PhoneJACK is available in PCI (peripheral connection interface) and ISA (industry standard architecture) bus versions and provides an RJ-11 interface into which any normal analog phone can be plugged. It also has headset/microphone and handset jacks. The Internet LineJACK has the RJ-11 POTS (plain old telephone system) port and an additional PSTN (public switch telephone network) port for use as a gateway to the normal phone system. The Internet LineJACK is presently available with only an ISA interface. These low-cost interface boards have Linux and Win32 drivers and provide the ability to use a single standard telephone with your computer. More information is available at Quicknet's web site (http://www.quicknet.net/) or in my article in last September's issue of Linux Journal (“Voice-Over IP for Linux”).
Talks are underway with several other hardware vendors to participate in the OpenPhone project. We encourage such vendors to make drivers available for as many operating systems as possible, and join us in making OpenPhone work across all platforms.
Telephony can be thought of as having two major parts: the audio channel used to communicate and the signaling channel(s) used to control the audio channel. In the traditional public switched telephone network (PSTN), the signaling happens on a separate private network owned and operated by the telephone companies. This separate signaling network uses a protocol called Signaling System 7 (SS7); it is used to control the setup and ending (teardown) of calls, using the switched circuits in the system.
The audio-channel portion of the PSTN is mostly composed of two parts: the local loop, and the central office (CO) equipment that links all the local loops together. The local loop is the pair of copper wires that comes into your house or business—the analog line. The CO equipment is made up of high-speed digital links; it is beyond the scope of this article. The local loop uses analog signals to carry your voice to the CO, where it is digitized and sent to the CO on the other end of the call. The other CO takes the digital signal, converts it back into an analog signal and sends it down the analog line to the called party.
Internet telephony works the same way, except that the digitization process happens at your computer, and the high-speed digital link between end stations is the Internet. Your telephony interface (or sound card) converts the analog signal to digital and sends it in IP packets to the destination. The destination computer converts it back to analog sound signals and plays it out your phone. Simple, right? Ah, but like most things that involve computers, the details are the tricky part.
An Internet phone must carry the audio signal between the parties talking—this is the fundamental requirement. We all know that the Internet can carry high-fidelity audio, since many of us have listened to streaming audio music or radio programs using our Internet connection and PC. Internet telephony is a bit more complicated, though, since it has to do full duplex (two-way) audio in real time, and the human ear is quite sensitive to latency. Too much delay, and the call sounds like it's traveling across a satellite instead of across the Internet!
Bandwidth is also an issue, especially over dial-up lines to the Internet. In the normal PSTN world, the audio on the local loop is digitized into a 64Kbps digital data stream and presented to the phone company CO equipment, which then compresses the data for transport across the phone system backbone. Simple Internet telephony packages use a similar digitization, yielding a bandwidth requirement of 64Kbps in each direction for full-quality voice. That requires 128Kbps of bandwidth, plus extra for signaling and overhead, so this does not work well across a normal dial-up Internet connection that is perhaps 56Kbps.
Luckily, voice compression technology has come a long way and works quite well. It is commonplace to obtain 8:1 (or better) compression ratios using today's voice coders (codecs). Modern Internet telephony interface boards provide these codecs as standard features.
The normal phone system sets up virtual (or sometimes real) dedicated circuits for the voice packets to flow across. The Internet is much more chaotic. On the Internet, packets can take different routes from second to second, and may arrive at their destination out of order, late, not at all or staggered in time. Late packets might as well be lost, since if the packet is not there on time and ready to play, there will be a gap in the audio. Out-of-order packets are probably not useful, either—if it's out of order, it's probably late. Packets that arrive at the destination are unlikely to do so in a neat and orderly way. Sometimes they take a bit longer or shorter than average—the stream is not uniform. This staggering is called jitter. Several techniques are used to deal with these problems: the Real Time Protocol (RTP) and jitter buffers.
Audio packets need to arrive on time and in the correct order. RTP is a user-level protocol that provides a way to encapsulate data into packets time-stamped with enough information to allow the proper playback of audio. The protocol has a companion control protocol (RTCP) that provides a means for the end points to stay informed about the quality of services they are receiving. The complete protocol is described in RFC1889, which can be found at www.ietf.org/rfc/rfc1889.txt. Several implementations of RTP are available under various open-source licenses. See Resources for more information and pointers to those libraries.
The Internet is not predictable, and packets sent at nice predictable rates do not always arrive at the same rate they were sent. They can arrive slightly sooner or later than the average latency. If a packet arrives slightly late, the audio device which is ready to play the next frame of audio has nothing to play. This causes a discontinuity that degrades the audio quality. In simple applications, this is a short silent period that makes the voice sound choppy. In more advanced applications, comfort noise or some form of audio blending is used to mask the gap. This can make the voice sound warbled or as if it's under water. These effects are observed while using cell phones or Internet phones, by the way—packet loss and latency is a general problem of all digital audio applications. RTP provides a means for the application software to know if packets are out of order, missing or running early or late. The application can then make the appropriate corrections for missing or misordered packets.
The best solution for dealing with jitter (short of a perfect transport path, which would eliminate it entirely) is to make sure the audio device never “runs dry”. This requires jitter buffers. These buffers store a small amount of audio at the beginning and then stay a bit ahead of the flow so that there is always an audio frame to play. Several one-way real-time audio/video programs will buffer up many seconds of data before playing any sound, thus ensuring they will always have plenty of data on hand to play. However, every frame buffered adds latency, which is especially relevant to voice calls. If you buffer 90 milliseconds (ms) of data, you add 90ms to the delay between the time the words are spoken and when they are heard. When added to the latency of the Internet itself, this can rapidly become unacceptable. Some people believe 200ms of latency is a good upper limit for what the human ear can tolerate. Given that many Internet locations are 100ms or more (one way) apart, adding 90ms of jitter buffer latency accounts for a significant fraction of the acceptable delay. A delicate balance lies between the need to jitter buffer and the need to reduce latency.
Jitter buffer techniques are one of the areas where I think the Open Source community can contribute significantly. There is much thought and experimentation going on now to find algorithms and techniques to use adaptive jitter buffers in two-way real-time audio streams. I suspect the cumulative efforts of the Open Source community will find some excellent solutions to this in the next year. I hope the OpenPhone Project can be a catalyst for making it happen sooner.
The standard PSTN digitization technique is described by the ITU G.711 specification. This document describes an 8KHz sampling rate using 8 bits per sample, for an effective 64Kbps data rate. There are two subsets of this technique: A-Law and Mu-Law. These are scaling factors that take into account the sensitivity of the human ear and allow for a more efficient utilization of the 8-bit encoding space for recording the human voice. Both are in wide use and are considered standard codecs.
However, 64Kbps is excessive for Internet telephony. Two-way audio encoded in this manner yields 128Kbps of audio alone (not counting signaling or overhead), making it impractical for use across a dial-up link, and wasteful of bandwidth even on a LAN or WAN link. Voice compression is the answer, and there are several options from which to choose.
Audio codecs can either be implemented in software or provided by the telephony interface card using an on-board DSP. On a lightly loaded PC, the extra processing load incurred by the encoding is not particularly burdensome, although it can be on more heavily utilized machines. Software-only solutions can add some latency, since the encoding usually happens in a user-space program and is thus at the whim of system load and the scheduler. Internet telephony cards use on-board DSPs to perform the encoding, virtually eliminating the load on the host CPU and the associated latencies.
Perhaps the most widely used codec in Internet telephony is G.723.1. This codec provides either a 6.3 or 5.3Kbps data stream of packets that hold 30ms of audio each. The encoding requires 7.5 milliseconds of lookahead, which, when combined with the 30-millisecond frame, totals 37.5ms of coding delay. This is the minimum theoretical latency using this codec. Of course, real-world latencies will include the time on the Internet and application-level processing at both ends (and the jitter buffer—mustn't forget that).
G.723.1 is patented technology. The patent holders charge royalties for its use, making it impossible to use in open-source software. However, most telephony boards (including the hardware used by the OpenPhone Project) include this codec as part of the board price. Aside from other technical advantages of telephony interface boards over sound cards, the inclusion of licensed compression codecs on the card is perhaps the single best reason we need these cards. With the codec in hardware, we can write open-source software and not violate any licenses. I'd like to point out that the G.723.1 codec is very different from the G.723 codec (a much higher bandwidth codec in which the source code is widely available, but not often used due to high bandwidth requirements).
Rapidly gaining in popularity are the G.729 and G.729A codecs. These codecs use an 8KHz sampling rate, a 10ms audio frame and a 5ms lookahead buffer, for a total of 15ms processing latency. Because the frames are only 10ms long, this codec suffers less from packet loss—the software has less to recover, given the loss of any one packet. G.729A is a version of the codec that uses fewer DSP resources, making it easier to implement on lower-performance (cheaper) DSP chips. The audio quality these codecs can deliver is astounding. I recently experienced a LAN-based call using the G.729A codec and stood next to the person calling me. I could detect no perceptible delay between the direct path (his mouth to my ear) and the link across the network. If I had not known otherwise, I would have sworn it was a normal PSTN call. This codec is the future of Internet Telephony.
In-band signaling includes all the tones you normally hear in the course of a call, including ringing, busy-signal, fast-busy and the all-important dual-tone modulation frequency digits (DTMF). These tones are a normal expected part of using a telephone, but unfortunately, many do not compress well with the algorithms discussed above. One alternative, the one chosen by the OpenPhone project, is to pass these signals as a separate data type within the real-time audio stream. This has the advantage of ensuring that the signals are reproduced accurately at the other end of the connection regardless of the audio codec needed, and it reduces the bandwidth needed to convey the data.
This technique is described in an Internet Draft (see Resources). It basically specifies a new RTP data payload type; the application layer simply sends the data in the stream along with the normal audio. One very tricky aspect needing more work is the code required to remove the actual audio signals from the audio data stream prior to compression. Should these tones and signals get compressed, the codecs will distort these signals at the far end, reducing the chances of properly detecting the signal or tone. It's far better to filter these signals out prior to compression and to pass the signal across the network directly, allowing the other end to reproduce the signal with perfect accuracy.
In the traditional PSTN world, the phone companies have a private dedicated network to do all the signaling. Every phone system in the world that wants to interoperate with the rest of the world must use the SS7 network and play by its rules. On the Internet, though, there is no separate dedicated network—control and data signals use the same network. How does signaling work, then? It's not very simple. There are several different ways to get the job done, and several different “standards” in place to provide guidance. This is a simplification of the issues and standards. I refer you to the Resources section for places to obtain more detailed information.
Several standards for Internet telephony are currently in use: H.323, SIP and MGCP are the predominant ones in use now. I cannot hope to provide more than a quick introduction to these protocols, but can certainly provide an overview and pointers to where to learn more.
H.323 is a family of protocols established by the International Telecommunications Union (ITU). It is highly complex, but covers most of what one would want to do with audio-visual conferencing or calling over networks. H.323 arose out of the telecommunications world, not from the Internet world. It is a binary protocol that uses ASN.1 notation/encoding for message passing. This protocol is in wide use and presently has the highest level of use. Many commercial packages use H.323. H.323 is so widely deployed that many feel new VoIP applications must support the protocol. However, H.323 is so complicated that interoperability between different implementations is not good. For better or worse, Microsoft's NetMeeting product seems to be a common benchmark for measuring interoperability. Since Microsoft makes NetMeeting available for free, it's easily available, and if a product can interoperate with it, you are assured a wide user base. There are several excellent open-source projects working to provide this protocol to the community—most notably the OpenH323 Project. These folks have a working protocol stack now capable of making a call to a NetMeeting client on a Win32 machine. This is tremendous success, and I hope to see even more improvements as this code is used in more and more projects. The OpenPhone Project will use the OpenH323 libraries and the slimmer “Simple Endpoint Terminal” code that Vovida Networks derived from it. See Resources for more information and places where you can obtain that code.
The Session Initiation Protocol (SIP) is described in RFC-2543. This IETF protocol arose from the Internet community; it has a feel not too different from HTTP and similar protocols. It is a text-based protocol that uses fairly simple commands to get the job done. It is much more oriented towards telephony services, whereas H.323 provides details on full multimedia audio-visual services. SIP is growing in popularity because of its relative simplicity and ease of implementation. A few links to more information on SIP are in the Resources section. At the time of this writing, I am unaware of an open-source implementation of SIP, although there has been much talk about starting such a project. The OpenPhone Project would very much like to see this project, and perhaps we can facilitate getting it started. We will certainly provide a home for it if anyone is interested.
The Media Gateway Control Protocol (MGCP) is a protocol and API for controlling voice gateways from a centralized server or “Call Agent”. The protocol is text-based and relatively straightforward. The complete protocol is described in an IETF draft (see Resources for link information).
This protocol is finding wider acceptance and is thought by many to be superior to H.323. However, it is interoperable with H.323, since the Call Agent can act as an H.323 Gatekeeper. This is probably where the future of VoIP call control is going, and the OpenPhone Project intends to use this protocol as the default wherever possible. An open-source implementation of MGCP is available from Vovida Networks (see Resources), and several commercial versions are available. However, since the specification is not ratified and is changing slightly (but frequently), there is no assurance at this time that two different MGCP implementations will work together. This is an area where open source can make a huge contribution to Internet telephony. Vovida's MGCP implementation is released under the LGPL license, which allows for commercial use without releasing the associated source code for the non-open portions of the application. With polish, such an open-source MGCP could be used in a wide variety of applications and systems, assuring interoperability by virtue of using the same base code for the protocol. The OpenPhone Project will be using the Vovida code for its MGCP control with the hope of extending and enhancing the protocol to stay current with the ratified MGCP protocol.
We envision a basic application that provides a framework for Internet telephony using plug-in modules to accomplish the various tasks to be performed. For example, one plug-in module would provide the signaling tasks, and another would provide the audio tasks (including use of RTP and jitter buffers). We envision an H.323 module, a SIP module and an MGCP module—the user could select which one to use based on the interoperability requirements. New modules could be plugged in as needed to evaluate different RTP/jitter buffer techniques. As improvements are made in signaling or audio-transport modules, all the user has to do is drop in the new module.
All that is needed to make this approach viable is a common API for applications to use to perform basic high-level functions. The modules would all provide those API functions; the appropriate module would be used to provide the actual functionality. Since many people refer to the signaling and audio code as a “stack”, we call this the Stack Adaption Layer (SAL). The SAL is a commonly defined and adopted API that will allow the application developer to focus on the functionality of the program, and the stack developers to focus on the detailed lower-level implementation.
To extend the concept down a layer and provide platform and hardware independence, we envision a Hardware Adaption Layer (HAL) that provides a set of common functions for controlling and using the hardware. This allows the signaling/audio stacks and the SAL to work seamlessly on top of the HAL code, regardless of which Internet telephony card is in use at the hardware level. This approach will allow us to realize the goal of true cross-platform multi-vendor interoperability.
Design specifications for the SAL and HAL layers are in active development as of this writing, and by the time of publication there should be several white papers and some reference code available on our web site. We encourage active participation and have started a mailing list devoted to the project. You can join this majordomo-hosted list by sending a “subscribe” message to developers@openphone.org.
Internet telephony has many intricate pieces that work together to make it function well. There are programs available now that work—but not at a level that is truly useful. Most of these implementations fall short because they don't use modern compressed-codecs, or because they don't use something like RTP and good jitter-buffer techniques to control the audio stream. Most also lack a standardized signaling protocol that provides interoperability with other programs. The OpenPhone Project aims to provide a new, highly flexible framework that uses plug-in modules to provide the components discussed above. It will be based on inexpensive, easily available hardware that can be used in normal, commonly available computers. OpenPhone will use modern techniques to provide near-toll-quality calls between any two phone-enabled computers, and hopefully will foster growth and acceptance of Internet Telephony to the point that all computers are phone-capable.