2.1. A Brief Introduction to History and Architecture of SIP

The Session Initiation Protocol (SIP, RFC 3261) is a backbone of every VoIP network nowadays. Its “language” is used by telephony devices to find each other, signal who is calling whom, negotiate which audio/video codecs to use and even more. The telephony devices are typically SIP desktop phones, but it may be also smartphones, softphones, or massive PSTN gateways with PSTN infrastructure and users behind them. In between, there are intermediary SIP network devices that help to locate the end-user devices, perform Call Admission Control, help often with various imperfections of the end-devices and perform other useful functions. The ABC Session Border Controller is one of such, however other kinds of SIP proxy servers, Back-to-Back User Agents, specialized application servers, and more are common.

VoIP began to reach market back in mid nineties. By then Internet had established itself as a consumer product. The number of users buying PCs and subscribing to an Internet Service Provider (ISP) for a dial-up access was increasing exponentially. While mostly used for the exchange of Email, text chatting and distribution of information VoIP services based on proprietary solutions as well as H.323 started to gain some popularity. The standardization organization Internet Engineering Task Force, IETF, began to devise its own protocol suite. Some protocols existed already by then. The Real-Time Transport Protocol (RTP) RFC 1889 enabled the exchange of audio and video data. The Session Description Protocol (SDP) RFC 2327 enabled the negotiation and description of multimedia data to be used in a communication session. The first applications, often open-source, for sending and receiving real-time audio and video data emerged. A signaling protocol was missing, however.

In those days, the procedure for establishing a VoIP call between two users based on the IETF standards would look as follows: The caller starts his audio and video applications at a certain IP address and port number. The caller then either calls the callee over the phone or sends him an Email to inform him about the IP and port address as well as the audio and video compression types. The callee then starts his own audio and video applications and informs the caller about his IP and port number. While this approach was acceptable for a couple of researches wanting to talk over a long distance or for demonstrating some research on QoS this was clearly not acceptable for the average Internet user.

The Session Initiation Protocol (SIP) RFC 3261 was the attempt of the IETF community to provide a signalling protocol that will not only enable phone calls but can also be used for initiating any kind of communication sessions. SIP has been contemplated for use by audio and video calls, as well as for setting up a gaming session or controlling a coffee machine.

The SIP specifications describe three types of components: user agents (UA), proxies and registrar servers. The UA can be the VoIP application used by the user, e.g., the VoIP phone or software application. A VoIP gateway, which enables VoIP users to communicate with users in the public switched network (PSTN) or an application server, e.g., multi-party conferencing server or a voicemail server are also implemented as user agents. The registrar server maintains a location database that binds the users’ VoIP addresses to their current IP addresses. The proxy provides the routing logic of the VoIP service. When a proxy receives a SIP request from a user agent or another proxy it also conducts service specific logic, such as checking the user’s profile and whether the user is allowed to use the requested services. The proxy then either forwards the request to another proxy or to another user agent or rejects the request by sending a negative response.

While the server roles prescribed by the SIP specification are functional, actual implementations found in networks tend to integrate multiple roles in a server product. A registrar is often co-located with a proxy server so that they can share user-location databases. A server can also present itself as User-Agent to both sides of a signaling session to be able to manipulate SIP messages more extensivelly than the proxy specification would permit.

Every signaling SIP transaction consists of a request and one or more replies. The three most commonly used request types are REGISTER, INVITE and BYE. The REGISTER request makes a SIP phone’s address known to a SIP server so that it knows where to forward incoming SIP requests. The INVITE request initiates a dialog between two users. A BYE request terminates this dialog. Responses can either be final or provisional. Final responses can indicate that a request was successfully received and processed by the destination. Alternatively, a final response can indicate that the request could not be processed by the destination or by some proxy in between or that the session could not be established for some reason. Provisional responses indicate that the session establishment is in progress, e.g. the destination phone is ringing.

In general one can distinguish between three types of SIP message exchanges, namely registrations, dialogs and out of dialog transactions.


Figure 1: SIP Call flow

A SIP registration enables a user agent to register its current address, IP address for example, at the registrar. This enables the registrar to establish a correlation between the user agent’s permanent address, e.g. sip:user@frafos.com, and the user agent’s current address, e.g., the IP address used by the user’s user agent. In order to keep this correlation up to date the user agent will have to repeatedly refresh the registration. The registrar will delete a registration that is not refreshed for a while.

A SIP dialog, a call for example, usually consists of a session initiation phase in which the caller generates an INVITE that is responded to with provisional and final responses. The session initiation phase is terminated with an ACK, see SIP Call flow. A dialog is terminated with a BYE transaction. Depending on the call scenario the caller and callee might exchange a number of in-dialog requests such as reINVITEs or REFER.

The last type of SIP interactions is SIP transactions that are not generated as part of a dialog. Examples of out of dialog SIP requests include OPTIONS and INFO that are often used for exchanging information between SIP nodes or as an application level heartbeat.

Every SIP message consists of three parts: First line, message header and message body, see Content of SIP messages. The first line states the purpose of the message. For requests it identifies its type and the destination address. For replies the first line states the result as a numerical 3-digit status code together with a textual human-readable form. The second part of the message, the header part, includes a variety of useful information such as identification of the User Agent Client and the SIP path taken by the request. The third part includes a message body that contains application specific information. This can be for example session description information (SDP) indicating the supported codecs.


Figure 2: Content of SIP messages

The information contained in these three parts can be roughly divided into three categories, see Content of SIP messages:

  • Addressing and routing information: This includes information about who has sent the message and where it is destined to, the next hop to be sent to as well as the hops it has traversed. This information is included in the first line as well as in different headers such as From, To, Contact, P-Asserted-Identity, Via, Route, Path and other headers. The message body can contain information about where the media traffic should be sent to or is expected to come from.
  • Dialog and transaction identification: This part of a SIP message is used to uniquely identify a SIP dialog or transaction. This information is included in SIP headers such as Cseq, Call-Id and tags included in From, To and Via headers.
  • Dialog content:  With dialog content we categorise data that is included in a SIP message that is either used to describe certain features of a dialog or indicates how a node receiving the message should process the message. This can include parts of the SIP message body carrying SDP, which includes description about which audio or video codes to use. Certain headers such as Privacy for example indicate the user’s wishes with regard to the way private information such as user address should be dealt with.

Previous topic

2. Introduction

Next topic

2.2. What is a Session Border Controller (SBC)?

This Page