Security surveillance by intercepting Skype and other VoIP communications

Security surveillance by intercepting Skype and other VoIP communications

VoIP telephony is gradually gaining ground with traditional copper wire telephone systems because it provides higher bandwidth at lower deployment costs. In 2013, the number of VoIP subscribers was more than 150 million, which in itself is a lot; and in 2017 – almost a billion. But what about the privacy of VoIP conversations? Is end-to-end encryption used in VoIP software capable of providing this highest level of privacy?

.

Skype and other VoIP intercepts by intelligence agencies, image #1

.

.



This issue became particularly topical after Snowden’s revelations to the world about the total wiretapping that government intelligence agencies like the NSA (National Security Agency) and the CPS (Government Liaison Center) are doing with the help of spy software PRISM and BULLRUN – this software, as it turns out, is even capable of encrypted negotiation.

.

How does PRISM, BULLRUN and other similar software extract information from the voice stream transmitted through encrypted channels?



In order to find the answer to this question, you must first understand how VoIP voice traffic is transmitted. The data channel in VoIP systems, as a rule, is implemented on top of the UDP-protocol and most often works via SRTP (Secure Real-time Transport Protocol), which supports packaging (via audio codecs) and encryption of the audio stream. At the same time, the encrypted stream, which is received at the output, has the same size as the audio input stream.

When shown below, similar seemingly minor leaks can be used for listening to “encrypted” VoIP conversations.

.

What can be extracted from the encrypted audio stream

.

A majority of the audio codecs used in VoIP systems are based on CELP algorithm (Code-Excited Linear Prediction), whose function blocks are shown in the figure below. To achieve better sound quality without increasing the load on the data channel, VoIP-soft usually uses audio codecs in VBR mode (Variable bit-rate). This principle works, for example, with Speex audio codec.

.

Tracking intelligence by intercepting Skype and other VoIP communications, image no. 2

.

.

What does this mean in terms of privacy?

Plain example. Speex, working in VBR mode, packs hissing consonants with a lower bitrate than vowels, and moreover, even certain vowels and consonants are packed with a bitrate specific to them. The graph in the figure below shows the distribution of packet lengths for a phrase with hissing consonants: Speed skaters sprint to the finish. The deep troughs in the chart are the hissing parts of this phrase.

The figure shows (Source) the dynamics of the audio input, bitrate and output (encrypted) packet size overlaid on a common timeline; the striking similarity of the second and third charts can be seen with the naked eye.

.

Tracking intelligence by intercepting Skype and other VoIP communications, image #3

.

.

Plus, if you look at the picture through the prism of the digital signal processing mathematical apparatus (which is used in speech recognition tasks), like PHMM machine (Profile Hidden Markov Models is an extended version of the hidden Markov model), you will see much more than just the difference between vowel sounds and consonants. Including identify the speaker’s gender, age, language and emotions.

VoIP override attack

.

PHMM machine is very good at processing numerical chains, comparing them with each other and finding patterns between them. This is why the PHMM machine is widely used in speech recognition tasks.
In addition, the PHMM-automatic device is also useful for listening to the encrypted audio stream. But not directly, but through bypass channels. In other words, the PHMM-automatic cannot directly answer the question: “What is the phrase in this chain of encrypted audio packages?”, but it can answer with great accuracy the question: “Is there such a phrase in this place of such an encrypted audio stream?

This way, the PHMM machine can only recognize phrases for which it was originally trained. However, modern deep learning technologies are so powerful that they can train a PHMM machine to the point where it actually blurs the line between the two questions above. To appreciate the full power of this approach, you need to dive a little bit into the mattress.

A little bit about DTW algorithm

.

DTW-algorithm (Dynamic Time Warping) has until recently been widely used for speaker identification and speech recognition tasks. It is able to find similarities between two numerical chains generated by the same law – even when these chains are generated at different speeds and are located at different points on the time scale. This is exactly what happens when digitizing an audio stream.

Such as a speaker can say the same phrase with the same accent, but faster or slower, with different background noise. This will not prevent the DTW algorithm from finding similarities between the first and second variant. To illustrate, let’s look at two integer chains:

.

0 0 0 4 7 14 26 23 8 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 6 13 25 24 9 4 2 0 0 0 0 0

If you compare these two chains “on the forehead”, they are obviously very different from each other. However, if we compare their characteristics, we will see that the chains definitely have some similarities: both of them consist of eight integers, both have a similar peak value (25-26). A “frontal” comparison starting from their entry points ignores these important characteristics. But the DTW algorithm, comparing the two chains, takes into account them and other characteristics. However, we won’t focus too much on the DTW algorithm, because there is a more effective alternative today – PHMM automats.
Experimentally it has been determined that PHMM machines “recognize” phrases from an encrypted audio stream with 90% accuracy, while the DTW algorithm only provides an 80% warranty. Therefore, the DTW algorithm (which in its heyday was a popular tool for solving speech recognition problems) is mentioned only to show how much better PHMM-automatic devices are in comparison with it (in particular, when recognizing an encrypted audio stream).

Conditionally, the DTW algorithm learns much faster than PHMM machines. This advantage is undeniable. However, with modern computing power, it will not be crucial.

HMM-automatic work

.

HMM (simply HMM, not PHMM) is a statistical simulation tool that generates numerical chains following a system defined by a deterministic finite machine, each of whose transient functions is a so-called Markovian process.
This machine always starts with state B (begin) and ends with state E (end). The next state to which a transition from the current state will be made is selected according to the transition function of the current state.

As you move between states, the HMM machine produces one number at each step, from which the output chain of numbers is formed. When the HMM machine is in state E, the output chain ends.

With the help of the HMM machine, it is possible to find patterns in chains that look random. For example, here this advantage of a HMM machine is used to find a pattern between a chain of packet lengths and the target phrase we are checking for in an encrypted VoIP stream.

.

Tracking intelligence by intercepting Skype and other VoIP communications, image #4

.

.

While there are many possible ways that an HMM machine can go from point B to point E (in our case, when packing an individual audio piece), there is one single best path, one single best chain for each particular example (even for a random Mark process).
She will also be the most likely candidate, who is likely to choose an audio codec when packing the corresponding audio fragment (because its uniqueness is expressed in the fact that it is better than others can be packed). Such “best chains” can be found with Viterby algorithm (as is, for example, done with here).

In addition, in speech recognition tasks (including from an encrypted data stream, as in our case) it is also useful to be able to calculate how likely it is that the chain we choose will be generated by the HMM machine. The laconic solution to this problem is given here; it relies on an algorithm “forward-backward” and the algorithm Bauma-Welsha.
There a method has been developed based on the HMM machine to identify the language you are talking in with an accuracy of 66%. But this low accuracy is not very impressive, so there is a more advanced modification of the HMM machine – PHMM, which pulls a lot more patterns from the encrypted audio stream. For example, whole here describes in detail how to identify words and phrases in encrypted traffic with PHMM (and this task will be more difficult than simply identifying the language in which the conversation is taking place) with an accuracy of 90%.

.

Tracking intelligence by intercepting Skype and other VoIP communications, image #5

.

.

PHMM automata work

.

PHMM is an advanced modification of the HMM machine where in addition to the “match” states (squares with the letter M) there are also “insert” (rhombuses with the letter I) and “delete” (circles with the letter D) states. Thanks to these two new states, the PHMM machines, unlike the HMM machines, are able to recognize a hypothetical A-B-C-D chain even if it is not fully present (e.g. A-B-D) or the insertion is made into it (e.g. A-B-X-C-D).

These two innovations of the PHMM machine are particularly useful in solving the problem of recognition of encrypted audio streams. Because the audio codec output rarely matches, even when the audio inputs are very similar (when, for example, the same person pronounces the same phrase). Thus, the simplest PHMM machine model consists of three interrelated state chains (“matching”, “inserting” and “deleting”), which describe the expected length of network packets in each position of the chain (encrypted VoIP packets for the selected phrase).

.

Tracking intelligence by intercepting Skype and other VoIP communications, image #6

.

Another way, since in the encrypted audio stream the network packets on which the target phrase is packed are usually surrounded by other network packets (the rest of the conversation), we need an even more advanced PHMM machine. One that can isolate the target phrase from other sounds surrounding it.

There five new states are added to the original PHMM machine. The most important of these five added states is the “random” state (rhombus with the letter R). The PHMM machine (after finishing the training stage) goes into this state when it receives those packages that are not part of the phrase we are interested in. The states PS (Profile Start) and PE (Profile End) provide a transition between a random state and the profile part of the model. This improved modification of the PHMM-automatic machine is able to recognize even those phrases, which the machine “not heard” in the training phase.

.

Tracking intelligence by intercepting Skype and other VoIP communications, image #7

.

.

Cognize the language you are speaking

.

There is a PHMM-based pilot installation with which the encrypted audio streams of 2000 native speakers from 20 different language groups were analyzed. After completing the training process, the PHMM machine identified the spoken language with an accuracy of 60 to 90%: for 14 out of 20 languages, the accuracy of identification exceeded 90%, and for the rest 60%.

The pilot installation shown in the figure below consists of two Linux PCs with VoIP Onsource software. One of the machines works as a server and listens to SIP calls on the network. After receiving the call the server automatically responds by initializing the voice channel into Speex over RTP mode. It should be mentioned here that the control channel on VoIP systems is usually implemented over the TCP protocol and either runs on some of the public protocols with an open architecture (SIP, XMPP, H.323) or has a closed architecture specific to the application (as in Skype, example).

.

Tracking intelligence by intercepting Skype and other VoIP communications, image no. 8

.

.

When the voice channel is initialized, the server plays a file to the caller and then ends the SIP connection. The caller, who is another machine on our local network, makes a SIP call to the server and then uses a sniffer to “listen” to a file that is played by the server: he listens to a chain of network packets with encrypted audio traffic coming from the server.

Other, the subscriber either trains the PHMM machine to identify the conversation language (using the mathematical apparatus described in the previous sections) or “asks” the PHMM machine which language the conversation is in. As already mentioned, this experimental setup provides up to 90% accuracy of language identification.

Listening to Skype’s encrypted audio stream

.

Here has demonstrated how to solve an even more complex problem with a PHMM machine: to recognize the encrypted audio stream generated by Skype (which uses the Opus/NGC audio codec in VBR mode and 256-bit AES encryption). This development uses an experimental installation like the one shown in the picture above, but only with Skype’s Opus.

codec.

To teach their PHMM machine, researchers used this sequence of steps:

.

    1. Set up a set of soundtracks that include all the phrases they are interested in;

.

    1. Then we installed a sniffer of network packets and initiated a voice conversation between two Skype accounts (this resulted in generation of encrypted UDP traffic between the two machines, in P2P mode);

(

)

  • Then each of the collected soundtracks in the Skype session was played using a media player, with five-second silence intervals between tracks;

.

  • Time the packet sniffer was configured to log all traffic coming to the second machine of the pilot installation.

After collecting all training data, chains of UDP lengths were extracted using the automatic parser for PCAP files. The resulting chains consisting of payload packet lengths were then used to train the PHMM model using the algorithm Bauma-Welsha.

.

And if you turn off VBR mode?

.

It would seem that the problem of such leaks can be solved by switching audio codecs to constant bitrate mode (although what solution is this – the bandwidth from this drastically decreases), but even in this case the security of the encrypted audio stream still leaves much to be desired. After all, exploitation of VBR traffic packet lengths is just one example of a bypass attack. But there are other examples of attacks, for example tracking pauses between words.

.

.

Tracking intelligence services by intercepting Skype and other VoIP communications, image #9

.

.

Task is of course non-trivial, but quite solvable. Why non-trivial ? Because in Skype, for example, in order to reconcile the UDP protocol and NAT (network address translation) and to improve the quality of the voice transmitted, network packets are not stopped, even when there are pauses in the conversation. This makes it difficult to detect pauses in speech.
However voot here an adaptive threshold algorithm has been developed to distinguish silence from speech with an accuracy of more than 80%; the proposed method is based on the fact that speech activity is strongly correlated with the size of encrypted packets: more information is encoded in the voice packet when the user speaks than when the user is silent.

A vot here (with an emphasis on Google Talk, Lella and Bettati), the speaker is identified even when there is no leak across the packet size (even when VBR mode is disabled). Here, the researchers rely on measuring time intervals between packet receptions. The method described relies on silence phases, which are encoded into smaller packets, with longer time intervals – to separate words from each other.

Findings

.

As practice shows, even the most modern cryptography is unable to protect encrypted VoIP communications from listening, including if this cryptography is implemented properly – which in itself is unlikely.

It should also be noted that this article examines in detail only one mathematical model of digital signal processing (PHMM-automatic devices), which is useful for recognition of encrypted audio stream (in such spyware of government intelligence services as PRISM and BULLRUN). But there are dozens and hundreds of such mathematical models. So if you want to keep up with the times, look at the world through the prism of higher mathematics.

Source

.


7 Views

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments


Do NOT follow this link or you will be banned from the site!
0
Would love your thoughts, please comment.x
()
x

Spelling error report

The following text will be sent to our editors: