Abstract
The interest for Quality of Service (QoS) measurements in the Internet has exceeded the boundaries of research laboratories and passed into the public domain. The Internet is treated as a public utility, and users are given the right to access Internet content of their choice at an adequate quality. This coincides with the sustained interest for net neutrality (NN), basic safeguards for which are being taken in countries all over the world. Today, several tools have become available that empower end-users to measure the performance of their Internet connection and detect NN violations. However, the value that end-users obtain with these measurements is still small and the results are not being exploited satisfactorily by regulatory authorities, policy makers, and consumers. In this article, we perform a detailed review of the tools that are currently available for public QoS and NN measurements and explore three challenges that must be met in order to extract more value from the results: (a) Harmonization of measurement methodologies of basic performance parameters (throughput, delay, jitter, and packet loss); (b) the creation of a toolbox for detecting and monitoring NN violations; and (c) the use of a proper sampling plan for producing estimates over population groups.
Introduction
Over the past decade several web tools that enable end-users to test the quality of their Internet connection have been made publicly available. Prominent examples are the Network Diagnostic Tool (NDT), which has existed since 2004, and speedtest.net by Ookla, which launched in 2006. In the last few years, such tools have known an increased proliferation, due also to their adoption by National Regulatory Authorities (NRAs). In addition, some of the NRAs have also developed their own tools, either by themselves or in cooperation with third-parties (see Table 2).
The interest of NRAs in Quality of Service (QoS) measurements is in its turn due to the importance of the Internet and the World Wide Web as a public utility, as well as to the established role of those agencies in safeguarding the open Internet.1 QoS measurements are a central part of the ongoing debate on net neutrality (NN), since the detection of NN violations is based on such measurements. In fact, several specialized tools have been developed to automatically perform this detection, and have also been made available to the public (e.g., Glasnost, Shaperprobe, NANO).
However, despite their proliferation, the value offered by these tools to people outside of the research realm, such as regulators, policy makers, and consumers, is still small. There are no examples of NRAs making systematic use of QoS measurement results for examining user complaints about poor performance or QoS differentiation practices applied by Internet service providers (ISPs), of consumer organizations for evaluating and comparing ISPs, or of policy makers for improving existing telecom policies. On the contrary, if we look at websites of NRAs offering such tools, the users are cautioned that these results are subject to measurement errors, and that several factors that are outside the responsibility of the ISP can influence the result.2 Such “disclaimers” are appropriate, but highlight an important weakness and are an indication of the inability to fully exploit these results.
In this article, we perform a detailed review of these tools, focusing on their functional characteristics so that a comparison can be made even by nonexpert technical users. The first section reviews the methodology used for network performance measurements. The focus is on broadband speeds, but some evidence is also provided for delay, jitter, and packet loss. In the second, we review specific tools that are designed to detect, or assist in detecting, NN violations. These tools have much smaller public adoption than generic performance tools, but we remark that the current low level of adoption does not seem to stem from their inefficacy, since some tools have been sufficiently tested on real networks with very low false positive and false negative rates (e.g., Glasnost and Shaperprobe). Rather, it seems to stem from the inability to fully exploit the results, determine their extent and scope, and pinpoint the exact differentiation practice.
In the following section, we explore tools that serve in the detection of congestion in the Internet. Such tools are expected to gain more importance in the future; as Internet access connection speeds improve, congestion will appear more and more often in the core of the network. Interconnection disputes may occur between content providers and ISPs, or between ISPs themselves, regarding preferential treatment or intentional degradation of specific flows (also a central part of the NN debate).3
The last section sheds some light into how a statistical survey of broadband performance measurements should scientifically be performed. Broadband statistics are often taken ad hoc, without considering the accuracy of the results, thus undermining the validity of the overall survey. In a relevant study in the Netherlands,4 it was found that results from a crowdsourcing tool compared to those of a household survey were significantly different. Properly sampling the population is very important, not only to accurately inform the consumers and other interested parties, but also to avoid unjustly harming reputations of ISPs.
Our review of the existing status of these measurements leads us to identify three major challenges that must be overcome:
Harmonization of measurement methodologies of basic performance parameters (throughput, delay, jitter, and packet loss): There are currently a variety of tools used for measuring basic performance parameters and no tool that can be proclaimed as better than others in all aspects. These tools differ in subtle but important details and can have significant variations in their results. In order to have valuable measurements tools, tools and measurement methodologies need to be harmonized and, ideally, standardized. We describe the best practices that could be followed for such a harmonization.
The creation of a toolbox for detecting and monitoring NN violations: In contrast to basic performance parameters, harmonization of NN-specific tools is much more difficult, because of the many different differentiation practices which can be applied by an ISP and require different detection mechanisms. Therefore, the proposal is to have a toolbox of applications for detecting such violations, while at the same time monitoring their evolution over time, since differentiations may be applied only on specific periods.
The use of a proper sampling plan for producing estimates over population groups: So far no scientific approach has been taken on producing estimates over aggregate measurements, so as to evaluate the overall performance of an ISP, or the broadband quality of a group of users (residents in an area, a city, or even a whole country). As a result, even a good measurement method can produce far-off results. We describe the basic parts of the scientific approach to sampling for broadband measurements, which can substantially improve the accuracy of population or area statistics.
Measurement Methodology of Basic Performance Parameters
Bandwidth Measurements
There are three different notions of bandwidth estimates of a connection in a data network6: the capacity, the available bandwidth, and the throughput.
Capacity: This is the maximum possible bandwidth a connection can deliver, measured between the physical and data link layer (excluding the physical layer overhead). It is also known as the net bit rate. For example, the capacity of Fast Ethernet is 100 Mbps, of 802.11 g 54 Mbps (maximum) and ADSL2+ 24576/1024 (downlink/uplink) Kbps.
Available bandwidth: This is the unused capacity of the connection during a certain time period. It changes over time and depends on the load of the set of links that the connection spans.
Throughput: This is usually defined as the maximum achievable throughput, that is, the throughput that is expected to be achieved using standard transport protocols by attempting to saturate the path.
The maximum achievable throughput should approach the available bandwidth of the link. However, there is no guarantee that the transport protocol will utilize all bandwidth for the measurement period and the throughput can be significantly lower, especially with tools using single TCP connections during the measurement.8 Moreover, several factors affect the throughput measurement. Impact factors can be discerned in the following categories:
Factors related to the TCP protocol itself: The expected throughput can be influenced by the version of TCP used (e.g., NewReno, Reno, Tahoe), and parameters such as the receive windows at the sender and receiver sides (defined based on the send and receive buffers), the size of the initial congestion window, and the use of delayed ACKs.
Factors related to the measurement setup and process: These include the interface of the measurement at the client (fixed or wireless), the transmission rate of the interface, the location of the measurement server, and whether a software or hardware agent is used at the client. Factors related to the measurement process include the size of the injected packets, the number of simultaneous TCP connections, the length of the measurement, calculation methods (e.g., the estimation of the exact transfer time9), and queuing methods at intermediate routers.
Cross-traffic on the path: This is the traffic carried over part of the end-to-end path simultaneously with the injected traffic for the measurements. Cross-traffic may originate either from the user's connection or from other users sharing links in the path toward the server.
Factors related to sampling: These include the definition of the sampling frame, the sampling method (e.g., random/systematic sampling), the stratification technique, the sample size, the estimator formula, as well as methods for reducing measurement errors (e.g., rejection of extreme values, double sampling). It is well known that different sampling methods can produce different results for the same estimator, or different estimators can produce different values for the same method. Sampling can be discerned in two levels: (a) for estimating performance of the connection of a single user; and (b) for estimating performance over a wide area, an ISP, a whole country, or any other population group. Techniques applied in practice for conducting the first level of sampling are also presented in this section (although not all tools perform first-level sampling), while sampling in general is discussed in more detail later in the report.
Two widely used tools, also employed by NRAs for throughput measurements are the NDT, available from the M-Lab platform, and speedtest.net, available from Ookla. We have studied the commonalities and differences between these tools, using information from a previous paper10 as well as wikis of the tools.11,12 Besides NDT and Ookla, another known methodology for bandwidth measurements is the one used by the SamKnows company. SamKnows had been commissioned by the EC to do measurement surveys across European countries over a three-year period,13 and as a result some of the measurement details have been disclosed.14 A comparison of NDT, speedtest.net, and SamKnows tools based on different criteria is shown in Table 1.
Comparison of NDT, Speedtest.net and SamKnows Tools Based on Different Criteria
Criterion . | NDT . | Speedtest.net (Ookla) . | SamKnows . |
---|---|---|---|
Application type | Java | Flash | Hardware probes (No web interface) |
Measurement layer | Above TCP | HTTP | HTTP |
Number of parallel connections | A single TCP connection | Up to 8 parallel HTTP threads in each direction serving multiple TCP connections | 3 parallel TCP connections (typically) |
Automatic selection of test server | Based on distance and load | Based only on distance (smallest latency) | Based only on distance (smallest latency) |
Total length of each test | 10 seconds | 10 seconds | 25–30 seconds (may vary) |
Number of throughput values for each test | A single throughput value | Multiple values taken up to 30 times per second | A single throughput value, or multiple throughput values at 5 seconds intervals |
Filtering of throughput values for each test | None (single throughput value) | Removal of fastest 10% and slowest 30% of values | Abortion of low-quality tests, removal of failed or irrelevant tests |
Warm-up phase | No | No | Yes |
Calculation side | Server | Client | Client |
Central storage of results | Yes | Yes | Yes |
Criterion . | NDT . | Speedtest.net (Ookla) . | SamKnows . |
---|---|---|---|
Application type | Java | Flash | Hardware probes (No web interface) |
Measurement layer | Above TCP | HTTP | HTTP |
Number of parallel connections | A single TCP connection | Up to 8 parallel HTTP threads in each direction serving multiple TCP connections | 3 parallel TCP connections (typically) |
Automatic selection of test server | Based on distance and load | Based only on distance (smallest latency) | Based only on distance (smallest latency) |
Total length of each test | 10 seconds | 10 seconds | 25–30 seconds (may vary) |
Number of throughput values for each test | A single throughput value | Multiple values taken up to 30 times per second | A single throughput value, or multiple throughput values at 5 seconds intervals |
Filtering of throughput values for each test | None (single throughput value) | Removal of fastest 10% and slowest 30% of values | Abortion of low-quality tests, removal of failed or irrelevant tests |
Warm-up phase | No | No | Yes |
Calculation side | Server | Client | Client |
Central storage of results | Yes | Yes | Yes |
As can be seen from Table 1, the Ookla tool opens several HTTP threads, which serve multiple TCP connections. In the detailed examination of the tool,10 the number of TCP connections was at least two, four being quite common. In contrast to the NDT where a single TCP connection is used, multiple TCP connections can better saturate the path and are more representative of common usage patterns (modern browsers open many simultaneous TCP connections when viewing a site). By default, the Ookla test tool chooses the closest test server, that is, the server that has the smallest latency, while the server selection decision for NDT is based on both distance and load. Thus, a faraway server could be selected by NDT, whereas an already loaded server could be chosen by Ookla (nevertheless, in the Ookla case, it is likely that a highly loaded server would also give high latency, and thus would not be chosen). Ookla also operates a much larger number of servers than M-Lab, which increases the chances of finding a closer server (although capacity features of Ookla servers are not known).
Both the NDT and Ookla tests try to estimate the maximum achievable throughput by continuously sending traffic during the test duration. The total length of each test is the same for both NDT and Ookla (10 seconds); however, NDT makes a single transfer and measures the ratio of the sent traffic to the test duration, whereas Ookla makes several throughput calculations during the test duration and outputs a single estimate after processing and aggregating the results. Because the Ookla tool measures at HTTP layer, it discards the bottom 10% values to alleviate overhead effects. To further undervalue the effect of the slow start periods of TCP, Ookla discards another slowest 20% values. The removal of the 10% fastest values is done to alleviate throughput bursting due primarily to central processing unit (CPU) usage. In both tools, measurement results are sent to a central repository and used to derive several statistics.
The selection of the closest server and the usage of multiple simultaneous TCP connections are probably the reasons why Ookla depicts consistently higher throughput results than the NDT. On the other hand, the measurement performed by NDT with a single connection is closer to the original definition of the BTC defined in RFC 3148.
SamKnows performs measurements using a hardware probe connected to the user gateway. The probe has the ability to detect user traffic and defer tests, so as to alleviate cross-traffic influence. To determine the measurement server, a brief latency test is performed to all servers, and the nearest server in terms of latency is chosen. SamKnows utilizes both its own measurement servers as well as M-Lab servers. Furthermore, the SamKnows tool has the ability to examine network conditions and abort tests when there is high latency and loss, or client configuration issues (such as insufficient TCP settings, firewall products, random access memory [RAM], or CPU). To measure throughput in the upload and download directions, SamKnows typically uses three concurrent TCP connections. The download and upload tests are conducted by performing GET and POST HTTP requests to the target test node. The software first executes a so-called “warm-up” phase to wait until TCP reaches steady-state in all concurrent connections, and then the real testing begins. The warm-up period in each connection ends when three consecutive data chunks (256 KB) are downloaded within a speed difference of 10% from each other. To output the result, either a single throughput measurement is performed throughout a larger interval (usually 25–30 seconds), or several measurements are done at smaller intervals (usually 5 seconds). The client attempts to download/upload as much of the payload as it can during the test duration. In addition to aborting the execution of problematic tests we mentioned above, SamKnows also discards hourly measurements where an excessive number of failures were observed in either the client or server during the whole hour. Throughput calculations are performed at the client side, and the results are sent to the backend in the UK for processing and presentation.
Another well-known platform for performance measurements (including throughput) is BISmark.15 Similarly to SamKnows, BISmark employs hardware probes connected to the user gateway. Measurement details are cited here based on related publications16: BISmark measures throughput by performing an HTTP download and upload for 15 seconds using a single-threaded TCP connection once every 30 minutes, regardless of cross-traffic. Multithreaded TCP measurements are also made, as shown in the active dataset description.17
Another possibility is to conduct crowdsourcing measurements by embedding measurement code in existing applications. A well-known example is Dasu, an extension provided as a plug-in to the popular Vuze BitTorrent client for passively monitoring performance parameters.18 The plug-in collects per-torrent and application-wide statistics (number of TCP resets, upload and download rates, number of active torrents) as well as system-wide statistics (number of active, failed, and closed TCP connections). Apart from application-specific measurements, such extensions can also be used to monitor general client performance. An analysis of a dataset from the ONO extension (another measurement plug-in) to the Vuze BitTorrent client showed that it is also possible to characterize and compare ISP performances from such data.19 The greatest advantage of such methods is their large coverage and the possibility for monitoring of performance over large periods. However, the results reflect passive measurements of application performance which are narrow in scope and hard to control and reproduce. Thus far they have not attracted regulatory attention.
Table 2 presents briefly the throughput measurement methods performed by NRAs which have online broadband measurement tools. All of the NRAs use tools that perform active measurements, which are easier to control and reproduce. The data in the table are based on a recent BEREC report,20 while additional information has been gathered from the websites of the tools.
Comparison of Known NRA Tools for Throughput Measurement
Country (NRA) . | Link (URL) . | Throughput Measurement Details (Measurement Layer, No. of Parallel Sessions, Packet Size, Test Duration, Estimate) . |
---|---|---|
Austria (RTR) | https://www.netztest.at/en/ | TCP, 3 parallel sessions, 4 KB data chunks, 2 seconds warm-up period, 7 seconds real test |
Croatia (HAKOM) | http://www.hakom.hr/default.aspx?id=1144 | HTTP and/or FTP, large number of parallel connections to attain maximum speed, 5 seconds duration, average over all measured values |
Cyprus (OCECPR) | http://2b2t.ocecpr.org.cy | Uses NDT (M-Lab) |
Czech Republic (CTU)a | https://www.netmetr.cz | TCP, one or more parallel TCP connections, determination of MTU (Based on Austrian RTR Net test) |
France (Arcep) | http://qualiserv.directique.com | Tool developed by independent third party; details not available |
Greece (EETT) | http://hyperiontest.gr/?l=1 | Uses NDT (M-Lab) |
Hungary (NMHH) | Not yet deployed | Uses Ookla |
Italy (AGCOM) | https://www.misurainternet.it/nemesys_intro.php | FTP (single TCP connection), file size equal to 10 times the headline speed. HTTP tests also conducted for mobile |
Lithuania (RRT) | http://www.matuok.lt | Details not available |
Montenegro (EKIP) | http://izmjeribrzinu.ekip.me/ | Uses Ookla |
Norway (NPT) | www.nettfart.no | Uses Ookla |
Poland (UKE) | Not yet deployed | TCP, several parallel data streams |
Portugal (Anacom) | http://www.netmede.pt | Details not available |
Romania (Ancom) | http://www.netograf.ro | HTTP, multiple threads, warm-up period, exclusion of extreme throughput values |
Slovenia (AKOS) | Not yet deployed | Based on Austrian RTR Net test |
Sweden (PTS) | http://www.bredbandskollen.se | Tool developed by independent third-party, details not available |
Country (NRA) . | Link (URL) . | Throughput Measurement Details (Measurement Layer, No. of Parallel Sessions, Packet Size, Test Duration, Estimate) . |
---|---|---|
Austria (RTR) | https://www.netztest.at/en/ | TCP, 3 parallel sessions, 4 KB data chunks, 2 seconds warm-up period, 7 seconds real test |
Croatia (HAKOM) | http://www.hakom.hr/default.aspx?id=1144 | HTTP and/or FTP, large number of parallel connections to attain maximum speed, 5 seconds duration, average over all measured values |
Cyprus (OCECPR) | http://2b2t.ocecpr.org.cy | Uses NDT (M-Lab) |
Czech Republic (CTU)a | https://www.netmetr.cz | TCP, one or more parallel TCP connections, determination of MTU (Based on Austrian RTR Net test) |
France (Arcep) | http://qualiserv.directique.com | Tool developed by independent third party; details not available |
Greece (EETT) | http://hyperiontest.gr/?l=1 | Uses NDT (M-Lab) |
Hungary (NMHH) | Not yet deployed | Uses Ookla |
Italy (AGCOM) | https://www.misurainternet.it/nemesys_intro.php | FTP (single TCP connection), file size equal to 10 times the headline speed. HTTP tests also conducted for mobile |
Lithuania (RRT) | http://www.matuok.lt | Details not available |
Montenegro (EKIP) | http://izmjeribrzinu.ekip.me/ | Uses Ookla |
Norway (NPT) | www.nettfart.no | Uses Ookla |
Poland (UKE) | Not yet deployed | TCP, several parallel data streams |
Portugal (Anacom) | http://www.netmede.pt | Details not available |
Romania (Ancom) | http://www.netograf.ro | HTTP, multiple threads, warm-up period, exclusion of extreme throughput values |
Slovenia (AKOS) | Not yet deployed | Based on Austrian RTR Net test |
Sweden (PTS) | http://www.bredbandskollen.se | Tool developed by independent third-party, details not available |
The NetMetr tool is currently offered for mobile.
Out of the sixteen NRAs on which information is available, three use the Ookla tool, two use the NDT, two point to tools that are not affiliated with the NRA but to independent third parties, and the remaining nine have proprietary solutions or solutions provided by third parties. From the table, it is clear that most of the NRAs measure throughput at TCP or HTTP level with multiple parallel connections and attempt to saturate the connection by sending data continuously throughout the duration of the test.
Measurements of Other Parameters
The measurement of other parameters by tools employed by NRAs is more nuanced, since there is only scarce information available. Of course, parameters such as latency, jitter, and packet loss are usually monotonically correlated with throughput, so knowing throughput variations could roughly characterize variations in other parameters.21 However, the accurate measurement of other parameters is also important, since it can provide insight on the cause of a hypothetical degradation as well as be of interest on its own for a specific application (e.g., for online gaming, where latency and jitter are very important). Again, the information provided here is based on the recent BEREC report,20 the websites of the tools, and referenced sources for NDT, Ookla, and SamKnows.
Most of the NRAs use Internet Control Message Protocol (ICMP) ping to measure latency, although there exist countries that employ HTTP ping (e.g., Romania). Latency is very much affected by the location of the measurement server, and care must be taken so that only the access link is measured in order to reflect only the performance of an ISP and not that of the whole end-to-end path. The packet size is clearly also a crucial factor that affects the result. Some tools (e.g., NDT) measure latency on the same set of packets that are used to estimate the maximum achievable throughput; according to M-Lab,12 NDT's calculation of the RTT is conservative, that is, the actual RTT should be no worse than the RTT when NDT is running the throughput test. In Ookla, the latency test is performed by transmitting HTTP packets, and is also used to select the measurement server. SamKnows measures UDP latency as the average round trip time of a series of randomly transmitted UDP packets. Another parameter that affects latency is buffering, especially at high-load conditions (near saturation). As has been demonstrated, packets can see substantial delays if the buffer is large.16 On the other hand, large buffers decrease packet loss, so a trade-off must be made between the two.22
In appraising the accuracy of RTT measurements, a recent work has shown that HTTP-based methods introduce significant overhead. Additional overhead is incurred by the Flash GET and POST methods, compared to socket-based methods.23 Thus, NDT, which uses a Java applet to establish a TCP socket, is more accurate than Ookla's measurement using HTTP-based Flash technology.
Jitter is measured in only a few of the tools, despite its importance for real-time applications. There are also variations in the measurement methods, following the different definitions in different standardization organizations. The jitter parameter is not measured in Ookla. NDT measures jitter as the difference between the maximum and minimum RTT values over the set of measurements. In the Romanian tool, jitter is estimated as the latency difference between consecutive packets, which is closer to the IETF definition. In the measurements made by the Polish NRA, jitter is measured as the standard deviation of the delay values of the measurements, following the ETSI definition. SamKnows measures separately upstream and downstream jitter in VoIP. Jitter is calculated using the Packet Delay Variation (PDV) approach described in Section 4.2 of RFC 5481, which calculates the delay differences with respect to the minimum delay. The 99th percentile is recorded and used in all calculations when deriving the PDV.
Finally, packet loss also admits different implementations. It is again not measured in the Ookla tool. NDT measures packet losses as the ratio of packets with congestion signals (each congestion signal roughly corresponding to a lost packet) to the total number of packets sent. In the Romanian tool, the packet loss is estimated as the percentage between the packets sent but not received (or incomplete) to the total number of packets sent between the server and the client device, when sending ICMP ping messages (each 100 packets) in five parallel sessions. In SamKnows, packet loss is measured by sending a sequence of numbered UDP packets between the hardware probe and the test server. A packet is treated as lost if it is not received back within three seconds of sending.
NN Measurements
Here we review available tools for detecting NN violations and their usefulness from a regulatory perspective. The first tools we examine are Glasnost, Shaperprobe, and Neubot, which are available on M-Lab servers. These tools detect differentiation based on path properties, namely the throughput or bit rate at which packets are sent (for all three tools) or the RTT (for Neubot). Next we present NANO, a tool that uses a statistical method to establish causal relationships between an ISP and observed service performance. It distinguishes itself from the other tools in that it uses passive measurements from many Internet users to detect NN violations by a particular ISP. We also review Netalyzr, a tool that is not NN-specific, but permits users to obtain a detailed analysis of Internet connectivity and provides information on blocking of specific services or discrimination on specific content. For all these tools, we try to describe the factors that could lead to false estimates, or inhibit their wide adoption by end-users. Finally, we briefly describe methods that are not available as widely deployed tools, but can also be used to detect differentiation, either at path- or link-level.
Glasnost
Glasnost is able to detect traffic differentiation that is triggered by transport protocols headers (usually ports) or packet payload. For each tested application, the test compares the throughput of a pair of flows: The first flow is from the application we place under examination, and the second is a slightly changed version of this flow, with changes for parameters that are suspected to be differentiation criteria (e.g., port numbers, payload). For example, for detecting content-based differentiation in BitTorrent, BitTorrent packet payloads are replaced with random bytes in the second flows, while keeping other frame bits identical.
The maximum throughput achieved by the two flows is compared and used to signal differentiation if it exceeds a certain threshold. The value of the threshold is chosen so as to minimize the percentage of false positives, which could harm the reputation of an ISP and create economic damage. According to the tool developers,24 this threshold is set to 50% for short flows, which yields a false positive rate of 0.7% and a false negative rate of 1.7% (based on validation tests). The major parameter that contributes to inaccurate results is cross-traffic, which the tool tries to mitigate by discarding tests with more than 20% difference between the median and max throughput of the flows (described as high-noise tests).
Glasnost was publicly deployed in March 2008 and has been active ever since. It is the only tool that has been adopted by NRAs and made available to users for detecting NN violations. It is currently operational in Greece (EETT), Cyprus (OCECPR), and Portugal (Anacom); it is planned for deployment by NMHH (Hungary); and has previously been used in studies commissioned for the Bundesnetzagentur (Germany) in 2012–2013.
However, the tool has not yet found wide adoption by end-users, and its results have not yet been used by regulators for detecting NN violation cases. At the time of writing this article, the number of Glasnost measurements on EETT's Hyperion platform (hyperiontest.gr) was around 10,000, approximately 1/10 of the measurements conducted with the NDT (used for measuring speed). Moreover, this number of measurements is over a more than three-year period, which means that the data for a specific ISP in a specific time period may be insufficient to support decision making on potential NN violations in that period. The relatively much lower number of measurements is likely due to the large amount of time needed to run the measurements, which is currently about eight minutes and can discourage many users from completing the measurement. Moreover, the fact that noisy measurements are discarded often makes it impossible to derive a conclusion and users are advised to rerun the test without other applications running.
Several analyses of Glasnost data have also been done by independent research teams. Two analyses shown on the M-Lab page25 are a network neutrality map that color-codes areas of the world according to the fraction of shaped tests, and graphs showing both the percentage of shaped tests and the percentage of ISPs that perform traffic shaping based on deep packet inspection (DPI).
Shaperprobe
Shaperprobe is a tool specifically designed to detect traffic shaping that is implemented using a token bucket.26 This form of shaping is more common in cable providers (the most prominent example being the PowerBoost technology advertised by Comcast27) than in DSL.28
The tool first estimates link capacity using packet train dispersion and then probes the path at a constant bit rate equal to the capacity. The goal is to detect level shifts in the received bit rate, which are a sign of traffic shaping. After the detection of a level shift, the token bucket parameters are estimated: the shaping rate (the sustained rate produced under rate-limiting) and the burst size (the size of the bucket, which shows how many bits can be transmitted at the available capacity, which in the case of traffic shaping is larger than the shaping rate).
Shaperprobe was released as a beta version in November 2009. The current version was released in January 2012. Practical implementation difficulties are the difficulty of maintaining a constant probing rate, the existence of path losses, and cross-traffic. All these can lead to fluctuation in the received rate, which disrupts the measurement. In test runs,26 Shaperprobe was able to overcome these limitations and discover known traffic shaping cases with false-positive and false-negative detection rates of less than 5%.
A practical difficulty in running the test is that it requires that only one client is connected to the server at a time. This often leads to an inability to conduct a measurement (in fact, all test trials conducted by the author at the time of writing this article sent back an error message for busy servers). This could be a serious impediment in a wide-scale adoption of the tool, and probably requires the coordination of the measurements with some form of scheduling.
Neubot
Neubot implements active measurement tests for estimating the performance of HTTP, BitTorrent, TCP, and DASH (Dynamic Adaptive HTTP streaming). For each test, the estimated performance parameters are throughput (download and upload) and RTT. The Neubot tool has been hosted on M-Lab since February 2012.
The most important feature of this tool is that tests can be done either on demand, or repeatedly in the background as a system service. The user can see time charts of all measurement results in a dedicated webpage, and thus gain more insight about the performance of his/her Internet connection over time.
The Neubot tool does not explicitly signal differentiation; further statistical analysis is needed to establish if the observed performance is a result of a differentiation policy or is a typical traffic pattern of the network, without undergoing any differentiation. As emphasized by the tool developers,29 aggregate analysis of user data should be performed to discover patterns, which also requires clustering and categorization of connections according to technology, service tier, location, and so on. Moreover, this analysis should also be done promptly to discover transient phenomena that may occur as a result of a temporarily applied traffic differentiation.
NANO
NANO is a system that can detect ISP discrimination by passively collecting performance data from clients. It consists of an agent installed on the client computer that collects data for selected services and reports the results to centralized servers for processing. This processing consists of establishing a causal relationship between an observed degradation and discrimination by an ISP, by comparing the performance of the ISP with the baseline performance of all other ISPs. In order to compare performances on an equal basis and avoid the influence of confounding factors (such as location, time of measurement, web browsers and OS, etc.), the system performs a stratification whereby measurements in each stratum have similar values of the confounding variables. If all possible confounding variables are taken care of in this way, any difference in performance could be attributed to a differentiation practice by the ISP itself. The final result, called the causal effect, is simply the mean of the performance differences for this ISP over all strata.30
The system includes a web interface where users can view their own performance metrics, as well as compare them with statistics over all users. It has existed since 2009, although the public release date is not known exactly.
NANO is currently the only deployed system that relies on passive monitoring. Potential advantages of passive monitoring are, first, that it measures the actual applications run by a user and second, that it evades possible detection and subsequent preferential treatment of active probes by ISPs31 On the other hand, if the number of users is small, passive monitoring may lead to a shortage of results for a specific stratum, which can negatively impact the result (currently, the NANO agent is currently only available for Linux users, which impedes its wider adoption). Additionally, it compromises privacy, although the tool developers also take measures to protect it (such as not monitoring the actual traffic being sent, being able to set websites that will not be tracked, or completely disabling tracking). Further, if traffic differentiation becomes the norm for the majority of ISPs, the system simply cannot detect it.
Netalyzr
As noted in the beginning of this section, Netalyzr is not an NN-specific tool, in the sense that it is not designed to signal blocking or differentiation of specific services or applications. Rather, it probes for a diverse set of network properties, including IP address use and translation, IPv6 support, DNS resolver fidelity and security, TCP/UDP service reachability, proxying and firewalling, antivirus intervention, content-based download restrictions, content manipulation, HTTP caching prevalence and correctness, latencies, and access-link buffering.32 Thus, it is a tool primarily designed for network implementers and operators for debugging and troubleshooting their networks.
From an NN viewpoint, the most interesting tests are the ones about reachability of services and content filtering:
The applet attempts to connect to ports of 25 well-known services, amongst which FTP, SSH, SMTP, DNS, HTTP, HTTPS, POP3, RPC, NetBios, IMAP, SNMP, SMB. Despite the fact that port blocking is usually claimed to be part of an ISP's security policy for limiting the exposure of their network to services with well-known vulnerabilities or for avoiding spam, it is an essential part of the NN debate, since blocking of applications can be easily performed through port blocking. Although treating services and applications differently for security purposes lies in the set of exceptions for which traffic management can be performed, such practices should be attended with more care because they can easily lead to overblocking.
The applet tests for content filtering by attempting to download different file types: a Windows executable (exe), an mp3 file, a file of type torrent, and a benign file that most antivirus vendors recognize as a virus.
Altogether, although not specifically designed for detecting blocking and discrimination on specific services and applications, Netalyzr's detailed analysis can provide useful information on such practices. However, it is currently limited to the technically savvy users. In order to become widely adopted, some simplified explanations should be made, describing the results in simpler terms and linking the detected problems to specific applications (e.g., applications that are affected by blocking specific ports), or verifying whether a specific application becomes nonfunctional and cannot override the blocking.33
A summary of the main characteristics of the above tools is shown in Table 3.
Main Characteristics of Public Tools for Detecting Net Neutrality Violations
Tool Name (URL) . | Measured Performance Parameters . | Application or Service-Specific . | Automatic Violation Detection . | Detection Type . | Known and Potential Problems . |
---|---|---|---|---|---|
Glasnost (http://broadband.mpi-sws.org/transparency/bttest.php) | Throughput | YES (P2P, email, HTTP, SSH, NNTP, Flash video) | YES | Comparison of max throughput achieved by non-discriminated and discriminated packet flow | Long running time, cross-traffic |
Shaperprobe (http://netinfer.com/diffprobe/shaperprobe.html) | Throughput | NO | YES | Traffic shaping based on token bucket | Probing rate fluctuations, path losses, cross-traffic, potential scaling issues |
Neubot (http://www.neubot.org/) | Throughput, RTT | YES (HTTP, BitTorrent, DASH) | NO | Monitoring of performance parameters over time | Further statistical analysis needed to signal differentiation |
NANO (http://noise-lab.net/projects/old-projects/nano/) | Throughput | NO | YES | Comparison of user ISP with baseline performance over all ISPs | Potential shortage of measurements and privacy issues because of passive measurement, difficulty to detect colluding practices |
Netalyzr (http://netalyzr.icsi.berkeley.edu/) | Service reachability, content filtering | YES (FTP, SSH, SMTP, DNS, HTTP, HTTPS, POP3, RPC, NetBios, IMAP, SNMP, SMB, .exe, mp3, .torrent file types) | NO | Port blocking, blocking of specific file types | Oriented to technically-savvy users, further analysis needed to detect blocking of specific applications |
Tool Name (URL) . | Measured Performance Parameters . | Application or Service-Specific . | Automatic Violation Detection . | Detection Type . | Known and Potential Problems . |
---|---|---|---|---|---|
Glasnost (http://broadband.mpi-sws.org/transparency/bttest.php) | Throughput | YES (P2P, email, HTTP, SSH, NNTP, Flash video) | YES | Comparison of max throughput achieved by non-discriminated and discriminated packet flow | Long running time, cross-traffic |
Shaperprobe (http://netinfer.com/diffprobe/shaperprobe.html) | Throughput | NO | YES | Traffic shaping based on token bucket | Probing rate fluctuations, path losses, cross-traffic, potential scaling issues |
Neubot (http://www.neubot.org/) | Throughput, RTT | YES (HTTP, BitTorrent, DASH) | NO | Monitoring of performance parameters over time | Further statistical analysis needed to signal differentiation |
NANO (http://noise-lab.net/projects/old-projects/nano/) | Throughput | NO | YES | Comparison of user ISP with baseline performance over all ISPs | Potential shortage of measurements and privacy issues because of passive measurement, difficulty to detect colluding practices |
Netalyzr (http://netalyzr.icsi.berkeley.edu/) | Service reachability, content filtering | YES (FTP, SSH, SMTP, DNS, HTTP, HTTPS, POP3, RPC, NetBios, IMAP, SNMP, SMB, .exe, mp3, .torrent file types) | NO | Port blocking, blocking of specific file types | Oriented to technically-savvy users, further analysis needed to detect blocking of specific applications |
Other Non-Widely Adopted Methods
Several other proposals have been made for detecting NN violations, which have so far remained within laboratory environments and have not found widespread adoption (despite some of the tools having been validated on real test cases). NetPolice34 can detect whether an ISP treats traffic differently by comparing the aggregate loss rates of different flows based on routing information collected by traceroute probes. Diffprobe35 detects whether an ISP is classifying certain kinds of traffic as “low priority,” providing different levels of service for them. Differentiation is detected by comparing the delays and packet losses experienced by two flows: an Application flow A and a Probing flow P. Diffprobe can also detect whether differentiation is being done through strict prioritization, or variations of weighted fair queuing and random early detection. NetDiff36 is a tool that uses active probing to estimate performances of different ISPs and thus has the potential to discover ISPs that perform traffic differentiation (but is not designed for this purpose).
Finally, a novel approach was recently proposed to detect specific links or sequences of links that treat traffic differently.37 This approach could be used to discover if an ISP throttles traffic coming to or being directed to a specific content provide, or from a specific P2P network. The algorithm assumes the topology of the network is known and detects differentiation by monitoring how often links in neighboring paths experience packet loss, and if path losses in different paths are consistent with each other, or are a result of preferential treatment of some traffic (if a link treats all traffic flows equally, then path losses should be consistent).
Detection of Congested Links
There exist various definitions of congestion depending on the time scale at which one observes traffic, and the network resources or parameters based on which the observation is done. The classical TCP definition is that congestion occurs when packets are dropped at a network queue, which triggers the sender to enter into congestion avoidance state. However, many congestion detection methods are based on RTT variations, or on observing the utilization of the link. The scale at which someone may observe congestion occurrences can vary from milliseconds to hours and the distribution of these occurrences may be different at each scale. For regulatory purposes it would be more appropriate to observe congestion at a large scale (hours or more) and to examine persistent congestion episodes; first, because congestion events at regular intervals and in small time scale are normal in a network; and second, because large time scale examination of congestion is more meaningful to consumers and those who are non-technically savvy and could be easier to connect to economic and legal factors. A possible definition at macroscopic level is that congestion occurs whenever, in a series of measurements in time, the throughput of a connection falls below a specified threshold, for more than a specified period of time. Clearly, there could be other definitions, each requiring a different measurement method. For example, one could monitor RTT or packet loss values instead of throughput, or have different algorithm details for deciding when a persistent congestion event occurs. The important thing is to note that from a regulatory viewpoint, a useful definition of congestion would not involve small time scale congestion, which is normal in a network, but persistent level shifts. Thus, for detecting congestion, a mechanism needs to be set up that monitors several performance parameters of the network over time (throughput, delay, or packet loss) and discovers performance variations.
In current tools, congestion detection is done by monitoring TCP parameters that signal congestion. For example, M-Labs is using results from the NDT to estimate several congestion-related parameters, such as the percentage of tests in congestion avoidance or congestion-limited state,38 and defines that congestion occurs if the congestion-limited state occurs for more than 2% of the time and there are no other limiting factors.
However, the main challenge is not only to detect congestion itself, but to locate the congested link or links. With increasing capacities of subscriber lines that surpass 100 Mbps (VDSL2, DOCSIS 3.0, and FTTx technologies), the bottleneck of the network is moving from the last mile to deeper into the ISPs’ networks and in interconnection links between ISPs. In an end-to-end measurement, congestion may occur anywhere in the path (either in the forward or reverse directions).
A severe obstacle in locating congestion is the scarcity of relevant data. Traffic data of telecommunication companies are almost always under non-disclosure agreements and very frequently capacity and topology data are also not revealed. Therefore, localization of congestion has to be done from the edge of the network, which is much more difficult.
An initial effort in locating congestion is again made by the NDT. The tool calculates the percentage of time that the transmission rate was limited by the congestion, the sender or receive window, and thus gives a first indication of where the problem lies. It also attempts to detect the bottleneck, that is, the link with the smallest capacity in an end-to-end path. It does so by calculating the throughput of each packet (called “inter-packet throughput”) and assigning each measurement result to a number of predefined bins (ranges): for example,39
0.064 < inter-packet throughput (mbits/s) ≤ 1.5—Cable/DSL modem;
1.5 < inter-packet throughput (mbits/s) ≤ 10—10 Mbps Ethernet or WiFi 11b subnet;
10 < inter-packet throughput (mbits/s) ≤ 40—45 Mbps T3/DS3 or WiFi 11 a/g subnet;
40 < inter-packet throughput (mbits/s) ≤ 100—100 Mbps Fast Ethernet subnet; and
100 < inter-packet throughput (mbits/s) ≤ 622—a 622 Mbps OC-12 subnet.
Another available tool is NPAD (Network Path and Application Diagnostics), also known as Pathdiag.40 It is a diagnostics tool based on a Web100 measurement engine that performs several tests, including client configuration, path measurements and server consistency tests. It also captures a number of key metrics of a TCP connection and uses existing performance models to determine if the connection achieves certain goals. Relevant to the detection of congestion, Pathdiag estimates whether the connection can sustain a specified target data rate, for a given target RTT. The estimation is based on the number of packets delivered successfully between congestion events, which must exceed a threshold derived from a TCP performance model.
Drawbacks of the method are the significant amount of injected traffic required to accurately estimate the loss rate, as well as that the test may give inconsistent results for highly variable levels of cross-traffic. Another built-in deficiency is that the user must input the target RTT, whereas in reality the RTT value varies with the TCP dynamics. However, Pathdiag is unique in the sense that it tries to see if a certain performance goal can be fulfilled, which can be of tremendous value to both regulators and consumers. Work on such tests is also actively pursued by the IETF IPPM group.41
Promising methods for more accurate detection of congestion emerge from the research community. Representative research approaches can roughly be discerned in two broad categories: (a) correlation analyses between multiple links42,43 and (b) active probing of the two edges of a single link.44
In the first category, a network of dispersed network servers is used to identify the root of congestion, by examining correlations of congestion patterns. To understand the basic underlying idea of this approach, suppose that a number of measurements are performed to a distant server from different locations that follow distinct paths. If all measurements show congestion, then there is a high probability that the network segment near the distant server is congested; otherwise if a small number of measurements from specific clients show congestion, a problem likely exists at the access networks of these specific clients. Ghita, Argyraki, and Thiran showed43 that it is indeed possible to calculate the probability that a link is congested from e2e measurements; they constructed an algorithm that determines the congestion probability of each link with high accuracy, based on the probabilities that paths in the network are congested (which are themselves derived from e2e measurements). The algorithm, however, is based on several assumptions, and has not been applied in a real network. Genin and Splett42 manage to identify congestion in a realistic network consisting of M-Lab nodes, but with much less accuracy. All methods require knowledge of the topology of the network, which is hard to derive from traceroute measurements when no data from providers are disclosed.
In the second approach of Luckie et al.,44 repeated RTT measurements are performed from a single test point to the two edges of an examined interdomain link. If that link is congested, the test result will show sustained increased latency to the far-end, but not to the near-end of the link. Here also the method requires knowledge of the topology of the network, as well as the delay history to detect level shifts. The authors tested the method in five realistic case studies and showed its potential for providing empirical data about link congestion.
The above results show that the major problem of associating congestion reliably with a particular link or set of links has not yet been resolved. However, conducting and comparing measurements from different vantage points in different ASes is a promising approach for detecting and localizing congestion problems. Vantage points should be deployed in as many ASes as possible, in an effort to generate a global “congestion heat map.” Finally, unless measurements are performed from a single test point, placing measurement servers at large capacity peering points (e.g., IXPs) will not suffice, as it will be necessary to perform measurements at the edge of an interconnection link near a specific ISP or at links connecting an ISP to a transit provider.
Measurement Plan
Sampling Units and Frame
The determination of the measurement plan includes the selection of the sampling frame, the sampling technique, and the estimator (the formula which will provide the estimate of the desired quantity). In broadband surveys, the required estimates are usually means per subscriber, for example, the mean download throughput, or mean delay per subscriber. Additionally, one could estimate proportions (e.g., the proportion of subscribers for whom the mean throughput is within a certain percentage of the advertised speed), ratios, and percentiles. Thus as smallest sampling unit we consider here a subscriber, although coarser or finer (e.g., a Local Exchange or a packet flow, respectively) levels of detail could be examined. A subscriber here actually refers to a subscriber's connection, and not the physical person using it. We shall also employ the term “user” to refer to a subscriber, although it has a broader sense.
There exist many alternatives for sampling users: one can sample users directly, for example, through crowdsourcing platforms, or perform two-stage sampling: sample a larger entity (e.g., a building, or a traffic concentration point such as a Local Exchange) at a first stage, and the users in it at a second stage. For independent measurements, such as those conducted by—or on account of—an NRA, user sampling is usually preferred, since it minimizes the Internet access provider's (IAP) involvement. Therefore, we consider here that users are sampled directly, and that the sampling frame theoretically consists of all Internet subscribers or users. In reality, however, a subset of Internet access users is actually employed as a sampling frame. As we will see later, this introduces bias from the very beginning and is closely related to the sampling method itself.
The estimates of broadband parameters are heavily influenced by: (a) the context, which includes factors such as the Internet access technology, the access capacity, the number of users sharing access, or the overall traffic at the corresponding IAP's network; (b) the location, which has a significant impact on performance for most access technologies; and (c) the time of measurement. Sometimes, the effect of the above is so important that we may want to have separate estimates per affecting factor, for example, the mean throughput per subscriber per access technology. Alternately we may want to derive an overall estimate, but account for the affecting factors through an appropriate stratification.
Crowdsourcing versus Preselected Panel
We consider two basic methods for constructing a sampling frame: either via crowdsourcing or a preselected panel. A preselected panel refers to the a priori selection of a set of users according to their type and connection characteristics, and to the selection of the minimum number of tests each user is required to perform. The selection should be representative of the population and thus be guided by market data or a previous measurement study (pilot study). A careful panel preselection could already be done according to sampling principles, so that no further sampling is necessary (i.e., all users in the panel are included).
On the other hand, crowdsourcing refers to end-users measuring the quality of their Internet access service (IAS) at their own initiative, without any a priori selection of the type or number of users, their connection characteristics, or the number of tests each user performs. In this case, a posteriori sampling is absolutely necessary and should also be guided by market data or a previous pilot study.
Usually in crowdsourcing users perform measurements using software-based tools, either via a web service or a software client that they download and run from their computers. In a preselected panel, users can be selected from a subscriber directory or through an advertising campaign. In the case of a panel, the number of participating users will most likely be smaller so that one can afford to install measurement probes at user premises instead of using software applications. However, it is not unlikely that another combination is made, that is, crowdsourcing is performed using measurement probes without preselection, or with “partial preselection,”45 or a selected panel is employed where users are instructed to perform measurements using software-based tools.
With regard to sampling via software-based crowdsourcing or a preselected panel with measurement probes, the following observations are in order: First, because of the higher cost of setting up measurement probes, it is necessary to estimate the minimum required number of such devices, their location and the minimum number of measurements at each time interval. Second, although crowdsourcing at first seems much more accurate (because of the very high number of measurements [sample size]), it is actually very sensitive to bias. Higher mass of measurements may be concentrated on specific areas, on specific access links and transmission technologies, or on specific hours of a day. Without some form of stratification and an estimate of the size of each stratum, highly variable groups of users may be mixed together, increasing the bias. For example, in fixed IASs, we usually have access technologies varying widely in transfer capacity: ADSL up to 24 Mbps, VDSL about 100 Mbps, Cable (hundreds of Mbps depending on the number of channels), and FTTx up to 1 Gbps. If, as is most likely to happen in an uncontrolled environment, the numbers of crowdsourcing participants from each category are not representative of the subscriber population in an area, the overall result may be significantly biased.
Bias Due to the Sampling Frame
Regardless of the way that the sampling frame is constructed, there are inherent deficiencies with respect to the ideal sampling frame, that is, the set of all Internet access users. These deficiencies introduce bias in the measurement, in addition to the bias which may exist due to the sampling or the estimator used. In crowdsourcing, the set of users who have performed measurements may not be representative of the whole population. Likewise, a nonrepresentative set of users may agree to participate in a panel (e.g., users with demand for more bandwidth that usually choose higher bit rate offerings, or users who are dissatisfied with their connection could be more interested in conducting measurements). A similar observation was made in the survey comparison study in the Netherlands,4 where the authors found that in the self-select panels users of higher bitrate offerings were overrepresented.
If market and ISP subscriber data is available, it is possible to mitigate this bias by correcting the sampling frame to be more representative of the population, or to perform the sampling from this frame, but use more accurate market and subscriber data to improve it. For example, if the sampling frame is split equally between ADSL and VDSL users, but market or ISP data show that in reality, 70% are ADSL and 30% are VDSL users, then either the sampling frame should be corrected by removing some VDSL users, or the sample size must be decided based on the populations occurring from the market or ISP data and not from the populations of the sampling frame.
The extent to which this bias in the selection of the sample influences the informational value of the study depends on the type of statements to be made. It is likely that the bias applies equally to all providers, products, regions, and technologies. In particular, it is not to be expected that only highly dissatisfied customers of provider A and only especially satisfied customers of provider B would take part in the study. Assuming that the motivation to participate in the study is not dependent on the factors that are the actual object of the study (provider, product, region, technology), valid statements regarding differences such as between providers, regions, or technologies can be made despite bias in the sample.
Stratification
Stratification can help to improve the precision of the estimates when the population under investigation presents high variance and may be divided into relatively homogeneous subsets according to different characteristics (e.g., distance from the local exchange, nominal connection speed, peak or nonpeak hours). Furthermore, stratification is useful when we want to examine different population groups separately and derive estimates for each population group, as well as for the whole population (e.g., examine ADSL, VDSL, and Cable users, or subscribers of different IAPs separately). It is noted that stratification can be applied regardless of whether sampling within each stratum is performed or not; in the case where sampling is not performed (all the population in each stratum is enumerated), it is equivalent to merely reweighting the results from the different strata.
The number and types of strata are the major issues; they must be decided based on the purposes of the study and the characteristics of the population. Generally, the goal of stratification is to divide a heterogeneous population into subpopulations which are internally homogeneous. To date, there has not been a research work about Internet QoS measurements in a large area which provides guidance as to the type of stratification that should take place. Therefore, significant time must be taken prior to the measurements to study the characteristics of the population, in order to decide on an efficient stratification.
Sampling Method
The sampling method refers to how the users that will be included in the study are selected (in the case of stratification, it refers to the selection of users in each stratum). Two basic methods exist: random and systematic sampling. In random sampling, each user in the frame is selected with the same probability. Systematic sampling presupposes that a list of users is available and is conducted by selecting a user at random for the first k users in each stratum and the kth user thereafter. Systematic sampling is easier than simple random sampling and can be more precise when the acquired sample consists of heterogeneous users. For example, if the list contains users in order of geographical proximity to a Central Office, taking every kth user can circumvent correlation effects and produce a more precise estimate. However, since no previous results are available in the populations at hand, an initial comparison of both methods is recommended.
Required Sample Sizes
The required sample sizes refer to (a) the required number of users that should be included in the results (or the required number of users in each stratum, in case of stratification) and (b) the required number of tests that each user should perform.
Generally there exists no safe rule for determining the required sample size of individual measurements for each parameter. The rule of thumb is that the larger the deviation of the random variates from the normal distribution, the larger the required sample size. Therefore, an estimation of the sample variance is helpful in this direction. In practice, we can also strengthen the validity of the normal distribution assumption by removing outliers, thereby reducing the skewness. Outliers should be rigorously checked for measurement errors and if errors are found, discarded. Otherwise they should be treated separately (e.g., by completely enumerating them), or be modified to improve the precision of the estimate.46 In any case, the persons conducting the survey must be extremely cautious, in order not to discard valuable information.
For estimating mean values and percentages, there are widely known methods for the required sample size assuming the underlying population is very large and the estimates are normally distributed.47 Assuming this normal approximation (a large underlying population compared to the sample size can be taken for granted, since usually only a small fraction of the subscribers are in the crowdsourcing sample), the minimum required number of measurements for estimating a mean value Ῡ is n0 = s2/V, where V is the desired variance and s2 an estimate of the population variance. For estimating percentages (i.e., the percentage of the population that has a certain characteristic), the minimum required size is n0 = pq/V, where p denotes the estimated percentage of the population that has this characteristic, and q = 100 – p. If in contrast the normal approximation assumption does not hold, these formulas are likely underestimates of the required number, since performance parameters usually have heavier tails (see also the next section) and therefore are skewed to the right. However, they could still be used by employing for more safety a higher value than what occurs from the calculation. Additionally, the experimenter has the option to resort to other methods for estimating confidence intervals, such as the Jackknife and Bootstrap methods.48
Validity of the Normal Approximation Assumption
When the sample size is large, it is usually common practice in many surveys to rely upon the normal distribution assumption to construct confidence intervals and derivatives such as the minimum required sample size we mentioned above. The normal distribution assumption holds exactly when the sampled random variates are themselves normally distributed, or in binomial trials where it can be justified by the normal approximation to the binomial distribution (e.g., when calculating the proportion of users with a speed above a certain threshold). However, random variates for the metrics under examination are usually not normally distributed. But the normality of sample averages and proportions49 may be justified by the application of the central limit theorem, which states that the arithmetic mean of a large number of independent random variates, each drawn from a distribution with well-defined expectation and variance is approximately normally distributed.50
Conditions under which the central limit theorem applies for simple random sampling in finite populations involve the limiting behavior of functions of the population elements, and are usually very hard to demonstrate.51 In practice, we can be more certain that the normal approximation applies by examining the two primary conditions: independence and finite variance of single user measurement results (which are random variables themselves). Independence can be satisfied by considering that, in normal network conditions, the burden of injected traffic of a single user measurement is infinitesimal compared to the total network traffic in the access network, so the measurement of one user does not affect the result of another.
The assumption of finite variance should be justified for each measurement metric separately: Throughput generally follows a lognormal distribution,52,53,54 while for specific flow sizes and specific time periods it approaches a normal distribution.55 The RTT distribution was shown to be bimodal and have a long tail,54 while in an older analysis it was modeled by a gamma distribution.56 For file transfers, the one-way delay follows a lognormal distribution;57 however, for transfers of data flows the distribution has different behavior for different value ranges and can be better modeled by combining Pareto and Weibull distributions.58 Finally, packet loss rate follows an exponential distribution, and there is also a significant probability of zero rate value.59 Therefore we see that, except for the one-way delay, which may approximate a Pareto distribution with infinite variance, all other parameters have been modeled with lognormal, Gamma, or exponential distributions, which are known to have finite variance.
Therefore, we can have more confidence that the central limit theorem applies here, although this has not been verified in practice.60 However, if these underlying conditions are fulfilled, the approximation is safe considering the very large number of user measurements that are available in crowdsourcing tools.
Required Number of Users under Stratification
In the case of stratification, the required number of users also differs according to whether proportional or optimal allocation is performed. In proportional allocation, the sample size of each stratum is proportional to the population size of the stratum, whereas in optimal allocation the sample sizes are selected so as to minimize the variance of the overall estimate. Optimal allocation is more costly, since it additionally requires the estimation of the stratum variances. It is usually much more accurate when there are significant differences in the variances between strata. That said, significant differences in variance are very likely to occur in Internet access measurements, due to the large differences in capacity of the various access technologies (e.g., DSL, Cable, Fiber, Mobile 3G and 4G). Therefore, the extra cost of variance estimation may be worth the gain achieved.
When estimating mean values and percentages, there are also simple formulas for the calculation of the minimum required sample size for a desired variance which are similar to the nonstratification case.61
Accuracy of Single User Measurements
In crowdsourcing surveys of Internet access performance, the accuracy of each user's measurement result is usually not examined. However, since the user test environment is out of control of the experimenter and there are a lot of parameters that may introduce errors (simultaneously running other applications, use of wireless interface, TCP and OS versions, user equipment), errors inevitably exist and it is necessary to apply statistical methods to minimize them or mitigate their effect. Clearly, the accuracy of the overall per user estimate is affected by the accuracy of the result of each single user. Therefore it is reasonable to check for the stability of single user measurements and ask for the minimum required number of measurements each user should perform. In the second section of this article (“Measurement Methodology of Basic Performance Parameters”), we saw that some tests (the Ookla and SamKnows tests, as well as some test tools by NRAs) perform multiple measurements to output a single throughput statistic, but without providing evidence of accuracy.
To safeguard the accuracy of the overall result, the measurement result of each user should also abide to strict requirements (e.g., that a maximum variance exists, or that the result of each user is a mean that lies in a 95% confidence interval). Since the technical parameters of interest (throughput, delay, jitter, packet loss) are usually not normally distributed, a large number of measurements may be required in order to get a small variance.
We may try to calculate an estimate of the sample variance of single user measurement results for a “typical” user; then following the methods described previously (see the section “Required Sample Sizes”) we can calculate the minimum required number of single user measurements and include only those users who have performed a number of measurements higher than this threshold.
It is also worth noting that in the case of QoS measurements, the same sample will probably be used for the estimation of all parameters (throughput, delay, jitter, packet loss, and any other parameters). These parameters usually exhibit different characteristics and therefore require different minimum sample sizes. One should therefore examine the minimum required sample size for each parameter separately and select the maximum of these values.
To mitigate errors from the user environment, one should process the sample and try to filter users who are suspected for measurement errors. A possible approach would be to calculate the variance of the estimate of each user and exclude users who exhibit significantly larger variance in repeated measurement results compared to other users with the same characteristics (access technology, nominal speed, distance from local exchange, etc.). This approach is based on the reasoning that each user's results from repeated measurements should be relatively “stable.” This should not mean that results of a single user are not allowed to vary. On the contrary, they are expected to vary depending on the time of day (peak/nonpeak hours) and the degree of congestion in the network. However, users that have significantly higher variance than other “similar” users are more likely to be influenced by confounding factors internal to the user environment, and thus their results may contain significant bias (e.g., a user who conducts some measurements while simultaneously running other applications in the background and some without running other applications will exhibit larger variance in the results). Nevertheless, it is not known whether other network-related factors could lead to such a higher variance, or what the variance difference should be to detect users with “suspicious results.” Therefore, similarly to the previous section about the required sample size, the handling of users with significantly higher variance in their results should be done with caution and be subject to strict recommendations about identifying and handling outliers.
Discussion and Conclusions
In the last section, we present a summary of the findings of this review and an approach to address each of the challenges we have set.
Harmonization of Measurement Methodologies
From the analysis of the current measurement methods for estimating throughput and other QoS parameters of a broadband connection, it seems that none of the currently available public tools for doing the measurements can be proclaimed as better in all aspects. Furthermore, not all tools measure all four basic parameters of interest (throughput, latency, jitter, packet loss). Another interesting conclusion in several of the works we reviewed is the large amount of variations in throughput measurements observed with different measurement tools.
So far there is no knowledge about how to optimally set measurement parameters (number of parallel connections, packet size, probing rate, etc.). In addition, there is no consensus on processing and aggregation of measurements for deriving the single user value, or enough evidence of the impact that this has. However, despite the absence of an optimal tool, the overwhelming need to conduct comparable measurements, especially by regulators to inform the public, necessitates the harmonization, and preferably standardization of the methodology of each measurement metric. We can already identify a number of best practices toward harmonized methodologies.
The number of simultaneous TCP connections used during the measurement is largely responsible for throughput variations. Previous research works10,16 have shown that the throughput achieved by multiple parallel TCP sessions produces consistently higher throughput and thus comes closer to measuring the available bandwidth. Further, single TCP connection measurements are more adversely affected by packet losses.16 The number of parallel connections generally increases as capacity increases; however, previous research62 has shown that this number need not be very high (theoretically, six TCP connections are sufficient to reach 95% of available capacity).
Besides the number of parallel connections, the other factors that seem to play the most important role in the results of a single test are the measurement interface (wired/wireless), the selection of the test server, the existence of cross-traffic, and the methods used for processing and aggregating the measurements. There is generally a consensus that wired interfaces should be used for the measurement, that cross-traffic should be absent or minimal, and that, when measuring the IAS quality, the test server should be located as close as possible, but in a different Autonomous System (AS) than that of the ISP of the client (so that it is outside the control of the ISP). Additionally, the test server should employ redundancy so that it is not affected by load problems. These constitute good measurement practices that any tools should try to follow as closely as possible. (The TCP parameters also play a role, but it is assumed that the majority of home users have the same configuration, so that the effect is the same.)
Harmonizing methodologies should not be an insurmountable task. Even currently employed tools that use multithreaded connections could be used, if their overall accuracy were further investigated (e.g., if comparisons of the throughput results with the results for the available bandwidth were conducted) and care is taken to follow the good measurement practices mentioned above.
The primary parameter of interest is throughput, but common practices also need to be established for the measurement of the other parameters, primarily for latency, and secondarily for jitter and packet loss. Latency is heavily affected by the server location, packet sizes, and buffering. The server location unavoidably changes, but the packet sizes and buffer specifications could be fixed. Finally, a common definition of jitter and packet losses should be applied, as definitional differences seem currently to be the greatest cause of discrepancies.
It should be noted that harmonization of measurement methodologies does not mean that different methodologies should not exist, especially for researchers and developers to advance our knowledge in the field. However, results that inform public dialogue and opinion should be as harmonized as possible.
A Toolbox for Monitoring NN Violations
Unlike performance measurement tools that follow similar practices, currently deployed tools for detecting NN violations are quite different. In fact, each of the tools has unique features:
Neubot can monitor performance parameters (throughput, latency, packet loss, etc.) over time and provides a graphical overview so that users can see the evolution of their performance, and can thus perceive a sudden differentiation of a certain service or application. It does not incorporate logic for automatically deducing whether an NN violation has occurred, but provides evidence to facilitate the user in that detection.
Tools such as Glasnost and Shaperprobe incorporate logic in automatically deducing NN violations, and are either application specific (Glasnost) or designed to detect a specific traffic management practice (traffic shaping, in the case of Shaperprobe). These tools rely again on performance measurements, namely throughput for detecting a violation.
NANO estimates user performance using passively collected data for specific services or applications, and compares the performance of a certain ISP with that of the baseline of all ISPs.
Netalyzr can identify blocking of ports and filtering of specific content, as well more delicate strategies such as HTTP proxying, which could be seen as an NN violation.
However, due to the variations of traffic management practices that can be used by ISPs, the combined usage of several tools can provide more evidence for discovering that such a practice exists and understanding its functionality.
Furthermore, for supporting evidence on NN violations, detailed monitoring is necessary: Differentiations may concern only some users (most likely users who consume most data) or only specific content. It can also be applied only for a certain period, such as in peak hours. Deriving aggregate statistics on a large number of users and types of content can help to deduce NN violations with more accuracy, thus enabling a regulatory authority to take counter measures. Therefore, similarly to simple performance results, a procedure for statistical processing of results on NN violations is necessary to provide evidence not only to regulators, but also to the broader public. The tools can themselves be improved by adding more user-friendly interfaces, more detailed analysis on the cause of the differentiation (e.g., whether throttling is based on port number or content), and statistical processing.
Additionally, a wider adoption can occur if more NRAs adopt or support such tools. They could cooperate more closely with end-users for detecting and substantiating a violation. For example, NRAs could add a web interface where a user can report his/her results from running the tool directly to the NRA, and file a complaint. User cooperation is very important in this area, since it is most often the subscribers themselves that first discover a violation.63 Depending on the accuracy with which automated results can be produced, NRAs may also need to cooperate more closely with universities, research institutes, or other third-party experts to establish beyond doubt that a violation has occurred.
A Proper Sampling Plan
The presentation of various aspects of sampling methodologies in this article has shown the complexity of the problem and the difficulty of performing optimal sampling to maximize the accuracy of the estimates.
However, there are basic tasks that can be performed to substantially improve accuracy:
Determine what estimates are required and the services or access technologies that must be evaluated (e.g., mean throughput per user for each different access technology, results per ISP and overall country indices, results for peak and nonpeak hours, results for different geographical regions). This means that one should first precisely define what type of result is demanded.64
Study distinct population subgroups in the whole population based on market and subscriber data from ISPs, or a pilot study. The goal here should be to determine the size and variance of these populations, and whether stratification is necessary. If large variances exist in different categories, split the population in appropriate strata, so as to produce homogeneous groups of users and increase the accuracy of the estimate (see the section “Stratification”).
Determine the minimum required number of sample users of each stratum for a desired accuracy, if possible by using estimates of the population variances (see the section “Required Sample Sizes”). Ideally, the numbers of users from each category should not only be proportional to the initial population sizes, but also to the population variance.65
Make the sampling frame representative of the population, for example by considering only a subset of the crowdsourced data, or preselecting a representative panel (see the section “Bias Due to the Sampling Frame”).
Check the accuracy of single user measurements and determine a minimum required number of measurements that each user should perform in order to be included in the sample. Users showing extreme variance in repeated measurements are more likely to have measurement errors or be influenced by confounding factors internal to the user environment, so they should be investigated further and potentially removed from the sample in order to improve accuracy (see the section “Accuracy of Single User Measurements”). It is remarked that collecting accurate single user measurements is an easier task because the results taken by common software tools (e.g., NDT in M-Lab, or speedtest.net) are already averages over several individual measurements; hence they are representative means to some extent. Therefore, it could even be safe to use only very few individual measurements. In any case, a previous study of the variance of the measurements is necessary, and should also include different time periods (at least peak and nonpeak hours). If the variance turns out to be large, then a more thorough investigation of the number of measurements for each time period should be conducted.
In addition to tackling these challenges, measurement tools that are used to inform the public should be fully publicly documented, including the details of the algorithm, its software implementation, and the sampling methodology used to derive single-measurement statistics. Validation experiments and comparisons with other tools should also be described in sufficient detail to permit replication of the experiments by third parties, and it should also be possible to log test runs at packet-level. Ideally, these transparency requirements are strengthened if both the software code and measurement results are made open.
Finally, the measurement tools should respect user privacy. Public statistics should be anonymous and not aim at characterizing specific persons, such as heavy users. Although raw data will necessarily include the IP address of the user and possibly other user-related data (e.g., computer name), these should not be used in a way to profile an individual, but only for understanding and further investigating the measurement results.
A concerted effort among all stakeholders should be made in order to tackle the aforementioned challenges: Regulators and ISPs, as well as researchers and developers, should cooperate more closely to improve current measurement tools and correctly inform the public of Internet quality.
Footnotes
This role has been strengthened by the recent adoption of Open Internet Rules by the Federal Communications Commission and the enactment of NN rules at European level through the Telecom Single Market Regulation (Council of the European Union).
See for example the explanatory sections in the tools provided by RTR (Austria): Rundfunk & Telekom Regulierungs-GmbH. Similar “disclaimers” can also be found in other tools.
Clark et al.
Rood et al.
Goel et al.
Prasad et al.
Such as the NDT (M-Lab), or speedtest.net (Ookla), or other specially made tools (e.g., RTR NetTest).
For example, in an NDT dataset examined in Bauer, Clark, and Lehr, 38% of the tests never made use of all the available capacity.
As noted in Bauer, Clark, and Lehr, this is not trivial because the sender side does not know exactly when the receiving side acknowledges the data. More precise measurements use the TCP timestamp option; however, current NDT and Ookla systems rely only on one side to do the measurement.
Bauer, Clark, and Lehr.
Ookla.
Google Code, “NDT Text Methodology.”
The last survey was conducted in 2014.
SamKnows.
Project BISmark—Broadband Internet Service Benchmark.
Sundaresan et al.
Project BISmark—Broadband Internet Service Benchmark, “BISmark Active Dataset Description.”
Sánchez et al.
Bischof et al.
BEREC. BEREC stands for Body of European Regulators for Electronic Communications and is a formal organization of European NRAs that was established by the EU in order to contribute to the development and better functioning of the internal market for electronic communications networks and services.
The throughput is monotonically decreasing with latency and packet loss. This is evident for protocols without flow-control, such as UDP, but also holds for TCP. A well-known result is that an upper limit to the throughput of a TCP connection is equal to the value of the receive window divided by the round-trip time (RTT); also, the steady-state TCP throughput is approximately inversely proportional to the RTT and the square root of packet loss rate (see Padhye et al.) These dependencies were also shown in realistic networks from the analysis of NDT results (see García García). It is not always true for jitter; however, the NDT measurement results in García García also showed that most of the times, a greater average RTT also has a higher variance. A possible explanation is that higher speed networks are also dimensioned with larger buffers, which would cause packets to wait longer during congestion periods. It is, however, emphasized that knowledge of how throughput behaves only roughly characterizes other parameters; one would not know, for example, if a low throughput value were due to high latency or increased packet loss.
TCP does not let packet loss increase much since it adjusts the congestion window. But also if buffers are too small, it could cause a link to be underutilized. This is the reason for which modern routers are designed with large buffers. Furthermore, routers have to accommodate other traffic as well apart from TCP, especially real-time traffic using UDP. Overall, buffer sizing is a very complex problem, of which only the surface is scratched here. For interested readers, a nice reference is Vishwanath, Sivaraman, and Thottan.
Li et al.
Dischinger et al.
M-Lab.
Kanuparthy and Dovrolis, “ShaperProbe.”
Xfinity.
A DSL provider could dynamically adjust the synchronization speed without having to resort to shaping, although there are no indications of such practice.
Masala et al.
Tariq et al.
This is an advantage of passive over active methods. Note, however, that when active probe methods are made by thousands of users on a large number of servers, evading detection is an arduous task.
Kreibich et al.
In many cases it may not be obvious to users that a specific application is failing because of port blocking, or there may be workarounds to the blocking of a specific port. For more details, see the technical report of BITAG (Broadband Internet Technical Advisory Group).
Zhang, Mao, and Zhang.
Kanuparthy and Dovrolis, “Diffprobe.”
Mahajan et al.
Zhang, Mara, and Argyraki.
Google Public Data.
Google Code, “NDT Test Methodology: Bottleneck Link Detection.”
Mathis et al.
See Mathis and Morton.
Genin and Splett.
Ghita, Argyraki, and Thiran.
Luckie et al.
“Partial preselection” is what was performed in the last SamKnows study for the EC in 2014: A part of the panel was composed of existing panelists from previous studies, and the remaining from volunteers. Volunteers were first directed to a website, where they first completed an identification form and then performed a speed test. On the basis of the consumer information and the results of the speed test, it was then decided if the user would receive a hardware probe (SamKnows Whitebox) to participate in the panel.
Aguinis, Gottfredson, and Joo.
Cochran.
Efron and Tibshirani.
Since proportions can also be seen as averages.
Feller.
Bellhouse.
Balakrishnan et al.
Zhang et al.
Maier et al.
Lu et al.
Mukherjee.
Downey.
Basher et al.
Paxson.
Practical verification is extremely difficult, since it demands random samples from the vast population of Internet subscribers.
See again the classical book of Cochran, Sections 5.9 and 5.12. It is noted that these simple formulas hold provided that the population is very large compared to the sample size, which is generally a safe assumption to make.
Altman et al.
The famous BitTorrent blocking by Comcast in 2007 as well as other NN cases were first brought to the surface as a result of user complaints. For a list of cases, see Sankin.
This is a usual guiding principle in designing sampling schemes; see for example Gruijter, Section 13.1.2.
This is called “optimal” or “Neyman” allocation in statistics. See Cochran.