Location, location, location: it's about as important in distributed systems as it is in real-estate. Where a server is located in the network determines the kind of performance you're going to get when downloading data from that server. So the efficiency of a content distribution system is determined in no small part by its ability to direct you to nearby servers.
To be slightly more precise, network location affects two separate metrics: latency and bandwidth. In theory, these two are completely independent from each other. In reality, they tend to be intertwined. There has been much academic research on exploring the linkage between latency and bandwidth (so much so that it is infeasible to make a comprehensive list -- interested parties should start with recent work by Steenkiste at CMU, Jia Wang and Spatschek at AT&T and follow their references). But the bottom line is that you, as a consumer, want high bandwidth and low latency. And the easiest way to achieve both goals is to place servers as close as possible to the clients you want to serve.
Our bigger competitors, whose businesses are built on 1970's technology known as a client-server architecture, are at a disadvantage in the location game. To a first approximation, they provide data to their users by hooking each of them up to a single server on a rack in a datacenter they control. All of the data is unicast from this node, which has some terrible ramifications.
First, your entire connection state resides on that node, so, should that node ever die or hiccup for any reason, your downloads get interrupted. This used to be very common about a decade ago. As anyone who visited the NYTimes website back then can testify, there was a time when one out of a hundred page loads would get a connection reset error. Even now, a decade later, it still is not uncommon to see one of those fuzzy/blocky little videos from their online servers just stop dead in its tracks. This indicates that the unicast server is having a bad day. Now, some of you might say, "heh, no big deal, I just refresh my page and the problem goes away; in fact, I taught my grandma to do the same, and she doesn't mind 'cause she's got nothing but time on her hands." As you can guess, this logic doesn't really win anyone over. We pack a bunch of engineering pride (and so should every computer scientist -- we're responsible for most of the world productivity increase over the last half-century), and we see our life struggle as one of building reliable infrastructure on which others can build even more interesting services, but more pragmatically, we
just don't want to end up looking like this guy. Cutting edge protocols for
stateless TCP (which some of us invented) are realistically not going to be deployed any time soon. So a simple failure analysis shows that the client-server architecture is vulnerable to single point failures, which is as bad as things can technically get on the failure-resilience front.
Second, to assure good service, the big boys need to make sure that they map you to nearby, uncongested servers. Given a client-server architecture, they can only do this by operating servers scattered all across the globe. The Internet globe is big, very big, and getting bigger as more countries go online. It's an expensive proposition to operate a datacenter, and the costs increase with each additional location --
Akamai tries to do this and has the highest CDN costs around. When a small geographic region is segmented because of political boundaries, as it is in Europe, the costs are even higher: one would need to either place a datacenter in every market/jurisdiction, or take the hit of routing the request to a datacenter in a nearby but separate ISP. (Now, some might correctly claim that countries like Luxembourg, which, get this, is ruled by a "Grand Duke," are relics that are not worth the complication they cause for computer networks. These countries that weren't effective even as roadbumps during WWII not only pointlessly complicate border crossings, require an extra seat at the UN, and occupy top-level letter combinations in the domain name system, but more importantly, their political boundaries create separate segmented markets for ISPs. And there are so many entrenched interests in keeping these markets segmented that they are not going to go away any time soon. If you are a business type, there is a nice McKinsey article about how the European Internet evolved differently from what we know as the Internet in the US that you might want to check out). The situation is different but no better in the US or Asia, where sheer geographic distances between dense population centers necessitate operating multiple data centers, or risk having to route to far-away nodes.
Even for a client-server service with infinitely deep pockets, the data-center positioning game is difficult to play. Certain locations are just not penetrable, e.g. the federal government employs thousands of people to whom content needs to be routed, and yet there are many good reasons for why third-parties cannot plunk down their own hosts in the middle of the federal government network.
And it's not even clear that the folks with the deep pockets have infinitely deep pockets. The precise number and location of Google datacenters remains private, but their datacenters must be an expensive enough resource that even they have to limit the amount of bandwidth they spend on streaming your videos. At the time of this writing,
YouTube provides a short burst of data when you first start streaming a video, then throttles it down to about the bitrate of the video stream. This leads to slow downloads by design, or rather, by design compensating for an inherent architectural limitation.
Yet there is
plenty of bandwidth. The aggregate edge bandwidth of the Internet is staggeringly huge. All it takes is a protocol that is capable of finding nearby hosts that have your content, and routing your requests to them. This is one of the things FlixQ does.
First,
our service allows you to fetch your data from multiple servers/peers at the same time. So the failure of a server is not going to be catastrophic for your download. It may even be completely masked as your requests transparently get shifted to other replicas under the covers.
Second, our service directs clients to nearby sources of data, without having to have the data routed through far away data centers. Chances are very very high that there are indeed nearby copies of the data that you seek. We recently had a Solstice Parade in Seattle where a few hundred people in body-paint (or not) biked through the streets. The interest in this video is concentrated in the Seattle geographical area. It would be criminally inefficient to take this video and replicate it on a server in Las Vegas, or for that matter, in Oregon and stream it from there. It's already on my disk, and everyone who is interested in it lives within 10 miles of me.
FlixQ will direct our clients to these nearby sources.
Finally, our service will direct clients to the nearest replicas, even when there are thousands of replicas scattered throughout the globe. Take a typical swarm for a highly popular item. There may be tens of thousands of nodes in the swarm. The standard BitTorrent protocol, which relies on randomized tit-for-tat exchanges, will go through, on average, half the nodes in the swarm before it discovers that your neighbor next door has a cached copy. This is because randomized exchanges, while simple to implement, are an incredibly inefficient way to do a directed search. In contrast, FlixQ's
coordinator directs our clients to the nearest nodes at the first point of contact. Our clients do not need to waste time probing half the nodes in order to locate the nearby peers.
To be clear, location is not the only factor we take into account when selecting nodes to download data from. The SideCar will explore the network and continually try to find nodes from which it can download quickly, even if such nodes are far away. But our client starts off initially from a much better-informed starting position.
Overall, a peer-to-peer architecture enables our protocol to take advantage of replicas that are naturally scattered around the network, while the big boys like Hulu and YouTube have to pay to own and operate replicas all over the globe. Datacenters cost money, not only for the equipment, but also for power and cooling. We discussed earlier how this leads to these companies effectively
killing baby fish. In contrast, a peer-to-peer approach is cheaper, leaner, faster, and greener.