We, at nokta technology team, are responsible for the infrastructure of nokta media projects, most prominent being izlesene.com, which is the largest Turkish video-sharing website. We believe we are in a unique position being stationed in a developing country with a reputation of internet censorships and trying to make a living through an internet business. And, dare i say, one of the hardest kind: video as the model and a revenue-stream based on advertising.
If this post somehow gets attention from the tech folks living in bay area where funding can be found lying on the street and every business owns a datacenter with 10k machines, let me start with a few friendly reminders here: In this part of the world, no datacenter is that big, any seven figure funding makes headlines, and advertising budgets are abysmal. While it may be the norm for businesses out there to outsource some operations to CDNs and cloud providers to cut the costs and focus on their business, it’s generally the other way around here. This means the list of things that our team is responsible for includes: networking and cabling, physical servers and their components, virtual machines, storage systems, databases, data warehouses, video transcoding, video streaming, static-content delivery, load balancers, firewalls, real-time analytics and so on. We spend our average day monitoring, troubleshooting, improving and designing the new versions of these things. March 27th was not an average day.
On March 27th, around 1500 UTC, YouTube has been banned by turkish government. ISPs applied that ban by DNS spoofing, so users trying to watch a cat diving into a box or listen to turkish pop music were reading a legal notice instead. While there is a lot to say about censorships, this post is about something else. Such a ban does not stop the average user from reaching her goal. She simply goes back to the search result and clicks on the second link. That would be us.
We are used to a similar situation when YouTube is having some kind of outage and the same user behavior applies so our traffic doubles, but those don’t last beyond an hour. This was a total ban and it was here to stay. We had no idea where the traffic would reach or when the ban would be lifted for that matter. Now, after two months, ban is still in effect and we know where we are: We are now riding a 6 times bigger traffic than what we were just a few months ago.
It wouldn’t be a story if we were serving 10 requests a day before and 60 requests now. So here are some numbers comparing first weeks of March and May:
And this is a direct screenshot of one of our zabbix graphs showing between Feb 16th and May 18th.
We know none of these numbers are enter-your-popular-webgiant-here grade but they are not also enter-your-average-website-here grade either. The interesting part for us here is actually the change in numbers. We think this kind of leap is not possible in a healthy internet ecosystem. It’s the result of being in the unique position mentioned earlier. The same position that puts a team of five, to play with just over a hundred machines in order to keep things running. And it’s sad.
It won’t be honest to say that we were totally unprepared though. Good part of the last year is spent deploying and running a private openstack cloud, a storage cluster running ceph, a data processing cluster running hadoop and storm, several haproxy and nginx and varnish installations, zabbix to monitor and puppet to keep us sane, all on ubuntu linux. When the day came, our endeavors have paid off, all of the systems kept their promises. Such a solution with any proprietary system is simply impossible, where standard support doesn’t go through in less than a month and advanced support we require is non-existent, not here anyways.
While each of them deserves a post on their own, here is a short overview of what we have done after March 27th and how these systems helped us.
Just when we know we need more machines, our previously purchased hardware got delivered the day after March 27th. BTW usual hardware purchase-delivery takes about a month for us, so it was lucky.
We deployed new machines for video streaming. With cobbler and puppet in place, it takes about an hour to get a machine from bare metal to ready state.
We launched additional web servers. Openstack creates an instance in seconds, required environment is set with puppet in a few minutes.
We exhausted available compute resources for openstack and started adding new compute nodes. No problem.
Adding new storage nodes to ceph is business as usual for us. It didn’t require further attention.
Our search index solr was a single instance and after about 12x load, it failed. We had to redo it into a solr cloud. Didn’t take more than a few hours. Then kept adding nodes.
Some user facing servers required kernel tuning done with sysctl.
We’d like to complete our words by thanking the open source communities of all kind, for this kind of story would not have been possible without them. We’ll be posting more technical details soon.
Note: Since this post ended up on my personal blog, i have to write about the team:
- Hakan Kocakulak (@hakankocakulak): Team Leader
- Caglar Bilir (@caglarbilir): Systems
- Ahmet Kandemir (@ahbikan): Systems
- Selcuk Tunc (@ttselcuk): Software
- Erdem Agaoglu (@agaoglu): Software
Note: At the time of writing this post, there are rumors about YouTube ban being lifted. Our metrics still say otherwise.