Thought Leaders in the Cloud: Talking with Jonathan Boutelle, cofounder and CTO of Slideshare - Azure のブログ

Jonathan Boutelle is cofounder and CTO of Slideshare, a site for the online social sharing of slideshows. A software engineer by training, his interests lie at the intersection of technology,…

Jonathan Boutelle is cofounder and CTO of Slideshare, a site for the online social sharing of slideshows. A software engineer by training, his interests lie at the intersection of technology, business, and customer experience. He studied computer science at Brown University, and previously worked as a software engineer for Advanced Visual Systems (a data visualization company), CommerceOne (a B2B enterprise software company), and Uzanto (a user experience company). He writes an occasional article on his blog.

In this interview, we discuss:

The scale of SlideShare (45 million visitors per month)
Two kinds of expensive missteps with the cloud
Aligning SaaS revenue with cloud costs
Concrete examples of the benefits of hybrid architectures
The higher administrative burden of infrastructure as a service
The opportunity to build cloud hosted special purpose application services
Front end complexity and back end cost risks

Robert: Jonathan, could you take a moment and introduce yourself?

Jonathan Boutelle: Sure. I’m the CTO and cofounder of SlideShare. I came up with the idea for SlideShare four years ago when I was organizing an unconference, a board camp in Delhi, India. People were coming up and asking me how they could put the PowerPoint into a Wiki. That was when I got the idea for SlideShare.

Prior to SlideShare, I had a small online consulting company, and before that, I was a software engineer at a B2B startup called Commerce One.

Robert: Could you talk a little bit about the scale you’re working with at SlideShare, in terms of your traffic handling, the amount of data you’re storing, the number of documents, and those kinds of things?

Jonathan: We handle 45 million unique visitors a month and we’re growing at about 10 percent a month right now. We handle tens of thousands of new documents every day. Once they get uploaded to the system, just the process of converting all of those documents and preparing them for viewing on the web is a scaling challenge in its own right.

We have tens of millions of documents in our repository, and we’re getting a lot more every day. It’s a really big site that has a lot of simultaneous load on it at all times, because we’re very global.

Robert: Can you talk a little bit about the stack that you run?

Jonathan: We have a strong preference for open source software, because it’s easier to tinker with and troubleshoot if it doesn’t work. We use MySQL on the back end.

Robert: Are you using MySQL in a relational way or more in a non-SQL fashion?

Jonathan: We’re using MySQL in a classic relational way. It’s not like the kind of stuff you’ve heard about from Facebook, where it’s essentially using MySQL for key value pairs. It’s traditional grouping of business objects that we’re doing sorting on.

One of the saving graces of SlideShare is that, from a scaling perspective, it’s a lot of read traffic. There’s not a fantastic amount of write traffic, because the overwhelming amount of activity that comes to the site is people browsing and reading content. That’s quite a bit different from a site like Twitter or Facebook where there’s tremendous amount of content being written into the system by the people who are using it.

There’s really no way to use Facebook without writing a lot of data into their database, but at SlideShare, we get a lot of people browsing and looking at content. It’s analogous to YouTube, in that sense. The nice thing about that use case is that you can put many layers of caches in between the user and the database, which can help you scale up to a very high level while using a fairly traditional database architecture.

Our first tier of caching is a reverse-proxy cache. We use Varnish, and we keep HTML pages around once they’re rendered. If you’re viewing a slideshow and you’re not logged in, we keep them around for four hours or so. We’ll happily serve that up to you if you come along and request that page.

The next layer back is a tier of memcached servers where we save data. That’s the data that we use to build up web pages. If we retrieve, say, a user name for a particular user, we’ll save that in memcached. If we need that information within a certain amount of time, we’ll pull it from the memory cloud rather than bothering the database with it.

The database is the last layer. The stuff that doesn’t get caught by those two caching layers is what comes back to the database.

Robert: You’ve got a great article titled, “Lessons from SlideShare: Cloud Computing Fiascos and How to Avoid Them” where you talk about how to lose $5,000 without even trying. People think of the cloud as a huge cost saver, but what can you share about expensive missteps?

Jonathan: I think there are two categories of expensive missteps that most people are susceptible to when they’re starting to deploy cloud computing in an enterprise environment. The first is big blunders. What happened to us was that we were doing some very heavy Hadoop-based log analysis, and the software was not able to crunch through the data fast enough. We decided just to throw more hardware at the problem.

It’s very seductive to do that, because if you throw 100 servers at the problem, you’re still only paying several dollars an hour, which feels very affordable. The problem can come when you don’t make sure to shut it down as soon as the work is done or as soon as you’ve determined that there’s actually a problem with your software rather than a problem with the availability of hardware.

We ended up leaving the servers running for several days, and we got a very high bill that month from Amazon because we had been so sloppy. This just doesn’t happen with conventional hardware, because you wouldn’t buy 100 servers to see whether throwing hardware at a temporary problem will fix it. Because the cloud gives you that power to scale up so fast, you need to make sure you remember to scale back down when you don’t need it anymore. You need to have more discipline, not less.

The second kind of problem that can really bite you is just the drip, drip, drip of occasional servers that have been spun up and haven’t been shut down. The first category I described was like the big screw up. This category is more like just being a little bit sloppy, having several servers sitting around. This happens even with conventional hardware.

There’s that box in the corner that nobody really knows what it does, but everybody is afraid to unplug it, because they think maybe it does something critical. You can get many more metaphorical boxes in the corner with cloud computing, because you’ve empowered more people in the organization to do procurement. Procurement is just spinning up a node.

Robert: When I’m traveling in Europe with data roaming turned on, I’ll get a message from AT&T telling me that I have very heavy data usage and high costs associated with it. That really doesn’t happen with the cloud, but what kinds of alerts like that would you like to see that would help detect those kinds of $5,000 mistakes before they happen?

Jonathan: I think that alerting on the basis of spikes in costs, like you described with the AT&T scenario, would be extremely helpful. I also think that daily or weekly reporting of costs would be extremely valuable. When you drill down into your spend on cloud computing, it can be challenging to figure out exactly where the money is going, when the costs originated, who authorized them, and things like that.

Being able to get a weekly or a daily report of what your spend was and a chart that shows the difference between today and yesterday would go a long way toward helping organizations cut out these kinds of extra costs.

Robert: I hear a lot of this stuff first-hand from our Windows Azure customers, and there are definitely a lot of parallels to the mobile phone industry. If you look back to what mobile plans looked like five or 10 years ago, compared to how they have evolved today, they are definitely much more attuned to how users consume. At one time, you had to do a lot of math to figure out what your costs were going to be.

You’ve also talked about the freemium model, and you posted a great slide about this. When a cloud provider charges by the drink and a SaaS (software-as-a-service) provider wants to charge per user, how do you determine where you’re going to draw that free/premium line?

Jonathan: I find pricing interesting. The link between SaaS and freemium and cloud computing is that, in all cases, you’re paying for cloud computing resources as you use them. Presumably, if you’re running a SaaS or a freemium business, you’re collecting money as your users use it. The challenge with freemium is that there’s a large percentage of your users that are not paying you. So maybe you’re relying on them as a distribution strategy.

You’re hoping that your free users will convert to paid, and what that means is that you’re starting to pay for computing resources at the beginning, but you’re only collecting money once a given user converts to being a paying customer, which might be two or three months out and is only going to happen a certain percentage of the time.

That makes business modeling a little bit more complicated, but it’s still much better to use a cloud computing solution where you can spin up more compute resources as you have more users than to have to front load that cost and pay for the users that you hypothetically hope that you’ll get.

Robert: You’ve also talked about cloud advantage of “success-based scaling.” Can you elaborate a little bit on that?

Jonathan: What’s really powerful about cloud computing is that the cost of failure is dramatically reduced. When the cost of failure is low enough, innovation can happen much more freely. You can do experiments assuming that the vast majority of them are going to fail and that in your portfolio of experiments, one will work. It’s only when it works that you start to incur real infrastructure costs.

This is really powerful, because it means that you can try to build a lot of different types of solutions. That means you’ll probably get more innovative, creative solutions to the problems coming faster.

Robert: You’ve also talked about the dangers of storage sprawl with the cloud. Can you talk a little bit about knowing what to store? After all, with big data and distributed-cost processing, you can ask a lot of “I wonder” questions if you’ve bothered to archive the data.

Jonathan: I think storage fits in the same category as compute, really, in the sense that because there’s no hard limit on how much storage you have, it’s easy to go overboard and just store everything. It’s especially easy to be sloppy and then not know exactly what you’re storing and where you stored it.

If you had a conventional disk array, your system administrator would come back to you much earlier and say, “Look, we’re running out of space. We need to prune this data and only save the things that are necessary.” The constraint of physical hardware forces you to be more disciplined, so in the case of cloud computing, you need to have more sophisticated processes.

You need to address what is saved where, what the policies are for what data should be saved, and automating the process of removing data from storage when it’s no longer needed. That helps you contain your costs and make sure that you’re only saving the valuable data.

Both storage and compute resources are becoming cheaper over time, so data is becoming more valuable because the cost of working on it is lower and the insights that come out of it are still worth the same. Therefore, you probably want to save a lot of information, but you still don’t want to save everything. You need to make sure that your team is on the same page and is only saving the data that’s required.

For example, we save our load balancing logs for a couple of months on the off chance that we’ll want to parse through them and understand our traffic patterns. But the log files themselves are just too bulky to save them forever on the hypothetical basis that they’ll be useful for something someday.

Robert: You’re definitely pretty bullish on the cloud, but you’ve also written about hybrid advantages. Can you talk us through some of those?

Jonathan: The cloud has one huge Achilles heel, which is I/O performance. It is usually very slow to access the disk in a cloud-based solution, because since they use virtualization, there’s another layer of software between you and the disk. Therefore, for example, if you’re trying to build a conventional web application, you might need to have a very high-performance database. At SlideShare, our database has eight spindles, 15K RPMs, 32 gigs of memory on top of it. It’s basically just an I/O monster.

You can’t get something like that in the cloud, which means that if you’re going to build a really big website that’s 100% in the cloud, you have to have a much more complicated back end data model. You have to do all of your shorting from the very beginning. That can be complicated and expensive. I think hybrid architectures are really exciting, where you have a back end database that’s a physical machine that’s very high performance, which is surrounded by proximate nodes that are cloud computing nodes handling the web application tier, the web server tier, and everything else, except for the data layer where you need to have very high I/O throughput.

It’s interesting to consider who’s going to arrive at a really good solution like that first. You could imagine cloud computing vendors like Azure and Amazon renting out access to dedicated hardware on an hourly basis. I’m not sure whether there are plans to offer something like that, but it would certainly meet a very compelling need. On the flip side, aggressive hosting providers are moving rapidly into the cloud computing space. You have companies like Rackspace and SoftLayer who are offering cloud computing more and more in addition to their very mature dedicated hosting offerings.

So the question is; who’s going to arrive first at a hybrid computing nirvana where you can get everything from one vendor and it’s really good? Nobody’s there right now, and that’s why at SlideShare, we actually have a hybrid architecture that uses different vendors. We have our dedicated hosting in SoftLayer and we have our cloud computing in Amazon.

Robert: Maybe you can comment on some of the trends you’re seeing in the industry. The infrastructure-as-a-service players are starting to move toward becoming more like platform-as-a-service. Then you have platform-as-a-service vendors, us in particular, moving a little bit toward infrastructure-as-a-service.

It seems like we’
e going to meet somewhere in the middle. I think the distinction between infrastructure and platform as a service is going to go away. We’re doing it primarily because we need to make it easier for customers to on ramp into a platform as a service. Any thoughts on how you see the market moving there?

Jonathan: Well, I would agree that there’s convergence and that everything is basically becoming platform-as-a-service, because that delivers so much more value to a customer than raw infrastructure-as-a-service. The sysadmin requirements of working with infrastructure-as-a-service are, if anything, higher than using dedicated hosting solutions, because you need to figure out not just how to administer all these servers, but also how to handle the case that they’re likely to disappear at any moment because they’re virtual computers, rather than physical ones.

I think that platform-as-a-service is going to be a really huge trend in the coming years, as evidenced by Salesforce acquiring Heroku and Amazon web services launching Beanstalk and talking about using Engine Yard as a potential platform as a service for their Ruby community. I think that Beanstalk is particularly interesting, because I’ve been surprised at how long it’s taken for there to be really credible platform-as-a-service offerings for Java.

This space has huge market share, and it’s been completely underserved relative to Ruby on Rails, for example, which has two excellent platform-as-a-service offerings competing for developer mind share.

Robert: From the enterprise perspective, I’ve talked to a bunch of architects and senior execs around this issue of cloud adoption. At least within the enterprise, a lot of these organizations just aren’t ready to move some of their data to a public cloud. It’s been one of the biggest barriers of adoption.

Jonathan: You know, enterprise IT people will be waiting 10 years to use this. As I start up, I can adopt the new good stuff immediately because I don’t have hang ups, and that’s a competitive advantage. I don’t spend a lot of time worrying about that. I do think there’s another trend, though, that is just as big and probably doesn’t get a lot of attention. I don’t even know what the word for it is, really, but it’s offering point solutions to particular application problems.

For example, SendGrid is the vendor we use to send email at SlideShare. It completely outsources the entire technological problem of delivering emails to a bunch of inboxes, doing rate limiting, making sure that there aren’t too many spam complaints, all that kind of stuff.

Similarly, Recurly is the provider that we use for handling our recurring billing. That means that we don’t have to build our own billing system. We’re looking at vendors for other things as well. Video transcoding is a really good example of something that you can just outsource and use on the basis of a REST API that you talk to from a provider.

I don’t know what the word is for that, but I think it’s a huge trend that has definitely made it easier to build creative, new solutions to problems. Because you don’t have to build the entire solution yourself.

Robert: Forrester analysts talk about being a pure cloud provider, a pure cloud application. I know exactly what you’re talking about here. We have a number of companies who have basically architected applications that primarily interface with RESTful APIs, but they perform a particular function and a whole new set of functionality that they can offer in a way that can leverage the scale out that a cloud application has to provide.

I’ll give you an example. There’s a company called RiskMetrics. I think they’re now called MSCI. They were acquired by that company. They do sophisticated simulations, called Monte Carlo simulations, to analyze the portfolios for hedge funds and look at very complicated instruments like collateralized debt obligations. They come in, and they’ll look at anywhere from 10 to 20 thousand servers at a time. Go in, run their analysis, go back out.

It’s just amazing. I’ve also seen another company called MarginPro in the US doing the same thing. They’re evaluating the profitability of a bank’s loan in the market. Every night, they’re pulling down the rates and running the analysis. Again, they didn’t have to make any capex investments to build out all these servers.

Think of how much lower the bar is going to be for startups that don’t need that initial round of angel investment to pay for a capex.

Jonathan: Absolutely, and that’s exactly what Amazon has talked about from the beginning. Removing the stuff that everybody always has to do and centralizing it in one place. Startups and general businesses using IT can focus on the higher level, and can focus on the incremental added value, rather than the core infrastructure that always has to be done.

Platform as a service is the next jump in those terms. Infrastructures that are accessed via APIs are another big jump in that direction. What it means is that you can rapidly prototype a new idea with very low capital requirements, bring it to market, see what the response is, and then invest in it only if it starts to get traction.

Robert: There’s a site called Data Center Map that lets you search for data centers in specific geographic locations. For a global site like SlideShare, how do you balance centralization for efficiency with being close to users for performance?

Jonathan: We centralize our infrastructure, our dynamic page generation, for efficiency. We have one big cluster of physical servers and one big cluster of cloud servers, but the thing to realize about a site like SlideShare is that nine tenths of the traffic is actually downloading content, rather than HTML.

So it’s the slides that are the overwhelming majority of the bandwidth load, and all that goes through our CDN infrastructure. If it misses on the CDN, it goes back to our cloud storage infrastructure on Amazon.

So nine tenths of the traffic on SlideShare never, ever hits a server that we’re personally responsible for keeping up. It’s all handled by Akamai and Amazon. That’s tremendously liberating, and it lets you focus while leaving stuff like administering huge arrays of storage that are constantly increasing in requirements for size to a third party.

Robert: With less complexity associated with in-house IT, where should startups know that the new complexity will show up?

Jonathan: Interestingly enough, the new complexity for us is on the front end. As we start to explore HTML5 features and build solutions on top of WebSockets, we have to take into account the fact that 95 percent of the browsers out there don’t have WebSockets yet, and that number is changing fast. So we’re having to be a lot smarter in terms of our front-end coding, and we’re having to be a lot more clever in terms of our JavaScripting.

The fact that your infrastructure is dynamic exposes you to the risk of essentially unlimited potential costs. That means you have to be much more careful, and you have to build monitoring systems for that yourself, especially since vendors don’t seem to do a very good job of providing those systems right now.

We’ve written scripts that try to keep track of that cost on a daily basis, and we look at the data pretty carefully. Operations pays a lot of attention to what our current spending is.

That’s a definite area of increased complexity. Another megatrend, I think, is server automation and making sure that operations people don’t end up doing the same job more than once. That’s a best practice in a traditional hosting environment, but in the cloud, it’s even more necessary, just because you have computers appearing and disappearing on a continual basis.

You need to be able to have a fully scripted way of creating a computer with a particular role. We use Puppet for that, which is a really great infrastructure for automating the configuration of your servers.

Robert: Is there anything else you’d like to talk about?

Jonathan: One thing I’m personally really excited about is a new SlideShare feature that we are launching next week, called ZipCasting. ZipCasts are very easy to start online meetings that are completely browser-based.

You can start a ZipCast with one click, and you can invite someone to join a ZipCast with one click, and then you’re sharing slides with them, and you’re broadcasting video to them. This is a much faster way of doing online collaboration than has traditionally been available, and it’s also at a much lower price point. The majority of the features are free. What you pay for is password protection and ad removal.

I’m really excited about the potential for ZipCasts to create a new type of social media experience. We think of it as being Ustream for nerds: a real-time, social-learning, one-to-many experience that is driven by a social media website.

Robert: One thing that I’m still waiting for is more robust collaborative whiteboarding.

Jonathan: That is definitely a pain point for our organization when we’re having remote meetings. There is one company that I’ve heard of that has been working on a whiteboarding app for the iPad, which is pretty cool.

Robert: I feel like what you’re doing here would be the perfect service to acquire through the iPad, right? Or through any other tablet technology for that matter.

Jonathan: What we’re really waiting for is for the front facing web camera on the iPad 2. Once that comes out, online meetings on the iPad will really pop, because you’ll be able to broadcast video and you’ll be able to advance slides. That will be really cool.

Robert: Are you guys keeping a close look at Honeycomb and some of the other Android-based tablets like Xoom as well?

Jonathan: It’s funny that you should ask that. A lot of the developers in our Delhi office have taken to carrying around these seven inch Android tablets. They’re really enjoying them as a lighter way of having a computing device with them for taking notes during meetings and things like that.

We do a lot of testing of our mobile web site on these Samsung Galaxy Tabs, as well as on iPhones, iPads, and everything else. I don’t see the tablet market as going 100 percent to Apple, but the Android devices are only starting to come out now. We’ll just have to wait some time before we really see what they can do.

Robert: Thanks a lot, and good luck with your February 16th launch.

Jonathan: Thank you.