Todd Papaioannou is currently at Yahoo!, in the role of vice president of cloud architecture for the cloud platform group. Before taking on that role, Todd was responsible for new product architecture and strategy at Teradata, including driving the entire Cloud computing program. Before that, he was the CTO of Teradata’s client software group. Prior to joining Teradata, he was chief architect at Greenplum/Metapa. Dr Papaioannou holds a PhD in artificial intelligence and distributed systems.
In this interview, we discuss:
- Virtualization as the megatrend of this decade
- The world’s largest Hadoop clusters
- Cloud benefits to businesses large and small
- “If it’s not your business to be running data centers, don’t do it.”
- Analyzing 100 billion events per day
- A future dominated by hybrid clouds
Robert Duffner: Todd, could you introduce yourself and describe your background and your current role at Yahoo?
Todd Papaioannou: I’m the chief architect for cloud computing here at Yahoo. My official title is VP of Cloud Architecture, and I’m responsible for technology, architecture and strategy across the whole of the cloud computing initiatives here at Yahoo!.
The Yahoo! cloud is the underlying engine on which we run the business, and we like to think of it as one of the worlds largest private clouds. So my responsibilities span edge, caching, content distribution, multiple structured and unstructured storage mechanisms, serving containers and the underlying cloud fabric we’re focused on rolling out that makes it all possible.
I’m also responsible for Hadoop and the cloud -serving container architecture, as well as all of the data capture and data collection across the whole of the Yahoo network. We dedicate a lot of energy to pulling together a very wide range of technologies as an internal platform as a service.
Robert: I imagine that making the move from Teradata to Yahoo! was significant for you personally. Given that your career has been focused on cloud computing for some time now, has the move to Yahoo made a difference in terms of what kinds of projects, initiatives, and solutions you can personally lead or develop in the cloud computing space?
Todd: Absolutely. At Teradata, I was responsible for driving the cloud computing program from a blank piece of paper through launch and delivery of multiple products, so I have been focused on the cloud for a number of years as well as all of the other big data initiatives and future-facing stuff.
I played the role of looking to the future and helping to drive product strategy and product architecture across the Teradata portfolio. But I became very involved with all of the cloud stuff as I drove that program and saw that this was a very compelling and very interesting part of the market space.
So coming to Yahoo! was an opportunity to help drive one of the largest private clouds in the world. There are probably only two or three other companies in the world that deal with these issues at the scale that Yahoo! does, so it’s a fantastic opportunity. It also allows me to work closer to the consumer.
Robert: You said in a presentation that you developed while at Teradata that “virtualization is the megatrend of the next decade.” Do you still feel that’s the case? And what do you think has the potential to supplant it, either in this decade or the next?
Todd: I still think it’s turning out that way. Virtualization is a megatrend that’s going on in data centers around the world right now, and virtualization is actually just one component of cloud computing. There’s a lot more that goes into cloud computing as a layer above virtualization, and I think the self-service and elasticity aspects are particularly interesting.
As I look to what is going to change stuff in the future, ubiquity of devices is clearly another megatrend, as is the explosion of data. When you take massive data, massive sets of devices, and cloud computing together, you start to see a slightly different vision of how software needs to be built, abstracted, and developed to support both the enterprise and the consumer, going forward.
Robert: Let’s talk a little bit about Hadoop. In a quick set of back and forth tweets with Barton George, you clarified who has the largest Hadoop clusters, with Yahoo, Facebook, and eBay being the largest, in that order. Who also is in the top 10, in your estimation, and are there any particularly interesting implementations that fall outside of the top 10 that you think bear watching?
Todd: The top 10 is probably made up of West Coast, Bay Area web companies that are generating huge amounts of data, particularly social graph data, and are finding that traditional tools are not that great for analyzing it. Outside of the ones you mentioned, it’s Twitter, Facebook, LinkedIn, Netflix, and those types of folks.
We’re also starting to see penetration into the financial industry, where they have huge amounts of data to process as well. US government agencies like the CIA and NSA are using Hadoop now, and they use the Yahoo! distribution of Hadoop to do their processing. They won’t tell me what they’re doing, but I’d love to know. [laughs]
Robert: You mentioned when you were part of a panel discussion at Structure 2010 that cloud computing enables business users to be separated from infrastructure cycles, so each can move at a different pace. Can you unpack that statement a little bit and tell me what benefits you think cloud computing provides to both smaller and larger businesses?
Todd: To take a simple view of the IT business, there’s infrastructure you need to purchase, put in place, and manage. Infrastructure buying cycles tend to be fairly long, because it’s a big investment and you want to make sure that you’re doing the right thing.
That can be a challenge for a small business, a business unit, or someone that’s close to the customer, because they need to move at a much faster pace in today’s business climate. Cloud computing, in my mind, allows you to decouple the business logic from the underlying infrastructure and allow those two things to move at separate paces.
As an analogy, consider the fact that building a road, which is infrastructure, takes quite a long time, but small businesses can spring up or shut down along that road, and people can build houses much more quickly. In the same way, cloud computing enables the business to iterate much more quickly, because they don’t have to worry about purchasing infrastructure.
Robert: You’ve tweeted that you see an enormous amount of innovation ongoing today with Hadoop. What excites you the most about the future of that project as a whole?
Todd: I think we’re at an inflection point for Hadoop. Obviously, we at Yahoo! are extremely proud that we have created and open sourced Hadoop. Over the last four or five years, we’ve continued to invest in that environment, and right now we have around 40,000 machines running Hadoop at Yahoo!, which is clearly a huge number, and growing all the time.
There’s another set of folks now who are starting to use Hadoop at a smaller scale, and the exciting thing, I think, is that there is now an ecosystem springing up. There are vendors coming into the ecosystem with new tools and new products, and people starting to innovate around the Hadoop core that we built.
Robert: Could you comment on how important the innovation around the core software for the cloud is, in terms of everything that has to happen around running the operations at data centers?
Todd: If you think about the entire business, the data center is the lowest level of infrastructure, and then you have the cloud running above that, and then in our case, the business of web properties running above that. There’s a huge amount of innovation that has to happen on a vertical basis.
We’ve been driving a lot of innovation in how we design our data centers. We recently opened a new data center that got some awards for its design. It’s designed [laughs] like a chicken coop, basically, so it’s self cooling in some respects. That was a great, novel approach to some of the problems when you roll out a cloud, you’re basically trying to build infrastructure.
I want to be able to shunt workloads around from data center to data center depending on changing conditions. For example, we need to respond to it if the data center is getting too hot or we are getting a lot of surge traffic because the U.S. is waking up, and that sort of thing.
That can’t really be done by humans in front of a keyboard. What you really need to be thinking about is trying to automate everything. One of the big initiatives I’ve been pushing is to automate everything in the cloud so we have can have more of a high level thought process around control, rather than a low-level, tactical one that focuses on shunting around specific workloads on an as-needed basis.
Robert: How can organizations that want to move to the private cloud benefit from the lessons learned by big companies like Microsoft and Yahoo! that have gone before them?
Todd: If it’s not your business to be running data centers, don’t do it. You need to make it Someone Else’s Problem. Yahoo!, Microsoft, and a few others out there are in the business of running data centers, and smaller companies should take advantage of that availability.
Companies that do need to run their own data centers, for whatever reason, can benefit from the fact that we have open sourced our infrastructure code. One of our stated goals is to open source all the underlying software that sits in our cloud, and we’ve done that today so far with Hadoop, which is our big data processing and analytic environment, and also more recently with Traffic Server, which is our caching and content distribution network software.
And we do that for a particular reason, which is that if you’re building software internally, the minute you deploy it and no one else externally is using it, it’s already on a path to legacy. You can continue to invest in that software, but you’re continuing to invest in a one-off solution.
We like the open source world, because if we can build a community around a piece of our software and drive it to be a de facto standard, we can build a measure of future-proofing into our software. If people are already working with it outside of the company, we can also hire people who have previous experience with the software.
For the first time, we recently acquired a company that built its product on top of Hadoop, which helps validate our belief that open sourcing our infrastructure software benefits not only us, but the rest of the world as well.
Robert: Considering the role of Linux to the enterprise on servers, do you see an analogous software package developing for the cloud?
Todd: I don’t think that has quite resolved itself yet. There’s a lot of competition among the big players like Microsoft, Amazon, and Rackspace. Amazon clearly has a lead, but it’s not insurmountable. And then there’s obviously the open source world, which includes Eucalyptus, OpenStack, Deltacloud, and others.
It’s an exciting time to be working in this landscape, and that’s one of the reasons I came to Yahoo!. There’s a huge amount of innovation going on at every level of the stack, from way down at the hardware level, all the way up to the cloud service level.
Virtualization, a massive expansion in server computing power, and low prices have really acted as catalysts. I really see the cloud as an abstraction layer above a set of underlying compute, storage, bandwidth, and memory resources. That abstraction allows you to get access to those resources on demand.
Because of that, one of the big initiatives I’m driving here at Yahoo! is to think of cloud computing resources as a utility just like electricity or cell phone minutes. You should just be paying for the utility when you need it, as you need it.
Robert: During a panel discussion on big data, you mentioned that Yahoo is analyzing more than 45 billion events per day from various sources to help direct users to the right content and resources on the web. From the user perspective, how does an emphasis on cloud computing technologies enhance their experience with Yahoo as a portal?
Todd: First, just to correct the number there, either I said the wrong number, or I was just talking about audience data. We actually deal with 100 billion events a day. That covers audience data, advertising data, and a bunch of other events that happen across the Yahoo! network.
Our goal at Yahoo! is basically to offer the most compelling and personally relevant experience to our end users. To do that, we need to understand stuff about you, such as whether you’re into sports, travel, finance, or other topics. And we need to do that as you span across our multiple properties.
At Yahoo!, we have hundreds of different web properties, each with a different focus and context. So even if you were interested in sports, it may not be so relevant for us to show you a piece of sport content when you’re on Yahoo! Finance.
Because of that, we use all of the events that we collect, and we use Hadoop to do all the processing, so we drive better user understanding, and we’re able to do better content targeting and ultimately, better behavior targeting from an advertising standpoint.
Our ultimate goal is to understand you across all of our properties, and depending upon what context you’re in, to understand the content you’ll be interested in. Based on that, we want to be able to put a contextually relevant advert close to that content to better drive engagement for our advertising customers.
Robert: During that same panel, focused on big data, there was a portion of the discussion about on the data problem that the Fortune 1000 are having. To quote you for a moment, you said, “They all have the same problem, but they haven’t figured out how much they’re going to pay to solve it.” Can you expand on that a little bit, and how you think cloud computing technologies can help the Fortune 1000, both in the short and long term?
Todd: For any business, there’s a spectrum of data that is vitally important right now, whether it’s investment mana
ement, supply chain management, or user registration. Businesses are willing to pay a certain dollar value for that data, whether it’s available or active so they can access it immediately, whether or not it’s online.
There is a set of data that you don’t know the dollar value of yet, because you haven’t discovered what it may teach you. But you know that somewhere within that data, there’s value to be found, whether it’s better user understanding or better insight into how to run your business.
I think the question was, “How do we know if you have big data?” And my response was, “Everybody has big data. They just don’t know how much they want to pay for that big data.” And by that I mean, whichever business you go to, you can say you have a whole bunch of data that you can really gain insight out of around your business, which you are currently just dropping on the floor.
On the other hand, you probably believe you should pay a lot less for that data as a business than you would for a more traditional enterprise data warehouse or data mart like you might get from Teradata or Oracle.
Still, the insights you can get from that data are huge. So what you want to do is find the platform that matches your dollar cost profile and that allows you to work on that data, discover insight, and then start to promote it up into a more fully featured platform that ultimately ends up costing you more.
You can stick a bunch of data in a public cloud, and it’s going to cost you a lot less to store than if you’re buying a whole bunch of filers or disks locally, for most people. There’s also a set of technologies like Hadoop that allow you to discover value in that data at a much lower cost than you would pay a traditional vendor.
Because of that, the cloud is a great place for people to process big data or unstructured data that they don’t know the value of and are looking for insights into their business.
Robert: That concludes the prepared questions I had for you. Is there anything else you would like to address for our Windows Azure community?
Todd: We dabbled a little bit on the public/private cloud question, but we didn’t really get into that too much. In fact, I think the future is going to be dominated by the hybrid cloud. Companies are going to have a menu of options presented to them.
Say you’re the CIO of some Global 5,000 company, or even a small 10 person business. You’ve got to look at this menu and say “Given the business service that I want to run, what are my criteria?” For example, sensitive data or high security demands are likely to push me toward a private cloud.
On the other hand, if I have huge amounts of data that I don’t need to be highly available, that’s the sort of thing that I would put into the public cloud. That would prevent my having to make a large infrastructure investment, and as we talked about earlier, it also lets me move quickly.
I really think the future for all businesses is to look at this hybrid model. So, what’s my service, what’s my data, where do I want to put it, how much do I want to pay and why do I want to pay that? And rather than one menu, you’ll have a set of rate cards from vendors that you can go and choose from.
Robert: Microsoft’s Windows Azure platform appliance announcement concerned the ability to take the services we offer in the public cloud and offer them on premises, while still keeping it very much as fundamentally a service.
Todd: I think that makes a lot of sense when you look at what am I going to be worrying about as a CIO. In the life cycle of an application, I may even move it up and down between the layers of the cloud. I may start off in a public cloud and then bring it back in.
I think one of the areas for innovation and investment that the industry needs to make is in enabling that. I do not want to be locked into a single place where I can’t move my application and I’m stuck with a single source vendor.
Being able to move my workload from vendor to vendor, private to public, to me is an important element of what will make a successful ecosystem.
Robert: Clearly, you guys are a quintessential example of the public cloud. What are you looking at with regard to public customers?
Todd: We actually don’t offer a public cloud like Amazon or Google App Engine. In many ways, though, we are the cloud. People don’t think of Yahoo that way, but we’re the personal cloud. In terms of where people’s resources such as emails, photos, fantasy sports teams, and financial portfolio, among other things, are to be found, Yahoo is a personal cloud service for 100Ms of people. It’s just that they don’t think of us that way.
Considering whether we would move our workloads out into the public cloud, we finally come to the conclusion that probably, we would not. At the scale we deal with, it doesn’t make sense.
There’s a certain scale, I think, where it makes sense for you to make it someone else’s problem until it becomes a critical part of your business. For us at Yahoo!, running technology and trying to scale technology with 600 million registered users around the world, that is our problem, and it has to be. It’s the only way that we can successfully execute on that.
You see this with other folks as well. At the Structure Conference, Jonathan from Facebook was saying they have actually come to that same conclusion and that now they’re actually creating their own data center, pouring their own concrete and building up.
And they did that because they realized that, you know what, it was their problem. And they needed to have the level of control and the level of efficiency that you can derive by owning your own infrastructure.
So for folks of our size, it’s unlikely we’re going to move our workload to Amazon or Azure. It just wouldn’t make sense for us.
Robert: Todd, thanks for your time.
Todd: Thank you. It was great to talk to you.