Microsoft Research India Podcast
Accelerating AI Innovation by Optimizing Infrastructure. With Dr. Muthian Sivathanu
Episode 010 | September 28, 2021
Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI.
Muthian's interests lie broadly in the space of large-scale distributed systems, storage, and systems for deep learning, blockchains, and information retrieval.
Prior to joining Microsoft Research, he worked at Google for about 10 years, with a large part of the work focused on building key infrastructure powering Google web search — in particular, the query engine for web search. Muthian obtained his Ph.D from University of Wisconsin Madison in 2005 in the area of file and storage systems, and a B.E. from CEG, Anna University, in 2000.
For more information about the Microsoft Research India click here.
Related
- Microsoft Research India Podcast: More podcasts from MSR India
- iTunes: Subscribe and listen to new podcasts on iTunes
- Android
- RSS Feed
- Spotify
- Google Podcasts
Transcript
Muthian Sivathanu: Continued innovation in systems and efficiency and costs are going to be crucial to drive the next generation of AI advances, right. And the last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in both hardware in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributed jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation about neural networks 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI.
[Music]
Sridhar Vedantham: Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that’s impacting technology and society. I’m your host, Sridhar Vedantham.
[Music]
Sridhar Vedantham: Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI.
[Music]
Sridhar Vedantham: So Muthian, welcome to the podcast and thanks for making the time for this.
Muthian Sivathanu: Thanks Sridhar, pleasure to be here.
Sridhar Vedantham: And what I'm really looking forward to, given that we seem to be in some kind of final stages of the pandemic, is to actually be able to meet you face to face again after a long time. Unfortunately, we've had to again do a remote podcast which isn't all that much fun.
Muthian Sivathanu: Right, right. Yeah, I'm looking forward to the time when we can actually do this again in office.
Sridhar Vedantham: Yeah. Ok, so let me jump right into this. You know we keep hearing about things like AI and deep learning and deep neural networks and so on and so forth. What's very interesting in all of this is that we kind of tend to hear about the end product of all this, which is kind of, you know, what actually impacts businesses, what impacts consumers, what impacts the health care industry, for example, right, in terms of AI. It's a little bit of a mystery, I think to a lot of people as to how all this works, because... what goes on behind the scenes to actually make AI work is generally not talked about.
Muthian Sivathanu: Yeah.
Sridhar Vedantham: So, before we get into the meat of the podcast you just want to speak a little bit about what goes on in the background.
Muthian Sivathanu: Sure. So, machine learning, Sridhar, as you know, and deep learning in particular, is essentially about learning patterns from data, right, and deep learning system is fed a lot of training examples, examples of input and output, and then it automatically learns a model that fits that data, right. And this is typically called the training phase. So, training phase is where it takes data builds a model how to fit. Now what is interesting is, once this model is built, which was really meant to fit the training data, the model is really good at answering queries on data that it had never seen before, and this is where it becomes useful. These models are built in various domains. It could be for recognizing an image for converting speech to text, and so on, right. And what has in particular happened over the last 10 or so years is that there has been significant advancement both on the theory side of machine learning, which is, new algorithms, new model structures that do a better job at fitting the input data to a generalizable model as well as rapid innovation in systems infrastructure which actually enable the model to sort of do its work, which is very compute intensive, in a way that's actually scalable that's actually feasible economically, cost effective and so on.
Sridhar Vedantham: OK, Muthian, so it sounds like there's a lot of compute actually required to make things like AI and ML happen. Can you give me a sense of what kind of resources or how intensive the resource requirement is?
Muthian Sivathanu: Yeah. So the resource usage in a machine learning model is a direct function of how many parameters it has, so the more complex the data set, the larger the model gets, and correspondingly requires more compute resources, right. To give you an idea, the early machine learning models which perform simple tasks like recognizing digits and so on, they could run on a single server machine in a few hours, but models now, just over the last two years, for example, the size of the largest model that's useful that state of the art, that achieves state of the art accuracy has grown by nearly three orders of magnitude, right. And what that means is today to train these models you need thousands and thousands of servers and that's infeasible. Also, accelerators or GPUs have really taken over the last 6-7 years and GPUs. A single V-100 GPU today, a Volta GPU from NVIDIA can run about 140 trillion operations per second. And you need several hundreds of them to actually train a model like this. And they run for months together to train a 175 billion model, which is called GPT 3 recently, you need on the order of thousands of such GPUs and it still takes a month.
Sridhar Vedantham: A month, that's sounds like a humongous amount of time.
Muthian Sivathanu: Exactly, right? So that's why I think just as I told you how the advance in the theory of machine learning in terms of new algorithms, new model structures, and so on have been crucial to the recent advance in the relevance in practical utility of deep learning.Equally important has been this advancement in systems, right, because given this huge explosion of compute demands that these workloads place, we need fundamental innovation in systems to actually keep pace, to actually make sure that you can train them in reasonable time, you can actually do that with reasonable cost.
Sridhar Vedantham: Right. Ok, so you know for a long time, I was generally under the impression that if you wanted to run bigger and bigger models and bigger jobs, essentially you had to throw more hardware at it because at one point hardware was cheap. But I guess that kind of applies only to the CPU kind of scenario, whereas the GPU scenario tends to become really expensive, right?
Muthian Sivathanu: Yep, yeah.
Sridhar Vedantham: Ok, so in which case, when there is basically some kind of a limit being imposed because of the cost of GPUs, how does one actually go about tackling this problem of scale?
Muthian Sivathanu: Yeah, so the high-level problem ends up being, you have limited resources, so let's say you can view this in two perspectives, right. One is from the perspective of a machine learning developer or a machine learning researcher, who wants to build a model to accomplish a particular task right. So, from the perspective of the user, there are two things you need. A, you want to iterate really fast, right, because deep learning, incidentally, is this special category of machine learning, where the exploration is largely by trial and error. So, if you want to know which model actually works which parameters, or which hyperparameter set actually gives you the best accuracy, the only way to really know for sure is to train the model to completion, measure accuracy, and then you would know which model is better, right. So, as you can see, the iteration time, the time to train a model to run inference on it directly impacts the rate of progress you can achieve. The second aspect that the machine learning researcher cares about is cost. You want to do it without spending a lot of dollar cost.
Sridhar Vedantham: Right.
Muthian Sivathanu: Now from the perspective of let's say a cloud provider who runs this, huge farm of GPUs and then offers this as a service for researchers, for users to run machine learning models, their objective function is cost, right. So, to support a given workload you need to support it with as minimal GPUs as possible. Or in other words, if you have a certain amount of GPU capacity, you want to maximize the utilization, the throughput you can get out of those GPUs, and that's where a lot of the work we've been doing at MSR has focused on. How do you sort of multiplex lots and lots of jobs onto a finite set of GPUs, while maximizing the throughput that you can get from them?
Sridhar Vedantham: Right, so I know you and your team have been working on this problem for a while now. Do you want to share with us some of the key insights and some of the results that you've achieved so far, because it is interesting, right? Schedulers have been around for a while. It's not that there aren't schedulers, but essentially what you're saying is that the schedulers that exist do not really cut it, given the, intensity of the compute requirements as well as the jobs, as the size of the jobs and models that are being run today in terms of deep learning or even machine learning models, right?
Muthian Sivathanu: That's right.
Sridhar Vedantham: So, what are your, key insights and what are some of the results that you guys have achieved?
Muthian Sivathanu: So, you raise a good point. I mean, schedulers for distributed systems have been around for decades, right. But what makes deep learning somewhat special is that it turns out, in contrast to traditional schedulers, which have to view a job as a black box, because they're meant to run arbitrary jobs. There is a limit to how efficient they can be. Whereas in deep learning, first of all because deep learning is such high impact area with lots, and I mean from an economic perspective, there are billions of dollars spent in these GPUs and so on. So, there is enough economic incentive to extract the last bit of performance out of these expensive GPUs, right. And that lends itself into this realm of- what if we co-design? What if we custom design a scheduler for the specific case of deep learning, right. And that's what we did in the Gandiva project which we published at OSDI in 2018. What we said was, instead of viewing a deep learning job as just another distributed job which is opaque to us, let's actually exploit some key characteristics that are unique to deep learning jobs, right? And one of those characteristics, is that although, as I said, a single deep learning training job can run for days or even months, right, deep within it is actually composed of millions and millions of these what are called mini batches. So, what is a mini batch? A mini batch is an iteration in the training where it reads one set of input training examples, runs it through the model, and then back propagates the loss, and essentially, changes the parameters to fit that input. And this sequence this mini batch repeats over and over again across millions and millions of mini batches. And what makes it particularly interesting and relevant from a systems optimization viewpoint is that from a resource usage perspective and from a performance perspective, mini batches are identical. They may be operating on different data in each mini batch, but the computation they do is pretty much identical. And what that means is we can look at the job for a few mini batches and we can know what exactly is going to do for the rest of its life time, right. And that allows us to, for example, do things like, we can automatically decide which hardware generation is the best fit for this job, because you can just measure it in a whole bunch of hardware configurations. Or when you're distributing the job, you can compare it across a whole bunch of parallelism configurations, and you can automatically figure out, this is the right configuration, right hardware assignment for this particular job, which you couldn't do in an arbitrary job with a distributed scheduler because the job could be doing different things at different times. Like a MapReduce job for example, it would keep fluctuating across how we'd use a CPU, network, storage, and so on, right. Whereas with deep learning there is this remarkable repeatability and predictability, right. What it also allows us to do is, we can then look within a mini batch what happens, and it turns out, one of the things that happens is, if you look at the memory usage, how much GPU memory the training loop itself is consuming, somewhere at the middle of a mini batch, the memory peaks to almost fill the entire GPU memory, right. And then by the time the mini batch ends, the memory usage drops down by like a factor of anywhere between 10 to 50x. Right, and so there is this sawtooth pattern in the memory usage, and so one of the things we did in Gandiva was proposed this mechanism of transparently migrating a job, so you should be able to, on demand checkpoint a job. The scheduler should be able to do it and just move it to a different machine, maybe even essentially different GPU, different machine, and so on, right. And this is very powerful from load balancing. Lots of scheduling things become easy if you do this. Now, when you're doing that, when you are actually moving a job from one machine to another, it helps if the amount of state you need to move is small, right. And so that's where this awareness of mini batch boundaries and so on helps us, because now you can choose when exactly to move it so that you move 50x, smaller amount of state.
Sridhar Vedantham: Right. Very interesting, and another part of this whole thing about resources and compute and all that is, I think, the demands on storage itself, right?
Muthian Sivathanu: Yeah.
Sridhar Vedantham: Because if the models are that big, that you need some really high-powered GPUs to compute, how do you manage the storage requirements?
Muthian Sivathanu: Right, right. So, it turns out the biggest requirement from storage that deep learning poses is on the throughput that you need from storage, right. So, as I mentioned, because GPUs are the most expensive resource in this whole infrastructure stack, the single most important objective is to keep GPUs busy all the time, right. You don't want them idling, at all. What that means is the input training data that the model needs in order to run its mini batches, that is to be fed to it at a rate that is sufficient to keep the GPUs busy. And GPUs process, I mean the amount of data that the GPU can process from a compute perspective has been growing at a very rapid pace, right. And so, what that means is, you know, when between Volta series and an Ampere series, for example, of GPUs there is like 3X improvement in compute speed, right. Now that means the storage bandwidth should keep up with that pace, otherwise faster GPU doesn't help. It will be stalling on IO. So, in that context one of the systems we built was the system called Quiver, where we say a traditional remote storage system like the standard model for running this training is...the datasets are large- I mean the data sets can be in terabytes, so, you place it on some remote cloud storage system, like Azure blob or something like that, and you read it remotely from whichever machine does the training, right. And that bandwidth simply doesn't cut it because it goes through network backbone switches and so on, and it becomes insanely expensive to sustain that level of bandwidth from a traditional cloud storage system, right. So what we need, to achieve here is hyper locality. So, ideally the data should reside on the exact machine that runs the training, then it's a local read and it has to reside on SSD and so on, right. So, you need several gigabytes per second read bandwidth.
Sridhar Vedantham: And this is to reduce network latency?
Muthian Sivathanu: Yes, this is to reduce network latency and congestion, like when it goes through lots of back end, like T1 switches, T2 switches etc. The end-to-end throughput that you get across the network is not as much as what you can get locally, right?
Sridhar Vedantham: Right.
Muthian Sivathanu: So, ideally you want to keep the data local in the same machine, but as I said, for some of these models, the data set can be in tens of terabytes. So, what we really need is a distributed cache, so to speak, right, but a cache that is locality aware. So, what we have is a mechanism by which, within each locality domain like a rack for example, we have a copy of the entire training data, so, a rack could comprise maybe 20 or 30 machines, so across them you can still fit the training data and then you do peer to peer across machines in the rack for the access to the cache. And within a rack, network bandwidth is not a limitation. You can get nearly the same performance as you could from local SSD, so that's what we did in Quiver and there are a bunch of challenges here, because if every model wants the entire training data to be local to be within the rack, then there is just no cache space for keeping all of that.
Sridhar Vedantham: Right.
Muthian Sivathanu: Right. So we have this mechanism by which we can transparently share the cache across multiple jobs, or even multiple users without compromising security, right. And we do that by sort of intelligent content addressing of the cache entries so that even though two users may be accessing different copies of the same data internally in the cache, they will refer to the same instance.
Sridhar Vedantham: Right, I was actually just going to ask you that question about how do you maintain security of data, given that you're talking about distributed caching, right? Because it's very possible that multiuser jobs will be running simultaneously, but that's good, you answered it yourself. So, you know I've heard you speak a lot about things like micro design and so on. How do you bring those principles to bear in these kind of projects here?
Muthian Sivathanu: Right, right. So, I alluded to this a little bit in one of my earlier points, which is the interface, I mean, if you look at a traditional scheduler which we use the job as a black box, right. That is an example of traditional philosophy to system design, where you build each layer independent of the layer above or below it, right, so that, there are good reasons to do it because you know, like multiple use cases can use the same underlying infrastructure, like if you look at an operating system, it's built to run any process, whether it is Office or a browser or whatever, right.
Sridhar Vedantham: Right.
Muthian Sivathanu: But, in workloads like deep learning, which place particularly high demands on compute and that are super expensive and so on, there is benefit to sort of relaxing this tight layering to some extent, right. So that's the philosophy we take in Gandiva, for example, where we say the scheduler no longer needs to think of it as a black box, it can make use of internal knowledge. It can know what mini batch boundaries are. It can know that mini batch times are repeatable and stuff like that, right. So, co-design is a philosophy that has been gaining traction over the last several years, and people typically refer to hardware, software co-design for example. What we do in micro co-design is sort of take a more pragmatic view to co-design where we say look, it's not always possible to rebuild entire software layers from scratch to make them more tightly coupled, but the reality is in existing large systems we have these software stacks, infrastructure stacks, and what can we do without rocking the ship, without essentially throwing away everything in building everything from a clean slate. So, what we do is very surgical, carefully thought through interface changes, that allow us to expose more information from one layer to another, and then we also introduce some control points which allow one layer to control. For example, the scheduler can have a control point to ask a job to suspend. And it turns out by opening up those carefully thought through interface points, you leave the bulk of the infrastructure unchanged, but yet achieve these efficiencies that result from richer information and richer control, right. So, micro co-design is something we have been adopting, not only in Gandiva and Quiver, but in several other projects in MSR. And MICRO stands for Minimally Invasive Cheap and Retrofittable Co-design. So, it's a more pragmatic view to co-design in the context of large cloud infrastructures.
Sridhar Vedantham: Right, where you can do the co-design with the minimum disruption to the existing systems.
Muthian Sivathanu: That's right.
Sridhar Vedantham: Excellent.
[Music]
Sridhar Vedantham: We have spoken a lot about the work that you've been doing and it's quite impressive. Do you have some numbers in terms of you know, how jobs will run faster or savings of any nature, do you have any numbers that you can share with us?
Muthian Sivathanu: Yeah, sure. So the numbers, as always depend on the workload and several aspects. But I can give you some examples. So, in the Gandiva work that we did. We, introduce this ability to time slice jobs, right. So, the idea is, today when you launch a job in a GPU machine, that job essentially holds on to that machine until it completes, and until that time it has exclusive possession of that GPU, no other job can use it, right. And this is not ideal in several scenarios. You know, one classic example is hyperparameter tuning, where you have a model and you need to decide what exact hyperparameter values like learning rate, etc. actually are the best fit and give the best accuracy for this model. So, people typically do what is called the hyperparameter search where you run maybe 100 instances of the model, see how it's doing, maybe kill some instances spawn of new instances, and so on, right. And hyperparameter exploration really benefits from parallelism. You want to run all these instances at the same time so that you have an apples-to-apples comparison of how they are doing. And if you want to run like 100 configurations and you have only 10 GPUs, that significantly slows down hyperparameter exploration- it serializes it, right. What Gandiva has is an ability to perform fine grained time slicing of the same GPU across multiple jobs, just like how an operating system time slices multiple processes, multiple programs on the same CPU, we do the same in GPU context, right. And because we make use of mini batch boundaries and so on, we can do this very efficiently. And with that we showed that for typical hyperparameter tuning, we can sort of speed up the end-to-end time to accuracy by nearly 5-6x, right. Uh, and so this is one example of how time slicing can help. We also saw that from a cluster wide utilization perspective, some of the techniques that Gandiva adopted can improve overall cluster utilization by 20-30%. Right, and this directly translates to cost incurred to the cloud provider running those GPS because it means with the same GPU capacity, I can serve 30% more workload or vice versa, right, for a given workload I only need 30% lesser number of GPUs.
Sridhar Vedantham: Yeah, I mean those savings sound huge and I think you're also therefore talking about reducing the cost of AI making the process of AI itself more efficient.
Muthian Sivathanu: That's correct, that's correct. So, the more we are able to extract performance out of the same infrastructure, the cost per model or the cost per user goes down and so the cost of AI reduces and for large companies like Microsoft or Google, which have first party products that require deep learning, like search and office and so on, it reduces the capital expenditure running such clusters to support those workloads.
Sridhar Vedantham
Right.
Muthian Sivathanu: And we've also been thinking about areas such as, today there is this limitation that large models need to run in really tightly coupled hyperclusters which are connected via InfiniBand and so on. And that brings up another dimension of cost escalation to the equation, because these are sparse, the networking itself is expensive, there is fragmentation across hyperclusters and so on. What we showed in some recent work is how can you actually run training of large models in just commodity VMs-these are just commodity GPU VMs- but without any requirement on them being part of the same InfiniBand cluster or hypercluster, but just they can be scattered anywhere in the data center, and more interestingly, we can actually run these off of spot VMs. So Azure, AWS, all cloud providers provide these bursty VMs or low priority VMs, which is away essentially for them to sell spare capacity, right. So, you get them at a significant discount. Maybe 5-10x cheaper price. And the disadvantage, I mean the downside of that is they can go away at any time. They can be preempted when real demand shows up. So, what we showed is it's possible to train such massive models at the same performance, despite these being on spot VMs and spread over a commodity network without custom InfiniBand and so on. So that's another example how you can bring down the cost of AI by reducing constraints on what hardware you need.
Sridhar Vedantham: Muthian, we're kind of reaching the end of the podcast, and is there anything that you want to leave the listeners with, based on your insights and learning from the work that you've been doing?
Muthian Sivathanu: Yeah, so taking a step back, right? I think continued innovation in systems and efficiency and costs are going to be crucial to drive the next generation of AI advances, right. And the last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in both hardware in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributed jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation about neural networks 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI.
Sridhar Vedantham: Ok Muthian, I know that we're kind of running out of time now but thank you so much. This has been a fascinating conversation.
Muthian Sivathanu: Thanks Sridhar, it was a pleasure.
Sridhar Vedantham: Thank you