Microsoft Research India Podcast

Microsoft Research India Podcast


Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

August 03, 2020

 


Episode 004 | August 04, 2020


Podcast: Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal


The process of software development is dramatically different today compared to even a few years ago. The shift to cloud computing has meant that companies need to develop and deploy software in ever shrinking timeframes while maintaining high quality of code. At the same time, developers can now get access to large amounts of data and telemetry from users. Is it possible for companies to use Machine Learning and Artificial Intelligence techniques to shorten the Software Development Life Cycle while ensuring production of robust, cloud-scale software? We talk about this and more with Chandra Maddila and Chetan Bansal, who are Research Software Development Engineers at Microsoft Research India.


Click here for more information on Project Sankie.


Related



Transcript


Chandra Maddila: One of the biggest disconnects we used to have in boxed product world where we used to ship software as a standalone product and give it to customers is, once customer takes the product, it is in their environment, we don’t have any idea about how it is being used and what kind of issues people are facing unless they come back to Microsoft support and say, “Hey, we are using this product, we get into these issues, can you please help us?”. But with the advent of services, one of the beautiful things that happened is, now we have the ability to collect telemetry about various issues that are happening in the service. So, this helps us pro-actively fix issues and help customers mitigate outages and also join the telemetry data from deployment side of the world all the way into coding phase, which is the first phase of software development life cycle.



[Music]



Sridhar: Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that’s impacting technology and society. I’m your host, Sridhar Vedantham.



[Music]



Sridhar: Chandra and Chetan, welcome to the podcast. And thank you for making the time for this.



Chetan: Thanks, Sridhar, for hosting this.



Chandra: Thanks Sridhar, thanks for having us.



Sridhar: Great! Now, there is something that’s interested me when I decided to host this podcast with you guys. You are both research software development engineers and Microsoft research is known for being this hardcore computer science research lab. So, what does it mean to be a software developer in a research org like MSR? And how is it different than being a software developer in say, a product organization, if there is a difference?



Chetan: Yeah, that’s a great question, Sridhar about the difference between the RSDE role which is research software developer engineer at MSR vs. the product groups at Microsoft. In my experience the RSDE role is sort of open ended. Because often times, research teams work on open ended research problems. So, the RSDE engineers often work on things like prototypes and building products from the ground up which are deployed internally and which are the pre-cursor for products which are shipped to our customers, so there’s a lot of flexibility and openness in terms of what the RSDEs work on, and it can range from open ended research to actually building products which are shipped to our customers. So, there’s a wide spectrum of things and roles which RSDE plays.



Sridhar: Chandra, what’s your take on that?



Chandra: I think Chetan summarized it pretty well. RSDE in general is much more flexible compared to a typical software engineer role in products groups. You can switch from areas to areas and products to products. I, for example was working on NLP for some time, then web applications, learning platforms for some time. Then, I switched to software engineering. So, we have this flexibility to move across different areas and also, one thing we I think do as RSDEs is working on long-term problems, problems from ground up which takes some time to incubate and productize, whereas software engineers and product groups have well defined scope and well defined problems which are aligned to their product’s vision. So, that way they have slightly more constraint in terms of what kind of problems they work on. But, at the same time of the greatest advantages people in product groups have is the accessibility to customers. They are very close to customers and they really work on customer problems and ship things quite faster, whereas RSDEs in MSR don’t have access to direct customers.



Sridhar: Interesting, so it sounds like it’s kind of a play between customer access and freedom as far as RSDEs are concerned.



Chandra: Yeah, as RSDEs in Microsoft research, we have lot more flexibility and provision to explore more interesting areas in research, new and upcoming areas like probably, quantum computing or block chain or advances in AI/ ML etc and do more exploratory things.


Chetan: Just wanted to add another thing here. A lot of times, people have misconceptions that in Microsoft Research or in other research organizations, a doctorate or Ph.D is required to get a job or to work for these organizations. But there are roles such as RSDEs, and product managers, program managers or even designers which people can take on without the need to have a Ph.D or a doctorate and they can still contribute to the research happening in companies like Microsoft.



Sridhar: Great. Now, we keep hearing now-a-days that the process of software development has changed tremendously over the last few years. So, what’s actually caused these changes?



Chetan: I think, to start with there are two things which in my opinion have caused this sort of revolution in the software development industry. One of them is the move to the services-oriented world, so we are no longer shipping boxed products in a CD or a DVD. But we are actually shipping services, we are actually selling services which are used by our customers unlike before where you ship a software and that’s used by our customers for couple of years and then they update it. So, I think that’s one key change which has happened in the last decade and the other major paradigm shift which has happened is the move to cloud. So, even in terms of software deployment, today it’s being done on cloud instead of on-prem, which is within the premises of a customer or a company. So, that has brought in a whole range of changes in terms of how a software is developed, deployed, and maintained within small and big companies like even Microsoft. And today startups and any new company doesn’t have to actually spend a lot of money in capex, capital expenditure on buying servers or hiring people to maintain the servers, but they can basically ship and operate out of cloud which saves a lot of money and time. So, in my opinion, these are the two major paradigm shifts which has happened and which has positively impacted the software industry.



Chandra: Compared to 90’s, when we used to for instance, ship boxed products, now everything is becoming a service, that is also primarily driven by customer expectation. So, these days customers are expecting companies to actually ship services more faster, make the new features available at a much faster pace which is also accelerated by the development and growth in cloud computing technologies which makes software companies or software developers to scale the services really fast and serve more people and ship things much faster.



Sridhar: So, I know for a fact that earlier there used to be these long ship cycles where somebody would develop some software, and there would be a bunch of people testing it and after which it would reach the customer, whether it would be the retail customer or the enterprise customer, right. I think, a lot of these processes have either disappeared or been extremely compressed. So, what kind of challenges and opportunities do these changes provide you guys as software developers?




Chandra: So, these rapid development models where people are expected to ship really fast brought down the overall ship cycles, the duration of the ship cycles down, to even like days, or in a single day, you experience the entire software development life cycle, all the steps of the development life cycle starting from coding, to testing to deployment in a single day. This definitely poses lot of challenges because you have to make sure, you are shipping fast, but at the same time you are making sure your service is stable and customers are not experiencing any interruptions. So, you need to build tools, and services that aid developers to achieve this. So, the tools and services has to be pretty robust and make sure they catch all the catastrophic bugs early on and developers to achieve this feat of shipping their services much faster. So, the duration between someone writing the code and the code hitting the customer has come down significantly, which is what we all need to make sure we support.



Chetan: I just want to add two more things- two more changes which have helped evolve the software development life cycle and processes. First is the possibility of collecting telemetry and data from our users. So, basically, we are able to observe how our features or our code is been behaving or being used in near real time which allows us to see if there is any regression or if there are any changes or if there are any bugs which needs to be fixed. This wasn’t possible in the past within the boxed software world because we didn’t have access to the telemetry. The second aspect is having a set of users which are helping you test your features and services at the same time. So, now, we can sort of do software development in parallel as we roll out our current set of features.



Sridhar: Cool. So, it sounds like you guys are now able to get a large amount of data as well as telemetry from the users, right. How does this actually help in making the software development life cycle more efficient or faster?



Chetan: So, I think there are two aspects. Like, one of them which I just highlighted was, now we are getting real-time or near real-time telemetry in terms of how different aspects of our software or services are being used. And the second is, if there is any regressions or any anomalies which are happening, we are able to detect that and then resolve that very quickly which wasn’t possible before. So, I think these are the two aspects.



Chandra: One of the biggest disconnects we used to have in boxed product world where we used to ship software as a standalone product and give it to customers is, once customer takes the product, it is in their environment. We don’t have any idea about how it is being used and what kind of issues people are facing unless they come back to Microsoft support and say, “Hey, we are using this product, we get into these issues, can you please help us?”. But with the advent of services, one of the beautiful things that happened is, now we have the ability to collect telemetry about various issues that are happening in the service. So, this helps us pro-actively fix issues and help customers mitigate outages and also join the telemetry data from deployment side of the world all the way into coding phase, which is the first phase of software development life cycle and give valuable insight to developer so that in the code itself, they have an understanding of how this code is going to behave out there in the wild and be more cautious and cause less bugs or issues.



[Music]



Sridhar: There have been a couple of terms which have become, I think very predominant, very prominent over the last few years. There are two terms that come to mind immediately to me, one is DevOps and the other is AIOps. What exactly are these?



Chetan: So, DevOps is basically a commonly used term across the software development industry which refers to basically the set of practices and tools for developing software, deploying software and shipping software. So basically, how different parts of our industry, different companies are actually building software, what are the set of practices, for example, how do you do code reviews, how do you check in code, how do you deploy the code, so, different set of practices and also the tools and infrastructure which is involved. So, in my opinion, that’s sort of the definition of DevOps. It’s a very abstract term which refers to different sets of practices and tools for software development. Lastly, AIOps, that’s basically a recently introduced term, probably in the last few years where because of the access to telemetry and data from our software and users, we are able to leverage data science and machine learning for optimizing a lot of key aspects of the DevOps life cycle. For instance, while doing code reviews, can we use machine learning and data science for catching bugs? That’s a very simple example that gives an idea that how AIOps or Artificial intelligence can be used to help different aspects of DevOps. And that’s branded as AIOps.



Chandra: So, DevOps, actually is a combination of two words, right, Development plus Operations. In box product world, companies were shipping software through CD’s or DVD’s as Chetan mentioned, we used to develop software and sell it to customers. And all the operational aspects of the software, that is, deploying the software in their organizations and maintaining it and making sure the software is running properly etc is in the hands of the customer who takes the software from the vendors like Microsoft. But, with the advent of services, Microsoft is also becoming a services provider. Like Satya famously says, Microsoft is now a services company and we provide solutions to customers. So, we definitely got into this innate need of doing operations also inside Microsoft itself which makes us do both the development and operations together, DevOps, inside Microsoft itself. So, this basically combines different aspects of software development life cycle starting from coding, testing and also deployment and customer support and filling the feedback loop back into development and iterating over all these phases again and again. AIOps is a term that has been coined in the last couple of years. AIOps specifically means, using technologies like Artificial Intelligence and Machine Learning and leveraging that to solve problems and operational challenges in software development. For instance, you take a fancy AI algorithm and use it to solve root causing problem in software services. That is a classic example of using AI for solving a real problem in operations. And we have a variety of different problems that occurs in the operations side of the software development now because of the scale at which software development is happening and using and applying AI/ ML techniques to solve those PROBLEMS, put together can be called as AIOps.



Sridhar: Ok. Now, I know you guys have been working for a few years on this very interesting research project called Sankie and I think this has elements of using AI and machine learning in making the SDLC more effective. Talk a bit about that.



Chandra: Sankie is a project which we started at the end of 2016. One of the primary goals of Sankie is to provide an ability to join various data that is being collected at different phases of software development life cycle and leverage techniques like AI/ML, do analysis on top of the data and provide valuable insights which can aid various stakeholders in each phase of these software development life cycle.



Chetan: I think Chandra put it in a great way that Sankie was started, The whole motivation behind Sankie was to infuse AIOps into the software development processes across Microsoft. And it has been a huge collaborative effort with several collaborators such as [B Ashok, Rahul Kumar, Ranjita Bhagwan, Sonu Mehta, Jim Kleewein] and even several research fellows who have worked with us and collaborated with us over the last several years and our counterparts from different parts of Microsoft. and not just these folks but also several research fellows across MSR and other counterparts across Microsoft.



Sridhar: Ok. Now, I get the feeling that both of you have kind of over simplified what Sankie is actually. I’ve sat through various talks in which there seems to be huge amount of work that goes in at different components that feed into Sankie which seems to be kind of like a platform. Why don’t you guys talk a little more about what Sankie actually is and what the different constituent parts are, so to speak?



Chandra: So, Sankie is actually a platform that we have been building. Sankie basically has loaders that ingests data from various phases of software development life cycle, for instance from development phase, it ingests data about pull requests, commits, various builds, from testing phase it ingests data about test cases, test executions, what is the status of the tests, and from deployment phase, it ingests data about alerts, exceptions, and various other telemetry that is collected at the deployment phase and we basically put all these data together in a single queriable data source. That is very important because this data exists in various disparate data sources which are exposed at various levels and Sankie basically gets all this data into a single relational data store which can be easily queried and joined against each other. Then, we use this data, we feed it into various AI and ML tools to provide insights and recommendations in various phases of software development life cycle. For example, we mine all the commit data, that is which files are changed together, which files go in to a pull request etc, to basically discover rules that explains the files that are always changed together and we use that knowledge to provide recommendations when developers are creating pull requests, if they are missing any files to include in their pull requests. We call it as related files analysis. Similarly, we developed tools like ORCA, Online Root Cause Analysis tool which is intended towards root causing service incidents and service disruptions as quickly as possible. So, in case of ORCA, it is pretty interesting that it uses data from both left side of the software development life cycle and right side of the software development life cycle, that is data from commits, and code that is written and the differences of code and the telemetry that is collected at the deployment side, that is the exceptions, errors that are occurring in the service. So, ORCA basically takes all these exceptions, errors that are happening and has an ability to point them towards the actual code change that introduced these problems in first place, which is pretty fascinating because this greatly reduces the amount of time developers spend in root causing issues which typically takes probably couple of days or sometimes even weeks depending on the complexity of the issue. And Sankie has close to 8 such recommenders which combines data from various different phases of the SDLC and leverages the AI/ML techniques, the AIOps processes and make the entire development life cycle more optimal and efficient.



Chetan: So, to add to what Chandra just said about Sankie, I just want to mention that in the beginning of the podcast we briefly discussed how the move to cloud and service oriented software development has posed some new and interesting challenges for software development. But in this case, we are actually able to use that to our advantage since in Sankie we are basically building services which we can deploy and iterate on very fast , based on the feedback from our users, and also based on the telemetry we are getting from the services. And lastly, because of this cloud oriented architecture, we are able to leverage our big data technologies and the service oriented architectures which allow us to leverage terabytes of data or telemetry which are being produced by different user facing services and then combining that with machine learning algorithms and providing insights which are very valuable to the end users of the Sankie platform.



Sridhar: Now, is Sankie available to the world outside of Microsoft?



Chetan: As part of Sankie, one of the key focus has been on making sure that all of our techniques and algorithms are published in major software and system conferences, so we have published research papers and articles about the Sankie platform, architecture and even the 8 different recommenders which Chandra talked about.



Sridhar: Ok. So, if it’s all available in the public domain, I think we will make them available along with the transcript of this podcast. Ok, let’s do a little bit crystal ball gazing now. Where do you guys see software development, engineering and DevOps evolving in the future?



Chandra: I think that’s a great question. As Marc Andreessen famously said, “Software is eating the world”. So, lot of traditional companies are becoming more and more tech companies. You can see that in every industry- automobile, pharmaceutical, retail, everywhere tech is penetrating a lot. This actually makes software development more complex and we need to react to customer requests in more faster ways which basically makes AIOps much more relevant using all the AI/ML technologies to make the entire software development life cycle more efficient and deliver value to the customers and users who are subscribing to our services is going to become way more important.



Chetan: To add to what Chandra just said, I think there are two things that makes me excited about how we can evolve Sankie and other similar projects to prepare for the next shift in software development industry. So, I think, first is the more and more usage of software and machine learning in cyber physical systems, for example in self driving cars, in agriculture, and these are systems which are safety critical, time critical, and impact humans in a big way. So evolving Sankie and other similar tools and techniques, for those sets of those verticals of software and services I think will be a key challenge and opportunity. And the last one is the move from software industry has seen the software 1.0, 2.0 and now this move to the edge, right, where lot of times, the cloud or the computers available on the edge of the network so that is accessible or located close to the user, so how we can leverage Sankie and other similar techniques for the edge focused cloud is another interesting aspect which we are excited about.



Sridhar: Ok. So, Chandra and Chetan, this has been a fantastic conversation and fascinating. And thank you so much once again for your time.



Chandra: Thanks Sridhar.



Chetan: Thanks, Sridhar, for this insightful conversation. Thank you.