Microsoft Research India Podcast

Microsoft Research India Podcast


Evaluating LLMs using novel approaches. With Dr. Sunayana Sitaram

May 20, 2024

[Music]

Sunayana Sitaram: Our ultimate goal is to build evaluation systems and also other kinds of systems in general where humans and LLMs can work together. We're really trying to get humans to do the evaluation, get LLM's to do the evaluation, use the human data in order to improve the LLM. And then just this continues in a cycle. And the ultimate goal is, send the things to the LLM that it's good at doing and send the rest of the things that the LLM can't do to humans who are like the ultimate authority on the evaluation.

Sridhar Vedantham: Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that’s impacting technology and society. I’m your host, Sridhar Vedantham.

[Music]

Sridhar Vedantham: LLM's are perhaps the hottest topic of discussion in the tech world today. And they're being deployed across domains, geographies, industries and applications. I have an extremely interesting conversation with Sunayana Sitaram, principal researcher at Microsoft Research about LLMs, where they work really well and also challenges that arise when trying to build models with languages that may be under resourced. We also talk about the critical work she and her team are doing in creating state-of-the-art methods to evaluate the performance of LLMs, including those LLMs that are based on Indic languages.

Related

 

[Music]

Sridhar Vedantham: Sunayana, welcome to the podcast.

Sunayana Sitaram: Thank you.

Sridhar Vedantham: And I'm very excited to have you here because we get to talk about a subject that seems to be top of mind for everybody right now. Which is obviously LLMs.  And what excites me even more is I think, we're going to be talking about LLMs in a way that's slightly different from what the common discourse is today, right?

Sunayana Sitaram: That's right.

Sridhar Vedantham: OK. So before we jump into it, why don't you give us a little bit of background about yourself and how you came to be at MSR?

Sunayana Sitaram: Sure. So it's been eight years now since I came to MSR. I came here as a postdoc after finishing my PhD at Carnegie Mellon. And so yeah, it's been around 15 years now for me in the field, and it's been super exciting, especially the last few years.

Sridhar Vedantham: So, I'm guessing that these eight years have been interesting, otherwise we won't be having this conversation. What areas of research, I mean, have you changed course over the years and how is that progressed?

Sunayana Sitaram: Yeah, actually, I've been working pretty much on the same thing for the last 15 years or so. So I'll describe how I got started. When I was an undergrad, I actually met the principal of a blind children's school who himself was visually impaired. And he was talking about some of the technologies that he uses in order to be independent. And one of those was using optical character recognition and text to speech in order to take documents or letters that people sent him and have them read out without having to depend on somebody. And he was in Ahmedabad, which is where I grew up. And his native language was Gujarati.  And he was not able to do this for that language. Whereas for English, the tools that he required to be independent were available. And so, he told me like it would be really great if somebody could actually build this kind of system in Gujarati. And that is when it sort of it was like a, you know, aha moment for me. And I decided to take that up as my undergrad project. And ever since then, I've been trying to work on technologies trying to bridge that gap between English and other languages- under resourced languages. And so, since then, I've worked on very related areas. So, my PhD thesis was on text to speech systems for low resource languages. And after I came to MSR I started working on what is called code switching, which is a very common thing that multilinguals all over the world do. So they use multiple languages in the same conversation or sometimes even in the same sentence. And so you know, this was a project called Project Melange that was started here and that really pioneered the code switching work in the research community in NLP. And after that it's been about LLMs and evaluation but again from a multilingual under resource languages standpoint.

Sridhar Vedantham: Right. So I have been here for quite a while at MSR myself and one thing that I always heard is that there is this in general, a wide gulf in terms of the resources available for a certain set of languages to do say NLP type work. And the other languages is just the tail, it's a long tail, but the tail just falls off dramatically. So, I wanted you to answer me in a couple of ways. One is, what is the impact that this generally has in the field of NLP itself and in the field of research into language technologies, and what's the resultant impact on LLMs?

Sunayana Sitaram: Yeah, that's a great question. So, you know the paradigm has shifted a little bit after LLM's have come into existence. Before this, so this was around say a few years ago, the paradigm would be that you would need what is called unlabeled data. So, that is raw text that you can find on the web, say Wikipedia or something like that, as well as labeled data. So, this is something that a human being has actually sat and labeled for some characteristic of that text, right? So these are the two different kinds of texts that you need if you want to build a text based language model for a particular language. And so there were languages where, you know, you would find quite a lot of data on the web because it was available in the form of documents or social media, etc. for certain languages. But nobody had actually created the labeled resources for those languages, right? So that was the situation a few years ago. And you know the paradigm at that time was to use both these kinds of data in order to build these models, and our lab actually wrote quite a well-regarded paper called, ‘The State and Fate of Linguistic Diversity and Inclusion’, where they grouped different languages into different classes based on how much data they had labeled, as well as unlabeled.

Sridhar Vedantham: Right.

Sunayana Sitaram: And it was very clear from that work that, you know only around 7 or 8 languages of the world actually can be considered to be high resource languages which have this kind of data. And most of the languages of the world spoken by millions and millions of speakers don't have these resources. Now with LLMs, the paradigm changed slightly, so there was much less reliance on this labeled data and much more on the vast amount of unlabeled data that exists, say, on the web. And so, you know, we were wondering what would happen with the advent of LLMs now to all of the languages of the world, which ones would be well represented, which ones wouldn't etc. And so that led us to do, you know, the work that we've been doing over the last couple of years. But the story is similar, that even on the web some of these languages dominate and so many of these models have, you know, quite a lot of data from only a small number of languages, while the other languages don't have much representation.

Sridhar Vedantham: OK. So, in real terms, in this world of LLMs that we live in today, what kind of impact are we looking at? I mean, when you're talking about inequities and LLMs and in this particular field, what's the kind of impact that we're seeing across society?

Sunayana Sitaram: Sure. So when it comes to LLMs and language coverage, what we found from our research is that there are a few languages that LLMs perform really well on. Those languages tend to be high resource languages for which there is a lot of data on the web and they also tend to be languages that are written in the Latin script because of the way the LLMs are designed currently with the tokenization. And the other languages, unfortunately there is a large gap between the performance in English and other languages, and we also see that a lot of capabilities that we see in LLMs in English don't always hold in other languages. So a lot of capabilities, like really good reasoning skills, etc, may only be present in English and a few other languages, and they may not be seen in other languages. And this is also true when you go to smaller models that you see that their language capabilities fall off quite drastically compared to the really large models that we have, like the GPT 4 kind of models. So when it comes to real world impact of this, you know, if you're trying to actually integrate one of these language models into an application and you're trying to use it in a particular language, chances are that you may not get as good performance in many languages compared to English. And this is especially true if you're already used to using these systems in English and you want to use them in a second language. You expect them to have certain capabilities which you've seen in English, and then when you use them in another language, you may not find the same capabilities. So in that sense, I think there's a lot of catching up to do for many languages. And the other issue also is that we don't even know how well these systems perform for most languages of the world because we've only been able to evaluate them on around 50 to 60 or maybe 100 languages. So for the rest of the 6000ish languages of the world, many of which don't even have a written form, most of which are not there on the web. We don't even know whether these language models are, you know, able to do anything in them at all. So I think that is another, you know, big problem that is there currently.

Sridhar Vedantham: So, if you want to change the situation where we say that you know even if you're a speaker of a language that might be small, maybe say only two million speakers as opposed to a larger language that might have 100 million or 200 million speakers. How do we even go about addressing inequities like that because at some level it just seems unfair, that for no fault of their own, you know, large sections of population could be excluded from the benefits of LLM's, right? Because there could be any number of languages in which the number of speakers might be, say, 1,000,000 or 100,000.

Sunayana Sitaram: Right. I think that's a very hard question. How to actually involve language communities into our efforts, but do that at scale, so that we can actually involve all language communities, all cultures, etc. into the whole building process. So we've had some success with doing this with some language communities. So there is a project called ELLORA in MSR India that you know Kalika leads where you know they work with specific language communities, try to understand what the language communities actually need and then try to co-build those resources with them. And so you know in that sense, you know, working directly with these language communities, especially those that have a desire to build this technology and can contribute to some of the data aspects, etc. That's definitely one way of doing things. We've also done some work recently where we've engaged many people in India in trying to contribute resources in terms of cultural artifacts and also evaluation. And so you know, we're trying to do that with the community itself, with the language community that is underrepresented in these LLMs, but doing that at scale is the challenge to try and really bring everyone together. Another way of course, is just raising awareness about the fact that this issue exists, and so I think our work over the last couple of years has really, you know, moved the needle on that. So we've done the most comprehensive multilingual evaluation effort that exists both within the large models as well as across different sized models which we call Mega and Megaverse respectively.

Sridhar Vedantham: So if I can just interrupt here, what I'd like is if you could, you know, spend a couple of minutes maybe talking about what evaluating an LLM actually means and how do you go about that?

Sunayana Sitaram: Sure. So when we talk about evaluating LLM's right, there are multiple capabilities that we expect LLMs to possess. And so, our evaluation should ideally try to test for all of those different kinds of capabilities. So, this could be the ability to reason, this could be the ability to produce output that actually sounds natural to a native speaker of the language. It could be completing some particular task, it could be not hallucinating or not making up things. And also of course, responsible AI, you know, metrics. So things like, you know, being safe and fair, no bias, etc. Only if all of those things work in a particular language, can you say that, that LLM actually works for that language. And so there are several dimensions that we need to consider when we are evaluating these LLMs.

 

[Music]

 

Sridhar Vedantham: Before I interrupted you, you were talking about certain projects that you were working on, which are to do with evaluating LLMs, right? I think there's something called Mega there's something called Megaverse. Could you tell us a little bit about those and what exactly they do?

Sunayana Sitaram: Sure. So Mega project we started when ChatGPT came out basically. And the question that we were trying to answer was how well these kinds of LLMs perform on languages of the world. So with Mega what we did was, we took already existing open source benchmarks that tested for different kinds of capabilities. So some of them were question-answering benchmarks. Some of them were testing for whether it can summarize text properly or not. Some of them were testing for other capabilities like reasoning etc. And we tested a bunch of models across all of these benchmarks and we covered something like 80 different languages across all these benchmarks. And our aim with Mega was to figure out what the gap was between English and other languages for all of these different tasks, but also what the gap was between the older models, so the models pre LLM and LLMs. Whether we've become better or worse in terms of linguistic diversity and performance on different languages in the new era of LLMs or not. And that was the aim with Mega.

Sridhar Vedantham: Sorry, but what was the result of it? Have we become better or not?

Sunayana Sitaram: Yeah. So we have mixed results. So for certain languages, we are doing quite well, but for some languages, unfortunately the larger models don't do as well as some of the older models used to do. And the older models used to be specialized and trained on labeled data as I said in the beginning, right? And that would help them also be better at all the languages under consideration, whereas with the LLMs we were not really using labeled data in a particular language to train them. And so, we found that, you know, in some cases, the performance of English had shot up drastically and so the gap between English and other languages had also increased in the LLM’s case.

Sridhar Vedantham: Ok. So, the performance of English, I mean the LLMs, that's much better than what there was earlier, but the other languages didn't manage to show the same performance increase.

Sunayana Sitaram: That's right. They didn't always show. Some of them did, some of the higher resource languages written in the Latin script, for example, did perform quite well, but some of the others did not.

Sridhar Vedantham: OK. And after Mega, then what happened?

Sunayana Sitaram: Yes. So with Mega, we were primarily trying to evaluate the GPT family of models with the older generation of models as I mentioned. But then we realized by the time we finished the work on Mega, there was a plethora of models that came out. So there's Llama and, you know other models by competitors as well as smaller models, the SLMs, you know, like the Llama sized models, Mistral etc, right. So, there were all of these different models. And then we wanted to see across different models, especially when you're trying to compare larger models with smaller models, how do these trends look? And that is what we call Megaverse, where we do all of these different evaluations, but not just for the GPT family, but across different models. And what we found in Megaverse were the trends were similar that there were some languages that were doing well, some of the other lower resource languages, especially the ones written in other scripts, were not doing so well. So, for example, the Indian languages were not doing very well across the board. But we also found that the larger frontier models, like the GPT model, they were doing much better than the smaller models for multilingual. And this is again something that you know was shown for the first time in our work that there is this additional gap when you have this large model and small model and there are important practical implications of this. So, say you're trying to integrate the small model into your workflow as a startup or something like that in a particular language then because it is cheaper it is much more cost efficient, etc, you may not get the same performance in non-English languages as you would get with the larger model right? So that has an impact in the real world.

Sridhar Vedantham: Interesting. And how do you draw this line between what constitutes a large language model and what constitutes a small language model? And I'm also increasingly hearing of this thing called a tiny language model.

Sunayana Sitaram: That's right. Yeah. So the large language models are the GPTs, the Geminis, you know, those kinds of models. Everything else, we just club as a smaller language model. We don't really draw a line there. I haven't actually seen any tiny models that do very well on multilingual. They're tiny, because they are, you know, trained on a smaller set of data, they have fewer parameters and typically we haven't seen too many multilingual tiny models so we haven't really evaluated those. Although there is a new class of models that have started coming up, which are language specific models. So, for example a lot of the Indic model developers have started building specialized model for one language or a small family of languages.

Sridhar Vedantham: OK, so going back to something you said earlier, how do these you know kind of models that people are building for specific Indian languages actually work or perform, given that, I think we established quite early in this podcast that, these are languages that are highly under resourced in terms of data to build models.

Sunayana Sitaram: That's right. So I think it's, it's not just a problem of them being under resourced, it's also that the proportion of data in the model for a particular language that is not English, say Hindi or Malayalam or Kannada, is very tiny compared to English. And so there are ways to actually change this by doing things with the model after it has been trained. So this is called fine tuning. So what you could do is you could take, say, an open source model which is like a medium sized or a small model and then you could fine tune it or specialize it with data in a particular language, and that actually makes it work much better for that particular language because the distributions shift towards the language that you're targeting. And so, it's not just, you know, about the amount of data, but also the proportion of data and how the model has been trained in these giant models that cover hundreds of languages in a single model versus, you know, having a model that is specialized to just one language which makes it do much better. So these Indic models we have found actually do better than the open source models that they were built on top of, because now they have been specialized to a particular language.

Sridhar Vedantham: Ok. I know that your work focuses primarily on the evaluation of LLM's, right? There must be a lot of other people who are also doing similar work in terms of evaluating performance of LLM on different parameters. How do you differentiate your work from what others are doing?

Sunayana Sitaram: Yeah, that's a great question. So we've been doing evaluation work pre LLM actually. We started this a few years ago. And so we've actually done several evaluation projects. The previous one was called LITMUS where we were trying to evaluate without even having a benchmark in a particular language, right? And so we've built up a lot of expertise in how to do evaluation, and this has actually become a very hard problem in the LLM world because it's becoming increasingly difficult to figure out what the strengths and weaknesses are of these LLMs because of how they're built and how they behave, right. And so I think we bring in so much rich evaluation expertise that we've been able to do these kinds of, you know, Mega evaluations in a very systematic way where we've taken care of all of the we've taken care of all of the hanging loose threads that otherwise others don't take care of. And that is why we managed to do these comprehensive giant exercises of Mega and Megaverse and also got these clear trends from them. So in that sense I would say that our evaluation research is very mature and we've been spending a lot of time thinking about how to evaluate which is unique in our group.

Sridhar Vedantham: OK, so one thing I've been curious about for a while is there seems to be a lot of cultural and social bias that creeps into these models, right? How does one even try to address these issues?

Sunayana Sitaram: That's right. So, I think over the last few months, building culture, specific language models or even evaluating whether language models are appropriate for a particular culture, etc, that has become a really hot topic. Because people have started seeing that, you know,  most of these language models are a little tilted towards western protestant and, rich, industrialized kind of worldviews and the values that they encode may not be appropriate for all cultures. And so there have been some techniques that we've been working on in order to again shift the balance back into other target cultures that we want to fine tune the model for, so again, you know, you could take data that has characteristics of a particular culture, values of a particular culture, and then do some sort of fine tuning on a model in order to shift  its distribution more towards a target culture. There are techniques that are coming to be for these kinds of culture specific language models. However, I still think that we are far away from a comprehensive solution, because even defining what culture is and what constitutes, you know, say an Indic culture LLM, I think that's a really hard problem.  Because culture is complex and there are so many factors that go into determining what culture is, and also it's deeply personal. So, each individual has their own mix of factors that determine their own culture, right? So, generalizing that to an entire population is also quite hard, I think to do. So, I think we're still in the very initial stages in terms of actually figuring out how well aligned these models are to different cultures and also trying to sort of align them to any specific target cultures. But it is a hot topic that a lot of people are currently working on.

Sridhar Vedantham: Yeah. You know, while you're talking and giving me this answer, I was thinking that if you're going to go culture by culture, first of all, you know, what is the culture, what are you doing about subcultures and how many cultures are there in the world, so I was just wondering how it's going to even work in the long term? But I guess you answered the question by saying it's just starting. Now let's see how it goes.

Sunayana Sitaram: Absolutely.

Sridhar Vedantham:  It's a very, very open canvas right now, I guess.

Sunayana Sitaram: Yeah.

Sridhar Vedantham: Sunayana, you know you've been speaking a lot about evaluation and so on and so forth and especially in the context of local languages and smaller languages and Indic languages and so on. Are these methods of evaluation that you talk about, are they applicable to different language groups and languages spoken in different geographies too?

Sunayana Sitaram: Absolutely. So in the Mega and Megaverse work, we covered 80 languages and many of them were not Indic languages. In fact, in the Megaverse work, we included a whole bunch of African languages as well. So the techniques, you know, would be applicable to all languages for which we have data for which data exists on the web. Where it is challenging is the languages that are only spoken, that are not written, and languages for which there is absolutely no data or representation available, on the web, for example. So, unfortunately, there aren't benchmarks available for those languages, and so we would need to look at other techniques. But other than that, our evaluation techniques are for, you know,  all languages, all non-English languages.

[MUSIC]

Sridhar Vedantham: There is something that I heard recently from you which again I found extremely interesting. It's a project called Pariksha, which I know, in Hindi and derived from Sanskrit, basically means test or exam. And I remember this project because I'm very scared of tests and exams, and I've always been from school. But what is this?

Sunayana Sitaram: Yes, Pariksha is actually quite a new project. It's under the project VeLLM that is on universal empowerment with Large Language Models and Pariksha is something that we are super excited about because it's a collaboration with Karya, which is an ethical data company that was spun off from MSR India. So what we realized a few months ago is that you know there is just so much happening in the Indic LLM space and there are so many people building specialized models either for a single language or for a group of languages like Dravidian languages, for example. And of course, there are also the GPTs of the world which do support Indian languages as well, right. So now at last count, there are something like 30 different Indic LLMs available today. And if you're a model builder, how do you know whether your Indic LLM is good or better than all of the other LLMs? If you're somebody who wants to use these models, how do you know which ones to pick for your application? And if you're a researcher, you know, how do you know what the big challenges are that still remain, right?

Sridhar Vedantham: Right

Sunayana Sitaram: And so to address this, of course, you know, one way to do this is to do evaluation, right. And try to figure out, you know, compare all these models on some standard benchmarks and then try to figure out which ones are the best. However, what we found from our work with Mega and Megaverse is that the Indian language benchmarks unfortunately are usually translations of already existing English benchmarks and also many of them are already present in training data of these large models, which means that we can't use the already existing benchmarks to get a very good idea about whether these Indic LLMs are culturally appropriate, whether they capture linguistic nuances in the Indian languages or not, right. So we decided to sort of reinvent evaluation for these Indic LLMs and that's where Pariksha came in. But then how do we scale if we want to actually, you know, get this kind of evaluation done and we were looking at human evaluation to do this, right. And so, we thought of partnering with Karya on this. Because Karya has reached in all the states in India and they have, you know, all of these workers who can actually do this kind of evaluation for different Indian languages. And so, what Pariksha is, it's a combination of human evaluation as well as automated evaluation. And with this combination we can scale and we can do thousands and thousands of evaluations, which we have already done actually on all of these different models. And so this is the first time actually that all of the different Indic LLMs that are available are being compared to each other in a fair way. And we are able to come up with a leaderboard now of all of the Indic models for each Indic language that we are looking at. So that's what Pariksha is. It's quite a new project and we've already done thousands of evaluations and we are continuing to scale this up even further.

Sridhar Vedantham: So how does someone, you know, if I have a LLM of my own in any given Indic language, how do I sign up for Pariksha, or how do I get myself to be evaluated against the others?

Sunayana Sitaram: Yeah. So you can contact any of us for that, the Pariksha team. And we will basically include this model, the new model into the next round of evaluation. So what we do with Pariksha is we do several rounds. So we've already finished a pilot round and we're currently doing the first round of evaluations. So we would include the new model in the next round of evaluations. And you know, as long as it's an open source model or there is an API access available for that model, we can evaluate the model for you. We are also planning to release all the artifacts from Pariksha, including all the evaluation prompts. So even if it is a closed model, you can use these to do your own evaluation as well later to figure out how you compare with the other models on the leaderboard.

Sridhar Vedantham: Right. Quick question. When you say that you're working with Karya, and you also say that you're looking at human evaluation along with the regular methods of evaluation. Why do you need human evaluation at all in these situations? It's just simpler to throw everything into a machine and let it do the work?

Sunayana Sitaram: Yeah, that's a great question. So we did some work on, you know, making machines evaluators. So basically asking GPT itself to be the evaluator and it does a very good job at that. However, it has some blind spots. So we found that GPT is not a very good evaluator in languages other than English. Basically, it's not a good evaluator in the languages that it doesn't do well in otherwise and so using only automated techniques to do evaluation may actually give you the wrong picture. It may give you the wrong sort of trends, right? And so we need to be very careful. And so our ultimate goal is to build evaluation systems and also other kinds of systems in general where humans and LLMs can work together. And so the human evaluation part is to have checks and balances on the LLM evaluation part. Initially, what we are doing is we're getting the same things evaluated by the human, and the LLM is doing the exact same evaluation. So we have a point by point comparison of what the humans are saying and what the LLM is saying so that we can really see where the LLM goes wrong, right. Where it doesn't agree with humans. And then we use all of this information to improve the LLM evaluator itself. So we're really trying to get humans to, you know, do the evaluation, get LLM's to do the evaluation, use the human data in order to improve the LLM. And then just this continues in a cycle. And the ultimate goal is, send the things to the LLM that it's good at doing and send the rest of the things that the LLM can't do to humans who are like the ultimate authority on the evaluation. So it's like this hybrid system that we are designing with Pariksha.

Sridhar Vedantham: Interesting. OK, so I know we are kind of running out of time. My last question to you would be, where do you see evaluation of LLM's and where do you see your work going or progressing in the near future?

Sunayana Sitaram: So evaluation for me is a path to understanding what these systems can do and cannot do and then improving them, right. So our evaluations are always actionable. So we try to figure out why something is not working well. So even in the Mega paper, we had lots of analysis about what factors may lead to, you know, lower performance in certain languages, etc. So I see all of this as providing a lot of rich information to model developers in order to figure out what the next steps should be, how they should be designing the next generation of models and I think that has already happened. It's, you know, systems have already improved from the time we started working on Mega and a lot of the issues that we pointed out in Mega, like tokenization etcetera, now they are well known in the field and people are actually taking steps in order to make those better in these language specific models etc. So I see the work as being, you know, first of all raising awareness about the problems that exist, but also providing actionable insights on how we could improve things. And with Pariksha also the idea is to release all the artifacts from our evaluation so that Indic model developers can use those in order to improve their systems. And so I see that you know better evaluation will lead to better quality models. That's the aim of the work.

Sridhar Vedantham: Sunaya, thank you so much for your time. I really had a lot of fun during this conversation.

Sunayana Sitaram: Same here. Thank you so much.

Sridhar Vedantham: Thank you.