Podcast: What 'bhasha' do you want to talk in? With Kalika Bal…

Microsoft Research India Podcast

Podcast: What 'bhasha' do you want to talk in? With Kalika Bali and Dr. Monojit Choudhury

June 01, 2020

Episode 003 | June 02, 2020

Many of us who speak multiple languages switch seamlessly between them in conversations and even mix multiple languages in one sentence. For us humans, this is something we do naturally, but it’s a nightmare for computing systems to understand mixed languages. On this podcast with Kalika Bali and Dr. Monojit Choudhury, we discuss codemixing and the challenges it poses, what makes codemixing so natural to people, some insights into the future of human-computer interaction and more.

Kalika Bali is a Principal Researcher at Microsoft Research India working broadly in the area of Speech and Language Technology especially in the use of linguistic models for building technology that offers a more natural Human-Computer as well as Computer-Mediated interactions, and technology for Low Resource Languages. She has studied linguistics and acoustic phonetics at JNU, New Delhi and the University of York, UK and believes that local language technology especially with speech interfaces, can help millions of people gain entry into a world that is till now almost inaccessible to them.

Dr. Monojit Choudhury is a Principal Researcher in Microsoft Research Lab India since 2007. His research spans many areas of Artificial Intelligence, cognitive science and linguistics. In particular, Dr. Choudhury has been working on technologies for low resource languages, code-switching (mixing of multiple languages in a single conversation), computational sociolinguistics and conversational AI. He has more than 100 publications in international conferences and refereed journals. Dr. Choudhury is an adjunct faculty at International Institute of Technology Hyderabad and Ashoka University. He also organizes the Panini Linguistics Olympiad for high school children in India and is the founding chair of the Asia-Pacific Linguistics Olympiad. Dr. Choudhury holds a B.Tech and PhD degree in Computer Science and Engineering from IIT Kharagpur.

Microsoft Research India Podcast: More podcasts from MSR India
iTunes: Subscribe and listen to new podcasts on iTunes
Android
RSS Feed
Spotify
Google Podcasts
Email

Transcript

Monojit Choudhury: It is quite fascinating that when people become really familiar with a technology, and search engine is an excellent example of such a technology, people really don’t think of it as technology, people think of it as a fellow human and they try to interact with the technology as they would have done in natural circumstances with a fellow human.

[Music plays]

Host: Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that’s impacting technology and society. I’m your host, Sridhar Vedantham.

[Music plays]

Host: Many of us who speak multiple languages switch seamlessly between them in conversations and even mix multiple languages in one sentence. For us humans, this is something we do naturally, but it’s a nightmare for computing systems to understand mixed languages. On this podcast with Kalika Bali and Monojit Choudhury, we discuss codemixing and the challenges it poses, what makes codemixing so natural to people, some insights into the future of human-computer interaction and more.

[Music plays]

Host: Kalika and Monojit, welcome to the podcast. And thank you so much. I know we’ve had trouble getting this thing together given the COVID-19 situation, we’re all in different spots. So, thank you so much for the effort and the time.

Monojit: Thank you, Sridhar.

Kalika: Thank you.

Host: Ok, so, to kick this off, let me ask this question. How did the two of you get into linguistics? It’s a subject that interests me a lot because I just naturally like languages and I find the evolution of languages and anything to do with linguistics quite fascinating. How was it that both of you got into this field?

Monojit: So, meri kahani mein twist hai (In Hindi- “there is a twist in my story”). I was in school, quite a geeky kind of a kid and my interests were the usual Mathematics, Science, Physics and I wanted to be a scientist or an engineer and so on. And, I did study language, so I know English and Hindi which I studied in school. Bangla is my mother tongue, so, of course I know. And I also studied Sanskrit in great detail, and I was interested in the grammar of these languages. Literature was not something which would pull me, but language was still in the backbench right, what I really loved was Science and Mathematics. And naturally I ended up in IIT, I studied in IIT Kharagpur for 4 years doing Computer Science, and everything was lovely. And then one day there was a project when we were in final year where my supervisor was working on what is called a text to speech system. So, in this system, it takes a Hindi text and the system would automatically speak it out and there was a slight problem that he was facing. And he asked me if I could solve that problem. I was in my final year- undergrad year at that time. And the problem was how to pronounce Hindi words correctly. At that time, it sounded like a very simple problem, because in Hindi the way we write is the way we pronounce unlike English, where you know, you have to really learn the pronunciations. And turns out, it isn’t. If you think of the words, ‘Dhadkane’ and ‘Dhadakne’, you pretty much write them in exactly the same way, but one you pronounce as ‘Dhadkane’ and the other one is pronounced as ‘Dhadakne’. So, this was the issue. So, my friend, of course, who was also working with me was all for machine learning. And I was saying, there must be a pattern here and I went through lots and lots of examples myself and turned out that there is this very short, simple, elegant rule which can explain most of Hindi words- the pronunciation of those words perfectly. So, I was excited. I went to my professor, showed him the thing, he was saying, “Oh! This is fantastic!”, let’s write a paper and we got a paper and all this was great. But then, somebody, when I was presenting the paper said, “Hey, you know what the problem you solved!” It’s called ‘schwa deletion’ in Hindi. Of course, I wasn’t in linguistics, neither my professor was, so he had no clue what was ‘schwa’ and what was ‘schwa deletion’. I dug a little deeper and found out that people had written entire books on ‘schwa deletion’. And, actually what I really found out was in line with what people had done their research on. And this got me really excited about linguistics. And more interestingly, you know, what I saw is, like you said, language evolution, if you think of why this is there. So, Hindi uses exactly the same style of writing that we use for Sanskrit. But in Sanskrit, there is no ‘schwa deletion’. But if you look at all the modern Indian languages which came from Sanskrit, like Hindi, Bengali or Oriya, they have different degree of pronunciation different from Sanskrit. I am not getting into the detail of what exactly is ‘schwa deletion’, that’s besides the point. But the pronunciations evolve from the original language. The question I then eventually got interested in is, how this happens and why this happens. And then I ended up doing a Ph.D. with the same professor on, language evolution and how sound change happens across languages. And of course, being a computer scientist, I tried modelling all these things computationally. And then there was no looking back, I went, more and more deeper into language, linguistics and natural language processing.

Host: That's fascinating. And I know for sure that Kalika has got an equally interesting story, right? Kalika, you have a undergrad degree in chemistry?

Kalika: I do.

Host: Linguistics doesn’t seem very much like a natural career progression from there.

Kalika: Yes, it doesn’t. But before I start my story, I have one more interesting thing to say. When Monojit was presenting his ‘schwa deletion’ paper, I was in the audience. I was working somewhere else and I looked at my colleague at that time and said, “We should get this guy to come and work with us.” So, I actually was there when he was presenting that particular ‘schwa deletion’ paper. So, yes, I was a Science student, I was studying Chemistry, and after Chemistry, the thing in my family was everybody goes for higher studies, I rebelled. I was one of those difficult children that we now are very unhappy about. But I said that I didn’t want to study anymore. I definitely didn’t want to do Chemistry and I was going to be a journalist, like my dad. I had already got a job to work in a newspaper. And I went to the Jawaharlal Nehru University to pick up a form for my younger sister. And I looked at the university and said, “This is a nice place, I want to study here.” And then I looked at the prospectus, kind of flicked through it and said, “what’s interesting?”. And I looked at this thing called Linguistics, and it seemed very fascinating. I had no idea what linguistics was about. And then, there was also ancient history which I did know what it was about and it seemed interesting. So, I filled in forms and sat for the entrance exam, after having read like a thin, layman’s guide to linguistics I borrowed from the British Council Library. And I got through. And the interesting thing is that the linguistic entrance exam was in the morning, the ancient history exam was in the afternoon. This was peak summer in Delhi. There were no fans in the place where the exam was being held. So, after taking the linguistic exam, I thought I can’t sit for another exam in this heat and I left. So, I only took the linguistic exam. I got through, no one was more surprised than I was. And I saw it as a sign that I should be going. So, I started a course without having any idea what linguistics was and completely fell in love with the subject within the first month. And coming from a science background, I was very naturally attracted towards phonetics, which I think is, to really understand phonetics and speech science part of linguistics, you do need to have a lot of understanding of how waves worked- the physics of sound. So, that part came a little naturally to me and I was attracted towards speech and the rest as they say is history. So, I went from there, basically.

Host: Nice. So, chemistry’s loss is linguistics gain.

Kalika: Yeah, my gain as well.

Host: Ok, so, I’ve heard you and Monojit talk at length and many times about this thing called codemixing. What exactly is codemixing?

Kalika: So, codemixing is when people in a multi-lingual community switch back and forth between two or more languages. And you know, as we all, all of us here come from multi-lingual communities where at a community level, not at an individual level, all of us speak more than one language, two, three, four. It’s very natural for us to keep switching between these languages in a normal conversation. So, right now of course, we are sticking to English, but if this was, say, in a different setting, we would probably be switching between Hindi, Bengali and English because these are three languages, all three of us understand, right.

Host: That’s true.

Kalika: That’s what code switching is, when we mix languages that we know, when we talk to each other, interact with each other.

Host: And how prevalent it is?

Kalika:“Abhi bhi kar sakte hain” (in Hindi- “we can even do it now”). We can still switch between languages.

Monojit: Yeah.

Host: “Korte pari” (In Bangla- “we can do that”). Yeah, Monojit, were you saying something when I interrupted you?

Monojit: You asked how prevalent it is. So, actually, linguists have observed that in all multi-lingual societies where people know multiple languages at a societal level, they codemix. But there is no quantitative data for how much mixing is there and one of the first things we tried to do when we started this project was to do some measurement and see how much mixing does really happen. We looked at social media where people usually talk the way they talk in their real life. I mean they type it, but it’s almost like speech. So we studied English-Hindi mixing in India and some of the interesting things we found is, if you look at public forums on Facebook in India and if you look at sufficiently long threads, let’s say 50 or more comments, then all of them are multi-lingual. You will find at least two comments in two different languages. And sometimes there will be many many languages, right, not only two languages. And interestingly, if you look at each comment, and try to measure how many of them are mixed within itself, like a single comment has multiple languages, it’s as high as 17%. Then, we extended this study to Twitter and now for seven European languages including English, French, Italian, Spanish, Portuguese, German, Turkish. And we studied how much codemixing was happening there. Again, interestingly, 3.5% of the tweets from, I would say the western hemisphere is codemixed. I would guess from South Asia, the number would be very high, we already said 17% for India itself. But then, what’s interesting is, if you look at specific cities, the amount of codemixing also varies a lot. So, in our study we found Istanbul has the largest amount of codemixed tweets, as high as 13%. Whereas some of the cities in the US, let’s say Houston, or cities in southern United States where we know that there is a huge number of English-Spanish bilinguals, even then we see around 1% of codemixing. So, yes, it’s all over the world and it’s very prevalent.

Kalika: Yeah, and I would like to add that there is this mistaken belief that people codemix because they are not proficient in one language, you know, people switch to their so called native language or mother tongue when they are talking in English because they don’t know English well enough or they can’t think of the English word when they are talking in English and therefore they switch to, say Hindi or Spanish or some other language. But that actually is not true. For people to be able to fluently switch between the two languages and fluently codemix and code switch, they actually have to know both the languages really well. Otherwise, it’s not mixing or code switching, it is just borrowing… borrowing from one language to another.

Host: Right. So, familiarity with multiple languages basically gets you codemixing, whereas if you are forced to do it, that’s not codemixing. Codemixing is more intentful and purposeful is what you are saying.

Kalika: Exactly.

Host: Ok. Do you see any particular situations or environments in which codemixing seems to be more prevalent than not?

Kalika: Yeah, absolutely. So, in the more formal scenarios, we definitely tend to stick to one language and if you think about it, even if you are a mono-lingual, when you are talking in a formal setting, you kind of have a very structured and have a very different kind of language used than when you are speaking in an informal scenario. But as far as codemixing is concerned, over the years when linguists actually started looking into this, you know some of the first papers that are published on code switching are from 1940’s. And at that time, it was definitely viewed as an informal use of language, but as our language use over the decades has become… you know informal has become much more acceptable in various scenarios. We’ve kind of also started codemixing in a lot of scenarios. So earlier if you’ve thought about it, if you looked at television, people stuck to just one language at a time. So, of it was a Hindi broadcast, it was just Hindi, if it was an English broadcast, it was just English. But now, television, radio, they all switch between English and multiple Indian languages when they are broadcasting. So, though it is like a much more informal scenario… use-case, now it’s much more prevalent in various scenarios.

Monojit: And to add to that, there is a recent study which says that there is all the signs that Hinglish- mixing of Hindi and English- is altogether a new language rather than mixing. Because there are children who grow up with that as their mother tongue. So, they hear Hinglish being spoken or in other words codemixing between these two languages happening all the time in their family, by their parents and other in their family and they take that as the language or the native language they learn. So, it’s quite interesting like on one extreme like Kalika earlier mentioned, there are words which are borrowed, so you just borrow them to fill a gap which is not there in your language, or you can’t remember, whatever the reason might be. On the other extreme, you have two languages that are fused to give a new language. So, these are called fused-lects like Hinglish. I would leave it to you to decide whether you consider it as a language or not. But definitely there are movies which are entirely in Hinglish or ads which are in Hinglish, you can’t say it’s either Hindi or English. And in between, of course there is a spectrum of different levels and level of integration of mixing between the languages

Host: This is fascinating. You are saying something like Hinglish, kind of becomes a language that’s natural rather than being synthetic.

Kalika: Yes.

Monojit: Yes.

Host: Wow! Ok.

Kalika: I mean, if you think of a mother tongue as the language that you dream in and then ask yourself what is the language that you dream in- I dream in Hinglish, so that’s my mother tongue.

[Music plays]

Host: How does codemixing come into play or how does it impact the interaction that people have with computers or computing systems and so on?

Monojit: So, you know, there is again another misconception which is, in the beginning we said that when people codemix, they know both the languages equally well. So, the misconception is if I know both Hindi and English and my system, let’s say a search engine or a speech recognition or a chat bot system, understands only one of the languages, let’s say English, then I will not use the other language or I will not mix the two languages. But we have seen that this is not true. In fact, long time ago, when I say long time I mean, let’s say ten years ago, when there was no research in computational processing of codemixing and there were no systems which could handle codemixing, even at that time, we saw that people issued a lot of queries to Bing which were codemixed. My favorite example is this one – “2017 mein, scorpio rashi ka career ka phal” in Hindi. So, this is the actual query. And everything is typed in the Roman script. Now, it has mixed languages, it has mixed scripts and everything. So It is quite fascinating that when people become really familiar with a technology, and search engine is an excellent example of such a technology, people really don’t think of it as technology, people think of it as a fellow human and they try to interact with the technology as they would have done in natural circumstances with a fellow human.. And that’s why even though we designed chat bots or ASR (automatic speech recognition) systems, thinking of one particular language in mind, but when we deploy them, we see everybody is mixing languages actually, even without realizing that they are mixing languages. So in that sense all technologies that we build which are user facing or any technology that is actually analyzing data which is user generated ideally should have the capability to process codemixed input.

Host: So, you used the word ideally which obviously means that it’s not necessarily happening may be too often or as much as it should be. So, what are the challenges out here?

Kalika: Initially, the challenge was to accept that this happens. But now we have crossed that barrier and people do accept that large percentage of this world lives in multi-lingual communities and this is a problem. And if they are to interact naturally with the so-called natural language systems, then they have to use and process codemixing. But I think the biggest challenge is data because most of the technologies… language technologies these days are data hungry. They all are based on machine learning and deep neural network systems and we require a huge amount of data to train these systems. And it’s not possible to get data in the same sense for codemixing as we can for mono-lingual language use, because if you think about it, the variation in code mixing where you can switch from one language to another is very high. So, to be able to get enough examples in your data of all the possible ways in which people can mix two languages is a very, very difficult task. And this has implications for almost all the systems that we might want to look at like machine translations, speech recognition, because all of these ultimately rest on language models and to train these language models we need this data.

Host: So, are there any ways to address this challenge of data?

Monojit: So, there are several solutions that we actually thought of. One thing is, asking a fundamental question that “Do we really need a new data set for training codemix systems?”. For instance, imagine a human being who knows two languages, let’s say Hindi and English which the three of us know. And imagine that we have never heard anybody mix these two languages in our life before. A better example might be English and Sanskrit. I really haven’t heard anybody mixing English and Sanskrit. But if somebody does mix these two languages, would I be able to understand? Would I be able to point out- this sounds grammatical and this doesn’t? It turns out that intuitively at least, for human beings, that’s not a problem. We have an intuitive notion of what is mixing and which patterns of mixing are acceptable. And we really don’t need to learn codemix language as a separate language once we know the two languages involved equally well. So, this was the starting point for most of our research. So then, we thought, how best- instead of creating data in codemixed language- can we start with mono-lingual data sets or mono-lingual models and from there somehow combine them to build codemixed models? Now there are several approaches that we took and they worked to various degrees. But the most interesting one which I would like to share is based on some linguistic theories. Now, these linguistic theories, says that certain, I mean given the grammar of the two languages, so if you have the grammar of English and let’s say Hindi and depending on how these grammars are, there are only certain ways in which mixing is acceptable. And to give an example, let’s say, I can say, “I do research on codemixing”. Now, for this, I can codemix and say… let’s say, “Main codemixing pe research karta hoon”. It sounds perfectly normal. “I do shodh karya on codemixing”- we don’t use it that often. Probably we wouldn’t have heard, but you still might find it quite grammatical. But if I say, “Main do codemixing par shodh karya”, does it sound natural to you? Now, there is something which doesn’t sound right, and linguists have theories on why this doesn’t sound right. And, starting from those theories we build models which can take data in two languages… parallel data or if you have a translator, then you can actually translate a sentence, let’s say, “I do research in codemixing.” And you use a English-Hindi translator and translate it into Hindi: “Main codemixing (I don’t know what the Hindi for codemixing is) par shodh karya karta hoon”. And then given these two sentences… this pair of parallel sentences, there is a systematic process in which you can generate all the possible ways in which these two sentences can be mixed in a grammatically valid way, when you are saying Hinglish. Now, we built those models, the linguistic theories were more theories, so we had to build… we had to flesh them out and build real systems which could generate this. Now, once we have that, now you can imagine that there is no dearth of data. You can take any data in a mono-lingual… in a single language… any English sentence and convert it into codemixed Hindi versions. And then you have lot of data. And then whatever you could do for English, you can now train the same system on this artificially created data and you can solve those tasks. So that was the basic idea using which we could solve a lot of different problems starting from translation to part of speech tagging, to sentiment analysis to parsing.

Host: So, what you are saying is that given that you need a huge amount of data to build… build out models, but the data is not available, you just create the data yourself.

Monojit: Right.

Host: Wow.

Kalika: Yes, based on certain linguistic theoretical models which we have made into computational linguistic theoretical models.

Host: Ok, so, we’ve then talking about codemixing as far as textual data is concerned for the large part. Now, are you doing something as far as speech is concerned?

Kalika: Yes, speech is slightly more difficult than pure text, primarily because there you have to kind of look at both the acoustic models as well as the language models. But our colleague Sunayana Sitaram, she’s been working now for almost three years on codemixed automatic speech recognition system and she had… she had actually come up with this really interesting Hindi-English ASR system which mixed between Hindi and English and… was able to recognize a person speaking in mixed Hindi-English speech.

Host: Interesting. And where do you see the application of all the work that you guys have done? I mean, I know you have been working on this stuff for a while now, right?

Kalika: If you think about opinion mining as one of the things and you are looking at a lot of user generated data. The user generated data is a mix between say, English and Spanish and your system can only process and understand English. It can’t understand either the Spanish part or the mixed part, like both English and Spanish together, then, the chances are that you will only get a very skewed and most probably incorrect view of what the user is saying or what the user’s opinion is. And therefore, any analysis you do on top of that data is going to be incorrect. I think Monojit has a very good example of that in the work that you know we did on sentiment and codemixing on Twitter and he looked at how negative sentiment was expressed on Twitter.

Monojit: Yeah. That’s actually pretty interesting. So this brings us to the question of why people codemix? We said in the beginning that first it’s not random and second it has… it seems to have a purpose. So what is that purpose? Of course, there are lots of theories or observations from linguists starting from humor, sarcasm or even when you are reporting a speech. All these have various degrees of codemixing and there are reasons for this. So, we thought- there is a lot of codemixing on social media, so, we could do a systematic and quantitative study of the different features which make people switch people from Hindi to English or vice-versa. We formulated a whole bunch of hypotheses to test based on the current linguistic theories. So our first hypothesis was that people might be switching from English to Hindi when they are moving from facts to opinions. Because it’s a well-known thing that when you are talking of facts, you can speak it in any language and more likely to be in English in Indian context. Whereas when you are expressing something emotional or an opinion, you are more likely to move… switch to your native language. So people might be more likely switching to Hindi. So, we tried to test all these hypotheses and nothing actually was statistically significant. So, we didn’t see strong signals for that in the data. But then what we saw a really strong signal is when people are expressing negative sentiment they are more likely- actually nine times more likely- to use Hindi than when they are expressing positive sentiment. It seems like English is the preferred language for expressing positive sentiment whereas Hindi is the preferred language for expressing negative sentiment. And we wrote a paper based on these findings that we might praise you in English but gaali to Hindi mein hi denge (In Hindi- we will swear only in Hindi). So, if you did only sentiment analysis in one language, let’s say English and try to do trend analysis of some Indian political issue based on that. It is very likely that you will get a much rosier picture because if you do only English, people would have said more positive things. And the Hindi, I mean, all the gaalis (cuss words) or negative things will actually be in Hindi which you will be missing out. So ideally you should do a processing of all the languages when you are looking at a multi-lingual society and analyzing content from there.

Kalika: Yeah. And this actually touches a lot on why people codemix and that’s a very vast area of research. Because people codemix for a lot of reasons. People might codemix because they want to be sarcastic, people might codemix because they want to express in group… the three of us will… can move to Bengali to kind of bond and show that we are part of this group that knows Bengali. Or, you meet somebody and they want to keep you at a distance, and not talk to you in that language or mix. So people do it for humor, people do it for reinforcement, there’s a lot of reasons why people codemix and if we miss out on all that it’s very hard for us to make any claims… any firm claims on why people are saying what they are saying.

Host: It seems like this is an extremely complex area of research which spans not just the computer science or linguistics but also affects sentiment, opinion, etc., a whole lot of stuff going here.

Monojit: Yeah, and in fact most of the computational linguistics work that you’d see mostly draws from linguistics starting from, you know, how grammar works, syntax and may be how meaning works, semantics. But codemixing goes much beyond that. So, we are talking now of what is called pragmatics and sociolinguistics-. So, pragmatics would be, given a particular context or situation, how language is used there. And modelling pragmatics is insanely difficult. Because you not only need to know the language but you need to know the speakers, the relationship between the speakers, what is the context in which the speakers are situated and speaking and all this information. So, for instance I mean, typically example is if I tell you, “Could you please pass the water bottle?”. Now actually it is a question and you could say, “Yes, I can.”. But that’s not what will satisfy me, right, it’s actually a request. So, that’s how we use language and what we say is not necessarily what we mean. And this intent- understanding this hidden intent is very situational. And in different situation, the same sentence might mean very different things. And codemixing is actually at the boundaries of syntax, semantics and pragmatics. And sociolinguistics is the study of how language is used in society, especially how social variables corelate with linguistic variables. So social variables could be somebody’s level of education, somebody’s age, somebody’s gender, where somebody is from etc. And linguistic variables are whether it’s codemixed or not, at what degree of codemixing, just to give some examples. And we do see some very strong social factors which determine codemixing behavior. In fact, that’s used a lot in our Hindi movies, Bollywood. So, we did a study on Bollywood scripts, so we studied some 37 or 40 Hindi movie scripts which are freely available for research online to see where does codemixing happen in Bollywood. And what we found is codemixing is employed in a very sophisticated way by the script writers in two particular situations. One is, if they want to show a sophisticated urban crowd, as opposed to a rural crowd. So if you look at movies like “Dum Lagake Haisha” which are set either in a small town or in a rural scenario or in the past. Usually those movies will have lot less codemixing. Then, let’s say “Kapoor & Sons” or “Pink” which are set in typically in a city and people are all educated, urban people, so, just to show that codemixing is used heavily in these kinds of movies. And another case where in Bollywood they use a lot of codemixing, in fact accented codemixing, is when you want to show that somebody has been to “foreign” as we would say- abroad- and would come back to India and interact with poor country cousins. So, it’s used a lot in different ways in the movies. And that’s the sociolinguistics bit which is kicking in.

Kalika: And you know to add to that, what we had touched upon earlier how this usage has kind of changed over time. In the earlier Bollywood movies, this mixing was much less. Not only that, the use of English was mostly used to denote who is the villain in the movie. The evil guys were usually the ones who spoke… if you look at 1970’s or 60’s movies, it’s always the smugglers, the kingpins of the mafia who spoke a lot of English and mixed English into Hindi. So obviously that kind of change has happened over years even in Bollywood movies.

Host: I would never have thought about all these things. Villains speaking English, ok, in Bollywood!

[Music plays]

Host: Where do you see this area of research going in the future? Do you guys have anything in particular or you are just exploring to see ?

Kalika: I think one of the things we have been looking at a lot is that how when AI interacts with users, with humans, this human-AI interaction scenario, where does codemixing fit in because there is one aspect that the user is mixing and you understand but does the bot or the AI agent also have to mix or not. And if the AI agent has to mix, then where and when should it mix? So, that’s something that we have been looking at and that is something that we think that is going to play an important role in human-AI interaction going forward. We’ve studied this in some detail and it’s actually very interesting- people have a whole variety of attitudes towards not only codemixing but also towards AI bots interacting with them. And this kind of reflects on what they feel about a bot that will codemix and talk to them in a mixture of language irrespective of whether they themselves codemix or not. And our study has shown that some people would look at a bot which codemixes as ‘cool’… and in a very positive way but some people would look at it very negatively. And the reason for that is some people might think that codemixing is not the right thing to do, it’s not a pure language. Other people would think that it’s a bot, it should talk in a very “proper” way, so it should only talk in English or only talk in Hindi and it shouldn’t be mixing language. And a certain set of people are kind of freaked out by the fact that the AI is trying to sound more human like when it mixes. So, there is a wide range of attitudes that people have towards a codemixing AI agent. And how can we kind of tackle that? How do we make a bot then, that codemixes or doesn’t codemix and it please the entire crowd, right?

Host: Is there such a thing like pleasing the entire crowd?

Kalika: So, we have ideas about that. How to go about, trying to at least please the crowd.

Monojit: Yeah. Basically, you have to adapt to the speaker. Essentially the way we please the crowd is through accommodation. So, when we talk to somebody who is primarily speaking in English, I will try to reciprocate in English. Whereas if somebody is speaking in Hindi, I will try to speak in Hindi if I want to please that person. Of course, if I don’t, then I will use the other language to show the social distance. And this is one of the ways which we call the ‘Linguistic Accommodation Theory’. There are many other ways or in general there are various style components that we incorporate in our day to day conversation, mostly unknowingly, based on whether we want to please the other person or not. So, call it sycophancy or whatever, but we want to build bots which kind of model that kind of an attitude. And if we are successful, then the bot will be a world pleaser.

Kalika: I don’t think it has so much to do with sycophancy- human beings actually have to cooperate and that’s in a sense hardwired to a certain extent into our spine now. For evolutionary reasons, we do need to cooperate and to be able to have a successful interaction, we have to cooperate, and one of the ways we do this is by trying to be more like the person we are talking to and both parties kind of converge to a middle ground and that’s what accommodation is all about.

Host: So, Kalika and Monojit, this has been a very interesting conversation. Are there any final thoughts you’d like to leave with the listeners?

Kalika: I hope people get an idea through our work on codemixing that human communication is quite intricate. There are many factors that come into play when human beings communicate with each other. There can be social contexts, there can be pragmatic contexts and of course, the structure of the language and the meaning that you are trying to convey, all of it plays a big role in how we communicate. And by studying codemixing in this context, we are able to hopefully grapple with a lot of these factors which in a very general human-human communication become too big to handle all at once.

Monojit: Yeah. Language is an extremely complicated and multi-dimensional thing, so, codemixing is just one of the dimensions where we are talking of switching between languages, but then even within languages there are words, there are structural differences between languages, sometimes you can use features of another language in your own language. It won’t be called codemixing, but essentially you are mixing. For instance, accents, when you talk your own native language in, let’s say another kind of an accent borrowed from another language. In Indian English we use things like “little-little”… “those little-little things that we say”. Now “little-little” is not really an English construct, this is a Hindi or Indian language construct which we are borrowing into English. So, all this studying at once would be extremely difficult. But on the other hand, codemixing does provide us with a handle into this problem of computational modeling of pragmatics and sociolinguistics and all those concepts and how we can then not only model these things for the sake of modeling, but they are concrete use-cases… not only use-cases, they are needs. Users are already codemixing through technology. So technology should respond back by understanding codemixing and if possible even generating codemixing. So, through this entire research we are trying to close this loop of how linguistic theories can be used to build computational models and these computational models can then be taken to users and in all its complications and complexities and then we understand and learn from the user technology interaction and feed back to our model. So, this entire cycle of theory to application to deployment is what we would like to do or get deeper insight into in the context of natural language processing.

Host: And I am looking forward to doing another podcast once you guys have gone down the road with your research on that. Kalika and Monojit, this was a very interesting conversation. Shukriya (In Hindi/ Urdu- thank you).

Kalika: Aapka bhi bahut bahut thank you (In Hindi- many thanks to you too). It was great fun.

Monojit: Thank you, Sridhar. Khoob enjoy korlam ei conversationta tomar shaathe. Aar ami ekta kotha (In Bangla- “I very much enjoyed this conversation with you, Sridhar. There’s one thing) I want to tell to the audience: Never feel apologetic anytime when you codemix. This is all very natural and don’t think you are talking an impure language. Thank you.

Host: Perfect.

[Music plays]

Download Episode

Bengaluru, India

A technology and research podcast from Microsoft Research India