Phantom Power

Phantom Power


Sonic AI: Steph Ceraso & Hussein Boon

April 16, 2023

Help grow the show:



Subscribe to Phantom Power



Join our Patreon and get perks + merch



Rate us easily on your platform of choice



Today we hear two scholars reading their recent work on artificial intelligence. Steph Ceraso studies the technology of “voice donation,” which provides AI-created custom voices for people with vocal disabilities. Hussein Boon contemplates the future of AI in music via some very short and thought-provoking fiction tales. And we start off the show with Mack reflecting on how hard the post-shutdown adjustment has been for many of us and how that might be feeding into the current AI hype.  



For our Patreon members we have “What’s Good” recommendations from Steph and Hussein on what to read, listen to, and do. Join at Patreon.com/phantompower. 



About our guests:



Steph Ceraso is Associate Professor of Digital Writing & Rhetoric in the English Department at the University of Virginia. She’s one of Mack’s go-to folks when trying to figure out how to use audio production in the classroom as a form of student composition. Steph’s research and teaching interests include multimodal composition, sound studies, pedagogy, digital rhetoric, disability studies, sensory rhetorics, music, and pop culture. 



Hussein Boon is Principal Lecturer at the University of Westminster. He’s a multi-instrumentalist, session musician, composer, modular synth researcher, and AI researcher. He also has a vibrant YouTube presence with tutorials on things like Ableton Live production. 



Pieces featured in this episode: 



Voice as Ecology: Voice Donation, Materiality, Identity” by Steph Ceraso in Sounding Out (2022)



“In the Future” by Hussein Boon in Riffs (2022). 



Mack also mentioned in his rant: 



Embodied meaning in a neural theory of language” by Jerome Feldman and Srinivas Narayanan (2003). 



The Contemporary Theory of Metaphor” by George Lakoff (1992). 



Today’s show was produced and edited by Ravi Krishnaswami. 



Transcript

[7:22]: Steph Ceraso, Voice As Ecology: Voice Donation, Materiality, Identity



[24:10] Hussein Boon, Public Access File Trading



[26:26] Hussein Boon, Better Than Mininum Wage: Working For The Man 



[30:10] Hussein Boon, Tinkering At The Edges



Ethereal Voice: This is Phantom Power.



[Ethereal Music]



Steph Ceraso: We must treat voice holistically. Voices are more than people, more than technologies, more than contexts, more than sounds. 



Understanding voice means acknowledging the interconnectedness of these things and how that interconnectedness enables or precludes vocal possibilities.



[Ethereal Music Fades]



Mack Hagood: Welcome to another episode of Phantom Power, where artists and scholars talk about sound. I’m Mack Hagood. 



It’s April, and for those of you in the world of academia, you know what that means. We’re in the final sprint, or maybe the death march, towards the end of the semester. 



And for those of you who are students, you probably find that your plate is overloaded with all those final projects and exams to work on and study for.



And for those of you who are faculty, it’s probably that moment of triage where you have to decide which of your high hopes for the semester is going to have to be sacrificed so you can meet the essential demands of your teaching service and research. 



And if you’re like me and a lot of other faculty I’ve spoken to and read, discussing the matter online, you’re probably also watching a lot of your students kind of spin out as the demands of college no longer seem to, I don’t know, be surmountable or maybe even seem necessary.



Honestly, I’m not sure what’s happening with our students right now. The current post-Covid, post-shutdown crop of college students is really struggling and I’m struggling with how to serve them best. 



I don’t know, you know, what combination of understanding, but also maintaining high standards would really serve them best. I don’t want to be some kind of cruel task master, but then I also don’t want to lower my expectations to the degree that they graduate without feeling a sense of accomplishment and ability as they move forward in life. 



Everyone seems to feel like they need a little help right now, and I can’t help but think that these post shutdown struggles have helped fuel the massive interest that we’ve seen in so-called “artificial intelligence,” as of late. 



In my academic world, both students and faculty seem to be hoping that machine learning algorithms can shoulder some of their burden. It’s got me imagining a dystopian future where AI both writes and grades the papers, finally completely removing the human from the humanities. And I think some people would be very happy to see us go out like that.



So, I have my own perspective on so-called “artificial intelligence” as you might glean from the fact that I use the term so-called AI. AI is a branding term. The chatbots that we hear about today have nothing to do with the general artificial intelligence that science fiction writers and philosophers have speculated about for so many decades.



As you probably know, chatbots are just word prediction algorithms, autofill on steroids. They have no understanding or intention, and it’s my strong opinion that an algorithm without a body will never develop human understanding because cognition is embodied and enacted. It’s enacted in interactions between bodies and between bodies and environments.



Our understanding of any utterance is embodied. It’s social, it’s emotional. As one well sighted paper from 20 years ago points out: For us to understand the sentence past the salt, we need to have had an entire set of sensory motor experiences, such as grasping something and moving one’s arm through space.



When we read or hear about someone grasping something, we really do a quick mental simulation of grasping based on our own embodied experience. 



And I believe that this phenomenon extends beyond concrete action words to metaphor, as well. As George Lakoff wrote, “Metaphor helps us understand something abstract in terms of something more concrete.”



How can we speak of grasping an idea if we’ve never grasped the salt, or at least seen someone else grasp the salt? In my view, without concrete sensory experiences, there’s just no foundation to build understanding upon. 



How does one understand sentences about time, space, motion, touch, sight, or sound, or any of the countless metaphors based on them without a body.



I mean, honestly, this is completely obvious in my opinion, but it amazes me how many supporters and doomsayers alike seem to miss this point when it comes to AI. A better, faster, large language model will still never think. It’s not a problem of scale. It’s a problem of kind. 



My university newspaper interviewed me recently about AI and I told them a lot of people are afraid of AI because they think it’s so smart. I’m afraid of it because it’s so dumb. Dumb and powerful. And I have concerns about how much of our cognition and social interaction we are about to outsource to something that is so literally thoughtless. 



Okay, rant over. I mean, having said all of that, machine learning can do amazing things that I’m excited about and it’s already making important impacts on the world of sound. So, for example, really fascinating stuff in the world of music production. 



And this week on today’s show, we have two guests reading recently published works on AI and sound.



Steph Ceraso is Associate Professor of Digital Writing and Rhetoric in the English Department at the University of Virginia. She’s someone I’ve known and followed for a pretty long time now, and she’s definitely one of my top go-to folks when I’m trying to figure out how to use audio production in the classroom as a form of student composition.



Hussein Boon is Principal Lecturer in Music at the University of Westminster. He’s a multi-instrumentalist, session musician, composer, modular synth researcher, and AI researcher. I was not hip to Hussein’s work, but Phantom Powers producer Ravi Krishnaswami turned me onto some fascinating short stories he wrote about AI and the future of music, and I’m really excited to share some of those stories with you today.



And for those of you who are listening via our Patreon feed, we’ll have our “What’s Good?” segment, where Steph and Hussein will suggest something good to read, something good to listen to, and something good to do. If you’re interested in being a patron of the show, just go to patreon.com/phantompower.



[7:22]



Let’s hear from Steph Ceraso who brings us a piece she wrote on the intersection of AI and disability. 



You know, a moment ago I was saying that I think AI is going to need a body of some kind in order to experience the world and to think in the way we normally construe thinking as human beings. But that’s not to say that there’s only one kind of body or one way of thinking as a human. 



There’s a whole diversity of human bodies out there, of course, and a whole diversity of perspectives that they generate. And as we’ve discussed before on this show, we have to watch out for the ableist pitfall of defining humanity by way of any specific human capacity or ability.



For example, defining being human as having a voice. 



In Steph’s piece, which I first read in the amazing Sounding Out blog, Steph describes how sometimes the makers of assistive technologies can reinforce ableist understandings of the human even as they try to support the agency of disabled people. 



So here it is, “Voice As Ecology: Voice Donation, Materiality, Identity” by Steph Ceraso.



[Ethereal Music]



Ceraso: I first heard about voice donation while listening to Being Siri, an experimental audio piece about Erin Anderson donating her voice to Boston-based voice donation company, VocaliD. 



Like a digital blood bank of sorts, VocaliD provides a platform for donating one’s voice via digital audio recordings.



These recordings are used to help technicians create a custom digital voice for a voiceless individual. Providing an alternative to the predominantly white male, mechanical sounding assistive technologies used by people who can’t vocalize for themselves. Think Steven Hawking. 



Stephen Hawking: Can you hear me? 



[Crowd Responds “Yes!”]



Technology has transformed the outlook for the disabled. People like me can now move around independently when they can communicate.



The fact that you are listening to me now shows what technology can do, even if it does give me a Scandinavian or American accent.



[Crowd Laughs]



Ceraso: VocaliD manufacturers voices that better match a person’s race, gender, ethnicity, age, and unique personality. 



To me, VocaliD encapsulates the promise, complexity, and problematic nature of our current speech AI landscape, and serves as an example of why we need to think critically about sound technologies, even when they appear to be wholly beneficial.



Given the extreme lack of sonic diversity and vocal assistive technologies, VocaliD provides a critically important service. But a closer look at both the rhetoric used by the organization and the material process involved in voice donation also amplifies the limits of overly simplistic human-centric conceptions voice.



For instance, VocaliD rhetorically frames their service by persistently linking voice to humanity; to self authenticity, individuality. 



Consider the following statements made by Rupal Patel, CEO and Founder of VocaliD, in which she emphasizes the need for voice donation technology. 



Rupal Patel: They say that giving blood can save lives. Well, giving your voice can change lives.



All we need is a few hours of speech from our surrogate talker and as little as a vowel from our target talker to create a unique vocal identity. 



So that’s the science behind what we’re doing. I want to end by circling back to the human side. That is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William.



When his mom first heard this voice, she said, “This is what William would’ve sounded like had he been able to speak.” 



And then I saw William typing a message on his device. I wondered, “What was he thinking?” Imagine carrying around someone else’s voice for nine years and finally finding your own voice.



Imagine that! 



This is what William said, “Never heard me before.”



Thank you.



Ceraso: These are just a few examples from a larger discourse that reinforces the connection between voice and humanity. VocaliD’s repeated claims that their unique vocal identities humanize individuals imply that one is not fully human unless one’s voice sounds human.



This rhetoric positions voiceless individuals as less than human, at least until they pay for a customized human sounding voice. 



VocaliD’s conflation of voice in humanity makes me wonder about the meaning of human in this context. 



For example, notions of humanity have been historically associated with western whiteness, and deployed as a means of separating or distinguishing white people from capital O, Others, as Alexander Weheliye points out. 



Though VocaliD’s mission is to diversify manufactured voices, is a human sounding voice still construed as a white voice? Does “sounding human” mean sounding white? 



[Compilation of Siri Soundbites]



Even if there is a bank of sonically diverse voices to choose from, does racial bias show up in the pacing, phrasing, or inflection caused by the vocal technology? 



I am also disturbed by the rhetoric of humanity and individuality used by VocaliD because the company adopts the same rhetoric to describe the AI voices they sell to brands for media and smart products.



Here’s an example of this rhetoric from the VocaliD AI website. 



VocaliD Voice: When you need a voice that resonates, evokes audience empathy, and sounds like you, rather than your competitors. VocaliD’s AI powered vocal persona is the solution. 



Your voice always on. Where you need it. When you need it. 



Oh, by the way, this quote was generated using VocaliD voice named Malleague.



Ceraso: Using similar rhetorical strategies to describe both voiceless people and products is dehumanizing. 



And yet, having a more diverse AI vocal mediascape, especially in terms of race, is crucially important, since voice activated machines and products are designed largely by white men who end up reinforcing the sonic color line.



Interestingly, the processes VocaliD uses to create a custom voice reveal that these voices are not in fact unique markers of humanity or individuality.



It’s hard to find a detailed account of how VocaliD voices are made due to the company’s patents, but here are the basics: 



VocaliD does not transfer donated voice directly to a voiceless person’s assistive technology. VocaliD technicians instead, blend and digitally manipulate the donated voice with recordings of the noises a voiceless person can make (a laugh a hum) to create a distinct new voice for the recipient. 



In other words, donated voices are skillful remixes that wouldn’t be possible without extracting vocal data and manipulating it with digital tools.



[Compilation of Electronic Voices]



Despite perpetuating narratives about voice humanity, and authenticity, VocaliD’s creative blending of vocal material reveals that donated voices are the result of compositional processes that involve much more than people. 



Further, considering VocaliD voices from a material rather than human-centric perspective amplifies something important about voices in general. All voices are composed of and grounded in an ecology. That is, voices emerge and are developed through a mixture of biological makeup or technological makeup, in the case of machines with voices, specific environments and contexts, like geography, may determine the kinds of accents humans have.



AI voices have distinct sounds for their brands. Technologies like phones, computers, digital recorders and editors, software and assistive technologies all preserve, circulate, and amplify voices. 



And finally, others. Humans often emulate vocal patterns of the people they interact with most. Many machine voices also sound like other machine voices.



Put simply, all voices are intentionally and unintentionally composed over time, shaped by ever-changing bodily and or technological states and engagements with the world. Voices are dynamic compositions by nature. 



Examining voice from a material standpoint shows that voices are not static markers of humanity. Voices are responsive and malleable because they are the result of a complex ecology that involves much more than a “unique human being.” 



However, focusing solely on the material aspects of vocality leaves out people’s lived experiences of voice, and based on online videos of VocaliD recipients like Delaney, a 17 year old with cerebral palsy, VocaliD voices seem to live up to the company’s hype.



Delaney appears delighted by her new voice. 



Delaney: I was so excited to get my own voice. I used to have a computer voice and now I sound like a girl. I like that. And I talk more. 



Ceraso: Delaney’s teachers also discussed how her new voice completely changed her demeanor. Whereas before Delaney was reluctant to use her assistive technology to speak, her new voice gives her confidence and a stronger sense of identity.



As her teacher explains in the video. 



Ms. Cunningham: She definitely has a lot more confidence. She is really engaged in groups. She wants to share her answers. She’s excited to talk with friends. It’s been really nice to see. 



Delaney: What did you do today? 



Ms. Cunningham: What did I do today? 



Ceraso: For Delaney, a VocaliD voice represents a newfound sense of agency.



It’s important to recognize this example is not necessarily representative of every VocaliD recipient’s experience, or even Delaney’s full experience. 



As Meryl Alper notes in Giving Voice, these types of news stories, “Portray technology as allowing individuals to overcome their disability as an individual limitation and are intended to be uplifting and inspirational for able-bodied audiences.”



While we should be wary of the technological determinism in the video, observing Delaney use her VocaliD voice, and listening to the emotional responses of her mom and teachers, makes it difficult to deny that donated voices make a positive. 



For me, this video also gets at a larger truth about humans and voice.



The ways we hear and understand our own voices and the ways others interpret the sounds of our voices matter a great deal. Voices are integral to our identities, to the ways we understand and think about ourselves and others. And the sounds of our voices have social and material consequences, as the Sounding Out Gendered Voices forum illustrates so clearly.



Tt’s worth repeating that VocaliD’s mission to diversify synthetic voices is incredibly important, especially given the restrictive vocal options available to voiceless individuals. It’s also necessary to acknowledge the company has limitations that end up reproducing the structural inequities it tries to address.



As Alper observes, “In order to become a speech donor, one must have three to four hours of spare time to record their speech, access to a steady and strong internet connection, and a quiet location in which to record.” 



With these obstacles to donating one’s voice in mind, it’s not surprising that all of the VocaliD recipient videos I could find feature white people. Donating one’s voice is much easier for middle to upper class white people who have access to privacy, internet and leisure.



This brief examination of VocaliD raises questions about what a more equitable future for vocal technologies might look or sound like. Though I don’t have the answer, I believe that to understand the fullness of voice, we can’t look at it from a single perspective. 



We need to account for the entire vocal ecology. The material conditions from which a voice emerges or is performed, and individual speakers understanding of their culture, race, ethnicity, gender, class, ability, sexuality, et cetera.



An ecological approach to voice involves collaborating with people and their vocal needs and desires, something VocaliD models already. But it also involves accounting for material realities. 



How might we make the barriers preventing a more diverse voice ecosystem, less difficult to navigate, especially for underrepresented?



In short, we must treat voice holistically. Voices are more than people, more than technologies, more than contexts, more than sounds. 



Understanding voice means acknowledging the interconnectedness of these things and how that interconnectedness enables or precludes vocal possibilities.



[Ethereal Music Fades]



Mack: Steph Ceraso. 



Next up we have Hussein Boon. Last year, Ravi Krishnaswami sent me this very cool zine called Riffs



I don’t know, is it a zine? It looks like a zine. The layout is amazing, but it’s also a peer reviewed academic journal that encourages scholars to push the limits of what counts as scholarship, which is a project that I can totally relate to as an academic podcaster.



Anyway, link in the show notes. Check it out. Very, very interesting. This issue of Riffs was a collaboration with IASPM, which is the International Association for the Study of Popular Music, and the authors were contemplating the future of popular music in this issue. 



Hussein Boone’s piece was simply entitled “In The Future,” and it consisted of several short stories, very short stories, about the nature of creative labor and the music industry in the era of a complete AI takeover. 



Like any good science fiction, I think it’s less a prediction about the future than it is a critique of the present, the already degraded conditions of musical labor in the digital age.



And because these pieces are about music, Ravi and I decided not to use any sound design at all so that you can simply imagine the sonic future that Hussein presents. 



So here it is, Hussein Boone reading in audiobook style three very short stories from “In The Future.”



[24:10]



Hussein Boone: Public Access File Trading. 



The banner flashed across his screen, “Get 20% off the pro version.” 



He was interested, but still on the fence. 



He’d been playing with the trial version and the results were pretty staggering. Five new songs already completed in the last hour. 



Though most of the hours spent typing in parameters and rearranging lyrics, the software only took seconds to come up with a song. 



He was about to start his second hour and was busy trying to identify the right parameters for his Nicki Minaj versus Jacques Brel song when the banner interrupted his flow. 



If he didn’t know better, he thought they knew he was ready to commit, at least almost. He could upgrade to the pro version and this would allow him to accomplish the task he was currently interested in.



But was it worth paying that much money? 



He’d already looked online for a cracked version of the software, but reports suggested that the AI’s engine’s data sets were distributed across a number of crypto sites, making it difficult to hack. 



Knowing he was probably going to buy it, he had secured a loan in advance against his DNA sequence and tissue samples for the next five years.



He’d be giving up some freedom in return for access to the data sets to realize his musical dreams, but at least he’d still have his organs and limbs. 



The last thing that held him back was that the early adopters had started to flood the market in their rush to be influencers and get in front of the next wave of creators.



And whilst he was interested in the what ifs, he also felt sorry for most of the artists. For the ones who were really popular, the net was saturated with songs either in imitation or in the style of. 



The bottom soon fell out of the streaming market. There were so many songs that no one really had the time to listen to them. Many artists no longer needed to go to the studio, preferring to use the software instead.



[Electronic Tones]



[26:26]



Boon: Better Than Minimum Wage: Working For The Man



Each year on the first Friday in February, the music the whole world will be listening to for the rest of the year is released. 



It’s televised in a global extravaganza. There’s a competition to see who gets to push the button that starts the task. 



Once the button is pressed, the AI turns out nearly 22 million songs in a matter of seconds.



These songs will form the sole available listening experience for everything. Adverts, films, spin cycle classes, dog and cat videos. 



First in the queue are TV, film, and production companies. Following this, a number of cultural commentators listened to a random selection of what’s available. Usually they moan that there’s nothing exciting.



One year, they noted that the only radical piece was a remix of John Cage’s “4’33” married to Terry Riley’s “In C” for 800 musician. The piece was a little over 17 hours of silence and spectacle, hailed as a master work. 



Once the critics were out of the way, the song competition would follow, aiming to find the best song of the year. 



Each country nominated its own attributes and keywords submitted to the AI listening panel, which managed to analyze the complete dataset in a matter of minutes. 



The results were tallied and the winning song revealed to an eager world, and made available for free on all streaming platforms.



Of course, what lay behind this venture was the human workforce required to make the machine work. 



You see, so much money had been invested in AI that it was now too big to fail, yet its results were far from stellar. 



The labs soon realized that the flaw in a plan was copying what already existed, which did not guarantee that the musical outputs would be exciting enough for audiences. 



To get around this, the leading AI labs site, new undiscovered human artists. They promised them a better deal than the record labels ever. The artists will have stable and secure jobs, a regular income commensurate with corporate working plus benefits, and all they needed to do was to turn up every day and feed the machine.



They wrote and recorded new songs, either on their own or in collaborative teams, and presented the fruits of their labor to the machine. 



The problem for the songwriters was that they could never tell anyone what they did for a job. They would never be nominated for Song of the Year, Best New Artist or Best Producer.



Some still played gigs and occasionally they’d play a song they’d written, ingested by the machine. After their shows, they would be told that it was a great cover, but not as good as the machine. 



But the money was great and was better than the old Spotify remuneration rate. 



For many of them, it was the most stable their lives had been, though anonymity was not what they bargained for.



[Melancohlic Strumming]



[30:10]



Boon: Tinkering At The Edges 



The machine was the perfect capitalist venture, servicing all segments of the market. All genres, all taste. 



The bottom had fallen out of the market for musicians except for those styles where the machine had difficulties. 



The machine lacked sufficient authenticity and capability, some described it as belief, to make music for religious purposes. It couldn’t make Christian music, but it also couldn’t make music that worshiped the devil. 



It seemed that not enough data sets were available to construct a proper model based either on faith or the negation of it. They tried one particular data set, which infected all aspects of production.



Children’s music took a nasty turn when exposed to negative faith data sets. When they mixed this data set with a faith-based data set, the machine stopped producing music. 



Analysts suggested that the machine was conducting an internal battle between good and evil. They could see from the various activity monitors that something was happening, but no sound would emerge.



They almost lost the whole venture. From this point onwards, it became company policy to run these sorts of experiments in an isolated offsite facility to limit potential contamination to the wider system. 



The machine was too important to allow it to be compromised and destabilized by existential crises.



[Melancholic Music]



Mack: And that’s it for this episode of Phantom Power. 



Huge thanks to Steph Ceraso and Hussein Boon. I have to get out of here. I need to run and catch a flight. I’m headed to the Society for Cinema and Media Studies Conference in Denver, Colorado, which is probably going to be underway by the time this podcast drops.



So maybe I’ll see some of you out there. I’m gonna bring some Phantom Power stickers. 



As always, you can find links to some of the things that we talked about today in our show notes or at phantompod.org. Please, as always, subscribe to the show wherever you get your podcasts, and I’ll check you later. 



And by the way, today’s show was produced by Ravi Krishnaswami, and the music was by Ravi Krishnaswami.



Huge thanks to Ravi on this episode, and we’ll talk to you next time.



[Music Fades]



The post Sonic AI: Steph Ceraso & Hussein Boon appeared first on Phantom Power.