Synthesized voice-sculpting for future RPG's [Suggestion]

+
I read some threads about how Cyberpunk 2077 wasn't a real RPG. But it occured to me that if/when CD Projekt Red was to continue making RPG-games, then they might as well sculpt the voice of the protagonist, just like the player can create their character visually.

How does this work?

-First of, no voice-lines, just texts.
-Second, use text-to-speech software (AI-generated speech)
-Third, let the user sculpt the voice to their liking (tweaking the speech-sample)

How does this work on a technical level? Well, Two Minute Papers presented the solution using an AI to generate human speech using only 5 seconds of speech.


The samples could then be tweaked by the player, and the game does the rest producing human speech based on this sample.

This avoids pulling the player out from the experience as the appearance and voice will be 'synched' in the mind of the player. Not to mention, each and every players experience will be unique. I don't think this solution is for Cyberpunk 2077 (or any Witcher-game), but would keep this in mind for the next RPG after Cyberpunk 2077 is truly finished in a couple of years time. By then, the algorithms/programs doing these things will be much better and mature.
 
I read some threads about how Cyberpunk 2077 wasn't a real RPG. But it occured to me that if/when CD Projekt Red was to continue making RPG-games, then they might as well sculpt the voice of the protagonist, just like the player can create their character visually.

How does this work?

-First of, no voice-lines, just texts.
-Second, use text-to-speech software (AI-generated speech)
-Third, let the user sculpt the voice to their liking (tweaking the speech-sample)

How does this work on a technical level? Well, Two Minute Papers presented the solution using an AI to generate human speech using only 5 seconds of speech.


The samples could then be tweaked by the player, and the game does the rest producing human speech based on this sample.

This avoids pulling the player out from the experience as the appearance and voice will be 'synched' in the mind of the player. Not to mention, each and every players experience will be unique. I don't think this solution is for Cyberpunk 2077 (or any Witcher-game), but would keep this in mind for the next RPG after Cyberpunk 2077 is truly finished in a couple of years time. By then, the algorithms/programs doing these things will be much better and mature.
I don't know why but this concept makes me sad. You'd miss out on some amazing performances by some truly great actors. Maybe I'm just being too conservative about it but I'd be sorry to see that side of the storytelling art (and I do think it has become an art) go.

As I get older, games as player wish fulfilment are becoming more uninteresting to me (eg Elder Scrolls). I find them artistically compromised. I like inhabiting worlds that have had a true narrative vision applied to them; that are in some way deliberately restricted so that while you still have freedom you are recognisably inside the creator's vision and being told a carefully crafted story. The acting, to me, is part of that.
 
I read some threads about how Cyberpunk 2077 wasn't a real RPG. But it occured to me that if/when CD Projekt Red was to continue making RPG-games, then they might as well sculpt the voice of the protagonist, just like the player can create their character visually.

How does this work?

-First of, no voice-lines, just texts.
-Second, use text-to-speech software (AI-generated speech)
-Third, let the user sculpt the voice to their liking (tweaking the speech-sample)

How does this work on a technical level? Well, Two Minute Papers presented the solution using an AI to generate human speech using only 5 seconds of speech.


The samples could then be tweaked by the player, and the game does the rest producing human speech based on this sample.

This avoids pulling the player out from the experience as the appearance and voice will be 'synched' in the mind of the player. Not to mention, each and every players experience will be unique. I don't think this solution is for Cyberpunk 2077 (or any Witcher-game), but would keep this in mind for the next RPG after Cyberpunk 2077 is truly finished in a couple of years time. By then, the algorithms/programs doing these things will be much better and mature.

Yeah there's actually mod on Nexus for this already, but it's waiting for the game and modding to develop to the point where it's possible for modders to make quests and add dialogue. So basically, you train it on a voice actor who's already in the game, and then you can make that voice say anything written (with a bit of tweaking). Definitely great for future quest-makers.
 
I read some threads about how Cyberpunk 2077 wasn't a real RPG. But it occured to me that if/when CD Projekt Red was to continue making RPG-games, then they might as well sculpt the voice of the protagonist, just like the player can create their character visually.

How does this work?

-First of, no voice-lines, just texts.
-Second, use text-to-speech software (AI-generated speech)
-Third, let the user sculpt the voice to their liking (tweaking the speech-sample)

How does this work on a technical level? Well, Two Minute Papers presented the solution using an AI to generate human speech using only 5 seconds of speech.


The samples could then be tweaked by the player, and the game does the rest producing human speech based on this sample.

This avoids pulling the player out from the experience as the appearance and voice will be 'synched' in the mind of the player. Not to mention, each and every players experience will be unique. I don't think this solution is for Cyberpunk 2077 (or any Witcher-game), but would keep this in mind for the next RPG after Cyberpunk 2077 is truly finished in a couple of years time. By then, the algorithms/programs doing these things will be much better and mature.
This would fit the world of Cyberpunk. For example they can use it on the robots of the game but they shouldn't use it on other characters. For The Witcher, It's better not to use it at all. just like @northwold said, you will miss amazing performances by amazing actors. Imagine AI handled the voice acting of Geralt in TW3 or V in Cyberpunk, Would it be the same ? Or imagine AI handled the voice acting of Queen Meve in Thronebreaker, it would be a completely different game, for me at least. Also in my opinion no matter how much time you spend on developing the AI, It can not deliver emotional lines as good as humans, because it's Artificial. But this is just my opinion. This would be good for Bethesda though. Since their games have never ending quests (Like SKYRIM, the game generates quests by itself for example the night mother quest will never end, The night mother gives you a contract and an NPC will be spawned so you can kill it, this process continues over and over again).
 
Last edited:
I think this is really a necessity going forward, for significantly increasing variation of dialog without increasing the time/cost spent recording/localizing. To be clear, it shouldn't be ALL machine-learning voices either... a hybrid approach makes a lot more sense. You'd still do traditional voice acting for all the important stuff where an emotive performance is key, but then you'd use ML voice models of the actors to fill in the remaining 80% or so wherever more calm/mundane/emotionless speech occurs.

Think of how horrible it is when you walk up to a shopkeeper that you've visited 20 times before — and you cringe just before activating that shopkeeper because you know you're going to hear that same line of dialog you've heard 20 times before... but with ML voices, you could have 30 or more unique text lines for one vendor alone (with a new greeting and a matching player-response pre-generating the moment you walk through the shopfront door) ... that would add enough variation to make the game more immersive, and less cringey. ;) How emotive does a shopkeeper really have to be, or your responses to them for that matter? Listen to the ones in CP2077 right now, and they're not emotive at all... This is one instance where I'd happily sacrifice expressive performances for more varied dialog... :) And that's just one example... Ever played stealthily in CP2077 and had an NPC taunt you incessantly for 5 minutes by repeating a single line of dialog over and over again... to the point where you breath a huge sigh of relief when you finally manage to knock them out and make them stop talking? That's another instance where a little ML-generated dialog could really help the game by adding filler lines among the more expressive (but currently overly-repetitive) voice-acted stuff.

Another possible use... Ever done the gig where you rescue Hwangbo Dong-Gun? It's noteworthy because he doesn't speak English, and you get a text-translation in your HUD via your cyberware's A.I. This works well, when you're at the motel, but not so great later on, as you're in your car and it's like texting and driving... You're either missing the dialog or running over pedestrians... ;D Imagine having Excelsior/Brendan/Skippy/Alt or some other A.I. stepping in to help translate audibly... like your own personal C-3PO? "Hwangbo says that you should pick a nickname..." Right... :) Now... let's go one step further... imagine that the game brings all the other recorded dialog from the other non-English localizations into play simultaneously... so the city becomes really multi-national in everything you hear around you, but you have a little [toggleable] voice in your ear that translates for anyone at the center of the screen/under your scanner reticle to keep you from being distracted by text prompters...? Wouldn't that be cool? This also becomes an accessibility feature of sorts, as I know at least one friend of mine who still enjoys the game, but can't read the tiny center-of-screen translation text at all.
 
I don't know why but this concept makes me sad. You'd miss out on some amazing performances by some truly great actors. Maybe I'm just being too conservative about it but I'd be sorry to see that side of the storytelling art (and I do think it has become an art) go.

As I get older, games as player wish fulfilment are becoming more uninteresting to me (eg Elder Scrolls). I find them artistically compromised. I like inhabiting worlds that have had a true narrative vision applied to them; that are in some way deliberately restricted so that while you still have freedom you are recognisably inside the creator's vision and being told a carefully crafted story. The acting, to me, is part of that.

100% this.

It's hard enough as it is for us to get jobs (I'm an actor) and I find the whole idea of this so utterly demoralising and frankly offensive. It's not as if we just stand there and read words off a page. There is a lot of work that goes into the acting side of it. We're basically in the business of psychology and the human condition; we have to understand how people think and feel and behave. We have to put all that into a character, bring their humanity to life and react accordingly.

I don't see how an algorithm could possibly replace all that - I certainly wouldn't want it to. It's like the James Dean controversy.
 
Last edited:
Think of how horrible it is when you walk up to a shopkeeper that you've visited 20 times before — and you cringe just before activating that shopkeeper because you know you're going to hear that same line of dialog you've heard 20 times before...
While technology may enable this, application of that isn't that simple as there are cultural dependent aspects.

Conversation how they tend to go IRL:
Customer: Good day, Hi, Hey (or something like that)
Cashier: That would be x € ... Thank you, do you want receipt for your purchase?
Customer: No, thank you
 
I don't see how an algorithm could possibly replace all that - I certainly wouldn't want it to. It's like the James Dean controversy.
In a philosophical sense, I kind of agree with your sentiment, but I don't see why this wouldn't be possible in the future. There's a reason why we perceive something as angry or sad or happy or whatever. There really is no reason why this couldn't be emulated when you consider the hardware it's usually produced by. Why couldn't the process be copied in another medium? Combine this with a camera (+ maybe other physiological sensors) and you'd be having meaningful conversations with a machine: "you look a bit sad, did something happen"...

Done right, it would probably be way better than interacting with real people. And everyone would just stay home, wanking or something.
 
In a philosophical sense, I kind of agree with your sentiment, but I don't see why this wouldn't be possible in the future. There's a reason why we perceive something as angry or sad or happy or whatever. There really is no reason why this couldn't be emulated when you consider the hardware it's usually produced by. Why couldn't the process be copied in another medium? Combine this with a camera (+ maybe other physiological sensors) and you'd be having meaningful conversations with a machine: "you look a bit sad, did something happen"...

Done right, it would probably be way better than interacting with real people. And everyone would just stay home, wanking or something.

I don't see how we can realistically have an AI version of a sentient human being i.e. self-awareness, sentience, feelings, human psychology and experience (hardly a small obstacle that one), the whole shabang. That's what I was referring to. The actual human. Acting requires all of that. If you're just talking about mimicry, like a chatbot, that's something else.
 
I don't see how we can realistically have an AI version of a sentient human being i.e. self-awareness, sentience, feelings, human psychology and experience (hardly a small obstacle that one), the whole shabang. That's what I was referring to. The actual human. Acting requires all of that. If you're just talking about mimicry, like a chatbot, that's something else.
Just a couple of scientific papers down the line, and the algorithm can mimic emotions. It's just a matter of time. Right now the humans holds the upper hand, as the machine can't mimic 100% emotive speech. In a way we're already heading towards the Cyberpunk-universe. Algorithms have already done away with many jobs previously done my humans. We'd better start working on our strengths that cannot be mimiced, and that requires the grey matter between the ears. Hint: creativity.
 
I don't see how we can realistically have an AI version of a sentient human being i.e. self-awareness, sentience, feelings, human psychology and experience (hardly a small obstacle that one), the whole shabang. That's what I was referring to. The actual human. Acting requires all of that. If you're just talking about mimicry, like a chatbot, that's something else.
Why not, really? What would be the fundamental obstacle that can not be surpassed, ever? Everything we do still has to obey the laws of physics. There isn't anything magical that could only somehow exist by human procreation, no matter its complexity. I'd say that even NorseGraphics's idea of focusing on our strengths is doomed in the long run. There's no reason why we couldn't create artificial creativity because - still - nothing is magical and therefore possible to create in another system.

As an often used example, could we replace a neuron in your brain with an artificial one without changing the you? One, that would be exactly like the one you had, it's just not the original. Why not, if you think not? Let's replace a hundred, a million,..., ~80 billion and we have an artificial brain - but this is just one simplified approach. How about we start modding?

Or could a colony of bees be conscious - with bees considered as cells and their signals from bee to bee as neural impulses? Why or why not?
Nobody currently understands the requirements for consciousness or creativity or such phenomena, but has that ever been a problem in science? There are no other options but to find a reasonable explanation, or what do you think? Claiming impossibility without certainty of impossibility seems premature.
 
Why not, really? What would be the fundamental obstacle that can not be surpassed, ever? Everything we do still has to obey the laws of physics. There isn't anything magical that could only somehow exist by human procreation, no matter its complexity. I'd say that even NorseGraphics's idea of focusing on our strengths is doomed in the long run. There's no reason why we couldn't create artificial creativity because - still - nothing is magical and therefore possible to create in another system.

As an often used example, could we replace a neuron in your brain with an artificial one without changing the you? One, that would be exactly like the one you had, it's just not the original. Why not, if you think not? Let's replace a hundred, a million,..., ~80 billion and we have an artificial brain - but this is just one simplified approach. How about we start modding?

Or could a colony of bees be conscious - with bees considered as cells and their signals from bee to bee as neural impulses? Why or why not?
Nobody currently understands the requirements for consciousness or creativity or such phenomena, but has that ever been a problem in science? There are no other options but to find a reasonable explanation, or what do you think? Claiming impossibility without certainty of impossibility seems premature.

I'm not saying it's an absolute impossibility. I'm saying it's unrealistic. I'm not an AI expert but from what I understand from experts in that field, hollywood/sci-fi notions of AI (anthropomorphised robots ala Terminator, Ex Machina, etc.) are pretty far removed from the reality of what AI actually is.

I get that it's fun to philosophise about though.
 
I don't know why but this concept makes me sad. You'd miss out on some amazing performances by some truly great actors. Maybe I'm just being too conservative about it but I'd be sorry to see that side of the storytelling art (and I do think it has become an art) go.
I don't see how we can realistically have an AI version of a sentient human being i.e. self-awareness, sentience, feelings, human psychology and experience (hardly a small obstacle that one), the whole shabang. That's what I was referring to. The actual human. Acting requires all of that. If you're just talking about mimicry, like a chatbot, that's something else.
What I do find a bit troubling is what is going to happen to craft. I don't see voice acting in games being any different from radio dramas or animation. Poor use of voice talent is production issue that may depend from writing to voice directing, games may not had the best standards there, but voice acting in principle is IMO definitely art and craft. Craft is very important as that is how we learn things, how we maintain our understanding of things and ultimately improve things. Mimicry as kettunaut put it, does not contribute to that human side of that but just makes us better with machines.

This is very important as with digital mimicry we might get voice that is believable copy, but it's only a copy of something at some state. But that's it, there's nothing to learn, nothing to improve on as result would be based on copy of copy without human element that our communication is based upon. It's ironical but this sort of advances in technology can easily make us stagnate or even regress culturally. We can start to forget things masses even not realizing that.
 
I'm not saying it's an absolute impossibility. I'm saying it's unrealistic. I'm not an AI expert but from what I understand from experts in that field, hollywood/sci-fi notions of AI (anthropomorphised robots ala Terminator, Ex Machina, etc.) are pretty far removed from the reality of what AI actually is.

I get that it's fun to philosophise about though.
I think there's this misconception between artificial intelligence and a conglomerate of algorithms churning out results. What people think when they say "AI" is generalized conscious intelligence, not the algorithms. The algorithms might be the backbone of certain systems, but they aren't AI by itself. Think of muscle memory and how we walk, talk and handle things with our hands. Many sub-functions would rely on algorithms, but it's the generalized intelligence determining what to do that I think is "AI".

The paper in the original post is an algorithm to mimic human speech, and it can be utilized to create spoken lines using the 'seed' from other lines in the game. And as this algorithm is developed, more functions are tagged (eg. emotive stress on pronouncing words) and one profession might be slimmed down a bit. You don't need voice-actors to produce ten thousand lines. You need voice-actors teaching algorithms how to pronounce beliveable speech. So, these voice-actors can create different kinds of 'seeds' as starting-points for the algorithm.

To me it's all about fleshing out more content than is humanely possible, and for that we need the machines.
 
The paper in the original post is an algorithm to mimic human speech, and it can be utilized to create spoken lines using the 'seed' from other lines in the game. And as this algorithm is developed, more functions are tagged (eg. emotive stress on pronouncing words) and one profession might be slimmed down a bit. You don't need voice-actors to produce ten thousand lines. You need voice-actors teaching algorithms how to pronounce beliveable speech. So, these voice-actors can create different kinds of 'seeds' as starting-points for the algorithm.

To me it's all about fleshing out more content than is humanely possible, and for that we need the machines.

Okay but you're just talking about mimicry as opposed to acting. They're not the same thing.

You're talking about replacing actors with an algorithm that can only mimic a person's voice. I'm saying that's not enough because acting is not simply saying words. It's not even just saying words whilst trying to sound emotional either (sometimes we're guilty of that and when we do it generally looks forced and artificial). Ideally what we're doing instead is reacting. And I doubt you're ever going to get a computer to react to information the way a person does.

In any case, why would you even want this? I'm just wondering if you've considered the ethical side of it, which, to be honest, is far more important. Why would you want to replace a creative job? Yeah, acting is a creative job, by the way. Not to mention we'd be up in arms over it. What makes you think my peers and I would ever support this? Our unions would probably fight tooth and nail to stop it - at least I hope they would.

external-content.duckduckgo.com.jpg
 
Okay but you're just talking about mimicry as opposed to acting. They're not the same thing.

You're talking about replacing actors with an algorithm that can only mimic a person's voice. I'm saying that's not enough because acting is not simply saying words. It's not even just saying words whilst trying to sound emotional either (sometimes we're guilty of that and when we do it generally looks forced and artificial). Ideally what we're doing instead is reacting. And I doubt you're ever going to get a computer to react to information the way a person does.

In any case, why would you even want this? I'm just wondering if you've considered the ethical side of it, which, to be honest, is far more important. Why would you want to replace a creative job? Yeah, acting is a creative job, by the way. Not to mention we'd be up in arms over it. What makes you think my peers and I would ever support this? Our unions would probably fight tooth and nail to stop it - at least I hope they would.

View attachment 11259493

Because corporations are greedy and wants their money, so using machine-speech that is close to or surpasses human speech creates more content for video-games. In Fallout 4 there's over 10K spoken lines, I think. I'd like a lot more interaction, and this is were synthesized speech (more content) comes in. When you play a game long enough, you get tired of the same 10-20 lines from the same person. I'd rather developers focus on their strengths (creative writing) than waste time recording 100K lines when they could use text-to-speech AI and let the machines do their thing and have 250K lines easily. It's all about more content, and there's simply a limit how much any human being can deliver.
 
Because corporations are greedy and wants their money, so using machine-speech that is close to or surpasses human speech creates more content for video-games.
How do you surpass human speech for human audience?
In Fallout 4 there's over 10K spoken lines, I think. I'd like a lot more interaction, and this is were synthesized speech (more content) comes in. When you play a game long enough, you get tired of the same 10-20 lines from the same person. I'd rather developers focus on their strengths (creative writing) than waste time recording 100K lines when they could use text-to-speech AI and let the machines do their thing and have 250K lines easily. It's all about more content, and there's simply a limit how much any human being can deliver.
I can imagine games industry being even interested about this, but I don't see things playing out like that. Economics is that while there would be less cost on voice acting, that doesn't mean that automatically would translate to more dialogue as then cost would transfer to writing. Economics is that replacing voice cast with digital actor might potentially (and that's big if) enable better margins only if amount of dialogue is kept about the same we get now.
 
Top Bottom