Why CD Projekt used AI to localise Cyberpunk 2077
Canadian facial animation firm Jali Research's Pif Edwards tells us how its tech contributed to the blockbuster RPG
There are a lot of big names associated with the upcoming Cyberpunk 2077.
The eagerly-anticipated sci-fi RPG is being made by Polish studio CD Projekt of The Witcher 3: Wild Hunt fame. The Matrix and John Wick star Keanu Reeves is performing the role of Johnny Silverhand in the game, while punk legends Refused and pop sensation Grimes are among the musical acts performing on the soundtrack.
One name that you'll be forgiven for not being familiar with is Jali Research. This is a facial animation company based in Toronto, Canada that has helped CD Projekt with the localisation process of Cyberpunk 2077, in a manner of speaking.
The outfit emerged out of the University of Toronto, founded by PhD student Pif Edwards, along with Academy Award-winning animator and director Chris Landreth, as well as professors Eugene Fiume and Karan Singh.
Edwards was doing a PhD in Computer Science, initially wanting to focus on facial animation, but ended up looking at speech because it "turns out when people are expressing, they're almost always talking." Unhappy with the tools for handling speech and animation that were available at the time, he decided to build his own.
"The shape your mouth makes for specific letters or sounds isn't a direct one-to-one thing. You can't say: 'Oh, it's an 'en' sound, so it looks like this'"
CD Projekt turned to Jali after reading a paper from the Canadian outfit that had been submitted to annual computer graphics conference, SIGGRAPH, in 2016. This was focused on procedural speech.
For 2015's The Witcher 3, CD Projekt used algorithms to handle facial animation for eight different language voiceovers. This was successful up to a point, but for Cyberpunk 2077, the Polish firm had loftier goals; it wanted to do lip-syncing for ten languages: English, German, Spanish, French, Italian, Polish, Brazilian Portuguese, Russian, Mandarin and Japanese.
For Cyberpunk 2077, CD Projekt and Jali used a combination of machine learning and rule-based artificial intelligence. The former is used for what Jali calls the 'alignment' phase, a machine learning process that figures out what sounds are actually being made when someone speaks.
"Let's say we have an audio file of someone saying 'Hello'," Jali co-founder and CTO Pif Edwards explains.
"Where does the 'H' start and stop? Then where are the 'e', 'l' and 'o' sounds? We mark that information up for a specific language, then train a machine-learning process using this data to recognise what sounds are being made.
"After a while, you can give it a brand new line of dialogue that it has never seen before and it will predict where the boundaries between sounds are and how long each of these phonemes is."
After that, Jali moves to its second phase; animation. Here the company uses old fashioned rule-based AI to determine what facial movements correspond to the sounds that are being made. This is a more simple 'if this, then that' system, which simply does what it is told in response to specific inputs.
"The rule-based methodology is what we use to figure out what mouth shape needs to be generated given what sounds are being made," Edwards says. "For example, 'dude' looks just like 'you', but they're completely different words. The core articulation - the shape your mouth is making - is actually anticipating what's coming up or remembering where it was.
"We have to train each machine learning process specifically for each language, but the animation component is identical"
"The shape your mouth makes for specific letters or sounds isn't a direct one-to-one thing. You can't say: 'Oh, it's an 'en' sound, so it looks like this.' If there's an 'e' afterwards, then it could be a 'ni' or 'noo' sound. The shape of the 'n' noise is what letters are around it, not necessarily the shape of the sound that has been made. Then there's something like 's' where you have your teeth close enough for there to be friction.
"There are all these different things that we know about speech. There are rules that, no matter what is being articulated, whatever the different aspects of linguistics are, we know what facial expression is required."
The beauty of this combination of techniques is that, because humans make the same expressions for the same sounds across different languages, once machine learning processes for different languages have gone through audio, the same rule-based AI can be used across a variety of different dialects.
"We have to train each machine learning process specifically for each language, but the animation component is identical," Edwards says. "We don't have a specific animation model for Japanese, we only have the language model. The general principles of what someone's mouth does when they speak are not language-specific.
"To my surprise, the general principles of linguistics hold across all languages. It's hard though. The reason that people don't want to do rule-based work like this is that you have to know the rules. That takes a long time."
This process can save a huge amount of time, too. On average, it's estimated to take an animator seven hours to complete work on making a character say just a single minute of in-game speech. You can do the maths for yourself, but having to do animation work for an RPG experience that not only boasts huge amounts of dialogue but is also supporting lip-syncing for ten different languages would be a huge feat. It would require a ridiculous number of man-hours to complete such a task.
The net result of this is a technique of localising games that means that more languages around the world are treated as "first-class citizens." A lot of the time, a game will be shipped with lip-syncing designed for one language - generally, English, let's be honest. From there, this version of the game will be localised for other languages, generally taking the form of new audio dubs.
"When you play, someone speaking Mandarin actually looks like they're speaking Mandarin. It's not just the mouth; it's the forehead, the eyes, when blinks happen"
A lot of hard work goes into this process, but the result can still be pretty clunky as the translation will have to fit without certain mouth movements or has to be crammed to fit into the same space as the original audio.
Plus, language isn't just about words, either. Facial expressions and how someone actually looks while saying something is a huge part of communication.
"Let's say there's a line of dialogue that you want to translate from English into French," Edwards explains. "It might end up being something much longer than the original line. But what a lot of games end up doing is just scaling that animation. They can look pretty dumb, but that's what the studio has had to do because they can't re-do the lip-syncing.
"It's also the facial animation, too. With Jali, everything matches up. Now when you play the game, someone speaking Mandarin actually looks like they're speaking Mandarin. It's not just the mouth; it's the forehead, the eyes, when the blinks happen, when their neck moves, how their face moves. It's all going to, like be the same engine that does it in English."
CD Projekt might be best-known today for its RPGs, but the firm actually started out by localising games for its native Poland.
In the post-USSR country, most people were happy to pirate games, in part because the companies that made or published them weren't putting any effort into translating their releases into Polish. CD Projekt found that by creating something that people actually felt was worth buying, by putting more effort into translation and localisation, gamers in the country were happy to spend their hard-earned cash on these products.
That philosophy seems to have carried through to the modern day. In the summer, CD Projekt's director of PR and marketing for China, Darren Ding, said in a now-removed post on LinkedIn that that was the most popular region for Cyberpunk 2077 pre-orders. This could be due to the sheer size of the country compared, but it's also no surprise to see costumers coming out in support of a game that has been dubbed and subtitled in both Simplified Chinese and Mandarin, while Jali's lip-syncing magic has been executed for Mandarin.
All of which is to say that when it comes to localisation, if you put in the effort, your customers will reward you.
"I was talking to a Russian colleague of mine about what we're doing," Edwards says. "He's a huge fan of The Witcher 3, but he only ever played it in English. They natively speak Russian, but he plays in English because that's the version of the game that received the most attention. He was so so excited because when he plays Cyberpunk 2077, he'll have the same experience that someone speaking English would have."
He concludes: "This is a way for people to be further involved in the game. It helps suspension of disbelief so much so they can really get into the story."