Changing the face of animation
Image Metrics' Nick Perrett on the challenges of creating engaging synthetic characters
As technology has allowed videogames to become more and more realistic, animation techniques have also evolved to bring characters to life. In just the past 15 years, developers have moved from keyframing to motion capture and now to performance-driven facial animation.
One of the companies providing such animation is Image Metrics. You may not know their name, but you should know some of the titles they've worked on such as Grand Theft Auto IV, Bully, Metal Gear Solid 3: Snake Eater, The Getaway: Black Monday, Devil May Cry 4 and SOCOM: US Navy SEALs Combined Assault.
Nick Perrett, the company's VP of business development, recently spoke to GamesIndustry.biz about how the facial animation business has evolved, what challenges they currently face, and what the future may hold.
I actually joined the company from the company that was providing Image Metrics with its seed financing. So this was back in 2001. The company was actually formed a year before that by Gareth Edwards and Kevin Walker, and subsequently Alan Brett.
When we first started out the company got funded because of some content security products it had. It was mainly security industry stuff. It is a very, very different marketplace, but we were doing things like providing technology to a company that was doing scanning of e-mails for pornographic content.
For the first three years of the company's life, we didn't actually do anything in animation at all. We had a number of different application areas as wide ranging as a simulation system for training golfers - that was one of the first applications that got built - and we also had a number of things in CC TV analysis and facial recognition.
Around 2003 was when the company started getting more traction in two specific areas: medical image analysis and animation.
If you think about it, at the highest level we've got a technology that can understand image content. On one side of the world we are using it to understand internal anatomy for medical studies, and on the other side of the business we are using it to understand external anatomy - facial features - so that we can create animation data rather than data about whether someone's got a disease or not.
On the animation side, I suppose the person I would credit with Image Metrics entering the videogames business was a chap called Peter Edward who was at the time the producer of The Getaway franchise for Sony London and is now in charge of Sony Home.
When I joined the company, I was actually the marketing director. At that point in time, we had a technology that was capable of doing a million and one tasks to do with analyzing images.
What I was essentially doing was trying to segment out the universe of places where images and video get captured today - from industrial inspections through security applications through medical imaging through animation and entertainment. And what we did was essentially go out and talk to the key players in each industry about what Image Metrics may or may not be able to do for them.
I used to have this massive suitcase - the biggest possible suitcase you can imagine, with a great big Intel dual Xeon server in it. And I used to carry another bag with a big laptop in it with a screen. And I used to have this bicycle helmet with huge steel rod and a camera at the end of it. So I would turn up at Sony's office and plug all this hardware in and then stick this camera thing on my head!
We basically were able to demonstrate live right there in front of them that I could have various points on my face tracked and that could actually drive...At the time, it was using Kaydara's Motionbuilder - it was actually called something else at the time, but it was their suite where you could pump all the real-time data in and drive a fairly crude bone-based rig in real-time.
So, it was actually a pro-active thing going and finding companies like Sony who may have an interest in the technology. Once we had done that one demo for them, they were like "Well, that's all very well and good, but how do we get any useful data out of what you do to create a game with it?"
They should be given a lot of credit for helping the company to figure out..."Okay, so you've got this raw capability, but how do I get your data into Maya so that I can actually drive a character in a useful way?"
That was how we started out in games. Gareth Edwards, who is the main technical founder of Image Metrics, and I went down to Sony Europe and he spent a lot of time with a chap called Dave Smith - who's now a lead character guy at LucasArts in San Francisco - trying to figure out how we could put Image Metrics' data into a product like Maya.
We actually did that and ended up producing about 90 minutes of animation for The Getaway. A lot of people don't know this, because at the time we had no PR and we weren't pushing Sony to do any PR, but that was actually the first time that anyone had done a simultaneous full performance capture.
A lot of the press about The Polar Express was like this was the first time that they had captured a full facial performance and a full body performance at the same time. Timeline-wise, The Getaway: Black Monday was actually done before The Polar Express started because we worked on both of those products in different ways. It was a very pioneering piece of work that they did and it had all sorts of technical hurdles to get over and I think it was quite an achievement. That's kind of what keyed us up.
That was the first of probably half a dozen or more products that we've done for Sony that went from 2003 through now. We are working on some at the moment which I can't talk about. The last one we did for them was Lair, which was obviously on the PS3 platform so there were some new challenges working on that platform.
A lot of the stuff we've done for Sony since has actually been pre-rendered, which isn't true of virtually every other client of ours. But we have started to do some things with them on animation that is streamed through the engine. That relationship has continued to develop for us.
We went on to do lots more work with many different publishers over that period on probably around 15 or so PS2 and Xbox products before the transition happened. We did work with some of the Japanese studios as well, such as Konami on Metal Gear.
This generation, we've started to do things differently because there have been huge changes in what the actual consoles can deal with in terms of facial animation in the last couple of years.
By way of example, we were working with characters on most typical Xbox or PS2 titles that had somewhere between 400-600 polygons in the face itself. And the texture resolution was like 100K, for example.
In that realm, you couldn't do very much with the faces, really. You would get these triangular-looking shapes as a result of the way the meshes were laid out and the fact that there weren't enough polygons to give you any sort of decent deformation. Things like subtle grimaces and skin wrinkling that you see in some of the games now certainly wouldn't have been remotely possible on those platforms.
The characters that we are working with at the moment are anything from 5 to 10 times the resolution of those PS2 characters. And that's resolution in many different ways - in the size of the texture which details a lot of the look and feel of the skin, also the polygon count which will influence the smoothness of a lot of the shapes that you can create and the variety of expressions you are actually able to generate in a convincing way.
Also, the underlying structure of these heads has become a lot more complicated. Whereas on a PS2 game we might be using a character with 10-20 bones in the face, we've worked with characters that have over 100 bones in the face and we've also worked with characters that have a similar number of bones to a PS2 character but we're also using 50-70 blend shapes which will determine a lot of very subtle expressions that you've never been able to do on a PS2 title.
There is a lot of change in both the actual asset itself and the rigging technology that goes into it - and consequently the level of detail of performance capture that you need to track. For example, if you were just tracking a PS2 character that couldn't do much more than open its mouth and blink, then you really didn't need very complicated performance data.
Whereas now, if you've got a character where the way the tongue appears and the micro-movement of the eyeballs are visible to the player, and that's what's conveying the emotion, then obviously you've got to track a whole other layer of data complexity.
Obviously, the two constraints for us are what you can hold in memory and what you can stream in terms of vertex movement. Each vertex in the mesh as it is moving around is creating a "weight" in terms of streaming purposes.
Like with many things, if you give people ten times the power, they tend to fill it with ten times as much stuff. You actually have the same problem you had on PlayStation 2, but just with a much higher resolution asset.
When we were doing GTA IV, for example - this is public domain stuff that they've written about in terms of problems of storage for the Xbox 360 - what you need to store in terms of facial animation is orders of magnitudes more heavy as an asset and as streamed data as it was on PS2.
I would say that we've more than increased linearly in terms of what facial animation might take up on the disc as compared to what it took up on the disc on PS2. Not everybody is trying to push it as far as it can go, that's for sure.
Facial animation isn't the thing alone that is going to cause that, obviously. The problems that we're having are going to be replicated for bodies, buildings, cars...everything. It's a limitation that I can't see would ever go away.
The ultimate end point - if you watch how these transitions come about - is doing what the visual effects industry does offline by spending hours rendering a single frame. At some point in the future, the PS9 or whatever is going to be able to do that in real time.
To give you an example, we've got some characters we are using on the visual effects side that have 20 times more polygons and at least 5 times greater texture resolution and 100 times more vertices.
Something like the Samburu warrior of ours, for example - at some point in the future, you are going to be able to run that on a games console. But you can't do that today.
We will see two things. We will see characters of ever-greater realism in terms of offline content. The goal for that - if I put it in the words of what Rockstar has said publically about what they are trying to do with their characters in GTA - it's all about creating a deeper engagement and resonance with the characters, and therefore with the story.
Any game that has a pre-told story is going to benefit from characters that can give you that greater emotional interaction. It's not just Rockstar that thinks that. I know that the chief visual officer of EA has talked about how stories are going to become more compelling the time they can make people cry - when they can elicit emotions from the player.
Most humans elicit emotions from other human beings through facial expressions primarily...And body language, obviously, but facial expression is a hugely important part of that.
So, I think there is definitely going to be interest across the board from story-driven games like GTA. The future also is going to be about characters that you encounter and interact with in-game. So, I suppose the ultimate "nirvana" for interactive entertainment is kind of like a holodeck experience - where you can walk up to a synthetic character and interact with it as if it were human, so that the story is also interactive.
In-game facial animation doesn't really exist today, even on a lot of the products on the new consoles. They do have in-game facial animation, but it is very, very crude. Ninety per cent of the developers we work with will be using something like face effects in Unreal to just do automated lip flap for their in-game animation.
So, that's a key challenge for the industry. If you believe the supposition that a player walking up to a character in a game, if he gets a more visually authentic experience with that character then that's going to improve the attractiveness of the product - which I believe, anyway - then getting the facial animation pieces there and pushing it forward in-game is going to be really important to it.
And that is a huge challenge to the industry, because you are talking about true AI - not just AI for character interactions along the natural motion lines, but you are talking about synthesising audio and synthesising facial animation on the fly.
I think what you are really looking at there is segmentation of the consumer market.
In the same way that you would go to a cinema and you would be able to watch a period Pride and Prejudice drama, or you would be able to watch a heavy visual effects film like Iron Man, or you would be able to watch Bee Movie - I think the videogames market is just going to replicate that. "Different horses for courses," being an English expression.
You've got families who are purchasing Wii-style games. In fact, the same people purchasing GTA are also purchasing Wii-style games and having great fun with that. And that's a completely valid and appropriate segment of the industry that will undoubtedly will grow even more because what Nintendo has done is make it accessible to a wider audience in a way that no one else has been able to do in the past.
You are going to see huge growth in that area probably in families, and probably bringing in older people into videogaming as well with brain games and such. I don't think any of that is going to go away. If anything, it is going to be as fast growing as any other segment of the entertainment industry.
But I think that there is always going to be a place for the GTA, the Heavenly Sword types of experiences. I think the largest entertainment launch in history - GTA's 500 million dollars or whatever - is testament to the fact that that part of the industry is also going to continue to grow.
But they're not going to cannibalise each other. There's room for interactive entertainment as a whole to continue to take a growing and growing share of entertainment.
I think you're right in the sense that Image Metrics is never going to do loads and loads of Nintendo games. We are definitely more appropriate for the 360, PS3 experience, I think.
The answer to that is several fold.
We have deployed our technology on the EyeToy platform in the past. We did a product for Sony called Operation Spy in the US which had a facial recognition component to it so that you could recognise one player as being distinct from another player.
If you think that some games are going to have characters where there isn't the need for a synthetic character that responds to you through AI, you're just going to have one character that is a player talking to another player - much like you do with the Xbox headset now.
If we can capture the facial expressions via webcam of a character that's in Sony Home, for example, and allow avatars to chat with each other, that's a fantastic piece of technology.
At the moment, there is nothing we are announcing publicly about our intentions in that realm. But if the question is, is it technically something that Image Metrics can do? Absolutely.
That's something people are very conscious of, certainly in the games world in terms of product consistency.
We can't radically increase the quality of facial animation on a GTA character, for example, if the body animation sucks, because it breaks the spell of realism. I think everything needs to be in context.
I see moments in a lot of what Image Metrics does where I think "I can't believe it's not human." A lot of it comes from the eyes, actually. I've really noticed this in videogames characters because, generally speaking, the eye animation is either non-existent or terrible.
The minute that the eyes aren't engaging, you start looking at the lip sync. Everybody, especially in the games industry, is obsessed with lip sync. But there is an element of me that says, actually, if you get the eye animation right, and the eyes are engaging...If we were standing in front of each other now, I wouldn't be staring at your lips.
I think there is a lot of potential to get around the issue of the "uncanny valley" imminently, if not today. The reason that you keep wanting to push it is because of the story-telling opportunities it allows you to create.
Angelina Jolie in Beowulf is probably not a good example, because she looked like Angelina Jolie, but the principle that you can actually harness a performance and deliver it through a complete fantasy environment - I think people are going to keep on trying to do it as far into the future as we know.
The cost of doing it is definitely a key factor. Something like King Kong - the eye animation on the feature film was actually quite good. And they spent ages keyframing it to get it just right - right down to all the little microdots.
Part of the problem is that gross eye movement is not what makes the eyes. It is actually all the subtle little flicks.
We've done this at Image Metrics. We've recorded high definition video of a lot of these acting performances. If you actually zoom in on the eye area, there's just so much going on. To even attempt to keyframe it is prohibitively expensive.
If you then move on to - okay, if we can't keyframe it, what are we going to do? Are we going to use motion capture? We can't really use motion capture to get convincing eye data. There are all sorts of people doing experiments...I know WETA are doing experiments like, sort of gluing sensors to the eyelids. You are just not going to get accurate data with that.
You are stuck at video because video is the one thing where you can actually track pupil and iris movement, and pupil dilation even if you wanted to - because humans actually pick up on that as well. Certainly, emotions like love or fear trigger quite significant pupil change.
Image Metric probably has a better opportunity than pretty much almost anyone else to do eye animation well. And then the question becomes, how do game developers take advantage of it in a pipeline in a way that makes sense?
If you are not doing simultaneous capture, then it becomes fiendishly difficult to use the eye data. Because if we are filming voice over sessions, of course the actor is not necessarily going to be looking where you want them to look - there is a direction issue there.
A lot of it comes down to, not only is it fiendishly expensive, but there are technical hurdles and performance-related issues that prevent you from using the data properly, even if you could get it.
A lot of the time the priority just goes into getting the eye lined right. Having characters properly look at each other. That's a big enough problem in and of itself, before you even get onto then, once the characters is looking in vaguely the right direction, are their eyes animated properly?
Nick Perrett is VP of business development for Image Metrics. Interview by Mark Androvich. Special thanks to Shannon McPhee.