In my years-long use of RealSense with the Unity game engine, I found that whilst the camera could produce mouth open and close expressions when the user opens and closes their mouth slowly, it was impractical for normal live speech lip-sync. This was because the real life lips were moving too fast during talking for the camera to track the movements.
Note: I develop with the original F200 camera. I haven't tried lip syncing with its next generation replacement the SR-300, so my experience is based on the F200.
What I did instead to create lip sync in my PC game was to analyze the volume of microphone input and convert that into a value between 0 (mouth closed) and 1 (mouth fully open). The louder the speech input into the microphone, the wider the animated mouth opened, just like in real life. This approach works excellently.
I have a video of the principle on my YouTube channel.
This message was posted on behalf of Intel Corporation
Marty, thank you very much for your contribution. It is very good information what you’ve provided.
Daniel, have you checked Marty’s post and link?
The principle can also be applied to analyzing the volume of a pre-recorded mp3 file to automatically generate lip movements in sync to speech on the track. This allows a facial animation to instantly be localized to any country just by playing an mp3 of audio speech in that language, saving the labor of having to re-record the animation for that dialect. Other facial animations can also be tied into the '0 to 1' volume value to auto-generate movement in the lip shape expression,, eyebrows, eyelids, etc.
I have an unlisted tech test video on my YouTube account that shows the squirrel character from the live lip sync video moving her lips to the tune of the 1990 Disney cartoon TaleSpin. It's a bit rougher than the live lip sync because the volume analysis code is also picking up the theme's background music and sound effects (i./e it's not a "clean" speech recording) but it demonstrates the technique well.