A data-driven model generates natural human movements for virtual avatars

WANDR starts from any body position and generates accurate and realistic human movements that reach a specified 3D target (shown as a red ball). WANDR is a conditional variational autoencoder that uses a purely data-driven approach that is driven by intention functions (arrows shown) that control the person’s orientation (yellow), posture (cyan), and wrist (pink) towards the target. WANDR can achieve a wide range of goals, even if they deviate significantly from the training data. Credit: Diomataris et al.

Humans can innately perform a wide range of movements, as this allows them to best solve various tasks in their daily lives. Automatic reproduction of these movements in virtual avatars and 3D animated human-like characters could be very beneficial for many applications, from metaverse spaces to digital entertainment, AI interfaces and robotics.

Researchers at the Max Planck Institute for Intelligent Systems and ETH Zurich recently developed WANDR, a new model that can generate natural human movements for avatars. The model, which will be presented in a paper presented at the Conference on Computer Vision and Pattern Recognition (CVPR 2024) in June, unifies multiple data sources into a single model to achieve more realistic movements in 3D humanoid characters. The paper is also sent to arXiv prepress server.

“At a high level, our research focuses on finding out what it takes to create virtual humans capable of behaving like us,” Markos Diomataris, first author of the paper, told Tech Xplore. “This basically means learning to think about the world, how to navigate it, set goals and try to achieve them.

“But why pursue this research problem? Basically, we want to understand people better, just as a neuroscientist would, and we try to do that by following the ‘try to build what you want to understand’ philosophy.”

The primary goal of a recent study by Diomataris and his colleagues was to create a model that would generate realistic movements for 3D avatars. These generated movements would allow the avatars to eventually interact with their virtual environment, such as reaching out and grasping objects.

“Consider reaching for a cup of coffee—it can be as simple as extending an arm, or it can involve the coordinated action of our entire body,” Diomataris said. “Actions like bending, reaching, and walking must come together to reach a goal. At a detailed level, we’re constantly making fine adjustments to maintain balance and stay on track toward our goal.”







Credit: arXiv (2024). DOI: 10.48550/arxiv.2404.15383

By making these subtle adjustments, people can produce smooth movements, integrating numerous smaller movements that work toward a simple goal (eg, placing a hand on a cup). Diomataris and his colleagues decided to teach the human avatar the same skills.

One approach to teaching virtual agents new skills is reinforcement learning (RL), while another is to build a dataset containing human demonstrations and then use it to train a machine learning model. The two approaches have different strengths and limitations.

“RL, very simply put, is learning skills through experience gained through trial and error,” explained Diomataris. “For our task, the agent would have to try all kinds of random movements at the beginning of its training until it can first stand properly, then walk, orient itself to a target, navigate to it, and finally reach it with its hands.

“This approach does not necessarily need a dataset, but may require a large amount of computation, as well as tedious design of rewards for the agent to avoid unnatural-looking behavior (e.g. prefer crawling instead of walking when moving).”

Unlike RL, training models using datasets provide the virtual agent with richer information about the skill, rather than allowing it to discover this information on its own. Although there are now various large datasets containing demonstrations of human movement, very few of them include reaching movements, which the team also wished to replicate in avatars.

“We prioritized motion realism and decided to learn this skill from data,” Diomataris said. “We present a method that is able to use both large data sets with a variety of general movements and smaller data sets that specialize in people reaching goals.”







Credit: arXiv (2024). DOI: 10.48550/arxiv.2404.15383

Diomataris and colleagues first proposed a training goal that is agnostic to the existence of goal labels. This key step allowed WANDR to learn general navigation skills from larger datasets while still using the labeled data it acquired from smaller datasets.

“WANDR is the first human motion generation model that is driven by an active feedback loop learned purely from data, without additional reinforcement learning (RL) steps,” Diomataris said. “What is active feedback? WANDR generates movement autoregressively (frame by frame). At each step, it predicts the action that will move the person to the next state.”

Predictions of WANDR avatars’ actions are conditioned by time- and goal-dependent properties, which the researchers define as “intention.” These properties are recalculated on each frame and act as a feedback loop that guides the avatar to reach a given target using its wrist.

“This means that, similar to a human, our method is constantly adjusting the actions taken in an attempt to orient the avatar towards a goal and reach it,” Diomataris said. “As a result, our avatar is able to approach and engage moving or sequential targets, even though it has never been trained to do such a thing.

Existing datasets containing goal-directed reaching human movements, such as CIRCLE, are sparse and do not contain enough data to generalize the models across different tasks. This is why RL has so far been the most common approach to training models for reproducing human movements in avatars.







Credit: arXiv (2024). DOI: 10.48550/arxiv.2404.15383

“Inspired by the behavioral cloning paradigm in robotics, we propose a purely data-driven approach where a randomly chosen future hand position of the avatar is considered as the target during training,” said Diomataris.

“By hallucinating targets in this way, we are able to combine both smaller datasets with target annotations, such as CIRCLE, and large datasets, such as AMASS, which have no target labels but are essential for learning general navigational skills such as walking, turning etc.”

WANDR, a model developed by Diomataris and colleagues, was trained on data from a variety of datasets and sources. By appropriately blending data from these sources, the model creates more natural movements, allowing the avatar to reach arbitrary targets in its environment.

“So far, works that study motion generation either use RL or completely lack the element of online motion adaptation,” Diomataris said. “WANDR demonstrates a way to learn the adaptive behavior of avatars from data. The ‘online adaptation’ part is essential for any real-time application where avatars interact with humans and the real world, such as in a virtual reality video game or human-avatar interaction .”

In the future, the new model presented by this team of researchers could help generate new content for video games, VR applications, animated films and entertainment, allowing human characters to perform more realistic body movements. As WANDR relies on a variety of data sources and human movement datasets are likely to grow in the coming decades, its performance could soon improve further.

“There are two main pieces missing right now that we plan to explore in the future,” Diomataris added. “First, avatars need to be able to use large and uncurated video datasets to learn how to move and interact with their virtual world, in addition to having the ability to explore their virtual world and learn from their own experiences.

“These two directions represent the basic means by which people also gain experience: by taking actions and learning from their consequences, but also by observing others and learning from their experiences.”

More information:
Markos Diomataris et al, WANDR: Intention-Driven Generation of Human Motion, arXiv (2024). DOI: 10.48550/arxiv.2404.15383

Information from the diary:
arXiv

© 2024 Science X Network

Citation: Data-driven model generates natural human motions for virtual avatars (2024, May 30) Retrieved June 1, 2024, from https://techxplore.com/news/2024-05-driven-generates-natural-human-motions.html

This document is subject to copyright. Except for any bona fide act for the purpose of private study or research, no part may be reproduced without written permission. The content is provided for informational purposes only.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top