人形机器人 —— 强化学习:站立和行走的奖励函数设置

相关:

https://docs.zeroth.bot/ml/rl




image

image



Reward Shaping
General Configuration for Standing
A general configuration for standing involves ensuring that the original URDF (Unified Robot Description Format) model is set to fulfill the standing position. The goal is to minimize deviation from this original position during training.

If necessary, an orientation reward can be included to encourage the robot to maintain an upright posture. This can be achieved by adding a term to the reward function that penalizes deviations from the desired orientation.



Walking Rewards
For training the robot to walk, we have an additional set of rewards that are added to the standing rewards. Crucially, maintaining the original standing position accounts for 80% of the total reward during initial training, which ensures the policy first learns a stable standing position. This is essential since standing represents the base distribution from which other behaviors must develop.

Forward Velocity Reward: This reward encourages the robot to move forward. It can be defined as a function of the robot’s forward velocity, but is weighted to be less significant initially to prevent premature optimization for walking before stability is achieved.

Additional rewards such as feet clearance and contact forces are crucial for achieving sim2real transfer and handling various real-world properties like friction coefficients. These rewards ensure the policy learns realistic locomotion patterns that can translate to physical robots. The action smoothness reward particularly helps generate commands that are feasible for real-world actuators to execute under typical PID control schemes.



机器人站立时的奖励函数,站立的位置不能离初始位置太远,根据站立后的水平位置距离初始位置的距离进行惩罚设置;站立后的朝向方向应该是直立的,根据站立后机器人的姿态与直立姿态的差距进行惩罚奖励设置。


训练机器人行走时的奖励需要设置为多个阶段,如初始阶段和正常阶段,初始阶段时可以认为是主要进行站立训练,这时候机器人站立后的位置与初始位置距离造成的惩罚是总奖励惩罚的80%,此时以训练站立效果为主;随着训练站立效果比较成熟后逐渐增加行走方面的奖励,此时减少站立的奖励所在总奖励的比例,行走时的奖励包括行走的速度奖励,但是为了使机器人行走稳定,因此行走部分的奖励比重不大,驱动器(电机)的输出的平滑程度也是需要考虑的奖励和惩罚,可以根据机器人驱动器距离PID控制时的输出值的距离判定其平滑程度,从而给予奖励和惩罚。



posted on 2024-12-06 23:17  Angry_Panda  阅读(21)  评论(0编辑  收藏  举报

导航