DQN vs. DDQN

在传统的DQN（Deep Q-Learning Network）和DDQN（Double Deep Q-Learning Network）之间，主要区别如下：

1. Q值估计的目标函数不同：

DQN: 在DQN中，目标Q值是通过最大化Q值的动作直接由同一个网络（目标网络）计算得到的。这可能会导致Q值过高估计（overestimation）的现象。
\( Y^{DQN} = r + \gamma \max_a Q(s_{t+1}, a; \theta^-) \)
这里，\(\theta^-\) 是目标网络的参数。
DDQN: 在DDQN中，目标Q值的动作选择和Q值估计分开使用两个网络完成。具体地，它使用当前网络（在线网络）选择动作，然后使用目标网络评估该动作的Q值，从而减少了Q值过高估计的可能性。
\( Y^{DoubleQ} = r + \gamma Q(s_{t+1}, \arg\max_a Q(s_{t+1}, a; \theta); \theta^-) \)
这里，动作的选择由当前网络的参数 \(\theta\) 决定，而Q值的评估则由目标网络的参数 \(\theta^-\) 决定。

2. 引入了分离的动作选择和Q值更新：

DQN在更新Q值时直接依赖于目标网络的最大Q值输出，而DDQN通过分离动作选择和目标Q值计算，降低了估值的不稳定性和偏差。
DDQN的这种方法可以更稳定地收敛，同时避免因过高估计导致的策略错误。

3. 稳定性和性能改进：

DQN可能因过高估计导致策略不稳定，特别是在训练过程中容易出现“抖动”现象。
DDQN通过分离动作选择和Q值估计，引入更精确的目标，增强了训练的稳定性和最终性能。

总结

DQN 采用的是单一网络进行动作选择和目标计算，因此可能会导致Q值过高估计。
DDQN 通过分离动作选择和目标计算，降低了Q值过高估计的问题，从而提升了算法的稳定性和准确性。

希望这个解释能够清晰地帮助你理解DQN和DDQN的主要区别！

a prompt for change

Let me analyze the RL components in this code and provide guidance on implementing DDQN.

Key Points from Current Implementation:

The code uses a ComponentSelector for selecting which neural network components to train in each round using DQN
Components that can be selected include: transformer, node embedding, confidence encoders, VAE parts, and output projection
The current state representation includes client loss history
Reward calculation considers both global loss improvements and individual client improvements
Uses an epsilon-greedy exploration strategy with decay

To modify this to use DDQN, here are the key changes needed:

Network Architecture:

Keep two networks: main network and target network
Both networks should have identical architecture but separate parameters
Target network parameters should be updated periodically (soft or hard updates)

Action Selection:

Use main network for selecting actions (argmax Q-values)
Use target network for evaluating those actions (computing target Q-values)

Loss Calculation:

Current Q-value: Q(s,a) from main network
Next action selection: argmax Q(s',a') from main network
Target Q-value: Q(s',argmax Q(s',a')) from target network
TD error: r + γQ_target(s',argmax Q(s',a')) - Q(s,a)

Here's a suggested prompt for implementing these changes:

I need help modifying the RL component selection in a federated learning system. The system should:

1. Use Double DQN for component selection with:
   - Main network for action selection
   - Target network for value estimation 
   - Periodic target network updates (every N steps)
   - Experience replay buffer of size K

2. State representation should include:
   - Historical loss values for each client
   - Current component activation states
   - Client performance metrics
   - Global model performance metrics

3. Action space:
   - Binary vector indicating which components to activate/freeze
   - Length equal to number of neural network components

4. Reward function considering:
   - Improvement in global model performance
   - Individual client improvements
   - Training efficiency (speed vs performance tradeoff)
   - Component sparsity bonus

5. Training process:
   - Epsilon-greedy exploration with decay
   - Minibatch updates from replay buffer
   - Separate target network updates
   - Early stopping based on performance plateau

Please provide the core logic for implementing this DDQN-based component selection system.

This prompt would help get an implementation that:

Better stabilizes learning through the target network
Reduces overestimation bias common in standard DQN
Maintains exploration-exploitation balance
Considers both global and local performance metrics

Would you like me to elaborate on any of these aspects or provide more specific implementation details?

对 RL + aggregate的修改（选择哪些模块）

Looking at the Model class structure, let me clarify how the RL component selection and aggregation work:

Current Selection Scope:

The RL agent only selects components within the VAE (encoder, decoder, sd_decoder)
Other components like confidence module, road_embedding, projection_head, and SPGNN are not part of the selection process

Aggregation Scope:
Currently, these components participate in aggregation:

- VAE components (if selected by RL):
  - encoder
  - decoder 
  - sd_decoder
- projection_head
- sd_projection_head
- gnn

Do NOT aggregate:
- road_embedding (client-specific)
- confidence module

To expand the RL selection to the full model, we could modify it like this:

class Model(nn.Module):
    def __init__(self, input_size, hidden_size, device, layer_rnn, label_num, edge_num) -> None:
        super().__init__()
        # Group all trainable components
        self.components = nn.ModuleDict({
            'confidence': Confidence(label_num, input_size),
            'vae': VAE(input_size, hidden_size, layer_rnn, hidden_size, 0, label_num),
            'projection_head': self.create_projection_head(hidden_size, label_num),
            'gnn': SPGNN(edge_num, hidden_size, label_num)
        })
        
        # Non-aggregatable components
        self.road_embedding = nn.Embedding(self.label_num, input_size)
        
        # Track component activation
        self.active_components = {name: True for name in self.components.keys()}
        
    def set_active_components(self, component_mask):
        """Extended component selection"""
        component_map = {
            0: 'confidence',
            1: 'vae.encoder',
            2: 'vae.decoder',
            3: 'vae.sd_decoder',
            4: 'projection_head',
            5: 'gnn'
        }
        
        for idx, component_path in component_map.items():
            if '.' in component_path:  # Handle nested components (VAE)
                module_name, component_name = component_path.split('.')
                component = getattr(self.components[module_name], component_name)
            else:
                component = self.components[component_path]
                
            for param in component.parameters():
                param.requires_grad = bool(component_mask[idx])
                if not param.requires_grad and param.grad is not None:
                    param.grad.zero_()

And modify the aggregation function:

def aggregate_models(clients, weights):
    """Aggregate model parameters using weighted average"""
    global_state = {}
    reference_model = clients[0].model.state_dict()
    
    # Define non-aggregatable components
    skip_components = [
        'road_embedding',
        'confidence.pos_encoder'
    ]
    
    # Initialize global state with zeros
    for key in reference_model:
        if not any(x in key for x in skip_components):
            global_state[key] = torch.zeros_like(reference_model[key])
    
    # Weighted sum of parameters
    for client, weight in zip(clients, weights):
        client_state = client.model.state_dict()
        for key in global_state:
            if key in client_state:
                global_state[key] += client_state[key] * weight
    
    return global_state

This way:

RL agent can select ALL major components of the model
Aggregation properly handles both selected and non-selected components
Client-specific components are preserved
Component selection affects both training and aggregation

Would you like me to explain any part in more detail or help modify other aspects of the implementation?

目前RL + 课程学习的setting

Let me break down these components:

RL State Space:

def get_state(self, client_losses):
    # State dimension = num_clients * 4
    # Each client contributes its last 4 loss values
    state = []
    for client in range(len(client_losses)):
        if client_losses[client]['total']:
            recent_losses = client_losses[client]['total'][-4:]  # Last 4 losses
            state.extend(recent_losses)
        else:
            state.extend([0.0] * 4)
    return torch.tensor(state, device=self.device).float()

For example, with 5 clients, state dimension = 5 * 4 = 20

Reward Calculation:

def calculate_reward(client_improvements):
    # Normalize improvements to prevent extreme values
    improvements = torch.tensor(client_improvements)
    normalized_improvements = torch.tanh(improvements)  # Squash to [-1, 1]
    
    # Reward components:
    # - mean_improvement: average performance gain across clients
    # - min_improvement: worst-case performance to ensure no client is left behind
    mean_improvement = normalized_improvements.mean()
    min_improvement = normalized_improvements.min()
    
    # Combined reward: 70% average improvement + 30% worst case
    reward = 0.7 * mean_improvement + 0.3 * min_improvement
    
    return reward.item()

CurriculumAggregator:

class CurriculumAggregator:
    def __init__(self, num_clients):
        # Initialize equal weights for all clients
        self.num_clients = num_clients
        self.difficulty_scores = torch.ones(num_clients) / num_clients
        
    def update_difficulty(self, client_metrics):
        # Calculate difficulty scores based on client performance
        # Lower loss = higher performance = lower difficulty
        performance_scores = torch.tensor([
            metrics['total'][-1] if metrics['total'] else float('inf')
            for metrics in client_metrics
        ])
        
        # Convert scores to weights (higher loss = higher weight)
        normalized_scores = F.softmax(-performance_scores, dim=0)
        
        # Smooth update: 90% old scores + 10% new scores
        self.difficulty_scores = 0.9 * self.difficulty_scores + 0.1 * normalized_scores
        
    def get_aggregation_weights(self, clients_perf):
        # Combine difficulty scores with recent improvements
        weights = self.difficulty_scores * torch.tensor(clients_perf)
        return F.softmax(weights, dim=0)  # Normalize to sum to 1

Key Points:

State:

Takes last 4 losses from each client
Dimension = num_clients * 4
Represents recent training history

Reward:

Based on improvement in losses
Balances average and worst-case performance
Normalized to [-1, 1] range

Curriculum Aggregation:

Tracks difficulty scores per client
Updates smoothly (90/10 split)
Combines difficulty with performance for aggregation weights

The interaction works like this:

RL agent observes state (loss history)
Selects components to activate
Clients train with selected components
Reward based on improvements
Curriculum updates based on performance
Aggregation weights combine curriculum and performance

Would you like me to elaborate on any of these aspects?

课程学习的修改

改进后的代码对联邦学习（Federated Learning）中的模型聚合过程进行了优化，结合课程学习（Curriculum Learning）机制，通过客户端的性能（performance）和异质性（heterogeneity）评分动态调整聚合权重，从而使聚合更加智能化和适应性强。

以下是详细的解读和总结：

核心思想

动态聚合权重计算：
- 综合客户端的性能评分（performance scores）和异质性评分（heterogeneity scores），动态生成聚合权重。
- 通过平滑机制（历史与当前权重平衡），确保权重的稳定性，避免剧烈波动。
- 将聚合权重用于模型参数加权平均，以提升全局模型性能。
关键改进点：
- 引入了异质性评分，用以衡量客户端数据的分布差异性。
- 性能评分的权重与异质性评分的权重可调节，通过参数 α 控制。
- 平滑更新课程难度评分，确保训练过程稳定。

代码的功能解析

1. 课程学习权重计算

calculate_weights 方法实现了基于性能评分和异质性评分的动态聚合权重计算，核心逻辑如下：

归一化评分：
- 对性能评分（performance scores）进行归一化，较低的损失对应较高的权重（使用 -perf_tensor 的 softmax 计算实现）。
- 对异质性评分（heterogeneity scores）进行归一化，异质性越高，权重越大。
- 分别计算归一化后的性能权重和异质性权重。
组合当前评分：
- 使用参数 α 平衡异质性评分和性能评分，公式如下：
  \[\text{current\_weights} = \alpha \cdot \text{norm\_het} + (1 - \alpha) \cdot \text{norm\_perf} \]
平滑更新历史评分：
- 结合当前权重和历史评分，使用参数 β 平滑更新课程学习的难度评分：
  \[\text{difficulty\_scores} = \beta \cdot \text{difficulty\_scores} + (1 - \beta) \cdot \text{current\_weights} \]
- 这样可以避免当前客户端性能剧烈变化带来的权重抖动。
归一化输出权重：
- 最终，将更新后的权重进行 softmax 归一化，确保权重总和为 1：
  \[\text{final\_weights} = \text{softmax}(\text{difficulty\_scores}) \]

2. 模型参数加权聚合

average_models 方法根据计算出的聚合权重对客户端模型进行参数聚合，核心流程如下：

计算客户端的异质性评分：
- 使用 calculate_heterogeneity 方法，基于客户端的嵌入向量（node embeddings）计算异质性评分，异质性越高表示客户端数据分布与其他客户端越不同。
计算客户端的性能评分：
- 使用最近的损失值（loss）作为性能评分，损失越低表示性能越好。
获取聚合权重：
- 调用 calculate_weights 方法，结合性能评分和异质性评分计算聚合权重。
参数加权聚合：
- 遍历模型参数（如权重、偏置等），根据聚合权重对每个客户端模型的参数进行加权求和。
- 跳过某些无需聚合的参数（如 BN 层的统计值 running_ 和 num_batches_tracked）。
生成全局模型：
- 聚合后的参数构成全局模型，用于下一轮训练。

公式总结

异质性和性能评分的归一化：
- 性能评分（较低的损失对应较高权重）：
  \[\text{norm\_perf}_i = \text{softmax}(-\text{performance\_scores}_i) \]
- 异质性评分（较高的异质性对应较高权重）：
  \[\text{norm\_het}_i = \text{softmax}(\text{heterogeneity\_scores}_i) \]
聚合权重的组合：

\[\text{current\_weights}_i = \alpha \cdot \text{norm\_het}_i + (1 - \alpha) \cdot \text{norm\_perf}_i \]
平滑更新课程难度评分：

\[\text{difficulty\_scores}_i = \beta \cdot \text{difficulty\_scores}_i + (1 - \beta) \cdot \text{current\_weights}_i \]
最终聚合权重：

\[\text{final\_weights}_i = \text{softmax}(\text{difficulty\_scores}_i) \]
模型参数聚合：

\[\text{Aggregated\_Param}_k = \sum_{i=1}^{N} \text{final\_weights}_i \cdot \text{Model}_i[k] \]

改进的优点

动态适应性：
- 通过异质性评分考虑了客户端数据分布的多样性。
- 动态调整性能和异质性评分的权重，适应不同训练阶段的需求。
平滑稳定性：
- 平滑机制通过历史和当前评分的平衡，避免权重剧烈变化，确保训练过程稳定。
增强的公平性：
- 异质性评分对低资源或高异质性客户端给予更多关注，改善了联邦学习中的公平性问题。
可调节性：
- 参数 α 和 β 提供了灵活性，用户可以根据具体场景调节异质性与性能、历史与当前权重的平衡。

总结

改进后的方法在联邦学习中引入课程学习机制，通过动态权重计算和异质性评分的引入，提升了聚合过程的公平性、适应性和稳定性。这种设计在非IID数据分布场景中能够更好地平衡全局模型性能和个体客户端的贡献。

RL的设置

这个奖励函数的设计目的是平衡全局性能提升（平均改进）和个体最差性能（最小改进），确保在训练过程中既能提升整体性能，也不会让某些客户端“掉队”。以下是奖励计算的具体解析：

奖励函数的输入

函数的输入是一个列表，表示各个客户端的性能改进（client_improvements）。

性能改进：一般指的是每个客户端在当前一轮训练后的损失改进情况，例如 前一轮损失 - 当前损失。性能改进越大，表示当前一轮训练效果越好。

奖励函数计算的步骤

1. 标准化性能改进

normalized_improvements = torch.tanh(improvements)  # Squash to [-1, 1]

使用 torch.tanh 对性能改进值进行标准化，映射到 \([-1, 1]\) 范围：
- 性能改进为正值（loss 减少）：对应的标准化值为正，越大越接近 1。
- 性能改进为负值（loss 增加）：对应的标准化值为负，越小越接近 -1。
- 这种方式能够防止性能改进值过大（如数值爆炸）对奖励值的影响，并将值归一化。

2. 计算奖励组件

奖励值由两个部分组成：平均改进（mean_improvement）和最小改进（min_improvement）。

(1) 平均改进

mean_improvement = normalized_improvements.mean()

计算所有客户端的平均性能改进，表示训练的全局性能提升。
这一部分奖励衡量了联邦学习中整体模型的改进情况，确保全局性能提升。

(2) 最小改进

min_improvement = normalized_improvements.min()

计算客户端中最差的性能改进，表示训练的最差性能改进。
这一部分奖励用于关注“掉队的客户端”，避免模型只优化某些客户端而忽视其他性能较差的客户端，确保模型公平性。

3. 综合奖励值

reward = 0.7 * mean_improvement + 0.3 * min_improvement

奖励值由两部分加权组合：
- 70% 平均改进（mean_improvement）：保证整体性能的优化。
- 30% 最小改进（min_improvement）：保证最差客户端的性能不被忽视。
这种权重分配的逻辑是更关注整体性能的提升，同时适度平衡个体公平性。

奖励函数输出

return reward.item()

返回最终奖励值，用于强化学习算法更新策略（如策略梯度或 Q 值更新）。

核心逻辑总结

输入：
- 输入是每个客户端的性能改进值列表。
计算过程：
- 将性能改进标准化到 \([-1, 1]\)。
- 计算平均改进（global improvement）和最小改进（worst-case improvement）。
- 加权组合平均改进和最小改进，得到综合奖励值。
输出：
- 综合奖励值，作为强化学习的目标反馈，用于引导联邦学习中的训练策略。

奖励设计的意义

全局性能提升：
- 平均改进反映了所有客户端整体的训练效果，确保联邦学习模型逐步优化。
个体公平性：
- 最小改进关注性能最差的客户端，防止模型训练过程中过于偏向某些客户端。
动态平衡：
- 通过 70% 和 30% 的权重分配，动态平衡全局性能和个体公平性。

例子

假设有 3 个客户端，其性能改进分别为：

客户端 A：+0.5
客户端 B：+0.2
客户端 C：-0.1

Step 1: 标准化性能改进

\[\text{normalized\_improvements} = \tanh([0.5, 0.2, -0.1]) = [0.4621, 0.1974, -0.0997] \]

Step 2: 计算平均和最小改进

平均改进：

\[\text{mean\_improvement} = \frac{0.4621 + 0.1974 - 0.0997}{3} = 0.1866 \]

最小改进：

\[\text{min\_improvement} = -0.0997 \]

Step 3: 计算最终奖励

\[\text{reward} = 0.7 \cdot 0.1866 + 0.3 \cdot (-0.0997) = 0.13062 \]

最终的奖励值为 0.13062，用于引导强化学习策略。

总结

这个奖励函数通过结合平均改进和最小改进，使联邦学习的强化学习策略能够在保证全局性能提升的同时，兼顾个体客户端的公平性，避免性能过于倾向某些客户端。

改进RL的计算

Yes, you're right. Let's modify the ComponentSelector to include more historical information and component selection history. Here's the enhanced version:

class ComponentSelector(nn.Module):
    def __init__(self, num_clients, hidden_size=128, device='cuda'):
        super().__init__()
        self.device = device
        self.components = [
            'encoder',   # BiGRU encoder
            'decoder',   # Cross-attention decoder  
            'sd_decoder' # Side task decoder
        ]
        
        # Track component selection history
        self.component_history = deque(maxlen=4)  # Last 4 selections
        self.selection_stats = {comp: 0 for comp in self.components}  # Track selection frequency
        
        # Enhanced state representation
        state_size = (num_clients * 4) + (len(self.components) * 4) + len(self.components)
        
        # Main and target networks with larger state input
        self.main_net = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Linear(hidden_size // 2, len(self.components))
        ).to(device)
        
        self.target_net = copy.deepcopy(self.main_net)
        
        # RL parameters
        self.memory = deque(maxlen=10000)
        self.gamma = 0.95
        self.epsilon = 1.0
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        self.batch_size = 32
        self.optimizer = Adam(self.main_net.parameters())
        
        # For target network updates
        self.target_update_freq = 10
        self.steps = 0
    
    def get_state(self, client_losses):
        state = []
        
        # 1. Client loss history (num_clients * 4 values)
        for client in range(len(client_losses)):
            if client_losses[client]['total']:
                recent_losses = client_losses[client]['total'][-4:]
                state.extend(recent_losses)
            else:
                state.extend([0.0] * 4)
        
        # 2. Component selection history (num_components * 4 values)
        component_states = [[0.0] * len(self.components)] * 4
        for i, hist in enumerate(self.component_history):
            if i < len(component_states):
                component_states[i] = hist
        
        for comp_state in component_states:
            state.extend(comp_state)
        
        # 3. Component selection statistics (num_components values)
        total_selections = sum(self.selection_stats.values()) + 1e-10
        selection_freqs = [self.selection_stats[comp] / total_selections 
                         for comp in self.components]
        state.extend(selection_freqs)
        
        return torch.tensor(state, device=self.device).float()
    
    def select_action(self, state):
        if random.random() < self.epsilon:
            action = torch.bernoulli(torch.ones(len(self.components)) * 0.5).to(self.device)
        else:
            with torch.no_grad():
                q_values = self.main_net(state)
                action = (q_values > 0).float()
        
        # Update selection statistics
        for i, comp in enumerate(self.components):
            if action[i] > 0:
                self.selection_stats[comp] += 1
        
        # Update history
        self.component_history.append(action.cpu().tolist())
        
        return action
    
    def update_memory(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
        
    def optimize(self):
        if len(self.memory) < self.batch_size:
            return
            
        batch = random.sample(self.memory, self.batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        
        states = torch.stack(states)
        actions = torch.stack(actions)
        rewards = torch.tensor(rewards, device=self.device)
        next_states = torch.stack(next_states)
        dones = torch.tensor(dones, device=self.device)
        
        # Get next action using main network (DDQN)
        next_actions = (self.main_net(next_states) > 0).float()
        
        # Get next Q-values using target network
        next_q_values = self.target_net(next_states)
        next_q_values = next_q_values.gather(1, next_actions.long())
        
        # Calculate target Q-values
        target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
        
        # Calculate current Q-values
        current_q_values = self.main_net(states).gather(1, actions.long())
        
        # Calculate loss and update
        loss = F.mse_loss(current_q_values, target_q_values.detach())
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Update target network periodically
        self.steps += 1
        if self.steps % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.main_net.state_dict())
        
        # Update epsilon
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

Key improvements:

Enhanced State Representation:
- Client loss history (original)
- Component selection history (last 4 selections)
- Component selection statistics (frequency of each component being selected)
Deeper Network Architecture:
- Added an intermediate layer
- Added dropout for better generalization
- Increased state input size to accommodate history
Component History Tracking:
- Added deque to track recent selections
- Added statistics tracking for component usage
- Normalizes selection frequencies
Better Target Network Update:
- Periodic target network updates
- Step counter for tracking updates

This enhanced version should:

Make better decisions based on historical context
Learn long-term patterns in component effectiveness
Balance exploration and exploitation better
Provide more stable learning

Would you like me to explain any part in more detail?

posted @ 2024-12-20 13:53 GraphL 阅读(20) 评论(0) 编辑收藏举报

刷新页面返回顶部

csjywu01

DQN vs. DDQN

1. Q值估计的目标函数不同：

2. 引入了分离的动作选择和Q值更新：

3. 稳定性和性能改进：

总结

a prompt for change

对 RL + aggregate的修改（选择哪些模块）

目前RL + 课程学习的setting

课程学习的修改

核心思想

代码的功能解析

1. 课程学习权重计算

2. 模型参数加权聚合

公式总结

改进的优点

总结

RL的设置

奖励函数的输入

奖励函数计算的步骤

1. 标准化性能改进

2. 计算奖励组件

(1) 平均改进

(2) 最小改进

3. 综合奖励值

奖励函数输出

核心逻辑总结

奖励设计的意义

例子

Step 1: 标准化性能改进

Step 2: 计算平均和最小改进

Step 3: 计算最终奖励

总结

改进RL的计算

公告

csjywu01

DQN vs. DDQN

1. Q值估计的目标函数不同：

2. 引入了分离的动作选择和Q值更新：

3. 稳定性和性能改进：

总结

a prompt for change

对 RL + aggregate的修改 （选择哪些模块）

目前RL + 课程学习的setting

课程学习的修改

核心思想

代码的功能解析

1. 课程学习权重计算

2. 模型参数加权聚合

公式总结

改进的优点

总结

RL的设置

奖励函数的输入

奖励函数计算的步骤

1. 标准化性能改进

2. 计算奖励组件

(1) 平均改进

(2) 最小改进

3. 综合奖励值

奖励函数输出

核心逻辑总结

奖励设计的意义

例子

Step 1: 标准化性能改进

Step 2: 计算平均和最小改进

Step 3: 计算最终奖励

总结

改进RL的计算

公告

对 RL + aggregate的修改（选择哪些模块）