QLearning的Java实现(附代码)
Step-By-Step Tutorial
This tutorial introduces the concept of Q-learning through a simple but comprehensive numerical example. The example describes an agent which uses unsupervised training to learn about an unknown environment. You might also find it helpful to compare this example with the accompanying source code examples.
Suppose we have 5 rooms in a building connected by doors as shown in the figure below. We'll number each room 0 through 4. The outside of the building can be thought of as one big room (5). Notice that doors 1 and 4 lead into the building from room 5 (outside).
We can represent the rooms on a graph, each room as a node, and each door as a link.
For this example, we'd like to put an agent in any room, and from that room, go outside the building (this will be our target room). In other words, the goal room is number 5. To set this room as a goal, we'll associate a reward value to each door (i.e. link between nodes). The doors that lead immediately to the goal have an instant reward of 100. Other doors not directly connected to the target room have zero reward. Because doors are two-way ( 0 leads to 4, and 4 leads back to 0 ), two arrows are assigned to each room. Each arrow contains an instant reward value, as shown below:
Of course, Room 5 loops back to itself with a reward of 100, and all other direct connections to the goal room carry a reward of 100. In Q-learning, the goal is to reach the state with the highest reward, so that if the agent arrives at the goal, it will remain there forever. This type of goal is called an "absorbing goal".
Imagine our agent as a dumb virtual robot that can learn through experience. The agent can pass from one room to another but has no knowledge of the environment, and doesn't know which sequence of doors lead to the outside.
Suppose we want to model some kind of simple evacuation of an agent from any room in the building. Now suppose we have an agent in Room 2 and we want the agent to learn to reach outside the house (5).
The terminology in Q-Learning includes the terms "state" and "action".
We'll call each room, including outside, a "state", and the agent's movement from one room to another will be an "action". In our diagram, a "state" is depicted as a node, while "action" is represented by the arrows.
Suppose the agent is in state 2. From state 2, it can go to state 3 because state 2 is connected to 3. From state 2, however, the agent cannot directly go to state 1 because there is no direct door connecting room 1 and 2 (thus, no arrows). From state 3, it can go either to state 1 or 4 or back to 2 (look at all the arrows about state 3). If the agent is in state 4, then the three possible actions are to go to state 0, 5 or 3. If the agent is in state 1, it can go either to state 5 or 3. From state 0, it can only go back to state 4.
We can put the state diagram and the instant reward values into the following reward table, "matrix R".
The -1's in the table represent null values (i.e.; where there isn't a link between nodes). For example, State 0 cannot go to State 1.
Now we'll add a similar matrix, "Q", to the brain of our agent, representing the memory of what the agent has learned through experience. The rows of matrix Q represent the current state of the agent, and the columns represent the possible actions leading to the next state (the links between the nodes).
The agent starts out knowing nothing, the matrix Q is initialized to zero. In this example, for the simplicity of explanation, we assume the number of states is known (to be six). If we didn't know how many states were involved, the matrix Q could start out with only one element. It is a simple task to add more columns and rows in matrix Q if a new state is found.
The transition rule of Q learning is a very simple formula:
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
According to this formula, a value assigned to a specific element of matrix Q, is equal to the sum of the corresponding value in matrix R and the learning parameter Gamma, multiplied by the maximum value of Q for all possible actions in the next state.
Our virtual agent will learn through experience, without a teacher (this is called unsupervised learning). The agent will explore from state to state until it reaches the goal. We'll call each exploration an episode. Each episode consists of the agent moving from the initial state to the goal state. Each time the agent arrives at the goal state, the program goes to the next episode.
The Q-Learning algorithm goes as follows:
1. Set the gamma parameter, and environment rewards in matrix R.
2. Initialize matrix Q to zero.
3. For each episode:
Select a random initial state.
Do While the goal state hasn't been reached.
- Select one among all possible actions for the current state.
- Using this possible action, consider going to the next state.
- Get maximum Q value for this next state based on all possible actions.
- Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
- Set the next state as the current state.
End Do
End For
The algorithm above is used by the agent to learn from experience. Each episode is equivalent to one training session. In each training session, the agent explores the environment (represented by matrix R ), receives the reward (if any) until it reaches the goal state. The purpose of the training is to enhance the 'brain' of our agent, represented by matrix Q. More training results in a more optimized matrix Q. In this case, if the matrix Q has been enhanced, instead of exploring around, and going back and forth to the same rooms, the agent will find the fastest route to the goal state.
The Gamma parameter has a range of 0 to 1 (0 <= Gamma > 1). If Gamma is closer to zero, the agent will tend to consider only immediate rewards. If Gamma is closer to one, the agent will consider future rewards with greater weight, willing to delay the reward.
To use the matrix Q, the agent simply traces the sequence of states, from the initial state to goal state. The algorithm finds the actions with the highest reward values recorded in matrix Q for current state:
Algorithm to utilize the Q matrix:
1. Set current state = initial state.
2. From current state, find the action with the highest Q value.
3. Set current state = next state.
4. Repeat Steps 2 and 3 until current state = goal state.
The algorithm above will return the sequence of states from the initial state to the goal state.
Q-Learning Example By Hand
To understand how the Q-learning algorithm works, we'll go through a few episodes step by step. The rest of the steps are illustrated in the source code examples.
We'll start by setting the value of the learning parameter Gamma = 0.8, and the initial state as Room 1.
Initialize matrix Q as a zero matrix:
Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state 3, or go to state 5. By random selection, we select to go to 5 as our action.
Now let's imagine what would happen if our agent were in state 5. Look at the sixth row of the reward matrix R (i.e. state 5). It has 3 possible actions: go to state 1, 4 or 5.
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100
Since matrix Q is still initialized to zero, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The result of this computation for Q(1, 5) is 100 because of the instant reward from R(5, 1).
The next state, 5, now becomes the current state. Because 5 is the goal state, we've finished one episode. Our agent's brain now contains an up
import java.util.Random; public class QLearning1 { private static final int Q_SIZE = 6; private static final double GAMMA = 0.8; private static final int ITERATIONS = 10; private static final int INITIAL_STATES[] = new int[] {1, 3, 5, 2, 4, 0}; private static final int R[][] = new int[][] {{-1, -1, -1, -1, 0, -1}, {-1, -1, -1, 0, -1, 100}, {-1, -1, -1, 0, -1, -1}, {-1, 0, 0, -1, 0, -1}, {0, -1, -1, 0, -1, 100}, {-1, 0, -1, -1, 0, 100}}; private static int q[][] = new int[Q_SIZE][Q_SIZE]; private static int currentState = 0; private static void train() { initialize(); // Perform training, starting at all initial states. for(int j = 0; j < ITERATIONS; j++) { for(int i = 0; i < Q_SIZE; i++) { episode(INITIAL_STATES[i]); } // i } // j System.out.println("Q Matrix values:"); for(int i = 0; i < Q_SIZE; i++) { for(int j = 0; j < Q_SIZE; j++) { System.out.print(q[i][j] + ",\t"); } // j System.out.print("\n"); } // i System.out.print("\n"); return; } private static void test() { // Perform tests, starting at all initial states. System.out.println("Shortest routes from initial states:"); for(int i = 0; i < Q_SIZE; i++) { currentState = INITIAL_STATES[i]; int newState = 0; do { newState = maximum(currentState, true); System.out.print(currentState + ", "); currentState = newState; }while(currentState < 5); System.out.print("5\n"); } return; } private static void episode(final int initialState) { currentState = initialState; // Travel from state to state until goal state is reached. do { chooseAnAction(); }while(currentState == 5); // When currentState = 5, Run through the set once more for convergence. for(int i = 0; i < Q_SIZE; i++) { chooseAnAction(); } return; } private static void chooseAnAction() { int possibleAction = 0; // Randomly choose a possible action connected to the current state. possibleAction = getRandomAction(Q_SIZE); if(R[currentState][possibleAction] >= 0){ q[currentState][possibleAction] = reward(possibleAction); currentState = possibleAction; } return; } private static int getRandomAction(final int upperBound) { int action = 0; boolean choiceIsValid = false; // Randomly choose a possible action connected to the current state. while(choiceIsValid == false) { // Get a random value between 0(inclusive) and 6(exclusive). action = new Random().nextInt(upperBound); if(R[currentState][action] > -1){ choiceIsValid = true; } } return action; } private static void initialize() { for(int i = 0; i < Q_SIZE; i++) { for(int j = 0; j < Q_SIZE; j++) { q[i][j] = 0; } // j } // i return; } private static int maximum(final int State, final boolean ReturnIndexOnly) { // If ReturnIndexOnly = True, the Q matrix index is returned. // If ReturnIndexOnly = False, the Q matrix value is returned. int winner = 0; boolean foundNewWinner = false; boolean done = false; while(!done) { foundNewWinner = false; for(int i = 0; i < Q_SIZE; i++) { if(i != winner){ // Avoid self-comparison. if(q[State][i] > q[State][winner]){ winner = i; foundNewWinner = true; } } } if(foundNewWinner == false){ done = true; } } if(ReturnIndexOnly == true){ return winner; }else{ return q[State][winner]; } } private static int reward(final int Action) { return (int)(R[currentState][Action] + (GAMMA * maximum(Action, false))); } public static void main(String[] args) { train(); test(); return; } }