CS | 强化学习

CS | 强化学习入门介绍



  • 什么是强化学习
  • 强化学习特点
  • 组成部分
  • 学习过程
  • Q-learning


1.1 有监督学习

Supervised learning: the labels of the training data are known and refer to the correct results.

Task: infer feedback mechanisms on the training set according to the corresponding labels, so as to compute the most correct results possible on samples with unknown labels

Applications: classification and regression problems

1.2 无监督学习

Unsupervised learning: labeling of unknown training data

Task: discover hidden structures from unlabeled datasets

Applications: clustering, etc., to cluster similar data together

1.3 强化学习

Reinforcement learning: no data tagging, but feedback at each step is needed.

Action based on feedback from the environment, through continuous interaction with the environment trial and error, so that the overall action to maximize the benefits of feedback at each step is rewarded / punished, can be quantified, based on feedback to adjust the object behavior.




2.1 试错学习

Reinforcement learning requires the training object to constantly interact with the environment and summarize the best behavioral decision at each step through trial-and-error, without any guidance, all learning is based on environmental feedback and the training object to adjust behavioral decisions.

2.2 延迟反馈

Reinforcement learning training process, the training object's "trial and error" behavior → feedback from the environment, generally a complete training to get feedback, now improve the training process is generally disassembled, try to break down the feedback to each step

2.3 时间是重要因素

Time is an important factor: A series of environmental state changes and environmental feedback of reinforcement learning are strongly linked to time, and the whole training process of reinforcement learning is one that changes with time, and the state & feedback also keep changing, so time is an important factor of reinforcement learning.

2.4 当前的行为影响后续接收到的数据

The current action affects the subsequently received data: In supervised learning & semi-supervised learning, each training data is independent of each other and there is no correlation between them. However, in reinforcement learning, the current state, and the actions taken, will affect the next received state. There is some correlation between data and data


  • Agent(智能体、机器人、代理) The main body of the reinforcement learning training; in Pacman, it is the little yellow man.

  • Environment(环境): All the elements of the game constitute the environment; Agent, Ghost, beans and each isolated board in Pacman make up the whole environment.

  • State(状态): The current state of the Environment and Agent, because the position of Ghost moves, the number of beans changes, and the position of Agent changes, so the whole State is in a state of change; here especially emphasize that the State contains the state of Agent and Environment.

  • Action(行动): Based on the current state, the Agent can take what action, such as the direction of movement in this example; Action is strongly linked with the state, for example, many positions in the above figure are partitioned, it is obvious that the Agent in this state is not up or down, only left and right;

  • Reward(奖励): Agent in the current state, after taking a specific action, will get a certain feedback from the environment is Reward, which uses Reward for a general term, although Reward translated into Chinese means "reward", but in fact, in reinforcement learning Reward only represents the environment In fact, Reward only represents the "feedback" given by the environment, which may be a reward or a punishment. For example, in the Pacman game, if the Agent meets the Ghost, the environment gives the punishment, and if the Agent eats the beans, the environment gives the reward.


The entire training process is based on the premise that we consider the entire process to be consistent with the Markov Decision Process.

MDP core idea: the next State is only related to the current State and the Action to be taken by the current State, only one step back. For example, State3 above is only related to State2 and Action2. If we know the current State and the Action that will be taken, we can launch what the next State is, instead of going back to the previous State and Action, and then combining the current (State, Action) to get the next State. For example, if AlphaGo plays Go, what is the current State, and where is the current piece ready to land, we can clearly know what the next move will be.


Q-Value (State, Action): Q-value is determined by combining State and Action together. In a real project, we store a table, we call it Q-table, the key is (state, action), the value is the corresponding Q-value, whenever the agent enters a state, we will query this table, select the Action with the largest value under the current state, and execute the action, then go to the next state, and then continue to query the table to select the action. The value of Q-value is to guide the Agent to choose which action in different states.

💡 如何知道整个训练过程中,Agent会遇到哪些State,每个State下面可以采取哪些Action。最最重要的是,如何将每个(State, Action)对应的Q-value从训练中学习出来?

Bellman方程 核心思想是:当我们在特定时间点和状态下去考虑下一步的决策,要关注的不仅仅是当前决策立即产生的Reward,同时也要考虑当前的决策衍生产生未来持续性的Reward。