- similar to the previous thesis ([1312.5602] Playing Atari with Deep Reinforcement Learning)
- in the [1312.5602] Playing Atari with Deep Reinforcement Learning, they generated the target Q from the network whose parameter is one time-step before but here, they prepare two network and
- estimates the current Q and will be updated performing gradient descent step on square error but will be updated every C step. (C is given)
- in the standard Q-learning an update that increases also increases for all a and hence increase target which will cause a divergence of the policy. The proposed technique will prevent this.