Overview
State,Reward and Action are the core elements in reinforcement learning. So when considering playing streetfighter by DQN, the first coming question is how to receive game state and how to control the player. If we can get access to the game’s inner variables like players’ blood, action,dead or live, etc, it’s really clean and friendly to use. But to implement the idea as soon as possible I choose to use a raw method(screen_grab and keyboard/mouse event) to save time.
Here is the game.
How to get game state?
The following code aims to get a quick grab of the game screen.
|
|
Here the g_game_box
is the meaningful game region.When called without arguments, ImageGrab.grab
will return the WHOLE SCREEN REGION. Basically everytime you open a new game, it will appear at the same cordinates, So I set the box fixed to (142,124,911,487). You must modify it on your computer since it very likely changes.
The im
we get is the state at step t
, more specifically, the observation of step t
. Besides the screen image we need to know about the players’ blood.
|
|
This time the box
is the players’ blood bar region(i.e, the blue box in the following image):
check_player_blood
receives the game image and certain player’s, i.e, your or enemy’s, blood bar box, and returns his blood left between 0~100.
Dependency
Lasagne and Theano
CUDA
To install Lasagne on window, you can refer to my another blog “Lasagne On Win7 With GPU Environment & VizDoom Compilation” written in Chinese.
How to control the player?
For the host player, you can use W/D/A/S to control directions and J/K/U/I for fist and leg. Python can simulate the keyborad and mouse operations by win32api
and win32con
.
Keyboard simulation
Below is used keys’ virtual value:
We can give player commands by following code:
|
|
Then when we call make_action((1,0,0,0,0,0,0,0), 4)
, the player receive “keep back” command, while make_action((0,1,0,0,0,0,0,0), 4)
means “jump”, make_action((1,1,0,0,0,0,0,0), 4)
means “jump back”.
Mouse simulation
Although we do not use mouse to control the player, under some conditions when certain player is dead the game gets renewed, but the players’ blood bar will be wrong. I mean the bar region gets longer or shorter. If you keep observing the game progress for some time this will appear. I believe it’s some kind of bug so when this occurs you’d better close the game and reopen it. That’s why mouse simulation is needed, to reopen the game in Chrome.
|
|
(1151,9)
is the while (212,205)
is the index.html positon in the screen.leftClick
on (30,60) is to give the chrome focus.
How to set reward?
I give the player living_reward = -1 and -self_blood_lose
, 3*enemy_blood_lose
, -200 for self death and 200 for enemy’s death. Here 3*enemy_blood_lose
means encouragement for attack.
The reward meachanism is REALLY IMPORTANT and more experments need to be done on this topic.
Training setting
Network architectureInput228x122x3
Conv12x12x8@s=3,sigmoid
MaxPool2x2@s=1
Conv7x7x16@s=2,sigmoid
MaxPool2x2@s=1
33x13x16 = 6864 elements by estimation
Dense128@sigmoid
Dense256
(output)
Parameters
Others
To accelerate the training process, I commented the test in every epoch and only focus on training scores.
And also for training simplicity, the game backgroud is changed to white. If necessary you can set it by yourself. Just modify StreetFighter/images/g/behind.gif and front.gif.
Current Result
At the beginning I set epochs = 20 and learning_steps_per_epoch = 2000. And make following attempts:
- change gray input to color input — no convergence
- change frame_repeat to 2,4,12 — no convergence
change reward:
add death penalty — no convergence
add living reward — no convergence
give more reward on attack — no convergence
- change relu to sigmoid — no convergence
- add maxpooling layer — no convergence
- change the input size — no convergence
- change the keys to available actions
- try dueling dqn
And after ranging from these params, I think the problem is that the total train steps(minibatch weight update times) aren’t enough as the paper “Playing Atari with Deep Reinforcement Learning” points, it takes 100 epochs x 50000 minibatch updates.Comparing to this number my setting is too naive.So now I’m doing experiment on new setting.
REALLY EXPECT FOR A GOOD PERFORMANCE, OR AT LEAST CONVERGENCE!!!!
The Rainbow comes after the storm.
Reference
- Project
- Playing Atari with Deep Reinforcement Learning
- How to build a python bot that can play web games
- Street Fighter
- VizDoom
- virtual keys value