Mountain Car is a classical control environment. The main purpose of player is to achieve the flag. All that player can do is to apply force to right or to left or not to move at all. The game ends when:
Video results:
The solution is based on deep q-learning algorithm. The model has the structure:
from tensorflow.keras.models import Model
import tensorflow.keras.layers as layers
def build_model(states, actions):
inputs = layers.Input(shape=(1, states))
x = layers.Dense(64, activation="relu") (inputs)
x = layers.Dense(64, activation="relu") (x)
x = layers.Flatten()(x)
outputs = layers.Dense(actions, activation="linear")(x)
return Model(inputs, outputs, name="mountain_car_player")
The model was trained on 100000 steps with Adam optimizer(lr=1e-3) and BoltzmannQPolicy, also using checkpoint callback to save the model that achieve the best rewards. The main problem of the game is a reward function. It gives -1 for every step that player not achieved a goal. But it does not provide any usefull information of how to do this. Using a theoretical materials from paper, and medium post I performed a reward shaping in a such way:
from gym.envs.classic_control.mountain_car import MountainCarEnv
class MountainCarModifiedReward(MountainCarEnv):
def step(self, action: int):
previous_state = self.state
new_state, reward, done, info = super().step(action)
modified_reward = reward + 300 * (0.95 * abs(new_state[1]) - abs(previous_state[1]))
if new_state[0] >= 0.5:
modified_reward += 100
return new_state, modified_reward, done, info
The main idea was that to climb a hill the car need a big velocity, so i used a potential reward shaping to stimulate a high velocity. I also add +100 reward for actual winning a game to stimulate to actually win a game but not to achieve a big score using only velocity rewards.