CS440/ECE448 Spring 2019Assignment 4: Reinforcement Learning and Deep LearningDeadline: Monday, April 22, 11:59:59PMContents
Part 1: Q-learning (Snake)Created by Ningkai Wu, Yiran KangSnake is a famous video game originated in the 1976 arcade game Blockade. The player uses up, down, left and right to control the snake which grows in length(when it eats the food), with the snake body and walls around the environment being the primary obstacle. In this assignment, you will train AI agents using reinforcement learning to play a simple version of the game snake. You will implement a TD version of the Q-learning algorithm. Image from Wikipedia Provided Snake EnvironmentSnake In this assignment, the size of the entire game board is 560x560. The green rectangle is the snake agent and the red rectangle is the food. Snake head is marked with a thicker boarder for easier recognition. Food is generated randomly on board once the initial food is eaten. The size for every side of wall (filled with blue) is 40. The snake head, body segment and food have the same size of 40x40. Snake moves with a speed of 40 per frame. In this setup, the entire board that our snake agent can move has a size of 480x480 and can be treated as a 12x12 grid. Every time it eats a food, the points increases 1 and its body grows one segment. Before implementing the Q-learning algorithm, we must first define Snake as a Markov Decision Process (MDP). Note in Q-learning, state variables do not need to represent the whole board, it only needs to represent enough information to let the agent make decisions.(So once you get environment state, you need to convert it to the state space as defined below). Also, the smaller the state space, the more quickly the agent will be able to explore it all.
Q-learning AgentTrainedAgent In this part of the assignment, you will create a snake agent to learn how to get food as many as possible without dying. In order to do this, you must use Q-learning. Implement the TD Q-learning algorithm and train it on the MDP outlined above. $$ {Q(s,a) \leftarrow Q(s,a) + \alpha(R(s)+\gamma \max_{a'}Q(s',a')-Q(s,a))}$$ Also, use the exploration policy mentioned in class and use 1 for R+: During training, your agent needs to update your Q-table first (this step is skipped when the initial state and action are None), get the next action using the above exploration policy, and then update N-table with that action. If the game is over, that is when the dead varaible becomes true, you only need to update your Q table and reset the game. During testing, your agent only needs to give the best action using Q-table. Train it for as long as you deem necessary, counting the average number of points your agent can get. Your average over 1000 test games should be at least 20. For grading purposes, please submit code with the above exploration policy, state configurations and reward model. We will initialize your agent class with different parameters (Ne, C, gamma), initialize environment with different initial snake and food postion and compare the Q-table result at the point when the first food is eaten during training(see snake_main.py for Q-table generation detail). Once you have this working, you will need to adjust the learning rate, α (how about a fixed learning rate or other C value?), the discount factor, γ, and the settings that you use to trade off exploration vs. exploitation. In your report, please include the values of α, γ, and any parameters for your exploration settings that you used, and discuss how you obtained these values. What changes happen in the game when you adjust any of these variables? How many games does your agent need to simulate before it learns an optimal policy? After your Q-learning seems to have converged to a good policy, run your algorithm on a large number of test games (≥1000) and report the average number of points. In addition to discussing these things, try adjusting the state configurations that were defined above. If you think it would be beneficial, you may also change the reward model to provide more informative feedback to the agent. Try to find modifications that allow the agent to learn a better policy than the one you found before. In your report, describe the changes you made and the new number of points the agent was able to get. What effect did this have on the time it takes to train your agent? Include any other interesting observations. Tips
Debug ConvenienceFor debug convenience, we provide three debug examples of Q-table for you. Each Q-table is generated exactly after snake eats the first food in training process. More specifically, it's the first time when snake reaches exactly 1 point in training, see how the Q-table is generated and saved during training in snake_main.py. For example, you can run diff checkpoint.npy checkpoint1.npy to see whether there is a difference. The only difference of these three debug examples is the setting of parameters(initialized position of snake head and food, Ne, C and gamma). Notice that for passing the autograder, if the scores of actions from exploration function are equal, the priority should be right > left > down > up.
Note that for one part of the autograder, we will run your training process on different settings of parameters and compare the Q-table generated exactly when snake reaches 1 point first time in training, with ours. So making sure you can pass these debug examples will help you a lot for passing this part of the autograder. In addition, for the other part of autograder, we will test your performance using your q_agent.npy and agent.py. Average points over 20 points on 1000 test games should be able to obatin full credit for this part.
Part 2: Deep Learning (MNIST Fashion)Created by Austin Bae (Modified from Ryley Higa)BackgroundBy now you should be familiar with the MNIST dataset from MP3. In MP3, we trained a model on this dataset using Linear Classifiers such as Naive Bayes and Perceptron. This time around, we will implement a fully connected neural network from scratch on the same dataset we used in MP3. Your task is to build a 4-layer neural network with 256 hidden nodes per layer except the last layer which should have 10 (number of classes) layers. You are going to use a method called Minibatch Gradient Descent, which runs for a given number of iterations (epochs) and does the following per epoch:
The Neural NetworkNeural Network ArchitectureFor this assignment, you will build a 4-layer, fully connected Neural Network. You can think about a fully-connected Neural Network as a series of multiple interconnected layers, in which are basically a group of nodes (See the diagram above). You can think of each layer as the perceptron model you implemented in MP3, but instead of one perceptron, you have multiple perceptrons that feed the output as the input to another. Neural Network LayerThis is what each layer in a Neural Network looks like. Note that there are two directions: forward and backward propagation. Forward Propagation uses both the inputs (Ain) and the weights (W, b) specific to that layer to compute the output (Aout) to the next layer. Backward Propagation computes the gradient (derivative) of the loss function with respect to the weights at each layer. Inside of each propagation, there are two functions that both compute forward and backward. In general, Affine transformations compute the affine output (Z) by doing some sort of computation with the input and weights (like matrix multiply). Nonlinear Activation functions then take that affine output and transform it. There are numerous activation functions out there, but for our assignment we will be using ReLU (Rectified Linear Units) which is just simply R(x) = max(0, x). The backwards functions will compute the gradients of the inputs with respect to loss. Then, at the last layer, the output should be the classification computed by your Neural Network. You have to then calculate the loss, which would represent how well your network did in classifying correctly. The model's job is to minimize this loss, since the better it does, the lower loss it will have. Again, we can use many different types of loss functions, but we will use Cross-entropy for this assignment. Implementation DetailsYou are to implement these 8 different functions inside of
TestingWe have provided you with a unit testing script What to ReportRun your minibatch gradient for 10, 30, and 50 epochs. You should start with clean generated weights/biases for each run. For each run, report:
Also report any interesting observations, and possible explanations for those observations.
Extra Credit OpportunityConvolutional Neural Network (CNN) is a different type of Neural Network that is very popular, especially in image processing. Implement a CNN using the same MNIST fashion dataset using any deep learning frameworks of your choice (Tensorflow, Pytorch, Keras, etc). You may reference online materials, but make sure to cite them as necessary! Write a paragraph on how CNNs work, what you implemented, and the results of your CNN. Compare it to the performance of your fully-connected 4 layer network. Report the confusion matrix, average classification rate, and average classification rate per class of the test data once your model converges. The extra credit will be capped to 10% of the entire MP grade. Submit your file as nn_extracredit.ext, ext being the extension of the file. You do not have to submit any runtime information (modules/metadata) but make sure to describe your algorithm in the report. Warning/HintsOne disadvantage of Neural Networks is that they may take a long time to compute. Given that fact, correctness of your algorithms may not be sufficient to successfully complete this assignment. Use numpy functions (matrix multiply, broadcasting) as much as possible as opposed to native python loops, as this will significantly decrease your runtime. With decently optimized code, we were able to get under 10 seconds per epoch on a 2015 Macbook Pro, and roughly under 80 seconds on EWS. We highly recommend you running the code on a machine faster than EWS, but EWS should give you the highest-bound on computation time. We will cut off computation for autograding at around 120 seconds per epoch on EWS. Also, running your model for 50 epochs will take somewhere around 10 mins ~ 1 hour. If you do this last minute you may not be able to run your computations on time. There will be no excuses for missing the deadline. We will also grade your code on a different dataset that will have different number of rows (but divisible by 200), features, and target classes. So it is very important that you don't hardcode any numbers into your algorithms, and it should be able to support data of any size. Provided hyperparameters (number of layers, number of nodes per layer, learning rate, batch size) should remain the same. Provided Code SkeletonWe have provided skeleton.zip with the descriptions below. For Part 1, do not import any non-standard libraries except pygame (pygame version 1.9.4) and numpy. For Part 2, do not import any non-standard library except numpy. Failure to do so will result in a 0 in the autograder. Part 1
Part 2
You can modify the test/main functions as long as your inputs/outputs for all functions inside of neural_network.py is consistent with the instructions. Note that only neural_network.py will be submitted & graded. DeliverablesThis MP will be submitted via compass. Please upload only the following files to compass.
Report ChecklistPart 1
Part 2
|