Policy Gradient Descent on CartPole

Method	inputs	Description
shape	None (property)	Returns the dimension of the input (tuple)
random.rand()	$d_{0}, d_{1}, . . . d_{n}$	Returns random values between 0 and 1, and creates a new array with the input dimensions as shape.
sqrt	A numpy array	Returns a new numpy array with the square root of every element.
vstack	A list of numpy arrays	Returns the vertical stack of all the images (concatenates based on the first axis)
hstack	A list of numpy arrays	Returns the horizontal stack of all the images (concatenates based on the first axis)
mean	A numpy array	Return the mean value (sum of the array / lenght of array)
std	A numpy array	Return the standard deviation (sqrt(mean(abs(x - x.mean) ** 2)
dot	Two numpy arrays	Return the dot product of the two arrays (Geometrically, it is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them.)
cross	Two numpy arrays	Return the cross product of the two arrays (The cross product a × b is defined as a vector c that is perpendicular (orthogonal) to both a and b, with a direction given by the right-hand rule and a magnitude equal to the area of the parallelogram that the vectors span.)
outer	Two numpy arrays	Return the outer product of the two arrays (elementwise multiplication)

Line	variable	dimension	Remark
3	`self.model['W1']` `obs` `hidden`	(200, 4) (4,) (200,)
4	`hidden`	(200,)	These are the activation values of W1
5	`self.model['W2']` `log_probability`	(200,) (1,)
6	`probability`	(1,)	This is the probability of taking action 1 (or 0). These are the activation values of W2

Line	variable	dimension	Remark
3	`episode_hidden` `episode_hidden.T` `episode_probability` `dW2`	(`N`, 200) (200, `N`) (`N`,) (200,)	The update values for W2.
4	`episode_probability` `self.model['W2']` `dh`	(`N`,) (200,) (`N`, 200)	The outer product multiplies every input with every output
5	`dh`	(`N`, 200)
6	`dh.T` `episode_states` `dW1`	(200, `N`) (`N`, 4) (200, 4)	The update values for W1.

¶ PG CartPole