CS231n2017 Assignment3 RNN、LSTM部分
rnn_layers.py
rnn_step_forward:
rnn的一个时间步内的前向传播,这里采用的是Vanilla RNN的更新公式
def rnn_step_forward(x, prev_h, Wx, Wh, b):"""Run the forward pass for a single timestep of a vanilla RNN that uses a tanhactivation function.The input data has dimension D, the hidden state has dimension H, and we usea minibatch size of N.Inputs:- x: Input data for this timestep, of shape (N, D).- prev_h: Hidden state from previous timestep, of shape (N, H)- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)- b: Biases of shape (H,)Returns a tuple of:- next_h: Next hidden state, of shape (N, H)- cache: Tuple of values needed for the backward pass."""next_h, cache = None, None############################################################################### TODO: Implement a single forward step for the vanilla RNN. Store the next ## hidden state and any values you need for the backward pass in the next_h ## and cache variables respectively. ################################################################################ *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****next_h = np.tanh(prev_h.dot(Wh) + x.dot(Wx) + b)cache = (x, Wx, Wh, prev_h, next_h)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return next_h, cache
rnn_step_backward:
一个时间步的反向传播
对,其导数为
def rnn_step_backward(dnext_h, cache):"""Backward pass for a single timestep of a vanilla RNN.Inputs:- dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)- cache: Cache object from the forward passReturns a tuple of:- dx: Gradients of input data, of shape (N, D)- dprev_h: Gradients of previous hidden state, of shape (N, H)- dWx: Gradients of input-to-hidden weights, of shape (D, H)- dWh: Gradients of hidden-to-hidden weights, of shape (H, H)- db: Gradients of bias vector, of shape (H,)"""dx, dprev_h, dWx, dWh, db = None, None, None, None, None############################################################################### TODO: Implement the backward pass for a single step of a vanilla RNN. ## ## HINT: For the tanh function, you can compute the local derivative in terms ## of the output value from tanh. ################################################################################ *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****x, Wx, Wh, prev_h, next_h = cachednext_h = dnext_h*(1 - next_h**2) #tanh的梯度db = dnext_h.sum(axis = 0) #样本累加梯度dprev_h = dnext_h.dot(Wh.T) dWh = (prev_h.T).dot(dnext_h)dx = dnext_h.dot(Wx.T)dWx = (x.T).dot(dnext_h)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return dx, dprev_h, dWx, dWh, db
rnn_forward:
rnn完整的前向传播
def rnn_forward(x, h0, Wx, Wh, b):"""Run a vanilla RNN forward on an entire sequence of data. We assume an inputsequence composed of T vectors, each of dimension D. The RNN uses a hiddensize of H, and we work over a minibatch containing N sequences. After runningthe RNN forward, we return the hidden states for all timesteps.Inputs:- x: Input data for the entire timeseries, of shape (N, T, D).- h0: Initial hidden state, of shape (N, H)- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)- b: Biases of shape (H,)Returns a tuple of:- h: Hidden states for the entire timeseries, of shape (N, T, H).- cache: Values needed in the backward pass"""h, cache = None, None############################################################################### TODO: Implement forward pass for a vanilla RNN running on a sequence of ## input data. You should use the rnn_step_forward function that you defined ## above. You can use a for loop to help compute the forward pass. ################################################################################ *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****N, T, D = x.shape_, H = h0.shapeh = np.zeros((N, T, H))prev_h = h0cache = []# T个时间步for i in range(T):prev_h, cache_ = rnn_step_forward(x[:,i,:], prev_h, Wx, Wh, b)h[:,i,:] = prev_hcache.append(cache_)# 记录每个时间步的cache# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return h, cache
rnn_backward:
输入dh:
loss对每个时间步dh的直接梯度,这并不是完整的dh,因为对于某个时间步的h,它还会通过影响后面的时间步的h来影响loss,因此我们还需要计算这个传递的间接梯度并累加
而权重梯度在所有时间步累积
def rnn_backward(dh, cache):"""Compute the backward pass for a vanilla RNN over an entire sequence of data.Inputs:- dh: Upstream gradients of all hidden states, of shape (N, T, H). NOTE: 'dh' contains the upstream gradients produced by the individual loss functions at each timestep, *not* the gradientsbeing passed between timesteps (which you'll have to compute yourselfby calling rnn_step_backward in a loop).Returns a tuple of:- dx: Gradient of inputs, of shape (N, T, D)- dh0: Gradient of initial hidden state, of shape (N, H)- dWx: Gradient of input-to-hidden weights, of shape (D, H)- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)- db: Gradient of biases, of shape (H,)"""dx, dh0, dWx, dWh, db = None, None, None, None, None############################################################################### TODO: Implement the backward pass for a vanilla RNN running an entire ## sequence of data. You should use the rnn_step_backward function that you ## defined above. You can use a for loop to help compute the backward pass. ################################################################################ *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****N, T, H = dh.shapedprev_h = 0# 从最后一个时间步开始dx_, dprev_h, dWx, dWh, db = rnn_step_backward(dh[:,T-1,:], cache[T-1])_, D = dx_.shapedx = np.zeros((N,T,D))dx[:,T-1,:] = dx_for i in range(T-2, -1, -1):dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:,i,:]+dprev_h, cache[i])dx[:,i,:] = dx_dprev_h = dprev_h_dWx += dWx_dWh += dWh_db += db_dh0 = dprev_h# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return dx, dh0, dWx, dWh, db
word_embedding_forward:
将每个句子中的每个词汇都转换成词汇库里对应的向量
def word_embedding_forward(x, W):"""Forward pass for word embeddings. We operate on minibatches of size N whereeach sequence has length T. We assume a vocabulary of V words, assigning eachword to a vector of dimension D.Inputs:- x: Integer array of shape (N, T) giving indices of words. Each element idxof x muxt be in the range 0 <= idx < V.- W: Weight matrix of shape (V, D) giving word vectors for all words.Returns a tuple of:- out: Array of shape (N, T, D) giving word vectors for all input words.- cache: Values needed for the backward pass"""out, cache = None, None############################################################################### TODO: Implement the forward pass for word embeddings. ## ## HINT: This can be done in one line using NumPy's array indexing. ################################################################################ *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****# W[idx] 表示取出W中索引为idx的行向量,其中idx为标量# W[x] 意味着对x中的每一个元素都进行上述操作,因此返回一个三维张量out = W[x]cache = (x, W)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return out, cache
sigmoid:
这里是官方给出的实现代码
对x>0的部分,采用
对x<0的部分,采用
从而避免指数过大导致精度丢失
def sigmoid(x):"""A numerically stable version of the logistic sigmoid function."""pos_mask = (x >= 0)neg_mask = (x < 0)z = np.zeros_like(x)z[pos_mask] = np.exp(-x[pos_mask])z[neg_mask] = np.exp(x[neg_mask])top = np.ones_like(x)top[neg_mask] = z[neg_mask]return top / (1 + z)
lstm_step_forward:
lstm的一个时间步内的前向传播,直接按照LSTM的公式前向计算即可,需要注意的是,为了提高计算效率,我们直接把四个门的权重矩阵拼接成一个大矩阵,而对于x与h的拼接,我们直接等价乘了矩阵乘法后再相加(易证)
def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):"""Forward pass for a single timestep of an LSTM.The input data has dimension D, the hidden state has dimension H, and we usea minibatch size of N.Note that a sigmoid() function has already been provided for you in this file.Inputs:- x: Input data, of shape (N, D)- prev_h: Previous hidden state, of shape (N, H)- prev_c: previous cell state, of shape (N, H)- Wx: Input-to-hidden weights, of shape (D, 4H)- Wh: Hidden-to-hidden weights, of shape (H, 4H)- b: Biases, of shape (4H,)Returns a tuple of:- next_h: Next hidden state, of shape (N, H)- next_c: Next cell state, of shape (N, H)- cache: Tuple of values needed for backward pass."""next_h, next_c, cache = None, None, None############################################################################## TODO: Implement the forward pass for a single timestep of an LSTM. ## You may want to use the numerically stable sigmoid implementation above. ############################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****N, H = prev_h.shapeact = x.dot(Wx) + prev_h.dot(Wh) + b # 后续要用到的激活值,(N, 4H),这里的矩阵乘法后相加,等价于公式里的拼接后再作矩阵乘法# 这里的Wx和Wh的4H维度分为四个部分,分别对应输入门、遗忘门、输出门和候选值的权值ai, af, ao, ag = act[:,:H], act[:,H:2*H], act[:,2*H:3*H], act[:,3*H:4*H]i, f, o, g = sigmoid(ai), sigmoid(af), sigmoid(ao), np.tanh(ag)next_c = f * prev_c + i * g # 下一个细胞状态next_h = o * np.tanh(next_c) # 下一个隐藏状态cache = (x, prev_h, prev_c, Wx, Wh, i, f, o, g, next_c, next_h)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return next_h, next_c, cache
lstm_step_backward:
LSTM的一个时间步的反向传播
def lstm_step_backward(dnext_h, dnext_c, cache):"""Backward pass for a single timestep of an LSTM.Inputs:- dnext_h: Gradients of next hidden state, of shape (N, H)- dnext_c: Gradients of next cell state, of shape (N, H)- cache: Values from the forward passReturns a tuple of:- dx: Gradient of input data, of shape (N, D)- dprev_h: Gradient of previous hidden state, of shape (N, H)- dprev_c: Gradient of previous cell state, of shape (N, H)- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)- db: Gradient of biases, of shape (4H,)"""dx, dprev_h, dprev_c, dWx, dWh, db = None, None, None, None, None, None############################################################################## TODO: Implement the backward pass for a single timestep of an LSTM. ## ## HINT: For sigmoid and tanh you can compute local derivatives in terms of ## the output value from the nonlinearity. ############################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****x, prev_h, prev_c, Wx, Wh, i, f, o, g, next_c, next_h = cachednext_c = dnext_c + dnext_h*o*(1-np.tanh(next_c)**2) # 直接梯度 + 间接梯度dprev_c = f*dnext_cdf = prev_c*dnext_c di = g*dnext_cdg = i*dnext_cdo = np.tanh(next_c)*dnext_hdai = i*(1-i)*didaf = f*(1-f)*dfdao = o*(1-o)*dodag = (1-g**2)*dgdact = np.concatenate((dai,daf,dao,dag),axis = 1)dx = dact.dot(Wx.T)dprev_h = dact.dot(Wh.T)dWx = x.T.dot(dact)dWh = prev_h.T.dot(dact)db = np.sum(dact,axis = 0)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return dx, dprev_h, dprev_c, dWx, dWh, db
lstm_forward:
LSTM完整的前向传播过程,和RNN类似
def lstm_forward(x, h0, Wx, Wh, b):"""Forward pass for an LSTM over an entire sequence of data. We assume an inputsequence composed of T vectors, each of dimension D. The LSTM uses a hiddensize of H, and we work over a minibatch containing N sequences. After runningthe LSTM forward, we return the hidden states for all timesteps.Note that the initial cell state is passed as input, but the initial cellstate is set to zero. Also note that the cell state is not returned; it isan internal variable to the LSTM and is not accessed from outside.Inputs:- x: Input data of shape (N, T, D)- h0: Initial hidden state of shape (N, H)- Wx: Weights for input-to-hidden connections, of shape (D, 4H)- Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)- b: Biases of shape (4H,)Returns a tuple of:- h: Hidden states for all timesteps of all sequences, of shape (N, T, H)- cache: Values needed for the backward pass."""h, cache = None, None############################################################################## TODO: Implement the forward pass for an LSTM over an entire timeseries. ## You should use the lstm_step_forward function that you just defined. ############################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****N, T, D = x.shape_, H = h0.shapeprev_h = h0prev_c = np.zeros_like(h0)h = np.zeros((N, T, H))cache = []for i in range(T):next_h, next_c, cache_ = lstm_step_forward(x[:,i,:], prev_h, prev_c, Wx, Wh, b)h[:,i,:] = next_hprev_h, prev_c = next_h, next_ccache.append(cache_)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return h, cache
lstm_backward:
lstm的反向传播过程,和RNN类似
def lstm_backward(dh, cache):"""Backward pass for an LSTM over an entire sequence of data.]Inputs:- dh: Upstream gradients of hidden states, of shape (N, T, H)- cache: Values from the forward passReturns a tuple of:- dx: Gradient of input data of shape (N, T, D)- dh0: Gradient of initial hidden state of shape (N, H)- dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)- dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)- db: Gradient of biases, of shape (4H,)"""dx, dh0, dWx, dWh, db = None, None, None, None, None############################################################################## TODO: Implement the backward pass for an LSTM over an entire timeseries. ## You should use the lstm_step_backward function that you just defined. ############################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****N, T, H = dh.shapedx_, dprev_h, dprev_c, dWx, dWh, db = lstm_step_backward(dh[:,T-1,:],np.zeros_like(dh[:,T-1,:]), cache[T-1])_, D = dx_.shapedx = np.zeros((N,T,D))dx[:,T-1,:] = dx_for i in range(T-2, -1, -1):dx[:,i,:], dprev_h, dprev_c, dWx_, dWh_, db_ = lstm_step_backward(dh[:,i,:]+dprev_h,dprev_c, cache[i])dWx += dWx_dWh += dWh_db += db_dh0 = dprev_h# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################### END OF YOUR CODE ###############################################################################return dx, dh0, dWx, dWh, db
CaptioningRNN:
loss:
先对输入的image featrue进行仿射变换计算出初始隐藏状态h0
再对captions_in进行word_embed映射,这里的caption已经是映射成idx的word
由于每张图对应的caption的长短不一,所以使用NULL将其它们补至一样长
然后进行前向传播
传播完后,对所有时间步的隐藏状态h进行全连接的仿射变换,得到预测句子
预测句子每个词都包含了词库里所有词的概率,因此可以计算softmax loss
最后进行反向传播
def loss(self, features, captions):"""Compute training-time loss for the RNN. We input image features andground-truth captions for those images, and use an RNN (or LSTM) to computeloss and gradients on all parameters.Inputs:- features: Input image features, of shape (N, D)- captions: Ground-truth captions; an integer array of shape (N, T) whereeach element is in the range 0 <= y[i, t] < VReturns a tuple of:- loss: Scalar loss- grads: Dictionary of gradients parallel to self.params"""# Cut captions into two pieces: captions_in has everything but the last word# and will be input to the RNN; captions_out has everything but the first# word and this is what we will expect the RNN to generate. These are offset# by one relative to each other because the RNN should produce word (t+1)# after receiving word t. The first element of captions_in will be the START# token, and the first element of captions_out will be the first word.captions_in = captions[:, :-1] # 去掉<END>captions_out = captions[:, 1:] # 去掉<START># You'll need thismask = (captions_out != self._null) # 忽略<NULL>的掩码# Weight and bias for the affine transform from image features to initial# hidden stateW_proj, b_proj = self.params['W_proj'], self.params['b_proj']# Word embedding matrixW_embed = self.params['W_embed']# Input-to-hidden, hidden-to-hidden, and biases for the RNNWx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']# Weight and bias for the hidden-to-vocab transformation.W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']loss, grads = 0.0, {}############################################################################# TODO: Implement the forward and backward passes for the CaptioningRNN. ## In the forward pass you will need to do the following: ## (1) Use an affine transformation to compute the initial hidden state ## from the image features. This should produce an array of shape (N, H)## (2) Use a word embedding layer to transform the words in captions_in ## from indices to vectors, giving an array of shape (N, T, W). ## (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to ## process the sequence of input word vectors and produce hidden state ## vectors for all timesteps, producing an array of shape (N, T, H). ## (4) Use a (temporal) affine transformation to compute scores over the ## vocabulary at every timestep using the hidden states, giving an ## array of shape (N, T, V). ## (5) Use (temporal) softmax to compute loss using captions_out, ignoring ## the points where the output word is <NULL> using the mask above. ## ## In the backward pass you will need to compute the gradient of the loss ## with respect to all model parameters. Use the loss and grads variables ## defined above to store loss and gradients; grads[k] should give the ## gradients for self.params[k]. ## ## Note also that you are allowed to make use of functions from layers.py ## in your implementation, if needed. ############################################################################## *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****h0 = features.dot(W_proj) + b_proj x, cache_embed = word_embedding_forward(captions_in, W_embed)if (self.cell_type == 'rnn'):h, cache = rnn_forward(x, h0, Wx, Wh, b) else:h, cache = lstm_forward(x, h0, Wx, Wh, b)out, cache_voc = temporal_affine_forward(h, W_vocab, b_vocab)loss, dout = temporal_softmax_loss(out, captions_out, mask, verbose=False)dh, dW_vocab, db_vocab = temporal_affine_backward(dout, cache_voc)if (self.cell_type == 'rnn'):dx, dh0, dWx, dWh, db = rnn_backward(dh, cache)else:dx, dh0, dWx, dWh, db = lstm_backward(dh, cache)dW_embed = word_embedding_backward(dx, cache_embed)dW_proj = features.T.dot(dh0)db_proj = dh0.sum(axis=0)grads = {'W_vocab':dW_vocab, 'b_vocab':db_vocab, 'Wx':dWx, 'Wh':dWh, 'b':db, 'W_embed':dW_embed, 'W_proj':dW_proj, 'b_proj':db_proj}# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################# END OF YOUR CODE #############################################################################return loss, grads
经过过拟合训练后,发现模型对train set的预测很准确,但是对于validation set预测的只能说是毫不相干
LSTM_Captioning.ipynb:
Question:
为什么在LSTM中不使用ReLU来替代sigmoid?
因为在门控中,我们作的是逐元素乘法,sigmoid的输出范围是0~1,完美符合信息保留的程度(从0%到100%),而ReLU的范围可以到正无穷,则无法提供比例控制
sigmoid在0~1的范围内可以提供平滑的梯度,维持信息流的平衡