LSTM Programming
Posted: 2016-03-19 , Modified: 2016-03-19
Tags: programming
Posted: 2016-03-19 , Modified: 2016-03-19
Tags: programming
Here are the equations for LSTM.
\[\begin{align} f_t&=\si(W_f \coltwo{h_{t-1}}{x_t} + b_f)\\ i_t&=\si(W_i \coltwo{h_{t-1}}{x_t} + b_i)\\ \wt{C}_t &= \tanh (W_C\coltwo{h_{t-1}}{x_t}+b_C)\\ C_t &= f_t \odot C_{t-1} + i_t \odot \wt{C}_t\\ o_t &= \si(W_o\coltwo{h_{t-1}}{x_t} + b_o)\\ h_t &= o_t\odot \tanh(C_t)\\ \wh y &= \text{softmax}(Wh_t + b). \end{align}\]References:
We define functions
step_lstm :: \(\R^n\times \R^m\times \R^m \to \R^m\times \R^m\) sending \[(i_t, C_{t-1}, h_{t-1}) \mapsto (C_t, h_t).\]sequence_lstm :: \((\R^n)^s \times \R^m\times \R^m \to (\R^m)^s\) sending \[((i_t)_{t=1}^T, C_0, h_0)\mapsto (h_t)_{t=1}^T.\] (This is essentially “scanl” of step_lstm.)step_multiple_lstm :: \((\R^n)^k\times (\R^m)^k \times (\R^m)^k \to (\R^m)^k \times (\R^m)^k\). The mapped vrsion of step_lstm. This we can implement efficiently as a matrix multiplication.sequence_multiple_lstm :: \(((\R^n)^s)^k\times (\R^m)^k \times (\R^m)^k \to (\R^m)^k \times (\R^m)^k \to ((\R^m)^s)^k\). There are two ways to write this:
sequence_lstm (i.e., scan, then map).step_multiple (i.e., map, then scan). This is more efficient since we can implement the “map” as a matrix multiplication.(Actually these functions will involve the parameters as well, which we omitted here.)
Define step_lstm1 by
def step_lstm1(x, C, h, Wf, bf, Wi, bi, WC, bC, Wo, bo):
hx = T.concatenate([h,x]) #dimension m+n
f = sigmoid(T.dot(hx, Wf) + bf) #dimension m
i = sigmoid(T.dot(hx, Wi) + bi) #dimension m
C_add = T.tanh(T.dot(hx, WC) + bC) #dimension m
C1 = f * C + i * C_add #dimension m
o = sigmoid(T.dot(hx, Wo) + bo) #dimension m
h1 = o * T.tanh(C1) #dimension m
return [C1, h1] #dimension 2mNow define step_lstm as the version with parameters grouped together.
def step_lstm(x, C, h, tparams):
Wf, bf, Wi, bi, WC, bC, Wo, bo = unpack_params(tparams, ["Wf", "bf", "Wi", "bi", "WC", "bC", "Wo", "bo"])
return step_lstm1(x, C, h, Wf, bf, Wi, bi, WC, bC, Wo, bo)To define sequence_lstm we use Theano’s can function. The arguments are
fn is the functionoutputs_info are the initial values in the recursionnon_sequences are fixed values that are not involved in the recursion.Thus to create a scanned function like so
scan' :: ((a,b,c) -> b) -> [a] -> b -> c -> [b]
scan' f a's init fixed =we call
theano.scan(fn=f, sequences=a's, outputs_info=init, non_sequences=fixed)Note here a, b, c can encompass multiple arguments, in which case you pass a list to sequences, outputs_info, and non_sequences. However, a, b, c must appear in that order.
def sequence_lstm(C0, h0, xs, tparams):.
Wf, bf, Wi, bi, WC, bC, Wo, bo = unpack_params(tparams, ["Wf", "bf", "Wi", "bi", "WC", "bC", "Wo", "bo"])
#the function fn should have arguments in the following order:
#sequences, outputs_info (accumulators), non_sequences
#(x, C, h, Wf, bf, Wi, bi, WC, bC, Wo, bo)
([C_vals, h_vals], updates) = theano.scan(fn=step_lstm1,
sequences = xs,
outputs_info=[C0, h0], #initial values of the memory/accumulator
non_sequences=[Wf, bf, Wi, bi, WC, bC, Wo, bo], #fixed parameters
strict=True)
return [C_vals, h_vals]Note this will map automatically; to define sequence_multiple_lstm all we have to do is swap two axes.
(Note on Theano list in scan.)
A vanilla neural net layer is
def nn_layer1(x, W, b):
return x * W + b
def nn_layer(x, tparams):
W, b = unpack_params(tparams, ["W", "b"])
return nn_layer1(x, W, b)We define functions
nn_layer :: \(\R^n\times \R^n\)logloss :: \(\R^n\times \R^n\) given by \[\text{logloss}(x,y) = -\sum_i x_i \ln' (y_i)\] where we use corrected_log, \(\ln'(y) = \ln(\max(10^{-6}, x))\) to avoid blowup at small probabilities.Now we can combine these with our LSTM to make the evaluation, prediction, and loss function. Evaluation will give the probabilities of each output, prediction will give the output with max probability, and loss is the logloss on the expected and actual outcomes. We also include a accuracy function that outputs 1 if the prediction is correct and 0 otherwise.
Note fns_lstm returns a list of Theano variables (depending on the input lists/parameters) representing the activations, predictions, losses and accuracy. We haven’t compiled these variables into a function yet.
(Add code here)
Some other functions:
init_params_with_f_lstm(n,m,f,g)train_lstmweight_decay :: \(\R\) -> Dict String TheanoVars -> [String] -> \(\R\). For the parameters in the list, sum the squares of their norms and multiply by the decay constant.(A further speedup is to concatenate the matrices.)
We’ll keep parameters in a dictionary, and unpack them as needed.
def unpack_params(tparams, li):
return [tparams[name] for name in li]wrap_theano_dict and unwrap_theano_dict.get_minibatches_idx (::Int -> Int -> Bool -> [(Int, [Int])]) will give an enumerated list of minibatch indices, given n, the size of the list, and minibatch_size. It will make a minibatch out of the remainder.oneHot(choices, n) gives a way to encode one-hot vectors within Theano.These are taken from…
The arguments of each are
Returns
f_grad_sharedf_updateWhat does the train function need?
patience number of epochs have passed without progress, or after max_epochs.Pseudocode for train:
f_grad_shared and f_update functions from the optimizer.best_p.patience iterations since validation error improved, stop.