LSTM Programming
Posted: 2016-03-19 , Modified: 2016-03-19
Tags: programming
Posted: 2016-03-19 , Modified: 2016-03-19
Tags: programming
Here are the equations for LSTM.
\[\begin{align} f_t&=\si(W_f \coltwo{h_{t-1}}{x_t} + b_f)\\ i_t&=\si(W_i \coltwo{h_{t-1}}{x_t} + b_i)\\ \wt{C}_t &= \tanh (W_C\coltwo{h_{t-1}}{x_t}+b_C)\\ C_t &= f_t \odot C_{t-1} + i_t \odot \wt{C}_t\\ o_t &= \si(W_o\coltwo{h_{t-1}}{x_t} + b_o)\\ h_t &= o_t\odot \tanh(C_t)\\ \wh y &= \text{softmax}(Wh_t + b). \end{align}\]References:
We define functions
step_lstm
:: \(\R^n\times \R^m\times \R^m \to \R^m\times \R^m\) sending \[(i_t, C_{t-1}, h_{t-1}) \mapsto (C_t, h_t).\]sequence_lstm
:: \((\R^n)^s \times \R^m\times \R^m \to (\R^m)^s\) sending \[((i_t)_{t=1}^T, C_0, h_0)\mapsto (h_t)_{t=1}^T.\] (This is essentially “scanl” of step_lstm.)step_multiple_lstm
:: \((\R^n)^k\times (\R^m)^k \times (\R^m)^k \to (\R^m)^k \times (\R^m)^k\). The mapped vrsion of step_lstm. This we can implement efficiently as a matrix multiplication.sequence_multiple_lstm
:: \(((\R^n)^s)^k\times (\R^m)^k \times (\R^m)^k \to (\R^m)^k \times (\R^m)^k \to ((\R^m)^s)^k\). There are two ways to write this:
sequence_lstm
(i.e., scan, then map).step_multiple
(i.e., map, then scan). This is more efficient since we can implement the “map” as a matrix multiplication.(Actually these functions will involve the parameters as well, which we omitted here.)
Define step_lstm1
by
def step_lstm1(x, C, h, Wf, bf, Wi, bi, WC, bC, Wo, bo):
hx = T.concatenate([h,x]) #dimension m+n
f = sigmoid(T.dot(hx, Wf) + bf) #dimension m
i = sigmoid(T.dot(hx, Wi) + bi) #dimension m
C_add = T.tanh(T.dot(hx, WC) + bC) #dimension m
C1 = f * C + i * C_add #dimension m
o = sigmoid(T.dot(hx, Wo) + bo) #dimension m
h1 = o * T.tanh(C1) #dimension m
return [C1, h1] #dimension 2m
Now define step_lstm
as the version with parameters grouped together.
def step_lstm(x, C, h, tparams):
Wf, bf, Wi, bi, WC, bC, Wo, bo = unpack_params(tparams, ["Wf", "bf", "Wi", "bi", "WC", "bC", "Wo", "bo"])
return step_lstm1(x, C, h, Wf, bf, Wi, bi, WC, bC, Wo, bo)
To define sequence_lstm
we use Theano’s can function. The arguments are
fn
is the functionoutputs_info
are the initial values in the recursionnon_sequences
are fixed values that are not involved in the recursion.Thus to create a scanned function like so
scan' :: ((a,b,c) -> b) -> [a] -> b -> c -> [b]
scan' f a's init fixed =
we call
theano.scan(fn=f, sequences=a's, outputs_info=init, non_sequences=fixed)
Note here a, b, c can encompass multiple arguments, in which case you pass a list to sequences
, outputs_info
, and non_sequences
. However, a, b, c must appear in that order.
def sequence_lstm(C0, h0, xs, tparams):.
Wf, bf, Wi, bi, WC, bC, Wo, bo = unpack_params(tparams, ["Wf", "bf", "Wi", "bi", "WC", "bC", "Wo", "bo"])
#the function fn should have arguments in the following order:
#sequences, outputs_info (accumulators), non_sequences
#(x, C, h, Wf, bf, Wi, bi, WC, bC, Wo, bo)
([C_vals, h_vals], updates) = theano.scan(fn=step_lstm1,
sequences = xs,
outputs_info=[C0, h0], #initial values of the memory/accumulator
non_sequences=[Wf, bf, Wi, bi, WC, bC, Wo, bo], #fixed parameters
strict=True)
return [C_vals, h_vals]
Note this will map automatically; to define sequence_multiple_lstm
all we have to do is swap two axes.
(Note on Theano list in scan.)
A vanilla neural net layer is
def nn_layer1(x, W, b):
return x * W + b
def nn_layer(x, tparams):
W, b = unpack_params(tparams, ["W", "b"])
return nn_layer1(x, W, b)
We define functions
nn_layer
:: \(\R^n\times \R^n\)logloss
:: \(\R^n\times \R^n\) given by \[\text{logloss}(x,y) = -\sum_i x_i \ln' (y_i)\] where we use corrected_log
, \(\ln'(y) = \ln(\max(10^{-6}, x))\) to avoid blowup at small probabilities.Now we can combine these with our LSTM to make the evaluation, prediction, and loss function. Evaluation will give the probabilities of each output, prediction will give the output with max probability, and loss is the logloss on the expected and actual outcomes. We also include a accuracy function that outputs 1 if the prediction is correct and 0 otherwise.
Note fns_lstm
returns a list of Theano variables (depending on the input lists/parameters) representing the activations, predictions, losses and accuracy. We haven’t compiled these variables into a function yet.
(Add code here)
Some other functions:
init_params_with_f_lstm(n,m,f,g)
train_lstm
weight_decay
:: \(\R\) -> Dict String TheanoVars -> [String] -> \(\R\). For the parameters in the list, sum the squares of their norms and multiply by the decay constant.(A further speedup is to concatenate the matrices.)
We’ll keep parameters in a dictionary, and unpack them as needed.
def unpack_params(tparams, li):
return [tparams[name] for name in li]
wrap_theano_dict
and unwrap_theano_dict
.get_minibatches_idx
(::Int -> Int -> Bool -> [(Int, [Int])]) will give an enumerated list of minibatch indices, given n
, the size of the list, and minibatch_size
. It will make a minibatch out of the remainder.oneHot(choices, n)
gives a way to encode one-hot vectors within Theano.These are taken from…
The arguments of each are
Returns
f_grad_shared
f_update
What does the train function need?
patience
number of epochs have passed without progress, or after max_epochs
.Pseudocode for train
:
f_grad_shared
and f_update
functions from the optimizer.best_p
.patience
iterations since validation error improved, stop.