Complex-valued deep belief networks - Complex- and hypercomplex-valued neural networks

quality of the images reconstructed by the complex-valued network is better than that of the images reconstructed by the real-valued network.

Figure 3.1: MNIST images reconstructed by the real-valued (left) and complex-valued (right) stacked denoising autoencoders, along with the original images [169]

Table 3.10: Experimental results for MNIST Architecture Complex

Loss

Real Loss

Complex Error

Real Error 784-128-64-32 (GN) 10.88e−4 11.33e−4 1.44% 1.49%

784-128-64-32 (MN) 9.24e−4 9.67e−4 1.71% 1.91%

784-256-128-64 (GN) 10.70e−4 10.88e−4 1.18% 1.26%

784-256-128-64 (MN) 8.96e−4 9.19e−4 1.35% 1.63%

784-1024-512-256-128 (GN) 10.64e−4 10.83e−4 1.19% 1.21%

784-1024-512-256-128 (MN) 8.69e−4 9.15e−4 1.18% 1.24%

3.4.2.2 FashionMNIST

A dataset that appeared recently is the FashionMNIST dataset [230]. Its characteristics are very similar to the MNIST dataset, from which it was inspired, but it is more complicated. The grayscale28×28pixel images pertain to the following10classes: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle boot. There are 60,000 training samples and 10,000test samples.

The experimental results are given in the same way as with the above experiment in Table 3.11. It can be seen that the task is more complicated, both in terms of reconstruction error as well as in terms of classification error. Nonetheless, the conclusion of the set of experiments is the same: complex-valued networks have better reconstruction error and better classification error then the real-valued networks.

The reconstructed images for the two types of networks are given in Figure 3.2, along with the original images. In this case too, the complex-valued reconstructed images have a better quality.

Figure 3.2: FashionMNIST images reconstructed by the real-valued (left) and complex-valued (right) stacked denoising autoencoders, along with the original images [169]

Table 3.11: Experimental results for FashionMNIST Architecture Complex

Loss

Real Loss Complex Error

Real Error 784-128-64-32 (GN) 24.32e−4 24.54e−4 10.37% 10.72%

784-128-64-32 (MN) 23.07e−4 23.47e−4 10.15% 10.82%

784-256-128-64 (GN) 24.21e−4 24.30e−4 10.30% 10.37%

784-256-128-64 (MN) 22.82e−4 23.07e−4 10.21% 10.45%

784-1024-512-256-128 (GN) 24.23e−4 24.29e−4 10.01% 10.09%

784-1024-512-256-128 (MN) 22.84e−4 22.89e−4 9.97% 10.05%

learning. As it turns out, this unsupervised pretraining of deep neural networks allows them to have better results than with random weight initialization.

On the footsteps of these papers, also taking into account the success of complex-valued convolutional neural networks for real-valued image classification demonstrated in Section 3.1, we introduce complex-valued deep belief networks. The presentation in this section follows that in the author’s paper [170].

3.5.1 Model formulation

Restricted Boltzmann Machines (RBMs) are part of the larger family of energy-based models, which associate a scalar energy to each configuration of the variables of interest. Learning corresponds to modifying the energy function so that it has some desired properties. Usually, we would want the energy to be as low as possible. The following deduction of the properties of complex-valued RBMs follows that of [15] for the real-valued case.

For Boltzmann Machines (BMs), the energy function is linear in its free parameters. To increase the representational power of the Boltzmann Machines in order for them to be able to represent more complicated input distributions, some variables are considered to never be observed, and that is why they are called hidden variables. Restricted Boltzmann Machines (RBMs) restrict the BM model by not allowing visible-visible and hidden-hidden connections.

If we denote the visible variables byv and the hidden variables byh, in the case of complex- valued RBMs, the energy function is defined by

E(v, h) = −(b^Hv)^R−(c^Hh)^R−(h^HW v)^R

= −(b^R)^Tv^R−(bÎ)^TvÎ−(c^R)^Th^R−(cÎ)^ThÎ

−(h^R)^T(W v)^R−(h^I)^T(W v)^I

= −(b^R)^Tv^R−(bÎ)^TvÎ−(c^R)^Th^R−(cÎ)^ThÎ

−(v^R)^T(W^Hh)^R−(v^I)^T(W^Hh)^I, (3.5.1)

where z^R andz^I are the real and imaginary parts, respectively, of the complex-valued matrix z, z^H is the Hermitian (complex-conjugate) transpose of matrix z, a^T is the transpose of real- valued matrix a, and we used the property that (x^Hy)^R = (y^Hx)^R. Also, W represents the weights connecting the visible and hidden layers, brepresents the bias of the visible layer, and crepresents the bias of the hidden layer.

With the above notations, the probability distribution of complex-valued RBMs can be defined as

P(v) =X

P(v, h) =X

e^−E(v,h) Z ,

whereZis called the partition function by analogy with physical systems.

If we denote by

F(v) = −logX

e^−E(v,h),

which is called free energy (a notion also inspired from physics), we have that P(v) = e^−F^(v)

Z , andZis given by

Z =X

e^−F(v). Now, from (3.5.1), we have that

F(v) = −(b^R)^Tv^R−(b^I)^Tv^I

−X

logX

h^R_i

e^h^Rⁱ^(c^Rⁱ^+(Wⁱ^v)^R⁾−X

logX

h^I_i

e^hÎⁱ^(cÎⁱ^+(Wⁱ^v)Î⁾

= −(b^R)^Tv^R−(b^I)^Tv^I

−X

logX

h^R_i

e^h^Rⁱ^(c^Rⁱ^+Wⁱ^R^v^R^−Wⁱ^I^v^I⁾−X

logX

h^I_i

e^hÎⁱ^(cÎⁱ^+Wⁱ^R^vÎ^+WⁱÎ^v^R⁾, (3.5.2)

whereh_iis theith element of vectorhandW_iis theith column of matrixW.

We can also obtain an expression for the conditional probabilitiesP(h^R|v)andP(h^I|v):

P(h^R|v) = e^(b^R⁾^T^v^R^+(bÎ⁾^T^vÎ^+(c^R⁾^T^h^R^+(cÎ⁾^T^hÎ^+(h^R⁾^T^{(W v)}^R^+(hÎ⁾^T^{(W v)}Î P

eh^Re^(b^R⁾^T^v^R^+(bÎ⁾^T^vÎ^+(c^R⁾^Tê^h^R^+(cÎ⁾^T^hÎ⁺⁽ê^h^R⁾^T^{(W v)}^R^+(hÎ⁾^T^{(W v)}Î

ie^c^Rⁱ^h^Rⁱ^+h^Rⁱ^(Wⁱ^v)^R Q

hei

Re^c^Rⁱ^e^h^Rⁱ^+e^h^Rⁱ^(Wⁱ^v)^R

= Y

e^h^Rⁱ^(c^Rⁱ^+(Wⁱ^v)^R⁾ P

hei

Re^e^h^Rⁱ^(c^Rⁱ^+(Wⁱ^v)^R⁾

= Y

P(h^R_i |v),

P(hÎ|v) = e^(b^R⁾^T^v^R^+(bÎ⁾^T^vÎ^+(c^R⁾^T^h^R^+(cÎ⁾^T^hÎ^+(h^R⁾^T^{(W v)}^R^+(hÎ⁾^T^{(W v)}Î P

ehÎe^(b^R⁾^T^v^R^+(bÎ⁾^T^vÎ^+(c^R⁾^T^h^R^+(cÎ⁾^Tê^hÎ^+(h^R⁾^T^{(W v)}^R⁺⁽ê^hÎ⁾^T^{(W v)}Î

ie^cÎⁱ^hÎⁱ^+hÎⁱ^(Wⁱ^v)Î Q

hei

Ie^cÎⁱê^hÎⁱ^+e^hÎⁱ^(Wⁱ^v)Î

= Y

e^hÎⁱ^(cÎⁱ^+(Wⁱ^v)Î⁾ P

hei

Ieê^hÎⁱ^(cÎⁱ^+(Wⁱ^v)Î⁾

= Y

P(h^I_i|v).

This means that the visible and hidden neurons are conditionally independent given one another.

If we assume thath^R_i ,h^I_i ∈ {0,1}, we obtain P(h^R_i = 1|v) = e^c^Rⁱ^+(Wⁱ^v)^R

1 +e^c^Rⁱ^+(Wⁱ^v)^R =σ(c^R_i + (W_iv)^R) =σ(c^R_i +W_i^Rv^R−W_iÎvÎ), P(hÎ_i = 1|v) = e^cÎⁱ^+(Wⁱ^v)Î

1 +e^cÎⁱ^+(Wⁱ^v)Î =σ(cÎ_i + (W_iv)Î) =σ(cÎ_i +W_i^RvÎ+W_iÎv^R),

where σ is the real-valued sigmoid function: σ(x) = _1+e¹−x. The above expressions, together with (3.5.2), prove that a complex-valued RBM can be implemented only using real-valued operations. This is important, because the computational frameworks used in the deep learning domain mainly deal with real-valued operations.

Because of the symmetry in the expression of the energy function between the visible and hidden neurons, assuming thatv_j^R,v_j^I ∈ {0,1}, the following relations can also be deduced:

P(v^R|h) = Y

P(v^R_j |h),

P(v^I|h) = Y

P(v^I_j|h),

P(v^R_j = 1|h) = σ(b^R_j + (W_j^Hh)^R) =σ(b^R_j + (W_j^R)^Th^R+ (W_jÎ)^ThÎ), P(v_jÎ = 1|h) = σ(bÎ_j + (W_j^Hh)Î) =σ(bÎ_j + (W_j^R)^ThÎ−(W_jÎ)^Th^R).

The free energy for an RBM with binary neurons can be further simplified to F(v) = −(b^R)^Tv^R−(b^I)^Tv^I

−X

log(1 +e^c^Rⁱ ^+Wⁱ^R^v^R^−Wⁱ^I^v^I)−X

log(1 +e^cÎⁱ^+Wⁱ^R^vÎ^+WⁱÎ^v^R).

Samples from the distributionP(x)can be obtained by running a Markov chain to convergence, using as transition operator the Gibbs sampling procedure. Gibbs sampling for N joint random variables S = (S₁, . . . , S_N) is done in a sequence of N sampling steps of the form S_i ∼ P(S_i|S−i), whereS−i denotes the otherN −1variables that are notS_i. In the case of an RBM, this means that first we sampleh^R,hÎ fromP(h^R|v),P(hÎ|v), and then we samplev^R, vÎ fromP(v^R|h), P(vÎ|h). By doing this procedure a sufficient amount of time, it is guaran- teed that(v, h)is an accurate sample ofP(v, h). This would however be very computationally

expensive, and so different algorithms have been devised to sample from P(v, h) efficiently during learning.

One such algorithm is contrastive divergence, which we use in our experiments. It is based on two ideas to speed up the sampling process. First, because we want to haveP(v)≈P_train(v), i.e., the true distribution of the training data, we initialize the Markov chain described above with a training sample, which will speed up convergence. The second idea is that contrastive divergence doesn’t wait for the Markov chain to converge, but only does k steps of Gibbs sampling. Surprisingly,k= 1gives good results in practice.

Now that we have all the ingredients for constructing and training a complex-valued RBM, we can stack several complex-valued RBMs to form a complex-valued deep belief network.

This type of network is trained one layer at a time, using the greedy layer-wise procedure [16].

After training the first layer to model the input, the following layers are trained as complex- valued RBMs to model the outputs of the previous layers. After learning the weights for all the RBMs in the deep belief network, a logistic regression layer is added on top of the last RBM in the deep belief network, thus forming a complex-valued deep neural network. This network can then be fine-tuned in a supervised manner, using gradient-based methods. Thus, the deep belief network is used to initialize the parameters of the deep neural network.

3.5.2 Experimental results

In the experiments, we use real-valued and complex-valued deep belief networks for the unsupervised pretraining of deep neural networks. The28×28images of the MNIST dataset were linearized into784-dimensional vectors, which constitute the inputs of the networks. The number of neurons in every hidden layer of the complex-valued networks is given in the first column of Table 3.12. The real-valued networks had1.41times more neurons in the hidden layers, to ensure the same number of real parameters between the two types of networks, and thus a fair comparison. Each layer was trained in an unsupervised manner, and then the learned weights were used to initialize the weights of the deep networks. These networks were then fine-tuned using stochastic gradient descent.

The number of pretraining epochs was100, and the number of fine-tuning epochs was 50.

The learning rate was0.01for the unsupervised learning, and0.1for the supervised learning.

The experimental results on the MNIST dataset are given in Table 3.12. It can be seen from the table that we tested different architectures, and the results were consistent: complex- valued neural networks pretrained using complex-valued deep belief networks attained better classification results than real-valued networks using their real-valued counterparts.

Table 3.12: Experimental results for MNIST

Hidden layer sizes Real-valued error Complex-valued error

1000 1.42% 1.40%

1000,500 1.36% 1.32%

1000,1000 1.38% 1.23%

1000,1000,1000 1.31% 1.20%

1000,1000,1000,1000 1.38% 1.37%

2000,1500,1000,500 1.64% 1.34%

2500,2000,1500,1000,500 1.45% 1.31%

No documento Complex- and hypercomplex-valued neural networks (páginas 55-60)