# Examples ## Introduction ### Biological Neuron and Artificial Neuron Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns [1]. The perceptron is a mathematical model of a biological neuron.

Fig.1: Biological neuron versus artificial neuron (perceptron).

### Learning scheme in Brain and Neural Network Animals, including humans, change their behavior through experience. It is said that the brain has three types of leaning system: supervised learning, reinforcement learning, and unsupervised leaning.

Fig.2: Learning scheme in the brain.

## Supervised Learning [view on github](https://github.com/Kashu7100/Qualia2.0/tree/master/examples/supervised_learning) *Supervised Learning* is a machine learning technique that expects a model to learn the input-to-label mapping of data where an input $x_i$ and the label $l_i$ associated with that input are given. The objective of supervised learning is to estimate the data generation probability $P$ from the experimental probability $\hat{P}$ :

$(x_i,l_i) \sim P(x,l)$

This is done by minimizing the error between $P$ and the output from the model $M_\theta$ with parameter $\theta$ . In practice, the experimental probability $\hat{P}$ is used for train the model.

$J(\hat{P}(x,l),M_\theta(x_i))$

$\theta \leftarrow \theta +\alpha \triangledown_\theta J(\theta)$

### Decision Boundary Neural networks can be viewed as a universal approximation function. Let's use a simple dataset called Spiral to see how neural net can obtain a non-linear decision boundary.

```python from qualia2.data.basic import Spiral from qualia2.nn.modules import Module, Linear from qualia2.functions import tanh, mse_loss from qualia2.nn.optim import Adadelta import matplotlib.pyplot as plt data = Spiral() data.batch = 100 class MLP(Module): def __init__(self): super().__init__() self.l1 = Linear(2, 15) self.l2 = Linear(15, 3) def forward(self, x): x = tanh(self.l1(x)) x = tanh(self.l2(x)) return x mlp = MLP() optim = Adadelta(mlp.params) # train model losses=[] for _ in range(3000): for feat, label in data: out = mlp(feat) loss = mse_loss(out, label) optim.zero_grad() loss.backward() optim.step() losses.append(loss.asnumpy()) # plot losses plt.plot(range(len(losses)), losses) plt.show() # show decision boundary data.plot_decision_boundary(mlp) ``` We can see training loss is gradually decreasing.

Here is the obtained decision boundary:

### FashionMNIST with GRU RNNs are often utilized for language model or time series prediction; however, they can also be used for image recongnition tasks. The GRU model takes rows of the image assuming the hidden state of GRU will contain a context of the image. Below is the visualization for the FashionMNIST dataset.

```python import qualia2 from qualia2.core import * from qualia2.functions import tanh, softmax_cross_entropy, transpose from qualia2.nn import Module, GRU, Linear, Adadelta from qualia2.data import FashionMNIST from qualia2.util import Trainer import matplotlib.pyplot as plt import os path = os.path.dirname(os.path.abspath(__file__)) def data_trans(data): data = data.reshape(-1,28,28) return transpose(data, (1,0,2)).detach() class GRU_classifier(Module): def __init__(self): super().__init__() self.gru = GRU(28,128,1) self.linear = Linear(128, 10) def forward(self, x, h0=qualia2.zeros((1,args.batch,128))): _, hx = self.gru(x, h0) out = self.linear(hx[-1]) return out model = GRU_classifier() optim = Adadelta(model.params) mnist = FashionMNIST() trainer = Trainer(100, path) trainer.set_data_transformer(data_trans) trainer.train(model, mnist, optim, softmax_cross_entropy, 100) ``` Training loss:

```bash [*] test acc: 89.33% ``` .. note:: the same model can achieve about 99% test accuracy on MNIST dataset. ### Image Classification with AlexNet Qualia provides pretrained computer vision models. ```python from qualia2.vision import AlexNet, imagenet_labels import qualia2.vision.transforms as transforms import PIL import numpy img = PIL.Image.open("./dog.jpg") preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize() ]) input = preprocess(img) model = AlexNet(pretrained=True) model.eval() output = model(input).asnumpy() sorted = output.argsort()[:,-5:][:,::-1] for i, candidates in enumerate(sorted): for idx in candidates: print('{}: {:.2f}%'.format(imagenet_labels[idx], output[i,idx]*100)) ``` Here is the top 5 predictions of the pretrained AlexNet on the given image.

```bash Labrador retriever: 53.45% Saluki, gazelle hound: 18.93% golden retriever: 15.33% borzoi, Russian wolfhound: 2.79% kuvasz: 1.94% ``` ## Unsupervised Learning [view on github](https://github.com/Kashu7100/Qualia2.0/tree/master/examples/unsupervised_learning) *Unsupervised learning* is a machine learning technique that expects a model to learn patterns in the input data $x_i$ . Unsupervised learning such as Hebbian learning or self-organization has been heavily utilized by the living creatures. In general, unsupervised system is better than supervised system in finding new patterns or features in the inputs. In 1949, Donald O. Hebb argued that: > "When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased." - Organization of Behavior (1949). This rule is called Hebbian learning; and this synaptic plasticity is thought to be the basic phenomenon in our learning and memory. ### Hebbian learning Hebb's Rule is often generalized as:

$\Delta \mathbf{w} = \eta y \mathbf{x}$

This version of the rule is clearly unstable, as in any network with a dominant signal the synaptic weights will increase or decrease exponentially. ### Oja's learning rule Oja's rule solves all stability problems of Hebb's Rule and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons.

$\Delta \mathbf{w} = \eta y(\mathbf{x}-y\mathbf{w})$

### Generalized Hebbian Algorithm (Sanger's rule) This is similar to Oja's rule in its formulation and stability, except it can be applied to networks with multiple outputs.

$\Delta w_{ij} = \eta y_i(x_j-\sum_{k=1}^{i}w_{kj}y_k)$

### Autoencoders Autoencoders learn a given distribution comparing its input to its output. It is useful for learning hidden representations of the data.

To explore the identification of chaotic dynamics evolving on a finite dimensional attractor, let's consider the nonlinear Lorenz system:

$\dot{x} = 10(y-x)$

$\dot{y} = x(28-z)-y$

$\dot{z} = xy-(8/3)z$

Here is the code for Lorenz system simulation: ```python from scipy.integrate import odeint import numpy as np def lorenz(u, t): x, y, z = u sigma = 10.0 beta = 8.0/3.0 rho = 28.0 dxdt = sigma*(y-x) dydt = x*(rho-z)-y dzdt = x*y-beta*z return np.array([dxdt,dydt,dzdt]) dt = 0.01 t = np.arange(0,25, dt) u0 = np.array([-8.0, 7.0, 27]) u = odeint(lorenz, u0, t) ``` The trapezoidal rule is a numerical method to solve ordinary differential equations that approximates solutions to initial value problems of the form:

$\dot{y} = f(t,y)$

$y(t_0)=y_0$

The trapezoidal rule states that:

$y_n - y_n_-_1 = \int_{t_n_-_1}^{t_n}f(t,y)dt \approx \Delta t(f(t_n,y_n)+f(t_n_-_1,y_n_-_1))/2$

The model will be trained so that the trapezoidal rule is satisfied. LHS will be the target and the RHS will be the sum of the outputs from the model multiplied by a time step.

$2(y_n - y_n_-_1) = \Delta t(f(t_n,y_n)+f(t_n_-_1,y_n_-_1))$

```python import qualia2 from qualia2 import Tensor from qualia2.nn import Linear, Module from qualia2.functions import tanh, mse_loss from qualia2.nn.optim import Adadelta from qualia2.core import * class Model(Module): def __init__(self): super().__init__() self.linear1 = Linear(3, 256) self.linear2 = Linear(256, 3) def forward(self, s): s = tanh(self.linear1(s)) return self.linear2(s) # train the net with trapezoidal rule # u_t = u_t1 + 1/2*dt*(f(u_t)+f(u_t1)) def train(model, optim, criteria, u, dt=0.01, epochs=2000): u_t1 = u[:-1] u_t = u[1:] for e in range(epochs): losses = [] for b in range(len(u)//100): target = Tensor(2*(u_t[b*100:(b+1)*100] - u_t1[b*100:(b+1)*100])) output = dt*(model(Tensor(u_t[b*100:(b+1)*100])) + model(Tensor(u_t1[b*100:(b+1)*100]))) loss = criteria(output, target) optim.zero_grad() loss.backward() optim.step() losses.append(loss.data) model = Model() optim = Adadelta(model.params) train(model, optim, mse_loss, u, dt) def f(u, t): return model(qualia2.array(u)).asnumpy() learned_u = odeint(f, u0, t) ``` Following is the obtained result:

### Generative Adversarial Networks (GANs) GANs utilize networks called Generator and Discriinator. The Discriminator measures the distance between the generated and the real data. The Generator tries to generate the data that Discriminator cannot distinguish from the real data.

$\underset{G}{min}\: \underset{D}{max} \: V_D(D,G)=\mathbb{E}_{x\sim P(x)}\{log(D(x)))\}+\mathbb{E}_{z\sim P(z)}\{log(1-D(G(z)))\}$

Here is the example with MNIST: ```python import qualia2 from qualia2.core import * from qualia2.functions import binary_cross_entropy from qualia2.nn import Sequential, Linear, Sigmoid, Tanh, Adam from qualia2.data import MNIST import matplotlib.pyplot as plt g = Sequential( Linear(50, 128), Tanh(), Linear(128, 256), Tanh(), Linear(256, 512), Tanh(), Linear(512, 784), Sigmoid() ) d = Sequential( Linear(784, 512), Tanh(), Linear(512, 256), Tanh(), Linear(256, 128), Tanh(), Linear(128, 1), Sigmoid() ) def savefig(g, noise, filename): g.eval() fake_img = g(noise) for c in range(10): for r in range(10): plt.subplot(10,10,r+c*10+1) plt.xticks([]) plt.yticks([]) plt.grid(False) img = fake_img.data[r+c*10].reshape(28,28) plt.imshow(to_cpu(img) if gpu else img, cmap='gray', interpolation='nearest') plt.savefig(filename) # train batch = 100 epochs = 200 z_dim = 50 smooth = 0.1 optim_g = Adam(g.params, 0.0004, (0.5, 0.999)) optim_d = Adam(d.params, 0.0002, (0.5, 0.999)) criteria = binary_cross_entropy mnist = MNIST(flatten=True) mnist.batch = 100 target_real = qualia2.ones((batch, 1)) target_fake = qualia2.zeros((batch,1)) check_noise = qualia2.randn(batch, z_dim) for epoch in range(epochs): for i, (data, _) in enumerate(mnist): d.train() g.train() noise = qualia2.randn(batch, z_dim) fake_img = g(noise) # update Discriminator # feed fake images output_fake = d(fake_img.detach()) loss_d_fake = criteria(output_fake, target_fake) # feed real images output_real = d(data) loss_d_real = criteria(output_real, target_real*(1-smooth)) loss_d = loss_d_fake + loss_d_real optim_d.zero_grad() loss_d.backward() optim_d.step() # update Generator d.eval() output = d(fake_img) loss_g = criteria(output, target_real) optim_g.zero_grad() loss_g.backward() optim_g.step() savefig(g, check_noise, path+'/gan_epoch{}.png'.format(epoch)) ``` The obtained result:

## Reinforcement Learning [view on github](https://github.com/Kashu7100/Qualia2.0/tree/master/examples/reinforcement_learning) *Reinforcement Learning* is a machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences assuming Markov Decision Process (MDP). Reinforcement Learning named after operant conditioning, a method of learning that occurs through rewards and punishments for behavior, presented by B. F. Skinner.

Fig.2: Learning scheme for reinforcement learning assuming MDP.

### Markov property A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state. That is, the state $S_t_+_1$ and reward $R_t_+_1$ at time t+1 depends on the present state $S_t$ and the action $A_t$ . ### Value function The **state value function** $V^\pi(s)$ under a policy $\pi$ is the expectation value of the total discounted reward or gain G at given state s.

$V^\pi (s)= \mathbb{E}^\pi \{G_t | S_t=s\} = \mathbb{E}^\pi \{\sum_{\tau=0}^{\infty} \gamma ^\tau R_t_+_\tau_+_1| S_t=s\}$

Similarly, the expectation value of the total discounted reward at given state s and an action a is represented by the **action value function** $Q^\pi(s,a)$ .

$Q^\pi (s,a)= \mathbb{E}^\pi \{G_t | S_t=s, A_t=a\} = \mathbb{E}^\pi \{\sum_{\tau=0}^{\infty} \gamma ^\tau R_t_+_\tau_+_1| S_t=s, A_t=a\}$

Among all possible value-functions, there exist an optimal value function $V^*(s)$ that has higher value than other functions for all states.

$V^*(s) = \underset{\pi}{max}V^\pi(s) \: \: \: \: \forall s\in S$

The optimal policy $\pi^*$ that corresponds to the optimal value function is:

$\pi^* = \underset{\pi}{argmax}V^\pi(s)$

In a similar manner, the optimal action value function and the corresponding optimal policy are:

$Q^*(s,a) = \underset{\pi}{max}\: Q^\pi(s,a)$

$\pi^* = \underset{a}{argmax}Q^*(s,a)$

### Bellman equation From the linearity of $\mathbb{E}^\pi$ , the value function can be expressed as:

$V^\pi (s)= \mathbb{E}^\pi \{R_{t+1} | S_t=s\}+\gamma\, \mathbb{E}^\pi \{\sum_{\tau=1}^{\infty} \gamma ^{\tau-1} R_{t+\tau+1}| S_t=s\}$

If we express the expected reward that we receive when starting in state s, taking action a, and moving into state s' as:

$\mathfrak{R}(s,s',a) = \mathbb{E}\{R_{t+1}|S_t=s,S_{t+1}=s',A_t=a\}$

The value function can be therefore expressed as following. This is the Bellman equation for the state value function under a policy $\pi$ .

$V^\pi (s)= \sum_{}{a\in A}\pi(a|s)\sum_{}{s'\in S}P(s'|s,a)(\mathfrak{R}(s,s',a) + \gamma V^\pi(s'))$

The Bellman equation for the action value function can be derived in a similar way.

$Q^\pi (s,a)= \sum_{}{s'\in S}P(s'|s,a)(\mathfrak{R}(s,s',a) + \gamma V^\pi(s'))$

$Q^\pi (s,a)= \sum_{}{s'\in S}P(s'|s,a)(\mathfrak{R}(s,s',a) + \gamma \sum_{}{a'\in A}\pi(a'|s')Q^\pi(s',a') )$

### TD error The Bellman equation requires the knowledge of the transition probability P, which is unknown for most tasks, in order to find the value. This can be resolved by utilizing the experience from trial and error.

$Q(S_t,A_t) = R_{t+1}+\gamma \, Q(S_{t+1},A_{t+1})-\delta$

The term $\delta$ is called Temporal Difference (TD) error. When the training converges, the TD error is expected to approach to zero. ### Dopamine neurons and TD error signal

Fig.3: Firing of dopamine neurons and its correspondence with the TD error [1,2].

In the first case, an unpredicted reward (R) occurs, and a burst of dopamine firing follows. In the second case, a predicted reward occurs, and a burst follows the onset of the predictor (CS or conditioned stimulus), but there is no firing after the predicted reward. In the bottom case, a predicted reward is omitted, with a corresponding trough in dopamine responding. The feature of TD error matches with the response of dopamine neurons $\delta$ in the figure. Therefore, the response of dopamine neurons is thought to be the TD error signal. ### CartPole with DQN Q-learning updates the action value according to the following equation:

$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+\alpha (R_{t+1}+\gamma \, \underset{a'\in A}{max}\: Q(S_{t+1},a')-Q(S_t,A_t))$

When the learning converges, the second term of the equation above approaches to zero. Note that when the policy that never takes some of the pairs of state and action, the action value function for the pair will never be learned, and learning will not properly converge. DQN is Q-Learning with a deep neural network as a Q function approximator. DQN learns to minimize the TD error with some evaluation function $J$ .

$J(R_{t+1}+\gamma \, \underset{a'\in A}{max}\: Q_\theta(S_{t+1},a'),Q_\theta(S_t,A_t))$

Generally, the following error is used as the evaluation function.

$J(\theta) = \left\{ \begin{matrix} \frac{1}{2} (R_{t+1}+\gamma \, \underset{a'\in A}{max}\: Q_\theta(S_{t+1},a')-Q_\theta(S_t,A_t))^2 \; \; \; \; \; |\delta |\leq 1 \\ |R_{t+1}+\gamma \, \underset{a'\in A}{max}\: Q_\theta(S_{t+1},a')-Q_\theta(S_t,A_t)|\; \; \; \; \;\;\;\;\; |\delta |> 1 \end{matrix}\right.$

Qualia2 provides `DQN` (`DQNTrainer`) class and `Env` class for handy testing of DQN. As an example, let's use [CartPole](https://gym.openai.com/envs/CartPole-v1/) task from Gym. One can visualize the environment with `Env.show()` method. A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity. ```python from qualia2.rl.envs import CartPole from qualia2.rl import ReplayMemory from qualia2.rl.agents import DQNTrainer, DQN from qualia2.nn.modules import Module, Linear from qualia2.functions import tanh from qualia2.nn.optim import Adadelta class Model(Module): def __init__(self): super().__init__() self.linear1 = Linear(4, 32) self.linear2 = Linear(32, 32) self.linear3 = Linear(32, 2) def forward(self, x): x = tanh(self.linear1(x)) x = tanh(self.linear2(x)) x = tanh(self.linear3(x)) return x env = CartPole() agent = DQN.init(env, Model()) agent.set_optim(Adadelta) trainer = DQNTrainer(ReplayMemory,80,10000) agent = trainer.train(env, agent, episodes=200) # plot rewards trainer.plot() # show learned agent agent.play(env) ``` Reward Plot:

The obtained result:

--- [1] A Beginner's Guide to Neural Networks and Deep Learning Online: https://skymind.ai/wiki/neural-network [2] Schultx, W., et al. (1997) Predictive Reward Signal of Dopamine Neurons Science 275: 1593-1599 [3] Doya K. (2007). Reinforcement learning: Computational theory and biological mechanisms. HFSP journal, 1(1), 30–40. doi:10.2976/1.2732246/10.2976/1