An Introduction to Artificial Neural Networks

James (and Jack) McKeown
Machine Learning
  1. Supervised Learning
    • Learning from (input,output) pairs.

  2. Unsupervised Learning
    • Learning about data. (clustering)
    • No "right answer"

  3. Reinforcement Learning
    • Learning behavior from (possibly delayed) rewards and punishments.
What is an Artificial Neural Network (ANN)?
  • A directed acyclic graph of "neurons"
  • Inspired by biological neurons
  • A continuous and differentiable function
What is an artificial neuron?
  • $f:\mathbb{R}^n\rightarrow\mathbb{R}$
  • $f(x) = \sigma(\displaystyle(\sum_{i=1}^n w_ix_i)+b) = \sigma(\vec{w}\cdot\vec{x}+b)$
What is $\sigma$?
  • An "activation" function.
Boolean Operators with a Single Neuron
Training Neural Networks from Data!
Let's say we have a data set of $n$, $(\vec{x},\vec{y})$ pairs which are inputs and desired outputs, respectively.

(Meaning we want $ f(\vec{x}) = \vec{y} $ for each $(\vec{x},\vec{y})$ pair after training)

$ z^l = w^la^{l-1} + b^l $
$ a^l = \sigma(z^l) $
$ (a^0 = x, a^L \approx y)$
There are a few sensible ways to measure how bad our network is. We call this the cost,loss, or error.

For now we will use mean squared error: $ C = \displaystyle \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2 $

Idea: Let's start with a network of random weights and biases and then move them in the direction which reduces the error

Gradient - direction of steepest ascent.
Want to find: $ \dfrac{\partial C}{\partial w^l_{jk}},\dfrac{\partial C}{\partial b^l_j} $

First let's find $\delta^l_j = \dfrac{\partial C}{\partial z^l_j}$ instead
Error in the Last Layer
$ \delta^L_j = \frac{\partial C}{\partial a^L_j}\frac{\partial a^L_j}{\partial z_j^L} = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) $

$ \delta^L = \nabla_{a^L} C \odot \sigma'(z^L)$

$ \delta^L = (a^L - y) \odot \sigma'(z^L)$

Error in the Last Layer
$ \delta^L = \underbrace{(a^L - y)}_{\nabla_{a^L}C} \odot \sigma'(z^L)$

Error in an Arbitrary Layer
$ \delta^l = \underbrace{((w^{l+1})^T \delta^{l+1})}_{\nabla_{a^l}C = \nabla_{z^{l+1}}C \odot \nabla_{a^l}z^{l+1}} \odot \sigma'(z^l)$
Error with respect to a Bias
$\frac{\partial C}{\partial b^l_j} = \delta^l_j$

$\frac{\partial C}{\partial b} = \delta$
Error with respect to a Weight
$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$
Code - Source
# numerical python library makes vector math efficient and easy
import numpy as np
s = lambda z: 1/(1+np.exp(-z)) #sigmoid activation function

X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T

# first weight matrix (3,4) means it's a 3x4 matrix
syn0 = 2*np.random.random((3,4)) - 1
# second weight matrix (4,1) means it's a 4x1 matrix
syn1 = 2*np.random.random((4,1)) - 1

for j in xrange(60000):
    l1 = s(,syn0))
    l2 = s(,syn1))
    l2_delta = (y - l2)*(l2*(1-l2))
    l1_delta = * (l1 * (1-l1))
    syn1 +=
    syn0 +=
What next?

Teach computers to see!

Andrej Karpathy is the man...