Universal Function Approximation using TensorFlow

A multilayered neural network with even a single hidden layer can learn any function. This universal function approximation property of multilayer perceptrons was first noted by Cybenko (1989) and Hornik (1991). In this post, I will use TensorFlow to implement a multilayer neural network (also known as a multilayer perceptron) to learn arbitrary Python lambda expressions.

function_to_learn = lambda x: np.sin(x)

In reality, we rarely see data that’s not corrupted by noise. Let’s add a small Gaussian noise to our observed values to make things a little harder:

function_to_learn = lambda x: np.sin(x) + 0.1*np.random.randn(*x.shape)

The first thing we do is generate some data to train

np.random.seed(1000) # for reproducibility 
all_x = np.float32(
    np.random.uniform(-2*math.pi, 2*math.pi, (1, NUM_EXAMPLES))).T
np.random.shuffle(all_x)
train_size = int(NUM_EXAMPLES*TRAIN_SPLIT)
trainx = all_x[:train_size]
validx = all_x[train_size:]
trainy = function_to_learn(trainx)
validy = function_to_learn(validx)

This should provide some training and validation data.

First, we create placeholders. TensorFlow placeholders are values that need to be fed when a computation is run through the graph. There are two placeholders in our example:

X = tf.placeholder(tf.float32, [None, 1], name="X")
Y = tf.placeholder(tf.float32, [None, 1], name="Y")

Naming placeholders and other objects are highly recommended as you define complex networks. Debugging such networks using TensorBoard becomes easier when the placeholders, constants, variables, operators, etc are named. From my initial use of the API, it appears you can name pretty much name every TF object, which is nice for observability.

Now, let’s define our network. The network has a single hidden layer. The hidden layer nodes are connected to a single compute node that represents the output layer.

This network approximates the unknown function f in y = f(x). We will see, as the training progresses, the value predicted by the network \hat{y}, gets progressively close to the true value y. To create a layer, we need to create TensorFlow Variables and initialize them appropriately. This initialization is crucial for the training process to be successful. Without getting into details, we will use a method called ‘Xavier’ or ‘Glorot’ initialization for the weights and zeroes for the biases in the network.

def init_weights(shape, init_method='xavier', xavier_params = (None, None)):
    if init_method == 'zeros':
        return tf.Variable(tf.zeros(shape, dtype=tf.float32))
    elif init_method == 'uniform':
        return tf.Variable(tf.random_normal(shape, stddev=0.01, dtype=tf.float32))
    else: #xavier
        (fan_in, fan_out) = xavier_params
        low = -4*np.sqrt(6.0/(fan_in + fan_out)) # {sigmoid:4, tanh:1} 
        high = 4*np.sqrt(6.0/(fan_in + fan_out))
        return tf.Variable(tf.random_uniform(shape, minval=low, maxval=high, dtype=tf.float32))

Now that we can create a layer, let’s write down the entire model.

def model(X, num_hidden=10):    
    w_h = init_weights([1, num_hidden], 'xavier', xavier_params=(1, num_hidden))
    b_h = init_weights([1, num_hidden], 'zeros')
    h = tf.nn.sigmoid(tf.matmul(X, w_h) + b_h)
    
    w_o = init_weights([num_hidden, 1], 'xavier', xavier_params=(num_hidden, 1))
    b_o = init_weights([1, 1], 'zeros')
    return tf.matmul(h, w_o) + b_o

That’s it. We can now create a compute node corresponding to the model.

yhat = model(X, NUM_HIDDEN_NODES)

For the training to happen, we construct a training node:

train_op = tf.train.AdamOptimizer().minimize(tf.nn.l2_loss(yhat - Y))

With just one line of code, we added a fancy optimizer and specified a squared loss function. This is not exactly the best loss function you can use but for this post, I’m leave all tuning, optimizations, etc. The actual training begins by creating a Session object, initializing the variables, and calling the run function.

sess = tf.Session()
sess.run(tf.initialize_all_variables())
..
 sess.run(train_op, feed_dict={X: .., Y: ..})

In reality, you would do this training over multiple epochs. When I run this network for, say 1000 epochs, I see the mean squared error (MSE) progressively reducing and eventually converging.

The code for this post can be found here. Feel free to play with it, esp. changing the function to be learned and various network parameters. This is obviously not optimal or anything, but should get your tensors flowing.

References:

George Cybenko (1989),  “Approximations by superpositions of sigmoidal functions”,  Mathematics of Control, Signals, and Systems
Kurt Hornik (1991),  “Approximation Capabilities of Multilayer Feedforward Networks”,  Neural Networks

Hat-tip: This post evolved from hacking another script with Craig Pfeifer

,

© 2016 Delip Rao. All Rights Reserved.