cat articles/tensorflow-without-keras

Building a simple fully connected neural network with TensorFlow 2 without Keras

In TensorFlow, using the high-level Keras API makes it easy to create and train neural network models and do many other things you want to do with neural networks. But as a neural network beginner, I had been using it without really understanding what was happening. So I implemented a simple feed-forward neural network without Keras, using only TensorFlow APIs.

This article is implemented with reference to Deep Learning from Scratch. It is also a personal note for my own understanding. If you want to understand this properly, you should probably read Deep Learning from Scratch and the TensorFlow guide. The original Jupyter notebook is here.


By moving my hands and building everything once, I understood more about TensorFlow and neural network basics: which function affects what, what manual training feels like, how automatic differentiation works and how to use it, and why Keras is useful. Guide pages that I had not understood before became mostly readable.

Some very capable people I have seen can learn the logic and implement it in a program without much trouble. Even many ordinary capable people can read a book in an unfamiliar field and understand and implement it. In my case, that is often not enough. I often understand only after actually moving my hands and observing behavior. This reminded me of that again.


The simple neural network implemented here looks like this:

  • Create a layer with two weight parameters:
    • weights of shape (input count, unit count)
    • bias weights of shape (unit count,)
    • During forward propagation, apply an activation function to the input multiplied by weights plus bias.
  • Create a network that manages layers.
    • During inference, apply layers in order, or forward propagation, and output the result.
    • Apply a loss function as a metric for how correct inference is.
    • During learning, which obtains optimal weight parameters from training data and applies them with a learning rate, calculate gradients so that the loss function becomes smaller, and update layer parameters little by little in reverse order with backpropagation. Backpropagation uses TensorFlow autodiff.
  • Give training data to this network and train it.

First, implement a simple layer.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

# GPU 使わない設定
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

# seed の固定
np.random.seed(42)
tf.random.set_seed(42)
class SimpleLayer():
    def __init__(self, input_dim, units, activation):
        # ウェイトを正規分布で初期化。Variable で更新可能な値として作る。
        self.w = tf.Variable(tf.random.normal([input_dim, units]) * 0.01, name='weight')
        # バイアスを 0 で初期化
        self.b = tf.Variable(tf.zeros([units]), name="bais")
        # 活性化関数
        self.activation = activation
    
    @property
    def weights(self):
        return [self.w, self.b]

    def forward(self, x):
        y = x @ self.w + self.b # y = tf.matmul(x, self.w) + self.b # と同等
        return self.activation(y)

    def __call__(self, x):
        return self.forward(x)
identify_function = lambda x: x
zero_function = lambda x: x * 0

l1 = SimpleLayer(2, 2, identify_function)
l2 = SimpleLayer(2, 1, zero_function)

print(f'l1 weights: {l1.weights}')
a1 = l1([[10, 20]]) # SimpleLayer.__call__ を呼び出す
print(f'a1: {a1}')
a2 = l2(a1)
print(f'a2: {a2}')

print(l2(l1([[10, 20]])))

Next, implement several simple activation functions.

def step_function(x:tf.Tensor):
    return tf.cast(x > 0, tf.uint8)

step_function(tf.constant([1, 0, 3, -3]))
def sigmoid(x:tf.Tensor):
    return 1 / (1 + tf.exp(-x))

sigmoid(tf.constant([0, 1.0, -2.0]))
def relu(x: tf.Tensor):
    return tf.maximum(0.0, x)

relu(tf.constant([-2.0, -1.0, 1.0, 2.0]))
def tanh(x: tf.Tensor):
    return (tf.exp(x) - tf.exp(-x)) / (tf.exp(x) + tf.exp(-x))

tanh(tf.constant([-3.0,-1.0, 0.0, 1.0,3.0]))

For output layer activation functions, implement the identity function, which does nothing, and softmax, which is used for classification problems.

def identity(x: tf.Tensor):
    return x

identity(tf.constant([1.0, 0.0, -1.0, -3.0]))
def softmax(x:tf.Tensor):
    e = tf.exp(x -tf.reduce_max(x))
    s = tf.reduce_sum(e)
    return e / s

print(softmax(tf.constant([0.3,2.9,4.0])))
print(softmax(tf.constant([1010.0, 1000, 990])))

Next, implement loss functions: sum of squared error and cross entropy error for classification models, and root mean squared error for regression models.

def sum_squared_error(x:tf.Tensor, y: tf.Tensor):
    return tf.reduce_mean(0.5 * tf.reduce_sum((x-y) ** 2, axis=tf.rank(x)-1))

y1 = [0.0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
y2 = [0.0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
x1 = [0.1, 0.05, 0.6, 0.0, 0.05, 0.1, 0.0, 0.1, 0.0, 0.0]

print(sum_squared_error(tf.constant(x1), tf.constant(y1)))
print(sum_squared_error(tf.constant(x1), tf.constant(y2)))
print(sum_squared_error(tf.constant([x1, x1]), tf.constant([y1, y2])))
def cross_entropy_error(x:tf.Tensor, y: tf.Tensor):
    delta = tf.constant(1e-7)
    if tf.rank(x) == 1:
        x = tf.reshape(x, (1, tf.size(x)))
        y = tf.reshape(y, (1, tf.size(y)))
    batch_size = x.shape[0]
    return -tf.reduce_sum(y * tf.math.log(x + 1e-7)) / batch_size

print(cross_entropy_error(tf.constant(x1), tf.constant(y1)))
print(cross_entropy_error(tf.constant(x1), tf.constant(y2)))
print(cross_entropy_error(tf.constant([x1, x1]), tf.constant([y1, y2])))
def root_mean_squared_error(x:tf.Tensor, y: tf.Tensor):
    diff = y - x
    return tf.sqrt(tf.reduce_mean((diff)**2))

y = [[100.0], [160], [60]]
x = [[80.0], [100], [100]]

print(root_mean_squared_error(tf.constant(x), tf.constant(y)))

Next, check TensorFlow behavior for gradient calculation with automatic differentiation.

For the function f1 below, the derivative at x=3 is 40.

f1 = lambda x: x**3 + 2*x**2 + x

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    z = f1(x)
tape.gradient(z, [x])

Next, partial differentiation. For the function f2 below, when w1 and w2 are 5 and 3, the partial derivatives with respect to w1 and w2 are 36 and 10. This code is from Chapter 12 - Custom Models and Training with TensorFlow.

def f2(w1, w2):
    return 3 * w1**2 + 2*w1 * w2

w1, w2 = tf.Variable(5.0), tf.Variable(3.0)

with tf.GradientTape() as tape:
    z = f2(w1, w2)
print(tape.gradient(z, [w1, w2]))

try:
    print(tape.gradient(z, [w1, w2]))
except RuntimeError:
    print('二回目の呼び出し時には、リソースが削除されていてエラーになる')

with tf.GradientTape(persistent=True) as tape:
    z = f2(w1, w2)

print(tape.gradient(z, [w1]))
print(tape.gradient(z, [w2]))
del tape # 開放する

Now check whether the functions and layers created so far behave as intended by trying a simple linear-function prediction. Create data based on 2x + 10 with small random noise.

x = np.arange(-50, 50, 2)
line_2x_1 = 2 * x  + 10
noise = -10 * np.random.rand(len(x)) + 5
dots_2x_1 = line_2x_1 + noise
plt.plot(x, line_2x_1)
plt.plot(x, dots_2x_1, 'o')
plt.show()

y = tf.expand_dims(tf.constant(dots_2x_1, dtype=tf.float32), axis=1)
X = tf.expand_dims(tf.constant(x, dtype=tf.float32), axis=1)

First, without using a neural network, check whether it works well with sklearn.

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
reg.predict([[-50], [0], [50], [100]])
0.9975753493086111

array([[-89.75652361],
       [ 10.54682827],
       [110.85018015],
       [211.15353203]])

Next, train a two-layer neural network as a regression model. If it can predict regression similarly to sklearn, that is fine.

layer1 = SimpleLayer(1, 32, relu)
layer2 = SimpleLayer(32, 1, identify_function)

loss_function = root_mean_squared_error
lr = 0.003

predict = lambda x: layer2(layer1(x))
for i in range(10000):
    # 勾配を求める
    with tf.GradientTape() as tape:
        y_pred = predict(X)
        z = loss_function(y_pred, y)
    (l1_w_grads, l1_b_grads), (l2_w_grads, l2_b_grads) = tape.gradient(z, [layer1.weights, layer2.weights])
    # SDG で勾配を学習する
    layer1.w.assign_sub(lr * l1_w_grads)
    layer1.b.assign_sub(lr * l1_b_grads)
    layer2.w.assign_sub(lr * l2_w_grads)
    layer2.b.assign_sub(lr * l2_b_grads)
    if (i % 1000 == 0):
        print('iter {} / train loss: {:.3}'.format(i, z.numpy()))

print('train loss: {:.3}'.format(loss_function(predict(X), y)))
print(predict(tf.constant([[-50], [0], [50], [100]], dtype=tf.float32)))
iter 0 / train loss: 58.6
iter 1000 / train loss: 5.08
iter 2000 / train loss: 3.71
iter 3000 / train loss: 3.79
iter 4000 / train loss: 3.5
iter 5000 / train loss: 3.3
iter 6000 / train loss: 3.18
iter 7000 / train loss: 3.09
iter 8000 / train loss: 3.03
iter 9000 / train loss: 2.98
train loss: 2.94
tf.Tensor(
[[-92.20102  ]
 [ 10.1016445]
 [109.65451  ]
 [209.20737  ]], shape=(4, 1), dtype=float32)

It seems to work. Next, implement a network that handles the training above more conveniently.

class SimpleSequenceNetwork:
    def __init__(self, layers, loss_function, lr=0.01):
        self.layers = layers
        self.loss_function = loss_function
        self.lr = lr
    
    def predict(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def loss(self, x, target):
        y = self.predict(x)
        return self.loss_function(y, target)

    def accuracy(self, x, target):
        y = self.predict(x)
        y = tf.argmax(y, axis=1)
        target = tf.argmax(target, axis=1)

        accuracy = tf.math.count_nonzero(y == target) / x.shape[0]
        return accuracy
    
    @property
    def all_weights(self):
        return tf.nest.flatten([layer.weights for layer in self.layers])

    # 損失関数の、重みパラメータに対する勾配を求める
    def gradient(self, x, target):
        with tf.GradientTape() as tape:
            tape.watch(x)
            z = self.loss(x, target)
        return tape.gradient(z, self.all_weights)

    # 単純な勾配下降法(SDG)で、レイヤーの重みパラメータを更新する
    def update_variables_by_sdg(self, grads):
        for (grad, val) in zip(grads, self.all_weights):
            val.assign_sub(self.lr * grad)

    # 勾配を求め、パラメータを更新する
    def training(self, x, target):
        grads = self.gradient(x, target)
        self.update_variables_by_sdg(grads)

Load the dataset for training. Use the familiar MNIST digit data from 0 to 9.

import tensorflow_datasets as tfds

ds = tfds.load("mnist", as_supervised=True)
test_ds = ds['test']
train_ds = ds['train']

for (i, (image, label)) in enumerate(train_ds.take(12)):
    plt.subplot(3, 4, i+1)
    plt.imshow(image, cmap='gray')
    plt.subplots_adjust(wspace=0, hspace=1)
    plt.title(label.numpy())
    plt.axis('off')
plt.show()

def preprocess(image, label):
    # 画像は (28,28,1) を (784,) にして、0.~1. の範囲へ
    image = tf.cast(tf.reshape(image, (-1,)), tf.float32) / 255.0
    # ラベルはワンホットベクトルに
    label = tf.one_hot(label, 10, dtype=tf.float32)
    return image, label

train_ds = train_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).cache()
test_ds = test_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).cache()

print((len(train_ds), len(test_ds)))
(60000, 10000)

Create a function that trains the network based on the dataset.

def fit(network, train_ds: tf.data.Dataset, test_ds: tf.data.Dataset, epochs=20, batch_size=32):
    history_train_loss, history_train_accuracy, history_test_accuracy = [], [], []
    for epoch in range(1, epochs + 1):
        train_loss, train_accuracy, test_accuracy = [], [], []
        for (X_batch, y_batch) in train_ds.shuffle(1000).batch(batch_size).prefetch(1):
            network.training(X_batch, y_batch)
            train_loss.append(network.loss(X_batch, y_batch))
            train_accuracy.append(network.accuracy(X_batch, y_batch))
        for (X_batch, y_batch) in test_ds.shuffle(1000).batch(batch_size).prefetch(1):
            test_accuracy.append(network.accuracy(X_batch, y_batch))
        
        print("train acc, test acc, train loss | {:.4}, {:.4}, {:.4}".format(
            tf.reduce_mean(train_accuracy).numpy(),
            tf.reduce_mean(test_accuracy).numpy(),
            tf.reduce_mean(train_loss).numpy()
        ))
        history_train_loss.extend(train_loss)
        history_train_accuracy.extend(train_accuracy)
        history_test_accuracy.extend(test_accuracy)
    return {
        'train_loss': np.array(history_train_loss),
        'train_accuracy': np.array(history_train_accuracy),
        'test_accuracy': np.array(history_test_accuracy)
    }

Actually create layers and a neural network, then train it.

%%time
input_layer = SimpleLayer(784, 100, relu)
hidden_layer = SimpleLayer(100, 50, relu)
output_layer = SimpleLayer(50, 10, softmax) 
network = SimpleSequenceNetwork([input_layer, hidden_layer, output_layer], cross_entropy_error, lr=0.1)

history = fit(network, train_ds, test_ds, epochs=5, batch_size=32)
plt.plot(history['train_loss'])
plt.show()
train acc, test acc, train loss | 0.7391, 0.9364, 4.336
train acc, test acc, train loss | 0.9798, 0.9572, 3.604
train acc, test acc, train loss | 0.9897, 0.9674, 3.547
train acc, test acc, train loss | 0.9936, 0.9695, 3.524
train acc, test acc, train loss | 0.9954, 0.9706, 3.511

Try replacing the activation function.

%%time
input_layer = SimpleLayer(784, 100, tanh)
hidden_layer = SimpleLayer(100, 50, tanh)
output_layer = SimpleLayer(50, 10, softmax) 
network = SimpleSequenceNetwork([input_layer, hidden_layer, output_layer], cross_entropy_error, lr=0.1)

history = fit(network, train_ds, test_ds, epochs=5, batch_size=32)
plt.plot(history['train_loss'])
plt.show()
train acc, test acc, train loss | 0.774, 0.9139, 4.245
train acc, test acc, train loss | 0.953, 0.9455, 3.687
train acc, test acc, train loss | 0.973, 0.9567, 3.607
train acc, test acc, train loss | 0.9815, 0.9625, 3.57
train acc, test acc, train loss | 0.9869, 0.9663, 3.548

With this simple neural network, MNIST digit label classification also worked reasonably well. The hardest part, updating weights through backpropagation, can be done easily with TensorFlow autodiff, so I did not need to write that process myself.

cat related_articles/tensorflow-without-keras.yaml

  1. Inferring Hiragana in the Browser with TensorFlow.jsI built a small TensorFlow.js demo that recognizes handwritten hiragana in the browser, then looked at model size, conversion from Keras, and the limits of importing Python-trained models into JavaScript.
  2. How to Build a SPLADE Model: Japanese SPLADE Technical ReportHow I built a Japanese SPLADE sparse retrieval model, including tokenizer issues, training implementation, evaluation, and the YAST trainer.
  3. RAPIDS SVR and SVC: fast training without fine-tuning, evaluated on MARC-jaAn introduction to RAPIDS SVR and SVC, using neural-network embeddings as features without fine-tuning and evaluating the approach on the Japanese MARC-ja classification dataset.