How does Distributed Training work in Tensorflow?

Optimizing Distributed Training: Best Practices & Fault Tolerance

Let’s understand how we can use the distributed strategies from Tensorflow to train our large-scale model. We will be using mnist dataset in this example for simplicity and easy understanding.

Step 1: Import TensorFlow and define the Model

Firstly, we import tensorflow library and specifically the layers and models modules from the Keras API. Then, we define a simple neural network model. Since, we are using mnist dataset, we will create simple convolutional neural network (CNN) model using the Sequential API. This model consists of a convolutional layers, a max-pooling layer, a flatten layer, and two dense layers.

Python

import tensorflow as tf
from tensorflow.keras import layers, models


def create_model():
    model = models.Sequential([
        layers.Conv2D(32, kernel_size=(3, 3), activation='relu',
                      input_shape=(28, 28, 1)),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])
    return model

Step 2: Load and Preprocess the Dataset

The MNIST dataset consist of 60,000 training and 10,000 testing images of handwritten digits, ranging from 0 to 9. In the following code, we have reshaped the images to have a single channel (since they are grayscale). We normalize the pixel values to the range [0, 1] by dividing by 255.

Python

(train_images, train_labels), _ = tf.keras.datasets.mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1)) / 255.0

Step 3: Initialize MirroredStrategy

Now, we initialize the MirroredStrategy for distributed training. This strategy is used for data parallelism, it will replicate the model across multiple GPUs, if available, for computation.

Python

strategy = tf.distribute.MirroredStrategy()

Step 4: Wrap Model Creation and Training

We use with statement to create the model within the scope of the MirroredStrategy. This will allow TensorFlow to distribute the computations for model creation and training across the available devices. Whatever operations are mentioned under this with statement will be distributed accordingly.

We compile the model by specifying desired optimizer, loss function, and metrics. In this example, we use the Adam optimizer, sparse categorical crossentropy loss function (since the labels are integers), and accuracy as the evaluation metric.

Python

with strategy.scope():
    model = create_model()
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

Step 5: Create dataset object

Now, we will create a TensorFlow Dataset object from the training images and labels. This Dataset object can be used to efficiently iterate over the training data during training. Here, we shuffle the dataset and batch it into batche size of 32 for training.

Python

train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(60000).batch(32)

Step 6: Train the Model

We use fit() method to train the model for 5 epochs, passing the distributed dataset. When the model trains, Tensorflow distributed the computation across the available devices using MirroredStrategy. The gradients updates are synchronized across devices.

Python

model.fit(train_dataset, epochs=5)

Output:

Epoch 1/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 19s 7ms/step - accuracy: 1.7512 - loss: 0.8273
Epoch 2/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9477 - loss: 0.1705
Epoch 3/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9628 - loss: 0.1198
Epoch 4/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9712 - loss: 0.0912
Epoch 5/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9787 - loss: 0.0701
<keras.src.callbacks.history.History at 0x7a5cfd8a68c0>

Step 7: Test the model

We load the dataset for testing and preprocess it the same way we did for training. Finally, we use evaluate() method to evaluate the model by passing the test dataset.

Python

(_, _), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
test_images = test_images.reshape((10000, 28, 28, 1)) / 255.0


test_loss, test_accuracy = model.evaluate(test_images, test_labels, verbose=2)
print(f'Test accuracy: {test_accuracy}')

Output:

313/313 - 2s - 5ms/step - accuracy: 0.9846 - loss: 0.0455
Test accuracy: 0.9846000075340271

Distributed Training with TensorFlow

As the size of data sets and model complexity is increasing day by day, traditional training methods are often unable to stand up to the heavy requirements of various contemporary tasks. Therefore, this has given rise to the necessity for distributed training. In simple words, when we use distributed training the computational workload is split across a considerable number of devices or machines that would run the training of the machine learning models more quickly and efficiently.

In this article, we will discuss distributed training with Tensorflow and understand how you can incorporate it into your AI workflows. In order to maximize performance when addressing the AI challenges of today, we’ll uncover best practices and valuable tips for utilizing TensorFlow’s capabilities.

Table of Content

What is Distributed Training?
Distributed Training with TensorFlow
How does Distributed Training work in Tensorflow?
Optimizing Distributed Training: Best Practices & Fault Tolerance

Optimizing Performance in Distributed Training
Monitoring, Debugging, and Fault Tolerance

Conclusion

How does Distributed Training work in Tensorflow?

Step 1: Import TensorFlow and define the Model

Step 2: Load and Preprocess the Dataset

Step 3: Initialize MirroredStrategy

Step 4: Wrap Model Creation and Training

Step 5: Create dataset object

Step 6: Train the Model

Step 7: Test the model

Distributed Training with TensorFlow

Similar Reads