EfficientNet-B0 Detailed Architecture

EfficientNet uses a technique called compound coefficient to scale up models in a simple but effective manner. Instead of randomly scaling up width, depth, or resolution, compound scaling uniformly scales each dimension with a certain fixed set of scaling coefficients. Using this scaling method and AutoML, the authors of EfficientNet developed seven models of various dimensions, which surpassed the state-of-the-art accuracy of most convolutional neural networks, and with much better efficiency.

From the table, the architecture of EfficientNet-B0 can be summarized as follows:

Stage	Operator	Resolution	#Channels	#Layers
1	Conv3x3	224 × 224	32	1
2	MBConv1, k3x3	112 × 112	16	1
3	MBConv6, k3x3	112 × 112	24	2
4	MBConv6, k5x5	56 × 56	40	2
5	MBConv6, k3x3	28 × 28	80	3
6	MBConv6, k5x5	14 × 14	112	3
7	MBConv6, k5x5	14 × 14	192	4
8	MBConv6, k3x3	7 × 7	320	1
9	Conv1x1 & Pooling & FC	7 × 7	1280	1

Compound Scaling Method

At the heart of EfficientNet lies a revolutionary compound scaling method, which orchestrates the simultaneous adjustment of network width, depth, and resolution using a set of fixed scaling coefficients. This approach ensures that the model adapts seamlessly to varying computational constraints while preserving its performance across different scales and tasks.

Compound Scaling:

The authors thoroughly investigated the effects that every scaling strategy has on the effectiveness and performance of the model before creating the compound scaling method. They came to the conclusion that, although scaling a single dimension can help improve model performance, the best way to increase model performance overall is to balance the scale in all three dimensions (width, depth, and image resolution) while taking the changeable available resources into consideration.

The below images show the different methods of scaling:

Baseline: The original network without scaling.
Width Scaling: Increasing the number of channels in each layer.
Depth Scaling: Increasing the number of layers.
Resolution Scaling: Increasing the input image resolution.
Compound Scaling: Simultaneously increasing width, depth, and resolution according to the compound scaling formula.

Different scaling methods vs. Compound scaling

This is achieved by uniformly scaling each dimension with a compound coefficient φ. The formula for scaling is:

Width × Depth² × Resolution² ≈ Constant

The principle behind the compound scaling approach is to scale with a constant ratio in order to balance the width, depth, and resolution parameters.

Depth-wise Separable Convolution

EfficientNet uses depth-wise separable convolutions to lower computational complexity without sacrificing representational capability. This is achieved by splitting the normal convolution into two parts:

Depth-wise Convolution: Applies a single filter to each input channel.
Point-wise Convolution: Aggregates features from different channels.

This makes the network more efficient by requiring fewer computations and parameters.

Inverted Residual Blocks

Inspired by MobileNetV2, EfficientNet employs inverted residual blocks to further optimize resource usage. These blocks start with a lightweight depth-wise convolution followed by point-wise expansion and another depth-wise convolution. Additionally, squeeze-and-excitation (SE) operations are incorporated to enhance feature representation by recalibrating channel-wise responses.

Inverted Residual Block Structure

An inverted residual block follows a narrow -> wide -> narrow structure:

Expansion Phase: Increase the number of feature maps with a 1×1 convolutional layer.
Depth-wise Convolution: Use a 3×3 convolutional bottleneck layer.
Projection Phase: Shrink the number of feature maps back to the original input number with a 1×1 convolutional layer.

Efficient Scaling:

EfficientNet achieves efficient scaling by progressively increasing model depth, width, and resolution based on the compound scaling coefficient φ. This allows for the creation of larger and more powerful models without significantly increasing computational overhead. By carefully balancing these dimensions, EfficientNet achieves state-of-the-art performance while remaining computationally efficient.

Efficient Attention Mechanism:

EfficientNet incorporates efficient attention mechanisms, such as squeeze-and-excitation (SE) blocks, to improve feature representation. SE blocks selectively amplify informative features by learning channel-wise attention weights. This enhances the discriminative power of the network while minimizing computational overhead.

Efficientnet Architecture

In the field of deep learning, the quest for more efficient neural network architectures has been ongoing. EfficientNet has emerged as a beacon of innovation, offering a holistic solution that balances model complexity with computational efficiency. This article embarks on a detailed journey through the intricate layers of EfficientNet, illuminating its architecture, design philosophy, training methodologies, performance benchmarks, and more.

Table of Content

Efficientnet
EfficientNet-B0 Architecture Overview
EfficientNet-B0 Detailed Architecture

Depth-wise Separable Convolution
Inverted Residual Blocks
Efficient Scaling:
Efficient Attention Mechanism:

Variants of EfficientNet Model:
Performance Evaluation and Comparison
Conclusion
FAQs