Accepted · ICML 2026

Activation-Free Backbones for Image Recognition

Modern vision models treat pointwise activations and the softmax exponential as essential. We show they are not required, and that polynomial alternatives match or exceed them.

Jeffrey Wang  ·  Jonathan Gregory  ·  Grigorios G. Chrysos

University of Wisconsin–Madison  ·  jjwang8@wisc.edu

Core construction

Abstract

Activations are not a representational necessity.

We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules drop into existing architectures: instantiated within MetaFormer, our PolyNeXt models match or exceed activation-based counterparts across scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness, while substantially outperforming prior polynomial networks at reduced cost.

Polynomial NetworksHadamard Product Activation-FreeMetaFormer Vision BackbonesFHE-amenable
85.2%
ImageNet top-1 (APolyNeXt-L), matching CAFormer with zero activations
+5.0pt
over the best prior fully polynomial model (a step toward FHE-compatible inference)
30–45%
less peak GPU memory than MetaFormer at matched scale
~200
layers trained stably with our lightweight recipe

01 · Key Idea

Multiplication in place of activation

The Hadamard product of two learned linear projections, , is already a second-degree polynomial in the input, with no ReLU or GELU required. Compose these across layers and the polynomial degree grows as while parameters grow only linearly. Drag the weights below to see two lines become a curve.

 
a*b =
a degree-2 polynomial, built with no activation.
degree
64
params

This gap is why PolyNeXt favors depth over width: stacking narrow layers yields exponential expressiveness at linear parameter cost.

02 · The Modules

Three polynomial primitives

Each module swaps a nonlinearity for a Hadamard product while preserving the input-output interface, so it stays compatible with orthogonal improvements like windowed or sparse attention.

PolyMLP

Channel mixing

Replaces the activation between linear projections with the elementwise product of two parallel projections, followed by LayerNorm.

PolyConv

Convolutional spatial mixing

Fuses a dilated coarse branch and a fine branch with different receptive fields, with a channel-flip to decorrelate them.

PolyAttn

Attention spatial mixing

, then norm

Replaces the softmax exponential with a polynomial kernel (degree ) and normalization, with shared Q/K and depthwise convs.

The three polynomial module primitives
Polynomial module primitives. Activation-free replacements for the three core vision operators: PolyMLP (channel mixing), PolyConv (convolutional spatial mixing), and PolyAttn (attention-based spatial mixing).

Set beside their standard counterparts, each polynomial module replaces the activation or softmax (shown in red) with a Hadamard-product construction. This also entails supporting changes, such as parallel convolutional branches with different receptive fields and depthwise convolutions for local context.

Layer-by-layer comparison of standard vs polynomial modules
Standard vs. polynomial, layer by layer. The activation (GELU) and softmax are shown in red. The polynomial variants replace them with a Hadamard product and add supporting structure, such as parallel branches with different receptive fields and depthwise convolutions on Q, K, V.

03 · Making It Trainable

The barrier was stability, not capacity

Hadamard products amplify large values, and that compounds across depth. Three lightweight mechanisms tame it and let polynomial networks train just under 200 layers, far deeper than prior shallow-and-wide polynomial models.

i.

Sigmoid-Scale

Each residual branch is scaled by a learnable factor σ(λ) ∈ (0,1), initialized small and decreasing with depth, so contributions grow gradually during training instead of compounding. It folds into the preceding weights at inference, adding no activation.

ii.

Multi-input skips

Following NASNet, each cell sees the two preceding cells, improving gradient flow through deep multiplicative stacks.

iii.

Depth over width

degree , params

Narrow-and-deep beats wide-and-shallow at matched parameters: 3 stacks/cell beats 1 stack/cell by +1.5 points.

PolyNeXt cell and stabilization
PolyNeXt cell. Two inputs from the previous two cells are reweighted, normalized, and processed by repeated stacks (spatial mixer → PolyMLP), each gated by Sigmoid-Scale.

04 · Architecture

PolyNeXt, end to end

A four-stage hierarchical backbone. CPolyNeXt uses PolyConv throughout; APolyNeXt switches to PolyAttn in the last two (low-resolution) stages, where global context is cheap and beneficial.

Overall PolyNeXt architecture
Overall framework. Stem → four stages of cells with downsampling → PolyMLP head. Spatial resolution halves and channel width grows at each stage.

05 · Results

Matching activation-based models across scales

PolyNeXt matches or exceeds activation-based ConvFormer / CAFormer across every scale, and outperforms prior polynomial networks (MONet, DTTN) by 2–3 points at lower cost. marks activation-free models.

ImageNet accuracy vs parameters and FLOPs
ImageNet-1K scaling. Accuracy vs. parameters (left) and FLOPs (right). PolyNeXt sits on the Pareto frontier across scales.
(a) ImageNet-1K: convolutional / MLP models. Top-1 at 224×224, trained from scratch.
ModelParams (M)FLOPs (G)Top-1 (%)
Tiny (<12M)
MogaNet-T5.21.179.0
DTTN-T 7.12.477.9
MONet-T 102.877.0
CPolyNeXt-T 6.41.280.2
Small (~12–30M)
ConvFormer-S18273.983.0
DTTN-S 124.179.4
CPolyNeXt-S 264.883.9
Medium (~30–50M)
ConvFormer-S36407.684.1
UniConvNet-S508.584.5
DTTN-B 3612.382.4
CPolyNeXt-B 408.584.7
Large (~50–100M)
ConvNeXt-B8915.483.8
ConvFormer-M365712.884.5
MogaNet-L8315.984.7
CPolyNeXt-L 5712.684.9
(b) ImageNet-1K: attention / hybrid models.
ModelParams (M)FLOPs (G)Top-1 (%)
Tiny (<12M)
FAN-T-Hybrid7.03.580.1
APolyNeXt-T 6.51.380.9
Small (~12–30M)
CAFormer-S18264.183.6
RMT-S274.584.1
TransNeXt-Tiny285.784.0
APolyNeXt-S 265.384.3
Medium (~30–50M)
CAFormer-S36398.084.5
TransNeXt-Small5010.384.7
APolyNeXt-B 419.384.9
Large (~50–100M)
RMT-B549.785.0
CAFormer-M365613.285.2
APolyNeXt-L 5713.385.2
Out-of-distribution robustness (no fine-tuning), compared against activation-based models at matched scale. IN-C is mean corruption error, where lower is better; the rest are top-1 accuracy. Best per column within each scale group is highlighted.
ModelCleanIN-C ↓IN-AIN-RIN-Sketch
Small (~25–30M)
Swin-T81.362.021.641.329.1
ConvNeXt-T82.153.224.247.233.8
ConvFormer-S1883.051.725.348.735.2
CAFormer-S1883.647.433.548.736.6
CPolyNeXt-S 83.947.935.149.437.8
APolyNeXt-S 84.345.039.649.737.5
Medium (~35–50M)
Swin-S83.052.732.345.132.4
ConvNeXt-S83.151.231.249.537.1
MONet-S 81.349.7n/an/an/a
ConvFormer-S3684.147.133.250.838.4
CAFormer-S3684.544.740.951.739.5
CPolyNeXt-B 84.744.542.852.040.0
APolyNeXt-B 84.942.746.852.841.1
Large (~50–90M)
Swin-B83.554.435.846.632.4
ConvNeXt-B83.846.836.751.338.2
ConvFormer-M3684.546.537.651.039.2
CAFormer-M3685.242.645.651.739.6
CPolyNeXt-L 84.942.548.354.541.8
APolyNeXt-L 85.242.949.254.041.8
ADE20K segmentation. UperNet, 160K iters. Gains exceed the classification margin.
ModelParamsmIoU
ConvFormer-S1854M48.6
CAFormer-S1854M48.9
CPolyNeXt-S 54M50.6
APolyNeXt-S 55M49.9
Fully polynomial variants. LayerNorm is replaced by a polynomial-compatible BatchNorm, so inference uses only additions and multiplications (a step toward FHE-compatible inference).
ModelLN ver.Poly BN
MONet-T 77.077.7
DTTN-S 79.477.2
CPolyNeXt-T 80.278.3
CPolyNeXt-S 83.982.7

Why do activations hurt? The block has a mutual gradient coupling, where each branch learns through its sibling's output. Inserting a GELU breaks that coupling, which is why every way of adding an activation back lowers accuracy, and replacing the product with addition collapses it (−22.3).

Citation

BibTeX

If you find this work useful, please cite our paper.

citation.bib
@inproceedings{wang2026polynext,
  title     = {Activation-Free Backbones for Image Recognition: Polynomial
               Alternatives within MetaFormer-Style Vision Models},
  author    = {Wang, Jeffrey and Gregory, Jonathan and Chrysos, Grigorios G.},
  booktitle = {Proceedings of the 43rd International Conference on
               Machine Learning (ICML)},
  year      = {2026}
}

The BibTeX entry will be updated with the official PMLR proceedings reference once published.