Modern vision models treat pointwise activations and the softmax exponential as essential. We show they are not required, and that polynomial alternatives match or exceed them.
University of Wisconsin–Madison · jjwang8@wisc.edu
Abstract
We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules drop into existing architectures: instantiated within MetaFormer, our PolyNeXt models match or exceed activation-based counterparts across scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness, while substantially outperforming prior polynomial networks at reduced cost.
01 · Key Idea
The Hadamard product of two learned linear projections, , is already a second-degree polynomial in the input, with no ReLU or GELU required. Compose these across layers and the polynomial degree grows as while parameters grow only linearly. Drag the weights below to see two lines become a curve.
This gap is why PolyNeXt favors depth over width: stacking narrow layers yields exponential expressiveness at linear parameter cost.
02 · The Modules
Each module swaps a nonlinearity for a Hadamard product while preserving the input-output interface, so it stays compatible with orthogonal improvements like windowed or sparse attention.
Replaces the activation between linear projections with the elementwise product of two parallel projections, followed by LayerNorm.
Fuses a dilated coarse branch and a fine branch with different receptive fields, with a channel-flip to decorrelate them.
Replaces the softmax exponential with a polynomial kernel (degree ) and normalization, with shared Q/K and depthwise convs.
Set beside their standard counterparts, each polynomial module replaces the activation or softmax (shown in red) with a Hadamard-product construction. This also entails supporting changes, such as parallel convolutional branches with different receptive fields and depthwise convolutions for local context.
03 · Making It Trainable
Hadamard products amplify large values, and that compounds across depth. Three lightweight mechanisms tame it and let polynomial networks train just under 200 layers, far deeper than prior shallow-and-wide polynomial models.
Each residual branch is scaled by a learnable factor σ(λ) ∈ (0,1), initialized small and decreasing with depth, so contributions grow gradually during training instead of compounding. It folds into the preceding weights at inference, adding no activation.
Following NASNet, each cell sees the two preceding cells, improving gradient flow through deep multiplicative stacks.
Narrow-and-deep beats wide-and-shallow at matched parameters: 3 stacks/cell beats 1 stack/cell by +1.5 points.
04 · Architecture
A four-stage hierarchical backbone. CPolyNeXt uses PolyConv throughout; APolyNeXt switches to PolyAttn in the last two (low-resolution) stages, where global context is cheap and beneficial.
05 · Results
PolyNeXt matches or exceeds activation-based ConvFormer / CAFormer across every scale, and outperforms prior polynomial networks (MONet, DTTN) by 2–3 points at lower cost. ◆ marks activation-free models.
| Model | Params (M) | FLOPs (G) | Top-1 (%) |
|---|---|---|---|
| Tiny (<12M) | |||
| MogaNet-T | 5.2 | 1.1 | 79.0 |
| DTTN-T ◆ | 7.1 | 2.4 | 77.9 |
| MONet-T ◆ | 10 | 2.8 | 77.0 |
| CPolyNeXt-T ◆ | 6.4 | 1.2 | 80.2 |
| Small (~12–30M) | |||
| ConvFormer-S18 | 27 | 3.9 | 83.0 |
| DTTN-S ◆ | 12 | 4.1 | 79.4 |
| CPolyNeXt-S ◆ | 26 | 4.8 | 83.9 |
| Medium (~30–50M) | |||
| ConvFormer-S36 | 40 | 7.6 | 84.1 |
| UniConvNet-S | 50 | 8.5 | 84.5 |
| DTTN-B ◆ | 36 | 12.3 | 82.4 |
| CPolyNeXt-B ◆ | 40 | 8.5 | 84.7 |
| Large (~50–100M) | |||
| ConvNeXt-B | 89 | 15.4 | 83.8 |
| ConvFormer-M36 | 57 | 12.8 | 84.5 |
| MogaNet-L | 83 | 15.9 | 84.7 |
| CPolyNeXt-L ◆ | 57 | 12.6 | 84.9 |
| Model | Params (M) | FLOPs (G) | Top-1 (%) |
|---|---|---|---|
| Tiny (<12M) | |||
| FAN-T-Hybrid | 7.0 | 3.5 | 80.1 |
| APolyNeXt-T ◆ | 6.5 | 1.3 | 80.9 |
| Small (~12–30M) | |||
| CAFormer-S18 | 26 | 4.1 | 83.6 |
| RMT-S | 27 | 4.5 | 84.1 |
| TransNeXt-Tiny | 28 | 5.7 | 84.0 |
| APolyNeXt-S ◆ | 26 | 5.3 | 84.3 |
| Medium (~30–50M) | |||
| CAFormer-S36 | 39 | 8.0 | 84.5 |
| TransNeXt-Small | 50 | 10.3 | 84.7 |
| APolyNeXt-B ◆ | 41 | 9.3 | 84.9 |
| Large (~50–100M) | |||
| RMT-B | 54 | 9.7 | 85.0 |
| CAFormer-M36 | 56 | 13.2 | 85.2 |
| APolyNeXt-L ◆ | 57 | 13.3 | 85.2 |
| Model | Clean | IN-C ↓ | IN-A | IN-R | IN-Sketch |
|---|---|---|---|---|---|
| Small (~25–30M) | |||||
| Swin-T | 81.3 | 62.0 | 21.6 | 41.3 | 29.1 |
| ConvNeXt-T | 82.1 | 53.2 | 24.2 | 47.2 | 33.8 |
| ConvFormer-S18 | 83.0 | 51.7 | 25.3 | 48.7 | 35.2 |
| CAFormer-S18 | 83.6 | 47.4 | 33.5 | 48.7 | 36.6 |
| CPolyNeXt-S ◆ | 83.9 | 47.9 | 35.1 | 49.4 | 37.8 |
| APolyNeXt-S ◆ | 84.3 | 45.0 | 39.6 | 49.7 | 37.5 |
| Medium (~35–50M) | |||||
| Swin-S | 83.0 | 52.7 | 32.3 | 45.1 | 32.4 |
| ConvNeXt-S | 83.1 | 51.2 | 31.2 | 49.5 | 37.1 |
| MONet-S ◆ | 81.3 | 49.7 | n/a | n/a | n/a |
| ConvFormer-S36 | 84.1 | 47.1 | 33.2 | 50.8 | 38.4 |
| CAFormer-S36 | 84.5 | 44.7 | 40.9 | 51.7 | 39.5 |
| CPolyNeXt-B ◆ | 84.7 | 44.5 | 42.8 | 52.0 | 40.0 |
| APolyNeXt-B ◆ | 84.9 | 42.7 | 46.8 | 52.8 | 41.1 |
| Large (~50–90M) | |||||
| Swin-B | 83.5 | 54.4 | 35.8 | 46.6 | 32.4 |
| ConvNeXt-B | 83.8 | 46.8 | 36.7 | 51.3 | 38.2 |
| ConvFormer-M36 | 84.5 | 46.5 | 37.6 | 51.0 | 39.2 |
| CAFormer-M36 | 85.2 | 42.6 | 45.6 | 51.7 | 39.6 |
| CPolyNeXt-L ◆ | 84.9 | 42.5 | 48.3 | 54.5 | 41.8 |
| APolyNeXt-L ◆ | 85.2 | 42.9 | 49.2 | 54.0 | 41.8 |
| Model | Params | mIoU |
|---|---|---|
| ConvFormer-S18 | 54M | 48.6 |
| CAFormer-S18 | 54M | 48.9 |
| CPolyNeXt-S ◆ | 54M | 50.6 |
| APolyNeXt-S ◆ | 55M | 49.9 |
| Model | LN ver. | Poly BN |
|---|---|---|
| MONet-T ◆ | 77.0 | 77.7 |
| DTTN-S ◆ | 79.4 | 77.2 |
| CPolyNeXt-T ◆ | 80.2 | 78.3 |
| CPolyNeXt-S ◆ | 83.9 | 82.7 |
Why do activations hurt? The block has a mutual gradient coupling, where each branch learns through its sibling's output. Inserting a GELU breaks that coupling, which is why every way of adding an activation back lowers accuracy, and replacing the product with addition collapses it (−22.3).
Citation
If you find this work useful, please cite our paper.
@inproceedings{wang2026polynext,
title = {Activation-Free Backbones for Image Recognition: Polynomial
Alternatives within MetaFormer-Style Vision Models},
author = {Wang, Jeffrey and Gregory, Jonathan and Chrysos, Grigorios G.},
booktitle = {Proceedings of the 43rd International Conference on
Machine Learning (ICML)},
year = {2026}
}
The BibTeX entry will be updated with the official PMLR proceedings reference once published.