PolyNeXt: Activation-Free Backbones for Image Recognition

Abstract

Activations are not a representational necessity.

We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules drop into existing architectures: instantiated within MetaFormer, our PolyNeXt models match or exceed activation-based counterparts across scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness, while substantially outperforming prior polynomial networks at reduced cost.

Polynomial NetworksHadamard Product Activation-FreeMetaFormer Vision BackbonesFHE-amenable

85.2%

ImageNet top-1 (APolyNeXt-L), matching CAFormer with zero activations

+5.0pt

over the best prior fully polynomial model (a step toward FHE-compatible inference)

30–45%

less peak GPU memory than MetaFormer at matched scale

~200

layers trained stably with our lightweight recipe

01 · Key Idea

Multiplication in place of activation

The Hadamard product of two learned linear projections, $(W_{a} x) * (W_{b} x)$ , is already a second-degree polynomial in the input, with no ReLU or GELU required. Compose these across layers and the polynomial degree grows as $2^{L}$ while parameters grow only linearly. Drag the weights below to see two lines become a curve.

a (x)

b (x)

a (x) * b (x)

w_{a}

1.2

b_{a}

0.4

w_{b}

-1.6

b_{b}

0.8

a (x) = w_{a} x + b_{a}

b (x) = w_{b} x + b_{b}

a*b = …
a degree-2 polynomial, built with no activation.

Stack depth L = 6 → representable degree

2^{L}

= 64

degree

2^{L}

params

\propto L

6×

This gap is why PolyNeXt favors depth over width: stacking narrow layers yields exponential expressiveness at linear parameter cost.

02 · The Modules

Three polynomial primitives

Each module swaps a nonlinearity for a Hadamard product while preserving the input-output interface, so it stays compatible with orthogonal improvements like windowed or sparse attention.

PolyMLP

Channel mixing

W_{o} ((W_{a} x) * (W_{b} x))

Replaces the activation between linear projections with the elementwise product of two parallel projections, followed by LayerNorm.

PolyConv

Convolutional spatial mixing

K_{c} (h) * flip (K_{f} (h))

Fuses a dilated coarse branch and a fine branch with different receptive fields, with a channel-flip to decorrelate them.

PolyAttn

Attention spatial mixing

(s Q K^{⊤} + 1)^{p}

, then

ℓ_{1}

norm

Replaces the softmax exponential with a polynomial kernel (degree $p = 4$ ) and $ℓ_{1}$ normalization, with shared Q/K and depthwise convs.

The three polynomial module primitives — **Polynomial module primitives.** Activation-free replacements for the three core vision operators: PolyMLP (channel mixing), PolyConv (convolutional spatial mixing), and PolyAttn (attention-based spatial mixing).

Set beside their standard counterparts, each polynomial module replaces the activation or softmax (shown in red) with a Hadamard-product construction. This also entails supporting changes, such as parallel convolutional branches with different receptive fields and depthwise convolutions for local context.

Layer-by-layer comparison of standard vs polynomial modules — **Standard vs. polynomial, layer by layer.** The activation (GELU) and softmax are shown in red. The polynomial variants replace them with a Hadamard product and add supporting structure, such as parallel branches with different receptive fields and depthwise convolutions on Q, K, V.

03 · Making It Trainable

The barrier was stability, not capacity

Hadamard products amplify large values, and that compounds across depth. Three lightweight mechanisms tame it and let polynomial networks train just under 200 layers, far deeper than prior shallow-and-wide polynomial models.

Sigmoid-Scale

y = x + σ (λ) f (x)

Each residual branch is scaled by a learnable factor σ(λ) ∈ (0,1), initialized small and decreasing with depth, so contributions grow gradually during training instead of compounding. It folds into the preceding weights at inference, adding no activation.

ii.

Multi-input skips

\tilde{x} = s_{0} * x_{t - 2} + s_{1} * x_{t - 1}

Following NASNet, each cell sees the two preceding cells, improving gradient flow through deep multiplicative stacks.

iii.

Depth over width

degree

2^{L}

, params

\propto L

Narrow-and-deep beats wide-and-shallow at matched parameters: 3 stacks/cell beats 1 stack/cell by +1.5 points.

PolyNeXt cell and stabilization — **PolyNeXt cell.** Two inputs from the previous two cells are reweighted, normalized, and processed by repeated stacks (spatial mixer → PolyMLP), each gated by Sigmoid-Scale.

05 · Results

Matching activation-based models across scales

PolyNeXt matches or exceeds activation-based ConvFormer / CAFormer across every scale, and outperforms prior polynomial networks (MONet, DTTN) by 2–3 points at lower cost. ◆ marks activation-free models.

ImageNet accuracy vs parameters and FLOPs — **ImageNet-1K scaling.** Accuracy vs. parameters (left) and FLOPs (right). PolyNeXt sits on the Pareto frontier across scales.

**(a) ImageNet-1K: convolutional / MLP models.** Top-1 at 224×224, trained from scratch.
Model	Params (M)	FLOPs (G)	Top-1 (%)
Tiny (<12M)
MogaNet-T	5.2	1.1	79.0
DTTN-T ◆	7.1	2.4	77.9
MONet-T ◆	10	2.8	77.0
CPolyNeXt-T ◆	6.4	1.2	80.2
Small (~12–30M)
ConvFormer-S18	27	3.9	83.0
DTTN-S ◆	12	4.1	79.4
CPolyNeXt-S ◆	26	4.8	83.9
Medium (~30–50M)
ConvFormer-S36	40	7.6	84.1
UniConvNet-S	50	8.5	84.5
DTTN-B ◆	36	12.3	82.4
CPolyNeXt-B ◆	40	8.5	84.7
Large (~50–100M)
ConvNeXt-B	89	15.4	83.8
ConvFormer-M36	57	12.8	84.5
MogaNet-L	83	15.9	84.7
CPolyNeXt-L ◆	57	12.6	84.9

**(b) ImageNet-1K: attention / hybrid models.**
Model	Params (M)	FLOPs (G)	Top-1 (%)
Tiny (<12M)
FAN-T-Hybrid	7.0	3.5	80.1
APolyNeXt-T ◆	6.5	1.3	80.9
Small (~12–30M)
CAFormer-S18	26	4.1	83.6
RMT-S	27	4.5	84.1
TransNeXt-Tiny	28	5.7	84.0
APolyNeXt-S ◆	26	5.3	84.3
Medium (~30–50M)
CAFormer-S36	39	8.0	84.5
TransNeXt-Small	50	10.3	84.7
APolyNeXt-B ◆	41	9.3	84.9
Large (~50–100M)
RMT-B	54	9.7	85.0
CAFormer-M36	56	13.2	85.2
APolyNeXt-L ◆	57	13.3	85.2

**Out-of-distribution robustness** (no fine-tuning), compared against activation-based models at matched scale. IN-C is mean corruption error, where lower is better; the rest are top-1 accuracy. Best per column within each scale group is highlighted.
Model	Clean	IN-C ↓	IN-A	IN-R	IN-Sketch
Small (~25–30M)
Swin-T	81.3	62.0	21.6	41.3	29.1
ConvNeXt-T	82.1	53.2	24.2	47.2	33.8
ConvFormer-S18	83.0	51.7	25.3	48.7	35.2
CAFormer-S18	83.6	47.4	33.5	48.7	36.6
CPolyNeXt-S ◆	83.9	47.9	35.1	49.4	37.8
APolyNeXt-S ◆	84.3	45.0	39.6	49.7	37.5
Medium (~35–50M)
Swin-S	83.0	52.7	32.3	45.1	32.4
ConvNeXt-S	83.1	51.2	31.2	49.5	37.1
MONet-S ◆	81.3	49.7	n/a	n/a	n/a
ConvFormer-S36	84.1	47.1	33.2	50.8	38.4
CAFormer-S36	84.5	44.7	40.9	51.7	39.5
CPolyNeXt-B ◆	84.7	44.5	42.8	52.0	40.0
APolyNeXt-B ◆	84.9	42.7	46.8	52.8	41.1
Large (~50–90M)
Swin-B	83.5	54.4	35.8	46.6	32.4
ConvNeXt-B	83.8	46.8	36.7	51.3	38.2
ConvFormer-M36	84.5	46.5	37.6	51.0	39.2
CAFormer-M36	85.2	42.6	45.6	51.7	39.6
CPolyNeXt-L ◆	84.9	42.5	48.3	54.5	41.8
APolyNeXt-L ◆	85.2	42.9	49.2	54.0	41.8

**ADE20K segmentation.** UperNet, 160K iters. Gains exceed the classification margin.
Model	Params	mIoU
ConvFormer-S18	54M	48.6
CAFormer-S18	54M	48.9
CPolyNeXt-S ◆	54M	50.6
APolyNeXt-S ◆	55M	49.9

**Fully polynomial variants.** LayerNorm is replaced by a polynomial-compatible BatchNorm, so inference uses only additions and multiplications (a step toward FHE-compatible inference).
Model	LN ver.	Poly BN
MONet-T ◆	77.0	77.7
DTTN-S ◆	79.4	77.2
CPolyNeXt-T ◆	80.2	78.3
CPolyNeXt-S ◆	83.9	82.7

Why do activations hurt? The block $(W_{a} x) * (W_{b} x)$ has a mutual gradient coupling, where each branch learns through its sibling's output. Inserting a GELU breaks that coupling, which is why every way of adding an activation back lowers accuracy, and replacing the product with addition collapses it (−22.3).

Citation

BibTeX

If you find this work useful, please cite our paper.

citation.bib

@inproceedings{wang2026polynext,
  title     = {Activation-Free Backbones for Image Recognition: Polynomial
               Alternatives within MetaFormer-Style Vision Models},
  author    = {Wang, Jeffrey and Gregory, Jonathan and Chrysos, Grigorios G.},
  booktitle = {Proceedings of the 43rd International Conference on
               Machine Learning (ICML)},
  year      = {2026}
}

The BibTeX entry will be updated with the official PMLR proceedings reference once published.