Exact Sequence Classification with Hardmax Transformers

A. Alcalde Zafra, G. Fantuzzi, E. Zuazua (2025)Exact Sequence Classification with Hardmax Transformers

Abstract. We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $\mathbb{R}^d$ , $d\geq 2$ . Specifically, given $N$ sequences with an arbitrary but finite length in $\mathbb{R}^d$ , we construct a transformer with $\mathcal{O}(N)$ blocks and $\mathcal{O}(Nd)$ parameters perfectly classifying this dataset. Our construction achieves the best complexity estimate to date, independent of the length of the sequences, by innovatively alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices within the attention mechanism, a common practice in real-life transformer implementations. Consequently, our analysis holds twofold significance: it substantially advances the mathematical theory of transformers and it rigorously justifies their exceptional real-world performance in sequence classification tasks.

arxiv: 2502.02270

Exact Sequence Classification with Hardmax Transformers

Turnpike in optimal control and beyond: a survey

Large-time asymptotics for hyperbolic systems with non-symmetric relaxation: An algorithmic approach

Pointwise constraints for scalar conservation laws with positive wave velocity

Gaussian Beam ansatz for finite difference wave equations

Representation and Regression Problems in Neural Networks: Relaxation, Generalization, and Numerics