>Look inside a SAGAN or something and you'll see the conv2d calls.
...Yes, because SAGANs operate on images, so the foundational operation is a convolution.
>You're reading that in an overly narrow way and imputing to me something I didn't mean.
You characterized the Transformer as "convolutions with attention". You then attributed the success of Transformer-based models to "the non-locality & easy optimization of convolutions". The "SOTA for most (all?) sequence-related tasks" applies the regular Transformer variants, not the Evolved Transformer which was published about 5 days ago.
No one is denying that convolutions are useful across many domains. But no one seriously working in the domain of NLP would consider convolutions to be anywhere near the most novel or notable parts of the Transformer.
(In case you do want to look it up, OpenAI's GPT also uses character-level convolutions for its word embeddings. However, BERT does not.)
Interesting conversation. I would add that papers by Lecun and others have been using character based convolutions on pure text since 2015 with great success. VDCNN is still a very good way to go for classification, and is much faster to train than RNN due to effective parallelization.
On a side note, sad to see these conversations about SOTA deep learning to be so adversarial... You're wrong / you're right kinda thing. It's an empirical science mostly at the moment, surf the gradient, be right and wrong at the same time !
And convolution-based models still find use in all sorts of cool applications in language, such as: https://arxiv.org/abs/1805.04833
With regards to adversarial discussions, it's one thing to argue about whether method A or method B gives better results in a largely empirical and experimental field. But giving a very misleading characterization of a model is actively detrimental especially when it would give casual readers the impression that the Transformer is a "convolution-based" model, which no one in the field would do.
...Yes, because SAGANs operate on images, so the foundational operation is a convolution.
>You're reading that in an overly narrow way and imputing to me something I didn't mean.
You characterized the Transformer as "convolutions with attention". You then attributed the success of Transformer-based models to "the non-locality & easy optimization of convolutions". The "SOTA for most (all?) sequence-related tasks" applies the regular Transformer variants, not the Evolved Transformer which was published about 5 days ago.
No one is denying that convolutions are useful across many domains. But no one seriously working in the domain of NLP would consider convolutions to be anywhere near the most novel or notable parts of the Transformer.
(In case you do want to look it up, OpenAI's GPT also uses character-level convolutions for its word embeddings. However, BERT does not.)