# Constructing the Scattering Transform

Spain 10.08.2022

### Constructing the Scattering Transform

Author: Javier Minguillón Sánchez

#### 1 Introduction

As Mallat presents it in [8], the scattering transform appears from the pursuit of a representation Sf of an L^2(\R^d) , d\in \N , function f such that the representations of two similar functions are close in the representation space, say \mathcal H .

What do we mean by similar? Well, one relevant sense is comparing images, such as images of handwritten digits. We want the structure of \mathcal H to bring together the handwritten representations every single digit 0 through 9, while simultaneously keeping different digits far from each other.
Choosing \mathcal H = L^2(\R^2) with the usual L^2 norm cannot possibly serve this purpose.

Take a look at Figure 1. The first and second columns represent the digit 1 as a positive piecewise constant function on a compact subset of \R^2 . In both rows, the third column represents the absolute value of the difference of the first two functions in that row. If the L^2 norm of the top-left function is C , the L^2 norms of the top-right and bottom-right functions are 2C and 0 respectively.

Figure 1. Digit 1 from MNIST [6]. These are grayscale images: the value 0 corresponds to black pixels and the value 255 , to white pixels. There are shades of gray in between.

Here we focus on functions f :\R^2\to \R because they are a way of modelling (grayscale) pictures. All the below discussion has a much deeper general side to it that can be seen in [8]. There the scattering transform operator is developed for functions on \R^d and an infinite number of parametrers in the scattering transform. This is far from our scope.

The scattering transform here is going to be an operator S : L^2(\R^2) \to \mathcal H . Here \mathcal H is the space of representations of images. Regarding classification purposes, the idea is that two images f_1,f_2:\R^2\to \R that belong to the same category have to be sent to Sf_1, Sf_2 : two representations that are closer together in the metric space (\mathcal H, \|\cdot \|_{\mathcal H}) whenever the categories of f_1, f_2 coincide and comparatively further apart whenever the categories of f_1,f_2 differ.

Ideally, the scattering transform would be translation invariant, rotation invariant,
continuous to deformations and would keep high frequency information in order to be able to discriminate L^2 functions associated to signals. Examples of such signals are soundwaves in the case of L^2(\R) or images in the case of L^2(\R^2).

We do not discuss any theoretical aspects of the rotation invariant representation that Mallat finds in [8], as they are rather sophisticated and beyond the purpose of our exposition. In practice the rotation invariant operator is not used
(see [1]). If necessary, the dataset is augmentated with rotated copies of the original elements.

#### 2 A Few Notations and Definitions

We need to introduce a few notations first, followed by two definitions. From the perspective of image classification Definition 1 is a way to model deformations of an image, using diffeomorphisms. Definition 2 says when a criterion of classification (an operator) is stable in spite of deformations within images.

Let d\in \N and f: \R^d\to \mathbb C . The space of p-integrable complex valued functions is L^p(\R^d), 1\leq p \leq \infty . The norm
\|f\|_p denotes the usual p -norm of L^p(\R^d), 1\leq p \leq \infty .
Given x\in \R^d , we denote its euclidean norm by |\sharp1|_2{x}^2 = x_1^2+\cdots +x_d^2 . The usual scalar product of x,y\in \R^d is x\cdot y .
The norm of a linear map T:V \to W , where V,W are normed vector spaces is
\|T\|_{V\to W} = \sup_{x\neq 0} \frac{\|Tx\|_W}{\|x\|_V} . The space of linear maps from V to W is denoted by L(V,W).

The \bf Fourier transform of f\in L^1(\R^d)\cap L^2(\R^d) is

\widehat f(\xi) = \int_{\R^d} f(x) e^{2\pi i x\cdot \xi} dx

and the definition extends in the usual way to all L^2(\R^d) by densirty. The \bf spectral energy of f \in L^2(\R^d) on \Omega \subset \R^2 is given by

\int_{\Omega} |\hat f(\xi)|^2d\xi.

Let \tau: \R^2 \to \R^2 be a \mathcal C^2 function. We denote the supremum of the norms of the differential linear maps D\tau(x), x\in\R^2 by

\| D \tau \|_\infty = \sup_{x\in \R^d} \| D \tau (x)\|_{\R^2\to\R^2}.  (1)

Likewise, the supremum of the norms of the Hessian tensors is denoted by

\| H\tau \|_\infty = \sup_{x\in \R^d}\| H\tau(x) \|_{\R^2\to L(\R^2,\R^2)},  (2)

and the supremum of the variation of the function is

\| \Delta \tau \|_\infty = \sup _{(x,y)\in \R^d\times \R^d} |\tau(x)-\tau(y)|_2 .  (3)

We now proceed to define the action of a diffeomorphism on a function and the Lipschitz continuity to \mathcal C^2 diffeomorphisms.

###### Definition 1

Let \tau:\R^2\to \R^2 . The action of \tau on a function f:\R^2 \to \mathbb C is given by

L_{\tau} f(x) = f (x- \tau(x)) .  (4)

Whenever \tau \equiv c is a constant function, the action is equivalent to the composition with a mere translation. In this case, we denote L_\tau as L_c.

We also call L_\tau f(x) the deformation of f by the field \tau .
This is our way of modelling the tiny variations on an image. If we think of handwritten digits, for example, the digit ‘1’ takes many shapes that are continuous transformations of one another. See for example Figure 2.

Figure 2. A sample of a ‘1’ from MNIST and a deformation of it by a continuous field.

###### Definition 2

Let d\in \N and let (\mathcal H, \|\cdot\|_\mathcal{H}) be a normed space. Let \Phi : L^2 (\R^2) \to \mathcal H be an operator
and \tau: \R^2 \to \R^2 be a \mathcal C^2 -diffeomorphism. We say the operator is Lipschitz continuous to (the action of) \tau , if

\| \Phi f - \Phi (L_\tau f ) \|_\mathcal{H} \leq C \|f\|_2 \left(\| D \tau(x) \|_\infty + \| H \tau (x) \|_\infty \right)  (5)

for some C>0 not depending on \tau of f.

Take an operator \Phi between function spaces. Given c\in \R^2 , we say the operator is translation covariant if

\Phi(L_cf) = L_c(\Phi f),  (6)

and we say it is translation invariant if

\Phi(L_cf) = \Phi f.  (7)

Let us make another comment on Definition 2. One possible choice of \mathcal H is a hilbert space of functions from \R^2 to \R^d , d\in \N, where the inner product of f,g:\R^2 \to \R^d is given by

\left\langle f,g \right\rangle_\mathcal{H} = \int_{\R^2} f(x) \cdot \overline{g(x)} dx = \int_{\R^2} \left(f_1(x) \overline{g_1(x)} +\cdots + f_d(x)\overline{g_d(x)}\right) dx,  (8)

while the associated norm is

\|f\|_{\mathcal{H}} = \left(\int_{\R^2} |f(x)|_2^2 dx\right)^{\frac12} = \||\sharp 1|_2 f\|_2.  (9)

#### 3 Wavelets and Low-pass Filters

First of all, we need to define the filters that are used in the transformation. These are the basic components that we convolve with an f\in L^2(\R^2). We define these filters first and then we give some relevant examples and explain what their convolution with f produces.

###### Definition 3

A function \psi \in L^2(\R^2) \cap L^1(\R^2) is a wavelet if

\widehat\psi(0) = \int_{\R^2} \psi(x) dx =0 .

The above is also known a 2D-wavelet in the literature (see [2]).

###### Definition 4

Let \phi : \R^2 \to \mathbb C we say it is a low-pass filter if

\int_{\R^2} | \phi(x) | dx = 1.  (10)

Let \psi, \phi: \R^2 \to \R be a wavelet and a low-pass filter respectively. Fix a positive integer K and, for each k = 0 , .... , K-1 , denote by \gamma_k: \R^2\to \R^2 , the counterclockwise rotation of the plane by an angle of \frac k{2\pi K} .

Fix J\in \Z. For j\in \Z , j\leq J and \gamma \in \Gamma_K , we define the K -directional child wavelets of \psi as

\psi_{j,\gamma} (x) = 2^{-2j} \psi(2^{-j}\gamma x).  (11)

and the low-pass filter for frequencies below 2^{-J} is

\phi_J(x) = 2^{-2J} \phi(2^{-J}x).  (12)

Observe that the respective Fourier transforms are

\widehat{\psi_{j,\gamma}}(\xi) = \widehat \psi (2^j\gamma \xi),  (13)

and

\widehat{\phi_J }(\xi) = \widehat \phi (2^J \xi).  (14)

Suppose for a moment that supp \widehat\phi_J = B(0,1) , the unit ball of \R^2. Since \widehat{f*\phi_J} = \widehat f \; \widehat\phi_J , we see from the above that \phi_J removes
high frequencies: larger than 2^{-J}. This justifies the name of the low-pass filter.

Common choices for \psi are rescalings of the following functions:
• A complex Morlet wavelet (See Figures 3, 4) with parameter \xi_0\in \R^2 ,

C_2 ( e^{2\pi i\xi_0\cdot x} -C_1 )e^{-|x|^2},  (15)

with C_1 such that the integral of the above is equal to 0 and C_2 such that the squared modulus of the above has integral equal to 1. Both constants depend on \xi_0. • A partial derivative wavelet, which is a partial derivative of the Gaussian function

\frac{\partial}{\partial x_1} e^{-|x|^2}.  (16)

A common choice for \phi is a rescaling of the Gaussian function (see Figure 5)

C_3 e^{-|x|^2},  (17)

with
C_3 = \frac{1}{2\pi} chosen for \phi to have integral equal to 1. It is a well known fact that the function e^{\pi |x|^2} is its own Fourier transform. Therefore, the above choice of \phi gives us \widehat\phi(\xi) = e^{-\pi^2 |\xi|^2} . Thus, the filter \phi_J reduces to almost 0 all frequencies above some multiple of 2^{-J}.

The purpose of \psi_{j,\gamma} , j\leq J , \gamma \in \Gamma_K ; is to be convolved with a function f:\R^2 \to \R that represents an image (in black and white). These filters extract certain features from the image. For example, the position and orientation of edges between light and dark parts of an image.

Figure 3. Real part of the Morlet wavelet from (15) with \xi_0 = (1,0) .

Figure 4. Imaginary part of the Morlet wavelet from (15) with \xi_0 = (1,0) .

Figure 5. The Gaussian function from (17)

#### 4 Wavelet transform

Let us define the wavelet transform, which is a key component of the scattering trasnform.

###### Definition 5

Let f\in L^2(\R^2) and consider a wavelet \psi\in L^1(\R^d)\cap L^2(\R^d) . Let j\in \Z and K\in \N. Fix a rotation \gamma \in \Gamma_K. The continuous wavelet transform or simply wavelet trasnform of scale 2^{-j} and rotation \gamma \in \Gamma_K is given by

W_{j,\gamma} f(x) = f * \psi_{j,\gamma} (x), \quad x \in \R^2.  (18)

The Wavelet transform W_J is defined as a sequence of the above

W_J f = \{ f*\phi_J\} \cup \{ f * \psi_{j,\gamma} \}_{j\leq J, \gamma \in \Gamma_K}.  (19)

There are many variants of the wavelet transform of 2D-wavelets.
The version of wavelet transform that we use here is known as Littlewood Payley wavelet transform (see [8]). In general (see [2]), the wavelet transform can be defined depending on a scale parameter a>0 ,

a displacement parameter b \in \R^d

and an angle of rotation \theta\in[0,2\pi) .

In the above definition we consider several scales a = 2^{-j}, j\leq J; we take b= 0 ; and we pick \theta in the set of angles corresponding to the rotations in \Gamma_K .

The wavelet transform operator also has the following \mathbf{boundedness} property for a certain choice of the wavelets we introduced above (see [8] or [5]),

\left\| W_Jf \right\|_\mathcal{H} \leq \| f \|_2,  (20)

for all f\in L^2(\R^d), where the space of sequences where W_J takes values is \mathcal H , and the norm of this space is

\|W_Jf\|_\mathcal{H} = \|f*\phi_J\|_2 + \sum_{j\leq J, \gamma\in \Gamma_K} \|f*\psi_{j,\gamma}\|_2.  (21)

Since W_J is linear, the above also means that it is a weakly contractive operator from L^2(\R^2) to
\mathcal H

#### 5 The Fourier Domain Perspective

There is a simplified interpretation of how W_J acts on a function f. Notice that the Fourier transform of the convolutions in the wavelet transform gives

\widehat{f*\phi_J}(\xi) = \widehat f (\xi) \widehat{\phi}(2^J\xi),  (22)

and

\widehat{f * \psi_{j,\gamma}} (x) = \widehat f (\xi) \widehat {\psi} ( 2^j \gamma \xi).  (23)

The wavelet trasform extracts certain information out of the Fourier transform of f. Look at the above equation. If we think of \widehat{ \psi_{j,\gamma} } as compactly supported on a ball not centered at the origin, the scale and the angle determine which part of the Fourier domain of f is preserved by the transform with those parameters. See Figure 6.

The reader may be wondering why it makes any sense to think of the filters as compactly supported. The low-pass filter is simpler. If we choose \phi as a Gaussian, then it is certainly not compactly supported. Nonetheless, the vast majority of its mass is concentrated on a ball centered at the origin. And, outside of this ball, all frequencies are reduced to almost zero.

To understand the rest of the filters, consider a simpler version of Morlet wavelets, called Gabor functions

\psi (x) = C_2 e^{2\pi i\xi_0\cdot x}e^{-|x|^2} = C_2 e^{2\pi i\xi_0\cdot x}\phi(x),  (24)

then,

\widehat \psi (x) = \widehat \phi(x-\xi_0),  (25)

a translated Gaussian. This explains the interpretation on Figure 6.

In practice, if we consider Morlet wavelets from (15), we have that the larger \xi_0 is (or the smaller j is), the closer to 0 the constant C_1 is. The above interpretation using equation (24) is an approximate one.

Figure 6. The filters on the frequency domain. In blue, the low-pass filter \phi_J with J=2 . In green the filters \psi_{0,\gamma}, \gamma \in \Gamma_8 In red, the filters \psi_{1,\gamma} , \gamma \in \Gamma_8. }

#### 6 Construction of the Scattering Transform

Let f \in L^2(\R^2) . This function represents an image in black and white. Eventually, it needs to be changed to a function with compact (and discrete) support. In this section we see how the Scattering Transform operator is constructed and the aspect of a transformed function has.

We need some auxiliar transformations first. The first transformation S_0f is given by a simple averaging of each value of f with the surrounding values: the convolution

S_0 f (x) = f * \phi_J (x) .  (26)

Here \phi_J preserves the frequencies |\xi|\leq 2^{-J} and removes the rest. See Figure 7 for an example.

Figure 7. On the left, an example of f _h representing a grayscale 320\times 240 image of a human hand. Extracted from [9]. On the right, the S_0f_h with J=2 and a Gaussian filter.

After that, we analyze higher freqencies by computing wavelet coefficients. We fix
0 \leq j \lt J and find

f* \psi_{j,\gamma} (x) , \quad \gamma \in \Gamma_K.  (27)

Let us make a comment on the bounds of the parameter j. The bound j \lt J is part of the definition, but the choice of 0\leq j is related to the image processing perspective. Any scale lower than 2^0 =1 on a discrite function (an image) would not yield new information (it would be a scale smaller than a pixel).

The wavelet transforms in (27) commute with translations, in the sense that, given c \in \R^2 ,

L_c(W_{j,\gamma} f) = L_c (f * \psi_{j,\gamma}) = f*\psi_{j,\gamma}(x-c) = (L_cf) * \psi_{j,\gamma} = W_{j,\gamma} (L_cf).  (28)

This property of translation covariance comes solely from convolution. But this property also implies that the elements in (27) are not translation invariant. In other words, the relation f * \psi_{j,\gamma} = (L_cf) * \psi_{j,\gamma}, does not hold in general for arbitrary f and c .

We want a translation invariant representation of f, but W_{j,\gamma}f is not, as (28) makes clear. We can obtain a translation invariant operator

\overline {W_{j,\gamma}} f = \int_{\R^2} W_{j,\gamma} f(x) dx,  (29)

but it is equal to 0. Indeed, applying Fubini, it is

\overline {W_{j,\gamma}} f = \int_{\R^2} \left(\left(\int_{\R^2}f(y) \psi_{j,\gamma}(y-x)dy\right)\right)dx  (30)

= \int_{\R^2}\left(f(y) \int_{\R^2}\psi_{j,\gamma}(y-x)dx\right)dy,  (31)

and the inner integral is 0 for all y\in \R^2 because of our definition of wavelet.

We can obtain a translation invariant operator from the integral by composing a non-linear \rho:\R\to\R function with the integrand as

\overline {U_{j,\gamma}} f = \int_{\R^2} \rho (W_{j,\gamma} f(x)) dx.  (32)

There are several possible non-linearities. But the chosen one is \rho = |\cdot|, the complex modulus.
There are two reasons for this.
Firstly, we want \|\rho\circ f\|_2 = \| L_\tau (\rho\circ f)\|_2 . Secondly, we want contractiveness, that is, \|\rho\circ f_1-\rho\circ f_2\|_2 \leq \| f_1-f_2 \|_2. These properties are key to preserve the properties of boundedness and contractiveness that the Wavelet Transform already has. One need only look at (20) and (21) to see that a composition with the modulus preserve the aforementioned properties.

Setting \rho (\cdot) = |\cdot|, we have a quantity that is invariant to translations of f given by the integral

\overline {U_{j,\gamma}} f = \int_{\R^2} |f*\psi_{j,\gamma} (x)|dx,  (33)

for each j\leq J and \gamma \in \Gamma_K. This is not satisfactory yet. Yes, the above positive quantities in (33) are translation invariant with respect to f ; and they are also stable to deformations because L_\tau|f| = |L_\tau f| and the integrand is stable to deformations [8]. However, the averaging over all \R^2 loses too much information [8]. We do not settle for \overline {U_{j,\gamma}} .

Suppose that, instead of taking an average over \R^d , we take averages over local sets (for the sake of fixing ideas, think about balls of radius proportional to 2^J ). We would end up with functions

|f*\psi_{j,\gamma}|*\phi_J (x) = \int_{\R^2} |f*\psi_{j,\gamma} (y)| \phi_J (x-y) dy.  (34)

The above functions happen to be Lipschitz continuous to deformations (see Theorem 6).
They also turn out to be locally translation invariant (also thanks to Theorem 6) for a translation of a scale below 2^J \| D \tau\|_\infty.

The above integrals constitute the second auxiliary transform S_1 : \R^2\to \R^{KJ} given by

S_1f(x) = \left(|f*\psi_{j_1,\gamma_1}|*\phi_J (x) \right)_{\substack{0\leq j_1 \lt J\\ \gamma_1\in\Gamma_K}}.  (35)

See Figure 8 for an example.

Figure 8. Components of S_1f_h for J=2 and scale j_1 = 0 , with rotations \gamma_1 of angles 0, \frac\pi4,\frac\pi2,\frac{3\pi}{2} respectively. The low-pass filter is a gaussian and the rest are Morlet wavelets. This output is normalized: white pixels respresent the largest value on any single image and black pixels represent the value 0. There are 255 shades of gray in total.

Figure 9. The third image in Figure 8 The component |f_h*\psi_{j_1,\gamma_1}|*\phi_J (x) , with J=2, j_1 = 0 , \gamma_1 of angle \pi/2. Here 2^{j_1}=1 is small compared to the support of f_h . This makes the Morlet wavelet detect oriented edges according to how the rotation affects \xi_0 in (15). For this particular \gamma_1 , the value of horizontal edges is visibly the largest.

Now, \phi_J is a low-pass filter and we are removing high frequencies of euclidean norm larger than 2^{-J} (and smaller than 2^{-j_1} ) from | f* \psi_{j_1,\gamma_1} | . In order to take into account some of those frequencies, we convolve with another wavelet: we perform the wavelet transform of | f* \psi_{j_1,\gamma_1} |.

We obtain, for some j_2 such that j_1 \lt j_2 \lt J ,

|f*\psi_{j_1, \gamma_1}|*\psi_{j_2,\gamma_2} (x), \quad \gamma_1,\gamma_2 \in \Gamma_K .  (36)

There is a reason for the choice of the parameter j_2 as j_2 < j_1. [/katex] According to [4] and [8], the spectral energy of [katex] |f*\psi_{j_1, \gamma_1}|*\psi_{j_2,\gamma_2} [/katex] over [katex] \R^2 [/katex] is negligible for [katex] j_1\leq j_2 [/katex]. As before, we apply the non-linearity and filter the high frequencies. The <strong>third auxiliary transform</strong> is, for some [katex] 0 \leq j_2 \lt j_1 \lt J, an S_2:\R^2 \to \R^{K^2} given by

S_2f(x) = \left( \left| |f*\psi_{j_1, \gamma_1}| * \psi_{j_2,\gamma_2} \right|* \phi_J(x) \right)_{\substack{0\leq j_2 \lt j_1 \lt J \\ \gamma_1,\gamma_2\in \Gamma_K}}.  (37)

See Figure 10 for an example.

Figure 10. Output of S_2 f for J=2 , j_1 =0 , j_2 =1 , \gamma_1 of angle \frac\pi2 and, from left to right, \gamma_2 of angles 0, \frac\pi4,\frac\pi2,\frac{3\pi}{2} . The output is normalized: white is the maximum value and black is 0.

Let m\in \N . The general \mathbf{(m+1} -th auxiliar transform is an iteration of the above given by

S_mf(x) = \left( \left| \cdots \left| |f*\psi_{j_1, \gamma_1}| * \psi_{j_2,\gamma_2} \right| \cdots *\psi_{j_m,\gamma_m}\right|* \phi_J(x) \right)_{\substack{0 \leq j_m \lt ...\lt j_1 \lt J\\ \gamma_1,...\gamma_m\in \Gamma_K}}  (38)

Each choice of parameters (j_i,\gamma_i) , i\in{1,...,m} , determines a path, that is, an ordered sequence of pairs p = \left( (j_1,\gamma_1),..., (j_m,\gamma_m ) \right) , where 0\leq j_m \lt j_{m-1} \lt \cdots \lt j_1 . The number of pairs is the <strong>length</strong> of the path p.

We finally arrive at the scattering transform of f with path lengths lesser or equal than m . It is Sf:\R^2\to\R^M given by

Sf(x) = (S_0f(x), S_1f(x),...,S_mf(x)),  (39)

where the dimension of the space M= \sum_{q = 0 }^{m}K^q\binom{J}{q} comes from the choice of parameters at each of the auxiliar transforms. Those are all the possible choices of scales in m steps 0\leq j_1 \lt j_2 \lt \cdots j_m \lt J with K rotations at each step.

For most practical classification purposes the parameter m from (39) can be set to m=2 and it is enough. In fact, the package Kymatio [1] only implements scattering of images up to m = 2 for this reason, which is explained in [4]. Parameters J and K are less straightforward to choose. Theorem 6 is helpful for guessing a good choice of J . Regarding the parameter K , it only needs to be somewhat high. A common choice would be K = 16. Of course, instead of guessing, one can perform cross validation for some values of the parameters to decide as it is done in [5].

#### 7 Lipschitz Continuity to Deformations

The following result tells us that the scattering transform as we have defined it is both Lipschitz continuous to \mathcal C^2 diffeomorphisms and translation invariant whenever the maximum displacement \|\tau\|_\infty is small with respect to the scale 2^J and the deformation \| D \tau\|_\infty . The following is a consequence of Corollary 1 in [8], a simpler version of it that is enough for our purposes.

###### Theorem 6

There exists C>0 such that, given f\in L^2(\R^2) with compact support, for all \mathcal C^2 -diffeomorphism \tau with \| D \tau\|_\infty \leq \frac12 and 2^J \geq \frac{\|\tau\|_\infty}{\| D \tau\|_\infty} , it holds that

\left\| |S (L_\tau f) - Sf|_2 \right\|_2 \leq C m \| f\|_2 \left( \| D \tau \|_\infty +\| H\tau \|_\infty \right).  (40)

#### 8 Quick Comparison with a Convolutional Neural Network

For those readers familiar with the basic structure of a Convolutional Neural Network (abbreviated CNN), let us take another look at (39). We can think of f as an image, and \psi_{j_1, \gamma_1}, \cdots, \psi_{j_m,\gamma_m} , for j_1 \lt \cdots \lt j_m, \gamma_1,...,\gamma_n \in \Gamma_K ; as the filters. The filters remain unchanged from beggining to end here: in contrast to the filters in CNNs, that do change with training.

The complex modulus \rho = |\cdot | performs a pooling. It puts the real and imaginary parts together. This gives us |f*\psi_{j_1, \gamma_1}|, j_1 \lt J, \gamma_1\in \Gamma_K , which is the output of the first layer, and the input of the second layer. In the second layer, we do another filtering with the \psi_{j_2,\gamma_2} , j_2 \lt J, \gamma \in \Gamma_K and yet another pooling with the complex modulus. We repeat this process until the final m ^{\text{th}} layer.

The low-pass filter \phi_J produces another pooling. It is an average pooling over regions of size (proportional to) 2^J. We apply \phi_J to the output of the network. But there is a caveat. Unlike in CNNs, whose output is the output of the final layer, here we apply the \phi_J pooling to the output of each layer and the input f itself. This gives us the S_0f, S_1f, ..., S_m f , which constitute the final output Sf .

#### 9 Plotting the Scattering transform

In order to have a broader intuition on how the output of the scattering transform looks, we include a folder (.zip) with a Python script (my\_scattering.py). The script generates two files (scattering\_slow.gif, scattering\_fast.gif) that contain the whole output that the above Figures 8, 9 and 10 represent only in part. It also generates an image (scattering.png)

There are six relevant variables with assigned values that we encourage the reader to vary. The following lines of code are among the first 40 lines of the script (.py) and they contain the definitions of the relevant variables. We encourage the reader to tinker with the values and compare the resulting outputs.
scale = 3
rotations = 8
resize = True
enhance_each = False
enhance_global = True

In fact, the reader can use a grayscale image of their choice istead of the example image of a digit 1 from MNIST [6]]. The only requirement is to put a grayscale image file on the same folder as the script (.py) and change the variable
filename = 'one.png'
to match the name of the file and the extension that correspond.

#### References

[1] Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim Andén, Eugene Belilovsky, Joan Bruna, Vincent Lostanlen, Matthew J. Hirn, Edouard Oyallon, Sixin Zhang, Carmine Cella, and Michael Eickenberg. Kymatio: Scattering Transforms in Python. 2019. http://arxiv.org/pdf/1812.11214v2
[2] Jean P. Antoine and Romain Murenzi. Two-dimensional directional wavelets and the scale-angle representation. Signal Processing, 52(3):259281, 1996. https://www.sciencedirect.com/science/article/pii/0165168496000655
[3] Alexandre Bernardino and José Santos-Victor. A real-time gabor primal sketch for visual attention. In David Hutchison et al. , Pattern Recognition and Image Analysis, volume 3522 of Lecture Notes in Computer Science, 335342. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. https://www.researchgate.net/publication/221258995_A_Real-Time_Gabor_Primal_Sketch_for_Visual_Attention
[4] Joan Bruna and Stéphane Mallat. Classication with scattering operators. 2013. https://arxiv.org/pdf/1011.3023
[5] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. 2012. https://arxiv.org/pdf/1203.1513
[6] Li Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research IEEE Signal Processing Magazine, 29(6):141142, 2012. https://ieeexplore.ieee.org/document/6296535
[7] David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steen, Madhu Sudan, Demetri Terzopoulos, Dough Tygar, Moshe Y. Vardi, Gerhard Weikum, Jorge S. Marques, Nicolás La Pérez de Blanca, and Pedro Pina, editors. Pattern Recognition and Image Analysis, volume 3522 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. https://link.springer.com/book/10.1007/b136825
[8] Stéphane Mallat. Group Invariant Scattering. 2012. https://arxiv.org/pdf/1101.2286
[9] Shyambhu Mukherjee. Dataset: Hands and Palm Images Dataset, Version 2, March 2021. https://www.kaggle.com/shyambhu/hands-and-palm-images-dataset/metadata