Constructing the Scattering Transform

Constructing the Scattering Transform

Spain 10.08.2022

Constructing the Scattering Transform

Author: Javier Minguillón Sánchez

 

1 Introduction

As Mallat presents it in [8], the scattering transform appears from the pursuit of a representation Sf Sf of an L2(Rd) L^2(\R^d) , dN d\in \N , function f f such that the representations of two similar functions are close in the representation space, say H \mathcal H .

What do we mean by similar? Well, one relevant sense is comparing images, such as images of handwritten digits. We want the structure of H \mathcal H to bring together the handwritten representations every single digit 0 through 9, while simultaneously keeping different digits far from each other.
Choosing H=L2(R2) \mathcal H = L^2(\R^2) with the usual L2 L^2 norm cannot possibly serve this purpose.

Take a look at Figure 1. The first and second columns represent the digit 1 as a positive piecewise constant function on a compact subset of R2 \R^2 . In both rows, the third column represents the absolute value of the difference of the first two functions in that row. If the L2 L^2 norm of the top-left function is C C , the L2 L^2 norms of the top-right and bottom-right functions are 2C 2C and 0 0 respectively.


Figure 1. Digit 1 from MNIST [6]. These are grayscale images: the value 0 0 corresponds to black pixels and the value 255 255 , to white pixels. There are shades of gray in between.

Here we focus on functions f:R2R f :\R^2\to \R because they are a way of modelling (grayscale) pictures. All the below discussion has a much deeper general side to it that can be seen in [8]. There the scattering transform operator is developed for functions on Rd \R^d and an infinite number of parametrers in the scattering transform. This is far from our scope.

The scattering transform here is going to be an operator S:L2(R2)H S : L^2(\R^2) \to \mathcal H . Here H \mathcal H is the space of representations of images. Regarding classification purposes, the idea is that two images f1,f2:R2R f_1,f_2:\R^2\to \R that belong to the same category have to be sent to Sf1,Sf2 Sf_1, Sf_2 : two representations that are closer together in the metric space (H,H) (\mathcal H, \|\cdot \|_{\mathcal H}) whenever the categories of f1,f2 f_1, f_2 coincide and comparatively further apart whenever the categories of f1,f2 f_1,f_2 differ.

Ideally, the scattering transform would be translation invariant, rotation invariant,
continuous to deformations and would keep high frequency information in order to be able to discriminate L2 L^2 functions associated to signals. Examples of such signals are soundwaves in the case of L2(R) L^2(\R) or images in the case of L2(R2). L^2(\R^2).

We do not discuss any theoretical aspects of the rotation invariant representation that Mallat finds in [8], as they are rather sophisticated and beyond the purpose of our exposition. In practice the rotation invariant operator is not used
(see [1]). If necessary, the dataset is augmentated with rotated copies of the original elements.

 

2 A Few Notations and Definitions

We need to introduce a few notations first, followed by two definitions. From the perspective of image classification Definition 1 is a way to model deformations of an image, using diffeomorphisms. Definition 2 says when a criterion of classification (an operator) is stable in spite of deformations within images.

Let dN d\in \N and f:RdC f: \R^d\to \mathbb C . The space of p p-integrable complex valued functions is Lp(Rd),1p L^p(\R^d), 1\leq p \leq \infty . The norm
fp \|f\|_p denotes the usual p p -norm of Lp(Rd),1p L^p(\R^d), 1\leq p \leq \infty .
Given xRd x\in \R^d , we denote its euclidean norm by 12x2=x12++xd2 |\sharp1|_2{x}^2 = x_1^2+\cdots +x_d^2 . The usual scalar product of x,yRd x,y\in \R^d is xy x\cdot y .
The norm of a linear map T:VW T:V \to W , where V,W V,W are normed vector spaces is
TVW=supx0TxWxV \|T\|_{V\to W} = \sup_{x\neq 0} \frac{\|Tx\|_W}{\|x\|_V} . The space of linear maps from V V to W W is denoted by L(V,W). L(V,W).

The \bf Fourier transform of fL1(Rd)L2(Rd) f\in L^1(\R^d)\cap L^2(\R^d) is

f^(ξ)=Rdf(x)e2πixξdx \widehat f(\xi) = \int_{\R^d} f(x) e^{2\pi i x\cdot \xi} dx

and the definition extends in the usual way to all L2(Rd) L^2(\R^d) by densirty. The \bf spectral energy of fL2(Rd) f \in L^2(\R^d) on ΩR2 \Omega \subset \R^2 is given by

Ωf^(ξ)2dξ. \int_{\Omega} |\hat f(\xi)|^2d\xi.

Let τ:R2R2 \tau: \R^2 \to \R^2 be a C2 \mathcal C^2 function. We denote the supremum of the norms of the differential linear maps Dτ(x),xR2 D\tau(x), x\in\R^2 by
 

Dτ=supxRdDτ(x)R2R2.(1) \| D \tau \|_\infty = \sup_{x\in \R^d} \| D \tau (x)\|_{\R^2\to\R^2}.  (1)

Likewise, the supremum of the norms of the Hessian tensors is denoted by

Hτ=supxRdHτ(x)R2L(R2,R2),(2) \| H\tau \|_\infty = \sup_{x\in \R^d}\| H\tau(x) \|_{\R^2\to L(\R^2,\R^2)},  (2)

and the supremum of the variation of the function is

Δτ=sup(x,y)Rd×Rdτ(x)τ(y)2.(3) \| \Delta \tau \|_\infty = \sup _{(x,y)\in \R^d\times \R^d} |\tau(x)-\tau(y)|_2 .  (3)

We now proceed to define the action of a diffeomorphism on a function and the Lipschitz continuity to C2 \mathcal C^2 diffeomorphisms.

 

Definition 1

Let τ:R2R2 \tau:\R^2\to \R^2 . The action of τ \tau on a function f:R2C f:\R^2 \to \mathbb C is given by
 

Lτf(x)=f(xτ(x)).(4) L_{\tau} f(x) = f (x- \tau(x)) .  (4)

Whenever τc \tau \equiv c is a constant function, the action is equivalent to the composition with a mere translation. In this case, we denote Lτ L_\tau as Lc. L_c.

We also call Lτf(x) L_\tau f(x) the deformation of f f by the field τ \tau .
This is our way of modelling the tiny variations on an image. If we think of handwritten digits, for example, the digit ‘1’ takes many shapes that are continuous transformations of one another. See for example Figure 2.


Figure 2. A sample of a ‘1’ from MNIST and a deformation of it by a continuous field.

 

Definition 2

Let dN d\in \N and let (H,H) (\mathcal H, \|\cdot\|_\mathcal{H}) be a normed space. Let Φ:L2(R2)H \Phi : L^2 (\R^2) \to \mathcal H be an operator
and τ:R2R2 \tau: \R^2 \to \R^2 be a C2 \mathcal C^2 -diffeomorphism. We say the operator is Lipschitz continuous to (the action of) τ \tau , if
 

ΦfΦ(Lτf)HCf2(Dτ(x)+Hτ(x))(5) \| \Phi f - \Phi (L_\tau f ) \|_\mathcal{H} \leq C \|f\|_2 \left(\| D \tau(x) \|_\infty + \| H \tau (x) \|_\infty \right)  (5)

for some C>0 C>0 not depending on τ \tau of f. f.

Take an operator Φ \Phi between function spaces. Given cR2 c\in \R^2 , we say the operator is translation covariant if
 

Φ(Lcf)=Lc(Φf),(6) \Phi(L_cf) = L_c(\Phi f),  (6)

and we say it is translation invariant if
 

Φ(Lcf)=Φf.(7) \Phi(L_cf) = \Phi f.  (7)

Let us make another comment on Definition 2. One possible choice of H \mathcal H is a hilbert space of functions from R2 \R^2 to Rd \R^d , dN, d\in \N, where the inner product of f,g:R2Rd f,g:\R^2 \to \R^d is given by
 

f,gH=R2f(x)g(x)dx=R2(f1(x)g1(x)++fd(x)gd(x))dx,(8) \left\langle f,g \right\rangle_\mathcal{H} = \int_{\R^2} f(x) \cdot \overline{g(x)} dx = \int_{\R^2} \left(f_1(x) \overline{g_1(x)} +\cdots + f_d(x)\overline{g_d(x)}\right) dx,  (8)

while the associated norm is

fH=(R2f(x)22dx)12=12f2.(9) \|f\|_{\mathcal{H}} = \left(\int_{\R^2} |f(x)|_2^2 dx\right)^{\frac12} = \||\sharp 1|_2 f\|_2.  (9)

 

3 Wavelets and Low-pass Filters

First of all, we need to define the filters that are used in the transformation. These are the basic components that we convolve with an fL2(R2). f\in L^2(\R^2). We define these filters first and then we give some relevant examples and explain what their convolution with f f produces.

 

Definition 3

A function ψL2(R2)L1(R2) \psi \in L^2(\R^2) \cap L^1(\R^2) is a wavelet if
 

ψ^(0)=R2ψ(x)dx=0. \widehat\psi(0) = \int_{\R^2} \psi(x) dx =0 .

The above is also known a 2D-wavelet in the literature (see [2]).

 

Definition 4

Let ϕ:R2C \phi : \R^2 \to \mathbb C we say it is a low-pass filter if
 

R2ϕ(x)dx=1.(10) \int_{\R^2} | \phi(x) | dx = 1.  (10)

Let ψ,ϕ:R2R \psi, \phi: \R^2 \to \R be a wavelet and a low-pass filter respectively. Fix a positive integer K K and, for each k=0,....,K1 k = 0 , .... , K-1 , denote by γk:R2R2 \gamma_k: \R^2\to \R^2 , the counterclockwise rotation of the plane by an angle of k2πK \frac k{2\pi K} .

Fix JZ. J\in \Z. For jZ j\in \Z , jJ j\leq J and γΓK \gamma \in \Gamma_K , we define the K K -directional child wavelets of ψ \psi as
 

ψj,γ(x)=22jψ(2jγx).(11) \psi_{j,\gamma} (x) = 2^{-2j} \psi(2^{-j}\gamma x).  (11)

and the low-pass filter for frequencies below 2J 2^{-J} is
 

ϕJ(x)=22Jϕ(2Jx).(12) \phi_J(x) = 2^{-2J} \phi(2^{-J}x).  (12)

Observe that the respective Fourier transforms are
 

ψj,γ^(ξ)=ψ^(2jγξ),(13) \widehat{\psi_{j,\gamma}}(\xi) = \widehat \psi (2^j\gamma \xi),  (13)

and
 

ϕJ^(ξ)=ϕ^(2Jξ).(14) \widehat{\phi_J }(\xi) = \widehat \phi (2^J \xi).  (14)

Suppose for a moment that suppϕ^J=B(0,1) supp \widehat\phi_J = B(0,1) , the unit ball of R2. \R^2. Since fϕJ^=f^  ϕ^J \widehat{f*\phi_J} = \widehat f \; \widehat\phi_J , we see from the above that ϕJ \phi_J removes
high frequencies: larger than 2J. 2^{-J}. This justifies the name of the low-pass filter.

Common choices for ψ \psi are rescalings of the following functions:
• A complex Morlet wavelet (See Figures 3, 4) with parameter ξ0R2 \xi_0\in \R^2 ,
 

C2(e2πiξ0xC1)ex2,(15) C_2 ( e^{2\pi i\xi_0\cdot x} -C_1 )e^{-|x|^2},  (15)

with C1 C_1 such that the integral of the above is equal to 0 and C2 C_2 such that the squared modulus of the above has integral equal to 1. Both constants depend on ξ0. \xi_0. • A partial derivative wavelet, which is a partial derivative of the Gaussian function
 

x1ex2.(16) \frac{\partial}{\partial x_1} e^{-|x|^2}.  (16)

A common choice for ϕ \phi is a rescaling of the Gaussian function (see Figure 5)
 

C3ex2,(17) C_3 e^{-|x|^2},  (17)

with
C3=12π C_3 = \frac{1}{2\pi} chosen for ϕ \phi to have integral equal to 1. It is a well known fact that the function eπx2 e^{\pi |x|^2} is its own Fourier transform. Therefore, the above choice of ϕ \phi gives us ϕ^(ξ)=eπ2ξ2 \widehat\phi(\xi) = e^{-\pi^2 |\xi|^2} . Thus, the filter ϕJ \phi_J reduces to almost 0 all frequencies above some multiple of 2J. 2^{-J}.

The purpose of ψj,γ \psi_{j,\gamma} , jJ j\leq J , γΓK \gamma \in \Gamma_K ; is to be convolved with a function f:R2R f:\R^2 \to \R that represents an image (in black and white). These filters extract certain features from the image. For example, the position and orientation of edges between light and dark parts of an image.


Figure 3. Real part of the Morlet wavelet from (15) with ξ0=(1,0) \xi_0 = (1,0) .

 


Figure 4. Imaginary part of the Morlet wavelet from (15) with ξ0=(1,0) \xi_0 = (1,0) .

 


Figure 5. The Gaussian function from (17)

 

4 Wavelet transform

Let us define the wavelet transform, which is a key component of the scattering trasnform.
 

Definition 5

Let fL2(R2) f\in L^2(\R^2) and consider a wavelet ψL1(Rd)L2(Rd) \psi\in L^1(\R^d)\cap L^2(\R^d) . Let jZ j\in \Z and KN. K\in \N. Fix a rotation γΓK. \gamma \in \Gamma_K. The continuous wavelet transform or simply wavelet trasnform of scale 2j 2^{-j} and rotation γΓK \gamma \in \Gamma_K is given by
 

Wj,γf(x)=fψj,γ(x),xR2.(18) W_{j,\gamma} f(x) = f * \psi_{j,\gamma} (x), \quad x \in \R^2.  (18)

The Wavelet transform WJ W_J is defined as a sequence of the above
 

WJf={fϕJ}{fψj,γ}jJ,γΓK.(19) W_J f = \{ f*\phi_J\} \cup \{ f * \psi_{j,\gamma} \}_{j\leq J, \gamma \in \Gamma_K}.  (19)

There are many variants of the wavelet transform of 2D-wavelets.
The version of wavelet transform that we use here is known as Littlewood Payley wavelet transform (see [8]). In general (see [2]), the wavelet transform can be defined depending on a scale parameter a>0 a>0 ,

a displacement parameter bRd b \in \R^d

and an angle of rotation θ[0,2π) \theta\in[0,2\pi) .

In the above definition we consider several scales a=2j,jJ; a = 2^{-j}, j\leq J; we take b=0 b= 0 ; and we pick θ \theta in the set of angles corresponding to the rotations in ΓK \Gamma_K .

The wavelet transform operator also has the following boundedness\mathbf{boundedness} property for a certain choice of the wavelets we introduced above (see [8] or [5]),
 

WJfHf2,(20) \left\| W_Jf \right\|_\mathcal{H} \leq \| f \|_2,  (20)

for all fL2(Rd), f\in L^2(\R^d), where the space of sequences where WJ W_J takes values is H \mathcal H , and the norm of this space is
 

WJfH=fϕJ2+jJ,γΓKfψj,γ2.(21) \|W_Jf\|_\mathcal{H} = \|f*\phi_J\|_2 + \sum_{j\leq J, \gamma\in \Gamma_K} \|f*\psi_{j,\gamma}\|_2.  (21)

Since WJ W_J is linear, the above also means that it is a weakly contractive operator from L2(R2) L^2(\R^2) to
H \mathcal H

 

5 The Fourier Domain Perspective

There is a simplified interpretation of how WJ W_J acts on a function f. f. Notice that the Fourier transform of the convolutions in the wavelet transform gives
 

fϕJ^(ξ)=f^(ξ)ϕ^(2Jξ),(22) \widehat{f*\phi_J}(\xi) = \widehat f (\xi) \widehat{\phi}(2^J\xi),  (22)

and
 

fψj,γ^(x)=f^(ξ)ψ^(2jγξ).(23) \widehat{f * \psi_{j,\gamma}} (x) = \widehat f (\xi) \widehat {\psi} ( 2^j \gamma \xi).  (23)

The wavelet trasform extracts certain information out of the Fourier transform of f. f. Look at the above equation. If we think of ψj,γ^ \widehat{ \psi_{j,\gamma} } as compactly supported on a ball not centered at the origin, the scale and the angle determine which part of the Fourier domain of f f is preserved by the transform with those parameters. See Figure 6.

The reader may be wondering why it makes any sense to think of the filters as compactly supported. The low-pass filter is simpler. If we choose ϕ \phi as a Gaussian, then it is certainly not compactly supported. Nonetheless, the vast majority of its mass is concentrated on a ball centered at the origin. And, outside of this ball, all frequencies are reduced to almost zero.

To understand the rest of the filters, consider a simpler version of Morlet wavelets, called Gabor functions
 

ψ(x)=C2e2πiξ0xex2=C2e2πiξ0xϕ(x),(24) \psi (x) = C_2 e^{2\pi i\xi_0\cdot x}e^{-|x|^2} = C_2 e^{2\pi i\xi_0\cdot x}\phi(x),  (24)

then,

ψ^(x)=ϕ^(xξ0),(25) \widehat \psi (x) = \widehat \phi(x-\xi_0),  (25)

a translated Gaussian. This explains the interpretation on Figure 6.

In practice, if we consider Morlet wavelets from (15), we have that the larger ξ0 \xi_0 is (or the smaller j j is), the closer to 0 the constant C1 C_1 is. The above interpretation using equation (24) is an approximate one.


Figure 6. The filters on the frequency domain. In blue, the low-pass filter ϕJ \phi_J with J=2 J=2 . In green the filters ψ0,γ, \psi_{0,\gamma}, γΓ8 \gamma \in \Gamma_8 In red, the filters ψ1,γ \psi_{1,\gamma} , γΓ8. \gamma \in \Gamma_8. }

 

6 Construction of the Scattering Transform

Let fL2(R2) f \in L^2(\R^2) . This function represents an image in black and white. Eventually, it needs to be changed to a function with compact (and discrete) support. In this section we see how the Scattering Transform operator is constructed and the aspect of a transformed function has.

We need some auxiliar transformations first. The first transformation S0f S_0f is given by a simple averaging of each value of f f with the surrounding values: the convolution
 

S0f(x)=fϕJ(x).(26) S_0 f (x) = f * \phi_J (x) .  (26)

Here ϕJ \phi_J preserves the frequencies ξ2J |\xi|\leq 2^{-J} and removes the rest. See Figure 7 for an example.


Figure 7. On the left, an example of fh f _h representing a grayscale 320×240 320\times 240 image of a human hand. Extracted from [9]. On the right, the S0fh S_0f_h with J=2 J=2 and a Gaussian filter.

After that, we analyze higher freqencies by computing wavelet coefficients. We fix
0j<J 0 \leq j \lt J and find
 

fψj,γ(x),γΓK.(27) f* \psi_{j,\gamma} (x) , \quad \gamma \in \Gamma_K.  (27)

Let us make a comment on the bounds of the parameter j. j. The bound j<J j \lt J is part of the definition, but the choice of 0j 0\leq j is related to the image processing perspective. Any scale lower than 20=1 2^0 =1 on a discrite function (an image) would not yield new information (it would be a scale smaller than a pixel).

The wavelet transforms in (27) commute with translations, in the sense that, given cR2 c \in \R^2 ,
 

Lc(Wj,γf)=Lc(fψj,γ)=fψj,γ(xc)=(Lcf)ψj,γ=Wj,γ(Lcf).(28) L_c(W_{j,\gamma} f) = L_c (f * \psi_{j,\gamma}) = f*\psi_{j,\gamma}(x-c) = (L_cf) * \psi_{j,\gamma} = W_{j,\gamma} (L_cf).  (28)

This property of translation covariance comes solely from convolution. But this property also implies that the elements in (27) are not translation invariant. In other words, the relation fψj,γ=(Lcf)ψj,γ f * \psi_{j,\gamma} = (L_cf) * \psi_{j,\gamma}, does not hold in general for arbitrary f f and c c .

We want a translation invariant representation of f, f, but Wj,γf W_{j,\gamma}f is not, as (28) makes clear. We can obtain a translation invariant operator
 

Wj,γf=R2Wj,γf(x)dx,(29) \overline {W_{j,\gamma}} f = \int_{\R^2} W_{j,\gamma} f(x) dx,  (29)

but it is equal to 00. Indeed, applying Fubini, it is
 

Wj,γf=R2((R2f(y)ψj,γ(yx)dy))dx(30) \overline {W_{j,\gamma}} f = \int_{\R^2} \left(\left(\int_{\R^2}f(y) \psi_{j,\gamma}(y-x)dy\right)\right)dx  (30)

=R2(f(y)R2ψj,γ(yx)dx)dy,(31) = \int_{\R^2}\left(f(y) \int_{\R^2}\psi_{j,\gamma}(y-x)dx\right)dy,  (31)

and the inner integral is 0 0 for all yR2 y\in \R^2 because of our definition of wavelet.

We can obtain a translation invariant operator from the integral by composing a non-linear ρ:RR \rho:\R\to\R function with the integrand as

Uj,γf=R2ρ(Wj,γf(x))dx.(32) \overline {U_{j,\gamma}} f = \int_{\R^2} \rho (W_{j,\gamma} f(x)) dx.  (32)

There are several possible non-linearities. But the chosen one is ρ=, \rho = |\cdot|, the complex modulus.
There are two reasons for this.
Firstly, we want ρf2=Lτ(ρf)2 \|\rho\circ f\|_2 = \| L_\tau (\rho\circ f)\|_2 . Secondly, we want contractiveness, that is, ρf1ρf22f1f22. \|\rho\circ f_1-\rho\circ f_2\|_2 \leq \| f_1-f_2 \|_2. These properties are key to preserve the properties of boundedness and contractiveness that the Wavelet Transform already has. One need only look at (20) and (21) to see that a composition with the modulus preserve the aforementioned properties.

Setting ρ()=, \rho (\cdot) = |\cdot|, we have a quantity that is invariant to translations of f f given by the integral

Uj,γf=R2fψj,γ(x)dx,(33) \overline {U_{j,\gamma}} f = \int_{\R^2} |f*\psi_{j,\gamma} (x)|dx,  (33)

for each jJ j\leq J and γΓK. \gamma \in \Gamma_K. This is not satisfactory yet. Yes, the above positive quantities in (33) are translation invariant with respect to f f ; and they are also stable to deformations because Lτf=Lτf L_\tau|f| = |L_\tau f| and the integrand is stable to deformations [8]. However, the averaging over all R2 \R^2 loses too much information [8]. We do not settle for Uj,γ \overline {U_{j,\gamma}} .

Suppose that, instead of taking an average over Rd \R^d , we take averages over local sets (for the sake of fixing ideas, think about balls of radius proportional to 2J 2^J ). We would end up with functions

fψj,γϕJ(x)=R2fψj,γ(y)ϕJ(xy)dy.(34) |f*\psi_{j,\gamma}|*\phi_J (x) = \int_{\R^2} |f*\psi_{j,\gamma} (y)| \phi_J (x-y) dy.  (34)

The above functions happen to be Lipschitz continuous to deformations (see Theorem 6).
They also turn out to be locally translation invariant (also thanks to Theorem 6) for a translation of a scale below 2JDτ 2^J \| D \tau\|_\infty.

The above integrals constitute the second auxiliary transform S1:R2RKJ S_1 : \R^2\to \R^{KJ} given by

S1f(x)=(fψj1,γ1ϕJ(x))0j1<Jγ1ΓK.(35) S_1f(x) = \left(|f*\psi_{j_1,\gamma_1}|*\phi_J (x) \right)_{\substack{0\leq j_1 \lt J\\ \gamma_1\in\Gamma_K}}.  (35)

See Figure 8 for an example.


Figure 8. Components of S1fh S_1f_h for J=2 J=2 and scale j1=0 j_1 = 0 , with rotations γ1 \gamma_1 of angles 0,π4,π2,3π20, \frac\pi4,\frac\pi2,\frac{3\pi}{2} respectively. The low-pass filter is a gaussian and the rest are Morlet wavelets. This output is normalized: white pixels respresent the largest value on any single image and black pixels represent the value 0. There are 255 shades of gray in total.


Figure 9. The third image in Figure 8 The component fhψj1,γ1ϕJ(x) |f_h*\psi_{j_1,\gamma_1}|*\phi_J (x) , with J=2, J=2, j1=0 j_1 = 0 , γ1 \gamma_1 of angle π/2. \pi/2. Here 2j1=1 2^{j_1}=1 is small compared to the support of fh f_h . This makes the Morlet wavelet detect oriented edges according to how the rotation affects ξ0 \xi_0 in (15). For this particular γ1 \gamma_1 , the value of horizontal edges is visibly the largest.

Now, ϕJ \phi_J is a low-pass filter and we are removing high frequencies of euclidean norm larger than 2J 2^{-J} (and smaller than 2j1 2^{-j_1} ) from fψj1,γ1 | f* \psi_{j_1,\gamma_1} | . In order to take into account some of those frequencies, we convolve with another wavelet: we perform the wavelet transform of fψj1,γ1. | f* \psi_{j_1,\gamma_1} |.

We obtain, for some j2 j_2 such that j1<j2<J j_1 \lt j_2 \lt J ,
 

fψj1,γ1ψj2,γ2(x),γ1,γ2ΓK.(36) |f*\psi_{j_1, \gamma_1}|*\psi_{j_2,\gamma_2} (x), \quad \gamma_1,\gamma_2 \in \Gamma_K .  (36)

There is a reason for the choice of the parameter j2 j_2 as j2<j1.[/katex]Accordingto[4]and[8],thespectralenergyof[katex]fψj1,γ1ψj2,γ2[/katex]over[katex]R2[/katex]isnegligiblefor[katex]j1j2[/katex].Asbefore,weapplythenonlinearityandfilterthehighfrequencies.The<strong>thirdauxiliarytransform</strong>is,forsome[katex]0j2<j1<J, j_2 < j_1. [/katex] According to [4] and [8], the spectral energy of [katex] |f*\psi_{j_1, \gamma_1}|*\psi_{j_2,\gamma_2} [/katex] over [katex] \R^2 [/katex] is negligible for [katex] j_1\leq j_2 [/katex]. As before, we apply the non-linearity and filter the high frequencies. The <strong>third auxiliary transform</strong> is, for some [katex] 0 \leq j_2 \lt j_1 \lt J, an S2:R2RK2 S_2:\R^2 \to \R^{K^2} given by
 

S2f(x)=(fψj1,γ1ψj2,γ2ϕJ(x))0j2<j1<Jγ1,γ2ΓK.(37) S_2f(x) = \left( \left| |f*\psi_{j_1, \gamma_1}| * \psi_{j_2,\gamma_2} \right|* \phi_J(x) \right)_{\substack{0\leq j_2 \lt j_1 \lt J \\ \gamma_1,\gamma_2\in \Gamma_K}}.  (37)

See Figure 10 for an example.


Figure 10. Output of S2f S_2 f for J=2 J=2 , j1=0 j_1 =0 , j2=1 j_2 =1 , γ1 \gamma_1 of angle π2 \frac\pi2 and, from left to right, γ2 \gamma_2 of angles 0,π4,π2,3π2 0, \frac\pi4,\frac\pi2,\frac{3\pi}{2} . The output is normalized: white is the maximum value and black is 0.

Let mN m\in \N . The general (m+1 \mathbf{(m+1} -th auxiliar transform is an iteration of the above given by
 

Smf(x)=(fψj1,γ1ψj2,γ2ψjm,γmϕJ(x))0jm<...<j1<Jγ1,...γmΓK(38) S_mf(x) = \left( \left| \cdots \left| |f*\psi_{j_1, \gamma_1}| * \psi_{j_2,\gamma_2} \right| \cdots *\psi_{j_m,\gamma_m}\right|* \phi_J(x) \right)_{\substack{0 \leq j_m \lt ...\lt j_1 \lt J\\ \gamma_1,...\gamma_m\in \Gamma_K}}  (38)

Each choice of parameters (ji,γi) (j_i,\gamma_i) , i1,...,m, i\in{1,...,m} , determines a path, that is, an ordered sequence of pairs p=((j1,γ1),...,(jm,γm)) p = \left( (j_1,\gamma_1),..., (j_m,\gamma_m ) \right) , where 0jm<jm1<<j1 0\leq j_m \lt j_{m-1} \lt \cdots \lt j_1 . The number of pairs is the <strong>length</strong> <strong>length</strong> of the path p. p.

We finally arrive at the scattering transform of f f with path lengths lesser or equal than m m . It is Sf:R2RM Sf:\R^2\to\R^M given by
 

Sf(x)=(S0f(x),S1f(x),...,Smf(x)),(39) Sf(x) = (S_0f(x), S_1f(x),...,S_mf(x)),  (39)

where the dimension of the space M=q=0mKq(Jq) M= \sum_{q = 0 }^{m}K^q\binom{J}{q} comes from the choice of parameters at each of the auxiliar transforms. Those are all the possible choices of scales in m m steps 0j1<j2<jm<J 0\leq j_1 \lt j_2 \lt \cdots j_m \lt J with K K rotations at each step.

For most practical classification purposes the parameter m m from (39) can be set to m=2 m=2 and it is enough. In fact, the package Kymatio [1] only implements scattering of images up to m=2 m = 2 for this reason, which is explained in [4]. Parameters J J and K K are less straightforward to choose. Theorem 6 is helpful for guessing a good choice of J J . Regarding the parameter K K , it only needs to be somewhat high. A common choice would be K=16. K = 16. Of course, instead of guessing, one can perform cross validation for some values of the parameters to decide as it is done in [5].

 

7 Lipschitz Continuity to Deformations

The following result tells us that the scattering transform as we have defined it is both Lipschitz continuous to C2 \mathcal C^2 diffeomorphisms and translation invariant whenever the maximum displacement τ \|\tau\|_\infty is small with respect to the scale 2J 2^J and the deformation Dτ \| D \tau\|_\infty . The following is a consequence of Corollary 1 in [8], a simpler version of it that is enough for our purposes.

 

Theorem 6

There exists C>0 C>0 such that, given fL2(R2) f\in L^2(\R^2) with compact support, for all C2 \mathcal C^2 -diffeomorphism τ \tau with Dτ12 \| D \tau\|_\infty \leq \frac12 and 2JτDτ 2^J \geq \frac{\|\tau\|_\infty}{\| D \tau\|_\infty} , it holds that
 

S(Lτf)Sf22Cmf2(Dτ+Hτ).(40) \left\| |S (L_\tau f) - Sf|_2 \right\|_2 \leq C m \| f\|_2 \left( \| D \tau \|_\infty +\| H\tau \|_\infty \right).  (40)

 

8 Quick Comparison with a Convolutional Neural Network

For those readers familiar with the basic structure of a Convolutional Neural Network (abbreviated CNN), let us take another look at (39). We can think of f f as an image, and ψj1,γ1,,ψjm,γm \psi_{j_1, \gamma_1}, \cdots, \psi_{j_m,\gamma_m} , for j1<<jm, j_1 \lt \cdots \lt j_m, γ1,...,γnΓK \gamma_1,...,\gamma_n \in \Gamma_K ; as the filters. The filters remain unchanged from beggining to end here: in contrast to the filters in CNNs, that do change with training.

The complex modulus ρ= \rho = |\cdot | performs a pooling. It puts the real and imaginary parts together. This gives us fψj1,γ1,j1<J,γ1ΓK |f*\psi_{j_1, \gamma_1}|, j_1 \lt J, \gamma_1\in \Gamma_K , which is the output of the first layer, and the input of the second layer. In the second layer, we do another filtering with the ψj2,γ2 \psi_{j_2,\gamma_2} , j2<J, j_2 \lt J, γΓK \gamma \in \Gamma_K and yet another pooling with the complex modulus. We repeat this process until the final mth m ^{\text{th}} layer.

The low-pass filter ϕJ \phi_J produces another pooling. It is an average pooling over regions of size (proportional to) 2J. 2^J. We apply ϕJ \phi_J to the output of the network. But there is a caveat. Unlike in CNNs, whose output is the output of the final layer, here we apply the ϕJ \phi_J pooling to the output of each layer and the input f f itself. This gives us the S0f,S1f,...,Smf S_0f, S_1f, ..., S_m f , which constitute the final output Sf Sf .

 

9 Plotting the Scattering transform

In order to have a broader intuition on how the output of the scattering transform looks, we include a folder (.zip) with a Python script (my\_scattering.py). The script generates two files (scattering\_slow.gif, scattering\_fast.gif) that contain the whole output that the above Figures 8, 9 and 10 represent only in part. It also generates an image (scattering.png)

There are six relevant variables with assigned values that we encourage the reader to vary. The following lines of code are among the first 40 lines of the script (.py) and they contain the definitions of the relevant variables. We encourage the reader to tinker with the values and compare the resulting outputs.
scale = 3
rotations = 8
resize = True
enhance_each = False
enhance_global = True

In fact, the reader can use a grayscale image of their choice istead of the example image of a digit 1 from MNIST [6]]. The only requirement is to put a grayscale image file on the same folder as the script (.py) and change the variable
filename = 'one.png'
to match the name of the file and the extension that correspond.

 

References

[1] Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim Andén, Eugene Belilovsky, Joan Bruna, Vincent Lostanlen, Matthew J. Hirn, Edouard Oyallon, Sixin Zhang, Carmine Cella, and Michael Eickenberg. Kymatio: Scattering Transforms in Python. 2019. http://arxiv.org/pdf/1812.11214v2
[2] Jean P. Antoine and Romain Murenzi. Two-dimensional directional wavelets and the scale-angle representation. Signal Processing, 52(3):259281, 1996. https://www.sciencedirect.com/science/article/pii/0165168496000655
[3] Alexandre Bernardino and José Santos-Victor. A real-time gabor primal sketch for visual attention. In David Hutchison et al. , Pattern Recognition and Image Analysis, volume 3522 of Lecture Notes in Computer Science, 335342. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. https://www.researchgate.net/publication/221258995_A_Real-Time_Gabor_Primal_Sketch_for_Visual_Attention
[4] Joan Bruna and Stéphane Mallat. Classication with scattering operators. 2013. https://arxiv.org/pdf/1011.3023
[5] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. 2012. https://arxiv.org/pdf/1203.1513
[6] Li Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research IEEE Signal Processing Magazine, 29(6):141142, 2012. https://ieeexplore.ieee.org/document/6296535
[7] David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steen, Madhu Sudan, Demetri Terzopoulos, Dough Tygar, Moshe Y. Vardi, Gerhard Weikum, Jorge S. Marques, Nicolás La Pérez de Blanca, and Pedro Pina, editors. Pattern Recognition and Image Analysis, volume 3522 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. https://link.springer.com/book/10.1007/b136825
[8] Stéphane Mallat. Group Invariant Scattering. 2012. https://arxiv.org/pdf/1101.2286
[9] Shyambhu Mukherjee. Dataset: Hands and Palm Images Dataset, Version 2, March 2021. https://www.kaggle.com/shyambhu/hands-and-palm-images-dataset/metadata