最终的model very thin(特别是使用了channel-wise separable convolution),同时block的 width 非常小。
Abstract
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8\x~and 5.5\x~fewer multiply-adds and parameters for similar accuracy as previous work.
expand model from the 2D space into 3D spacetime domain.
find relevant features to improve in a greedy fashion by including (forward selection) a single feature in each step, or start with a full set of features and aim to find irrelevant ones that are excluded by repeatedly deleting the feature that reduces performance the least (backward elimination).
X2D baseline
使用了channel-wise separable convolution
Further, the
temporal convolution in the first
stage is
channel-wise.
Similar to the SlowFast pathways , the model
preserves the temporal input resolution for all features throughout the network hierarchy.There is no temporal downsampling layer (neither temporal pooling nor time-strided convolutions) throughout the network, up to the global pooling layer before classifcation.Thus, the activations tensors contain all frames along the temporal dimension, maintaining full temporal frequency in all features.
expansion is simple and cheap e.g. our low-compute model is completed after only training 30 tiny models that accumulatively require over 25× fewer multiply-add operations for training than one large state-of-the-art network
An expansion of the depth after increasing input resolution is intuitive, as it allows to grow the filter receptive field resolution and size within each residual stage.
A surprising finding of our progressive expansion is that networks with
thin channel dimension and
high spatiotemporal resolution can be effective for video recognition.
X3D-S is a lower spatiotemporal resolution version of X3D-M therefore has the same number of parameters,
The models in (a) and (b) only differ in spatiotemporal resolution of the input and activations and (c) differs from (b) in spatial resolution, width, , and depth, See Table 1 for . Surprisingly -XL has a maximum width of 630 feature channels.