2020-CVPR-DMCP Differentiable Markov Channel Pruning for Neural Networks


propose a novel differentiable channel pruning method named Differentiable Markov Channel Pruning (DMCP) to perform efficient optimal sub-structure searching.


At the same FLOPs, our method outperforms all the other pruning methods both on MobileNetV2 and ResNet, as shown in Figure 1.

With our method, MobileNetV2 has 0.1% accuracy drop with 30% FLOPs reduction and the FLOPs of ResNet-50 is reduced by 44% with only 0.4% drop.


Recent works imply that the channel pruning can be regarded as searching optimal sub-structure from unpruned networks.


However, existing works based on this observation require training and evaluating a large number of structures, which limits their application.


Conventional channel pruning methods mainly rely on the human-designed paradigm.


the structure of the pruned model is the key of determining the performance of a pruned model, rather than the inherited “important” weights.


the optimization of these pruning process need to train and evaluate a large number of structures sampled from the unpruned network, thus the scalability of these methods is limited.


A similar problem in neural architecture search (NAS) has been tackled by differentiable method DARTS


ps 与DATRS的区别

First, the definition of search space is different. The search space of DARTS is a category of pre-defined operations (convolution, max-pooing, etc), while in the channel pruning, the search space is the number of channels in each layer.


Second, the operations in DARTS are independent with each other. But in the channel pruning, if a layer has k + 1 channels, it must have at least k channels first, which has a logical implication relationship.



Our method makes the channel pruning differentiable by modeling it as a Markov process.



Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization (e.g. FLOPs constraint).


In the Markov process for each layer, the state \(S_k\) represents the \(k^{th}\) channel is retained, the transition from \(S_k\) to \(S_{k+1}\) represents the probability of retaining the (k+1)th channel given that the kth channel is retained.

每一层为一个马尔科夫过程,状态 \(S_k\) 表示保留第k个通道。状态 \(S_k\) 到 \(S_{k+1}\) 的转移代表保留第k+1个通道的概率

Note that the start state is always \(S_1\) in our method.

\(S_1\) 是起始状态,即每层都至少有1个通道

Then the marginal probability for state \(S_k\), i.e. the probability of retaining \(k^{th}\) channel, can be computed by the product of transition probabilities and can also be viewed as a scaling coefficient.


Each scaling coefficient is multiplied to its corresponding channel’s feature map during the network forwarding.

前向过程中,每个通道的feature map 乘以 该通道对应的 边缘概率(放大系数)

So the transition probabilities parameterized by learnable parameters can be optimized in an end-to-end manner by gradient descent with respect to task loss together with budget regularization (e.g. FLOPs constraint).

因此可以通过对目标loss 和 代价loss(FLOPs loss)的梯度下降,来end to end地优化 不同层,不同通道的转移概率

After the optimization, the model within desired budgets can be sampled by the Markov process with learned transition probabilities and will be trained from scratch to achieve high performance.

优化完成后(即网络中每一层的转移概率/边缘概率 可以抽样出符合FLOPs限制的网络了),进行采样子网络并从头开始训练








(\(p_k\) 是转移概率,\(p_{w1}\) 是边缘概率)









The proposed method is differentiable by modeling the channel pruning as the Markov process, thus can be optimized with respect to task loss by gradient descent.



【CVPR 2020 Oral丨DMCP: 可微分的深度模型剪枝算法解读】

【Soft Filter Pruning(SFP)算法笔记】

