Despite the popularity of Model Compression and Multitask Learning, how to effectively compress a multitask model has been less thoroughly analyzed due to the challenging entanglement of tasks in the parameter space. In this paper, we propose DiSparse, a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme. We consider each task independently by disentangling the importance measurement and take the unanimous decisions among all tasks when performing parameter pruning and selection. Our experimental results demonstrate superior performance on various configurations and settings compared to popular sparse training and pruning methods. Besides the effectiveness in compression, DiSparse also provides a powerful tool to the multitask learning community. Surprisingly, we even observed better performance than some dedicated multitask learning methods in several cases despite the high model sparsity enforced by DiSparse. We analyzed the pruning masks generated with DiSparse and observed strikingly similar sparse network architecture identified by each task even before the training starts. We also observe the existence of a "watershed" layer where the task relatedness sharply drops, implying no benefits in continued parameters sharing. Our code and models will be available at: https://github.com/SHI-Labs/DiSparse-Multitask-Model-Compression.
翻译:尽管模型压缩和多任务学习受到欢迎,但由于参数空间中任务交织的难度很大,如何有效压缩多任务模型的分析不够透彻。我们在此文件中提议Disparse,一个简单、有效、首创的多任务编程和零散的培训计划。我们独立地考虑每项任务,在进行参数剪裁和选择时,将重要度量度脱钩,在所有任务中作出一致决定。我们的实验结果显示,与流行的稀少培训和裁剪方法相比,各种配置和设置的绩效优异。除了压缩的效果外,Disparse还为多任务学习界提供了一个强大的工具。令人惊讶的是,尽管Disparse强制推行了高模型宽度的多任务编程和零星学习方法,我们甚至在若干情况下观察到了比一些专门的多任务学习方法更好的业绩。我们分析了与Disparse生成的剪辑口罩,并观察到了每项任务甚至在培训开始前所发现的惊人相似的稀少的网络结构。我们还看到存在一个“水压式”层,其中任务相关度急剧下降,Dus-Busima/Discostruals。我们没有共享的代码和参数。