与Maggy统一单主机和分布式机器学习

与Maggy统一单主机和分布式机器学习This blog covers the oblivious training function and the internals of Maggy presented at Spark AI Summit 2020 on June 26th 该博客涵盖了忘却的培训功能以及 Maggy 在 星火 AI 首脑会议 2020 年 6 月 26 日

大家好,我是讯享网,很高兴认识大家。

This blog covers the oblivious training function and the internals of Maggy presented at Spark+AI Summit 2020, on June 26th.

该博客涵盖了忘却的培训功能以及Maggy在 星火+ AI首脑会议2020年 6月26日。

TLDR; Maggy is an open-source framework for distributed machine learning. In this post, we introduce a new unified framework for writing core ML training logic as “oblivious training functions”. Maggy enables you to reuse the same training code whether training small models on your laptop or reusing the same code to scale out hyperparameter tuning or distributed deep learning on a cluster. Maggy enables the replacement of the current waterfall development process for distributed ML applications, where code is rewritten at every stage, with an iterative development process.

TLDR; Maggy是用于分布式机器学习的开源框架。 在本文中,我们介绍了一个新的统一框架,用于将核心ML训练逻辑编写为“ 遗忘的训练功能 ”。 Maggy使您可以重用相同的训练代码,无论是在笔记本电脑上训练小型模型还是重用相同的代码来扩展超参数调整或在集群上进行分布式深度学习。 Maggy支持使用分布式ML应用程序替换当前的瀑布式开发过程,在分布式ML应用程序中,每个阶段都使用迭代开发过程来重写代码。

Most of the publicly available ML source code for training models is not built to scale-out on many servers or GPUs. Getting started with deep learning is relatively easy these days, thanks to fast.ai, GitHub, and the blogosphere. The hard part for practitioners starts when the code examples found online need to be applied to more challenging domains, with larger and custom datasets, which in turn will require a bigger customized version of the model to fit that dataset. Using publicly available code as a starting point for model development on clusters, you will end up in a process similar to the one depicted in Figure 1.


讯享网

用于培训模型的大多数公开可用的ML源代码都无法在许多服务器或GPU上横向扩展。 如今 , 借助fast.ai ,GitHub和Blogosphere ,深度学习入门相对容易。 对于从业者而言,最困难的部分是需要将在线找到的代码示例应用于具有更大和自定义数据集的更具挑战性的领域时,这反过来又需要模型的更大自定义版本以适合该数据集。 使用可公开获得的代码作为在集群上进行模型开发的起点,您将最终完成类似于图1所示的过程。

The software development process for ML models is rarely the perfect waterfall development model, as shown in Figure 1 without the green arrows. In the (discredited) waterfall development process, you would start out with requirements, then move on to design, implementation and test. The (current!) equivalent process in ML model development is the following, as shown in Figure 1 with the green arrows. You start out on your local machine with a subset of the data in order to explore and design the model architecture. Then you move to use a cluster of resources (such as GPUs) to more quickly find hyperparameters, run lots of parallel ablation studies (many skip this stage!), and finally scale out the training of the model on the large dataset using lots of resources. Then, you’re done, right? Wrong! You typically iterate through the stages, finding better hyperparameters, adding new features, rewriting for distribution, going from your laptop to the cluster and back again.

ML模型的软件开发过程很少是完美的瀑布式开发模型 ,如图1所示,没有绿色箭头。 在(分散的)瀑布式开发过程中,您将从需求开始,然后进行设计,实施和测试。 ML模型开发中的(当前!)等效过程如下所示,如图1中的绿色箭头所示。 您首先从本地计算机上获取数据的子集,以探索和设计模型架构。 然后,您开始使用资源集群(例如GPU)来更快地找到超参数,运行大量并行消融研究( 许多跳过此阶段 !),最后使用大量资源对大型数据集进行模型训练资源。 然后,您完成了,对不对? 错误! 通常,您会遍历各个阶段,找到更好的超参数,添加新功能,重写以进行分发,从笔记本电脑到群集再返回。

We rewrite our model training code for distribution as it offers many benefits — faster training of models using more GPUs, parallelizing hyperparameter tuning over many GPUs, and parallelizing ablation studies to help understand the behaviour and performance of deep neural networks. However, not only will the boiler plate model training code need to be modified, but as you move along the process, distribution will introduce additional obtrusive code artifacts and modifications, depending on the frameworks used. This will lead to a mix of infrastructure code and model code, with duplicated training logic, hyperparameters hard-coded into the training loop, additional tracking code to keep record of your changes and config files for experiments:

我们重写了用于分发的模型训练代码,因为它具有许多好处-使用更多GPU更快地训练模型,并行化许多GPU上的超参数调整,并行化消融研究以帮助理解深度神经网络的行为和性能。 但是,不仅需要修改样板模型训练代码,而且随着过程的进行,根据所使用的框架,发行版还会引入其他干扰性代码工件和修改。 这将导致基础结构代码和模型代码的混合,以及重复的训练逻辑,将超参数硬编码到训练循环中,附加的跟踪代码来记录您的更改和用于实验的配置文件:

Image for post
小讯
上一篇 2025-03-26 14:49
下一篇 2025-02-18 18:33

相关推荐

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/21505.html