Skip to content

Latest commit

 

History

History
153 lines (77 loc) · 55.2 KB

Deep-Learning-Computer-Architecture-Chip-Design.md

File metadata and controls

153 lines (77 loc) · 55.2 KB

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design

Jeffrey Dean Google Research

0 Abstract 摘要

The past decade has seen a remarkable series of advances in machine learning, and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas, including computer vision, speech recognition, language translation, and natural language understanding tasks. This paper is a companion paper to a keynote talk at the 2020 International Solid-State Circuits Conference (ISSCC) discussing some of the advances in machine learning, and their implications on the kinds of computational devices we need to build, especially in the post-Moore’s Law-era. It also discusses some of the ways that machine learning may also be able to help with some aspects of the circuit design process. Finally, it provides a sketch of at least one interesting direction towards much larger-scale multi-task models that are sparsely activated and employ much more dynamic, example- and task-based routing than the machine learning models of today.

过去的十年中,基于人工神经网络的机器学习,尤其是深度学习方法有了长足的发展,在很多领域中都建构得到了更精确的系统,包括计算机视觉,语音识别,语言翻译,和自然语言理解。本文是2020 ISSCC上一篇演讲的文章,讨论了机器学习中的一些进展,以及对需要建构的计算设备的影响,尤其是在后摩尔定律时代。我们还讨论了,机器学习对电路设计过程的一些方面可能的帮助。最后,我们还朝比现在的机器学习模型,更大型的多任务模型,稀疏激活的,并采用更加动态的,基于样本和基于任务的路线,朝这个方向,提出了至少一个有趣方向的概览。

1 Introduction

The past decade has seen a remarkable series of advances in machine learning (ML), and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas [LeCun ​et al.​ 2015]. Major areas of significant advances include computer vision [Krizhevsky ​et al.​ 2012, Szegedy ​et al.​ 2015, He et al. 2016, Real ​et al.​ 2017, Tan and Le 2019], speech recognition [Hinton ​et al.​ 2012, Chan ​et al.​ 2016], language translation [Wu ​et al. 2016] and other natural language tasks [Collobert ​et al.​ 2011, Mikolov et al. 2013, Sutskever ​et al.​ 2014, Shazeer ​et al.​ 2017, Vaswani et al. 2017, Devlin ​et al.​ 2018]. The machine learning research community has also been able to train systems to accomplish some challenging tasks by learning from interacting with environments, often using reinforcement learning, showing success and promising advances in areas such as playing the game of Go [Silver ​et al.​ 2017], playing video games such as Atari games [Mnih ​et al. 2013, Mnih ​et al.​ 2015] and Starcraft [Vinyals ​et al.​ 2019], accomplishing robotics tasks such as substantially improved grasping for unseen objects [Levine ​et al.​ 2016, Kalashnikov ​et al.​ 2018], emulating observed human behavior [Sermanet ​et al.​ 2018], and navigating complex urban environments using autonomous vehicles [Angelova ​et al.​ 2015, Bansal ​et al.​ 2018].

过去的十年中,机器学习取得了很大的进展,尤其是基于人工神经网络的深度学习方法,改进了我们的能力,可以在很多应用领域构建更加精确的系统。有显著进展的主要领域,包括计算机视觉,语音识别,语言翻译和其他自然语言任务。机器学习研究团体也可以训练系统完成更有挑战性的任务,方法是学习与环境进行互动,通常是使用强化学习,可以取得成功或很有希望的进展,如进行围棋博弈,Atari和Starcraft这样的视频游戏,完成一些机器人任务,如显著改善不可见物体的抓取,模拟观察到的人的行为,使用自动驾驶交通工具在复杂的城市环境中进行导航。

As an illustration of the dramatic progress in the field of computer vision, Figure 1 shows a graph of the improvement over time for the Imagenet challenge, an annual contest run by Stanford University [Deng ​et al.​ 2009] where contestants are given a training set of one million color images across 1000 categories, and then use this data to train a model to generalize to an evaluation set of images across the same categories. In 2010 and 2011, prior to the use of deep learning approaches in this contest, the winning entrants used hand-engineered computer vision features and the top-5 error rate was above 25%. In 2012, Alex Krishevsky, Ilya Sutskever, and Geoffrey Hinton used a deep neural network, commonly referred to as “AlexNet”, to take first place in the contest with a major reduction in the top-5 error rate to 16% [Krishevsky ​et al. 2​012]. Their team was the only team that used a neural network in 2012. The next year, the deep learning computer vision revolution was in full force with the vast majority of entries from teams using deep neural networks, and the winning error rate again dropped substantially to 11.7%. We know from a careful study that Andrej Karpathy performed that human error on this task is just above 5% if the human practices for ~20 hours, or 12% if a different person practices for just a few hours [Karpathy 2014]. Over the course of the years 2011 to 2017, the winning Imagenet error rate dropped sharply from 26% in 2011 to 2.3% in 2017.

ImageNet是Stanford大学举办的年度比赛,参与者会收到一百万幅训练用的彩色图像,涵盖1000个类别,使用这些数据训练出一个模型,泛化到评估集图像上,包含的同样的类别,图1是ImageNet挑战赛随时间的发展,描述了计算机视觉领域的巨大发展。在2010和2011年,在这个比赛中还没有使用深度学习技术,获胜者使用的是手工设计的计算机视觉特征,其top-5错误率是大约25%。在2012年,Alex等使用了一个深度神经网络,通常称之为AlexNet,在比赛中赢得了第一名的成绩,top-5错误率大幅降低到了16%。在2012年,这是唯一一只使用神经网络技术的队伍。下一年,深度学习计算机视觉革命就进入了高潮,大多数队伍都使用了深度神经网络技术,获胜的队伍其错误率显著降低到了11.7%。Andrej Karpathy等进行的研究表明,人类在这项任务上如果进行大约20个小时的训练,其错误率略高于5%,如果另外的人只进行了几个小时的训练,其错误率会高于12%。从2011年到2017年,获胜者的ImageNet错误率从2011年的26%急剧降低到了2017年的2.3%。

These advances in fundamental areas like computer vision, speech recognition, language understanding, and large-scale reinforcement learning have dramatic implications for many fields. We have seen a steady series of results in many different fields of science and medicine by applying the basic research results that have been generated over the past decade to these problem areas. Examples include promising areas of medical imaging diagnostic tasks including for diabetic retinopathy [Gulshan ​et al. 2016, Krause ​et al.​ 2018], breast cancer pathology [Liu ​et al.​ 2017], lung cancer CT scan interpretation [Ardila ​et al.​ 2019], and dermatology [Esteva ​et al.​ 2017]. Sequential prediction methods that are useful for language translation also turn out to be useful for making accurate predictions for a variety of different medically-relevant tasks from electronic medical records [Rajkomar ​et al.​ 2018]. These early signs point the way for machine learning to have a significant impact across many areas of health and medical care [Rajkomar ​et al.​ 2019, Esteva ​et al. ​2019].

在这些基本领域的进展,如计算机视觉,语音识别,语言理解和大规模强化学习,在很多领域中都有极大的影响。将过去十年的这些基本研究成果,应用到不同的科学领域和医学领域中,我们看到了很多进展。这些例子包括,医学图像诊断任务,包括糖尿病视网膜病变,乳腺癌病理学,肺癌CT解读和皮肤病学。对于语言翻译有用的序列预测方法,对于很多不同的医学相关的任务也可以进行精确的预测,如电子化的医学记录。这些早期现象,为机器学习指明了新的方向,即可以在健康和医学护理领域会有更大的影响。

Other fields that have been improved by the use of deep learning-based approaches include quantum chemistry [Gilmer ​et al.​ 2017], earthquake prediction [DeVries ​et al. ​2018], flood forecasting [Nevo 2019], genomics [Poplin ​et al.​ 2018], protein folding [Evans ​et al.​ 2018], high energy physics [Baldi ​et al.​ 2014], and agriculture [Ramcharan ​et al.​ 2017].

使用深度学习技术有改进的其他领域包括,量子化学,地震预测,洪水预测,基因组学,蛋白质folding,高能物理和农业科学。

With these significant advances, it is clear that the potential for ML to change many different fields of endeavor is substantial. 这些显著进展说明,机器学习有很大的潜力,可以改变很多不同的领域。

2 Moore’s Law, Post Moore’s Law, and the Computational Demands of Machine Learning

Many of the key ideas and algorithms underlying deep learning and artificial neural networks have been around since the 1960s, 1970s, 1980s, and 1990s [Minsky and Papert 1969, Rumelhart ​et al.​ 1988, Tesauro 1994]. In the late 1980s and early 1990s there was a surge of excitement in the ML and AI community as people realized that neural networks could solve some problems in interesting ways, with substantial advantages stemming from their ability to accept very raw forms of (sometimes heterogeneous) input data and to have the model automatically build up hierarchical representations in the course of training the model to perform some predictive task. At that time, though, computers were not powerful enough to allow this approach to work on anything but small, almost toy-sized problems. Some work at the time attempted to extend the amount of computation available for training neural networks by using parallel algorithms [Shaw 1981, Dean 1990], but for the most part, the focus of most people in the AI and ML community shifted away from neural network-based approaches. It was not until the later parts of the decade of the 2000s, after two more decades of computational performance improvements driven by Moore’s Law that computers finally started to become powerful enough to train large neural networks on realistic, real-world problems like Imagenet [​Deng et al. 2009​], rather than smaller-scale, toy problems like MNIST [LeCun ​et al.​ 2000] and CIFAR [Krizhevsky ​et al.​ 2009]. In particular, the paradigm of general-purpose computing on GPU cards (GPGPU) [Luebke ​et al.​ 2006], because of GPU cards’ high floating point performance relative to CPUs, started to allow neural networks to show interesting results on difficult problems of real consequence.

深度学习和人工神经网络的很多关键思想,在1960s, 1970s, 1980s和1990s就已经出现了。在1980s末期,1990s初期,关于ML和AI的研究有突然的兴起,因为人们意识到,神经网络可以解决一些问题,解决的方式还比较有趣,因为神经网络可以接受非常原始形态的输入数据(有时候是异质的),在训练模型进行预测任务时,让模型自动构建起层次化的表示,这是其最根本的优势。虽然在那个时候,计算机的计算量还不够,还只能解决很小规模的问题,甚至是玩具类的问题。那时的一些工作,使用并行算法,来拓展可用的计算能力,但大部分关注AI和ML的人,还是将注意力转移到了其他方法上。直到2000s后半段,在摩尔定律又推动计算能力前进了20多年,最终已经足够强大,可以在真实的、真实世界的问题上训练大型神经网络,如ImageNet,而不是那些小型的像MNIST和CIFAR问题。特别是,GPGPU的方式,由于GPU比CPU的浮点计算能力更强,神经网络可以在比较困难的问题上得出很有趣的结果。

It is perhaps unfortunate that just as we started to have enough computational performance to start to tackle interesting real-world problems and the increased scale and applicability of machine learning has led to a dramatic thirst for additional computational resources to tackle larger problems, the computing industry as a whole has experienced a dramatic slowdown in the year-over-year improvement of general purpose CPU performance. Figure 2 shows this dramatic slowdown, where we have gone from doubling general-purpose CPU performance every 1.5 years (1985 through 2003) or 2 years (2003 to 2010) to now being in an era where general purpose CPU performance is expected to double only every 20 years [Hennessy and Patterson 2017]. Figure 3 shows the dramatic surge in computational demands for some important recent machine learning advances (note the logarithmic Y-axis, with the best-fit line showing a doubling time in computational demand of 3.43 months for this select set of important ML research results) [OpenAI 2018]. Figure 4 shows the dramatic surge in research output in the field of machine learning and its applications, measured via the number of papers posted to the machine-learning-related categories of Arxiv, a popular paper preprint hosting service, with more than 32 times as many papers posted in 2018 as in 2009 (a growth rate of more than doubling every 2 years). There are now more than 100 research papers per day posted to Arxiv in the machine-learning-related subtopic areas, and this growth shows no signs of slowing down.

在我们有了足够的计算能力后,我们才开始去解决那些有趣的真实世界的问题,机器学习增长的幅度和可用性,需要越来越多的计算资源,来解决更大的问题,但计算产业作为一个整体,其逐年的通用CPU性能的增长,越来越慢。图2是这种减速的示意图,通用CPU从每1.5年计算能力翻倍(1985-2003),到每2年翻倍(2003-2010),现在到了只能每20年才能翻倍的时代。图3所示的是一些最近的重要机器学习进展的计算需求的爆发(注意Y轴是对数轴,这样才能很好的显示出,对这个选定的ML研究领域,每3.43个月计算需求就翻倍的示意)。图4所示的是在机器学习领域的研究成果及其应用的大爆发,这是通过Arxiv上与机器学习相关类别的发布数量衡量的,Arxiv是一个流行的文章预印版的服务,在2018年,发布的文章数量是2009年的32倍(每2年其增长率就翻倍)。现在每天在Arxiv上都有超过100篇机器学习相关的子课题领域文章发布,而且这个增长还没有减速的趋势。

3 Machine-Learning-Specialized Hardware

In 2011 and 2012, a small team of researchers and system engineers at Google built an early distributed system called DistBelief to enable parallel, distributed training of very large scale neural networks, using a combination of model and data parallel training and asynchronous updates to the parameters of the model by many different computational replicas [Dean ​et al.​ 2012]. This enabled us to train much larger neural networks on substantially larger data sets and, by mid-2012, using DistBelief as an underlying framework, we were seeing dramatically better accuracy for speech recognition [Hinton ​et al.​ 2012] and image classification models [Le ​et al.​ 2012]. The serving of these models in demanding settings of systems with hundreds of millions of users, though, was another matter, as the computational demands were very large. One back of the envelope calculation showed that in order to deploy the deep neural network system that was showing significant word error rate improvements for our main speech recognition system using CPU-based computational devices would require doubling the number of computers in Google datacenters (with some bold-but-still-plausible assumptions about significantly increased usage due to more accuracy). Even if this was economically reasonable, it would still take significant time, as it would involve pouring concrete, striking arrangements for windmill farm contracts, ordering and installing lots of computers, etc., and the speech system was just the tip of the iceberg in terms of what we saw as the potential set of the application of neural networks to many of our core problems and products. This thought exercise started to get us thinking about building specialized hardware for neural networks, first for inference, and then later systems for both training and inference.

在2011和2012年,Google的一个研究者和系统工程师小组,构建了一个早期的分布式系统,称为DistBelief,以对非常大型的神经网络进行并行的分布式训练,使用模型和数据并行组合的方式进行训练,使用很多不同的计算副本对模型参数进行异步更新。这使我们可以在更大的数据集上,训练更大的神经网络,到了2012年中,使用DistBelief作为框架,我们在语音识别和图像分类上得到了好了很多的准确率。但将这些模型放到系统中,给成百数千万的用户使用,则是另外一个问题,因为其计算量非常的大。一个计算表明,如果要部署基于深度神经网络的系统,其语音识别率比现有的语音识别系统有大幅度改善,则需要的基于CPU的计算机数量,比Google数据中心的还要多一倍。即使从经济上来说是可行的,但仍然需要耗费大量的时间,因为这涉及到大量的订单,安装大量的计算机,等等,而语音系统只是以神经网络为基础的很多核心问题和产品的冰山一角而已。这让我们开始思考,为神经网络构建专用硬件,首先用来推理,然后既用于训练,也用于推理。

3.1 Why Does Specialized Hardware Make Sense for Deep Learning Models?

Deep learning models have three properties that make them different than many other kinds of more general purpose computations. First, they are very tolerant of reduced-precision computations. Second, the computations performed by most models are simply different compositions of a relatively small handful of operations like matrix multiplies, vector operations, application of convolutional kernels, and other dense linear algebra calculations [Vanhoucke ​et al.​ 2011]. Third, many of the mechanisms developed over the past 40 years to enable general-purpose programs to run with high performance on modern CPUs, such as branch predictors, speculative execution, hyperthreaded-execution processing cores, and deep cache memory hierarchies and TLB subsystems are unnecessary for machine learning computations. So, the opportunity exists to build computational hardware that is specialized for dense, low-precision linear algebra, and not much else, but is still programmable at the level of specifying programs as different compositions of mostly linear algebra-style operations. This confluence of characteristics is not dissimilar from the observations that led to the development of specialized digital signal processors (DSPs) for telecom applications starting in the 1980s [​en.wikipedia.org/wiki/Digital_signal_processor​]. A key difference though, is because of the broad applicability of deep learning to huge swaths of computational problems across many domains and fields of endeavor, this hardware, despite its narrow set of supported operations, can be used for a wide variety of important computations, rather than the more narrowly tailored uses of DSPs. Based on our thought experiment about the dramatically increased computational demands of deep neural networks for some of our high volume inference applications like speech recognition and image classification, we decided to start an effort to design a series of accelerators called Tensor Processing Units for accelerating deep learning inference and training. The first such system, called TPUv1, was a single chip design designed to target inference acceleration [Jouppi ​et al.​ 2017].

深度学习模型与其他更通用的计算相比,有三个性质使其不同。首先,对于低精度计算不敏感。第二,多数模型的计算都是几种运算的简单组合,如矩阵乘法,向量计算,卷积核的计算,以及其他密集线性代数运算。第三,过去40年中发展出来的通用程序在现代CPU上高性能运算的机制,如分支预测,推测执行,超线程,深度缓存层级,和TLB子系统,对于机器学习来说,都是不必要的。所以,构建密集、低精度线性代数专用的计算硬件是有机会的,在将程序指定为线性代数类型运算的程度上,是要可编程的。这些特征的聚合,与自从1980s开始的电信应用中的专用DSP的发展是类似的。一个关键的不同是,深度学习在很多计算问题的广泛应用使得,这种硬件虽然只支持很少的运算,但可以用于很多重要计算中,这与DPS的定制应用是不同的。深度神经网络在一些大体量的推理应用中,如语音识别和图像分类,计算需求量急剧增加,我们决定开始设计一系列加速器,称为TPU,以对深度学习的推理和训练进行加速。第一个这种系统称为TPUv1,是设计用于推理加速的。

For inference (after a model has been trained, and we want to apply the already-trained model to new inputs in order to make predictions), 8-bit integer-only calculations have been shown to be sufficient for many important models [Jouppi ​et al. 2​017], with further widespread work going on in the research community to push this boundary further using things like even lower precision weights, and techniques to encourage sparsity of weights and/or activations.

对于推理来说(在模型经过训练后,我们将已经训练的模型应用于新的输入数据,以进行预测),8-bit整型的计算对于很多重要模型已经足够了,而研究者的工作甚至使用更低精度的权重,以及权重/激活的稀疏性。

The heart of the TPUv1 is a 65,536 8-bit multiply-accumulate matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS). TPUv1 is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher, and was able to run production neural net applications representing about 95% of Google datacenters' neural network inference demand at the time with significant cost and power advantages [Jouppi ​et al.​ 2017].

TPUv1的核心是65536个8-bit的乘法-累加的矩阵乘法单元,其峰值吞吐量为92TeraOps/sec(TOPS)。TPUv1比现有的GPU或CPU平均快了15X-30X,而TOPS/Watt则高了30X-80X,可以解决Google数据中心95%的神经网络推理需求,运行生产级的神经网络应用,而且有显著的能耗和时间优势。

Inference on low-power mobile devices is also incredibly important for many uses of machine learning. Being able to run machine learning models on-device, where the devices themselves are often the source of the raw data inputs used for models in areas like speech or vision, can have substantial latency as well as privacy benefits. It is possible to take the same design principles used for TPUv1 (a simple design targeting low precision linear algebra computations at high performance/Watt) and apply these principles to much lower power environments, such as mobile phones. Google’s Edge TPU is one example of such a system, offering 4 TOps in a 2W power envelope [​cloud.google.com/edge-tpu/​, coral.withgoogle.com/products/​]. On-device computation is already critical to many interesting use cases of deep learning, where we want computer vision, speech and other kinds of models that can run directly on sensory inputs without requiring connectivity. One such example is on-device agriculture applications, like identification of diseases in plants such as cassava, in the middle of cassava fields which may not have reliable network connectivity [Ramcharan ​et al. 2​ 017].

在低能耗的移动设备上的推理,对于很多机器学习使用场景来说,也是极端重要的。在设备上运行机器学习模型,而设备本身通常是原始数据输入的源泉,包括语音或视觉领域的模型,其延迟优势非常大,隐私方面也非常有优势。可以将TPUv1上的设计原则(简单的设计,面向低精度的线性代数运算,高性能/Watt),用于更低能的环境中,如移动手机。Google的Edge TPU是这种系统的一个例子,以2W的功耗提供了4TOps的计算量。设备上的计算量对于很多深度学习的应用场景是很关键的,我们希望计算机视觉、语音和其他类型的模型,直接在传感器输入上进行运行,而不需要其他连接。这样的一个例子是,设备上的农业应用,比如在木薯这样的植物上的疾病识别,在木薯田里可能没有很可靠的网络连接。

With the widespread adoption of machine learning and its growing importance as a key type of computation in the world, a Cambrian-style explosion of new and interesting accelerators for machine learning computations is underway. There are more than XX venture-backed startup companies, as well as a variety of large, established companies, that are each producing various new chips and systems for machine learning. Some, such as Cerebras [​www.cerebras.net/​], Graphcore [​www.graphcore.ai/​], and Nervana (acquired by Intel) [​www.intel.ai/nervana-nnp/​] are focused on a variety of designs for ML training. Others, such as Alibaba [​www.alibabacloud.com/blog/alibaba-unveils-ai-chip-to-enhance-cloud-computing-power_595409​] are designing chips focused on inference. Some of the designs eschew larger memory-capacity DRAM or HBM to focus on very high performance designs for models that are small enough that their entire set of parameters and intermediate values fit in SRAM. Others focus on designs that include DRAM or HBM that make them suitable for larger-scale models. Some, like Cerebras, are exploring full wafer-scale integration. Others, such as Google’s Edge TPUs [​cloud.google.com/edge-tpu/​] are building very low power chips for inference in environments such as mobile phones and distributed sensing devices.

随着机器学习的广泛使用,逐渐成为了一种关键类型的计算,专用于机器学习计算的新的有趣的加速器出现了寒武纪式的大爆发。有很多资本支持的初创企业,以及很多大型企业,都在开发用于机器学习的新型芯片和系统。一些公司,如Cerebras, Graphcore, Nervana,聚焦在ML训练上。其他的,如Alibaba设计的芯片聚焦在推理上。一些设计避免采用更大的内存容量的DRAM或HBM,以聚焦在非常高性能的设计,这可以用于很小的模型,其整个参数集和中间值都可以放在SRAM中。其他聚焦在囊括DRAM或HBM的设计上,使其适合于更大规模的模型。有些公司如Cerebras,探索的是整个晶元级的整合。其他的如Google的Edge TPU,构建的是非常低功耗的芯片进行推理,可用于移动电话和分布式传感器设备上。

Designing customized machine learning hardware for training (rather than just inference) is a more complex endeavor than single chip inference accelerators. The reason is that single-chip systems for training are unable to solve many problems that we want to solve in reasonable periods of time (e.g. hours or days, rather than weeks or months), because a single-chip system cannot deliver sufficient computational power. Furthermore, the desire to train larger models on larger data sets is such that, even if a single chip could deliver enough computation to solve a given problem in a reasonable amount of time, that would just mean that we would often want to solve even larger problems (necessitating the use of multiple chips in a parallel or distributed system anyway). Therefore, designing training systems is really about designing larger-scale, holistic computer systems, and requires thinking about individual accelerator chip design, as well as high performance interconnects to form tightly coupled machine learning supercomputers. Google’s second- and third-generation TPUs, TPUv2 and TPUv3 [​cloud.google.com/tpu/​], are designed to support both training and inference, and the basic individual devices, each consisting of four chips, were designed to be connected together into larger configurations called pods. Figure 5 shows the block diagram of a single Google TPUv2 chip, with two cores, with the main computational capacity in each core provided by a large matrix multiply unit that can yield the results of multiplying a pair of 128x128 matrices each cycle. Each chip has 16 GB (TPUv2) or 32 GB (TPUv3) of attached high-bandwidth memory (HBM). Figure 6 shows the deployment form of a Google’s TPUv3 Pod of 1024 accelerator chips, consisting of eight racks of chips and accompanying servers, with the chips connected together in a 32x32 toroidal mesh, providing a peak system performance of more than 100 petaflop/s.

设计定制的机器学习硬件进行训练(而不仅仅是推理),比单片推理加速器要更加复杂。其原因是,单片训练系统不能在可接受的时间内解决很多问题(如,几个小时或几天,而不是几个星期或几个月),因为单片系统不会有充足的计算量。而且,人们会倾向于在更大的训练集上训练更大的模型,即使单片可以有足够的计算量在可接受时间内解决给定的问题,这也只意味着,我们会经常想解决更大的问题(以并行或分布式的方式使用多个芯片成为必要)。因此,设计训练系统,实际上是设计更大规模的完整的计算机系统,需要将单个加速器芯片的设计,和高性能连接放到一起,紧密结合,形成机器学习的超级计算机。Google的第二代和第三代TPU,即TPUv2和TPUv3,同时支持训练和推理,基本的单体设备,每个都包含4个芯片,其设计是连接到一起的,形成更大的配置,称为pods。图5所示的是单个TPUv2芯片的模块图,有2个核,每个核的主要计算能力是由大型矩阵乘法单元提供的,可以在每个循环中都给出两个128x128大小的矩阵相乘的结果。每个芯片都有16GB(TPUv2)或32GB(TPUv3)的关联高带宽内存(HBM)。图6所示的是Google TPUv3 Pod的部署图,有1024个加速器芯片,由8个机架的芯片和附属的服务器组成,芯片连接到一起形成32x32的网格,系统的峰值性能超过100 petaflops/s。

3.2 Low Precision Numeric Formats for Machine Learning

TPUv2 and TPUv3 use a custom-designed floating point format called bfloat16 [Wang and Kanwar 2019], which departs from the IEEE half-precision 16-bit format to provide a format that is more useful for machine learning and also enables much cheaper multiplier circuits. bfloat16 was originally developed as a lossy compression technique to help reduce bandwidth requirements during network communications of machine learning weights and activations in the DistBelief system, and was described briefly in section 5.5 of the TensorFlow white paper [Abadi ​et al.​ 2016, sec. 5.5]. It has been the workhorse floating format in TPUv2 and TPUv3 since 2015. As of December, 2018, Intel announced plans to add bfloat16 support to future generations of Intel processors [Morgan 2018].

TPUv2和TPUv3使用的是定制的浮点格式,称为bfloat16,与IEEE半精度16-bit格式是不同的,其格式对于机器学习更有用,还可以使乘法电路的设计更偏移。bfloat16最初是用于有损压缩技术的,可以在DistBelief系统中对机器学习的权重和激活进行网络传输时降低带宽需求,在TensorFlow白皮书的5.5节进行了简要介绍。自从2015年,就成了TPUv2和TPUv3的工作浮点格式。到2018年12月,Intel宣布了在未来的Intel处理器中对bfloat16的支持计划。

Figure 7 below shows the split between sign, exponent, and mantissa bits for the IEEE fp32 single-precision floating point format, the IEEE fp16 half-precision floating point format, and the bfloat16 format.

图7给出了IEEE fp32单精度浮点格式,IEEE fp16半精度浮点格式和bfloat16格式的sign, exponent和mantissa bits的分割。

As it turns out, machine learning computations used in deep learning models care more about dynamic range than they do about precision. Furthermore, one major area & power cost of multiplier circuits for a floating point format with ​M​ mantissa bits is the (​M​+1) ✕ (​M​+1) array of full adders (that are needed for multiplying together the mantissa portions of the two input numbers). The IEEE fp32, IEEE fp16 and bfloat16 formats need 576 full adders, 121 full adders, and 64 full adders, respectively. Because multipliers for the bfloat16 format require so much less circuitry, it is possible to put more multipliers in the same chip area and power budget, thereby meaning that ML accelerators employing this format can have higher flops/sec and flops/Watt, all other things being equal. Reduced precision representations also reduce the bandwidth and energy required to move data to and from memory or to send it across interconnect fabrics, giving further efficiency gains.

结果是,深度学习模型中使用的机器学习计算,关心的更多是动态范围,而不是精度。而且,有M个mantissa bits的浮点格式的乘法电路,其面积&功率消耗是(​M​+1) ✕ (​M​+1)阵列的完整加法器(两个输入数字的mantissa部分相乘所需的结果)。IEEE fp32, IEEE fp16和bfloat16格式分别需要576、121和64个完整的加法器。由于bfloat16格式的乘法器需要这么少的电路,所以可能在同样的芯片面积和能量预算中放入更多的乘法器,因此意味着采用这种格式的ML加速器可以达到更高的flops/sec和flops/Watt,在其他的指标相同的情况下。降低了精度的表示,同时降低了,将数据从内存中移入或移出,或在相连的结构之间移动,所需的带宽和能量,得到更多的效率提升。

3.3 The Challenge of Uncertainty in a Fast Moving Field

One challenge for building machine learning accelerator hardware is that the ML research field is moving extremely fast (as witnessed by the growth and absolute number of research papers published per year shown in Figure 4). Chip design projects that are started today often take 18 months to 24 months to finish the design, fabricate the semiconductor parts and get them back and install them into a production datacenter environment. For these parts to be economically viable, they typically must have lifetimes of at least three years. So, the challenge for computer architects building ML hardware is to predict where the fast moving field of machine learning will be in the 2 to 5 year time frame. Our experience is that bringing together computer architects, higher-level software system builders and machine learning researchers to discuss co-design-related topics like “what might be possible in the hardware in that time frame?” and “what interesting research trends are starting to appear and what would be their implications for ML hardware?” is a useful way to try to ensure that we design and build useful hardware to accelerate ML research and production uses of ML.

构建机器学习加速器硬件的一个挑战是,机器学习研究领域正在快速进化(如图4中所示的发表文章的逐年快速增长和绝对数量)。今天开始的芯片设计项目,通常需要18个月到24个月来完成设计,制造半导体部件,拿到成品并安装到一个生产用的数据中心环境。为达到经济上的可行性,它们的使用寿命要至少达到3年。所以,计算机架构师构建ML硬件的挑战是,预测快速进化的机器学习技术未来2-5年的趋势。我们的经验是,将计算机架构师、较高层的软件系统构建师和机器学习研究者放到一起来讨论协同设计相关的话题,如“在这个时间段内的硬件可能会是什么?”,以及“哪些有趣的研究趋势正在出现,以及这对ML硬件有什么影响?”,这可能会保证我们设计制造了有用的硬件来加速ML研究,以及ML的生产使用。

4 Machine Learning for Chip Design

One area that has significant potential is the use of machine learning to learn to automatically generate high quality solutions for a number of different NP-hard optimization problems that exist in the overall workflow for designing custom ASICs. For example, currently placement and routing for complex ASIC designs takes large teams of human placement experts to iteratively refine from high-level placement to detailed placement as the overall design of an ASIC is fleshed out. Because there is considerable human involvement in the placement process, it is inconceivable to consider radically different layouts without dramatically affecting the schedule of a chip project once the initial high level design is done. However, placement and routing is a problem that is amenable to the sorts of reinforcement learning approaches that were successful in solving games, like AlphaGo. In placement and routing, a sequence of placement and routing decisions all combine to affect a set of overall metrics like chip area, timing, and wire length. By having a reinforcement learning algorithm learn to “play” the game of placement and routing, either in general across many different ASIC designs, or for a particular ASIC design, with a reward function that combines the various attributes into a single numerical reward function, and by applying significant amounts of machine-learning computation (in the form of ML accelerators), it may be possible to have a system that can do placement and routing more rapidly and more effectively than a team of human experts working with existing electronic design tools for placement and routing. We have been exploring these approaches internally at Google and have early preliminary-but-promising looking results. The automated ML based system also enables rapid design space exploration, as the reward function can be easily adjusted to optimize for different trade-offs in target optimization metrics.

机器学习可能有很大应用潜力的一个领域是,在设计定制ASICs的整体工作流中,有一些不同的NP-hard优化问题,可以学习以自动生成这些问题的高质量解。比如,现在复杂ASIC设计中的placement和routing,需要大型团队的人类placement专家,在一个ASIC的整体设计出炉的过程中,迭代的进行优化,从高层的placement到细节的placement。由于在这个过程中有相当的人类参与,一旦初始的高层设计完成之后,要考虑彻底不同的布局,而又不对芯片工程的计划有很大影响,这是不太可能的。但是,placement和routing,对于强化学习类的方法来说,是一个可以修复的问题,强化学习已经成功的应用在博弈问题求解中,如AlphaGo。在placement和routing中,一个placement和routing的决策序列一起组合起来,影响总计的度量标准,如芯片面积,timing和wire length。采用强化学习算法学习进行placement和routing的博弈游戏,要么通过很多不同的ASIC设计,或对于特定的ASIC设计,reward函数要综合各种属性,成为一个单独的数值reward函数,通过大量机器学习的计算(以ML加速器进行),可能会使一个系统更快更高效的进行placement和routing,这是与人类专家团队使用现有的电子设计工具进行placement和routing相比的。我们在Google内部已经探索这些方法了,并已经有了早期的初步的但是有希望的结果了。基于ML的自动化系统还可以快速探索设计空间,因为reward函数可以很容易的调整,为不同的目标优化度量标准的折中进行优化。

Furthermore, it may even be possible to train a machine learning system to make a whole series of decisions from high-level synthesis down to actual low-level logic representations and then perform placement and routing of these low-level circuits into a physical realization of the actual high level design in a much more automated and end-to-end fashion. If this could happen, then it’s possible that the time for a complex ASIC design could be reduced substantially, from many months down to weeks. This would significantly alter the tradeoffs involved in deciding when it made sense to design custom chips, because the current high level of non-recurring engineering expenses often mean that custom chips or circuits are designed only for the highest volume and highest value applications.

而且,训练一个机器学习系统,从高层的合成到实际的底层的逻辑表示,进行整个系列的决策,然后对这些底层电路的placement和routing实现成为实体,这样高层的设计就可以更加自动化、更加端到端实现。如果这个可以实现,那么复杂ASIC的设计就可能真的得到极大的简化,从几个月到几个星期。这就可能极大的改变决定什么时候设计定制芯片的折中关系,因为目前的高层的不可重现的工程花费,通常意味着,定制芯片或电路只能为最大体量和最大价值的应用进行设计。

5 Machine Learning for Semiconductor Manufacturing Problems

With the dramatic improvements in computer vision over the past decade, there are a number of problems in the domain of visual inspection of wafers during the semiconductor manufacturing process that may be amenable to more automation, or to improved accuracy over the existing approaches in this area. By detecting defects earlier or more accurately, we may be able to achieve higher yields or reduced costs. A survey of these approaches provides a general sense of the area [Huang and Pan 2015].

随着过去十年计算机视觉的极大发展,在半导体制造过程中的晶元的视觉检查中,有一些问题可以更加自动化的完成,或比这个领域中现有的方法的准确率更高。通过更早或更精确的检测缺陷,我们可能可以得到更高的产出,或降低过程中的消耗。这个领域问题的方法有一篇综述可供参考。

6 Machine Learning for Learned Heuristics in Computer Systems

Another opportunity for machine learning is in the use of learned heuristics in computer systems such as compilers, operating systems, file systems, networking stacks, etc. Computer systems are filled with hand-written heuristics that have to work in the general case. For example, compilers must make decisions about which routines to inline, which instruction sequences to choose which of many possible loop nesting structures to use, and how to lay out data structures in memory [Aho ​et al. ​1986]. Low-level networking software stacks must make decisions about when to increase or decrease the TCP window size, when to retransmit packets that might have been dropped, and whether and how to compress data across network links with different characteristics. Operating systems must choose which blocks to evict from their buffer cache, which processes and threads to schedule next, and which data to prefetch from disk [Tanenbaum and Woodhull 1997]. Database systems choose execution plans for high-level queries, make decisions about how to lay out high level data on disks, and which compression methods to use for which pieces of data [Silberschatz ​et al.​ 1997].

机器学习的另一个机会是,使用计算机系统中已经学习到的启发知识,比如编译器,操作系统,文件系统,网络栈等等。计算机系统中充满了这些手写的启发性知识,必须比通用情况进行工作。如,编译器必须决策要内联哪个例程,选择哪些命令序列,使用哪些可能的循环嵌套结构,怎样在内存中排布数据结构。低层的网络软件栈需要决策何时增加或减小TCP窗口大小,何时重新发包(很可能丢弃掉的),是否以及怎样在不同特质的网络连接中压缩数据。操作系统必须决定从缓存中清除掉哪些块,下面安排哪个过程和进程执行,从磁盘中预取哪些数据。数据库系统对高层查询命令选择执行计划,决定高层数据在磁盘中如何排布,对哪部分数据采用哪种压缩方法。

The potential exists to use machine-learned heuristics to replace hand-coded heuristics, with the ability for these ML heuristics to take into account much more contextual information than is possible in hand-written heuristics, allowing them to adapt more readily to the actual usage patterns of a system, rather than being constructed for the average case. Other uses of ML can replace traditional data structures like B-trees, hash tables, and Bloom filters with learned index structures, that can take advantage of the actual distribution of data being processed by a system to produce indices that are higher performance while being 20X to 100X smaller [Kraska ​et al.​ 2018].

使用机器学习到的启发性知识来替代手写的启发性知识,这是可能的,这些ML启发性知识可以比手写的启发性知识考虑更多上下文信息,使其更加适应系统的实际使用模式,而不是为平均的情况构建这个系统。ML的其他使用可以替代传统数据结构,如使用学习到的index结构替代B-trees, hash table和Bloom filters,这样可以利用系统要处理的数据的实际分布,以产生索引,性能更高,但大小可能缩小来20X-100X。

7 Future Machine Learning Directions

A few interesting threads of research are occuring in the ML research community at the moment that will likely be even more interesting if combined together.

在ML研究团体中出现了一些新的研究线索,如果结合到一起,甚至会更有趣。

First, work on sparsely-activated models, such as the sparsely-gated mixture of experts model [Shazeer et al.​ 2017], shows how to build very large capacity models where just a portion of the model is “activated” for any given example (say, just 2 or 3 experts out of 2048 experts). The routing function in such models is trained simultaneously and jointly with the different experts, so that the routing function learns which experts are good at which sorts of examples, and the experts simultaneously learn to specialize for the characteristics of the stream of examples to which they are given. This is in contrast with most ML models today where the whole model is activated for every example. Table 4 in Shazeer ​et al. 2​017 showed that such an approach be simultaneously ~9X more efficient for training, ~2.5X more efficient for inference, and higher accuracy (+1 BLEU point for a language translation task).

第一,稀疏激活模型方面的研究,比如稀疏门控混合的专家模型,展示了怎样构建非常大容量的模型,其中对于给定的样本只有一部分模型被激活(比如,2048个专家中的2个或3个)。这些模型中的routing函数同时训练,与不同的专家联合,这样routing函数可以学到,哪个专家善于哪些样本,专家也同时学到了为给定的样本流学习其专有特征。这与目前大多数ML模型都不一样,即为每个样本都激活整个模型。Shazeer等文章中的表4展示了,这样一种方法在训练时效率高了~9X,推理时效率高了~2.5X,准确率还更高(对于一个语言翻译任务,+1 BLEU点)。

Second, work on automated machine learning (AutoML), where techniques such as neural architecture search [Zoph and Le 2016, Pham ​et al.​ 2018] or evolutionary architectural search [Real ​et al.​ 2017, Gaier and Ha 2019] can automatically learn effective structures and other aspects of machine learning models or components in order to optimize accuracy for a given task. These approaches often involve running many automated experiments, each of which may involve significant amounts of computation.

第二,AutoML方面的工作,如NAS或演化架构搜索(EAS)这样的技术可以自动学到高效的架构,以及机器学习模型/组件的其他方面,以对于给定的任务优化准确率。这些方法通常都是运行很多自动化的试验,其中每个都涉及到大量的计算量。

Third, multi-task training at modest scales of a few to a few dozen related tasks, or transfer learning from a model trained on a large amount of data for a related task and then fine-tuned on a small amount of data for a new task, has been shown to be very effective in a wide variety of problems [Devlin ​et al. 2​018]. So far, most use of multi-task machine learning is usually in the context of a single modality (e.g. all visual tasks, or all textual tasks) [Doersch and Zisserman 2017], although a few authors have considered multi-modality settings as well [Ruder 2017].

第三,多任务训练,从几个到几十个任务,或迁移学习,从一个大规模数据集中训练得到的模型,在少量新数据上进行精调,可以很有效的解决很多问题。迄今,多任务机器学习多数都是用在单一模态的情景下(如,都是视觉任务,或都是文字任务),但也有几位作者考虑了多模态的设置。

A particularly interesting research direction puts these three trends together, with a system running on large-scale ML accelerator hardware, with a goal of being able to train a model that can perform thousands or millions of tasks in a single model. Such a model might be made up of many different components of different structures, with the flow of data between examples being relatively dynamic on an example-by-example basis. The model might use techniques like the sparsely-gated mixture of experts and learned routing in order to have a very large capacity model [Shazeer ​et al.​ 2017], but where a given task or example only sparsely activates a small fraction of the total components in the system (and therefore keeps computational cost and power usage per training example or inference much lower). An interesting direction to explore would be to use dynamic and adaptive amounts of computation for different examples, so that “easy” examples use much less computation than “hard” examples (a relatively unusual property in the machine learning models of today). Figure 8 depicts such a system.

有一个特别有趣的研究方向,将这三个趋势结合到了一起,其系统运行在大规模ML加速器硬件上,其目标是训练一个模型,可以用一个模型进行数千个任务。这样一个模型可能由很多不同的组件构成,每个都有不同的结构,样本间数据的流动是相对动态的,是以逐个样本为基础的。模型使用的技术可能包括稀疏门控的混合专家,和学习到的routing,这样就可以得到一个很大容量的模型,但给定的任务或样本可能只激活系统模型的一小部分(因此可以使每个训练样本或推理的计算代价和能耗非常低)。值得探索的一个有趣方向是,对每个不同的样本使用动态的、自适应数量的计算量,这样容易的样本使用少的多的计算量,困难的样本使用的则更多一些(这是目前机器学习方法中,相对不太平常的一个性质)。图8所示的就是这样一个系统。

Each component might itself be running some AutoML-like architecture search [Pham ​et al.​ 2017], in order to adapt the structure of the component to the kinds of data that it is being routed to that component. New tasks can leverage components trained on other tasks when that is useful. The hope is that through very large scale multi-task learning, shared components, and learned routing, the model can very quickly learn to accomplish new tasks to a high level of accuracy, with relatively few examples for each new task (because the model is able to leverage the expertise and internal representations it has already developed in accomplishing other, related tasks).

每个部件本身,都可能是运行一些AutoML类的架构搜索,以使部件本身适应route到这个部件的输入数据。新的任务可以利用在其他任务上训练的部件。通过极大规模的多任务学习,共享部件,和学习到的routing,模型可能可以非常迅速的完成新任务,准确率高,对每个新任务都可以只使用较少几个样本(因为模型可以利用,在完成其他相关的任务过程中学习到的,专业知识和内部表示)。

Building a single machine learning system that can handle millions of tasks, and that can learn to successfully accomplish new tasks automatically, is a true grand challenge in the field of artificial intelligence and computer systems engineering: it will require expertise and advances in many areas, spanning solid-state circuit design, computer networking, ML-focused compilers, distributed systems, and machine learning algorithms in order to push the field of artificial intelligence forward by building a system that can generalize to solve new tasks independently across the full range of application areas of machine learning.

构建可以处理数百万个任务的单个机器学习系统,并可以学习自动完成新任务,在人工智能和计算机系统工程里,是一个真正的大挑战:这需要很多领域的专业知识和进展,包括固态电路设计,计算机网络,基于ML的编译器,分布式系统,和机器学习算法,可以推动人工智能的领域,构建一个系统,可以独立的泛化解决所有机器学习领域的新任务。

8 Conclusion

The advances in machine learning over the past decade are already affecting a huge number of fields of science, engineering, and other forms of human endeavor, and this influence is only going to increase. The specialized computational needs of machine learning combined with the slowdown of general-purpose CPU performance improvements in the post-Moore’s Law-era represent an exciting time for the computing hardware industry [Hennessy and Patterson 2019​]:​ we now have a set of techniques that seem to be applicable to a vast array of problems across a huge number of domains, where we want to dramatically increase the scale of the models and datasets on which we can train these models, and where the impact of this work will touch a vast fraction of humanity. As we push the boundaries of what is possible with large-scale, massively multi-task learning systems that can generalize to new tasks, we will create tools to enable us to collectively accomplish more as societies and to advance humanity. We truly live in exciting times.

机器学习在过去十年中的进展,已经影响很多科学、工程领域,而且这种影响只会继续增加。机器学习的专用计算需求,与在后摩尔定律的通用CPU性能提升的减速一起,成就了计算硬件产业的时代:我们现在有很多技术,似乎可以应用于很多领域的很多问题,我们希望极大的增加模型和数据集的规模,在这些数据集上,我们可以训练这些模型,而这些工作会影响人性中的很多部分。当我们推动大规模、大型多任务学习系统前进时,泛化到新的任务,我们创建的工具会使我们一起取得更大的成就,并推动人性的发展。我们真的生活在一个让人激动人心的时代。