So welcome to the
Paul G. Allen School’s first distinguished lecture
of this academic year. It’s our great pleasure
to welcome Jeff Dean. Jeff is Google’s
Senior Fellow and runs Google’s Research Organization
of about 4,000 people. It’s an interesting
twist and turn. He received his
PhD here in 1996, spent three years at Digital
Equipment Corporation’s western research lab, and
then moved to Google in 1999 as probably one of the first
few dozen people there. All right. And until a few years ago,
had worked as an individual contributor, which didn’t
mean working by himself, would just mean he didn’t have
a 4,000-person management role. When I asked Jeff about sort of
why on earth he took this on, it was sort of progressive. He started with, like, just
2,000 or something like that, right? [LAUGHTER] OK. And he said it would stretch him
a little, and I’m sure it has. But as an individual
contributor, Jeff really designed
and implemented a large part of Google
scalable infrastructure that made it such a success. Five generations of the
crawling, and indexing, and serving systems,
MapReduce Bigtable, a lot of that with his close
friend Sanjay Ghemawat, who is an MIT PhD. So those of you
who are wondering about the role of PhDs in
industries, the vast majority of the scalable
infrastructure at Google was built by a UW PhD and
a Google PhD, both of whom spent a few years
at DEC, before– which is where all
great engineers were in the 1970s, and
’80s, and early ’90s. Jeff then turned his
attention to machine learning, where he’s made just
incredible marks. Contributed to a project
including Google Translate, DistBelief, and TensorFlow. So Jeff, welcome back to the
University of Washington, and we look forward
to hearing your talk. Thanks for being here. Thank you very much. [APPLAUSE] I’m delighted to be back here. There’s a lot of changes. A brand new building. The other building was brand new
a few years after I graduated. So it’s great to be here. And what I’m going
to be doing today is talking to you
about some of the work we’ve been doing
in Google research on sort of basic advances in
the field of machine learning, and also how some
of those advances are going to influence different
kinds of problems in our world. And this is joint work with
many, many people at Google. So this is not all my work. Some of it– I’m a collaborator
with many people, some of it is just independent work
done by our researchers that I think you
should hear about. So with that, let’s
talk about where we are. So one interesting trend that
has happened in the last 10 years is there’s been a lot more
interest in machine learning and machine learning
research in particular So this is a chart that shows
the number of machine learning research papers posted on
Arxiv every year, which– Arxiv is a paper
preprint hosting service. And the red line there is kind
of an exponential curve showing the Moore’s law growth
rate of roughly doubling every two years that we got
really nice and accustomed to in the fat and happy days of
computational growth for about 40 or 50 years from, like,
1950 through 2008 or 7. And unfortunately, Moore’s law
and computational performance has now slowed down,
but we’ve replaced it with growth in machine
learning research ideas, which is kind of nice. And this is a exciting space
with both broad applications across many areas
and also fundamental interesting basic
research happening. It’s hard to keep up. I don’t envy graduate students
these days trying to keep up. It’s hard, so don’t feel bad. And one of the most
successful areas of machine learning in
the last, say, 10 years is a modern reincarnation of
some ideas that have actually been around for 30
or 40 years that’s now been re-branded
as deep learning, but is really the sort of
ideas behind artificial neural networks– basically,
these collections of simple mathematical units
that can be trained organized in sort of layers that
progressively build on each other to build
higher and higher-level representations of things
in the goal of accomplishing some end-to-end task. So a lot of the ideas
that underpin us today were developed 30
or 40 years ago. My colleague Geoff m along with
Yoshua Bengio and Yann LeCun actually won the
Turing Award this year for their contributions to these
basic ideas 30 or 40 years ago, as well as a
progression of ideas that they developed
over that period. And the key benefit of these
deep learning approaches is that you can feed in
very raw forms of data and learn from this kind of
raw, heterogeneous, noisy data, rather than explicitly
hand-engineering things and then having a
machine learning model that might be shallower that
learns combinations of these hand-engineered
features. Instead, the features
kind of develop themselves through the learning process. And the really nice thing
is these basic techniques are applicable across a bunch
of different kinds of modalities of data. So for example, we
can put in pixels into one of these models,
and by training it on lots of examples of things
and categories of things, we can then have it
be able to predict, given a new image,
what category of thing is represented in that image. So pixels of the image, we can
then predict that’s a leopard. We can put in audio
waveforms and then learn to transcribe
them– basically, build a complete end-to-end
learned speech recognition system directly from example
pairs of audio waveform and what was being said
in that audio waveform. Similarly, we can take in
language one word or one piece of a word at a time– Hello, how are you?– and with enough
training examples of aligned sentence pairs– sentences that mean the
same thing in, say, English and French– learn to train an
end-to-end machine learning machine translation system
on that parallel data. With very little algorithmic
code in that system– hundreds of lines of machine learning
code and lots and lots of examples of
English-French sentences, we can actually learn
a very high-quality English-French
translation system. Hello, how are you– Bonjour, comment allez-vous? You can actually combine
these modalities. And so instead of
classifying an image, you can train a model from a
few hundred thousand examples of images with small,
human-written captions to actually emit captions
that are meaningful based on the pixel
content of that image. So for example, here’s one
of my vacation photos, which was cool, and the
model can learn to emit something like a cheetah
lying on top of a car, right? And that’s actually
showing a fair amount of interesting understanding
of the relationship of different objects
in that scene, and knows that
there’s a cheetah, that it’s sitting on top of a
car, and that there’s a car. So that’s pretty
interesting, and just to give you a sense of the
progress just in the last eight or nine years, you
know, I was actually pretty excited as an undergrad
about neural networks. In the first wave of
excitement in the late ’80s and early ’90s, there was a
lot of interest in this space because those techniques seemed
to be able to solve really interesting, toy-sized
problems in really interesting and creative
ways that we didn’t know how to solve in other ways. And the unfortunate
thing is we couldn’t make them scale to large-scale,
real-world problems at that time, and so a lot of
interest kind of faded away. As an undergrad in 1990,
I said, oh, well, we just need more compute. So maybe I’ll try as an
undergrad thesis project training– getting neural net
training parallelized on the 64-processor
machine in the department. And that’ll be
great because we’ll have 64 times as much compute. It turned out we needed like a
million times as much compute, not 64 times. But then, starting
about 2008 or 9, we started to have that much
compute, and these devices, these approaches
really started to sing. So just to give you an
example, in the field of computer vision,
Stanford every year runs an ImageNet
challenge contest where you get a
million images in 1,000 categories, 1,000
images per category. And your goal is to take, then,
a bunch of test images you’ve never seen before in
those same categories and predict what
category is represented. And in 2011, no one
used a neural network, and the winning entrant in
that contest of about 30 teams got 26% error in this task. And we know it’s a
reasonably difficult task, so humans get about 5% error. A Stanford grad student at
the time, Andrej Karpathy, wrote up a really nice blog
post about his experience training himself to do the
ImageNet classification challenge. He did 100 hours
of training, you know, sitting there with a list
studying photos and saying, oh, I think that’s
a Irish Setter, and that one’s a Doberman. And he got about 5% error. He convinced one of his
lab mates to do this, but they got tired, and they
only did 10 hours of training. They got 12% error. So this is reasonably
hard, because there’s a fair number of fine-grained
distinctions you need to make. And in 2016, the winner of that
contest got 3% error, right? So think about that progression
in a relatively short period of time. We’ve suddenly
gone from computers not really being able to see to
computers can now see, right? And that’s– if you think
back in evolutionary terms, when animals evolved eyes,
that was probably a big deal? That’s probably the same point
we’re at in computing now. We’ve got eyes; what
can we make them do? So the rest of
the talk I’m going to frame in terms of what
are these capabilities going to mean in terms of
what kinds of things can we do in important
problems in the world? So in 2008, the US National
Academy of Engineering convened a bunch of
experts across a bunch of different
disciplines to say, what are the problems we should be
thinking about from a science and engineering perspective
for the next 100 years– well, 92 years. They waited eight years
in the 21st century, so we lost eight
years off the top. But they came up with this
list of 14 different topics, and I think this is
a pretty nice list. We’d all agree, I think,
that if we made progress on most of these,
or some of these, it would be a good thing. The world would be less warm. We’d have much better,
healthier people. We’d be able to have
education spread more widely. So we have work
going on in Google research for all the
things listed in red, and I’m going to focus
on the ones in boldface in the rest of talk. So that’s the roadmap. So one of them is restore and
improve urban infrastructure. And we’re on the cusp of this
kind of exciting sort of change in how we think about
transportation, which is basically that we are close
to having autonomous vehicles that can operate in messy
environments like our roads and streets. So this is largely, now, made
more possible by the fact that computer
vision works, right? It’s very hard to build
an autonomous vehicle if you can’t really
see very well, but now we have models that
can take the kind of data that these autonomous vehicles
collect from their sensors– they have spinning
LIDAR that gives them 3D depth information. They have a bunch of
cameras all around the car. They have radar, infrared
radar in some cases. And they need to fuse all
that information together into a model of what’s going
on in the world around them– which other things
exist in the world? Is that a car next to me? There’s something up
there; is that a light post or a pedestrian? If it’s a light post, it’s
probably not going to move. And then, be able to
understand all the traffic signs, and the street lights,
and the regulations involved in those traffic
flow laws, and be able to make safe decisions
about what they’re going to do in order
to get from A to B, obeying all the traffic
laws and being safe. So Alphabet, which is
Google’s parent company, has a subsidiary called
Waymo that we collaborate with quite a lot on various
perceptual problems and control problems. And Waymo, for the last
year and a little bit, has been running trials
in Phoenix, Arizona, where we actually have cars with
real passengers in the backseat and no safety drivers
in the front seat. So this is not a distant,
far-off kind of thing. This is something that,
in the streets of Phoenix, is actually happening,
where we’re actually moving passengers around in Phoenix. Now, admittedly,
Phoenix, Arizona, for those of you
who think about it, is a somewhat easier, training
wheel-like environment. The other drivers
are pretty slow. [LAUGHTER] It doesn’t really rain. Rain is kind of an annoyance. It’s kind of hot, so there
aren’t that many pedestrians. [LAUGHTER] The streets are really
wide and rectangular. But still, there are cars
driving around Phoenix, Arizona autonomously, and that’s
something that’s coming. And it will actually
change how we think about building city and
urban infrastructure, right? Where we wouldn’t
need parking lots; you would summon the
right kind of car. It’s just me, it’d
be a small one. If it’s me at Home
Depot with my 2 by 4s, it might be an autonomous
pickup or something– who knows? So I think this is a
pretty exciting advance, and in general, the field
of robotics, I think, is also going to undergo a lot
of change because of the fact that we can now see and that we
now have reinforcement learning methods that actually enable
us to learn to accomplish new skills in a more learned way
than the traditional hand-coded robotic control algorithms
of the past decades. So one area we’ve
been looking at is, how do you teach
robots to pick things up? If you want to do
things in the world, that seems like a very
good, basic building block to do anything. And in 2015, on that
particular task, the success rate for
picking up an unseen kind of object, something
you’ve never seen before, was about 65%. And in 2016, we
did some work where we got a collection of
physical robots in a room– about 10 of them. We called it the arm farm. And the arm farm robots would
practice picking things up. The nice thing about that is
it’s a supervised problem. So if I succeed in
picking something up, my gripper doesn’t
close all the way. If I try and fail, my
gripper closes all the way, and I can tell that I’ve failed. And so they can just practice. Like you’ve seen, maybe,
your little nephew or niece sitting on the carpet
trying to pick up things. That’s what these
robotic arms can do. And with that, and pooling
their experience– so unlike your niece or
nephew, the 10 arms can all save their
data to hard drives and collectively train
a model tonight that improves their ability
to grasp tomorrow based on their collective experience. And with that, we
got to about 78% grasping success rate
on unseen objects. And then, with a bit more
work and some more focus on more reinforcement learning
algorithms rather than just purely supervised
grasping techniques, we actually improved
things quite a bit more. We’re now up to about
96% grasping success rate for unseen objects. So again, going from a third
of the time you fail to now 96% percent of the time
you succeed, that’s a pretty significant
improvement, and it’s a good basic
building block for lots of robotic kinds of things. Since we’re in the
Amazon auditorium, I’ll mention that the
way we did this is we purchased lots of variety
packs of tools and toys on Amazon so that
they could practice. [LAUGHTER] Another thing we want
robots to be able to do is to acquire new
skills quickly. And one way they can
do that is by observing people doing things that we want
robots to be able to emulate. And so for example, here
is one of our AI residents doing important work. No, I’m just kidding. They do great machine
learning research, but they also create
training videos for our robotics simulators. And you can see, here we’ve
got an algorithm that’s trying to emulate
what it’s observing in the pixel
content of the video with a simulated robotic arm. And on the right,
what you see is a system where it gets to
see about 10 short video clips of people pouring things
into different shaped vessels and from different
angles and orientations, and then it gets to try its
hand, so to speak, at pouring, and it gets to try
about 10 times. And so in about 15 minutes, it’s
gone from never having poured before, getting these 10 videos,
and then doing 10 or 15 trials of this, and it’s managed
to teach itself to pour at the, maybe,
four-year-old human level– rather than eight-year-old
human level, which is maybe what you want. But that’s pretty good. If you think about what would
be required to teach a system to pour, if you were
hand-specifying a control algorithm, it would
be very brittle. It would take you weeks or
months of handwriting software and basically wouldn’t
really generalize or work in any other environment. Like if the background turned
gray, or black, or something, it probably wouldn’t work. So we’re excited about this
general approach of learning from demonstrations. We think there’s a
lot of, as I say, YouTube videos of people
flipping pancakes, and maybe we can have– use those. OK, another area we’re
pretty excited about is advanced health informatics. And this is a really
broad area, so I’m going to do a bit of
a dive on one area and then talk generally
about a few other things that we’re doing in less detail. So one of the potentials of
machine learning is that we can use human physician and
clinicians’ decisions to use machine learning on those
decisions to then be able to generalize and bring those
decision-making capabilities to a machine learning model
that can be used to assist clinicians or, in some cases,
be used in areas where there aren’t very many physicians– to basically bring this
kind of specialist expertise to many more people. So I’m going to walk through one
of the areas we’ve been working on the most in the medical
imaging sets of problems, which is diagnosing
diabetic retinopathy, which is a degenerative eye
disease, as one example. But we’re also doing a bunch of
work in the areas of radiology, and pathology, and dermatology,
other kinds of modalities. Basically, now that
computer vision now works, there’s a lot of medical
imaging specialties that you can apply
machine learning to. So why this problem? It’s the fastest-growing
cause of preventable blindness in the world. And it’s a side
effect of diabetes, and so that means that there’s
400 million people at risk for this all around the world. And the screening
is specialized. A doctor can’t do it. You need to be an
ophthalmologist in order to really interpret
the retinal images and understand what
you’re looking at. And so for example,
in India, there’s a shortage of 120-something
thousand eye doctors to do this screening,
which is basically you’re supposed to be screened
annually if you’re at risk. And as a result, 45% of patients
lose full or partial vision before they’re diagnosed. And this is a complete
tragedy because this is a very treatable diseases
if you catch it in time. There’s a well-known treatment
that’s nearly 100% effective. And so if we can get screening
to be more automated, or assist with the
screening process, we can actually help with this
process, with this problem. And it turns out that
general computer vision models for determining,
is that a car, or a truck, or a doberman also, with
the right training data, can be used to assess one
of these retinal images and make a classification of
one, two, three, four, or five, which is how these are graded. And so you can basically take
an off-the-shelf computer vision model, train it on
retinal images– so we got about 130,000 retinal
images and found a Mechanical Turk network of ophthalmologists
who wanted to label– we actually assembled
this ourselves, of these ophthalmologists, got
them to label these images. One thing that you
might find if you do this is if you ask two
ophthalmologists to label the same image, they
agree, with each other- 1, 2, 3, 4, 5– with each
other 60% of the time, which is a little sad. If you ask the same
ophthalmologists to grade the same image
a few hours apart, they agree with themselves
65% percent of the time. Yeah, that’s perhaps more scary. But what you can do to
reduce that variance is to get each image labeled
by a bunch of ophthalmologists. So we got each image labeled
by about seven ophthalmologists on average. And if five of them say it’s a
2, and 2 of them say it’s a 3, it’s probably more
like a 2 than a 3. So at that point, you
now have a labeled data set of retinal images,
and you can just train the model on that
so that it can then generalize to new
retinal images and make this diagnostic assessment. And so this was work published
in JAMA at the end of 2016, on basically showing that the
model was on par, or perhaps slightly better, than the
average board-certified ophthalmologist. In this graph, you want to
be at the upper-left corner. And because it’s a
machine learning model, you can choose the
operating point of false positive
and false negatives to be anywhere on
that black curve, and individual ophthalmologists
are shown as colored dots there. And one thing you might
want– oh, I should also say, since then, we’ve done
follow-on work where we’ve gotten the same images labeled
by retinal specialists who have more training
in retinal disease, and we now have a
model that’s on par with retinal specialists,
which is kind of the gold standard of
care, rather than on par with board-certified
ophthalmologists. OK, explainability
is one big factor. Especially when you’re
making predictions for things that are
very consequential, like medical diagnoses, you
want to partner with a clinician and actually be able to
show the clinician well why do you why does the model
think this is a 2 instead of 3? And there’s actually been a bit
of a bad rap about neural nets as being completely
black box methods. I think that’s not quite true. Over the past four
or five years, the community has
developed some tools that enable us to get a
bit more insight into, why is the model making a
particular kind of decision for this example? And I’m going to show you one of
those here, which is basically, this is assessed as moderate
diabetic retinopathy, and we can actually use a
technique called saliency maps using integrated gradients
that can show a clinician where in the retinal image are
the points of most concern? Why is it looking here? Why is it– you know,
which parts of the eye are most contributing
to that assessment? We’re also doing a
bunch of other work in the medical space. So for example, a similar
kind of work, machine learning interpreting CT scans. So CT scans are
super complicated. You get a
three-dimensional volume of X-ray data, about 400
two-dimensional X-rays that you then need to look
through as a stack in volume, and particularly for, like,
early lung cancer detection, you’re, as a radiologist,
looking for the tiniest little sign of an early tumor. And we have a model
now that can really, basically, outperform
radiologists at that particular task. Can you use all the data in
de-identified medical records to make all kinds of
different predictions about how patients
might progress or what diagnoses
makes might make sense? Turns out, yes, and that
actually helps quite a bit. There’s actually 200,000
pieces of information in a typical medical
record, whereas a lot of the clinical models and
practice of trying to predict, will someone come back to the
hospital in the next week, use about four kind of
hand-designed features, and then plug it into a
sort of simple formula, and you can do way
better than that. Can we help with the
documentation burden? Clinicians spend an
awful lot of time interacting with the
medical record system, entering notes,
and billing codes, and all these kinds of things. Turns out we can actually use
machine learning to really help with that by sort of
suggesting early drafts of the medical note
based on the conversation the doctor had with the patient. Help with radiotherapy
planning– how does this actually
interact in a real working setting with pathologists? So can a pathologist, paired
with a machine learning system, be more effective? Turns out they actually can
make faster decisions that are more accurate,
but they also feel more confident
about the decisions that they make than if they
are not paired with that, which is kind of nice. OK, so that was a whirlwind
tour of the health care space. We’re doing a lot of
other stuff in that area. We think it’s a pretty
exciting area for, basically, improving collectively our
health and decision-making for health decisions OK, so many of the
advances in this space depend on being able
to understand text. Computer vision
has improved a lot. What is happening
in the text space? And there’s been a
bunch of improvements in the last couple
of– few years. So prior to 2017,
there were a lot of uses of what are called
recurrent neural networks, where you have some
internal state in a model, and then you look at the
next token, or the next thing in a sequence– like a sentence– and then
you apply some transformation to that internal– that input
data and the internal state to update the internal state. You now have this sort of
distributed representation of what is all–
what has happened in all the tokens
you’ve consumed so far. And one of the problems
with that is it has a very sequential
bottleneck. So every token, you
have the sequential step that needs to happen before
the next one can happen. So in 2017, a bunch of Google
researchers and interns developed a technique called
the transformer model, which basically enabled us to look
in parallel at a whole bunch of tokens and then use an
attention mechanism where you can look across all the path
tokens and make predictions, but it doesn’t have this
sort of sequential, recurrent bottleneck on every step. And that actually showed to
be quite significant in terms of performance improvements
on accuracy improvements of translation quality. The first two columns
there are English to German and English to French. Translation quality measured
with the blue score. Higher is better, and you see
pretty significant improvements over other kinds of approaches. But you also see
that this approach used a lot less compute,
because you can actually now get a lot of
efficiency gains and also reduce the
amount of compute by doing this kind of parallel,
multi-headed attention mechanism. Then, building on
that, a different team of Google researchers
took the transformer idea and created something
called BERT, which stands for Bidirectional
Encoder Representations from Transformers. So each one of
those blue circles there is a transformer module,
and the key thing about BERT is that it basically
is bidirectional. So it looks at the
context surrounding a piece of language,
a word, in order to make a prediction about
what word really makes sense in that overall context. So the way you train this model
is really nice because it’s what’s called self-supervised. So you could essentially
create your own supervised task you’re trying to accomplish
from raw data that has not been annotated. So if we take a piece of text
like this, the way this works is you remove 15% of the words
and replace them with blanks, and now your goal is to
fill in the blanks, right? So if you’ve played Mad Libs as
a kid, it’s kind of like that. And it’s actually
pretty hard, right? So pick one of those blanks. I’m going to go back,
tell you what the word is. Now forget the word, and
now look at the blank. And some of them,
it’s pretty clear it must be one of
these few words. Some of them, you have
no idea what it is. This turns out to be a pretty
hard task, both for humans and for computers, but you can
train on lots and lots of text. And the more you do
this, the better you get. So the training
objective for BERT is you essentially
remove 15% of the words and try to predict them
for the rest of it. And so, then, that is the
machine learning objective. And that doesn’t seem that
useful in and of itself, but a key thing you can do is
pre-train the model using this approach, and then, for
many language tasks, you actually don’t
have very much data , but you might have a very
focused language task you want to do– like predict from the
review text someone wrote about a restaurant, did they
like the restaurant or not? Or is this a five-star
review or two-star review? And maybe you don’t
have very much data for that particular
thing but you can fine-tune the model
on those individual tasks without very many examples using
this giant pre-trained thing on lots and lots of text. And this turns out to
work surprisingly well. So at the time
this was released, it showed pretty
significant advances in a suite of benchmarks in the
language understanding space called the GLUE
benchmark and basically exceeded the state of
the art by fair amount in all those different
areas, which is kind of cool. And since then,
there’s been a lot of other kinds of
improvements on top of BERT and other kinds of
similar approaches, all showing dramatic
progress in a lot of NLP and machine translation
kinds of tasks. OK, one of the things
was engineer the tools of scientific discovery. This seemed like a
bit of a cop-out, sort of like a vague thing,
whereas the other ones were much more focused. But it’s pretty clear that
if machine learning is going to be a key thing that we apply
to lots of these problems, we want good tools that enable
us to express machine learning ideas, machine learning
research techniques, and also take machine
learning and apply it as a scientific or engineering
tool in products and problems that we want to tackle. And so in 2015, we
decided we would create a second
generation system for our training of
machine learning models for our own research purposes,
and we would open source it when it was ready. And so in the end of 2015,
we released TensorFlow, which is a way of expressing
these kinds of machine learning ideas and also designed
to be easy to run them on lots of different
kinds of environments– so on large data
centers, on GPU cards, on customized machine
learning hardware, on phones, all kinds of things. And TensorFlow has had
a rather good adoption curve in terms of
people who– compared to a bunch of other machine
learning toolkits on GitHub, and it’s been used for a lot
of different things, which is kind of the point. That’s why we decided
to open source it, is we’ve hoped people
would do two things. So one is we hoped they
would help contribute to it, so we’re proud that we have
almost 2,000 contributors to the actual source code,
about a third of whom are from Google,
and the rest are from lots of other
companies, and universities, and organizations. But we also hoped that
people would use it for interesting things that they
can then tackle with machine learning in the world. And there’s been lots
and lots of uses. I’ll just highlight a
couple that are kind of fun. So one is there’s a fitness
centers for cows company based in the Netherlands. If you have 100 dairy
cows, it’s hard to keep track of is Bessie number
14 feeling good today? And it turns out you
can analyze the sensor data using a machine learning
model expressed in TensorFlow, and then they will tell you
that yes, Bessie number 14 is actually feeling great today– it’s all good. There was a nice collaboration– what? AUDIENCE: What are the sensors? What sorts of sensors? JEFF DEAN: I think it’s
accelerometers, basically. It’s like you tell if they’re
limping or laying down instead of standing up,
or that kind of thing. So there was a
nice collaboration between Penn State and the
International Institute of Tropical
Agriculture in Tanzania on, basically, building
a machine learning model for detecting cassava
plant diseases. And this is an
example of a couple of people using this
application, where it actually runs the model on-device. Because you’re in the middle
of a cassava field in Tanzania, you may not have a
network connectivity, but you want to be able to
take a picture of a plant leaf and assess, is this
plant diseased? If so, how should you treat it– and then go about
taking care of it And this is an example,
I think, of where you want more and
more machine learning to run in lots of different
kinds of environments, right? It’s important to be able to
run it in large data centers, but also, in the middle of
a cassava field in Tanzania, you want it to run
in a way that enables you to have a highly
accurate model that can run without network connectivity. OK, another thing we’ve
been focused a lot on is, how do we automate the
process of machine learning? Because if there are lots and
lots of problems in the world that we want machine
learning to be applied to, but there’s a relatively
small number of people in the world who’ve
actually, say, taken a master’s-level
class in machine learning, that’s a bit of a mismatch. So can we make this an
easier and more accessible thing for people who have
machine learning problems? So the current way you solve a
machine learning problem is you have some data, you have a
machine learning expert come in and look at the data,
make a bunch of decisions, try some things out,
run some experiments , look at the results
of those experiments, figure out what are the next
set of experiments to run, run the next set of
experiments, look at those, stir it all together,
and you get a solution– which is great. But what if we could turn this
more into data and computation and have the computation
be used to automate a lot of the
experimental process that a machine learning expert
would do if they were doing it in a hands-on manner? So that’s the general
thrust of this work. We’ve done a lot of different
kinds of work in this space, but I’ll talk about some of
the earliest work that we did, which was, essentially,
a technique called neural architecture search. So one of the decisions
you make when you’re using a deep learning
model is there’s a lot of decisions about
exactly what structure model you should use. Should it be 13 layers, or
nine layers, or, you know, 1,000 neurons in
this layer, or 1,500? Should have 3 by 3 filters
at layer 14, or 5 by 5? And the idea here
is, maybe we can have a model-generating
model, and we’re going to train it
with machine learning. Great, OK, so the
model-generating model, the approach here, we’re
going to generate 10 models from this
model-generating model. We’re going to train them on the
problem we actually care about, and then we’re going to
use the loss, the accuracy, of the generated models as a
reinforcement learning signal for the model-generating
model to kind of steer it away from models that
didn’t work very well and towards models that
seem to work better. And then we’re going to
repeat that a fair amount. And over time, you get
more and more accurate models because the
model-generating model is now sampling from the part of
the space of models that seem to be more accurate
for this problem we actually care about. So that’s the general
approach, and it comes up with models that look kind
of strange, but that’s OK. What you actually care
about is, is the model accurate for the problem you
ultimately set out to solve? And the answer is,
this works pretty well. So I’m going to take a little
time explaining this graph. So on the x-axis is
computational cost of the model to make an
inference on a new image. This is for the
ImageNet challenge. The y-axis is accuracy,
and the different dots here are lots of different
kinds of models. So the ones on the black line
and below the dotted black line are ones that all–
that are basically work by the top machine
learning, computer vision teams in the world, each
publishing work that advanced the state of the art at
the time it was published on this challenge. And in general, what you
see is a trend of, if you expend more computation, you
get more accurate results. Less computation, you get
less accurate results. And then the blue line there
is a circa-2017 AutoML approach to develop models. And what you see
is the blue line basically outperforms the Pareto
frontier of the top computer vision and machine learning
research teams in the world. And that’s true at the high end,
and also true at the low end where you might want a
very lightweight model that runs on a phone. And then, in 2019, we
continued work in the space and came up with an
approach that is actually significantly better
in this space, and now we have models that
dramatically outperform in accuracy the state
of the art there and also give you much more
accurate models for very low computational cost. And this is true in
image recognition; It’s true in object detection. So the label their parts
there for ML experts, and then this is
an AutoML approach to doing object detection
where you just kind of design the right architecture
in a computational way. It’s true in language
translation, where here, the accuracy is measured
in translation quality score and trillions of floating
point operations on the x-axis. And we’ve turned
these into tools that people can get
on our cloud services so that if people have
particular customized machine learning problems
they want to solve but don’t have in-house
machine learning expertise, they can actually
use this approach. They just upload a data
set they care about. It might be images of things
on their factory floor that they want to
distinguish, is this– which kind of 50 parts is this? And they can get a
high-quality model without as much machine
learning expertise. And there’s a whole suite
of these for different kinds of modalities. OK, so one of the
things that’s true is more computational
power generally helps because it enables
us to train larger, more powerful models on more– on larger data sets, or to
run more automated experiments in the AutoML space
for less cost. So one of the interesting
things that’s happened is we now have this tool that
is broadly applicable across so many different areas that we can
actually think about designing computing devices
that are more tailored to that kind of computation. So deep learning models– in particular, all the stuff
that I’ve described so far– has two really nice properties. So one is reduced precision for
all the arithmetic is generally fine with these kinds
of models, compared to other numerical techniques
where you generally need a lot more precision. One decimal digit of precision,
generally, is perfectly fine. And the second thing is that
all the models are made up of different compositions of
a handful of different kinds of operations that are all
in the dense linear algebra family– matrix multiplies,
vector dot products, other things like that. So essentially, you can take
these two properties together and build specialized
accelerators that do dense linear algebra
at reduced precision, and nothing else. They don’t do all the
complicated kinds of things that CPUs are trying to do with
branch predictors, and caches, and multiple levels of
cache hierarchies, and TLBs. You just do reduced-precision
dense linear algebra. So we’ve been doing
this for a while because we sort of saw
a bit of a worrying sign that we wanted to apply
these models to lots and lots of problems at
Google, and particularly ones in a lot of our products
with user-facing requests, and we just didn’t
see ways to do that with traditional
computational devices. We needed something that
was much more efficient. So the first-generation
TPU was tackling the problem of inference. You already have
a trained model; you just want to apply
it at high volume and high throughput. And this is a picture of it. It’s basically a card
that fits in a computer. It’s been in production
use for, actually, almost five years now. Used on search queries; it’s
used for all the neural machine translations; it’s used for
speech recognition, image recognition. When our DeepMind
colleagues played a match against Lee Sedol to compete
in the board game of Go, they used two racks
of these TPUs. So this is the
commemorative Go board we’ve stapled to the
side of the rack. And they, of course,
won that match. This is the second
generation chip, where now we wanted to tackle
the problem of training rather than just inference. Training is interesting
because it actually requires you to
think holistically about a whole computer system
rather than just a single chip because you’ll never get
enough compute in a single chip to solve all the machine
learning problems you want. If you did, you’d just
want to solve bigger ones. And so you have to
think about how are you going to connect things
together and build an effective computing system
rather than just a chip. So if we take a peek into–
this is the TPU V2 device, which has four
separate chips on it, and it’s designed to
be connected together into larger configurations. But let’s peek inside
the chip itself. And one of the really nice
properties it has is it’s actually a relatively
simple design. It’s got a giant matrix
multiply unit, so every cycle, we can do multiply 228
by 128 matrices together. And then it has some
scalar and vector units for things that are
not matrix multiplies. And then it has
high-bandwidth memory, which is sort of
super-high-speed DRAM. And that’s about it. And it has two cores that
are roughly equivalent, and that’s it. And it has reduced-precision
multiplication. And this is the progression
of TPU devices we’ve built. The first one,
which does teraops because it’s integer-only
for inference, and then the second and
third generation are flops because they’re 16-bit
floating point precision. And the third generation
adds water-cooling to the second generation and
some refinement to the chip. It’s always exciting
when you have hoses of water bringing water
to your surface of you chip. And these are designed
to be connected together into these larger
configurations we call pods. So the third
generation pod there is basically more than
100 petaflops computer. It’s 1,024 of those chips
connected together in a 32 by 32 2D toroidal mesh. And so, for example, these
are pretty high-performance devices, so when you’re training
an image model on one of these, you process the entire
ImageNet data set every second, and in about two minutes,
your model is trained– which is kind of nice. And the reason this
is important is it enables a very different kind
of science and research, right? If you’re able to do experiments
in a matter of a few minutes versus days, or
weeks, or months, you just do a fundamentally
different kind of experimental flow,
and that’s partly why we’re seeing
a lot of progress as people can try
lots of ideas quickly, and we want to make machine
learning on data sets of images at scale feel more interactive
than it has in the past. ED LAZOWSKA: Interactive
is different than faster. JEFF DEAN: Right, exactly. So I’ll also point out that
it’s important to apply some of these accelerator ideas
to very teeny accelerators that you want to be able to run
on edge devices like phones, or small-scale Internet of
Thing kind of sensory things. So this is an edge
TPU, which is sort of a lot of the same
design principles. It’s a different chip
design, but it’s basically designed to give you
the ability to have a fair amount of inference
capacity in a low power envelope that can fit in a
USB stick or inside a phone. OK, so what’s wrong with
how we do machine learning? This is a more of a
soapbox-y section. So the current way we solve
a task with machine learning, assuming we’re not using AutoML,
is we take our ML expert, we start with random
floating point numbers for all the parameter values. We add some data and have some
optimization algorithm that uses data to adjust the
floating point parameter values, and we get our solution. Now, if we’re using AutoML,
we have data, we have compute, we have a random
number generator, we start with random
floating point numbers, we train the model
using the computation, and we get our solution. But we’re still
starting the problem solving process with almost no
knowledge of anything, right? We’re trying to
rely on the entire– the data itself
teaching us everything we need to know about
solving that problem, and that’s pretty
unrealistic, right? Imagine every
problem you solved, you forgot all your schooling,
and you start from scratch, and now you’re going
to solve a new problem. That’s pretty, lame, right? So what can we do better? And as a result of this, new
problems need both significant data– you need hundreds of
thousands of examples to learn everything about the world
and solving that problem from nothing– and you also need
a lot of compute because now you have a lot of
data to deal with, and so on. So sometimes transfer learning
and multi-task learning help, but they’re usually
done pretty modestly. We do transfer learning
from a very related thing into one other thing, or we
do multi-task learning of five pretty related things together. So how can we do better? So I think we want
really big models, and I think it’s important that
they have a lot of capacity so they can remember
lots of stuff, but they should be
sparsely activated, right? When you call upon your
expertise in Shakespeare, there’s lots of
other expertise you have that’s not being called
upon and is mostly dormant, right? So how do we do that? So we did some earlier work in
the context of a single task where we created a technique
called sparsely gated mixtures of experts. So the pink things here are
normal neural net-style layers, and the mixture of expert layer
has this interesting property where we’re going
to have a bunch of little miniature
neural network components. Each one has a whole bunch of
capacity to remember things, and we’re going to
have a lot of them. So we’re going to have a
couple thousand of these. So we have 8 billion parameters
in capacity in this model just in the mixture
of expert layer there, and we’re going to have
a little thing there, the green part called
the gating network, which is going to learn to
route different examples to different experts,
and it’s going to learn which experts are
good at which kinds of things. And it turns out this
actually is pretty effective, and we’re actually able to
get experts that specialize in different kinds of things. So if you look at what
the experts specialize in, one of them is really good at
talking about scientific terms and biology, and
another one’s really good at talking about dates
and times, that kind of thing. And what you see is the
baseline here is the bottom row, actually? I guess that’s why it’s
called the baseline. And what we’re able to do
with this mixture of experts model is get a
system that is more accurate at machine translation
for a particular task. So one blue point higher
a pretty big deal. And it uses less
computation per word. So we’re actually able to shrink
the sizes of the pink layer, which is the part that’s
activated on every example, because we have this
dramatic improvement in capacity in the model in
the mixture of experts part. And so it’s half
the inference cost, half the cost to do
the translations, and the training time
is about 1/10 the cost. So it’s about 64 GPU days
instead of 700 GPU days. OK, so hold that thought. Think about that idea. What do we want? I think we want this large model
that’s sparsely activated, so kind of expert-style decisions
about which parts of it should be used for a
particular example. I think we actually
want a single model that does many things, right? Why aren’t we training
one model that does all of the
machine learning tasks we’re training separate
models for today. I think we need to dynamically
learn and have paths through this model that change
the structure of the model as we progress in
learning new things, and so that the model
architecture, then, can adapt to what it’s
being called upon to do. And adding a new task, then,
will leverage the existing things we already know how
to do and representations that we’ve learned
in doing those things in order to enable us
to do new things more quickly and with less data. OK, so I have a cartoon diagram. I have no results yet, but this
is roughly how it might look. So imagine we’ve trained the
thing on a bunch of tasks. Different tasks here
are different colors, and the components
in the bubbly thing are different pieces of
machine learning computation. They have some state. They have operations
you can apply to them, like forward and backward, say,
for a machine learning-style component. But it might do other things. It might do a Monte
Carlo tree search, or it might do all
kinds of other things. And then we get a
new task comes along. So if you squint at the neural
architecture search work and instead think of it
as neural pathway search, you can imagine trying out
a lot of different ways to progress through the
components of this model to see how you get to a good
state for the new problem you care about, right? So maybe we do
that search, and we find this particular
pathway works out pretty well for this task. Now imagine we care
deeply about that task and we want it to be better. So we can add a new component,
new capacity to the model, and start using
it for that task, and start updating
that component, maybe specifically
for that task, or maybe for a few
other related tasks. Now we have a model that
is, hopefully, more accurate for that purple task. And the nice thing is that
component, that representation that’s been learned, can
now be used for other tasks or for the existing tasks
we already know how to do. And furthermore, each
one of those components you can imagine running their
own little architecture search or other sort of genetic
or evolutionary search process to adapt the structure
and other aspects of the model to the kind of data that’s
being routed to it, right? So in essence, I think
that’s an approach that will enable us to build
on the things that are common across
many different tasks and enable us to
more quickly solve new tasks by leveraging our
expertise in the other tasks. OK, I want to close on thinking
about thoughtful use of AI in society. So there’s lots
and lots of things we can do with machine learning
and could do with machine learning, but I think
it’s really important to think carefully about
how we want machine learning to be applied in the things that
affect our day-to-day existence in our society. So about a year and
a half ago, we’d been having a lot of
discussions internally for lots of different potential
uses of machine learning in lots
of our products. And in order to help
crystallize our thinking, we came up with a
concrete set of principles by which we started to evaluate
all uses of machine learning in Google. And I think it’s
a pretty good set of decision-making
thoughts, that you want to look at each possible
use of machine learning. And I think it was great for
our own internal discussion, but we also decided we’d
make these public so that other people and other
organizations who are now thinking about these
sorts of same issues can see what we came up with. Now, it’s not
necessarily the case that they will
adopt exactly these, but it’s at least something that
is worth public discourse on. And I’ll point out that
a lot of these things are not just solved problems. They’re sort of we
have some techniques we can apply that
are the sort of state of the art in this
space for example, avoid creating or
reinforcing unfair bias. A lot of machine learning models
are trained on real-world data, and the real world is often the
world as it is, not the world as we’d like it to be, and
it’s biased in many, many ways. So we have some
algorithmic techniques we can apply to eliminate
certain kinds of bias, even if the data
itself is biased. But this is by no
means a solved problem, and so we are concurrently
applying these approaches but also doing
research to improve our understanding of how to
eliminate certain kinds of bias or to have safer machine
learning systems. And so this is just a sample. I went and looked, and we’ve
published about 75 research papers in the last 18
months on topics related to fairness and machine learning
fairness, bias, privacy, or safety. So this is definitely a
pretty significant effort across a lot of our
research organization and the rest of the
company as well. OK, and with that,
I want to conclude. And basically, I hope
I’ve convinced you, if you weren’t
already convinced, that deep neural nets are
helping make headway on some of these pretty
challenging problems that, actually, if we are
able to make progress on, we’ll make the world
a lot better place. We’ll have autonomous vehicles. We’ll have health care decision
making informed by more data. We’ll have cool
robots running around. It’ll be great. So with that, I’ll
take questions. [APPLAUSE] I assume we’re taking questions. ED LAZOWSKA: Yeah, time
for a couple questions. JEFF DEAN: Great. ED LAZOWSKA: [INAUDIBLE]
Go ahead, shout it out. AUDIENCE: Hi. Thank you so much for your talk. It was really
exciting, and I’m more excited about the future of
technology than I was before. JEFF DEAN: Woo! [LAUGHTER] AUDIENCE: Nevertheless,
my question is more of a social nature
rather than technological. And I find myself
often wondering about what our future
is going to look like. And I’m seeing this, these cool
models that can solve basically everything, every problem that
we have better than humans, and I’m seeing that
more and more people are going to lose their jobs and
not going to see that stop. And I’m wondering, is that
at all a topic that Google is talking about, and is it
something that [INAUDIBLE] about? JEFF DEAN: Yeah,
so it’s definitely a topic that we talk about. I think society
has been undergoing these kinds of
technologically-driven shifts for the last 200 years. So 200 years ago, 99% of
people worked on farms, and then the agricultural
revolution happened, and we were able to produce
the same amount of food with a lot less labor input. And all of a sudden,
that enabled people to do different things. So I think similar
things are happening are going to happen as a
result of machine learning. Suddenly, things that we
used to need people to do, all of a sudden,
there’s going to be less need for those particular
tasks to be done by people, or maybe people can do
them more efficiently combined with machine
learning models, and so we won’t need as many
people to do those things. This is definitely something
that we think about. I think it’s something
that Google thinks about, but it’s also a
societal-level question of what is the
most effective way to make sure that people land
safely, get new skills that are now things that are
much more needed in society. Given what we can
do, there’s going to be a bunch of things that
are suddenly things that we now can do that we didn’t used to
be able to do because of machine learning and because we now have
the ability for people to focus on other things. But it is a question
of, how do we make sure that that transition happens
safely and effectively for all the people affected? So, definitely,
complicated issue. It’s a societal,
governmental-level question, but we also want to
help and provide advice. We are running training programs
to help people acquire skills in programming and other
kinds of technology, things that we believe
are going to be useful for a long, long time. So– AUDIENCE: I’m
actually a physician. I’m a gastroenterologist. So one of the things– this is the first time I am in
a room, probably, mostly filled with computer scientist. So you see the diabetic
retinopathy image there. We are– I mean, as a
physician community, we agree that, you
know, machine learning will improve and
progress a lot of things, and you got to do
right on point when you said [INAUDIBLE] over
readability [INAUDIBLE] to ophthalmologists for
diagnosing diabetic retinopathy was 0.6, right? JEFF DEAN: Mm-hmm. AUDIENCE: So it’s almost
like a flip of the coin. So a lot of the
work we plan to do is in improving quality of
what we are already doing. We don’t expect robots
to be here already doing your colonoscopies,
or endoscopies, or something like that. My question to you is, one of
the things which most of us, it’s just not
gastroenterologists or medical professionals, [INAUDIBLE]
these sorts of specialities [INAUDIBLE] cardiologists
or different– all of us face is we
need diverse data, right? And then, it then
boils down to how do we create this diverse
data, and how can it be seamlessly transmitted
from the equipment that we use as our source
to a common platform which is kind of secure? And can we use– I was just taking some
notes, like the TPUs, can you have them
on your processors, or whatever imaging devices
you use that can be, maybe, used somewhere in
Europe and the traffic can be managed to
a cloud platform where you can test
different algorithms, because they might need five
algorithms, some group of five algorithms from Amazon. Which ones do we use
for our patients? JEFF DEAN: So I think you’re
right that a lot of this is sort of complexity,
and how do we navigate the complex regulatory
and health care ecosystem with many different
players and the very real privacy issues in
the health care space? But you’re right that bringing
together health care data so that we can
learn collectively on different health care systems
data or different individuals’ data in a privacy-preserving
manner is super important and a key part of this. We do make our TPUs available
on our cloud platform, so you can get medical data,
put it in the secure environment that your hospital or
health care system controls, and try different machine
learning algorithms on it. The edge TPUs are meant
to be deployed widely, so you can put those in
medical imaging devices or whatever would make sense. So there’s a lot of
complexity in getting to the vision of
actually having lots and lots of different diverse
health care system and health data inform future
care of all kinds of different medical
patients and conditions. But it’s a good
aspirational goal to say, how can
we get as close as possible to having the
world’s health care data inform the world’s health? AUDIENCE: So do you think
we are almost there yet, or with what we have
right now with technology? JEFF DEAN: I think a lot
of the problems are– some of the problems
are technical. There is a progression of
different machine learning and interesting technical
problems, and user interface problems. But there’s a lot of
collective problems that are not, at their
heart, technical. But it’s worth tackling all of
them because they’re important. ED LAZOWSKA: The first word
out of any physician’s mouth is, in my experience, OK? The notion that the
collective experience might offer data that’s valuable
to an individual physician is sort of elusive. Let’s take one more question. It’s up to you. JEFF DEAN: Well, I– in the back, please. AUDIENCE: Hi. OK, so I saw you had the picture
of the cloud things, right? And you had all of those rules. JEFF DEAN: Here? AUDIENCE: Yes. One back, I think? Well, you had a list of goals. They were in orange. JEFF DEAN: OK. AUDIENCE: Yeah, this, OK. So when I was looking
at this, I was like, that’s deeply
evocative of the function of the brain as a whole system. And each of those things,
each single thing there, the implementation is a
huge technical challenge in and of itself. I don’t think there’s any one
person in this room right now that could make that, right? That’s going to take
huge cooperation. JEFF DEAN: Yeah. AUDIENCE: Do you think the
tools that we have now are up to the challenge? Or essentially,
what I’m saying is I think that are be
able to get there, maybe, in five or 10
years, and not 100 years? Are we able to really pool
our collective knowledge effectively so that we
can achieve these really technical, challenging goals? Are our tools optimal? JEFF DEAN: No, I mean,
I think you’re right. There are a lot of interesting
technical problems in here. Some of them are computer
system problems– how do we build a system
at the right scale with the right characteristics? Our current software tools
for machine learning probably are not as dynamic in terms
of the kinds of computations they allow to be expressed. So that’s part of the problem. AUDIENCE: And I was also– I guess I was really
thinking, especially in the focus of
collaboration, do our tools allow us to collaborate to
achieve these kinds of goals? JEFF DEAN: Yeah,
so good question. I do think this is a five
year-ish kind of time scale problem. I don’t want to put
an exact number on it, but it feels more
like a five-year one than a 10-year one. And partly I’m
excited about working on this with a bunch
of other people who have different kinds
of expertise than I have because that’s how you actually
make progress on hard problems, is often you need a team of
people, a bunch of people to come together with
different expertise, and collectively, you all do
something that none of you could do individually. I think that’s also the most
fun kinds of projects to work on because while you’re doing
that together, you’re learning a lot about other
people’s expertise, and then you go
your separate ways, and some of their expertise
has rubbed off on you, and you can now
go your merry way and at least know the
right kinds of questions in their domain to as. ED LAZOWSKA: The track
record of technologists is wildly overestimating
what we may be able to do in five years. [LAUGHTER] And underestimating
what we’re going to be able to do in 50 years. And the phrase
“artificial intelligence” was coined in a
proposal for a workshop at Dartmouth College
in 1956, I believe. It was a summer workshop, and
Minsky and the other pioneers in the field laid out a
set of challenges and said, we think if the right 10 people
get together for the summer, significant progress
can be made on it. [LAUGHTER] And these are things
that we’re just now cracking in the
past couple of years, thanks to what Jeff
has talked about. So before we think
Jeff, let me remind you, there’s a reception across
the street in the atrium of the Allen Center. Hopefully, there’s
some food left if you join us with our
annual Diversity in Computing reception. So there’ll be a bit of a
program related to that. You’ve heard Jeff talk about
three generations of tensor processing units, and the
author of one of those papers was a guy named Dave Patterson. Dave will be our next
distinguished lecture speaker on Tuesday,
September 29, and his topic will be
domain-specific architectures for deep neural networks,
three generations of tensor processing units. So I think we are ok. [INTERPOSING VOICES] JEFF DEAN: Good job. ED LAZOWSKA: Jeff
Dean, thank you. JEFF DEAN: Thank you. [APPLAUSE]