Cloud Data Fusion: Data Integration at Google Cloud (Cloud Next ’19)

[MUSIC PLAYING] NITIN MOTGI: Hello, everyone. My name is Nitin Motgi. So I’m one of the group product
managers in data analytics space. You guys have seen me on
the keynote demo there. But if you have not,
I’m going to show you what data fusion is all about. So before I get
started, I wanted to give you guys enough context
of where it’s being used. I wanted to introduce
my colleague Robert here who’s from Telus Digital. So he’s going to
basically give you guys the context in terms of how they
are looking to use data fusion. So with that, I will
hand it over to Robert. See you guys in a few minutes. ROBERT MEDEIROS:
Thank you, everyone. All right. So I’m Robert. I’m a software architect
with Telus Digital. If you’re not
familiar with Telus, it’s a Canadian
telecommunications firm that operates in the
traditional telecom verticals. We offer phone, internet, mobile
phone, television services. It won’t surprise you to
learn that in the course of conducting our business we
collect a very large amount and we generate a very
large amount of data that we need to carefully
and responsibly manage. That data also
needs to be sifted for whatever insights
we can glean in order to better serve our customers. So we’re a growing business. And we’re moving into new
lines of business all the time. Some of those areas come
replete with their own data specific challenges. In some instances,
we’re confronted with a particular
security strictness that we have to contend with. In other areas we
have compliance and regulatory regimes
that we have to deal with. In still other areas we
have particular data volume and velocity challenges that
we need to contend with. So all of this is to say that
over the course of our 100 year history and a number of
mergers and acquisitions where every new
member of the family has come with new data
standards, new data processes, systems,
our landscape has grown extremely complex. This complexity has been
very hard to contend with. It’s hard just to grapple
with it, to understand it, let alone to try
and merge it all, fuse it into a unified whole. So to try and
understand where some of the places we were
falling short are, we recently conducted a
survey of all the participants in our data ecosystem
from which we learned that the average
score people were willing to give to
our data and our data systems was a rather middling
2.74 on a five point scale. Not as good as it could
be, unfortunately. The data that we
received was rich enough to reveal an interesting
pattern, which was that the closer you
are to generating data, particularly if you’re
a human that only deals with a single database or a
single data generating process, the happier you tended to
be with your lot in life. If you’re a person that
was responsible for taking data and transforming
it, somehow relocating it or handing it off, you were
generally a little less content with your lot. And if you were a
downstream consumer of data, you were generally
somewhat unhappy. This is particularly
if you needed to rely on a synthesis of
data, cross-functional data from a lot of different
upstream data sources. So generally
speaking, the picture was that the higher you went in
the organization and the more data you had to touch,
the more unhappy you were. And since that describes
senior level decision makers, we realized that we had a bit of
a problem that we had to solve. The implication for
us was that we needed to walk before we could run. We’re very interested
in exploring some sophisticated capabilities
like machine learning. We wanted to build
out an AI program. But we realized
that before we could do any of that we needed
to fix our data problems. So the answer for
us was to explore the principles of
supply chain management and to undertake a data
supply chain program. So building a data
supply chain is replete with many challenges. Those challenges include
the need to integrate data from a great number of
sources, integrating meaning clean, validate, refine,
reconcile data through an often complex chain of
lengthy transformations. We also have to handle
the entire data lifecycle. Our data is born, it lives
and is in active use, and it retires and
eventually becomes obsolete. And during the entire
course of these events, we need to avoid breakage. We need to provide a consistent
and canonical view of data to our users. And in particular, we want to
compute key business metrics just once and in just
one way if possible. We needed to better
understand our data. We need an ontology to ascribe
shared meaning to data. And in particular we needed
comprehensive metadata and lineage information
about all of our data down to the field
level for every field. We wanted to build a
unitary system, something that supported many roles
from less to more technical without causing an explosion
in the number of tools that we had to build or support. And we wanted something that
was ubiquitous, something that lived in all
the places where our data lived and made data
location transparency pivotal. Ultimately, we needed safe and
secure delivery of our data to our downstream consumers. We wanted to break out of
all the data silos that existed in the enterprise. All of this has to happen
in the context of a world where cloud is an important
part of what we do. And we’ve learned that building
hybrid cloud architectures is challenging. Some of those challenges include
the need for portability. We need to build things
locally, develop artifacts that we can run locally but also
run on prem in our data lake for example, as
well as to moving into the cloud for
scaling and other benefits that cloud offers. We need to be able to distribute
our data pipelines so they seamlessly bridge
from on prem to cloud so that our data can flow back
and forth without any friction. We need extensibility. So we need standard
hooks, well-defined places to add our own business logic,
to add data transformations, to add connectors to
various data sources, ideally without
bespoke integration. Every bit of
bespoke code that we have to write to
support these things means an additional burden
in terms of technical debt that we’d rather avoid. We have the problem
with affinity. Some applications just
resist hybridization. Some data sets have to be
pinned to a certain location for various reasons. Conversely, some data
has to be duplicated. And that comes with its
attendant challenges of having a unified
view of your data, maintaining data
lineage, and so on. And the problem of
testability is key. In as much as things vary from
environment to environment, we want to isolate those
things, test them once. We want to test once not
test once per environment. Right. We want sufficient
abstraction that we can build our pipelines, build
our data transformations, build our logic and
test it once and know that it’s going to run in all
the places where we need it to. So taking GCP as our
example of a single cloud– we’re here, why not– we realize that we suffer from
an embarrassment of riches. The great number of
services on offer mean that we are faced with
service integration challenges. We’re faced with a fairly
steep learning curve. Building even a relatively
simple pipeline– imagine landing data on GCS,
modifying it with Dataproc, pushing it into
BigQuery or Bigtable, and ultimately surfacing
it in Data Studio– recall what it was like to
be a newbie to the platform and how daunting
all of those things seemed, and then imagine having
to stitch them all together. This is something that
we needed to come up with a way to address. All of this impacts our
speed to productivity. The more things
that were touching and trying to stitch together,
the greater our challenge. We wanted to find a way
to make easy things easy. In spite of the fact that
digging into specific tools yields benefits, greater
performance, lower cost, we still need a way
to tie things together without having to
be deep experts in every single part of the
data pipelines that we build. So multi-cloud. We live in a multi-cloud world
and there is no escaping it. This comes with the
challenge of finding people that are sufficiently skilled. Finding folks that are
knowledgeable about one cloud platform is hard. Finding unicorns
that know and have deep expertise across multiple
clouds is correspondingly more difficult. When things
inevitably go wrong, finding the root
cause of the problems is made much more
complicated when your applications are parceled
out across multiple clouds. And with a greater mix
of tools and services available across clouds,
finding the right mix of things to tailor for precisely your
application is more difficult. And of course, there
may come a time when you want to move on to
or off of a particular cloud. And to the extent that you’ve
tailored your applications to the interfaces
of those cloud, you’ll find that you have a
higher switching cost to pay. So the question before us is,
can we find or build a tool to tie a broad array of services
together near seamlessly across our on prem and
cloud infrastructure, multi-cloud
infrastructure in fact, in a way that minimizes the
cognitive load of taking advantage of all this power? We want a platform
that supports the end to end data lifecycle, something
that helps us manage data from the moment of its capture
till its eventual retirement. We want something that helps
us better understand our data and processes, something
that explains the origin and the lineage of all the data
that flows through the system. We want something that
supports a broad mix of less technical and more
technical users, giving all of these folks a
home, something within which they can easily access the
data that they need and build flows to shape it in
the way that they want. And to add on a few
extra bullet points, if we could find something
that was open source and that has a community,
so much the better. If we can find something that
we know scales from micro jobs all the way up to massive,
and if we could find something that’s sufficiently flexible
that we can mold and shape it to our needs, we’d
be that much happier. So about a year ago
I was introduced to something called CDAP. CDAP was an open source tool
for big data processing. And having dug into it,
I became very excited. It seemed to address
a lot of these points that I’ve mentioned. And it wasn’t so long after
first encountering the tool and getting excited I learned
that it had joined the Google Cloud family of
services and had become what you’re going to hear
about today, Data Fusion. Since then, Data Fusion has
become an important part of our data integration efforts
as we move our data supply chain into a hybrid
and multi-cloud world. And Data Fusion solves
a number of the pain points that we suffer. And I’d like to introduce Nitin
to tell you a little bit more about it. NITIN MOTGI: Thank you. Thank you. Thank you, Robert. ROBERT MEDEIROS: Thank you. NITIN MOTGI: So let’s
talk about Data Fusion. So Data Fusion. What is Data Fusion? So Data Fusion is a fully
managed cloud native data integration service. It’s basically helping you to
efficiently build and manage data pipelines. With a graphical interface,
and a broad collection of open source
transformations, and hundreds of out of box connectors,
it helps organizations shift the focus from
code and integration to insights and actions. It’s actually built on an open
source technology called CDAP. CDAP has been in existence
for quite some time now. This is a managed version of it. So we can talk more
about CDAP in general. But CDAP as a platform
allows you to build data analytical applications. And as part of data
analytical applications, there are a few analytical
applications that are also included as part of CDAP. And one of them is ability
to build data pipelines. The second one is
more about how do you transform the data without
having to write any code. We call them
Wrangler, data prep. The name has been
constantly changing. But it’s an ability where if you
are specifying transformations, you’re applying
data quality checks, you don’t need to
be writing any code. And that is another accelerative
[INAUDIBLE] application that is built on top of CDAP. So we basically– there
are more other things like that like rules
engine, a bunch of things like that, which
we will look at it later. But what we have done now
is taken those two parts with ability to build
pipelines and add ability to do transformations
without having to write code. And along with
the CDAP platform, we have turned that into
a managed offering on GCP. So when you look at
the kinds of use cases that we can use to solve
this, and the problems that it is trying to address, first
thing that it is trying to address, like
Robert mentioned, it’s making it extremely
easy to move data around. Right. So right now, if you’re
doing it all by yourself, it’s very error prone. It’s very time consuming. So if I have to move data from,
let’s say GCS to BigQuery, I would have to
spend a lot of time writing code and ensuring
that that code actually works every single time. That is not to say if I have
to do another such point A to point B transition
with added transformation, I start all over again. So essentially, at the end
of it, it increases your TCO. Right. So just for moving, you’re
spending way too much time and then you’re actually
getting into situations where you might not be able
to address all of the business requirements that you have. Right. And also skills gap
is another thing. So it’s like you need to
have right set of skills and have good expertise
in the systems that you’re stitching them
together, because there are performance requirements. There a bunch of things
related to how-tos, best practices you need
to be adhering to. So all that is very
difficult. Right. And the last one is hybrid. And hybrid is something that
is very interesting actually. During our process through
EAP, which is the early access program on GCP, we learned that
actual journey to cloud starts on prem. It’s not like lift and shift
and everything is done. It’s kind of hard. You need to be able to
start your journey slowly from an on prem environment. So just by the combination
of CDAP and Data Fusion, it allows you to do so. And you don’t need
to build it twice. You can build it once
and be able to run them in two different environments. So who are the users that we
are targeting with Data Fusion? So we are leveling– we are moving the
level up in terms of being able to
adopt GCP and being able to apply transformations. When you look at– when there is a need for
developer data scientist or business analyst to cleanse,
transform, blend, transfer data, or standardize,
canonicalize, automate your data operations,
data ops, we want you to be
using Data Fusion. Data Fusion provides
a graphical interface. With that, it also
provides ability for you to test, debug,
as well as deploy. You can deploy it
at scale on GCP. And there we can essentially
scale to your data levels. Right. So it essentially can scale
to petabytes of data there. Now in addition to that, we
also collect a lot of metadata, whether it is business,
technical, or operational. All of this metadata is
aggregated within Data Fusion. And we are exposing that
in terms of lineage. So you will be able to
do things like root cause analysis, impact analysis,
be able to find provenance, be able to associate a lot of
metadata for the data pipelines that you’re building
as well as data sets that you’re creating with it. And that’s a very important
part of having data operations anywhere. So when you look at
the kinds of use cases that Data Fusion is trying
to solve, Data Fusion– these are different
business imperatives that translate
into IT initiatives and then maps to how Data
Fusion can help solve. So when you are looking
to build warehouses, you’re actually
trying to get data from many different sources,
let’s say into BigQuery as an example. You should make it
extremely easy to do that. It should not take you years,
six months, not that long. You should be able to
operationalize these things pretty quickly. When you are actually looking
to retire legacy systems, Data Fusion can help you
migrate that data over to GCP. it doesn’t need to
be just BigQuery. You can put it in Spanner. You can put it in Cloud
SQL, data store, Bigtable. You can pretty much
be able to connect to anything that is available
on GCP to bring that data in. And it’s not just limited
from a source’s perspective to on prem. You can also read from
the same set of sources that you have written to in GCP. Data consolidation
is another aspect. So you’re trying to
either migrate some data or essentially retire a bunch
of data that exists there. Master data management
is something that has a bunch of
capabilities but we are going to be adding
a lot more in this year to essentially help
create a much more consistent, high quality
environment for your data. And the last one is
extend or migrate in cases on a hybrid
environment that you want to load shed into cloud. If you are still
on prem, you will be able to do those
kind of things. The thing is the
complexity of it being able to run on cloud,
on prem, and in other clouds. That’s the biggest
part of all this. You should be able to
run that across anywhere. So just to put it all
together, so the way I like to think about
this is Data Fusion is providing a
fabric which allows you to fuse a lot of different
technologies and products that are available on GCP in
a much more easy, accessible, secure, performant and
an intelligent manner. It just makes the entire
process extremely easy. So the things that would
take me six weeks to build, I can literally build
them in two minutes, deploy them on the third minute,
and be able to operationalize. Operationalize should
not be that easy, but like it takes
let’s say a day or so. Right. So basically, this is bringing
a lot of different things together, making it very simple. It’s moving up the bar,
lowering the barrier to entry essentially for GCP when
you talk in terms of data. So with that brief
introduction, I’m going to show you guys
a demo of Data Fusion. So let’s get started. So Data Fusion, as I said,
is a managed offering. So it will be available
through Cloud console. So you will be,
starting today, you will be able to see Data Fusion
show up on the Cloud console. So now what we have done
is we have basically taken Data Fusion and given
two different editions– we have made two different
editions available to you all. One is the basic edition. The second one is an
enterprise edition. You should be able to pick
depending on your needs. And we recommend
basic edition to be more for dev I would say, not
so highly available to your QA environment. But enterprise edition is more
for mission critical stuff. So you should be able to
provision off of those two editions to start with. So here I have already
created a few of them. I mean, these are
all you can see. These are Next demo stuff, so
a bunch of different things that have been created
and all in enterprise. But I can go into
Create Instance and be able to specify. And here are the
zones and locations that we are starting with. So we have Asia East,
Europe West, Central, East, and West, all of these. So as we go along, we’ll
be adding more locations in this year. So once I clear the instance,
I can jump into Data Fusion entirely. So this is Data Fusion. It looks a lot like a
cockpit where essentially you have a lot of things that
you have interacted with. This is called a control center. Control center is the place
where you are actually able to monitor and be able to
manage all of your data sets and data pipelines. This is a central
place for doing that. So you also have an ability
to work with the pipelines. This is the list of
explicit pipelines that you are interested in here. We have put a few pipelines
that have been scheduled to run and have been running. Additionally, you
also have a studio where you can build
different pipelines. It’s a visual way of building
all of these pipelines, being able to deploy them. And also, we have a place where
we are collecting metadata. I skipped on one, Wrangler. I’ll come back to that. This is where we are collecting
all of the pipeline metadata as well as all of the
different data sets. The data gets collected here. So the interesting part. Now let’s imagine you have data
sitting in an Oracle system. So like it’s an on
prem Oracle system and you want to bring data in. And it’s not just limited
to Oracle actually. We have hundreds of
out of box connectors that allow you to connect
to various systems starting from mainframe
to being able to connect to a bunch of different
cloud sources. Right. So in order to
start with, we have started providing a very
simple way of bringing data in. Right. So first thing
you’re seeing here is I have pulled in the
data from an Oracle table. Now let’s see how
did we get here. Right. So Wrangler allows
you to connect to various different
data sources. These are the popular ones. We are going to be adding a
lot more in in coming months actually. So I have already preconfigured
an Oracle table here. I can actually add
more databases. Databases that can support
JDBC can be easily added here. We also support Kafka, S3,
GCS, BigQuery, and stuff. So as soon as your
instance gets created, it automatically
attaches to your projects in GCS, BigQuery, and Spanner. So now let’s say I have
data sitting in Oracle here. So this is now live listing
all of the Oracle tables. And I’m interested in
bringing in one table only, not to say that
you have ability to also bring the entire database over. So you don’t have to
do one table at a time. You can bring the
entire database over. So let me click on
the employees table. As soon as I click,
the data gets sampled. There are many different
ways the data can be sampled. But the data gets sampled. All of the data types of
the source data system gets translated into
an intermediate, which is essentially it’s
AVRO, where it gets translated into AVRO types. So it’s basically an AVRO spec,
but it is an extended AVRO spec to specify various other types. So now once I’m here, I
can apply transformations. So now the first thing I see
is commission percentages being highlighted for like
there’s some data that’s actually missing here. Now in order to process,
I want to make sure that I can specify
some default values. So I’m going to fill 0.0 here. So there’s 0% commission. I can now see it’s
100% complete. Right. I can also see that the job ID
is a combination of two things, department and title. So I want to be able to
specify a way to split those. So I can go in here. I say extract fields. Like delimiters. So now we generated two columns. Right. So now I don’t need this
specific column from here. And I’m good. Right. But you want to get a
little bit of insight of how your data is organized. So we basically make it easy
to get a little bit of insights on your data. So there is a way where you
can essentially slice and dice this data in and have
a visual way of looking at how the different
dimensions are organized, how many different
records of a certain type. So you can do
drill down on these and look at and compare that
with other dimensions, right? So this is just to give you a
feel of how your data that you have pulled in is– how the sample that
you have been pulled– is organized. So once you are
comfortable with this, I will say go create a pipeline
because my source right now is Oracle. It’s not a real
time source, right? So it’s a bad source right
now, because we are pulling it in bad, but you can
also do the real time data pull from Oracle, so
if you are doing things like CDC and such. So I want to create
a batch pipeline. So what that does is, it
transfers all the work that I did in Wrangler into my
studio, where I can upload the rest of the pipeline. My goal was to get
this transformation. Now, I started with
Wrangler to create this, but you can go the
other way, too. You can actually go from being
in this canvas sort of Studio to being able to wrangle that. So it’s just by dropping
one of the transforms, like Wrangler transforms
here, you should be able to do that seamlessly. For simplicity I’m just
using a very simple pipeline. So if you then you
want to be able to, let’s say, now this is
not operable– this is not going to be operating
on just a sample, it’s going to be operating
on the entire table that you’re trying to pull. So then I can go
ahead and figure out, like I would need to put this
into a BigQuery for analytics. I also want to make sure that
I can actually store this data or archive this data into GCS. Right, so it’s as easy as that. Now I configure these. Configurations are
also pretty easy here. So you just define
the configuration, which is your data set,
your table name, that’s all you need to specify. Those are the mandatory
fields, you’re done. That’s how you essentially
bring data into BigQuery. The same holds true
for GCS perspective. You specify how
your bucket needs to be there, what do
you Suffix Path is, and you can also define
the format in which you want to write that data in. It’s not limited to it. So now the Wrangler
that I specified, that I showed you guys,
that have more than 1,000+ functions, which include a lot
of data quality checks which I’m going to show you guys. But the most important thing
with that is it is extendable, you can write your own
data transformation– we call them directives. You can write your own data
transformation directives. And you can just drop them in
and use them in any pipeline. That gives you share-ability. So to show that, let me see,
after doing this I feel like, oh man, I missed something. So I want to go back and
do more operations here. Right, so I want to be able
to say, like, for example, this Employee ID change my
type to, let’s say, long. So I should be able to
do those type of things. So I go back and I make
those transformations and I am done with it. So, let’s see, I go back. OK so all of the transformations
are here, so now whatever I have done can be configured. So right now the pipeline– we provide two
different ways of being able to execute the pipeline. You can run them as a Spark job,
you can run them as a MapReduce job, but we are also going
to be in the future adding an ability for you to run
that as a Dataflow job. So you build a
pipeline, you’ll be able to run in any
of these three things without having to make any
change to your business logic. Let that sink in for a bit. You then have the ability
to specify resources. So, like, this is actually
for MapReduce, OK, that’s not that interesting,
it’s the old one. So let’s go look at Spark. So you can change
how Spark behaves. So you can specify
configurations, you can specify
a very specific– We don’t recommend,
actually, for you to do this, I’m just showing
you that these are available if you guys are
interested in fine tuning your particular pipeline. We also do support alerts. So again, these alerts– we do provide a
developer’s ticket that allows you to extend
these and add your own thing. For example, one of
our users actually wrote a Slack connector,
so every time a pipeline finishes it sends
a Slack message. I think there was one–
someone wrote a HipChat one and there was Trilio one. Like a bunch of
things like that. So you can make
the notifications pretty easy with these. So, in addition
to this, you also have the ability to schedule
your pipelines from here. You can define them here
or you can define them once they’re deployed. Now the interesting thing
here is you can actually debug the pipeline. As soon as you go
into the preview mode it provides you an ability
to test your pipeline. And it actually is going to
go to the original source and bring that data in. So right now it
supports N number. N is defaulted to
100 rows, but if you want to run a larger
sample, you should be able to do that, it
just takes a little bit more time when you are looking
to preview terabytes of data. But we are looking to
improve that and essentially be able to do this
constantly in real time so that you can look
at the data as you are building the pipeline. So all of this
actually translates– one question that
I always get asked is, Is this generating a
Spark or MapReduce code. Is it a code generator? And my answer to that is no,
it’s not a code generator. I don’t know how many
of you guys know, but the word that was
coined a long time ago called code weaving. So it weaves all of the
different components that are being built
for one execution engine in an optimized way. So there is actually
a planner that picks up all the bits and
pieces and translates that into an execution plan. So it weaves all of the stuff
that you’ll build into a pack and it figures out what is the
optimal way of transferring the data from one
node to the other. In most cases it’s amazing
because you’re now going from one machine to the other. In most cases, you’re actually
doing in memory transfers and that’s how it gets composed
into the execution paradigm. And that’s the same thing we
can do for Renegade and Spark, we can do the same thing
for Renegade and MapReduce. And with that,
the ability for us to be able to push down some
of those optimizations– basically leverage
the optimizations that are available in
the underlying system– is extremely critical. And we’re able to do
that without having to make major architectural
changes to this. Now once a pipeline is built,
you can actually see it generates a
configuration, right? It’s a simple JSON
configuration, you can build it all
by yourself with hand. It’s not very complicated. It has just a few sections. And for everything
there, in addition to being able to
define the graphs, it actually tracks
every individual node the artifact version. So you can essentially
track every component that is being used, is
being versioned, actually. So you’re able to
version every part of it. So if you come up with a new way
of connectors, or are updated, if there was a bug
and you fixed it, you should be able to
use that seamlessly. And the system allows you
to migrate from version A to version B, making
sure all of those checks have been handled carefully. So with that, this is
essentially an ability to build a batch
pipeline, but you can also do the same thing with real
time pipelines, actually. So you can build a bunch
of real-time pipelines. Looks like there are a few– there are not that many
plugins available here, but that’s the reason
we have added Hub. So Hub is a place
where you can– we will be adding
more connectors, more transformations, more
reusable components here, that we’ll be making
it available here. And the best part
is you can have your own organizational
internal hub. So you can share some of
the connectors of plugins that you’re doing with the other
users within your organization. It makes it extremely
easy for you to do that. It’s a spec that we ask you– it’s actually a documented spec. So if you can follow
that spec, you can get in exactly the same
market for your organization. In addition to
what we do provide. And here, where we put stuff,
there’s a lot of things that we have built
over the years, over the period of seven years. But we’re also getting
a lot of contributions from within Google, our
open source community, as well as, we are working
with a lot of partners to be able to build this. And the one
important distinction that we’re making, right? When I talk about Data Fusion,
Data Fusion is actually not– it’s unlimited use. At every instance
there is no limit on the number of users
who can access it. That’s number one. Between basic and
enterprise we are not distinguishing connectors. Every single connector will
be available in both of them. Unlike when you look at how
we essentially do connectors, generally we didn’t want
to give that choice. Essentially, we wanted to
keep it plain and simple, so every connector would
be available of the things that we built. So from here, you want DB2
plugin, you can just deploy it, it gets added to your instance. That’s it. With that you are now able
to use the DB2 plugin to talk to DB2 instances. So the same thing holds true. So there are a lot of
different real time connectors that are available. Transforms get shared. Analytics of different kinds,
where you’re actually doing joins or you’re actually
aggregating data, you’re profiling data. All that stuff is
readily available. And the real-time
pipelines actually, right now when they
run in GCP, they run as Spark
streaming pipelines. OK, with that, let me
show you a little bit of– a few things that
you can do with this. So this is a very simple
example of a pipeline that was deployed here, where
you’re essentially taking data from various different
sources, joining them together, and writing it into a BigQuery. This pipeline, I think,
has been scheduled to run. You get operational
metrics on this pipeline. You’re able to see
how this pipeline is behaving over a period of time. All of this data
that I’m showing you, all of the stuff that I just
did from, sorry, from the UI, you are able to do
those using REST APIs. Every single aspect of this
is available through publicly available REST APIs. So if you’re looking for
something on this dashboard, it’s actually available
as a REST API for you to pull so that
you can integrate with your internal systems
if you would like to. So, now, if you see
here, what’s happening is, when you ran this
pipeline, actually, this pipeline is using
a Dataproc profile that is automatically
created for you. So there is this notion of
provisioners and profiles and profiles are based
off of provisioners. So here is a profile that
the system automatically creates for you. And you can see,
using this profile, all the different pipelines
that are being executed and how much time they take
from an execution perspective. So you can define
different profiles for different
kinds of workloads. You know, if you’re
doing large migration, you want to be able to say
that you’re doing heavy ETL transformations. You should be able to
specify a profile for that. So something that can be
controlled by administrators, but also, if needed,
that users themselves can change them if they
have the right permissions. So you have also
the ability to look at every run of the pipelines. So this is just one
execution of the pipeline. So it can go back in time. So we do keep, for
long periods of time, all of this data in Data Fusion. And one important
thing, so this pipeline could be scheduled, right? This pipeline also
can be triggered by the execution actions
that are generated from the previous pipeline. Right, so I can say trigger
this run of the pipeline only when the previous pipeline
successfully completes, or it has just completed, or it
has failed, then trigger it. So you have an
ability to do that. And when you talk about
different use cases where you’re looking
at many different teams involved in building a
much larger pipeline which is organization-wide,
or there’s a team responsible for bringing data
in from external sources, standardizing them, landing
it in a staging area. And then from there,
analytics team picks that data and runs their analytics
and essentially do some aggregations
and write them into BQ. Another team could be
taking it from there. Now the thing is,
information needs to be passed from execution
of one pipeline to the other. So for example, where you are
landing on the staging area. So with tokens that are being
passed between these pipelines, you’re able to do that. And you can pick and
choose whether which tokens you want to pass. And in this example you
have plugins you can pass, runtime arguments
you can pass, you can even pass compute
configs in there. So this is just a simple
example of where you’re doing all of the different joins. There can be far more complex
pipelines you can build. So I want to show you
guys one other pipeline. So this is a real-time
pipeline which is not running, it should be running. So this is a very simple
real-time pipeline which actually is watching for
data that is arriving on GCS. As soon as the data arrives,
it passes, cleanses the data, sends it to data profiling. It profiles that data
for that window whatever you have specified, right? Like, if you specify a 15
minute or an hour data that’s coming in, profile that,
and that profile information is actually returned to Pub/Sub. Now all the good data
gets returned to BigQuery and any of the errors
that are generated in the records because
of any bad data actually get separated out
and gets written into GCS. So this is a real-time
program and, of course, it’s a microbatch. But when we do integrate
this with Dataflow, you will have true
streaming capability and you’ll still be able
to do this kind of stuff. So very simple things. This is how we are taking
it to the next level. So this is an
example of a use case where you want to
move the data, but you have no idea of the risk
profile of that data. And you want to be
able to make sure that if there is
any sensitive data, you want to exclude that
out before you send it to, let’s say, your final
data store, right? So this is something
that we are going to be making it
available soon in Q2, which is integration with DLP. So if I want to integrate with
DLP, let’s just drag the DLP and put it in the configurer,
your DLP work is done, right? Now, you might
say, oh, how do we ensure that every data set
that I have needs to always be passed through DLP? So think of those two nodes and
you are essentially creating different kinds of data sets. Like, for example,
this is a GCS one that is being sent
to, let’s say, DLP. So you will have the
ability to create templates. You will take GCS
and you will take DLP and you create a
template out of that. And what that will allow you to
do is essentially be able to– anytime anyone who uses that
it will always be applying DLP transformations on
that or DLP filtering or tokenizations on that. So that is much more
an easier way for you to [INAUDIBLE] things. This is a very filtered example. The filter example is
actually allowing you to only filter data, but we’ll
be adding more capabilities in terms of being able to
tokenize fields and such. So you will have ability to– right now you say, with
some confidence interval, which is like if a
filter confidence is low, anytime when there
is credit card numbers that gets identified as
sensitive data, filter it out. So I can pick and choose a
bunch of different things. I will soon be able to also
tokenize them and encrypt them, a bunch of things like that. So this is all great. So this is all about
building the pipelines. I showed you guys how you
can operationalize them, what are the metrics
that are available. Logs are also similarly
available in the same fashion. But the most important thing
from an operational perspective is, can we actually monitor all
of the pipelines in one place? So you have an
ability to monitor, and there are more things
that would be enabled soon. This is actually looking
at all of the pipelines that are running
on an hour by hour and you will be able to look
at each individual pipeline and how they were being
executed, what was there, like what time did it take? In fact, we also include
things like start delays. So what ends up
happening is, we have a lot of customers who
schedule the pipeline, actually, to be running at like,
let’s say, at 2 in the morning. Because it happens all
independent of each other. The capacity within that project
gets hit because everything starts around the same time. So how do you avoid that? So this has a little bit
of a futuristic view, one. Two, it also gives you
an idea of how much time it is taking to
start something up because the resources
are not available. So you can actually
go into future and look at what are the
different jobs that are being scheduled at every hour. And what does that mean
from a capacity perspective to be able to run
that in that hour. So you get a visibility
into those kind of things, which
makes it extremely easy from an
operations perspective. So one last thing before
we go into Q&A session is the metadata that
we talked about. So I can search– this is all in the context
of integration, right? So this is all in
context of integration and I know you guys will
have lots of questions around how does this
integrate with Data Catalog? So let me give you an
answer before you ask. But the whole idea here is
we are both at a beta product right now, but at
some point, we will be making a read-only
version of all of this metadata available
in Data Catalog, number one. Number two, we’ll still continue
enhancing the experiences of solving the problems
from a data integration perspective in Data Fusion. So that will still remain here. So things like deep lineage,
field level lineage, a bunch of things that are very,
very relevant to integration would still be
available in here. So this is the data
set that I was looking at that was the joint pipeline
and I can see the schema here, we can attach business tags. So this is like, I
don’t know, I’m just going to say doubleclick here. So I should be able
to immediately search based on that. You can look at lineage
at a data set level, but the interesting thing
is, you can look at lineage at the individual field level. So you are able to
actually figure out what happens to
individual fields when the data is being ingested. So, for example, if I’m
looking at a landing page URL, this actually comes from
a landing page data set. If I’m looking for
a referrer URL, it’s coming from
impression data set. If I’m looking at
advertiser field, it’s coming from an
advertiser data set. But the interesting
thing is, what are the operations
that are being applied as the data was being moved? It went through different joins,
it was read from the source. So our goal is to be able
to write this in a normal, like an English text,
so that you can just read this to be able to see
how your data is actually going to be moving. That is the goal
and we basically are starting to do that,
but there’s a long way to go in terms of making
it full English text so that you can just read it. But you’ll get an idea of
where we are going with this. So just being able to look at
that the data has come from and where it is going to go
gives you a lot of flexibility to do that. [MUSIC PLAYING]

Glenn Chapman


  1. I've no words for Google. It always lives in next generation. I'm happy to work on GCP products.

  2. Great presentation, very important tool in our Data/ ML pipelines

  3. Great presentation, and Great tool
    I have 2 questions please can you tell me what do you exactly mean by the type File in (Source, Sink), and is it also possible to send the result of the pipeline directly to a FTP server
    Thank you

  4. While every programmer shouts on top of their voice why hand coding is better than drag and drop ETL, the big boys are creating tools for everyone to adopt.

  5. This is what everyone wants. Bang on. We have used cloud dataflow in past and it was a nightmare. Not the development but it has to go through a very long process before pipeline can be deployed into production. E.g. code review, testing, SIT, code quality checks, check for usage of unapproved libraries etc. This looks like informatica + cognos + control M to me.

Leave a Reply

Your email address will not be published. Required fields are marked *