28

How to Make a Data Science Project with Kaggle (AI Adventures)


YUFENG GUO: On this
episode of “Cloud AI Adventures,” I’ve
invited Megan Risdal to join me on the show. Together, we’ll cook up our own
date science project on Kaggle. How are you doing today, Megan? MEGAN RISDAL: I’m doing great. Thanks so much for
having me on your show. YUFENG GUO: Awesome. And before we get
going, I wanted to let you have a chance to
talk a little bit about what you do at Kaggle and your role. MEGAN RISDAL: Sure. So I’m the product lead
for datasets at Kaggle. And what that means is that
I work with our engineers, our designers, as
well as our community to build tools that help
data scientists find, share, and analyze data. And today, what we
want is for Kaggle to be the best place
for our 1.7 million data scientists to share and
collaborate on data science projects. YUFENG GUO: Awesome. And so today, we’ll
be working together to use the freshest
ingredients– MEGAN RISDAL: Data. YUFENG GUO: –and prepare
them using different tools and work together to come up
with our delicious outcome, this public dataset
and notebook that we can share with the world
that has cool analysis to go with it. MEGAN RISDAL: Yeah. That’s exactly right. And I’m excited today because
we’re going to really make this a collaborative project. So that’s how we’re going to
get things done is together– teamwork. YUFENG GUO: Teamwork. All right. Let’s go. So Megan, on a previous
episode “AI Adventures,” I had a video that
showed how to get started with Kaggle kernels. And it was pretty rudimentary in
terms of just get started, go, it’s awesome, it’s
a free resource. But since then, there’s been a
couple of new features released that really enhance the
functionality of Kaggle, both kernels and
datasets, to be used as a great tool for
individuals and teams. MEGAN RISDAL: Yeah,
that’s exactly right. So today, Kaggle’s
a really great place for people who use R and
Python to work with data. They’re looking, really, to
build data science portfolios, do data analysis work,
or even share research. It takes a lot of tools
to do data science. And Kaggle really acts
as this one-stop shop that provides all
of these tools that makes this possible,
from working with data privately, to sharing
it with the world. YUFENG GUO: And that really– it’s really fantastic. Let’s explore a little
bit more about the fact that Kaggle datasets
and kernels can support this kind of collaborative
model, this private mode, if you will. MEGAN RISDAL: So some
more recent features are ability to publish and
work with private datasets and kernels. And speaking of kernels,
this is basically like a laptop in the cloud. It’s more powerful
than the laptop that I’m working
with here today. You’ve got 16 gigs of RAM, four
CPUs, six hours of compute. And one of the really
exciting things is that it is all in
a docker container that has all of these packages,
that data scientists love, pre-installed. So you’ve got this environment,
this one-click environment. And then finally, we’re starting
to add on more customization, so if there’s any
packages missing, I can install those or
even do things add a GPU. YUFENG GUO: Ooh. MEGAN RISDAL: Yeah. YUFENG GUO: Very nice. We’ve picked out a
particular dataset today to play with around data
from the city of Los Angeles if I understand correctly. MEGAN RISDAL:
Yeah, that’s right. So a lot of governments
and organizations from around the world
and in the United States are making open data
available as part of their open-data
initiatives to make their work more transparent. So I’m from Los Angeles. I live in Los Angeles. And I was kind of
interested in taking a look at some of the open data
that the city of Los Angeles makes available. So I was poking around on
their open-data portal. And this one caught
my eye because I’m a little bit of a foodie. It’s a little interesting. But it’s actually environmental
health code violations from restaurants and
markets in Los Angeles. YUFENG GUO: OK. All right, let’s get into it. Yeah. MEGAN RISDAL: Yeah,
so what I’ve done is I downloaded the dataset. So it’s on my local
machine right now. YUFENG GUO: Great. MEGAN RISDAL: And
what we’re going to do is upload it to Kaggle. This is going to be the
foundation of our project. YUFENG GUO: Awesome. And one of the things that, a
lot of times, I hear about is– and some people are
concerned around distributed computing and massive datasets. And you just mentioned,
you download this dataset to your local machine. And some folks say, oh, I need
lots of compute and resources. Is Kaggle going to be powerful
enough to support my use case? And I guess,
looking at a Kaggle, and just the vibrant
community you mentioned– was 1.7 million? MEGAN RISDAL: Yeah. That’s where we’re at today. YUFENG GUO: That’s amazing. It clearly shows that
there are so many use cases beyond the massive,
massive datasets out there. There’s the situations
where you can get away with just one powerful machine
that can take you quite far. MEGAN RISDAL:
Yeah, that’s right. Yeah, and we– people
are uploading thousands of datasets per month. YUFENG GUO: Yeah, wow. All right, so let’s
go over to your laptop and see how we go
about doing that. How do we make a new
dataset on Kaggle? MEGAN RISDAL: Sure. So we’re going to start from
the datasets page on Kaggle’s website. So this is what it looks like. And basically, this
is where you have access to all of the datasets
that have been publicly published on Kaggle. And we’re going to
add on our own today. So what I’m going to do is I’m
going to click New Dataset. And then, from here,
it’s just a matter of dragging and
dropping the files that I’ve chosen to upload. And these are inspections of
restaurants and markets in Los Angeles and then violations. And then, we’re to add
a little bit of metadata to get the dataset started. So I’m just going to grab all
of the information I need here. So we’re going to
keep it private because, like we
talked about, we want to prepare the dataset
so that it’s well documented. And then we’re also
going to play around with the data little bit
and create some kernels before we share it publicly. YUFENG GUO: Yeah, awesome. And that’s definitely
something that doesn’t get talked about as much is
documentation for datasets. MEGAN RISDAL:
Yeah, that’s right. YUFENG GUO: The
documentation from code is very well understood,
and people hammer that home. But documentation about datasets
is kind of a new concept. MEGAN RISDAL: Right. Yeah, it’s really about
making data accessible. It’s not just making the data
files itself machine readable– so having well formatted CSVs– but also helping anybody
who’s interested in working with this data
really understand it. So I’m going to go ahead
just click on Create Dataset. YUFENG GUO: Fantastic. All right. And your private dataset
was successfully created. MEGAN RISDAL: Yay. YUFENG GUO: Whoo. MEGAN RISDAL: Cool. So now the private
dataset was uploaded. And like it tells
us here, now we can do anything from starting
to analyze the dataset already to adding
collaborators, and we’re going to do both
of those things. YUFENG GUO: Fantastic. MEGAN RISDAL: So
we’ll click confirm, and it’s going to take
us to our dataset. YUFENG GUO: Looking good. MEGAN RISDAL: Yeah. YUFENG GUO: That’s,
like, a real thing. MEGAN RISDAL:
Yeah, that’s right. So what we want to
do when people create a private dataset is
make it easy for them to, then, make that dataset
public eventually and share it with the community. So we provide this
quality checklist that helps people basically
document their dataset and help them be successful
when they share it. So we’re just
going to quickly go through this quality checklist. So the first is
providing a description. And this is just a markdown
file, so I have it saved here. YUFENG GUO: Great. Yeah. I mean, that’s really
nice that there’s some guidance on
what sorts of things to add in to make
a dataset nice. MEGAN RISDAL: Yeah,
yeah, that’s right. YUFENG GUO: Make for
a good experience. MEGAN RISDAL: Yeah. So I think that things
like understanding the context of the data
and why it’s interesting and why you’re sharing
it is important, as well as providing
more details about the contents
of that dataset, so that’s what we’ve done here. And then also inspiration– so some questions that you
can use the data to answer. YUFENG GUO: Yeah. I’ve seen that in some of
the other datasets out there. Now I know why [INAUDIBLE],,
there’s some guidance there. MEGAN RISDAL: That’s right. Yeah. So then the next
thing on this page is we’re going to add
just a couple of tags. And this helps make the dataset
more discoverable once we’re ready to share it publicly. So we’ll do Public Health
and Food and Drink. YUFENG GUO: Seems reasonable. MEGAN RISDAL: Seems reasonable. So then we’re going to add a
subtitle and a banner image. And this is just to add
that final coat of paint to make it look good and,
again, help people understand what the dataset is about. YUFENG GUO: Yeah–
a little flair. MEGAN RISDAL:
Yeah, that’s right. YUFENG GUO: OK. MEGAN RISDAL: So
we’ll save that. YUFENG GUO: And we want
them to replace this image? MEGAN RISDAL: Yeah. So this is what Google will
see in the dataset listing. And you’re not supposed to
judge a dataset by its cover. But if it has a flashy image– that can only help. YUFENG GUO: Yes. I always pick datasets that have
an image of a sliced onion over ones that don’t. MEGAN RISDAL: That’s right. It looks delicious. And then finally, the
most important part is I’m going to add you as my
collaborator on this dataset. YUFENG GUO: So now
I get to see it? MEGAN RISDAL: Yeah. YUFENG GUO: OK. So eventually– MEGAN RISDAL: There you are. And I will grant
you edit access. YUFENG GUO: Well, thank you. Megan Risdal invited
you to edit the dataset. Great. And so I can click
View on Kaggle? MEGAN RISDAL: Yeah. YUFENG GUO: And let’s
see what that looks like. Awesome. So this looks basically the
same as it looks on your side. MEGAN RISDAL:
Yeah, that’s right. Cool. So we have uploaded our
data, we’ve documented it, and I’ve shared it with you. One of the things that we
like to encourage people to do is to also document their
datasets through code. So what I mean by
that is publishing a kernel on a dataset is one
way to demonstrate to users, and other people in
the community, what they can do with your data. So we might want to show
somebody in a kernel how they can read in the
data, some of the things that we can visualize
using the data, questions that can be answered using it. YUFENG GUO: Yeah. I mean, when I see datasets
on Kaggle these days, they all have these
exploration notebooks with fancy visualizations,
and it’s really nice. MEGAN RISDAL: Yeah. Yeah, exactly. And when you start working
with a new dataset, usually, when you’re
working locally, you’re starting from
a blinking cursor. You don’t have any
code that shows you how to read in the data
and how to work with it. So that’s what we’re
going to do is, we’re going to additionally
document our dataset by publishing a kernel on it. YUFENG GUO: Fantastic. MEGAN RISDAL: Let’s get started. I’m just going to click
on this Big Blue Button, as we call it– New Kernel. YUFENG GUO: Yes. MEGAN RISDAL: So here, we
have a choice between a script and a notebook. I’m going to go with
notebooks because I like interleaving markdown and code. And then while this
starts up, I can see that I have the
data accessible right here at my fingertips
in my environment. YUFENG GUO: Great. MEGAN RISDAL: And I’m going
to change the language to R. I am an R Stats person. That’s right. YUFENG GUO: All right. MEGAN RISDAL: Cool. So what I’ve done is I cheated,
and I already prepared the code that I’m going to use. So I’m just going to
quickly upload it here. And then I’m going to
walk you through what I’ve done to analyze the dataset. YUFENG GUO: Great. MEGAN RISDAL: So
in the first cell, we have the inspections CSV file
and the violations CSV file. So I’m going to go ahead
and read those in, join them together by serial
number, and then take a glimpse at the
resulting data frame. YUFENG GUO: OK. MEGAN RISDAL: So
once that’s done, you can see that we have
almost 900,000 records that we’re looking at. So these are all
health code violations for about two years of data. YUFENG GUO: OK. That’s a lot for two years. MEGAN RISDAL: Yeah, it–
yeah, it seems like it. So we’re going to dig
into what that looks like. So what I want to do, now that
I’ve got the dataset prepared, in the shape I want it, is look
at the number of violations reported over time by month. YUFENG GUO: Right. This is the big one. MEGAN RISDAL: Yeah, exactly. All right. YUFENG GUO: All right. MEGAN RISDAL: So you can see
how quick and snappy that is. And we’ve got this
visualization that’s– cool– right in front of us. So that’s a lot of
health code violations. YUFENG GUO: Yeah. It’s all over the place. What does look like–
what is that bar– 30,000? MEGAN RISDAL:
Yeah, that’s right. YUFENG GUO: In a month? MEGAN RISDAL: Yes. Yep. YUFENG GUO: That’s a doozy. MEGAN RISDAL: Yep. So let’s take a look and see if
there are any seasonal trends. And we also have information
about what the violations were for each serial number. So we’ll take a look at that. And we’re going to
look at just the top 10 violations, so that’s what this
code is going to be doing here. YUFENG GUO: We run that, and
then we’re going to get– wow, very nice color
coding here, yeah. Is that– the darker one is
more, or the lighter ones are more? MEGAN RISDAL: The
lighter ones are more. YUFENG GUO: OK. MEGAN RISDAL: Yeah. Yeah, so you can
see this one here is a violation of the
code for floors, walls, and ceilings are properly built,
maintained, in good repair, and clean. YUFENG GUO: OK. MEGAN RISDAL: Yeah. YUFENG GUO: It’s always
comforting to know that your establishment
is in good repair. MEGAN RISDAL:
Yeah, that’s right. And then finally, I’m
going to just save another project for later
that I have in mind is, I want to look at the
violations by zip code. So we’ve also got information
for each of the facilities with their address is. So we can look at
whether or not there are more violations
by zip code and look at a geospatial analysis. But I want to do that
a little bit later. So I’m just going to
write that CSV to a file. And I’ll be able to use
that in another kernel. YUFENG GUO: Right. And you can imagine– I’m just trying to think
about this new output that you’ve created, you could
make some kind of mapping with it. You could do one of those
fancy color-coded heat maps. MEGAN RISDAL: Right, yeah. YUFENG GUO: We have this
sort of heat maps, which shows the violations by type,
but you could also show– MEGAN RISDAL: Yeah, like a
choropleth map geospatial– YUFENG GUO: There’s
a tongue twister. MEGAN RISDAL: Yeah, choropleth. Yeah, exactly, and
you can see, now, how taking just a
peek at this dataset has already inspired
new questions. And that’s exactly what we
want to do for our users. So I’m going to go ahead and
give my notebook a title. YUFENG GUO: Yes, always
good to have a title. MEGAN RISDAL: Yeah. And then I’m going to
hit Commit and Run. YUFENG GUO: OK. So let’s hit that. And while that’s running,
a question for you– does the notebook
not save if you haven’t clicked Commit and Run? If you were to close that
tab before you click that, what would happen to all
that code, all that work? MEGAN RISDAL: So
it’s saving a draft. But if you want
to save your code and come back to it later and
share it with other people, you want to hit Commit and Run. And what that does is it
executes the code from top to bottom. YUFENG GUO: Right. Perfect. So once that’s done, I
guess– what is our next step? What’s our plan here with–
because right now, we have a dataset that’s private,
but shared between us, and we have this kernel,
which I think is still just private to you, right? MEGAN RISDAL: Yeah, so
once this is finished, I’m going to go ahead
and click View Snapshot. And this is going to take
us to the Notebook Viewer. And from here, this is what
I’ll be sharing with the world. And this is what somebody
looking at the dataset can come and find. So I’m going to go ahead and,
again, share this with you just to make sure that you
think that all of our work is ready to be made public. YUFENG GUO: Right, yeah,
so in a team environment, you could do this to essentially
do some sort of a code review scenario. MEGAN RISDAL: Yeah, exactly. YUFENG GUO: OK. So once you’ve done that, I
can go over here on my laptop and, in the dataset,
click Kernels and go to your work, which I guess,
in this case, is your work. MEGAN RISDAL: Right. YUFENG GUO: And we’ll open
up your notebook here. And you can see that
it loads nicely. And I have the option to either
edit or to fork the notebook. MEGAN RISDAL: Yeah. So why don’t you
go ahead fork it, and just make sure that
everything runs as expected, and you can get
everything to compile. YUFENG GUO: So
when I fork it, is that then similar to when
you fork a repo on GitHub– MEGAN RISDAL: Right. YUFENG GUO: –where
you make your own copy? Now, this is really mine? MEGAN RISDAL: Yeah. This is your copy of, not
just the code, but also the data that I used and
the environment that I used. YUFENG GUO: OK, gotcha. And so anything you now
make changes to on your side won’t affect my copy. MEGAN RISDAL: Correct. YUFENG GUO: OK. So now, I’m running
it, and this will generate a different kernel. Do I need to change the name? Will there be a name collision
there if I leave it the same? MEGAN RISDAL: You don’t
need to change it. So the slug that gets used
is your username and then the slug of the notebook title. YUFENG GUO: Gotcha. So I could change it,
but I don’t have to. MEGAN RISDAL: Right. YUFENG GUO: Now, if
I go back to– oh, we can watch it do
it’s thing, and I could share my fork with other folks. MEGAN RISDAL:
Yeah, that’s right. YUFENG GUO: And we
can see our data. If I click back now,
I guess this is– it should be– once it
finishes, it’ll just show up? MEGAN RISDAL: There it is. YUFENG GUO: And there it is. All right. MEGAN RISDAL: Awesome. So what do you think? YUFENG GUO: It’s pretty
good, pretty good. I guess it’s time to make
this thing public for real. MEGAN RISDAL: Yeah,
let’s go public. YUFENG GUO: All right. MEGAN RISDAL: Cool. So I’m going to go
back to the dataset. And I go to Settings, Sharing,
and if we think we’re ready, we can click Make Public. YUFENG GUO: All
right, let’s do it. So this is a Make
Public Permanently. MEGAN RISDAL: That’s right. YUFENG GUO: Great. That’s always good to know
what you’re getting into here. Nice. MEGAN RISDAL: And here we go. And then the next
step is, of course, we want to make the kernel public. YUFENG GUO: Oh, right. Because the kernel itself is
separate from the dataset, and so those two
concepts are distinct. MEGAN RISDAL: That’s right. YUFENG GUO: And so
in this situation, it could be one where
you wrote some stuff, and then you make yours public. But then I fork it,
and it’s private. And I can extend it privately
with you or with other folks and then release another
version, perhaps, with some different analysis. MEGAN RISDAL: Right. Yeah, exactly. So that flexibility
is up to you. YUFENG GUO: Awesome. MEGAN RISDAL: So our
data side is public. And anybody from our
community of data scientists can go ahead and explore more
about restaurant inspections and violations in
Los Angeles County. YUFENG GUO: That’s right. That’s right. And zooming out and looking
at what we’ve covered today, it’s quite a bit. And the tools are
all in this package, in this really nice,
seamless platform. I really enjoyed going through. And we made a notebook
and dataset on your side. We were able to share it across
privately and then publicly. And we didn’t even
get into things like commenting system
and the discussion forums. And there’s so much
more to Kaggle. But even this environment
of collaboration and sharing is so rich. MEGAN RISDAL: Yeah. So we really did create a
project from start to finish. We got data files off
of my local machine and into this reproducible,
documented dataset that’s now publicly shared with the world. And you can kind
of see how somebody could do this for
a school project or as a way to share research. YUFENG GUO: Absolutely. Yeah. So this notebook
is really public. So if you’re
watching this video, you can go on Kaggle right
now and access this dataset. We’ll include links to the
notebooks in the description below the video and
share it, and then you’ll be able to see the notebook,
the dataset, and post comments, fork your own
notebook, make edits. Thanks so much for joining
me today, Megan, on the show. It’s been really fun putting
together this Kaggle kernel. We’re making this dataset and
making it public to the world. If you’ve liked this video,
be sure to hit the Like button down below and click
Subscribe to get all the episodes of
“Cloud AI Adventures” right when they come out. For now, me and Megan,
we’re going to go back to working on this kernel. But this time,
maybe I can convince her to do it in Python. MEGAN RISDAL: We’ll
see about that. YUFENG GUO: All right.

Glenn Chapman

28 Comments

  1. I love how they decorate their laptops. I'm thinking of buying some stickers soon, gotta get some flames so it looks like it's speeding, also some skulls for dead code and some tensorflow and python ones. Does anyone know where I can get those tensorflow and python stickers?

  2. 12:05 I see a smiling clown with a triangular hat flipping the bird in the spacing of that dataset.

  3. Thanks for the video!! I love these walkthroughs for newbs like me and yes please do other vids in python as more of us are familiar with that.

  4. In the view of "Big Data", once we have a large raw data file, does using pandas, numpy processing data on Google Cloud Storage have a comparable performance with Hadoop HDFS and Spark?

    And in the view of "Machine Learning", how does sklearn compare with Spark-MLlib ?

    Thank you.

  5. Thanks for Nice Video.
    Where can I buy those sticker you have on your laptop, I like it.

  6. Fantastic primer about the amazing developments at Kaggle! This will greatly facilitate the data science community's need for collaborative modeling. Thanks so much! Megan really inspired Kaggle usage, with the example dataset. I'll be visiting and interacting at Kaggle, a lot more, for sure. Megan, arguably the most beautiful data scientist in the world, thanks so much again for drawing attention to the amazing accomplishments at Kaggle. Inspire on!

  7. can anyone tell me the definition of kernel in data science here briefly ?

  8. Awesome presentation to you both! Have her on more, very pleasant to listen to, very informative…Go KagGoo!

  9. I am a simple man I see a pretty woman, I click on video, I hit like, I comment.

Leave a Reply

Your email address will not be published. Required fields are marked *