About these real life datasets...

Welcome to Module #1!

Here, you'll get access to a few real life datasets. When I say real, I mean it: the data in these datasets are from real projects I’ve worked on in the past few years:

  1. An educational online game's usage data -- including the usage data of more than 100,000 rounds generated by real players.
  2. A blueberry plant's growth data -- including moisture levels, light levels and photos of the plants.
  3. A slice of the Data36 blog's traffic data -- the log of the article reads on all SQL-related articles in a five-month period -- which means 130,000+ rows, 1,000,000+ data points to be analyzed.

    Disclaimer: for privacy reasons I had to change a few things in the DATA36 dataset before I made it available for you. Why? Well, I've already collected the data in a privacy-first manner and I have not stored any personal information (IP addresses, email addresses, location data, etc.) of website visitors… But for the course, I went even further with data anonymization to prevent any possible privacy issues. This doesn't affect your analyses, the dataset still shows real user behaviour -- it's just hashed, anonymized and changed in a few simple other ways (that I won't tell you here ;-)).

As you can see, I tried to bring diverse projects, so you can see and work with all kinds of data. At the same time, you'll also see that from a "practical" aspect, the datasets are very similar.

In the first lecture (after this one), I'll show you how you can download these datasets (to your remote server or to your computer). And if you are using a remote server, I'll also show you how you can automate the dataload, so you can get your hands on the freshly generated data each day -- to make your analyses live and even more real.

After that, the lectures will guide you through the projects/datasets one by one. For each, you'll see the "documentation" with three key elements:

  • the structure of the data: where are the different tables, columns and data points in the given data sets -- what means what -- etc.
  • how to get the data: urls and/or command line commands (what exactly you should do to download the data to your server or computer)
  • random project/analysis ideas: a few ideas by me. What do I recommend analyzing in these datasets first? I'll focus on what looks good in a junior/aspiring data scientist's portfolio. I'll give you simpler and more advanced analysis ideas. But feel free to use these as a source of inspiration -- and I encourage you to work on your ideas if you have better ones (which I'm sure that you'll have).

Note: I tried to format these lectures to be very similar to database documentation sets that you'd get in a workplace.

At the end of the module, you'll find two bonus videos that were published on my Youtube channel: a crontab tutorial (how to run automations on a server) and an upload-a-file-to-a-server tutorial.