About these "random" datasets...

If you don't have any real life datasets at hand for your data science project… APIs won't help… web scraping is not an option… Well, you can always generate data for yourself. I've done this several times before when I've created data science courses (e.g. as you probably know, the Junior Data Scientist's First Month course's data is entirely generated by a Python script I wrote). And it works like a charm.

Well, obviously artificial datasets work for artificial projects, so you won't be able to use this method outside of a small portfolio demo project, but for that, it's just perfect.

Let's talk about the datasets in this module!

You'll get access to two randomly generated datasets:

  1. dogs & cats: this is a simpler and smaller dataset that holds ~20,000 lines of data about… you guessed it: dogs and cats. Using this dataset, you can create all sorts of simpler but good-looking analyses. I attached an "unknown" dataset, as well. So you can also play around with some basic classification machine learning techniques.
  2. random e-commerce data: this one is a more complex dataset of a random e-commerce store. The product they sell is "secret" -- and it doesn't really matter. (As an extra task, you can guess it, of course.) Here, you'll get three datafiles (first_visit, returning_visit, purchase) that have all-in-all 1,000,000+ lines of data to be analyzed.

This module is pretty similar to module #1 -- real life datasets... except that the data is not real but artificial, of course.

The rest is really the same though.

For both datasets in this module, you'll get:

  • a documentation about the structure of the data
  • instructions on how to get and download the data
  • and a few random project/analysis ideas

As an extra, I'll also attach the Jupyter Notebooks that I used to generate these datasets. Looking at my Python code, you'll realize that while I call the datasets in this module "randomly generated," in fact it takes quite a bit of brainwork (and code) to come up with something meaningful. Anyways, feel free to rerun my Python code and generate even more data for these projects… Or you can tweak these notebooks further and create your own artificial datasets!

Enjoy this module!

Complete and Continue