So what is Machine Learning?
In layman language, we feed data to the machine, the machine learns from that data. When a new set of data is provided then based on that learning, the machine makes a decision and prediction.
In supervised learning the data is labelled (i.e every input data is tagged to its corresponding output). The machine is trained with those outputs to make a decision. For instance, at school, the teacher first guided us and taught us how the specific problem is solved and accordingly we work on other problems.
In unsupervised learning the data is not labelled. The machine has to figure out the given data and must find hidden patterns in order to make prediction. A grown-up like you and me. We don't need guidance to help in our daily activity. We figure out things on our own.
Reinforcement Learning Suppose, you were dropped in an isolated island. You will have to learn how to live on the island, adapt to the changing climate, what to eat and what not to eat. So basically, you are following the hit and trial concept because you are new to the surrounding and only way for you to learn is to learn from your experience.
Reinorcement is a learning method where an agent interacts with its environment by doing some actions and discover errors and rewards.
One of the most important factor before you start working on a problem is that you create a Data Dictionary. Data Dictionary describes what each column or feature of your dataset actually means.
So, once you have got your business problem statement ready, data exploration is the next step which is analysing, summarising, visualising and becoming familiar with the dataset. Because the Data Science project is not just about creating models. Any time you build a machine learnig model, you have to preprocess the data so that model can be trained in the right way. 70% of the total time will be consumed in exploration, cleaning and preparing the data.
- Univariate Analysis - Univariate analysis means analysis of a single variable. It mainly describes the characteristics of the variable. -- If the variable is numerical patterns can be found by looking at mean, mode, median, range, variance, maximum, minimum, quartiles, and standard deviation and can be displayed using histograms, frequency distribution tables, boxplots are the best choice for visualizing outliers. -- If the variable is categorical we can use either a bar chart or a pie chart to find the distribution of the classes in the variable.
- Bi-variate Analysis - Bivariate analysis involves checking the relationship between two variables simultaneously.
Data wrangling or data cleaning is the process of identifying and removing inaccurate records from a dataset.
Removing duplicates
Variable transformation
Variable creation