Welcome to StatisticalExcelence! This repository is dedicated to sharing my learning journey in statistics as it applies to data science. Here, you'll find notes, code snippets, and resources that I find useful. Let's dive into the world of data together and uncover statistical insights!
Data = Facts and figures collected, measured, summarized, and analyzed for presentation and interpretation.
- Categorical data = Data grouped by a category (also known as Qualitative data)
- Nominal = data with no rank or order. Like eye color: Brown, Hazel, Brown; or Phone name: iPhone, Samsung, iPhone. With this data you can count and then calculate the % of total based on the counts. If you use a number as a category label, you cannot do arithmetic (+, -, *. /, ^) on it - you can only count.
- Ordinal = Data with rank but you don't know distance between each rank, Like: Bad, Good, Great; or *, **, ***, ****; or 1st, 2nd, 3rd. You usually count how many are in each category and then calculate the % of total. In some cases when a number is used to represent the category, you can calculate an average (mean).
- Quantitative data = Numeric data
- Interval = Data with rank and a fixed distance between each rank, but where zero is either not in the scale (like IQ or SAT scores) or zero is just a point on the scale (Like Fahrenheit temperature or Celsius temperature). Calculations like counting, % of count total, differences, averages and differences are OK, but ratios between two numbers is not OK.
- Ratio = Data with rank and a fixed distance between each rank and zero means nothing exists. Like money, weight, height, time, Kelvin temperature. All are scales where zero means nothing exists. Calculations like counting, % of count total, differences and ratios are OK.
- Discrete numbers are counting numbers & have gaps between each successive number, like 1, 2, 3…, or 1.2, 1.3, 1.4…. Answer the question HOW MANY?
- Continuous numbers can occupy any value over a continuous range and depend on the measuring instrument. Like: time, weight, temperature, money ($ don't seem continuous, but many statisticians treat it as such). Answer the question HOW MUCH?
Example:
- Other types of data=
- Cross-sectional data = Data collected at the same or nearly the same point in time. Ex: This cross-sectional data was collected on 11/5/2021
- Time-series data = Data collected over time. Ex: This data was collected for Amazon over a 5-day period
Numerical facts like: Percentages, averages, probability…
Examples:
• 5% of the people who took the CPA exam earned a score of 90 or above.
• The average cost for a Toyota Camry last year was $23,000.
• The probability that you will get a six when you roll a die is 1/6 or 0.167.
Def: “Statistics is the art and science of collecting, analysing, presenting and interpreting data to help make informed decisions.”
Definition for data analysis: Raw Data => Use Statistical Methods => Useful information for Decision Makers
Raw Data that is summarized and presented
• Tabular: Table of information
• Graphical: Charts and Graphs
• Numerical: Like an average (mean) or proportion
• The process of using data obtained from a sample to make estimates and test hypotheses about the characteristics/attribute of a population.
• Take a sample from the population and draw reasonable conclusions that can help to estimate the unknown future.
Population: The set of all elements of interest in a particular study (In many situations it is too costly to get data from all the elements in the population) Example => Census: Collecting data for a population
Sample: A subset of the population. Example => Sample survey: Collecting data for a sample
Why Tabular and Graphical? Because it is hard to see patterns and trends when you are looking at Raw Data! Our Goal is to create useful information from the raw data so that we can see patterns and trends. This will help in Decision Making.