Introduction to Statistics for Data Science

Site: Pedagogy Trainings - Learning Management System
Course: Pedagogy Trainings - Learning Management System
Book: Introduction to Statistics for Data Science
Printed by:
Date: Monday, 13 July 2020, 2:39 PM

Table of contents

1. An Introduction

Learning Objectives:
  1. To examine who would need statistics and how it can be used
  2. To provide a brief history of the use of statistics
  3. Sub-Divisions within statistics

1.1. Why should you take this course and who uses statistics anyhow?

The word statistics means different things to different people. To the manager of a manufacturing plant, statistics are the amount of pollution being released into the atmosphere. to the Food and drug administrator, the statistics is the likely percentage of undesirable effects in the population using the new drug. To a bank, statistics is the chance that a customer will replay her/his loan on time. To a student, it would be grades on his/her subjects. All of them are using statistics to ease decision making. 

Helping you understand and learn why statistics is important and how it can be used into a professional world to solve real world challenges would be the core objective of this program.

1.2. History of Statistics

The word statistik comes from the Italian word statista (meaning 'statesman'). It was first used by Gottfried Achenwall (1719 - 1772), a professor in Marlborough Russia.

Dr. EAW Zimmerman introduced the word statistics into England. Its use was popularized by Sir John Sinclair in his work Statistical Account of Scotland 1791 - 1799. Much before the eighteen century, however people by then had been recording and using data.

2. Grouping and Displaying Data to Convey Meaning (Tables and Graphs)

The production manager of a large carpet company manages approximately 500 carpet looms per day. In order not to measure daily output in meters he samples 30 odd loom from the daily produced lot and draw a conclusion as to the average carpet production of the entire 500 looms.
The table below show the meters produced by each of the 30 looms in yesterdays sample.

Using the methods introduced in this program, we can help the production manager draw the right conclusion.

A collection of data is called a data set and a single observation a data point.

2.1. How can we arrage data?

For data to be useful, our data points and observations must be organized so that we can pick our patterns and come to concrete conclusions.

DATA COLLECTION;
Statisticians select their observations so that all relevant groups are considered and represented in the data. As an example, to determine the potential market for a new product, analysts might study 100 consumers in a certain geographical area. Analysts must consider and be certain that this data group contains people representing variables such as income, race, education etc.

2.2. Examples of raw data

Information before it is arranged and analyzed is called raw data. It is raw because it is unprocessed by statistical methods.
Below table shows a sample of these raw data in tabular form: 20 pair of average grades in high school and college.

2.3. Arranging and Constructing a Frequency Distribution

ARRANGING DATA USING THE DATA ARRAY AND THE FREQUENCY DISTRIBUTION

A data array is one of the simplest ways to present data. It arranges the values in some order. See below tables as an example to data arrays.
Now, data arrays offer several advantages over raw data:
  1. One can quickly notice the lowest and highest values in the data
  2. The data can be divided into sections easily
  3. Values appearing more than once can be identified
  4. Distance can be measured between the succeeding values in the data
In spite of the data arrays advantages, sometimes a data array isn't helpful. Because it lists every observation, also it is cumbersome to display if have large quantities. In order to make real meaning of the data we need to compress the information and still be able to use it for interpretation and decision making, this can be done using the frequency distribution.

Do we have a Better Way to Arrange Data? A Frequency Distribution

One way we can compress the data is to use a frequency distribution or a frequency table. There is a difference in arranging the data in an array and in frequency table. Lets take an example to understand this.



A frequency distribution is a table that organizes data into classes i.e. into groups of values describing one characteristic of data.



Because we need to make the class intervals of equal size, the number of classes to determine the width of each class. To find the intervals, we can use the below equation:


2.4. Graphing Frequencies

Graphing Frequencies

2.5. Chapter review

Chapter review

2.6. Equations used along

Equations used along

2.7. Review and Application Exercises

Review and Application Exercises

3. Measures of Central Tendency | Dispersion in Frequency Distribution

Measures of Central Tendency | Dispersion in Frequency Distribution

3.1. MCT: The Arithmetic Mean

MCT: The Arithmetic Mean

3.2. MCT: The Weighted Mean

MCT: The Weighted Mean

3.3. MCT: The Median

MCT: The Median

3.4. MCT: The Mode

MCT: The Mode

3.5. Why Dispersion is important?

Why Dispersion is important?

3.6. Measures of Dispersion

Measures of Dispersion

3.7. Measures of Average Deviation

Measures of Average Deviation

3.8. The Coefficient of Variation: Relative Dispersion

The Coefficient of Variation: Relative Dispersion

3.9. Conclusion: Using Flow Charts

Conclusion: Using Flow Charts

4. Probability Part 1: An Introduction

Probability Part 1: An Introduction

4.1. Probability: The study of odd and even

Probability: The study of odd and even

4.2. Basic Terminology & types of Probability

Basic Terminology & types of Probability

4.3. Probability Rules

Probability Rules

4.4. Probabilities: Conditions of Statistical Independence

Probabilities: Conditions of Statistical Independence

4.5. Probabilities: Conditions of Statistical dependence

Probabilities: Conditions of Statistical dependence

4.6. Probabilities: Bayes Theorem

Probabilities: Bayes Theorem

5. Probability Part 2: Distributions

Probability Part 2: Distributions

5.1. What is a Probability Distribution?

What is a Probability Distribution?

5.2. Random Variables

Random Variables

5.3. Binomial Distribution

Binomial Distribution

5.4. Poisson Distribution

Poisson Distribution

5.5. Normal Distribution: of a Continuous Random Variable

Normal Distribution: of a Continuous Random Variable

5.6. Choosing the correct Probability Distribution

Choosing the correct Probability Distribution

6. Sampling in Statistics & its Distributions

Sampling in Statistics & its Distributions

6.1. An introduction to Sampling

An introduction to sampling

6.2. Random & Non-random Sampling

Random & Non-random Sampling

7. Testing of Hypothesis: One Sample tests

Testing of Hypothesis: One Sample tests

7.1. Basic concepts to Hypothesis testing

Basic concepts to Hypothesis testing

7.2. Testing Means when the Population Std. Deviation is known

Testing Means when the Population Std. Deviation is known

7.3. Testing Means when the Population Std. Deviation is not known

Testing Means when the Population Std. Deviation is not known

8. Chi-Square and Analysis of Variance

Chi-Square and Analysis of Variance

8.1. Chi-Square as a test of Independence

Chi-Square as a test of Independence

8.2. Analysis of Variance

Analysis of Variance

8.3. Inferences about a Population Variance

Inferences about a Population Variance

8.4. Inferences about two Population Variance

Inferences about two Population Variance

9. Correlation & Regression

Correlation & Regression

9.1. Introduction to Regression Analysis

Introduction to Regression Analysis

9.2. Estimating using a Regression Line

Estimating using a Regression Line

9.3. Correlation Analysis

Correlation Analysis

9.4. Making Inferences about Population Parameters

Making Inferences about Population Parameters

9.5. Limitations, Errors and caveats to Regression Analysis

Limitations, Errors and caveats to Regression Analysis