Meritshot Tutorials

  1. Home
  2. »
  3. Statistics in-R

SQL Tutorial

Statistics in-R

Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It provides tools and

methodologies to make informed decisions or predictions based on data, especially when faced with uncertainty.

Key Components of Statistics:

  1. Data Collection: Gathering information or observations in a systematic
    • Example: Conducting surveys, experiments, or using existing
  2. Data Organization: Arranging and summarizing raw data to make it more
    • Example: Tabulating survey results or creating
  3. Data Analysis: Applying mathematical methods to understand patterns, relationships, and trends in
    • Example: Using mean, variance, and correlation to find relationships between
  4. Interpretation: Making sense of the results from the analysis to derive insights and
    • Example: Concluding that a new drug is effective based on test
  5. Presentation: Conveying data and insights clearly through reports, graphs, charts, or
    • Example: Using histograms, pie charts, or dashboards to present survey

Branches of Statistics:

  1. Descriptive Statistics:
    • Focuses on summarizing and describing the main features of a
    • Key Tools: Mean, median, mode, standard deviation, histograms, and
    • Example: Finding the average age of students in a
  2. Inferential Statistics:
    • Makes predictions or inferences about a population based on a sample of
    • Key Tools: Confidence intervals, hypothesis testing, regression
    • Example: Predicting election results based on a survey of a small group of

Types of Data in Statistics:

  1. Qualitative Data: Describes attributes or
    • Example: Gender, race, types of
  2. Quantitative Data: Represents numerical
    • Example: Age, income,

Importance of Statistics:

  • Decision Making: Helps in making informed decisions based on data rather than
  • Understanding Variability: Identifies patterns, trends, and variations in
  • Predicting Future Outcomes: Provides methods for forecasting based on historical data.
  • Risk Assessment: Assists in understanding risks and making judgments in uncertain

Statistics in R programming involves a wide range of functions, libraries, and

techniques for statistical analysis. R is widely used in the field of statistics because of its extensive set of built-in tools for handling data, performing statistical tests, and visualizing results.

In summary, Statistics is essential for transforming raw data into meaningful

information, which can be used in research, business, science, and various fields for decision-making and problem-solving.

Now we will be learning about the concepts of Statistics and their types , how we are going to implement those in R Programming.

1.  Descriptive Statistics in R

Descriptive statistics provide simple summaries about the sample and the measures. These include measures of central tendency (mean, median, mode) and measures of variability (variance, standard deviation, range).

a) Basic Descriptive Statistics

You can begin with a simple dataset in R. For this, you can use built-in datasets like mtcars or iris.

# Loading built-in datasets data(mtcars)

data(iris)

# Viewing the first few rows of mtcars dataset head(mtcars)

1.1  Measures of Central Tendency

  • Mean: The average of the

mean(mtcars$mpg) # Mean of miles per gallon (mpg)

  • Median: The middle value when data is median(mtcars$mpg)
  • Mode: The most frequent value in the dataset (R does not have a built-in mode function, so you can use a package or write a custom function).

# Install modeest package to calculate mode install.packages(“modeest”)

library(modeest)

mfv(mtcars$mpg) # Most frequent value (mode) of mpg

1.2  Measures of Dispersion

  • Variance: The spread of the var(mtcars$mpg)
  • Standard Deviation: The square root of the sd(mtcars$mpg)
  • Range: The difference between the maximum and minimum range(mtcars$mpg)
  • Interquartile Range (IQR): The range between the 1st and 3rd IQR(mtcars$mpg)

1.3  Data Summary

  • Summary Statistics: Provides a full summary (min, 1st quartile, median, mean, 3rd quartile, max).

summary(mtcars$mpg)

  • Visualizing Descriptive Statistics

Visualization helps to understand data distributions

Histogram:

hist(mtcars$mpg, main = “Histogram of MPG”, xlab = “Miles Per Gallon”, col = “lightblue”)

Boxplot:

boxplot(mtcars$mpg, main = “Boxplot of MPG”, ylab = “Miles Per Gallon”, col = “lightgreen”)

Scatter Plot:

plot(mtcars$wt, mtcars$mpg, main = “Scatter Plot”, xlab = “Weight”, ylab = “MPG”, col = “red”)

1.  Inferential Statistics in R

Inferential statistics involve making predictions or inferences about a population based on a sample of data. The common techniques include

hypothesis testing, confidence intervals, correlation, and regression analysis.

  1. Basic Inferential Statistics

2.1  Hypothesis Testing

  • T-Test (used to compare the means of two groups):

# One-sample t-test

t.test(mtcars$mpg, mu = 20) # Test if the mean of mpg is different from 20

# Two-sample t-test

t.test(mpg ~ am, data = mtcars) # Compare mpg between automatic and manual cars

  • Chi-Square Test (used for categorical data):

# Create a contingency table and perform chi-square test chisq_test <- table(mtcars$am, mtcars$cyl)

chisq.test(chisq_test)

The Chi-Square test is useful when working with categorical data to determine if there’s an association between two categorical variables.

# Create a contingency table and perform chi-square test

chisq_test <- table(mtcars$am, mtcars$cyl) # Transmission type vs Cylinders chisq.test(chisq_test)

2.2  Confidence Intervals

Confidence intervals give a range of plausible values for a population parameter (like the mean), based on sample data.

Confidence Interval for Mean:

# The t.test function gives confidence intervals by default t.test(mtcars$mpg)

This output provides the confidence interval for the mean of mpg.

b) Advanced Inferential Statistics

1.1  Correlation

Correlation measures the strength and direction of the linear relationship between two continuous variables.

  • Pearson Correlation Coefficient (for linear relationships):

cor(mtcars$mpg, mtcars$wt) # Correlation between mpg and weight

  • Spearman Rank Correlation (for non-linear relationships): cor(mtcars$mpg, mtcars$wt, method = “spearman”)

2.4  Linear Regression

Regression analysis allows us to model and analyze relationships between variables.

  • Simple Linear Regression:

# Fitting a linear model for mpg based on weight model <- lm(mpg ~ wt, data = mtcars) summary(model)

This will output the regression equation and significance of the relationship between mpg and wt.

  • Multiple Linear Regression:

# Fitting a multiple linear regression model

model_mult <- lm(mpg ~ wt + hp + cyl, data = mtcars) summary(model_mult)

2.5  ANOVA (Analysis of Variance)

ANOVA is used to compare the means of three or more groups.

  • One-Way ANOVA:

# ANOVA to check if mpg differs across cylinder types aov_model <- aov(mpg ~ factor(cyl), data = mtcars) summary(aov_model)

  • Two-Way ANOVA (allows for interaction between variables):

# Two-way ANOVA with mpg, cylinder type, and transmission type aov_model2 <- aov(mpg ~ factor(cyl) * factor(am), data = mtcars) summary(aov_model2)

2.7   Non-Parametric Tests

When the assumptions of parametric tests (like normality) are not met, non-parametric tests like the Wilcoxon test are used.

  • Wilcoxon Test (non-parametric alternative to the t-test):

wilcox.test(mtcars$mpg ~ mtcars$am) # Compare mpg between automatic and manual cars

2.7  Bootstrapping

Bootstrapping is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.

Bootstrap Example (using the boot package):

library(boot)

# Defining a statistic function (mean) boot_mean <- function(data, indices) { return(mean(data[indices]))

}

# Applying bootstrapping on mpg data

results <- boot(data = mtcars$mpg, statistic = boot_mean, R = 1000) results

This generates multiple resamples of the data and computes the statistic (e.g., the mean) for each sample.

3. Moving to Advanced Visualization in R

Advanced visualizations help in presenting the results of your statistical analysis more effectively.

  • Using ggplot2 for Data Visualization: library(ggplot2)

# Scatter plot with a regression line ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point() +

geom_smooth(method = “lm”, col = “red”)

Summary Workflow for Descriptive and Inferential Statistics in R

1.      Data Loading:

  • Start by importing data using csv() or built-in datasets.

2.      Descriptive Statistics:

  • Compute central tendency (mean, median, mode) and variability (range, IQR, variance, standard deviation).
  • Visualize data using histograms, boxplots, and scatter

3.      Inferential Statistics:

  • Apply hypothesis testing (t-tests, ANOVA, chi-square tests).
  • Conduct correlation and regression
  • Use non-parametric tests if assumptions of normality are
  • Implement bootstrapping for more robust statistical

By following these steps, you can transition from basic to advanced levels of

both descriptive and inferential statistics in R, making your data analysis more powerful and comprehensive.