Analysis for gapminder data with R
Throughout this work, you will use data from gapminder, which tracks demographic data in countries of the world over time. To learn more about it, you can bring up the help file with gapminder.
Packages
if you have not installed those packages. Use for those lines;
install.packages("gapminder")
install.packages("dplyr")
install.packages("ggplot2")
Load the packages to working environment by:
library(gapminder) # For the gapminder data set
library(dplyr)
library(ggplot2)
Subject: Calculate center measures
For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by() and summarize(), a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.
Question 1;
1a) Create a dataset called gap2007 that contains only data from the year 2007.
#Data Set Explore
?gapminder ## help data
ls(gapminder)
str(gapminder) ## Structure
summary(gapminder)#Create dataset of 2007 data
gapminder_2007<- gapminder %>% #subset data
filter(year == 2007) # Or we can use ## filter(gapminder, year == 2007)
1b) Using gap2007, calculate the mean and median life expectancy for each continent. Don’t worry about naming the new columns produced by summarize().
## Compute groupwise mean and median lifeExp
# mean_lifeexp<-round(mean(gapminder_2007$lifeExp), digits=3) #whole world mean ## EXTRA ##
# medn_lifeexp<-round(median(gapminder_2007$lifeExp), digits=3)#whole world median ## EXTRA ##View(gapminder_2007 %>% #Compute groupwise and show excel table
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp)))
1c) Confirm the trends that you see in the medians by generating side-by-side box plots of life expectancy for each continent.
## Generate box plots of life-exp for each continent
ggplot(gapminder_2007, aes(x = continent, y = lifeExp)) + facet_wrap(~year) +
geom_boxplot(outlier.colour = “red”) +
ggtitle(“Box plots of lifeExp for each continent for 2007 across all countries”) +
geom_jitter(position = position_jitter(width = 0.09, height = 0), alpha = 1/10) #show data dist and dencity on boxplot width 0.09
Subject: Measures of variability
For each continent in gap2007, summarize life expectancy using the sd(), the IQR(), and the count of countries, n(). No need to name the new columns produced. The n() function within your summarize() call does not take any arguments. Graphically compare the spread of these distributions by constructing overlaid density plots of life expectancy broken down by continent.
Let’s extend the powerful group_by() and summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.
Question 2;
2a) The gap2007 dataset that you created in an earlier exercise is available in your workspace.
# Compute groupwise measures of spread
gapminder_2007 %>% # For each continent in gap2007
group_by(continent) %>%
summarize(“StandartDeviation”=sd(lifeExp),
“InterQuartileRange”=IQR(lifeExp),
“Number”=n())
2b) For each continent in gap2007, summarize life expectancy using the sd(), the IQR(), and the count of countries, n(). No need to name the new columns produced here. The n() function within your summarize() call does not take any arguments.
2c) Graphically compare the spread of these distributions by constructing overlaid density plots of life expectancy broken down by continent.
# Generate overlaid density plots (for Q2b & Q2c)
gapminder_2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) + facet_wrap(~year) + # aes = aesthetics
geom_density(alpha = 0.7) + # alpha is transparency
ggtitle(“Overlaid Density plots of lifeExp for each continent for 2007 across all countries”) +
theme(legend.title = element_text(color = “Black”,
size = 14,
face = “bold”),
legend.background = element_rect(fill = “gray90”,
size = 0.5,
linetype = “dashed”)) +
labs(x=”Life Expected (Years)”, y=”Density”)
Question 3
Choose measures for center and spread Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them. Using the shapes of the density plots, calculate the most appropriate measures of center and spread for the following: The distribution of life expectancy in the countries of the Americas. Note you’ll need to apply a filter here. The distribution of country populations across the entire gap2007 data set. Using the shapes of the density plots, calculate the most appropriate measures of center and spread for the following:
3a) The distribution of life expectancy in the countries of the Americas. Note you’ll need to apply a filter here.
# Compute stats for lifeExp in Americas
View(gapminder_2007 %>% #compute stats
filter(continent==”Americas”) %>%
summarize(“LifeExpected_Mean_Americas2007”=mean(lifeExp),
“LifeExpected_StandarDeviation_Americas2007”=sd(lifeExp)))#gapminder_2007 %>% #Plots for lifeexp in Americas
# filter(continent == “Americas”) %>%
# ggplot(aes(x = lifeExp, fill = continent)) + facet_wrap(~year) + # aes = aesthetics
# geom_density(alpha = 0.5) # alpha is transparency
3b) The distribution of country populations across the entire gap2007 data set.
# Compute stats for population
View(gapminder_2007 %>%
summarize(“Population_Median_2007”=median(pop), #← mean/sd ~ median/IQR but 2nd is more powers for outliers and anormal data
“Population_IQR_2007”=IQR(pop)))
Subject: Shape and Transformation
Modality (unimodal, bimodal, multimodal or uniform)
Skew (rihgt-skewed, left-skewed, symmetric etc.)
Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).
Question 4
Using the gap2007 data:
4a) Create a density plot of the population variable.
# Create density plot of old variable
gapminder_2007 %>%
ggplot(aes(x = pop)) + geom_density() + facet_wrap(~year) + #scale_x_log10() + # we can use log scale for detail view
ggtitle(“Density plot of the population variable for 2007 data”)
4b) Mutate a new column called log_pop that is the natural log of the population and save it back into gap2007.
# Transform the skewed pop variable
gapminder_2007 <- gapminder_2007 %>%
mutate(log_pop = log(pop))
4c) Create a density plot of your transformed variable.
# Create density plot of new variable
gapminder_2007 %>%
ggplot(aes(x = log_pop)) + geom_density() + facet_wrap(~year) +
ggtitle(“Density plot of LOG(the population variable) for 2007 data”)
Identifying outliers
Characteristics of a data distribution are:
— Center
— Variablity
— Shape
— Outliers
Question 5
5a) Apply a filter so that it only contains observations from Asia, then create a new variable called is_outlier that is TRUE for countries with life expectancy less than 50. Assign the result to gap_asia.
# Filter for Asia, add column indicating outliers
gap_asia <- gapminder_2007 %>%
filter(continent==”Asia”) %>%
mutate(is_outlier = lifeExp < 50)
5b) Filter gap_asia to remove all outliers, then create another box plot of the remaining life expectancies.
# Remove outliers, create box plot of lifeExpgap_asia %>%
filter(is_outlier==FALSE) %>% # filter(!is_outlier) alternate way, result is FALSE
ggplot(aes(x = 1, y = lifeExp)) + facet_wrap(~year) +
geom_boxplot() +
ggtitle(“Box plot of lifeExp without Outliers for 2007 data”)