Data Analysis Exercise

# load dslabs package
library("dslabs")

Warning: package 'dslabs' was built under R version 4.4.3

library("dplyr")

Warning: package 'dplyr' was built under R version 4.4.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library("ggplot2")

Warning: package 'ggplot2' was built under R version 4.4.3

# get an overview of data structure
str(gapminder)

'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

# get a summary of data
summary(gapminder)

                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586

# determine the type of object gapminder is
class(gapminder)

[1] "data.frame"

africadata <- gapminder %>%
  filter(continent == "Africa")

# Create a new data frame with only 'infant_mortality' and 'life_expectancy'
africa_mort_life <- africadata %>%
  select(infant_mortality, life_expectancy)

# Create another data frame with only 'population' and 'life_expectancy'
africa_pop_life <- africadata %>%
  select(population, life_expectancy)

# Check the structure of the first new object
str(africa_mort_life)

'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...

# Check a summary of the data to understand distributions and NA presence
summary(africa_mort_life)

 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226

# Check the structure of the second new object
str(africa_pop_life)

'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...

# Summary of population and life expectancy
summary(africa_pop_life)

   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51

I created two new data frames from the “africadata” object:
- “africa_mort_life”: Contains only the “infant_mortality” and “life_expectancy” columns. - “africa_pop_life”: Contains only the “population” and “life_expectancy” columns.

Both data frames contain 2907 rows (one for each African country-year observation) and 2 columns. I used str() to inspect the structure (data types and column names) and summary() to examine the distribution and detect missing values.

# Load dplyr package
library(dplyr)

# Create the object with infant_mortality and life_expectancy
africa_mort_life <- africadata %>%
  select(infant_mortality, life_expectancy)

# Plotting  Life expectancy vs infant mortality 
ggplot(africa_mort_life, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  labs(
    title = "Life expectancy vs Infant mortality in Africa",
    x = "Infant Mortality",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

Warning: Removed 226 rows containing missing values or values outside the scale range
(`geom_point()`).

This plot shows a negative correlation between infant mortality and life expectancy. Countries with higher infant mortality tend to have lower life expectancy.

# Create population vs life expectancy data from africadata
africa_pop_life <- africadata %>%
  select(population, life_expectancy)

# Plotting Life expectancy vs pop size 
ggplot(africa_pop_life, aes(x = population, y = life_expectancy)) +
  geom_point(alpha = 0.6, color = "darkgreen") +
  scale_x_log10() +  # log scale for population
  labs(
    title = "Life expectancy vs Population size",
    x = "Population (log scale)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_point()`).

Here, we observe a positive correlation between population size and life expectancy.

However, the data displays streaks of points. When we examine the original africadata again, we notice that each country has multiple entries over time (from 1960 to 2016). These streaks represent temporal data points for each country, where each streak corresponds to a single country progressing over time.

# Now perform analysis for year 20000
africadata_year2000 <- subset(africadata, year == 2000)
# get an overview of data structure
str(africadata_year2000)

'data.frame':   51 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
 $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
 $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
 $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
 $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

 # summary of data
summary(africadata_year2000)

         country        year      infant_mortality life_expectancy
 Algeria     : 1   Min.   :2000   Min.   : 12.30   Min.   :37.60  
 Angola      : 1   1st Qu.:2000   1st Qu.: 60.80   1st Qu.:51.75  
 Benin       : 1   Median :2000   Median : 80.30   Median :54.30  
 Botswana    : 1   Mean   :2000   Mean   : 78.93   Mean   :56.36  
 Burkina Faso: 1   3rd Qu.:2000   3rd Qu.:103.30   3rd Qu.:60.00  
 Burundi     : 1   Max.   :2000   Max.   :143.30   Max.   :75.00  
 (Other)     :45                                                  
   fertility       population             gdp               continent 
 Min.   :1.990   Min.   :    81154   Min.   :2.019e+08   Africa  :51  
 1st Qu.:4.150   1st Qu.:  2304687   1st Qu.:1.274e+09   Americas: 0  
 Median :5.550   Median :  8799165   Median :3.238e+09   Asia    : 0  
 Mean   :5.156   Mean   : 15659800   Mean   :1.155e+10   Europe  : 0  
 3rd Qu.:5.960   3rd Qu.: 17391242   3rd Qu.:8.654e+09   Oceania : 0  
 Max.   :7.730   Max.   :122876723   Max.   :1.329e+11                
                                                                      
                       region  
 Eastern Africa           :16  
 Western Africa           :16  
 Middle Africa            : 8  
 Northern Africa          : 6  
 Southern Africa          : 5  
 Australia and New Zealand: 0  
 (Other)                  : 0

ggplot(africadata_year2000, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point(color = "steelblue", alpha = 0.7) +
  labs(
    title = "Life Expectancy vs Infant Mortality -year 2000",
    x = "Infant mortality",
    y = "Life expectancy (years)"
  ) +
  theme_minimal()

# Now we will apply models to our analysis of year 2000 to see if they are significant correlations:

# Model 1: Predict life expectancy using infant mortality
fit1 <- lm(life_expectancy ~ infant_mortality, data = africadata_year2000)

# Model 2: Predict life expectancy using population size
fit2 <- lm(life_expectancy ~ population, data = africadata_year2000)

# Print model summaries to screen
summary(fit1)


Call:
lm(formula = life_expectancy ~ infant_mortality, data = africadata_year2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08

lm(formula = life_expectancy ~ infant_mortality, data = africadata_year2000)


Call:
lm(formula = life_expectancy ~ infant_mortality, data = africadata_year2000)

Coefficients:
     (Intercept)  infant_mortality  
         71.2933           -0.1892

summary(fit2)


Call:
lm(formula = life_expectancy ~ population, data = africadata_year2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.429  -4.602  -2.568   3.800  18.802 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
population  2.756e-08  5.459e-08   0.505    0.616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

lm(formula = life_expectancy ~ population, data = africadata_year2000)


Call:
lm(formula = life_expectancy ~ population, data = africadata_year2000)

Coefficients:
(Intercept)   population  
  5.593e+01    2.756e-08

I fitted two simple linear regression models using the lm() function on African countries in the year 2000:

Model 1 (fit1): Life expectancy as a function of infant mortality. - The model shows a negative correlation between infant mortality and life expectancy. - The estimate for infant mortality is negative and statistically significant (very low p-value), which confirms this inverse relationship.

Model 2 (fit2): Life expectancy as a function of population size. - This model shows not a strong correlation between population size and life expectancy. - The p-value and r-quared are not significant.

In the year 2000, infant mortality is a stronger predictor of life expectancy than population size, at least in the year 2000 in Africa.

# Additional Data Exploration – this part was contributed by Talia C. Swanson

library(dslabs)
library(dplyr)
library(ggplot2)

# Upload chosen dataset
data("historic_co2")

# Displaying all points in the historic_co2 dataset
historic_co2

# A tibble: 694 × 3
    year   co2 source   
   <dbl> <dbl> <chr>    
 1  1959  316. Mauna Loa
 2  1960  317. Mauna Loa
 3  1961  318. Mauna Loa
 4  1962  318. Mauna Loa
 5  1963  319. Mauna Loa
 6  1964  320. Mauna Loa
 7  1965  320. Mauna Loa
 8  1966  321. Mauna Loa
 9  1967  322. Mauna Loa
10  1968  323. Mauna Loa
# ℹ 684 more rows

# Check structure of data
str(historic_co2)

spc_tbl_ [694 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ year  : num [1:694] 1959 1960 1961 1962 1963 ...
 $ co2   : num [1:694] 316 317 318 318 319 ...
 $ source: chr [1:694] "Mauna Loa" "Mauna Loa" "Mauna Loa" "Mauna Loa" ...

The str( ) function shows the structure of the historic_co2 dataset. It has 694 total observations and 3 variables, year – the year of measurement co2 – the CO₂ concentration source – the measurement source

This helps us understand the type of data we’re working with and confirms that the dataset is ready for analysis and visualization.

# You will preview beginning data and look at your descriptive stats
head(historic_co2)

# A tibble: 6 × 3
   year   co2 source   
  <dbl> <dbl> <chr>    
1  1959  316. Mauna Loa
2  1960  317. Mauna Loa
3  1961  318. Mauna Loa
4  1962  318. Mauna Loa
5  1963  319. Mauna Loa
6  1964  320. Mauna Loa

summary(historic_co2)

      year              co2           source         
 Min.   :-803182   Min.   :177.7   Length:694        
 1st Qu.:-470498   1st Qu.:206.7   Class :character  
 Median : -43278   Median :236.9   Mode  :character  
 Mean   :-219753   Mean   :245.9                     
 3rd Qu.:  -8924   3rd Qu.:271.8                     
 Max.   :   2018   Max.   :408.5

The head( ) function displays the first 6 rows of the dataset. From this, we can see that CO₂ concentrations start around 316 ppm in 1959 from Mauna Loa and that measurements are chronological.

The summary( ) function provides descriptive statistics for each variable we have, year, source, and co2 concentration.

# Plotting CO2 concentration over all time
ggplot(historic_co2, aes(x = year, y = co2)) +
  geom_line(color = "blue", size = 1) +
  labs(title = "Historical CO2 Concentration Over Time",
       x = "Year",
       y = "CO2 (ppm)") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The historic_co2 dataset contains atmospheric CO₂ concentrations measured over time with the first table showing the first few observations within the dataset. Within this visualization, CO₂ concentrations have increased steadily from the mid 1900’s onward. There are small fluctuations of increasing and decreasing behavior throughout the years, however, the general trend is moving upwards indictaing a long-term rise in atmospheric CO₂ levels.

# Filter for recent years 2000 onwards as target
recent_co2 <- historic_co2 %>% filter(year >= 2000)

# Display the table of points used in the plot (for recent years)
recent_co2

# A tibble: 20 × 3
    year   co2 source   
   <dbl> <dbl> <chr>    
 1  2000  370. Mauna Loa
 2  2001  371. Mauna Loa
 3  2002  373. Mauna Loa
 4  2003  376. Mauna Loa
 5  2004  378. Mauna Loa
 6  2005  380. Mauna Loa
 7  2006  382. Mauna Loa
 8  2007  384. Mauna Loa
 9  2008  386. Mauna Loa
10  2009  387. Mauna Loa
11  2010  390. Mauna Loa
12  2011  392. Mauna Loa
13  2012  394. Mauna Loa
14  2013  397. Mauna Loa
15  2014  399. Mauna Loa
16  2015  401. Mauna Loa
17  2016  404. Mauna Loa
18  2017  407. Mauna Loa
19  2018  409. Mauna Loa
20  2001  368. Ice Cores

# This will plot our graph of atmospheric CO2 for years 2000 and above
ggplot(recent_co2, aes(x = year, y = co2)) +
  geom_line(color = "red") +
  geom_point(color = "darkred") +
  labs(title = "CO2 Levels Since 2000",
       x = "Year",
       y = "CO2 (ppm)") +
  theme_minimal()

This visualization focuses on CO₂ levels from the year 2000 to the present. During this period, CO₂ concetrations continue to increase at a steady rate without any major decreases or fluctuations. This suggests that atmospheric CO₂ have consistently risen in more recent years.