Modeling Change In Per Hectare Crop Yield Over Time

In this blog post, I’ll walk my readers through the R code I wrote in one of my Kaggle Notebooks to visualize the trends of different crop yields in most populated countries over time (years). Besides that, I’ll also visualize the population growth in the world and the populated countries. Initially, I have worked on a few countries in this notebook to optimally utilize the time. However, readers can use this code for the countries and crops of their choice. At the end, I’ll model and visualize the change in per hectare crop yield over time in the selected countries.

Load libraries

In this data visualization and analysis post, I’ll use the following R packages. I have labeled the category of each package as data science, modeling, visualization and tables.

# Data science
library(tidyverse)
library(janitor)
library(DataExplorer)
library(skimr)

# Modeling
library(tidymodels)

# Visualization
library(ggrepel)

# Tables
library(gt)

Upload data

In this blog post, I have used the data for crop production (yield) and population in countries provided on github. I used the read_csv function to remotely read the data from github.

crop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv")

land <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv")

Exploratory data analysis

The first task of any data analysis and visualization project after data loading is the Exploratory Data Analysis or EDA. Exploratory data analysis helps analysts swiftly understanding the data structure and patterns in it.

Data structure

The first and foremost thing we do in EDA is understanding the data structure. Luckily, developers have built few packages such as skimr that can help us in quickly understanding the data structure. Here, I used the skim function of skimr package to see different attributes of the data.

  • See the structure of crop data below.
skim(crop)
Table 1: Data summary
Name crop
Number of rows 13075
Number of columns 14
_______________________
Column type frequency:
character 2
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Entity 0 1.00 4 39 0 249 0
Code 1919 0.85 3 8 0 214 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1.00 1990.37 16.73 1961.00 1976.00 1991.00 2005.00 2018.00 ▇▆▇▇▇
Wheat (tonnes per hectare) 4974 0.62 2.43 1.69 0.00 1.23 1.99 3.12 10.67 ▇▅▂▁▁
Rice (tonnes per hectare) 4604 0.65 3.16 1.85 0.20 1.77 2.74 4.16 10.68 ▇▇▃▁▁
Maize (tonnes per hectare) 2301 0.82 3.02 3.13 0.03 1.14 1.83 3.92 36.76 ▇▁▁▁▁
Soybeans (tonnes per hectare) 7114 0.46 1.45 0.75 0.00 0.86 1.33 1.90 5.95 ▇▇▂▁▁
Potatoes (tonnes per hectare) 3059 0.77 15.40 9.29 0.84 8.64 13.41 20.05 75.30 ▇▅▁▁▁
Beans (tonnes per hectare) 5066 0.61 1.09 0.82 0.03 0.59 0.83 1.35 9.18 ▇▁▁▁▁
Peas (tonnes per hectare) 6840 0.48 1.48 1.01 0.04 0.72 1.15 1.99 7.16 ▇▃▁▁▁
Cassava (tonnes per hectare) 5887 0.55 9.34 5.11 1.00 5.55 8.67 11.99 38.58 ▇▇▁▁▁
Barley (tonnes per hectare) 6342 0.51 2.23 1.50 0.09 1.05 1.88 3.02 9.15 ▇▆▂▁▁
Cocoa beans (tonnes per hectare) 8466 0.35 0.39 0.28 0.00 0.24 0.36 0.49 3.43 ▇▁▁▁▁
Bananas (tonnes per hectare) 4166 0.68 15.20 12.08 0.66 5.94 11.78 20.79 77.59 ▇▃▁▁▁

skim function provides information about:

  • Data dimensions i.e., number of rows and columns in the data.
  • Data types i.e., character, factor or numeric, and their frequency.
  • Descriptive statistics or attributes of each column based on data type.
  • For example, if the data is of numeric type, it provides statistics such as mean, standard deviation, quantiles and missing values.
  • And if the data is of character type, it provides character data descriptive and missing values.

In short, this function is very useful and you can use it to quickly understand the structure of any data you are working on.

  • Below is the structure of population data.
skim(land)
Table 2: Data summary
Name land
Number of rows 49259
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Entity 0 1.00 4 50 0 286 0
Code 3178 0.94 3 8 0 235 0
Year 0 1.00 1 10 0 258 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Cereal yield index 38837 0.21 169.29 8.43900e+01 0 111.0 147 209 1335 ▇▁▁▁▁
Change to land area used for cereal production since 1961 38837 0.21 133.71 1.22230e+02 0 94.0 112 145 3503 ▇▁▁▁▁
Total population (Gapminder) 2376 0.95 29827896.89 2.53086e+08 905 201733.5 1542937 5886795 7713467904 ▇▁▁▁▁
  • If you have looked through the summaries of both data sets, you might have noticed that both the crop and population data have missing values in some of the columns.

  • We can also visualize the amount of missing data in each column using the plot_missing function from the DataExplorer library. These missing data plots categorize the amount of missing data into 3 categories such as OK, Good and Bad. Columns under Bad category needs to be fixed before moving towards data analysis.

Crop data

crop %>%
  plot_missing()

Population data

land %>% 
  plot_missing()

After the data structure, we can start exploring the data distributions. Here, I ’ll start with the population data. Before visualizing the distribution of population, I’ll calculate the population in million column to use instead of raw population data. At first, I’ll only visualize the worldwide population data from 2019.

Population data distribution (2019)

# Population in Millions
Pop <- land %>%
  filter(!is.na(Code)) %>%
  mutate(Year = as.numeric(Year)) %>% 
  mutate(pop = `Total population (Gapminder)`/1000000)

## Distribution (2019)
Pop %>% 
  filter(Entity != "World") %>%
  group_by(Entity) %>%
  filter(Year == max(Year)) %>%
  ungroup() %>% 
  ggplot(aes(x = pop)) +
  geom_histogram(fill = "steelblue") +
  labs(x = "Population", y = "Count") +
  theme_bw()

  • The population data is right skewed.

Now, I’ll extract the top 10 populated countries and present them along with their population using gt function. gt library is another wonderful package of R that builds professional quality tables in Rmarkdown.

Pop %>%
  filter(Entity != "World") %>%
  group_by(Entity) %>%
  filter(Year == max(Year)) %>%
  ungroup() %>% 
  slice_max(pop, n = 10) %>% 
  select(Entity, pop) %>%
  rename(Country = Entity, `Population (millions)` = pop) %>% 
  gt() %>% 
  tab_header(
    title = "Top 10 Populated Countries of the World"
  )
Top 10 Populated Countries of the World
Country Population (millions)
China 1433.784
India 1366.418
United States 329.065
Indonesia 270.626
Pakistan 216.565
Brazil 211.050
Nigeria 200.964
Bangladesh 163.046
Russia 145.872
Mexico 127.576

These are the top 10 populated countries according to the data. However, I’ll use only the data from China, India, United States, Pakistan, Brazil and Russia for comparison in subsequent steps.

Now, I’ll plot the worldwide population growth trend from 1900 to 2019. Afterwards, I’ll plot the population growth trends in the selected 6 countries.

Worldwide Population Growth

Pop %>% 
  filter(Entity == "World") %>% 
  filter(Year>=1900) %>% 
  ggplot(aes(x = Year, y = pop)) +
  geom_line() +
  geom_point() +
  labs(x= "Year", y = "Population (Millions)") +
  theme_bw()

Populated countries

Pop %>% 
  filter(Entity %in% c("Pakistan", "United States", "India", "China", "Brazil", "Russia")) %>%
  filter(Year>=1900) %>%  
  ggplot(aes(x = Year, y = pop, color = Entity)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", y = "Population (millions)", color = NULL) +
  theme_bw() +
  theme(legend.position = "top")

Now, I’ll move towards crop data and will do some cleaning before making visualizations.

  • At first, I’ll gather the data in a tidy format for subsequent visualizations and data analysis.

  • Also, I’ll clean the crop names to make them easily readable.

Crop <- crop %>% 
  gather(crop, yield, `Wheat (tonnes per hectare)`:`Bananas (tonnes per hectare)`, na.rm = FALSE, convert = TRUE) %>%
  mutate(crop = str_remove(crop, regex(" \\(tonnes per hectare\\)", ignore_case = TRUE)))

Crop
## # A tibble: 143,825 x 5
##    Entity      Code   Year crop  yield
##    <chr>       <chr> <dbl> <chr> <dbl>
##  1 Afghanistan AFG    1961 Wheat 1.02 
##  2 Afghanistan AFG    1962 Wheat 0.974
##  3 Afghanistan AFG    1963 Wheat 0.832
##  4 Afghanistan AFG    1964 Wheat 0.951
##  5 Afghanistan AFG    1965 Wheat 0.972
##  6 Afghanistan AFG    1966 Wheat 0.867
##  7 Afghanistan AFG    1967 Wheat 1.12 
##  8 Afghanistan AFG    1968 Wheat 1.16 
##  9 Afghanistan AFG    1969 Wheat 1.19 
## 10 Afghanistan AFG    1970 Wheat 0.956
## # ... with 143,815 more rows

Distribution of crop yields

  • Distribution of crop yield data obtained from 1961 to 2019.
Crop %>% 
  ggplot(aes(x = yield)) +
  geom_histogram(fill = "steelblue") +
  theme_bw() +
  labs(x = "Yield") +
  facet_wrap(~crop, scales = "free")

Jitter plots

Crop %>% 
  ggplot(aes(x = reorder(crop, yield), y = yield, color = crop)) +
  geom_jitter(position = position_jitter(0.2), alpha = 0.1) +
  labs(x = "Crop", y = "Yield") +
  theme_bw() +
  theme(legend.position = "none")

Modeling per hectare crop yield over time

Now, I’ll use simple linear models to estimate the improvement in crop yield (tons per hectare) per year. For this purpose, I’ll use tidymodels library. In tidymodels, we can optimally use the tidy functionality to analyze data without creating subsets of data. We can nest the variables for which we want to run the model simultaneously.

For instance, I want to model yield per hectare over years for all of the provided crops in the selected six populated countries. I can easily remove the unwanted columns from the data and then can nest the yield and year information for subsequent linear regression analysis. After the analysis, I can easily un-nest the tidied information of the model fits including slop coefficients, standard error, confidence intervals and p-value.

  • Slop coefficient = change in crop yield (tons per hectare ) per year.
slopes <- Pop_Crop %>% 
  select(-ID, -Code, -`Cereal yield index`, -`Change to land area used for cereal production since 1961`, -`Total population (Gapminder)`, -pop) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans") & !is.na(yield)) %>%
  nest(yields = c(Year, yield)) %>% 
  mutate(
    fit = map(yields, ~ lm(yield ~ Year, data = .x))
    ) %>% 
  mutate(tidied = map(fit, tidy)) %>%
  unnest(tidied) %>% 
  filter(term == "Year") %>%
  mutate(p.value = p.adjust(p.value)) %>% 
  select(-yields, -fit)

slopes
## # A tibble: 53 x 7
##    Entity crop     term  estimate std.error statistic  p.value
##    <chr>  <chr>    <chr>    <dbl>     <dbl>     <dbl>    <dbl>
##  1 Brazil Wheat    Year   0.0366    0.00222     16.5  1.17e-21
##  2 Brazil Rice     Year   0.0755    0.00490     15.4  2.22e-20
##  3 Brazil Maize    Year   0.0709    0.00395     18.0  1.96e-23
##  4 Brazil Soybeans Year   0.0388    0.00154     25.1  1.48e-30
##  5 Brazil Potatoes Year   0.440     0.0148      29.7  2.51e-34
##  6 Brazil Beans    Year   0.00770   0.00113      6.81 9.10e- 8
##  7 Brazil Barley   Year   0.0466    0.00319     14.6  2.30e-19
##  8 Brazil Bananas  Year  -0.0349    0.0169      -2.06 4.36e- 2
##  9 Brazil Peas     Year   0.0385    0.00717      5.37 7.30e- 5
## 10 China  Wheat    Year   0.0880    0.00141     62.6  8.23e-52
## # ... with 43 more rows

Visualization of slope coefficients

  • Now, I’ll visualize the slope coefficients against p-values of null hypothesis.
  • This visualization will help us in understanding that which countries have demonstrated the improved yield over time in each crop.
slopes %>%
  ggplot(aes(x = estimate, y = p.value, label = Entity)) +
  geom_vline(
    xintercept = 0, lty = 2,
    size = 1.5, alpha = 0.7, color = "gray50"
  ) +
  geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
  scale_y_log10() +
  facet_wrap(~crop) +
  geom_text_repel(size = 3) +
  theme_light() +
  theme(strip.text = element_text(size = 12)) +
  labs(x = "Change in yield (tons per hectare) per year", y = "P-value")

  • The vertical line indicates slope = 0 no change in per hectare crop yield.
  • Countries on the right side of the dashed line have demonstrated improvement in per hectare yield over time.
  • While, countries on the left side have demonstrated decrease in yield over time.

That’s all for this blog post. I hope you have enjoyed reading it and might get to know some new functions and R packages. You can use this code freely to clean, visualize and analyze your data. If you got any questions, leave those in comments and I’ll be happy to reply.

Muhammad Mohsin Raza
Muhammad Mohsin Raza
Data Science Fellow

My research interests include yield loss modeling, disease detection, GIS and Remote Sensing applications.