pacu: Precision Agriculture Computational Utilities - Yield monitor data

Yield monitor data

Proposed workflow

We propose the following workflow for processing raw yield monitor data:

  1. Read the data in
  2. Compute summary statistics and visualize the data
  3. Check the yield data
  4. Fix potential issues
  5. Process the data
  6. Examine the yield map

To illustrate the yield monitor workflow, we will look at data from the research paper Prairie strips improve biodiversity and the delivery of multiple ecosystem services from corn–soybean croplands. Specifically, we will use the 2012 data.

Reading the data in

extd.dir <- system.file("extdata", package = "pacu")
raw.yield <- st_read(file.path(extd.dir, '2012-basswood.shp'), quiet = TRUE)
boundary <- st_read(file.path(extd.dir, 'boundary.shp'), quiet = TRUE)

We can see that the raw yield data has multiple columns. Some of them are more or less relevant depending on how we choose to process the yield data and produce a yield map.

names(raw.yield)
##  [1] "ID"         "LONGITUDE"  "LATITUDE"   "FLOW"       "TIME"      
##  [6] "CYCLES"     "DISTANCE"   "SWATH"      "MOISTURE"   "STATUS"    
## [11] "PASS"       "SERIAL"     "FIELD"      "LOAD"       "CROP"      
## [16] "GPS"        "PDOP"       "ALTITUDE"   "DRY_BU_AC"  "DAY"       
## [21] "MONTH"      "DAYOFMONTH" "HOUR"       "MINUTE"     "SECOND"    
## [26] "TIMELAPSE"  "SPEED"      "geometry"

Compute summary statistics and visualize the data

Our main objective in this step of the proposed workflow is to make sure that the data conforms to what we would expect of yield monitor data. For instance, we can examine the raw yield data or the distance between measurements.

Examining the raw yield data

cols <- function(n) hcl.colors(n, 'Temps', rev = TRUE)
plot(raw.yield['DRY_BU_AC'], pal = cols)

The boxplot shows some observations that appear to be abnormally high. These can be a result of sudden speed or direction change, GPS errors, and other sources of unknown variability.

boxplot(raw.yield$DRY_BU_AC)

Check the yield monitor data with pa_check_yield()

In this step, pa_check_yield() searches for potential problems in the data so they can be addressed before the yield map is generated. For instance, when units are not supplied to pa_yield(), the function attempts to guess them. Because pa_check_yield() reports those guesses, the user can confirm or override them before processing.

The user can check for issues relevant to all available algorithms or target a specific algorithm.

chk <- pa_check_yield(input = raw.yield,
               algorithm = 'all')
chk
## Field information
## -------------------------
## # of points        4240  
## Approx. area (ha)  9.8529
## Approx. area (ac)  24.352
## Latitude           41.539
## Longitude          -93.27
## CRS                NA    
## -------------------------
## The CRS is missing from the input object. The pa_yield function will default to EPSG:4326
## 
## 
## Algorithm: Simple
## Checking column names and units
## --------------------------------
## variable  column     mean  units 
## ---       ---        ---   ---   
## yield     DRY_BU_AC  116.7 bu/ac 
## moisture  MOISTURE   17    %     
## interval  CYCLES     2     s     
## --------------------------------
## Checking data values
## ---------------------------------------
## variable  min  max  median  NAs  extreme 
## ---       ---  ---  ---     ---  ---     
## yield     10.25 372  118.553 0    6       
## moisture  6.90 22.9 17.3000 0    17      
## interval  1    3    2       0    0       
## ---------------------------------------
## Warning in print.check.yield(x): Extreme values identified. These these are
## values outside of the range of the mean ± 3sd
## 
## 
## Algorithm: RITAS
## Checking column names and units
## -------------------------------
## variable  column    mean  units 
## ---       ---       ---   ---   
## mass      -         -     -     
## flow      FLOW      18.80 lb/s  
## moisture  MOISTURE  17    %     
## interval  CYCLES    2     s     
## angle     -         -     -     
## width     SWATH     240   in    
## distance  DISTANCE  147.2 in    
## -------------------------------
## Checking data values
## ---------------------------------------
## variable  min  max  median  NAs  extreme 
## ---       ---  ---  ---     ---  ---     
## mass      -    -    -       -    -       
## flow      0.3  40.2 18.4000 0    0       
## moisture  6.90 22.9 17.3000 0    17      
## interval  1    3    2       0    0       
## angle     -    -    -       -    -       
## width     240  240  240     0    0       
## distance  5    304  153     0    0       
## ---------------------------------------
## Warning in print.check.yield(x): Extreme values identified. These these are
## values outside of the range of the mean ± 3sd
## Warning in print.check.yield(x): Column angle not found but can be estimated
## within pa_yield
## Median overlap between harvest polygons is 3.1 % 
## If this value seems high, please check the units of distance and swath.

Fix potential issues

pa_check_yield() searches for several kinds of issues that could prevent pa_yield() from working correctly. It is therefore worth addressing the identified issues before processing the data.

Example: missing yield column

Let us imagine a scenario in which we want to use the simple algorithm to process the data. To do so, we need a column with the yield data. The pa_yield() and pa_check_yield() functions will try to identify which columns of the input data frame contain the relevant information, and the units if those are not supplied. We can create a toy example in which the yield column has an uncommon name, preventing the functions from identifying it.

toy.example <- raw.yield 
names(toy.example) <- gsub('DRY_BU_AC', 'NOT_A_COMMON_NAME', names(toy.example))
chk <- pa_check_yield(toy.example, algorithm = 'simple')
chk
## Field information
## -------------------------
## # of points        4240  
## Approx. area (ha)  9.8529
## Approx. area (ac)  24.352
## Latitude           41.539
## Longitude          -93.27
## CRS                NA    
## -------------------------
## The CRS is missing from the input object. The pa_yield function will default to EPSG:4326
## 
## 
## Algorithm: Simple
## Checking column names and units
## -------------------------------
## variable  column    mean  units 
## ---       ---       ---   ---   
## yield     -         -     -     
## moisture  MOISTURE  17    %     
## interval  CYCLES    2     s     
## -------------------------------
## Checking data values
## ---------------------------------------
## variable  min  max  median  NAs  extreme 
## ---       ---  ---  ---     ---  ---     
## yield     -    -    -       -    -       
## moisture  6.90 22.9 17.3000 0    17      
## interval  1    3    2       0    0       
## ---------------------------------------
## Warning in print.check.yield(x): Extreme values identified. These these are
## values outside of the range of the mean ± 3sd
## Warning in print.check.yield(x): Columns yield and moisture are needed for the
## simple algorithm

To deal with this issue, we can rename the column to a more common name such as “yield” or “YIELD”. Additionally, a valid solution would be to provide the column name to the pa_yield() function directly when processing the data. Here, we rename the column simply to “yield” and we can see that the algorithm is able to identify it once again.

names(toy.example) <- gsub('NOT_A_COMMON_NAME', 'yield', names(toy.example))
chk <- pa_check_yield(toy.example, algorithm = 'simple')
chk
## Field information
## -------------------------
## # of points        4240  
## Approx. area (ha)  9.8529
## Approx. area (ac)  24.352
## Latitude           41.539
## Longitude          -93.27
## CRS                NA    
## -------------------------
## The CRS is missing from the input object. The pa_yield function will default to EPSG:4326
## 
## 
## Algorithm: Simple
## Checking column names and units
## -------------------------------
## variable  column    mean  units 
## ---       ---       ---   ---   
## yield     yield     116.7 bu/ac 
## moisture  MOISTURE  17    %     
## interval  CYCLES    2     s     
## -------------------------------
## Checking data values
## ---------------------------------------
## variable  min  max  median  NAs  extreme 
## ---       ---  ---  ---     ---  ---     
## yield     10.25 372  118.553 0    6       
## moisture  6.90 22.9 17.3000 0    17      
## interval  1    3    2       0    0       
## ---------------------------------------
## Warning in print.check.yield(x): Extreme values identified. These these are
## values outside of the range of the mean ± 3sd

Processing the data and examining the yield maps and diagnostic plots

The pa_yield() function can produce yield maps using two algorithms: simple and ritas. The simple algorithm allows moisture standardization, data cleaning, and smoothing. By default, it does not conduct any area-based aggregations. The ritas algorithm (Damiano & Niemi, 2020) follows a constructive framework, in which the steps are: rectangle creation; intersection assignment; tessellation; apportioning; and smoothing.

Simple aggregation of yield data

Initial example

When algorithm is simple, the function will search for the columns that indicate yield, moisture, and time. Note that, since different crops have a different conversion factors from bushel to pounds, it is important to specify the lbs.per.bushel argument when the US standard system of units is used. In this case, we are producing a yield map for a maize crop, thus, 56 lbs/bushel. Additionally, the function supports the output of the yield map in both U.S. standard and metric unit systems. In this example, we have chosen “metric”.

ymp1 <- pa_yield(input = raw.yield,
                 boundary = boundary,
                 algorithm = 'simple',
                 lbs.per.bushel = 56,
                 unit.system = 'metric',
                 verbose = FALSE)

The “ymp1” object contains information on the yield processing algorithm, smoothing method, conversion factor used, moisture, and a summary of the yield data.

ymp1
## Yield data processing algorithm: simple 
## Smoothing method: none 
## Conversion factor:  56 lbs/bushel
## Adj. moisture: 17.01373%
## Yield  summary (t/ha)
## -----------------
## Min.       0.635 
## 1st Qu.    4.4947
## Median     7.4521
## Mean       7.3107
## 3rd Qu.    10.0796
## Max.       23.054
## -----------------

We can visualize the yield map by plotting it

pa_plot(ymp1)

Unit conversion and moisture adjustment

In the previous map, we can see that the units are “t/ha”. However, a user might want to produce yield maps using US standard units. Additionally, the user can set the moisture level to the market standard moisture. When no moisture adjustment is provided, the function adjusts the moisture level to the average moisture in the data set.

ymp2 <- pa_yield(input = raw.yield,
                 boundary = boundary,
                 algorithm = 'simple',
                 unit.system = 'standard',
                 moisture.adj = 15.5,
                 lbs.per.bushel = 56,
                 verbose = FALSE)
ymp2
## Yield data processing algorithm: simple 
## Smoothing method: none 
## Conversion factor:  56 lbs/bushel
## Adj. moisture: 15.5%
## Yield  summary (bushel/acre)
## -----------------
## Min.       9.9172
## 1st Qu.    70.247
## Median     116.47
## Mean       114.26
## 3rd Qu.    157.53
## Max.       360.31
## -----------------

We can visualize the yield map in bushels/acre and at a moisture level of 15.5%

pa_plot(ymp2)

Outlier removal

In the previous maps we used simple aggregation and moisture standardization. We can also add a cleaning step to remove outliers from the raw yield data.

ymp3 <- pa_yield(input = raw.yield,
                 boundary = boundary,
                 algorithm = 'simple',
                 unit.system = 'metric',
                 clean = TRUE,
                 clean.sd = 3,
                 lbs.per.bushel = 56,
                 verbose = FALSE)

This cleaning step will remove observations outside of a 3 standard deviation range from the field mean. This will often leave empty spots in the field if no smoothing method is selected.

pa_plot(ymp3)

Inverse distance weighted interpolation

If we want to interpolate the data, the function can interpolate the yield map using two prediction methods: inverse distance weighted (IDW) and kriging. The IDW method is deterministic and is much faster than the kriging method. The kriging method is stochastic and the computation time scales with \(O(n^{3})\), where n is the number of observations.

Here, we can produce an interpolated yield map using the IDW method. By default, the function uses a power of 2, but that can be overridden by passing the argument idp to the function.

If the argument grid is not specified, the function will provide smoothed predictions at the same points as the input data. This provides predictions for locations where observations may have been removed during the cleaning step.

ymp4 <- pa_yield(input = raw.yield,
                 boundary = boundary, 
                 algorithm = 'simple',
                 unit.system = 'metric',
                 clean = TRUE,
                 clean.sd = 3,
                 smooth.method = 'idw',
                 lbs.per.bushel = 56,
                 verbose = FALSE)
ymp4
## Yield data processing algorithm: simple 
## Smoothing method: idw 
## Conversion factor:  56 lbs/bushel
## Adj. moisture: 17.01373%
## Yield  summary (t/ha)
## -----------------
## Min.       0.635 
## 1st Qu.    4.4886
## Median     7.4437
## Mean       7.2890
## 3rd Qu.    10.0523
## Max.       17.426
## -----------------
pa_plot(ymp4)

Kriging

Alternatively, we can interpolate the maps using the Kriging method. As noted earlier, the computational time scales quickly as the number of observations increases. Therefore, we will add the argument “maxdist” to decrease computational time. It should be noted that the value at which the user sets the “maxdist” argument should be determined after examining the variogram.

ymp5 <- pa_yield(input = raw.yield,
                 boundary = boundary,
                 algorithm = 'simple',
                 unit.system = 'metric',
                 clean = TRUE,
                 clean.sd = 3,
                 smooth.method = 'krige',
                 lbs.per.bushel = 56,
                 verbose = FALSE,
                 maxdist = 50)
ymp5
## Yield data processing algorithm: simple 
## Smoothing method: krige 
## Conversion factor:  56 lbs/bushel
## Adj. moisture: 17.01373%
## Yield  summary (t/ha)
## -----------------
## Min.       1.2440
## 1st Qu.    5.1989
## Median     7.6114
## Mean       7.4753
## 3rd Qu.    9.7014
## Max.       13.910
## -----------------
## Variogram model: Mat (Matern) 
## Partial sill:  122865.5 
## Range:  107.4692 
## Kappa:  0.3
pa_plot(ymp5, plot.var = 'yield')

We can also investigate the variogram

pa_plot(ymp5, plot.type = 'variogram')

RITAS

Initial example

When algorithm is ritas, the function expects an argument data.columns that indicates crop mass or crop flow, moisture, interval, angle, swath, and distance. Additionally, the function expects a data.units argument to indicate the units of the data columns. When neither of these arguments is provided, the function uses a dictionary of common names to search for likely matches and then attempts to guess the units.

The ritas algorithm is more computationally intensive than the simple algorithm. Some of the steps can be sped up by specifying the cores argument, which allows some of the processing steps to run in parallel.

ymp6 <- pa_yield(input = raw.yield,
                 algorithm = 'ritas',
                 lbs.per.bushel = 56,
                 unit.system = 'metric',
                 verbose = FALSE)

In this case, since the function is able to guess the columns and the units correctly, this would be the same as the chunk below.

ymp6 <- pa_yield(input = raw.yield,
                 data.columns = c(flow = 'FLOW', moisture = 'MOISTURE', interval = 'CYCLES', width = 'SWATH', distance = 'DISTANCE'),
                 data.units = c(flow = 'lb/s', moisture = '%', interval = 's', width = 'in', distance = 'in'),
                 unit.system = 'metric',
                 algorithm = 'ritas',
                 verbose = FALSE) 
pa_plot(ymp6)

Complete RITAS algorithm

In the previous yield map, we did not apply smoothing, so technically we did not complete the full ritas workflow. Following the algorithm’s steps, the best results are often obtained by setting smooth.method = "krige". The function also includes the option to krige on the log scale when the yield distribution is heavily skewed. In that case, set fun = "log".

ymp7 <- pa_yield(input = raw.yield,
                 boundary = boundary,
                 algorithm = 'ritas',
                 smooth.method = 'krige',
                 unit.system = 'metric',
                 lbs.per.bushel = 56,
                 verbose = FALSE,
                 maxdist = 50)
ymp7
## Yield data processing algorithm: ritas 
## Smoothing method: krige 
## Conversion factor:  56 lbs/bushel
## Adj. moisture: 17.01373%
## Yield  summary (t/ha)
## -----------------
## Min.       1.6147
## 1st Qu.    6.2176
## Median     9.0567
## Mean       8.8636
## 3rd Qu.    11.634
## Max.       17.456
## -----------------
## Variogram model: Mat (Matern) 
## Partial sill:  194472.9 
## Range:  197.6451 
## Kappa:  0.3
pa_plot(ymp7)