Detection of coffee fruits on tree branches using computer vision

ABSTRACT Coffee farmers do not have efficient tools to have sufficient and reliable information on the maturation stage of coffee fruits before harvest. In this study, we propose a computer vision system to detect and classify the Coffea arabica (L.) on tree branches in three classes: unripe (green), ripe (cherry), and overripe (dry). Based on deep learning algorithms, the computer vision model YOLO (You Only Look Once), was trained on 387 images taken from coffee branches using a smartphone. The YOLOv3 and YOLOv4, and their smaller versions (tiny), were assessed for fruit detection. The YOLOv4 and YOLOv4-tiny showed better performance when compared to YOLOv3, especially when smaller network sizes are considered. The mean average precision (mAP) for a network size of 800 × 800 pixels was equal to 81 %, 79 %, 78 %, and 77 % for YOLOv4, YOLOv4-tiny, YOLOv3, and YOLOv3-tiny, respectively. Despite the similar performance, the YOLOv4 feature extractor was more robust when images had greater object densities and for the detection of unripe fruits, which are generally more difficult to detect due to the color similarity to leaves in the background, partial occlusion by leaves and fruits, and lighting effects. This study shows the potential of computer vision systems based on deep learning to guide the decision-making of coffee farmers in more objective ways.


Introduction
Coffee demand has increased along with the demand for high-quality products.The supply of high-quality coffee is attributed mainly to improvements in selective harvesting, preferably of ripe fruits (Pineda et al., 2022).Enhancing selective harvest has allowed for the emergence of special products.To meet the increasing demands, new technologies and good crop management practices are needed to improve the quality of harvested coffee without harming the environment.
Most coffee farmers do not have efficient tools to have sufficient and reliable information about the maturation stage of coffee fruits before harvest (Ramos et al., 2017).Tracking the coffee fruits maturation stage can aid the decision of adequate harvesting periods based on the percentage of mature fruits on tree branches (Ramos et al., 2018;Rodríguez et al., 2020).This information is essential for crop management and adequately support decision-making (Martello et al., 2022).
The color of fruit samples is traditionally used to assess the maturation of coffee fruits, and the evaluation can be visual or using colorimeters.Colorimeters measure the color of the fruit surface but without spatial representativeness (Oliveira et al., 2016).Visual classification can also be subjective and relies on the person's experience.
In recent decades, systems based on computer vision have been largely applied to detect and classify fruits (Bazame et al., 2021;Ning et al., 2022;Thendral and David, 2022;Wang et al., 2019;Wu et al., 2020a).Few studies have reported on the classification of coffee fruits before the harvest, which can aid the decisionmaking of coffee farmers (Avendano et al., 2017;Ramos et al., 2018).However, most of these studies adopted techniques that require first extracting various features and then feeding them to the classification algorithm.
Recent advances in computer vision systems based on deep learning allow several features to be extracted automatically.For example, the YOLO (You Only Look Once) algorithm is a popular computer vision algorithm that has been used in several challenges in agriculture.YOLO has previously been used to detect flowers for robotic pollination (Li et al., 2022), fruit load and maturation (Cuong et al., 2022;Fu et al., 2022;Mirhaji et al., 2021), and weed detection (Parico and Ahamed, 2020).Therefore, this study aims to implement and explore different YOLO algorithms to detect coffee fruits on tree branches and classify the fruits according to the different maturation stages.

Data acquisition and labeling
The dataset used in this study consists of 387 RGB images of coffee fruits on tree branches (Figure 1).We used a Smartphone to photograph the fruits before harvest, between 12 and 29 May 2020, from a commercial farm of arabica coffee (Catuaí 144) in the municipality of Patos de Minas, Minas Gerais State, Brazil (18°32'28.55"S, 46°3'51.17"W, altitude 1020 m).Although the pictures were taken near the harvest, the crop uneven flowering over time resulted in pictures of coffee fruits with a mix of maturation stages.For developing a robust computer vision model for different field conditions, the pictures were taken from different angles, sides, and plants randomly selected across coffee lines.This resulted in a diverse scenario under different lighting conditions.The pictures were taken without zoom or flash and were saved with an image resolution of 72 dpi.The smartphone camera automatically adjusted for the white balance.The images were then randomly split into a training set (~80 % = 310 images) and a testing set (~20 % = 77 images).
The images were annotated considering three stages (classes) of coffee fruit maturation: unripe (green), ripe (cherry), and overripe (or dry).The annotation was carried out using the graphical user interface Yolo Mark (Bochkovskiy et al., 2020).

Computer vision algorithm
This study chose the YOLO algorithm for object detection (Redmon and Farhadi, 2018).The YOLO belongs to a family of one-stage object detectors and is popular for its speed and accuracy (Wu et al., 2020a).In this study, we assessed the improvements of the YOLO latest version, YOLOv4 (Bochkovskiy et al., 2020), compared to its former version, YOLOv3 (Redmon and Farhadi, 2018).The improvements of the YOLOv4 over its former version include using the Mish activation function (Misra, 2019), CutMix and mosaic data augmentation, Cross-Stage Partial connections (CSP), Cross mini-Batch Normalization (CmBN), Spatial Pyramid Pooling (SPP) (He et al., 2015) and the Path Aggregation Network (PANet) blocks, Complete Intersection over Union (CIoU) loss (Zheng et al., 2019), among others.
Besides the YOLOv3 and YOLOv4, a smaller version of these models, termed "tiny", was also assessed.The YOLO-tiny models were developed with fewer convolutional layers and are suitable for constrained devices, such as mobile phones (Tang, 2018), microcomputers, and microcontrollers.
The object detection models were trained considering different network sizes and resampling image sizes to match the corresponding network.The network sizes adopted were 320 × 320, 416 × 416, 512 × 512,  608 × 608, 704 × 704, and 800 × 800 pixels.For training, the batch size was set to 32 in the forward pass and the number of iterations was equal to 6000.The confidence thresholds (c) and non-maximum suppression adopted were 0.25 and 0.45, respectively.The performance criterion was tracked for each training iteration using the test set.The weights with the best performance were adopted as the final weights for the model.

Performance evaluation
The performance of the computer vision algorithms was measured by the mean values of average precisions (mAP) obtained for all classes detected, considering an intersection over union of 50 %.The average precision (Eq.( 1)) is the average value of 11 points on the precision/recall curve for pre-determined confidence thresholds for the same class.The precision (Eq.( 2)) and recall (Eq.( 3)) are computed for 11 equally spaced confidence thresholds (c = 0.0, 0.1, …, 1.0) and precision at each recall level (Eq.( 4)) is interpolated by setting the maximum precision measured for a threshold whose corresponding recall r' exceeds r (Eq.( 4)): where: AP is the average precision, TP are true positives, FP are false positives, FN are false negatives, p(r) is the precision at recall level r, p(r') is the precision at recall level r', and c is the confidence threshold.

Results and Discussion
The results are presented and discussed in three subsections.The first subsection discusses the general performance obtained by the object detection algorithms and highlights the main findings of this study.The following subsections detail more specific outcomes from the algorithms concerning performance scores for the different classes and object densities, respectively.

General performance obtained by YOLO algorithms
The performance of coffee fruit detection for each YOLO algorithm and network size, as measured by their mean average precision (mAP), is presented in Figure 2.For the YOLOv4, YOLOv4-tiny, and YOLOv3, the mAP stabilized near the network size of 608 × 608 pixels.In contrast, the performance of the YOLOv3-tiny continued to increase until the network size of 704 × 704 pixels.Despite stabilizing, the performance of the algorithms still showed slight improvements up to 800 × 800 pixels.In general, both the YOLOv4 and YOLOv4-tiny outperformed the YOLOv3.The YOLOv4-800 scored the highest mAP (81 %), followed by YOLOv4-tiny-800 (79 %), YOLOv3-800 (78 %), and YOLOv3-tiny-800 (77 %).
The smaller the network size, the greater the YOLOv4 and YOLOv4-tiny outperform the YOLOv3 and YOLOv3-tiny.In contrast, when larger network sizes are considered, for instance, 704 × 704 and 800 × 800 pixels, the difference in performances of the YOLOv3 and YOLOv3-tiny are negligible.Perhaps the most important outcome here is the YOLOv4tiny outperforming the YOLOv3.This means that the updates made for the latest YOLO version were crucial to improve its performance, even when considering a restricted number of convolutional layers.The YOLOv4tiny requires ~90 % fewer billion floating-point operations than the YOLOv3, which means its model/ weights not only occupy less space in a hard drive but can also be run much faster.
The detections made by the YOLO algorithms for three random images from the dataset and considering the network size of 800 × 800 pixels are shown in Figures 3A, 3B, and 3C.The mAP obtained for each image is also displayed in the figure, where YOLOv4 consistently outperforms the other algorithms.YOLOv4 better detects overlapped fruits (Figure 3C) or in the shade (Figure 3A).It also better detect unripe (green) fruits, even when they are visually smaller in the background and between the leaves (Figure 3B).Another adaptation that could further improve model detection is that suggested by Liu et al. (2020).The authors adapted the YOLO algorithm to use a circular bounding box rather than the traditional rectangular one.Because of the tomato shape, the circular bounding box allowed for better object detection under challenging lighting conditions, branch and leaves occlusion, and overlapping of tomatoes.The proposed algorithm performed better than the other methods and improved detection under occlusion conditions.In Figures 3A, 3B, and 3C, YOLOv4 showed to generally better detect occluded/overlapped objects, even under challenging settings.
The high performance of YOLOv3-tiny (Figure 3A) deserves special attention.There seems to be a surplus of detections (bounding boxes) in the figure, which, despite resulting in high recall (0.92), results in lower precision (0.70) because of the large number of false positives (see Eq. 2 and 3).This is an outcome of the poorly predicted boxes for this specific image not adequately removed by the confidence threshold and non-maximum suppression post-processing.In contrast, the YOLOv3 model predicted coffee fruits in this figure with lower confidence, resulting in fewer boxes and higher precision (0.83), but much lower recall (0.42) and mAP.Despite a similar mAP to that obtained by the YOLOv3-tiny and YOLOv4 models for the example image (Figure 2), YOLOv4 resulted in far better predictions, with both high precision (0.88) and recall (0.88).
To better assess the trade-offs between precision and recall, Figure 4 shows the distribution of performance scores (mAP, precision, and recall) for each test set image.Despite the overall higher median and mean mAP obtained from all images in the test set for YOLOv4, there are clear trends in the precision and recall trade-offs that can be assessed.The mAP is obtained by considering a set of different confidence thresholds, whereas the final precision and recall are calculated assuming a pre-set confidence threshold (c = 0.25).As discussed above, obtaining high precision at the expense of too many false positives can lead to a lower recall.For example, the YOLOv3 algorithm shows, for most network sizes, to score relatively higher precision but lower recall.In contrast, the YOLOv4 algorithm shows the opposite behavior, scoring relatively higher recall and lower precision.
Despite observing general trends for the precisionrecall trade-offs for the different algorithms, the results may partially be attributed to the random weight adjustment process during training.In this study, the final weights of the models were set as the weights obtained after the training iteration that resulted in the highest mAP for the test set from all 6000 iterations.However, predictions from weights scoring similar mAP can present different precision-recall trade-offs.Thus, ultimately, the final user of the model decides whether it is more important to identify all true positives regardless of a few false positives, or if predicting false positives can be detrimental/costly to the final objective.In general, similar values of precision and recall indicate a wellbalanced model and a robust precision-recall trade-off.

Performance by detection class
The average precision (AP) obtained for each class highlights a close performance between YOLOv4 and the other models for detecting ripe and overripe coffee fruits, especially for more extensive network sizes (Figure 5).For example, for ripe fruits and a network size of 800 × 800 pixels, the YOLOv4-tiny, YOLO-v3, and YOLOv3-tiny scored APs of 83 %, 84 %, and 80 %, respectively, while YOLOv4 scored an AP (84 %) higher by 1 %, 0.4 %, and 4 %, respectively.For overripe fruits, the YOLOv4-tiny, YOLO-v3, and YOLOv3-tiny scored APs of 78 %, 77 %, and 76 %, respectively, while YOLOv4 scored an AP (80 %) higher by 2 %, 3 %, and 4 %, respectively.YOLOv4 stands out in detecting unripe (green) coffee fruits, which are generally more difficult to detect because of leaves on the branches and in the background.YOLOv4 scored an AP of 80 % for unripe fruits and a network size of 800 pixels, which is higher by 4 %, 7 %, and 4 % than those scored by YOLOv4-tiny, YOLOv3, and YOLOv3-tiny, respectively.The difference is even higher when smaller network sizes are considered.
Other computer vision systems were also developed to predict the maturation stage of coffee fruits on tree branches (Ramos et al., 2018).The computer vision system classifies coffee fruits after building a 3D model of on-branch coffee fruits and results in classification background, including leaves and shades, whereas Bazame et al. (2021) collected data inside the harvester where the environment had controlled illumination and contrasting background.Besides, the authors also registered a lower mAP score for overripe fruits.
A further opportunity for the present study could be related to predicting coffee yield from full lateral pictures of coffee plants, as proposed by Idol and Youkhana (2020).However, obtaining such information for field scales requires collecting images along with geographic coordinates at higher rates.Besides, data collection at  efficacy between 42 % and 92 % for the different classes of the maturation stage.A computer vision model to detect coffee fruits and classify their maturation stage during harvest was proposed by Bazame et al. (2021).
The authors then mapped the maturation stage across the coffee plantation with an mAP of 86 %, 85 %, and 80 % for unripe, ripe, and overripe fruits, respectively.The lower mAP for unripe fruits in this study, compared to that of Bazame et al. (2021), can be attributed to the environment where images were taken.Here, pictures were taken from on-branches coffee fruits with a diverse Detection of coffee fruits using computer vision Sci.Agric.v.80, e20220064, 2023 higher rates by autonomous systems has been proposed in different studies.For example, an autonomous robot to monitor vineyard water potential was proposed by Saiz-Rubio et al. (2021).Autonomous robots have even been proposed to perform actions, such as tomato harvesting (Liu et al., 2020), strawberry harvesting (Xiong et al., 2020), and weed control (Wu et al., 2020b).

Performance for different object densities
It is harder for a smaller network to detect coffee fruits in higher object-density scenarios.This is because resizing images to lower resolution may blur the boundaries of fruits.This behavior is evident in Figure 6, which shows lower median mAP (red dashed lines) obtained for smaller networks and steeper slopes for the ordinary least squares regression fitted to data (blue line).For example, the YOLOv3 and YOLOv3-tiny models resulted in mAP lower than 70 % and 57 %, respectively, in 50 % of the images in the test set for a network size of 320 × 320 pixels.YOLOv4-tiny and YOLOv4 were more robust to extract features and avoid these effects for the smaller network sizes.For YOLOv4-tiny and YOLOv-4, 50 % of the test set images scored mAP equal to or higher than Steeper slopes mean that it is more difficult for the model to detect objects when object density is higher in the dataset.mAP = mean values of average precisions.
An adaption to the YOLOv3 model for the detection of litchi (YOLOv3-Litchi) in images with a high density of fruits has been proposed by Wang et al. (2021).The authors adapted the model to have fewer convolutions than the original YOLOv3 and predict from feature maps at higher resolutions, which increased accuracy to detect objects in images with high densities of small fruits.
As the network size and, therefore, the resolution of resized images increases, the problem is mitigated.For example, the regression slopes for the YOLOv3tiny models decreased from -0.975 to -0.329 for network sizes from 320 to 800 × 800 pixels.Overall, the regressions adjusted more gentle slopes (closer to 0) for scores obtained using larger network sizes.This is especially true for the YOLOv4 algorithm, whose slope was only -0.257 for the network size of 800 × 800 pixels.Input images at higher resolutions mean more extensive networks and usually better performance in object detection, but it may also increase the time required to predict (Wang et al., 2021) or constrain the model to hardware with higher computing power.The YOLOv4-tiny also performed better than YOLOv3-tiny in this regard, even at smaller network sizes, which can be attributed to its more robust feature extractor.
The developed models better detect ripe coffee fruits, which better contrast the background of the images.In contrast, the performance to detect unripe (green) fruits was considerably lower, which can be attributed to the coffee fruits being partially occluded by leaves (similar color) and in the shade.Overall, the YOLOv4 algorithm was more robust into detecting unripe fruits and less influenced by object density in images.
Future studies could advance this research in many directions.The image acquisition could be associated with geographic coordinates or even captured by an automated system, allowing for the spatialization of such information.The continuous collection of images from all sides of coffee plants could also be used to estimate fruit count and therefore,plant yield.

Figure 1 -
Figure 1 -Image acquisition for coffee fruits on tree branches.

Figure 2 -
Figure 2 -Performance of the different computer vision algorithms and network sizes assessed to detect coffee fruits on branches.mAP = mean values of average precisions.

Figure 3 -
Figure 3 -Coffee fruits detections made by YOLO algorithms considering a network size of 800 × 800 pixels for three arbitrary images representing fruits (A) in the shade, (B) between the leaves, and (C) overlapped.

Figure 4 -
Figure 4 -Distribution of performance scores obtained for each image of the test set by the different computer vision algorithms and network sizes used in this study.mAP = mean values of average precisions.

Figure 5 -
Figure 5 -Performance of the different computer vision algorithms and network sizes assessed for each class of detection.AP = Average precision.

Figure 6 -
Figure 6 -Performance obtained by the different computer vision algorithms and network sizes assessed for each image of the test set separately.The red dashed line represents the median mAP.The blue line represents the ordinary least squares regression fit to the data.Steeper slopes mean that it is more difficult for the model to detect objects when object density is higher in the dataset.mAP = mean values of average precisions.