Open-access Effort estimation for software products targeted at the manufacturing sector using machine learning algorithms

Abstract

Paper aims  This study seeks to investigate the accuracy of machine learning algorithms for estimation of the effort required for software development in the manufacturing sector to identify the most effective algorithms according to the nature and complexity of the data and the number of available attributes.

Originality  This work distinguishes itself from other studies in the field of effort prediction by utilizing a data repository that consists exclusively of projects from the manufacturing sector. This approach ensures that the specific characteristics of manufacturing projects are reflected in the predictions, addressing a gap in the existing literature. Another notable contribution of this study is the comparative analysis of various machine learning algorithms assessed under different dimensionality scenarios (three and five variables). Although this factor is crucial for enhancing effort estimation accuracy, it has received limited attention in the literature.

Research method  The investigated techniques in this work were (i) Support Vector Regression, (ii) Gradient Boosting Machines (GBM), (iii) eXtreme Gradient Boosting (XGBoost), (iv) Random Forest (RF), (v) Extreme Learning Machine (ELM); and (vi) Linear Regression (LR). Performance measures such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R2) were used to compare the results achieved by each model, considering a dataset of 230 records originating from various countries.

Main findings  The comparison among machine learning models revealed significant performance variations depending on the number of variables and the evaluation metrics adopted. GBM stood out for its robustness in complex scenarios, while SVR achieved the lowest mean absolute error. ELM, in turn, proved effective with fewer variables but showed sensitivity to outliers and less stability in more complex contexts. Among all the techniques evaluated, XGB yielded the worst performance across all parameters.

Implications for theory and practice  This study contributes by applying these models to the manufacturing sector and comparing scenarios with three and five variables. The results support a more informed selection of models based on project complexity and data dimensionality. The more research conducted in this area, the stronger the theoretical and practical conclusions can be drawn.

Keywords:
Software effort estimation; Software in the manufacturing sector; Software project management; Machine learning

1. Introduction

In software development, accurately estimating the effort required early in a project is essential for its success (Kassaymeh et al., 2024; Lavingia et al., 2024; Van Hai et al., 2022a). Software Development Effort Estimation (SDEE) involves forecasting the work and time needed to deliver a system within set constraints (López-Martín, 2022). Inaccurate estimates can disrupt later planning stages, reduce product quality, and lead to financial losses, especially when organizational needs are not met (Kaushik et al., 2020; Kumar et al., 2020).

The increasing availability of historical project data and the growing need for accurate forecasting in various scientific domains have driven the development of robust and efficient techniques capable of modeling stochastic dependencies between past and future observations (Bontempi et al., 2013).

Although similar techniques are frequently used in the literature, results often vary significantly. In the Artificial Intelligence field, for instance, Jiang et al. (2019) highlighted the effectiveness of Artificial Neural Networks. However, other studies have noted drawbacks such as slow convergence, low accuracy, and high sensitivity to initial parameters, often requiring metaheuristic algorithms to improve their performance (Kassaymeh et al., 2024). These issues contribute to increased complexity and longer implementation times.

In the context of Machine Learning (ML), various authors (Kassaymeh et al., 2024; Lavingia et al., 2024; Varshini & Kumari, 2024; Al-Betar et al., 2023; Rahman et al., 2023; Rao et al., 2024; Sharma & Vijayvargiya, 2020) have reported promising results in software effort estimation. Rankovic et al. (2021) emphasize that ML techniques are effective for SDEE due to their ability to learn from data with minimal human intervention. Nevertheless, there is still no consensus on the best ML method, as no single approach has proven to be universally optimal.

A review of the related literature also reveals a noticeable lack of studies specifically focused on SDEE in the manufacturing sector, which constitutes the central theme of this research. Moreover, many authors do not specify the types of variables used in model construction, hindering comparative analysis and limiting the applicability of results to other studies. It was further observed that most research efforts are conducted within the scope of specific systems or organizations, thereby restricting the generalizability of their findings.

This study aims to estimate software development effort in the manufacturing sector using international data from the ISBSG repository. A comparative analysis is performed on the performance of various machine learning models namely GBM, SVR, ELM, RF, LR, and XGBoost, using two variable sets (with three and five attributes) to evaluate the effect of dimensionality. Model performance is assessed based on R2, RMSE, and MAE. The goal is to identify the most effective models in relation to data complexity and variable availability, supporting more accurate and context-sensitive effort estimations.

The objective of this study is to estimate the effort required for software projects in the manufacturing sector, using data from various countries available in the ISBSG repository. In this respect, a comparative analysis is conducted on the performance of different machine learning models in predicting software effort, considering two sets of variables (with three and five attributes) to assess the impact of dimensionality. The models analyzed include GBM, SVR, ELM, RF, LR, and XGBoost, and their performances are compared using R2, RMSE, and MAE criteria. The study aims to identify the most effective models according to data complexity and the number of available variables, contributing to more accurate and context-aware effort estimations.

This research is justified by its focus on the manufacturing sector, which allows for integrating domain-specific characteristics into effort estimation. Using an international database enhances the representativeness of the results, making the predictions more accurate and generalizable. Additionally, the study compares different machine learning models, analyzing the impact of dimensionality (three and five variables) on predictive performance — an approach that has been underexplored in the literature. The results provide valuable insights for selecting algorithms best suited to the data complexity and application context.

The scientific contributions are as follows:

  • Software effort estimation tailored explicitly for the manufacturing sector, integrating the unique characteristics of this domain and contributing to more realistic and context-aligned estimates for industrial applications.

  • Utilization of an international database (ISBSG) comprising projects from various countries, enabling a comprehensive and representative analysis of different software development practices and realities.

  • Comparative evaluation of multiple machine learning models, including SVR, GBM, XGBoost, RF, ELM, and LR, based on performance metrics such as R2, RMSE, and MAE.

  • Investigation of the impact of data dimensionality on model performance, with tests conducted in scenarios with three and five variables, offering insights into the trade-off between model simplicity and accuracy.

  • Practical contribution to selecting predictive models in effort estimation, providing guidelines based on data nature and organizational objectives.

The remainder of this article is structured as follows: Section 2 reviews related studies in the context of software effort estimation using machine learning. Section 3 presents the dataset containing statistical information, the algorithms used, the performance metrics, and the proposed framework. Section 4 discusses the results, while Section 5 offers the conclusions.

2. Methodology

2.1. Information about the data repository

This study primarily relied on a database provided by the International Software Benchmarking Standards Group (ISBSG). The July 2022 version of the repository was used in this research. The repository includes contributions from multiple countries, with the largest being Spain (19.7%), Switzerland (19.5%), the United States (19.1%), Australia (7.6%), and Japan (7.5%). Brazil ranks twelfth, contributing 1.4% of the total data. Other contributing countries include Finland, China, France, Canada, India and Denmark. Considering all selected characteristics and after cleaning the dataset, which involved removing non-existent values and outliers, the resulting data sample consisted of 230 projects.

2.2. Dependent and independent variables predictor variables

The variable 'normalized effort,' measured in hours, was defined as the dependent (output) of the model. This variable encompasses all phases of the software development lifecycle, including planning, specification, design, construction, testing, and implementation.

The first predictor defined was 'relative size,' which corresponds to the functional size of the software, measured using the Functional Size Measurement (FSM) Method. This method includes approaches such as COSMIC, FiSMA, IFPUG 4+, IFPUG old, LOC, Mark II, and NESMA.

The second predictor defined was 'team size.' This variable represents the number of individuals who worked at any point during the project development.

The third predictor variable defined was 'development platform', which specifies the main development platform (as determined by the operating system used). Each project is classified as: PC, Mid Range (MR), Main Frame (MF), or Multiplatform (Multi). The fourth predictor variable is 'language type'. This variable defines the type of programming language used in the project, namely: 3GL, 4GL, and Application Generator (ApG).

2.3. Regression methods for effort prediction

2.3.1. Support Vector Regression

Support Vector Regression (SVR) identifies support vectors near a hyperplane to maximize the margin based on a threshold from the target value. It uses kernel functions to handle non-linear problems, with a linear kernel selected in this study (Box et al., 2015).

The forecasting is obtained through the linear regression stated as,

y = i = 1 n i ϖ K x i , x j + b (1)

in which, y is the vector of outputs, ϖi is a weight, K(xi, xj) is a kernel function, equivalent to an inner product between observations (xi, xj) in some feature space, b is the bias parameter, and n is the length of the time series. To determine the parameters estimation, an optimization problem is formulated as follows:

min ω , b , ξ , ξ * 1 2 ω 2 + C i = 1 n ξ i + ξ i * (2)
subject to: yiωxibε+ξi*, (3)
ω x i + b y i ε + ξ i , (4)
ξ i , ξ i * 0, i = 1, n . (5)

where C is a penalty factor, is a loss function, ξ and ξ* are two lack variables. There are several kernel functions which can be employed, and are described as,

Gaussian:k(xi,xj)=expxi-xj2δ2 (6)
Linear: k(xi,xj)=x'i,xj (7)
Polynomial: k(xi,xj)= 1+x´i,xjp (8)

where δ is the width of the kernel function.

2.3.2. Gradient Boosting Machines

Gradient Boosting Machines (GBM) is an ML technique renowned for modeling complex regression and classification problems, including software effort estimation.The step by step of GBM is described as follows:

Let D=xi,yii=1nbe the adopted dataset, and a loss function computed as follows: (for regression problems)

L [ Y i , F x ] = 1 2 [ Y i F x ] ² (9)

where yi is the vector of inputs, F(x) is the function related to the model used to obtain the predicted values. Initialize a model with a constant value

F 0 ( x ) = arg min γ i = 1 n L y i , γ x (10)

where the γ is the initial predicted value. In the context of regression problems, the argmin over γ means that it is need to find γ values which minimizes the loss function L, such as F0(x) = 0. In this context, mathematically,

F 0 x = 0 i = 1 n L y i , γ γ = 0 (11)
γ = 1 n i = 1 n y i (12)

Since m=1 to M,

  1. Compute the pseudo-residual for i-th output value in the regression tree m, that is,

    ri,m = L(yi , FxiFxiFx=FM1x, i = 1,...,n,(13)

  2. Fit a regression tree for the ri,m residuals and creates a terminal region Rj,m for all j = 1, . . ., JM. In other words, Rj,m represents the j-th leaves of the m-th tree;

  3. For the j = 1, . . ., JMcompute the

    γj,m= argmin γxiRi,jnLyi, Fm1xi + γ(14)

where in this step, for each j-th leaves, the predicted value is given by Equation 14. In fact, for each leaves, the predicted value is the average, as observed in Equation 12;

  1. Update the Fm(x) value (new prediction for the output) for the m-th regression tree, that is,

    Fm(x)=Fm1+ηj = 1jmγj,mIx Rj,m,(15)

where Fm1x are the previous prediction, η is the learning rate used to reduces the sensitivity of predictions regarding the individual outputs as well as reduces the effect of each tree on the new prediction. Finally, the summation represents the addition of new predicted values to the previous γ value.

The next step is compute the final prediction given by y^ =FM (x) (Ribeiro, 2021).

2.3.3. eXtreme Gradient Boosting

eXtreme Gradient Boosting (XGBoost) was developed by Chen & Guestrin (2016), incorporating the boosting model proposed by Friedman (2001). XGBoost is a tree-based ML algorithm known for its efficiency, speed, and performance (Jabeur et al., 2024).

The predicted output is given by a sum of individual predictions and computed as following:

y ^ i = k = 1 K f k x i , f k F , (16)

where K is the number of trees, y^i is the i-th forecast output and f is a function in the space F. The objective function to be optimized (minimization problem) in the XGBoost approach is represented by:

obj = l (y^i,yi) + Ω (fk) = l (y^i,yi) +γT + 12 λ w²,(17)

in which, l is the a differentiable convex loss function that computes the difference between the forecast output y^i and real value yi usually represented by mean squared error. The Ω(fk) is the regularization term, γ is a threshold for the gain, and λ is the regularization on leaf weights. Considering the predicted value in the s-th step denoted by y^is, the objective function described in Equation 17 can be rewritten as follows:

o b s s = i n g i f k x i + 1 2 h i f k 2 x i + f k , (18)

where taking into account the Taylor expansion of the mean squared error up to the second order,

gi= ys1lyi,y^is1 and hi=ys1² lyi,y^is1 (19)

By reformulating the Equation 18, the objective function at step s is computed as follows:

o b j s = j = 1 T i i j g i w j + 1 2 i i j h i + λ w j 2 + γ T , (20)

where Ij is the set of indices of observations assigned to the j-th leaf of the tree. In the aforementioned equation, the first term of Equation 20 is quadratic, and the most suitable wj for a give function q(x) is computed as follows:

w j * = i I j g i i I j h i + λ (21)

Finally, the objective function can be rewritten as following:

o b j * = 1 2 j = 1 T i I j g i ² i I j h i + λ + γ T , (22)

where the objective function depends on gi and hi (Ribeiro, 2021).

2.3.4. Random Forest (RF)

Random Forest (RF) is a machine learning method that uses an ensemble of decision trees to improve prediction accuracy and reduce overfitting. Each tree is built using a random sample of the data and a random selection of variables, promoting diversity among the trees. The final prediction is made by averaging individual tree predictions for regression or using majority voting for classification. RF is popular for its robustness, accuracy, and capacity to handle large datasets and noisy features (Breiman, 2001).

Mathematically, the final prediction achieved by RF can be computed as follows, where y^is the predicted output, m is the number of trees, and fkx represents the predictions of k-th according the input vector x (Ribeiro, 2021).

y ^ = 1 m k = 1 m f k x (23)
2.3.5. Extreme Learning Machine (ELM)

The Extreme Learning Machine (ELM) is a supervised learning method for single-hidden-layer feedforward neural networks (SLFN), introduced by Huang et al. (2006). This eliminates the need for iterative optimization, significantly reducing training time (Huang et al., 2011).

Mathematically, given a time series of n observations and m inputs, D = {(xi, xi)|xiRm, yi ∈ R}, where x is the vector of inputs y is the vector of outputs, an ensemble model uses an aggregation function G that aggregates the predictions of K base model f1(x), . . ., fk(x) towards predicting a single forecasting model as follows:

y ^ = G ( f 1 ( x ) f k x ) (24)

where y^i is the forecasting value of i-th observation of time series in specific time window. Overall, the success of ensemble models is related to the diversity of their base models (Ribeiro, 2021)

2.3.6. Linear Regression (LR)

Linear regression is one of the simplest and most widely used statistical models for understanding the relationship between a dependent variable and one or more independent variables (Freedman, 2009).

Mathematically, the linear regression model can be expressed as:

y = β 0 + β 1 x 1 + β 2 x 2 + β p x p + ε (25)

where y is the dependent variable, x1,x2,...,xp are the independent variables, β0 is the intercept, β1,...,βp are the coefficients, and ε is the error term. The coefficients are estimated using the least squares criterion, which minimizes the residual sum of squares:

RSS = i=1nyi y^i² (26)
2.3.7. Mean Absolute Percentage Error (MAPE)

The Mean Absolute Percentage Error (MAPE) is a widely used metric for evaluating the accuracy of forecasting and regression models. It measures the average absolute difference between predicted and actual values, expressed as a percentage of the actual values (Hyndman & Koehler, 2006). Its formula is given by Equation 27

MAPE = 100%nt=1nyty^tyt (27)

where n is the number of observations, yt is the actual value at time t and y^t is the predicted value at time t.

2.4. Performance measures

No single performance measure can fully evaluate the performance of an algorithm; therefore, this study utilized three evaluation parameters: RMSE, MAE, and R2.

By definition, RMSE is the average distance of a data point from the fitted line, measured along a vertical line, and can be computed using Equation 28. (Chou et al., 2012).

RMSE = i=1n Ai Ãi²n (28)

According to Al Betar et al. (2023), MAE is a good choice when outliers are not a major concern, as it is a less sensitive measure of accuracy. MAE is given by Equation 29.

MAE = 1ni=1nAi Ãi (29)

R2, or the coefficient of determination, is a statistical measure indicating the proportion of the variance in the dependent variable that is explained by the regression model. In simpler terms, R2 measures how well the data fit the regression model (Myers et al., 2012). Its formula is given by Equation 30.

R ² = 1 i = 1 n A i à i ² i = 1 n A i à i ² (30)

In Equations 27, 28 and 29, Ã is the actual value, A is the predicted value, and n is the number of data samples.

2.5. K-fold cross-validation

To evaluate the performance of a model more robustly and reliably, k-fold cross-validation with k = 5 was used. In this process, the dataset is initially randomly divided into approximately equal-sized subsets, or folds, each containing one-fifth of the data. Subsequently, the iteration among the folds begins, with one fold used for testing (validation) and the remaining four folds used for training the model. The model is trained on the four training folds and evaluated on the test fold, generating a performance measure for that specific iteration of cross-validation. This process is repeated five times, with each fold used once as the test set and the others as the training set.

2.6. Proposed framework

The methodology starts with acquiring data from the International ISBSG repository, followed by a data cleaning process to ensure dataset quality and consistency. Then, multiple predictive models, including SVR, ELM, GBM, RF, XGBoost, and LR, are applied, with each model undergoing a 5-fold cross-validation for robust performance evaluation.

After training and testing the models, performance metrics like RMSE, MAE, and R2 are calculated. The observed versus predicted values are compared to assess accuracy both visually and statistically. A T-test is performed to statistically compare the techniques and identify significant differences. Additionally, the importance of each input variable is calculated to assess its influence on predictions.

Finally, the results are compared to identify the most effective model and draw insights from the findings. This process is visually presented in Figure 1.

Figure 1
Proposed framework.

3. Results and discussions

In this section, the results of the SVR, ELM, RF, XGB, GBM, and LR models are presented for the performance metrics RMSE, MAE, and R2, along with the observed versus predicted values, a comparison of the techniques using the T-test, and the importance of the variables.

3.1. Statistical analysis of the data

Table 1 presents a statistical summary for the numerical variables: effort, relative size, and team size. The effort variable, with a mean of 6525.54 and a median of 3926, shows high variability, suggesting that while most projects require modest effort, a few large-scale projects significantly raise the average. Similarly, the relative size variable, with a standard deviation of 910.66, indicates substantial dispersion, with most projects falling between the 1st quartile (170.75) and 3rd quartile (1041.25), but larger projects also exist. The team size variable has a median of 6 members, with most projects involving small teams of 4 to 6 members. The presence of extreme values in effort and relative size highlights the need for robust modeling techniques to manage variability without being overly affected by outliers.

Table 1
Descriptive statistics for numerical variables.

Figure 2 shows box-plots for effort, team size, and functional size, revealing positively skewed distributions with most data concentrated at lower values. Outliers indicate exceptional cases of high effort, large teams, and large functional sizes. The medians are near the lower limits of the boxes, suggesting that most projects have modest effort, team size, and functional size. While the overall dispersion is moderate, the extreme values highlight the need for careful consideration of outliers in the analysis.

Figure 2
Graphical analysis of numerical variables.

The categorical variables used in this study are development platform and language type. Figure 3 presents the analysis of the development platform variable, showing that most projects are developed in MF environments, which account for 62% of the total. PC and Multi environments represent 17% and 16% of projects, respectively, while MR platforms make up only 5%. This highlights the dominance of large-scale environments in the analyzed projects.

Figure 3
Frequency of qualitative variables.

Regarding language type, there is a balance between third-generation languages (3GL), which account for 49% of cases, and fourth-generation languages (4GL), which represent 44%. Languages categorized as ApG are present in 7% of projects. These results suggest that while traditional languages like Java and C dominate, higher-level abstraction languages like SQL also play a significant role.

In conclusion, the data reveals that most projects are concentrated in MF environments and predominantly use third-generation languages.

3.2. Predictive performance of the models using RMSE, MAE, and R2

3.2.1. RMSE

The RMSE measures the dispersion of prediction errors and provides an idea of the magnitude of errors that the model is making in units of the dependent variable. In other words, the lower the RMSE, the better the model's performance in fitting the data.

The analysis of the results, presented in Tables 2 and 3 and Figure 4, showed that the performance of effort prediction models for software projects varied according to the number of variables used. With three variables, the ELM model achieved the lowest root mean square error (RMSE = 5648.579), indicating higher accuracy in this simpler scenario. However, when the number of variables increased to five, ELM was outperformed, with the GBM model delivering the best result (RMSE = 5824.245), closely followed by SVR (RMSE = 5848.167). These findings suggest that GBM has greater robustness and adaptability to increasing data complexity, while ELM performs better in lower-dimensional settings. SVR, in turn, showed sensitivity to the amount of available information, with a significant improvement in performance as more variables were included. Therefore, selecting the most appropriate model should consider not only error metrics but also the quantity and quality of variables in the dataset.

Table 2
Statistical indicators for each of the models throughout the cross-validation task based on RMSE for three variables.
Table 3
Statistical indicators for each of the models throughout the cross-validation task based on RMSE for five variables.
Figure 4
Performance graph of models according to RMSE for three and five variables.
3.2.2. MAE

Similar to RMSE, lower MAE values indicate better model fit. Based on MAE analysis, SVR achieved the best performance in both scenarios, three and five input variables, with the lowest mean absolute errors (2837.984 and 2727.108, respectively). This suggests SVR's consistency and accuracy, particularly as MAE is less sensitive to outliers than RMSE. GBM ranked second with three variables (MAE = 2901.099), while RF held that position with five. The modest reduction in MAE with more input variables indicates a slight benefit from additional information. Overall, SVR demonstrated strong generalization and low average deviation, whereas GBM showed robust performance across metrics. These findings, summarized in Tables 4 and 5 and Figure 5, highlight the complementary value of using both RMSE and MAE for a well-rounded evaluation.

Table 4
Statistical indicators for each of the models throughout the cross-validation task based on MAE for three variables.
Table 5
Statistical indicators for each of the models throughout the cross-validation task based on MAE for five variables.
Figure 5
Performance graph of models according to MAE for three and five variables.
3.2.3. R2

R2 values range from 0 to 1 and indicate how well predicted values align with observed data, with values closer to 1 reflecting better model accuracy. When using three input variables, ELM achieved the highest R2 (0.7059), explaining approximately 70.6% of the variance, followed by SVR (0.6670) and LR (0.6557). In contrast, with five variables, GBM performed best (R2 = 0.7483), followed by RF (0.7220) and SVR (0.6422). These results suggest that ELM excels in lower-dimensional scenarios, GBM adapts well to higher-dimensional data, and SVR, while not achieving the highest R2, shows consistent generalization and strong performance in MAE. Together, the metrics highlight the trade-offs between overall fit and error minimization. The complete results are shown in Tables 6 and 7 and Figure 6.

Table 6
Statistical indicators for each of the models throughout the cross-validation task based on R2 for three variables.
Table 7
Statistical indicators for each of the models throughout the cross-validation task based on R2 for five variables.
Figure 6
Performance graph of models according to R2 for three and six variables.
3.2.4. MAPE

MAPE assesses model performance by expressing prediction errors as percentages. With three input variables, ELM recorded the highest MAPE, indicating poor accuracy, while tree-based models, GBM, RF, and XGBoost, performed significantly better, with RF and XGBoost achieving the lowest and most stable errors. With five variables, most models improved, and XGBoost slightly outperformed the others, while ELM continued to exhibit high MAPE, reinforcing its limited generalization with increased dimensionality. Overall, tree-based ensemble models (especially RF and XGBoost) proved more effective in minimizing relative error. SVR and LR showed moderate results but were consistently outperformed by ensemble approaches. These outcomes are detailed in Tables 8 and 9 and illustrated in Figure 7.

Table 8
Statistical indicators for each of the models throughout the cross-validation task based on R2 for three variables.
Table 9
Statistical indicators for each of the models throughout the cross-validation task based on R2 for five variables.
Figure 7
Performance graph of models according to MAPE for three and six variables.

The radar chart (Figure 8) summarizes model performance across RMSE, MAE, R2, and MAPE. ELM excelled in RMSE and R2, indicating good absolute fit and variance explanation, but performed poorly in MAPE, showing high relative error. GBM showed balanced performance, ranking well in RMSE and R2, and moderately in MAE and MAPE, confirming its robustness. LR had weaker results across all metrics, especially in R2, indicating limited explanatory power. SVR led in MAE and did well in MAPE, but underperformed in RMSE and R2, suggesting good average error control but less variance capture. RF delivered strong, consistent performance, particularly in MAPE, making it a reliable and well-rounded option. XGBoost topped MAPE, with strong MAE results, though it ranked lower in RMSE and R2, highlighting its strength in relative accuracy over absolute fit.

Figure 8
Radar chart indicating the position of best and worst results for the models.

To complement RMSE, MAE, and R2, scatter plots of observed vs. predicted effort values were used to assess error distribution and generalization for the top models (ELM, GBM, SVR), based on five-fold cross-validation. In Figure 9, ELM showed high variability across folds, with greater dispersion, particularly for higher effort values, indicating difficulties in capturing complex patterns. This inconsistency, including some large prediction errors, may stem from ELM’s random weight initialization and sensitivity to outliers, which undermine its reliability.

Figure 9
Observed versus predicted values ​​for the ELM model.

Visually, the GBM model showed the most consistent performance among the three models analyzed. The predictions are well aligned with the ideal line across all folds, especially for low and medium effort values. Despite some deviations at extreme values, the model demonstrated low dispersion and good generalization ability, suggesting that it is robust and reliable for the task of effort estimation, as shown in Figure 10.

Figure 10
Observed versus predicted values ​​for the GBM model.

The SVR model showed intermediate performance (Figure 11). Compared to the ELM, the predictions are closer to the ideal line, although with slightly more dispersion than the GBM. The model performed reasonably well across the folds and showed stable performance, with moderate errors in the more extreme cases. Overall, the SVR appears to be a viable alternative, although not as precise as the GBM.

Figure 11
Observed versus predicted values ​​for the SVR model.

Based solely on the visual analysis of the graphs, it can be concluded that the GBM model is the most suitable for effort estimation, followed by the SVR. The ELM, although showing potential, exhibited greater variability and imprecision in its predictions, which limits its practical applicability in this context.

Figure 12 graphically compares the confidence intervals of MAEs for different techniques using an independent samples t-test.

Figure 12
Comparison of techniques using confidence interval.

Through a confidence interval, an inference can be made as follows: if the CI does not contain zero, then H0 is rejected, indicating a statistical difference between the groups. Since all intervals cross zero, it means that none are statistically more significant than the others.

The variable team size showed the highest importance score (0.4109), indicating its strong influence on model predictions, followed by functional size (0.3312), which also contributed significantly. In contrast, language type (0.0123) and development type (0.007) had minimal impact, as illustrated in Figure 13. These findings suggest that project complexity and scale, reflected in team and functional siz, are key drivers of effort estimation. The limited influence of language and development type may stem from low variability in the dataset or indirect effects. The marked difference in variable importance highlights the potential to streamline future models by excluding low-impact features, improving both efficiency and focus. Moreover, this insight can inform data collection priorities by emphasizing the most predictive attributes.

Figure 13
Importance of predictor variables.

4. Conclusion

The comparative analysis between different machine learning models applied to effort prediction in software projects showed that the performance of the algorithms varies significantly depending on the number of variables used and the evaluation metric considered. Among the models analyzed, the GBM stood out for its robustness, consistency, and ability to generalize, achieving the best results in terms of the coefficient of determination (R2) and maintaining competitive performance in both RMSE and MAE. This highlights its suitability for handling more complex scenarios with multiple variables.

The SVR, in turn, showed the lowest MAE values with both three and five variables, indicating excellent performance in terms of mean absolute error. Its stability across different scenarios, despite having lower explanatory power (R2), reinforces its value as an accurate and reliable model, especially when the goal is to minimize average deviations in the estimates.

On the other hand, the ELM demonstrated good performance in contexts with lower dimensionality, especially in scenarios with three variables, but exhibited greater variability and a drop in performance as complexity increased. This suggests that ELM can be useful in situations with fewer attributes and constrained execution time, although its sensitivity to outliers limits its robustness.

Other models, such as RF and LR, showed intermediate performance, while XGBoost achieved the worst results across all metrics, indicating difficulty in adapting to the analyzed dataset.

The graphical analysis of observed vs. predicted values reinforced these conclusions, with GBM showing the best adherence to the ideal prediction line, followed by SVR. Although promising, the ELM exhibited greater dispersion and inconsistency across the folds. Additionally, the statistical analysis through the T-test indicated that the differences between the models, although evident in the metrics, were not statistically significant within the adopted confidence intervals.

Finally, the variable importance analysis highlighted that factors related to Team Size (0.4109) and Functional Size (0.3312) were the most influential predictors in the model, whereas Language Type (0.0123) and Development Type (0.007) had minimal impact. This indicates that team-related and functional aspects are key determinants of model performance. The findings suggest that less relevant variables may be excluded in future analyses to enhance model efficiency and guide data collection priorities.

Therefore, it is concluded that the choice of the ideal model should consider not only performance metrics but also the number and nature of the available variables. GBM stands out as the best overall alternative, especially in more complex contexts, while SVR offers an effective solution for minimizing mean error, and ELM can be advantageous in scenarios with fewer variables and computational constraints.

Although the results obtained are promising, this study has some limitations that should be considered. The main limitation relates to the quantity and variety of data available, which may have restricted the models' ability to capture more complex nuances of development effort. The inclusion of additional predictive variables, such as team experience, software quality, number of reported defects, user satisfaction level, and especially the total project cost, could enhance the robustness and explanatory power of the models. Therefore, we suggest that future studies explore the incorporation of these attributes, if available, to enrich the analysis.

Another limitation is related to the difficulty of making direct comparisons with other studies in the literature, as many use different data subsets or specific approaches for distinct sectors. To mitigate this issue, we recommend that future research test the proposed model using databases with similar characteristics, which could enable more consistent comparisons between techniques and improve the generalization of the findings presented here.

Acknowledgements

The author wishes to thank the Federal University of Technology – Paraná, Pato Branco Campus, for its support throughout this research, as well as the professors whose valuable guidance and contributions were essential to the progress of this study.

Data availability

Research data is only available upon request.

  • How to cite this article:
    Lenhart, D., Ribeiro, M. H., Trojan, F. (2025). Effort estimation for software products targeted at the manufacturing sector using machine learning algorithms. Production, 35, e20240092. DOI: https://doi.org/10.1590/0103-6513.20240092
  • Financial Support
    Author Ribeiro thanks the Araucária Foundation (Grant number: PRD2023361000550) for its financial support of this work.
  • Ethical Statement
    This research did not involve experiments with human participants, nor the collection of personal or sensitive data. Therefore, ethical approval and informed consent were not required.

References

  • Al-Betar, M. A., Kassaymeh, S., Makhadmeh, S. N., Fraihat, S., & Abdullah, S. (2023). Feedforward neural network-based augmented salp swarm optimizer for accurate software development cost forecasting. Applied Soft Computing, 149, 111008. http://doi.org/10.1016/j.asoc.2023.111008
    » http://doi.org/10.1016/j.asoc.2023.111008
  • Bontempi, G., Ben Taieb, S., & Le Borgne, Y. A. (2013). Machine learning strategies for time series forecasting. In: M.-A. Aufaure, E. Zimányi (Eds.), Business intelligence (pp. 62-77). Berlin: Springer. http://doi.org/10.1007/978-3-642-36318-4_3
    » http://doi.org/10.1007/978-3-642-36318-4_3
  • Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis: forecasting and control. Hoboken: John Wiley & Sons.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. http://doi.org/10.1023/A:1010933404324
    » http://doi.org/10.1023/A:1010933404324
  • Chen, T., & Guestrin, C. (2016). Xgboost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794). New York: Association for Computing Machinery. http://doi.org/10.1145/2939672.2939785
    » http://doi.org/10.1145/2939672.2939785
  • Chou, J. S., Cheng, M. Y., Wu, Y. W., & Wu, C. C. (2012). Forecasting enterprise resource planning software effort using evolutionary support vector machine inference model. International Journal of Project Management, 30(8), 967-977. http://doi.org/10.1016/j.ijproman.2012.02.003
    » http://doi.org/10.1016/j.ijproman.2012.02.003
  • Freedman, D. A. (2009). Statistical models: theory and practice Cambridge: Cambridge University Press. http://doi.org/10.1017/CBO9780511815867
    » http://doi.org/10.1017/CBO9780511815867
  • Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189-1232. http://doi.org/10.1214/aos/1013203451
    » http://doi.org/10.1214/aos/1013203451
  • Huang, G.-B., Wang, D. H., & Lan, Y. (2011). Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2(2), 107-122. http://doi.org/10.1007/s13042-011-0019-y
    » http://doi.org/10.1007/s13042-011-0019-y
  • Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2006). Extreme learning machine: theory and applications. Neurocomputing, 70(1–3), 489-501. http://doi.org/10.1016/j.neucom.2005.12.126
    » http://doi.org/10.1016/j.neucom.2005.12.126
  • Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679-688. http://doi.org/10.1016/j.ijforecast.2006.03.001
    » http://doi.org/10.1016/j.ijforecast.2006.03.001
  • Jabeur, S. B., Mefteh-Wali, S., & Viviani, J. L. (2024). Forecasting gold price with the XGBoost algorithm and SHAP interaction values. Annals of Operations Research, 334(1), 679-699. http://doi.org/10.1007/s10479-021-04187-w
    » http://doi.org/10.1007/s10479-021-04187-w
  • Jiang, J., Chen, Z., Wang, Y., Peng, T., Zhu, S., & Shi, L. (2019). Parameter estimation for PMSM based on a back propagation neural network optimized by chaotic artificial fish swarm algorithm. International Journal of Computers, Communications & Control, 14(6), 615-632. http://doi.org/10.15837/ijccc.2019.6.3705
    » http://doi.org/10.15837/ijccc.2019.6.3705
  • Kassaymeh, S., Alweshah, M., Al-Betar, M. A., Hammouri, A. I., & Al-Ma’aitah, M. A. (2024). Software effort estimation modeling and fully connected artificial neural network optimization using soft computing techniques. Cluster Computing, 27(1), 737-760. http://doi.org/10.1007/s10586-023-03979-y
    » http://doi.org/10.1007/s10586-023-03979-y
  • Kaushik, A., Tayal, D. K., & Yadav, K. (2020). A comparative analysis on effort estimation for agile and non-agile software projects using DBN-ALO. Arabian Journal for Science and Engineering, 45(4), 2605-2618. http://doi.org/10.1007/s13369-019-04250-6
    » http://doi.org/10.1007/s13369-019-04250-6
  • Kumar, P. S., Behera, H. S., Kumari, A., Nayak, J., & Naik, B. (2020). Advancement from neural networks to deep learning in software effort estimation: perspective of two decades. Computer Science Review, 38, 100288. http://doi.org/10.1016/j.cosrev.2020.100288
    » http://doi.org/10.1016/j.cosrev.2020.100288
  • Lavingia, K., Patel, R., Patel, V., & Lavingia, A. (2024). Software effort estimation using machine learning algorithms. Scalable Computing: Practice and Experience, 25(2), 1276-1285. http://doi.org/10.12694/scpe.v25i2.2213
    » http://doi.org/10.12694/scpe.v25i2.2213
  • López-Martín, C. (2022). Machine learning techniques for software testing effort prediction. Software Quality Journal, 30(1), 65-100. http://doi.org/10.1007/s11219-020-09545-8
    » http://doi.org/10.1007/s11219-020-09545-8
  • Myers, R. H., Montgomery, D. C., Vining, G. G., & Robinson, T. J. (2012).Generalized linear models: with applications in engineering and the sciences. Hoboken: John Wiley & Sons.
  • Rahman, M., Roy, P. P., Ali, M., Gonçalves, T., & Sarwar, H. (2023). Software effort estimation using machine learning technique. International Journal of Advanced Computer Science and Applications, 14(4), 822-827. http://doi.org/10.14569/IJACSA.2023.0140491
    » http://doi.org/10.14569/IJACSA.2023.0140491
  • Rankovic, N., Rankovic, D., Ivanovic, M., & Lazic, L. (2021). A new approach to software effort estimation using different artificial neural network architectures and Taguchi orthogonal arrays. IEEE Access: Practical Innovations, Open Solutions, 9, 26926-26936. http://doi.org/10.1109/ACCESS.2021.3057807
    » http://doi.org/10.1109/ACCESS.2021.3057807
  • Rao, K. E., Terlapu, P. R. V., Naidu, P. A., Kumar, T. R., & Pydi, B. M. (2024). Feature importance for software development effort estimation using multi level ensemble approaches. Bulletin of Electrical Engineering and Informatics, 13(2), 1090-1102. http://doi.org/10.11591/eei.v13i2.5531
    » http://doi.org/10.11591/eei.v13i2.5531
  • Ribeiro, M. H. D. M. (2021). Time series forecasting based on ensemble learning methods applied to agribusiness, epidemiology, energy demand, and renewable energy (Doctoral dissertation). Pontifícia Universidade Católica do Paraná, Curitiba.
  • Sharma, S., & Vijayvargiya, S. (2020). Applying soft computing techniques for software project effort estimation modelling. InNanoelectronics, Circuits and Communication Systems: Proceeding of NCCS 2019(pp. 211-227). Singapore: Springer.
  • Van Hai, V., Nhung, H. L. T. K., Prokopova, Z., Silhavy, R., & Silhavy, P. (2022a). Toward improving the efficiency of software development effort estimation via clustering analysis. IEEE Access: Practical Innovations, Open Solutions, 10, 83249-83264. http://doi.org/10.1109/ACCESS.2022.3185393
    » http://doi.org/10.1109/ACCESS.2022.3185393
  • Varshini, A. G. P., & Kumari, K. A. (2024). Software effort estimation using stacked ensemble technique and hybrid principal component regression and multivariate adaptive regression splines. Wireless Personal Communications, 134(4), 2259-2278. http://doi.org/10.1007/s11277-024-11010-9
    » http://doi.org/10.1007/s11277-024-11010-9

Edited by

  • Editor(s)
    Adriana Leiras

Publication Dates

  • Publication in this collection
    17 Nov 2025
  • Date of issue
    2025

History

  • Received
    05 Sept 2024
  • Accepted
    22 Sept 2025
Creative Common - by 4.0
This is an Open Access article distributed under the terms of the Creative Commons Attribution license (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
location_on
Associação Brasileira de Engenharia de Produção CNPJ: 30.115.422/0001-73, Avenida Cassiano Ricardo, Nº 601, Residencial Aquarius, CEP: 12.246-870, http://portal.abepro.org.br/ - São José dos Campos - SP - Brazil
E-mail: production@editoracubo.com.br
rss_feed Acompanhe os números deste periódico no seu leitor de RSS
Reportar erro