Acessibilidade / Reportar erro

Universal Approximators: New Approach for the Curve Fit of the COVID-19 Infected Population

ABSTRACT

Fuzzy systems that include Takagi-Sugeno inference method with linear outputs, are widely known to have the ability to uniformly approximate any polynomial with high precision and, as a consequence, any continuous function, by applying the approximation theorem of Weierstrass. There is one more advantage for these methods which is to obtain an explicit expression of the defuzzified output as a function of the system’s inputs. The purpose of this study is to describe the dynamics of a data set collected through the behavior of the tangential envelope and the local concavity of a curve to be adjusted. The functions that define the envelope and its concavity are identified by means of a hybrid system that combines a fuzzy clustering with the qualities of the Takagi-Sugeno inference method. The analyzed data set represents the world population of confirmed infected people by the infectious disease caused by the coronavirus of severe acute respiratory syndrome, named COVID-19. The proposed fuzzy method, in two versions, first and second order, are compared with curves built through the least square method with the maximum of the absolute value of the difference between the fit values and the data, normalized at each instant. In these comparisons, both fuzzy approaches proposed in this study are the ones that best match the data collected, being the fuzzy approximation of second order the best of all.

Keywords:
mathematical modeling; fuzzy clustering; Takagi-Sugeno inference method; regression analysis

1 INTRODUCTION

The universal approach is the basis for theoretical research and practical applications of fuzzy systems, particularly for those using the Takagi-Sugeno (TS) inference method. There is currently a line of research in this direction that includes the most diversity investigations on the accuracy of TS fuzzy systems 11 K. Bart. Fuzzy Systems as Universal Approximators. Trans. on Computers, 43(1994), 1153-1162.), (22 J.J. Buckey. Sugeno type controllers are universal controllers. Fuzzy Sets Systems, 53(3) (1993), 299-303.), (66 J. Kim, K. Koo & J. Lee. Monotonic fuzzy systems as universal approximators for monotonic functions. Intell. Autom. Soft Comput., 18(1) (2012), 13-31.), (99 L. Wang & J. Mendel. Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Netw., 3(5) (1992), 807-814..

The proposed method in this work is based on a fuzzy identification technique whose process has two components: fuzzy clustering and a TS fuzzy inference 88 J.B. Martins, A.M.A. Bertone & K. Yamanaka. Novel Fuzzy System Identification: Comparative Study and Application for Data Forecasting. IEEE Latin America Transactions, 17 (2019), 1793-1799.. In fact, a data set collected is organized in inputs and output for the first stage of the method, consisting in a clustering of the data set by a fuzzy similarity. The Gustafson and Kessel algorithm is chosen for this purpose 44 D.E. Gustafson & W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In “Proceedings of the IEEE Control and Decision Conference” (1979), p. 761-766.. This stage of the process determines the fuzzy sets that are the antecedents of the TS fuzzy inference. This second stage provides a defuzzification output which is the explicit function that fits the data set collected. This data set comes with two inputs, time t, that represents one day, and x(t), representing the number of infected people by a disease. The data set is reorganized adding one output calculated as the first order finite variation of x(t). Then, the fuzzy method is applied to obtain the envelope fit curve. In the case of the concavity fit curve, the inputs are the same, adding an output which is the calculation of the second order finite variation. In short, the advanced and centralized finite difference methods 33 R.L. Burden & J.D. Faires. “Numerical Analysis”. Brooks/Cole Cengage Learning, United States, 9 ed. (2011). are used to organize the data set for the fuzzy identification.

The pandemic known as COVID-19: CO for coronavirus, VI for virus, D for disease, and 19 for the year it was discovered, has had an outbreak unleashed in China on December 31, 2019. The data have being recorded daily on the Worldometers’ portal 1010 Worldometer (2020). URL https://www.worldometers.info/coronavirus.
https://www.worldometers.info/coronaviru...
. The data set collected for this study provides confirmed numbers referring to the population of infected people by the coronavirus, is used to obtain a curve that interprets the dynamics of the disease’s infection spread. Samples for this investigation have been collected from January 22 until November 15 of 2020.

In order to measure the precision of the fuzzy identification methodology, the result obtained by the proposed method is compared with the fit curve of polynomial, exponential, Gaussian, and power type, obtained through the least squares method 33 R.L. Burden & J.D. Faires. “Numerical Analysis”. Brooks/Cole Cengage Learning, United States, 9 ed. (2011).. Exponential and Gaussian curves are the most accepted to interpret the type of dynamics of the collected data from January to November of 2020 77 E.Z. Martinez, D.C. Aragon & A.A. Nunes. Short-term forecasting of daily COVID-19 cases in Brazil by using the Holt’s model. Revista da Sociedade Brasileira de Medicina Tropical, 53 (2020). URL https://preprints.scielo.org/index.php/scielo/preprint/view/667.
https://preprints.scielo.org/index.php/s...
. To measure the accuracy of the methods for their comparison, is used the maximum of the absolute value of the difference of the fit curve values and the data, both normalized at each instant.

This work is organized as follows: in Section 2 the theoretical foundations of the methodology are explained; in Section 3 the methodology is developed; in Section 4 the results are presented to conclude, in Section 5, with final considerations.

2 THEORETICAL BACKGROUND

In general, clustering is an unsupervised classification of data, resulting in groups of elements called clusters. The purpose of this technique is to organize the patterns represented by vectors or points in the multidimensional space, according to a mathematical measure of similarity. The fuzzy clustering allows the elements of the data to belong to all clusters simultaneously with different membership degrees. There are many techniques that refer to fuzzy clustering; this study uses the algorithm developed by Gustafson and Kessel (GK) 44 D.E. Gustafson & W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In “Proceedings of the IEEE Control and Decision Conference” (1979), p. 761-766.. At the beginning of the GK algorithm the following elements are considered:

  • -Euclidean distance as a measure of similarity;

  • - an initial set of clusters centers, chosen from the equally distributed data set;

  • - a specific number of clusters, m, and a tolerance, tol, as a stopping criterion.

The GK algorithm stands out for changing the geometry of the cluster in each iteration, interpreting the similarities of the clusters more precisely.

Let zi=ti,xti,sti, i = 1, . . . , N a given data set where (t i , x(t i )) are the inputs and s(t i ) the output, that has been clustering. As a consequence of this process, four fundamental elements are obtained for the construction of the TS inference method, the next step of the methodology. These elements are:

  • - the centers of clusters, v j , j = 1, 2, . . . , m, that will be the centers of the Gaussian-type membership functions, corresponding to the antecedents of the TS inference method;

  • - µi j , entries of the membership matrix that contains the membership degree of the element zi=ti,xti,sti, i = 1, . . . , N to each cluster;

  • - the projections of the highest levels of memberships of each cluster over the t axis, that is, the sets given by

A j = t i , μ i j t i , x t i , s t i ) α , j = 1 , , m ,

  • where α is determined in the interval [0.5, 1[ by an optimization process in order to approximate the points of the projected cluster to a Gaussian function 55 R.M. Jafelice & A.M.A. Bertone. “Biological Models via Interval Type-2 Fuzzy Sets”. Springer Briefs in Mathematics (2020).;

  • -the standard deviation of the Gaussian membership function which is given by σj=βmaxAj-minAj, where β is determined by the same optimization process as aforementioned.

These four elements are essential for building the antecedents of the fuzzy TS inference method. The consequences of this inference are obtained by means of a local multivariate regression of the data outputs. The fuzzy rules are then established by:

Rule j: If t is A j then Djt=θj0+θj1t+θj2xt.

The final step is the defuzzification of the TS inference method, which is a weighted average of the product of the membership degrees of the inference’s output values. This final step provides an explicit expression δ(t) that relates data inputs with the data outputs, giving by:

δ t = j = 1 m e x p - t - v j 2 2 σ j 2 θ j 0 + θ j 1 t + θ j 2 x t j = 1 m e x p - t - v j 2 2 σ j 2 . (2.1)

3 CURVE FITTING THROUGH FUZZY IDENTIFICATION METHODOLOGY

The Worldometer data set 1010 Worldometer (2020). URL https://www.worldometers.info/coronavirus.
https://www.worldometers.info/coronaviru...
, ti,xtii=1,2,N, N = 300 represent the total number of days from January 22 to November 15, and x(t i ) the number of confirmed infected individuals in the world at time t i . Such data are organized in two matrix with three columns each, being Df=ti,xti,Δfxti and Dc=ti,xti,Δcxti, i = 1, 2, . . . , N, where:

  • ti is the first column of each matrix. The data is provided every day, therefore t i represents one day.

  • x(ti ) is the second column of both matrix representing the infected number of people in day t i .

  • Δfxti=xti+1-xti is the third column of the matrix D f , representing the day variation for a unitary step. The last line is filled with the same value as the penultimate line. The reason for repeating this value is that the difference between data set’s consecutive entries can be considered minimal for one day of the disease’s spread. In summary, it has been applied the advanced finite difference method 33 R.L. Burden & J.D. Faires. “Numerical Analysis”. Brooks/Cole Cengage Learning, United States, 9 ed. (2011)..

  • Δcxti=xti+1-2xti+xti-1, is the third column of matrix Dc representing the variation of the variation. For the same reasons as aforementioned, the first line is filled with the result of i = 2 and the last line is filled with the same value as the penultimate line. The method is the centralized finite difference of order two 33 R.L. Burden & J.D. Faires. “Numerical Analysis”. Brooks/Cole Cengage Learning, United States, 9 ed. (2011)..

Using the fuzzy identification method 88 J.B. Martins, A.M.A. Bertone & K. Yamanaka. Novel Fuzzy System Identification: Comparative Study and Application for Data Forecasting. IEEE Latin America Transactions, 17 (2019), 1793-1799. the following parameters are obtained for each data:

  • Df: m = 6 for the number of clusters, tol = 0.005, α = 0.93 and β = 0.37.

  • Dc: m = 6 for the number of clusters, tol = 0.6, α = 0.5 and β = 0.85.

As a result, are determined two functions, denoted by δft=δft,xt and δct=δct,xt, through the formula of Equation (2.1). The function δ f represents the values of the tangential envelope of the fit curve to be determined. The parameters of function δ f and δ c are detailed in Table 1 and Table 2, respectively.

Table 1:
Elements of the δ f function of the Equation (2.1).

Table 2:
Elements of the δ c function of the Equation (2.1).

The function δ f that fits data Df=ti,xti,Δfxti is shown in the Figure 1 (a) along with data and, in Figure 1 (b), is shown the data Dc=ti,xti,Δcxti and the graph of δ c .

Figure 1:
The function δ f (t) and δ c (t).

In the following, are detailed the steps of the algorithm for the calculation of the formula that approximates the values of x(t i ), formula that includes the functions δ f (t) and δ c (t). In fact, assume that the curve, g(t) that is going to fit the data is smooth enough, more precisely C]-ε,\+ε[, ε > 0 small and arbitrary, and define gt1=xt1,xt1 the first data value corresponding to the number of infected individuals at t = 1. The following steps are taken to reach the explicit formula for the first order approximation g(t):

  • Step 1: Consider first Taylor’s polynomial of g(t) at point t 1, denoted by p(t) and with the expression pt=gt1+g't1t-t1. Thus, since δ f (t) is the curve representing g(t) derivative, an approximation for the value x(t 2) can be calculated as follows:

p t 2 = g t 1 + g ' t 1 t 2 - t 1 x t 1 + δ f t 1 t 2 - t 1 = x t 1 + δ f t 1 , (3.1)

  • recalling that ti+1-ti=1 for all i = 1, . . . , N corresponding to the collected data set. Therefore:

x t 2 x t 1 + δ f t 1 . (3.2)

  • Step 2: An approximation for x(t 3) is calculated the same manner, using the first degree of Taylor’s polynomial of g(t) at t 2 to obtain:

x t 3 x t 2 + δ f t 2 . (3.3)

  • Thus, replacing (3.2) in (3.3), the approximate value for x(t 3) follows the formula:

x t 3 x t 1 + δ f t 1 + δ f t 2 .

  • Step k: An inductive process is carried out on the following observation moments to conclude:

x t k x t 1 + i = 1 k - 1 δ f t i , k = 2 , , N . (3.4)

It is noteworthy that the formula of Equation (3.4) can be extended for the calculation of intermediate values or for prediction purposes. Indeed, for instance, given t]tk-1,tk[, that value for g(t) is calculated as

g t = x t 1 + i = 1 k - 1 δ f t i + δ f t t - t k - 1 . (3.5)

Following the same reasoning applied to the second degree polynomial of Taylor, another approximation for x(t k ) is obtained as being:

x t k x t 1 + i = 1 k - 1 δ f t i + 1 2 δ c t i , k = 2 , , N . (3.6)

As aforementioned the formula (3.6) can be extended to other intermediate points t]tk-1,tk[ by a similar manner as (3.5), namely

g 2 t = x t 1 + i = 1 k - 1 δ f t i t - t k - 1 + δ c t 2 t - t k - 1 2 , (3.7)

where g 2(t) is the fuzzy approximation of second order for data collected. The flowchart of the whole procedure is described in Figure 2.

Figure 2:
Flowchart of the fuzzy fit curves methodology.

Another use of formulas (3.4) (3.6) and (3.5) (3.7) is to predict future values or estimate retarded values of the data collected. This investigation is in progress. Furthermore, by knowing the smoothness of the curve candidate to fit the data observed, the same formulas can be generalized for the r-degree of smoothness.

4 RESULTS

The methodology described in Section 3 is applied to the data set of the confirmed numbers of infected population from COVID-19, as well as the curves obtained through the following types of regression methods:

  • 1. a polynomial fit of degree 4 given by the function

a 1 t = - 0 . 0004776 t 4 + 0 . 94 t 3 + 424 . 9 t 2 - 2 . 372 · 10 4 t + 2 . 575 · 10 5 ;

  • 2. an exponential fit given by the function

a 2 t = 1 . 525 · 10 6 e x p 0 . 01218 t ;

  • 3. a Gaussian fit given by the function

a 3 t = 6 . 655 · 10 7 e x p - t - 384 . 3 / 164 . 6 2 ;

  • 4. a power fit given by the function

a 4 t = 28 . 67 t 2 . 531 .

These curves a l , l = 1, 2, 3, 4 are obtained using the least squares method 33 R.L. Burden & J.D. Faires. “Numerical Analysis”. Brooks/Cole Cengage Learning, United States, 9 ed. (2011). for the data collected.

To compare the result of the proposed fuzzy curve fit method with the curves a l , l = 1, 2, 3, 4, an accuracy measure is chosen defined by

E r r o r = m a x t i x t i M I - a l t i M I , (4.1)

where

M I = m a x t i x t i = 55 , 388 , 299 ,

that is, the values x(t i ) and a l (t i ) are normalized. The comparison result is shown in Table 3. It should be noted that the lowest value of the Error of Equation (4.1) corresponds to the fuzzy curve fit methodology of second order that has been exposed in this work. This best goodness of fit is followed by the curve constructed through the same methodology, being of the first order.

Table 3:
Comparison of the goodness of fit resulting from the measure (4.1) under the different methodologies.

In Figure 3 is shown the graphs of the curves a l , l = 1, 2, 3, 4 along with the graph of the functions of first order and second order obtained by the new approach developed in this research.

Figure 3:
Comparison of the methodologies with the data collected.

It is important to notice that the complete procedure of clustering and fuzzy inference, in both cases (first and second order approximation) in a computer Intel I7 processor, 60GB of RAM, 256 GB SSD memory drive, takes between 0.009085 and 0.015856 seconds. That means, no computational effort for a standard computer.

5 CONCLUSION

A new fuzzy approach for fit curve method is presented with promising results compared to the classic least squares method, in four types of curves. The comparison is made through data corresponding to confirmed numbers of infected individuals from COVID-19 in the world, using a chosen accuracy measure. Two types of fuzzy approximations are provided that represent the tangential envelope and the local concavity of the curve that would fit the data collected. The smallest errors obtained from the comparative study are given for the two curves built through the fuzzy identification system. The result is due to the intrinsic characteristics of the fuzzy methodology in which the inference method applied is the Takagi-Sugeno. These systems are proven to be universal approximators, in the sense that the methodology is able to approach continuous functions with high precision, in addition to provide an explicit continuous function as a defuzzified output. With respect to future work, is in process to use this methodology as a tool for prediction modeling.

REFERENCES

  • 1
    K. Bart. Fuzzy Systems as Universal Approximators. Trans. on Computers, 43(1994), 1153-1162.
  • 2
    J.J. Buckey. Sugeno type controllers are universal controllers. Fuzzy Sets Systems, 53(3) (1993), 299-303.
  • 3
    R.L. Burden & J.D. Faires. “Numerical Analysis”. Brooks/Cole Cengage Learning, United States, 9 ed. (2011).
  • 4
    D.E. Gustafson & W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In “Proceedings of the IEEE Control and Decision Conference” (1979), p. 761-766.
  • 5
    R.M. Jafelice & A.M.A. Bertone. “Biological Models via Interval Type-2 Fuzzy Sets”. Springer Briefs in Mathematics (2020).
  • 6
    J. Kim, K. Koo & J. Lee. Monotonic fuzzy systems as universal approximators for monotonic functions. Intell. Autom. Soft Comput., 18(1) (2012), 13-31.
  • 7
    E.Z. Martinez, D.C. Aragon & A.A. Nunes. Short-term forecasting of daily COVID-19 cases in Brazil by using the Holt’s model. Revista da Sociedade Brasileira de Medicina Tropical, 53 (2020). URL https://preprints.scielo.org/index.php/scielo/preprint/view/667
    » https://preprints.scielo.org/index.php/scielo/preprint/view/667
  • 8
    J.B. Martins, A.M.A. Bertone & K. Yamanaka. Novel Fuzzy System Identification: Comparative Study and Application for Data Forecasting. IEEE Latin America Transactions, 17 (2019), 1793-1799.
  • 9
    L. Wang & J. Mendel. Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Netw., 3(5) (1992), 807-814.
  • 10
    Worldometer (2020). URL https://www.worldometers.info/coronavirus
    » https://www.worldometers.info/coronavirus

Publication Dates

  • Publication in this collection
    27 June 2022
  • Date of issue
    May-Aug 2022

History

  • Received
    10 Dec 2020
  • Accepted
    06 Sept 2021
Sociedade Brasileira de Matemática Aplicada e Computacional - SBMAC Rua Maestro João Seppe, nº. 900, 16º. andar - Sala 163, Cep: 13561-120 - SP / São Carlos - Brasil, +55 (16) 3412-9752 - São Carlos - SP - Brazil
E-mail: sbmac@sbmac.org.br