Chemoface : a Novel Free User-Friendly Interface for Chemometrics

Um software para análise multivariada foi desenvolvido com o objetivo de oferecer uma ferramenta computacional livre com interface gráfica amigável para pesquisadores, professores e estudantes com interesse em quimiometria. O Chemoface possui módulos capazes de resolver problemas relacionados com planejamento experimental, reconhecimento de padrões, classificação e calibração multivariada. É possível obter uma variedade de gráficos e tabelas para explorar os resultados. Neste trabalho, as principais funcionalidades do Chemoface são exploradas usando estudos de caso reportados na literatura, tais como otimização de adsorção de corante índigo em quitosana usando planejamento fatorial completo, análise exploratória de amostras de própolis caracterizadas por ESI-MS (espectrometria de massas com ionização electrospray) usando PCA (análise de componentes principais) e HCA (análise hierárquica de agrupamentos), modelagem MIA-QSAR (análise multivariada de imagem aplicada à relações quantitativas estrutura-atividade) para predição de parâmetro cinético relacionado à atividade de peptídeos contra dengue usando PLS (método de quadrados mínimos parciais), e classificação de amostras de vinho de diferentes variedades usando PLS-DA (PLS para análise discriminante). Todos os exemplos são ilustrados com gráficos e tabelas obtidos no Chemoface.


Introduction
A new scientific concept was introduced in the 1970s; chemometrics, a science related to performing calculations on measurements taken in a chemical process or system, was presented with the purpose of obtaining information about the state of this system by means of either mathematical or statistical methods.Due to the complex origin of the data involved in chemometric works and the need to perform extensive calculations, the low processing capabilities of computers were limiting for researches at the time. 1,2mportant advances in computation have been achieved since then, and chemometrics spread into many research fields related to chemistry, such as food science, 3 soil science, 4 clinical analysis 5 and pharmaceutical sciences, 6 among others. 7Thus, many methods, and especially the implementation of computational tools for chemometric calculations, have been developed.
Currently, a number of specialized programs for chemometric calculations has been marketed.Those with somewhat friendly interfaces correspond to expensive commercial versions, [8][9][10] which can impose limitations to classrooms with many computers and for students.On the other hand, those free licensed ones 11,12 are still emerging about user-friendly graphical interfaces and usually require some command line programming, which generates a series of difficulties for less experienced users.Although there are some toolboxes with graphical interfaces that can facilitate the use of these programs, 13,14 they are specific to a particular chemometric method.
Therefore, a new software for chemometrics, namely Chemoface, was developed in order to provide a free computational tool with user-friendly graphical interface for researchers, professors and students with interest on this science.Chemoface includes several modules that can solve problems related to design of experiments, pattern recognition, classification and multivariate calibration.Files of different formats can be imported.It also allows the obtainment of a variety of high quality graphics and tables to explore results.In this work, the main features of Chemoface are presented using case studies reported in the literature.

Requirements
Chemoface was developed on the MATLAB 15 environment.It is a stand-alone application and does not require a MATLAB license installation to run.Indeed, only MATLAB Compiler Runtime (MCR) is required to be installed, which is freely available along with Chemoface.
MCR is a set of shared libraries that provides complete support for all the features of MATLAB.
Computational performance depends on the size of the data sets and the hardware capability.The examples presented in this work were carried out in a laptop with Core i3 processor and 4 GB RAM.Large data sets (about 100 × 10000) were also properly tested on Pattern Recognition, Multivariate Calibration, Data Plot and Data Organization modules.

Modules and applications
Chemoface consists of five modules which can be accessed from the software home screen; these modules are Experimental Design, Pattern Recognition, Multivariate Calibration, Data Plot and Data Organization.In all modules, Chemoface identifies samples in rows and variables in columns.Figure 1 shows the home screen and the Multivariate Calibration module of Chemoface.

Experimental Design module
This module is able to solve problems related to design of experiments using full factorial design, fractional factorial design, central composite design, Plackett-Burman design and mixture design. 16,17The results can be explored using effect tables and Pareto charts.The user can adjust various parameters related to design and analysis of experiments, such as number of factors, number of repetitions in the assays, number of central points, fraction size in fractional designs, simplex type and constraints on the component proportions in mixture designs, and confidence level for significance tests.Surface and contour graphs are also available, with settings for linear, linear-interactions, quadratic and pure quadratic models.Options to plot the experimental data and to account only for the significant regression coefficients are also available either for surface or contour plots.Statistics for models obtained only with significant regression coefficients are also computed.Some features of this module are illustrated by analyzing an experiment for evaluation of the removal of indigo carmine dye from aqueous solutions using cross-linked chitosan, originally reported by Cestari et al. 18 The effects of the amount of chitosan (100-300 mg), concentration of dye (2.0-5.0 × 10 −5 mol L −1 ) and temperature (25-35 °C) over dye adsorption on chitosan were evaluated by a 2 3 full factorial design.The responses were obtained in duplicate.In the original work, the authors evaluated the design using only a table of the effects and the respective errors.Here, this experimental design has also used other tools available in Chemoface.
The Pareto chart of the effects is presented in Figure 2a.The graph provides a clear visualization of factor effects, and indicates that the amount of chitosan exhibited an antagonistic effect, while the temperature presented a synergistic effect.A significant interaction effect between the amount of chitosan and temperature was also verified.The third order interaction effect was significant, but the main contribution was found to be the amount of chitosan and temperature since the main effect or second order interaction of dye concentration was not significant.A surface plot (Figure 2b) for the amount of chitosan and temperature against dye adsorption shows that the increase of chitosan mass from 100 to 300 mg decreases the dye adsorption, whereas increasing temperature from 25 to 35 o C increases such an adsorption.The statistical results for the model (Figure 3) indicates a significant linear fit (R² > 0.9; p-value of F-test < 0.05), and confirms chitosan mass and temperature as significant effects based on the regression coefficients.

Pattern Recognition module
The Pattern Recognition module performs principal component analysis (PCA) 19 and hierarchical cluster analysis (HCA). 20Several pre-processing methods can be easily applied to the data set, such as mean center, autoscaling, smoothing/derivative, normalization, multiplicative scatter correction, as well as spectral conversions (absorbance/transmittance).Graphs for 2D and 3D PCA can be generated individually for scores and loadings, in addition to biplots.Sample classes can be inserted and graphs colored according to such classes can be obtained.HCA can be performed using Euclidean or Mahalanobis distance with linkage by nearest neighbor,  furthest neighbor and average.A color can be assigned to each group of nodes in dendrograms based on a threshold.PCA can also be applied to input data for HCA.
Functionalities of this module are illustrated through an exploratory analysis of a data set from characterization of propolis harvested in different seasons reported in literature. 21Alcoholic extracts of propolis samples harvested in Spring, Summer and Autumn were analyzed by electrospray ionization-mass spectrometry (ESI-MS).The mass spectra were expressed as the intensities of the individual [M -H] − ions of the most intense ions in the fingerprint of each sample.Some ions were identified as polyphenolic compounds.In the original work, the results were autoscaled and explored by PCA using a PC1 × PC4 plot.Here, the non-preprocessed data set were analyzed by PCA and HCA.A 3D biplot for scores and loadings (Figure 4a) reveals the distinction of samples from three seasons.The main propolis feature from Spring was the high intensity of ion m/z 255.The ions of m/z 301, 315, 353 and 515 highlighted in Summer propolis.A high intensity of ions with m/z 300 and 363 were typical of Autumn propolis.Similar characteristics were also observed in the original work.The HCA dendrogram (Figure 4b) obtained using Euclidean distance and average linkage confirms the insights from the PCA analysis: the distinction of samples from three seasons, in which the Summer samples were better distinguished from the remaining ones.

Multivariate Calibration module
This module performs multivariate calibration using multiple linear regression (MLR), principal component regression (PCR) and partial least squares regression (PLS), as well as modeling for classification by discriminant analysis (PLS-DA, PCR-DA and MLR-DA). 22,23Leaveone-out cross validation (LOO-CV) can also be performed.Performance parameters for the models, such as the widely used root mean square error (RMSE) and correlation coefficient (R²) are calculated for the cross-validation, calibration and test sets.Additional statistical parameters proposed by Roy and co-workers, [24][25][26][27] namely r² m and r² p , are also calculated for validation purposes.A r² m above 0.5 guarantees that not only a good correlation between the experimental and predicted values was obtained for the test set, but also that the absolute experimental and predicted values are congruent.The r² p parameter gives insight about the statistical difference between R² for calibration and R² for y-randomization (values above 0.5 are acceptable).New data sets can be inserted for external validations or new predictions by using the current calibration model.A variety of options for data pre-processing are available.Models for multiple independent variables can be built simultaneously.The data set can be easily divided into samples for calibration and test sets, either manually or automatically using the Kennard-Stone algorithm. 28A number of charts and tables can be obtained to assist the exploration of results.Some features of this module for PLS regression is illustrated by a study on the modeling of a kinetic parameter related to activities of modified peptides against dengue type 2 using MIA-QSAR (multivariate image analysis applied to quantitative structure-activity relationship). 29n MIA-QSAR, two-dimensional images of chemical structures are correlated with bioactivities and are supposed to codify chemical properties. 30In this study, a total of 54 The first two samples selected are the furthest ones from each other.The next sample is selected by its distance from the previously selected samples. 28he molecular figures were imported using the Data Organization module of Chemoface as described further.An outlier detection test was applied by leverages × studentized residuals plot (Figure 5a).This test was not applied in the original and the absence of outliers in the data set was confirmed here.RMSE and R 2 for cross-validation corroborate 6 LV (latent variables) as the appropriate number of PLS components (Figure 6).The model performance (Table 1) corroborates the results of Silla et al. 29 and support the correct random selection of test samples by the authors.[26][27]     Finally, the measured × predicted property plot for training, cross-validation and test sets suggest a good predictive ability for the PLS model (Figure 5b).
A classical data set 31 was used to illustrate the classification analysis by PLS-DA.The data set refers to wine samples from three varieties (Barbera, Grignolino and Barolo), which were characterized by measurements of alcohol, total phenol, flavonoid, color intensity, hue color parameter, optical density at 280 nm/optical density at 315 nm and proline.In the original work, the data set was evaluated in order to build classification models.Classification ability was 97.7% using methods like PCA, KNN (K-nearest neighbor) and SIMCA (soft independent modeling of class analogies).Using the Chemoface, the data set was autoscaled.An outlier test was applied using leverages × studentized residuals plot (Figure 7a), and 12 samples were excluded from a total of 178.From 166 samples, 55 were selected for test set using the Kennard-Stone algorithm, and 111 were used in the calibration step.A percentage of successful classification plot for cross-validation indicates 2 LV as appropriated (Figure 7b).The 2 LV model presented a good performance according success of classifications about 100% (Table 2, Figure 8a).The score plot for training and test sets showed excellent sample discrimination (Figure 8b).Scatter plots for data sets can be obtained using the Data Plot module.This is especially useful to plot spectral data.Graphs can be plotted on both original and preprocessed data.
The Data Organization module allows importing numerical data from .txt,.dat,.csvfiles, and images in .bmp.Multiple files, such as spectra files, can be imported simultaneously.The process of importing images (.bmp) is based on converting them in a three-way array containing the RGB values for each pixel.Then the values of R, G and B are summed to each pixel, resulting in a two-way array (matrix).Finally, this matrix is unfolded to generate a vector.This is particularly useful to import molecular figures to be used as descriptors in MIA-QSAR models. 30serting and exporting data Numerical data can be inserted into Chemoface by two ways: they can be typed directly into the tables; or by copying from any numerical data spreadsheet or from a text file (separated by spaces or tabs) and pasting them directly in the module tables.
Commands to transpose dataset and to delete specific rows and columns are available.
After entering the data, they can be saved to a text file (.txt) properly structured by Chemoface; only this type of text file can be opened by the software.Data from other types of text (unstructured and not saved by Chemoface) may be inserted by copying and pasting as explained earlier.Models obtained by MLR, PCR or PLS can also be saved for further use in the software.
All procedures to insert or save data as described above are carried out through the main menu "File" of Chemoface modules.
The figures obtained can be exported to various image formats with high resolution.The numerical data from graphs can be copied and used in different graphical software.The data tables can also be copied.

Conclusion
The goal of the Chemoface project is to offer a computational tool, which is comprehensive, free and with user-friendly graphical interface for researchers, professors and students dealing with common practices in Chemometrics.
A number of other functions, graphs and tables, in addition to those presented in this work, are available in Chemoface.This version has the main methods used in Chemometrics, but new features and other chemometric methods, such as multivariate curve resolution and three-way approaches can be implemented hereafter.
The development of the program is not fully limited, and contributions from other researchers are welcomed.
The software can be freely downloaded from the Department of Food Science of the Federal University of Lavras, Minas Gerais State, Brazil (Download's link). 32

Figure 1 .
Figure 1.Home screen (a) and Multivariate Calibration module (b) of the Chemoface.

Figure 2 .
Figure 2. Pareto chart (a) and surface plot (b) for the 2 3 full factorial design for evaluation of the effects of the amount of chitosan (Q, mg), dye concentration (C, 10 −5 mol L −1 ) and temperature (T, °C) over dye adsorption on chitosan.

Figure 3 .
Figure 3. Chemoface output for statistical parameters of the linear model relating amount of chitosan (Q), dye concentration (C) and temperature (T) against dye adsorption on chitosan.

Figure 6 .
Figure 6.RMSE (a) and R² (b) in the cross-validation of the MIA-QSAR model.

6 Figure 7 .
Figure 7. Leverages × studentized residuals for outlier diagnostic (a), and percentage of successful classification for cross-validation (b) of the PLS-DA model.

Figure 8 .
Figure 8. Predicted classes for test samples of wines (a) and PLS-DA scores multiplot for calibration and test samples (b).

Table 1 .
PLS performance to prediction of k cat of modified peptides against dengue type 2 using MIA-QSAR model

Table 2 .
Success of classification of PLS-DA and SIMCA models for classification of wine samples