According to the features of texts, a text classification model is proposed. Base on this model, an optimized objective function is designed by utilizing the occurrence frequency of each feature in each category. According to the relation matrix oftext resource and features, an improved genetic algorithm is adopted for solution with integral matrix crossover, transposition and recombination of entire population. At last the sample date of manufacturing text information from professional resources database system is taken as an example to illustrate the proposed model and solution for feature dimension reduction and text classification. The crossover and mutation probabilities of algorithm are compared vertically and horizontally to determine a group of better parameters. The experiment results show that the proposed method is fast and effective.
Text classification; genetic algorithm; dimension reduction; text classification; manufacturing text
The total amount of scientific and technological information provided by the internet worldwide exceeds 20TB according to incomplete statistics [1 Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree.Www Special Interest Tracks & Posters of International Conference on World Wide Web, 2005:1190-1191.] and increases at a rate higher than 5% every year [2 Chen J, Li Q, Wang L, et al. Automatically Generating an e-Textbook on the Web. Advances in Web-Based Learning - ICWL 2004Springer Berlin Heidelberg, 2004:35-42.]. What the internet bring about is not only manufacturing resources information but also some problems. On one hand, the manufacturing enterprise has a large demand for manufacturing resources information. On the other hand, the customer has to spend a lot of time to acquire desired information due to large-scale accumulation of manufacturing resources information and sometimes the resource requester cannot find required resource information accurately and timely. So how to carry out classification retrieval and management of such large of manufacturing resources information on internet is always a hot area of research [3 Nguyen M H, Torre F D L. Optimal feature selection for support vector machines. Pattern Recognition, 2010, 43(3):584-591.]. Text classification is normally expressed through the vector space model [4 Liu H, Sun J, Liu L, et al. Feature selection with dynamic mutual information. Pattern Recognition, 2009, 42(7):1330-1339.]. But because the number of dimensions is always big after a text is changed to a vector space which can easily result in low efficiency of classification, and also there are many null values in the vector space, the initial features have to be selected to reduce the number of dimensions of features and increase the efficiency of classification[5 Destrero A, Mosci S, Mol C D, et al. Feature selection for high-dimensional data. Computational Management Science, 2009, 6(1):25-40.].
Dong et al.[6 Dong Lili, Wei Shenghui. Design of Mechanical Information Text Classifier.Microelectronics & Computer, 2012, 29(4):142-145.]proposed a machinery oriented information text classifier, which adopted the document frequency method to extract first-hand features and then uses the grey relational degree to select second-hand features with reductions of the feature dimensionality and weakened relations between terms, and at last used Bayesian classification algorithm for text classification. Liu et al.[7 Liu Kan,Liu Zhong.An Algorithm of Manufacturing Text ClassificationOriented Cloud Computing Environment. Machine Design and Manufacturing Engineering, 2013,42(1): 28-31.] proposed a cloud manufacturing text classification method based on Naive Bayes and SVM (support vector machine) under computing environment. In general, the researches on classification of manufacturing text is not common and the applicable methods are limited to traditional text classification algorithms, which can mainly be divided into two categories: one is based on statistics such as Bayesian classification algorithm[8 Wang S, Jiang L, Li C. Adapting naive Bayes tree for text classification. Knowledge & Information Systems, 2015, 44(1):77-89.], support vector machine[9 B. Ramesh, J.G.R. Sathiaseelan. An Advanced Multi Class Instance Selection based Support Vector Machine for Text Classification.Procedia Computer Science, 2015, 57(1):1124-1130.], K-nearest neighbor classifier[10 Chavan S, Ran F, Nicholls I A. Acute Toxicity-Supported Chronic Toxicity Prediction: A k-Nearest Neighbor Coupled Read-Across Strategy.International Journal of Molecular Sciences, 2015, 16:11659-11677.] and Rocchio algorithm[11 Whitehead N P, Scherer W T, Smith M C. Use of Natural Language Processing to Discover Evidence of Systems Thinking. IEEE Systems Journal, 2015:1 - 10.]; and the other is based on rules such as decision trees, rough sets and fuzzy sets[12 Wang X, Liu X, Pedrycz W, et al. Fuzzy rule based decision trees. Pattern Recognition, 2015, 48(1):50-59.]. Both have disadvantages of complex computation and big difficulty to deal with classification for large-scale text set.
Basically, the researches on manufacturing text is not too much and mostly utilize traditional text classification algorithms, which can be divided into two categories: one is based on statistics, such as Bayesian classification algorithm, support vector machine, K- nearest neighbor and Rocchio algorithm; the other is based on rules, such as decision tree rough sets, fuzzy sets and so on. It is difficult for these two categories of algorithms to deal with large scale of texts classification due to big computational complexity, therefore it is necessary to launch corresponding in-depth study. In this paper, text classification based on improved genetic algorithm is studied with column crossover of entire population is adopted for optimization solution.
Model for text classification
Generalized manufacturing text information refers to software and hardware elements necessary for completing all life-cycle production activities, including all elements involved in related procedures like design, manufacturing, maintenance and so on. Narrowly-defined manufacturing resource mainly refers to material elements necessary for producing a part and includes machines, tools, fixture, measuring means and materials. For both kinds of machinery information described as texts, each element can be divided into many subclasses. For example, the machine element can be divided into turning machine, milling machine, boring mill, grinding machine, planer, gear hobbing machine, gear slotting machine, sawing machine, broaching machine, tapping machine, numerical control machine and other machines. Text information classification is to provide the most suitable information as requested by the information users.
The text can be expressed with spatial vector according to the vector space model. Providing the text information is divided into t classes (e.g.features (e.g..) and each class can be expressed by ), so the total number of features of vector space is:
Therefore, the relationship between text information and features can be described as shown in Formula 1:
Where:is the matrix of relation between the resource and features;
is included in the text information (0 refers excluding and 1 refers to including);indicates whether denotes the operational sign for relation indicating the relation "belong to" and
refers to the number of dimensions of text in vector space. That is, the number of effective features.
Procedures of solutions for classification of text information
Since a lot of classes are involved in each element in text information and a lot of initial features are used to express each class, which results in low computation speed for classification and weak suitability for large-scale computing, the initial features have to be reduced. Due to that only 0 or 1 exists in the matrix of relation between resource and features 13 Kuldeep Kumar. Genetic Algorithm Review.International Journal of Technical Research, 2013, 2(3):24-29.]. Therefore, the genetic algorithm is selected in this paper for optimization solution and its procedure is shown in Fig. 1., the vector representation of a feature can be easily changed through changing the sequence of columns, and thus the matrix of relation between resource and features is very suitable to genetic coding for genetic algorithm (GA). In addition, compared to traditional method , the genetic algorithm is an adaptive global optimal searching algorithm and has more special and advantageous performance for some large-scale complex non-linear systems [
Construction of optimized objective function
Providing the text information is divided into t classes (e.g., so the total number of features of vector space is: and the total number of manufacturing texts is: and the frequency of occurrence of each feature in each class is expressed via ), the number of texts in each class is ) , each class can be expressed by ,.features (e.g., ...,
Then, for any vector in vector space, the optimized objective function for dimension reduction can be constructed into Formula (2):
Where p is the number of effective features and , the bigger the value of f(p) is, the higher the frequency of occurrence of an effective feature in corresponding class is and the higher the importance is.. Obviously, with the same
Providing T denotes text information to be classified and the frequencies of occurrence of features in this text information are refers to the number of features in each class), the optimized objective function for text classification and prediction can be expressed as Formula (3):, ..., , (
Where, . Obviously, the bigger the value of f(T) is, the higher the relevance between the text to be classified and features in this class is. So the text can be categorized into corresponding class.
Solution based on genetic algorithm
Take the matrix of relation Table 1 (resources are divided into t classes, and each class can be expressed by features and the total number of features of vector space isbetween text inforamtion and features as the chromosomal gene code, as shown in ).
The value of ai is 0 or 1, so the number of effective features is e.g. the number of feature vectors in reduced dimension. Both the initial population and the number of features for dimensionality reduction are randomly generated. To decrease the calculation workload, an approximate reduction range can be partitioned by experienced technical staff according to actual GA operations.
In order to ensure crossover effects of population, a method with column crossover, transposition and recombination of entire population is adopted. Firstly, utilize the chromosomes of entire population to form a big matrix (the number of chromosomes is Popsize and each chromosome comprises n columns) and then perform column crossover, transposition and recombination. The detailed crossover process is shown in Fig. 2.
Randomly select for crossover an interval from the big matrix. For example, select the crossover interval of parent chromosomes and then add the interval to the front column of A and reposition them in sequence to obtain child chromosomes A'. The advantages of this method are that all chromosomes are column-crossed over and the number of "1" in each chromosome remains the same so as to ensure the uniformity of final dimension reduction. In addition, it is not necessary to consider the parent chromosomes in this method and the obtained generation has certain variations, which is beneficial to keep the diversity of population.
Generate a random number from 0 to n as the column number of population matrix and another two random numbers from 0 to Popsize as position flags for interchange. Once mutation occurs, the gene value of corresponding position is interchanged. The selection of positions for interchange shall ensure the gene values of chromosomes are different and change the original gene value as far as possible so as to generate new solution advantageously.
(4) GA operations procedure
Step1: Providing the population scale Popsize and the number of features for reduced dimension p, randomly generate the initial population pop(k) and find out the chromosome with biggest fitness function value. The optimization function value is expressed as .
Step2: Calculate the optimization function value of population pop(k) and find out the biggest value of this function j , and let the variable i and j respectively (, let and corresponding chromosome i. If . If and , let ).. Also find out the smallest value of this function denote the optimization function values of chromosomes and corresponding chromosome . Wherein
Step3: stop calculation if ending conditions are met, and then output the optimal chromosome Fig.2 to obtain the population and the optimal solution . Otherwise, providing the crossover probability , crossover .as in
Step4: carry out mutation for to obtain .
Step5: let and , and then return to Step2.
In order to validate the proposed algorithm, the sample date are from China Knowledge Resource Integrated Database (http://www.cnki.net/), WanFang Data(http://g.wanfangdata.com.cn/), CHINA SCIENCE AND TECHNOLOGY JOURNAL DATABASE and other professional resources database system. The data contains 5681 items for training and 2447 items for prediction. The sample data is divided into 12 classes (i.e. concrete machinery, lifting machinery, motor vehicles industry, crushing equipment, port machinery and so on) and the data not belonging to these 12 classes is considered as "else". In order to increase the test speed, firstly use the document frequency method [14 Yang K F, Zhang Y K, Yan L I. Feature Selection Method Based on Document Frequency. Computer Engineering, 2010,36(17):33-36.] to initially screen features. 195 alternative features are left and the distribution in each class is shown in Table 2. Then apply VB.net language to compile the algorithm and set the iteration number of generations for ending as 30000.
Screening of initial features in each class
In order to obtain better parameters of crossover probability and mutation probability, firstly initially determine the mutation probability as 0.15 and crossover probability varying from 0.90 to 1 as shown in Table 3. Select the parameters from better operation results: crossover probability as 0.91, 0.93, 0.94 and 0.97 respectively. Then fix the crossover probability and change the mutation probability from 0.10 to 0.20. The comparison of the operation results is as shown in Table 4. Finally determine a set of better parameters for genetic algorithm operation: crossover probability as 0.93 and mutation probability as 0.15. The operation results for GA under these parameters is shown in Fig. 3. It can be seen from Fig. 3 the algorithm steadily converges to optimal result within generations for population iteration, indicating the algorithm has a good engineering application value.
Vertical comparison of algorithm operation when the dimension of features is reduced to 50
Horizontal comparison of algorithm operation when the dimension of features is reduced to 50
To check the comprehensive performance of the proposed algorithm, use the prediction data to compare in view of running time and prediction accuracy when the dimension of features is reduced to 50, 60, 70, 80, 90 and 100 respectively, as shown in Table 5. It can be concluded that the optimized algorithm running time is related to the number of iteration generations and not related to the reduced dimension of features, and the prediction time is related to the reduced dimension of features: the less the number of features is, the shorter the prediction time costs.
Comparison of prediction accuracy
(1) The dimensionality reduction for features of texts can effectively increase the running speed and reduce the dimensional explosion. However the dimension of features shall be not as little as possible since too little number of features can affect the accuracy of final classification. For example in this case, when the number of features is reduced to less than 60, the relative accuracy decreases a lot. So a proper balance point between the dimension reduction for features and the accuracy shall be selected according to actual situation.
(2) The method with integral matrix crossover, transposition and recombination of entire population can make all chromosomes be involved in cross over which is beneficial to generate new solution and keep the diversity of population. Reserving the optimal chromosomes in each generation of population is beneficial to inheritance of good genes in population and helpful the algorithm rapidly converge to the optimal solution.
(3) The selection of initial features can greatly affect the accuracy of prediction result, so the initial features screening of each class shall reserve key or exclusive feature attributes of the class.
(4) A group of better parameters can be obtained through vertical and horizontal comparison of algorithm operation performance (only a parameter is changed every time) based on the good experience of technical staff. The orthogonal or homogeneous experiment can be introduced in actual operation in order to rapidly decide the better parameter of algorithm.
The manufacturing text information on the internet accumulates massively along with the rapid development of information technology. A model for text information classification is constructed in this paper according to the features of text resource to reveal the mapping relation between text information and features. The genetic algorithm is adopted to solve feature dimension reduction. The case proves that this proposed method is fast and effective.
This paper was supported by the National Natural Science Foundation of China (Grant No.61104171) and Qing Lan Project of JiangSu Province of China (Grant No.2014).
Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree.Www Special Interest Tracks & Posters of International Conference on World Wide Web, 2005:1190-1191.
Chen J, Li Q, Wang L, et al. Automatically Generating an e-Textbook on the Web. Advances in Web-Based Learning - ICWL 2004Springer Berlin Heidelberg, 2004:35-42.
Nguyen M H, Torre F D L. Optimal feature selection for support vector machines. Pattern Recognition, 2010, 43(3):584-591.
Liu H, Sun J, Liu L, et al. Feature selection with dynamic mutual information. Pattern Recognition, 2009, 42(7):1330-1339.
Destrero A, Mosci S, Mol C D, et al. Feature selection for high-dimensional data. Computational Management Science, 2009, 6(1):25-40.
Dong Lili, Wei Shenghui. Design of Mechanical Information Text Classifier.Microelectronics & Computer, 2012, 29(4):142-145.
Liu Kan,Liu Zhong.An Algorithm of Manufacturing Text ClassificationOriented Cloud Computing Environment. Machine Design and Manufacturing Engineering, 2013,42(1): 28-31.
Wang S, Jiang L, Li C. Adapting naive Bayes tree for text classification. Knowledge & Information Systems, 2015, 44(1):77-89.
B. Ramesh, J.G.R. Sathiaseelan. An Advanced Multi Class Instance Selection based Support Vector Machine for Text Classification.Procedia Computer Science, 2015, 57(1):1124-1130.
Chavan S, Ran F, Nicholls I A. Acute Toxicity-Supported Chronic Toxicity Prediction: A k-Nearest Neighbor Coupled Read-Across Strategy.International Journal of Molecular Sciences, 2015, 16:11659-11677.
Whitehead N P, Scherer W T, Smith M C. Use of Natural Language Processing to Discover Evidence of Systems Thinking. IEEE Systems Journal, 2015:1 - 10.
Wang X, Liu X, Pedrycz W, et al. Fuzzy rule based decision trees. Pattern Recognition, 2015, 48(1):50-59.
Kuldeep Kumar. Genetic Algorithm Review.International Journal of Technical Research, 2013, 2(3):24-29.
Yang K F, Zhang Y K, Yan L I. Feature Selection Method Based on Document Frequency. Computer Engineering, 2010,36(17):33-36.
StatementThe author(s) declare(s) that there is no conflict of interests regarding the publication of this article.
Publication in this collection
15 June 2016
24 June 2016