Technological Indicators of Nanocellulose Advances Obtained from Data and Text Mining Applied to Patent Documents

Nanocellulose is remarkable cellulose-based nanomaterials that have a potential for innovation and sustainable appeal. Their advances can be assessed using patent indicators and text mining techniques. The aim of this study was at analyzing the advances in nanocellulose based on indicators compiled from patents filed at the United States Patent and Trademark Office (USPTO) from 2000 to 2012. Assignees, technological subjects, highly cited patents, applications and types of nanocellulose were obtained by mining structured and unstructured data. The results highlighted the different interests in the USA market, mainly after 2007. Mined terms from titles and abstracts could add further information to the analysis. However, although the method applied was useful, it was not sufficient to identify all applications and types of nanocellulose involved in the sample analyzed, therefore it is recommended that other document parts be included in future analyses.


Introduction
Nanocellulose is an emerging and sustainable nanomaterial, which has been the target of scientific and technological research due to its interesting functional properties and higher mechanical properties than regular cellulose fibers or other conventional reinforcement materials [1][2][3][4][5] .Leaders in producing cellulose and pulp, such as the USA, China and Brazil, as well as others with large forest reserves, notably Canada, Finland and Sweden, have supported nanocellulose developments.The United Kingdom and France may also be accounted for this group due to their initiatives and contributions 5 .The global market of nanocellulose has a good perspective, especially in the USA whose market is forecasted to be US$ 250 million by 2020 6 .In spite of many potential applications of nanocellulose [1][2][3][4]7 , it is estimated that it will be mainly used in composite materials, paper, medical devices, filters and electronics 6 .
Nanocellulose is a generic term to refer to cellulosebased nanomaterials and many others nomenclatures can be observed in the literature 1,2,4,5 .Basically, there are two types of cellulose-based nanomaterials, whose nomenclatures have recently been standardized as cellulose nanofibrils (CF) and cellulose nanocrystals (CC) 8 , and the main differences between them rely primarily on the crystalline degree of the cellulose [1][2][3] .CF comprises crystalline and amorphous domains of cellulose chains while CC contains only the crystalline cellulose.Wood and pulp, natural fibers, plants in general, forest and agricultural residues are examples of sources for CF and CC, which are prepared by using mechanical and acid hydrolysis processes (top-down approaches), respectively.Bacterial fermentation (bottomup approach) is another process that produces CC, and it has been called bacterial cellulose (BC) in most studies to differ from CF and CC obtained from top-down approaches [1][2][3][4] .
Despite renewed characteristics, potential properties and research efforts in nanocellulose, there are considerable challenges to be overcome.For example, even though an upscale in the production of nanocellulose is expected in coming years 6 , costs should be minimized 1,2,5 .The systematic characterization of nanocellulose sources, properties and behaviors, such as mechanical, rheological, surface properties and interaction with living tissues are also issues not completely solved.Furthermore, due to the agglomeration trends, more developments are needed concerning surface modification and functionalization [1][2][3]9 .
Patent indicators have also some limitations, though.Not all inventions are patented, thus, some companies prefer secrecy and other mechanisms to gain market dominance.There is also evidence of differing patent behavior across industries and countries over time.Furthermore, standards across patent offices affect patent numbers, although underlying inventive activities may remain unaffected, and the counting of patent documents gives the same weight to all patents regardless their economic and technical value [10][11][12] .An approach to select economically and technically relevant patent documents is to analyze those patents filed in the USA market because of their great economic importance [10][11][12] , despite home advantage issues 19 .Citation analysis of patents is also a usual method to enhance relevant documents and to track knowledge.It has been evidenced that the number of citations a single patent receives reflects its technological and commercial influence [10][11][12] .
Due to the increasing volume of information available and challenges involved in collecting and analyzing this amount of data and texts, data-mining and text-mining techniques have been researched, so that scientist, managers and engineers are able to keep up with what has been done in a specific research field, especially in the context of engineering 5,[10][11][12]16,17 . Data-ining applied to patent documents deals with structured part of documents, which essentially includes the bibliographic data, such as assignees, inventors, priority country (country of the first application), dates and classification codes 11,12,16 .However, data-mining results might not be sufficient to point out clearly technical aspects of inventions, for instance fabrication routes, applications and products involved.To fulfill this gap, text-mining techniques may help by analyzing unstructured texts (also called free texts) of patent documents, such as titles, abstracts, technical report, claims etc. [14][15][16] Data-and text-mining complement each other in the purpose to build a whole framework of competitors, historical developments and trends, technological monitoring, hotspots, gaps and opportunities to support strategic technological planning 11,14,15,17 .
Mining unstructured texts is a newer, empiric and much less developed field when compared to mining structured data of patent documents, thus it has currently been the subject of research [14][15][16][20][21][22][23][24][25][26] . There re advances in methodological practices and tools to perform analysis and visualization, such as text summarization, noun-phrase extraction and natural language processes, clustering, vector representation and mapping 16,[20][21][22][23][24][25][26] .Titles and abstracts have been the main sources of unstructured text, although all parts of documents can be eventually explored 16 .
There are some few studies concerning technological indicators and patent analyses regarding nanocellulose.Durán, Lemes and Seabra 7 aimed at updating the technological advances in cellulose nanocrystals by analyzing the content of patent documents obtained from important repositories without using any quantitative technique, which decreased the number of document evaluated.Despite the lack of some methodological criteria, for instance, the search expression and criteria to select the documents analyzed, the authors provided an assessment regarding preparation, treatments, and applications of the nanomaterial.Charreau et al. 4 attempted to visualize the technological development of cellulose nanocrystals, microfibrillated cellulose and bacterial cellulose using data mining technique.They obtained evolution trends, top patentee and most cited patents.Milanez et al. 5 forecasted worldwide trends of the scientific publications and patent applications using growth curves, compared nanocellulose evolution with other nanomaterials and the relation between scientific and technological advances among the USA, Japan, China, Canada, Brazil and some European countries (France, Sweden, the UK, Germany and Finland).However, these studies explored only structured data of patent documents, leaving out information that is potentially relevant, such as materials, processes and other technical aspects that could be assessed by text mining.Furthermore, none of them focused to analyze the nanocellulose patenting activity at the USA market, which may also be significant one for nanocellulose 6 as it already is for nanotechnology [27][28][29] Hence, the purpose of this study was to map the nanocellulose patenting activity in patent applications filed in the United States Patent and Trademark Office (USPTO I ) from 2000 to 2012.It provides the evolution of patenting and technological subjects, priority countries, main patent assignees, and most cited patents with focus on application and materials, specially the type of nanocellulose.

Data-mining procedure to compile and analyze patent document indicators
Technological indicators were developed using data mining procedure on a dataset of 500 full text patent documents in nanocellulose filed at the USPTO -United States Patent and Trademark Office 30 , from 2000 to 2012, in which the retrieval procedure, databases used and other criteria are described in topic 2.3.An additional analytical element was the identification of the top 15 most cited patent documents using the Derwent Innovations Index website 31 , whose values were obtained according to procedure also explained in detail in Topic 2.3 II .
The indicators were compiled and analyzed according to the guidelines and procedures recommended by the OECD Patent Statistic Manual 12 and by the field of quantitative studies 10,14,16 .The indicators were developed with support of the VantagePoint® (version 5.0) software.To select the years of filing, priority countries, the assignees´ names and International Patent Classification (IPC) codes from the sample data-mining technique were applied.The annual number of patent filed at the USA was analyzed comparatively to the worldwide perspective III .The I The USPTO is the federal agency that issues patents and trademarks in the USA.II The citation analysis was conducted in July 16 th 2014, therefore it counted the number of citation that documents received until this date.III The worldwide value considered the patent records retrieved from the Derwert Innovations Index database.
technological subject was depicted using the subclass level of IPC codes.
Assignees' names were standardized, because there is no control of this information in the filing act, and the name considered for standardization was the one found in the assignee website, as recommended by the OECD Patent Statistics Manual 12 .These websites also were consulted for further information about the assignee.Only assignees with five or more patents were ranked.A profile of activity for each assignee was developed by considering their main subject categories and mined terms that characterized the type of nanocellulose and application (which were obtained according to the procedure in Topic 2.2).To the top cited patents, a profile was developed taking into account the number of citation, priority year, assignee name, subject categories, mined terms.Co-assignee patents were also identified as a way to determine collaborative developments 11,12 .

Text-mining procedure to compile and analyze patent document indicators
Noun-phrase terms related to types of nanocellulose -cellulose nanofibrilas (CF), cellulose nanocrystals (CC), bacterial cellulose (BC) -and applications of the inventions were extracted from patent titles and abstracts by using the following procedure, which is analogous to practices found in other studies 10,14,16 .Natural language process (NLP) was used to mine phrases from titles and abstracts texts 32 .Stop words were removed and routines to identify and group meaningful terms concerning application and types of nanocellulose were applied.Due to the uncontrolled language of the patent texts, only terms that appeared at least in five documents were considered 16 .
Although DII patent records are interesting to compile patent indicators due to their coverage, we sought to explore the potentiality of the full text documents to provide useful information, particularly from their original titles and abstracts.Titles and abstracts from the DII records were not used because the database modifies them in order to provide more information and this would affect the final analysis.The full text data were then imported into the data and text mining software and the final sample comprised 500 patent documents, after removing carefully duplicate records.Therefore, the studied accounted two sources of information for the same invention: bibliographic data from DII records and title and abstract from full texts retrieved from the USPTO database.
The most cited patent filed at the USA were retrieved redoing the search at the DII database, but now using the US patent number (the same used to retrieve the full text document at the USPTO database).After performing the search, the patents were listed according to the most cited to the less cited and the top 15 patents were selected to be assessed.

Statistical calculations performed
The annual growth rate (G i ) was obtained using Equation 1, where N i is the number of patent documents in the year "i" and N i-1 is the number of patent documents in the year before ("i-1").
( ) The share (S) of patent documents was calculated using Equation 2, where S i is the number of patent documents from an assignee, technological subject, application or type of nanocellulose (i) and S t is the total number of patent documents in the context of the analysis (t).

Evolution, priority countries and technological subjects
The number of patent documents increased rapidly from 2000 to 2010 for both patents of worldwide perspective and those filed at the USPTO, according to Figure 1, and this might be consequence of knowledge advances in nanocellulose and the promising forecasted market size for this nanomaterial [1][2][3][4][5] .For the values of 2011 and 2012, the decrease might be due to the regular unpublished period of 18 months for patent applications and eventual database indexing delays 35 .Interestingly, the patents filed at the USPTO accounted for 36%, on average, of the total number of patent applications in nanocellulose worldwide, which peaked in 2004 when the value accounted for 85%.This result highlights the importance of the USA market for nanocellulose.
The priority countries of the patents filed at the USPTO consisted mainly of major producers of cellulose and pulp (the USA, Japan, India, Finland, Sweden, Germany, Brazil) 36 , as shown in Figure 2. Obviously, the USA had home advantage in the number of patents accumulated in the period of analysis 19 , but the presence of other countries/ regions suggests that they were also concerned about the American market.The absence of Canada, an important country for the development of nanocellulose 5 , may be consequence of the recurrent practice of their assignees, which prefer to file their patents firstly at the USPTO and then extend the protection to Canada 12 .
Medical, dental or toilet preparations (A61K) accounted for 51% of the patents filed at the USPTO, which was not considered as a relevant technological topic pointed for nanocellulose 6 .In part, it can be explained by the fact that the technological development of this subject decreased in 2008-2011 V period, as shown in Figure 3 VI , and the nanomaterial is not the focus of development.The number of patent documents regarding the use of nanocellulose in preparation of food and beverage (A23L) also declined in the same period.On the other hand, a considerably increase between 2004-2007 and 2008-2011 periods was observed V Despite the fact that not all patent documents from 2011 were indexed when the search was conducted, it seems the scenario of patenting will not change the trends observed.
VI See the complete meaning of each technological subject at www.wipo.int/ipcpub/ to technological subject related to materials engineering and processes.The number of patent of the subject D21H (pulp composition or paper treatment) grew 533%, which might be consequence of pulp industry interests, while C08B (polysaccharides) raised 236%, suggesting that the nanomaterial become the focus of development.Layered products (B32B) and macromolecular compositions (C08L) rose, respectively, 164% and 90.9% between the same periods and these subjects can be associated to nanocellulose surface modifications and composite materials, respectively.Furthermore, fermentation and enzyme-using processes (C12P) increased 100% and, in part, it is associated to bacterial cellulose or nanocellulose as a secondary product of biofuels production.
The mined terms concerning types of nanocellulose could be extracted to the top technological subject, as can be seen from Table 1, in spite of the low number of patent documents found.Bacterial cellulose was associated to all subjects, but as a preferred type to A61K, A61P, A61L, C12P and A23L.These can be consequence of its peculiar characteristics when compared to nanocelluloses obtained from top-down approaches: bacterial cellulose has higher purity and capacity of water absorption 2 .Moreover, the subject C12P corroborates with the bottom-up process to obtain bacterial celluloses.
Pulp compositions or paper treatments (D21H) was linked mainly to cellulose nanofibrils developments, as consequence of mechanical defibrillation of pulp.The area of polymers (C08L) and process of compounding (C08J) appeared with all types, suggesting that the developments of composite materials did not claimed a specific type of cellulose-based nanomaterials as reinforcement, despite its apparent importance to cellulose nanocrystals.However, there was no preference of type to layered products (B32B).

Main assignees patenting profile
The main assignees shared together only 26% of the total patent documents analyzed, as shown in Table 2. Patenting behavior across these assignees may differ due to their own interests and strategies 12,17 .At least one application addressed by patents of each assignee could be identified from key terms extracted, as shown in Table 2, even though the number of patents associated to that application in most of the cases was low when compared to the total of patents of that assignee.In many cases, the type of nanocellulose involved could also be recognized, except for the pharmaceutical-related companies (Pfizer, Randbaxy Lab., Aurobindo Pharma.and Teva Pharma.) and for FMC Co., Kimberly-Clark and Swetree Tech.The lack of information about the type of nanocellulose may be part of their patenting strategy 12 or even issues regarding the methodological procedures adopted in this study.However,   an improvement in the analysis was possible because the mined terms may suggest trends of type of nanocellulose preferences for some assignees.The nanocellulose patents from Procter & Gamble, whose interest lies in consumer goods 37 , were associated to detergent compositions, a result that can be seen by both the technological subject and application terms.Bacterial cellulose was found in only 12.5% of its patent documents.In turn, the patent documents from FMC Co., a chemical manufacturing company 38 , were related to a broad portfolio of applications, including medical, dental or toiletry preparations (A61K), cosmetics, food and beverage.Additionally, FMC Co. produces Avicel ® , a product based on microcrystalline cellulose commonly used in pharmaceutical compositions, which also emphasizes the company's nanocellulose-related interests 39 .Kimberly-Clark is a company that produces paper-based consumer products for personal care use 40 and 77.8% of its patent documents were related to medical devices (A61F) while 44.4% indicated applications as absorbent articles.
The patents from FPInnovations, a research center working with sustainable forestry, pulp, paper and the packaging industry 41 , were mainly addressed to polysaccharides and derivatives (C08B), as well as macromolecular compositions (C08L).The type of nanocellulose involved in its technologies included both cellulose nanocrystals and cellulose nanofibrils, while the main mapped applications were adhesives, hydrogels and nanocomposites.The University of Texas assigned patent documents mainly to medical device applications, using bacterial cellulose, and their technologies used fermentation processes, microorganisms and enzyme technologies.UT-Battelle is a private non-profit company created by a partnership between the University of Tennessee and the Battelle Memorial Institute 42 .It showed a broad technological profile, including preparations for medical, dental or toilet purposes (A61K), as well as implantable filters, prostheses and other medical devices (A61F).It also had patents concerning processes for conversion of energy (H01M), which is aligned with its function of managing and operating the Oak Ridge National Laboratory for the U.S. Department of Energy 42 .The mined titles and abstracts also showed that 75% of their technologies included bacterial cellulose as a type of nanocellulose and interest in nanocomposites applications (37.5%).
The Finnish Stora Enso and UPM-Kymmene and the Sweden Swetree Technologies are companies linked to forest products, pulp and paper [43][44][45] , and their patent interests were associated to pulp compositions.However, the nanocellulose applications mapped were different: Stora Enso focused in coatings, UPM-Kymmene in drilling compositions and Swetree Tech. in nanocomposites.Cellulose microfibrils were present in technologies from the Finnish companies and they also co-assigned two patent applications concerning manufacturing nanostructured paper or board for printing application in the period of analysis.Finally, The French Sofradim Products 46 demonstrated interests in fermentation processes (C12P) and methods or apparatus for sterilizing materials/objects (A61L) while the application claimed was clearly related to medical device (100%) using bacterial cellulose (66.7%).

Analysis of the highly cited patents from the sample
From the total of 500 US patent documents which comprises the sample analyzed, 329 (65.8%) received at least one citation until July 2014.Obviously, older documents trend to have more citation because there is a higher time span to cite them.The top 15 most cited patents had at least 19 citations and they included a variety of technological developments in nanocellulose, as can be seen from Table 3. Companies owning highly cited patents (Table 3) also were among the assignees with the greatest number of patents (Table 2).Most of the assignees from these highly cited patents are from the USA, except two from Japan (University of Tokyo and Sharp) and another from Switzerland (Roche).Moreover, universities seemed to have an important role in the development of new technologies in nanocellulose since they signed in one third of the patents analyzed.This is in agreement with the other authors´ statement about the high scientific dependence of the technological advances in nanocellulose 5,10,12 .The content of the document could be understand by the technological subject and, eventually, by the terms of application and type of nanocellulose mined from title and abstract.For instance, the highest cited patent refers to compostable containers (application and B65D code) made by polymers (C08L, C08K) as layered products (B32B) and many processes are involved (see subjects B05D, B29C, C08J and D21J).Indeed, it belongs to an American company that also owns a trademark to biodegradable containers 47 , however, it was not possible to identify the type of nanocellulose used in the product.Another example would be the second most cited document which requested an electronic paper comprising layered product (B32B), a device controlled by optical characteristics (G02F) and circuits (G09G).
Seven patent documents were related to medical, dental or toilet preparations (A61K).Two of them claimed pharmaceutical composition and another two absorbent article as application of the invention.Bacterial cellulose emerged as the preferred type of nanocellulose in three documents.Five documents assigned layered products (B32B) and cellulose nanofiber was claimed as nanomaterial in one of those documents.In the case of medical devices (A61F), which appeared as subject in four documents, bacterial cellulose was identified in one document and, again, absorbent article was the application found in two documents.Definitely, bacterial cellulose was the unique type that could be identified in two patent documents (US2007027108 and US2010210501) and fermentation process (C12P) was present in three documents, even though one claimed biofuels as application.On the other hand, cellulose nanofiber was claimed in only one document (US2010233481), but it is a generic name and do not foresee the manufacture approach of the nanocellulose used.Concerning biofuels, two technologies linked nanocellulose to this application (US2007178569 and US2008190013), which might be read as it would be a secondary product of the invention.

Conclusion
This paper has given an account of technological monitoring of nanocellulose, focusing on the patenting activity in patent applications filed at the United States Patent and Trademark Office (USPTO) from 2000 to 2012.The evolution and priority countries were assessed while a profile containing the assignee name, priority year, technological subject and terms mined from title and abstract were used to evaluate interests of main assignees and top cited patents.
A rapid annual growth of patenting activity was observed in the period, accompanying the worldwide trend.The share of patent applications in the USPTO resembled 36% in the annual average of world applications, and at least partially, it is due to the forecasted USA nanocellulose market size and the emergence of developments of the nanomaterial.The patents filed at the USPTO were not only from USA assignees, but also from other countries and regions, mainly from Europe, Japan and India.The highlighted assignees were from USA, Japan, Finland, India, Canada, France, and Sweden.The outcomes also showed the scientific dependence of technologies in nanocellulose, because universities and research institutes were present among the main assignees and the most cited documents.
Medical, dental or toilet preparations accounted for 51% of the patents filed at the USPTO.However, this subject stated decreasing from 2008.Technological subjects regarding materials and processes highly increased after 2008, though.Both assignee and top cited document profiles provided evidences of using nanocellulose in a variety of applications, including cosmetic and drugs formulations, biodegradable composition, absorbent articles, methods to produce nanocellulose and products containing it, and technologies regarding biofuels and nanocellulose as a secondary product.Composite materials and surface treatments trend to increase as subjects to nanocellulose technological developments in coming future.In spite of technologies involving bacterial cellulose appear to be more developed, technologies containing cellulose nanofibrils or nanocrystals has emerged consistently in recent years probably as consequence of establishing plant pilots worldwide and at the USA.
The data and text-mining tools and complementary procedures used in this study to compile and support the analysis of the indicators have brought advances in highlighting type of nanocellulose and their applications.Moreover, text-mining indicators could complement the technological subjects and enhance the monitoring assessment.However, limitations to be overcome in future research could also be observed.For example, it was not possible to point out the type of nanocellulose for all patent documents, and when it was successful, the number of patent documents involved was low.The absence of information may also be part of assignees' strategy or that the nanomaterial is not the focus of development.Even so, applying the text mining approach for other parts of the patent documents, such as the detailed description of the invention and/or the patent claims, could improve the outcomes in future research .

Figure 1 .
Figure 1.Annual number of patent documents in nanocellulose worldwide and at the USPTO and the relation between them from 2000 to 2010.Source: USPTO and DII.

Figure 2 .
Figure 2. The main priority countries of patent documents filed at the USPTO.Source: DII.

Table 1 .
Types of nanocellulose to the top technological subject.

Table 2 .
Main assignee profile compiled from patent documents in nanocellulose.

Table 3 .
Top 15most cited patents profile from the sample analyzed.