Seyfi et al., [1111 Seyfi A, Patel A. A focused crawler combinatory link and content model based on T-Graph principles. Comput Stand Interfaces. 2016;43:1-11.,1212 Seyfi A, Patel A, Celestino Júnior J. Empirical evaluation of the link and content-based focused Treasure-Crawler. Comput Stand Interfaces. 2016;44:54-62.] |
DDS, VSM |
Full page text, anchor text, sub section heading (ISH), section heading which contains ISH, main heading, data around the link and target information |
TF-IDF |
14% generic seed URLs and 22% on-topic seed URLs |
Recall, Harvest Rate |
0.27 |
Yajun et al., [1010 Liu WJ, Du YJ. A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing [Internet]. 2014;123:266-80. Available from: http://dx.doi.org/10.1016/j.neucom.2013.06.039 http://dx.doi.org/10.1016/j.neucom.2013....
] |
CMCOA, VSM |
full page text, anchor text, title text and surrounding text of paragraphs |
Evolution regular, communication regular, TF-IDF |
3 Seed URLs for each topic |
Harvest Rate, Average Relevance and Average Errors |
0.297 |
Almpanidis et al., [88 Almpanidis G, Kotropoulos C, Pitas I. Combining text and link analysis for focused crawling-An application for vertical search engines. Inf Syst. 2007;32(6):886-908.] |
VSM and HITS |
Full page text and anchor text |
LSI |
|
Precision, Recall and Harvest rate |
0.21 |
Chen et al., [99 Chen Z, Ma J, Lei J, Yuan B, Lian L, Song L. A cross-language focused crawling algorithm based on multiple relevance prediction strategies. Comput Math with Appl [Internet]. 2009;57(6):1057-72. Available from: http://dx.doi.org/10.1016/j.camwa.2008.09.021 http://dx.doi.org/10.1016/j.camwa.2008.0...
] |
VSM |
Full page text, anchor text, URL address and link structure |
TF |
3 Seed URLs for each topic |
Harvest Rate, Sum of info, average running time |
0.22 |
Mani Sekhar et al., [1414 Mani Sekhar SR, Siddesh GM, Manvi SS, Srinivasa KG. Optimized focused Web Crawler with Natural Language Processing based relevance measure in bioinformatics web sources. Cybern Inf Technol. 2019;19(2):146-58.] |
VSM |
Full page text and anchor text |
TF-IDF |
|
Operating time and harvest rate |
0.21 |
Farag et al., [1313 Farag MMG, Lee S, Fox EA. Focused crawler for events. Int J Digit Libr. 2018;19(1):3-19.] |
VSM |
Topic, Date and Location |
TF-IDF, Regular Expression and NER |
38 Seed URLs for each topic |
Precision, Recall, F1-Measure and Harvest Rate |
0.26 |
Singh et al., [1515 Singh B, Kumar Gupta D, Mohan Singh R. Improved Architecture of Focused Crawler on the basis of Content and Link Analysis. Int J Mod Educ Comput Sci. 2017;9(11):33-40.] |
VSM |
Full page text and link context |
TF-IDF |
10 Seed URLs for each topic |
Harvest Rate |
0.21 |
Geng et al., [1616 Geng Z, Shang D, Zhu Q, Wu Q, Han Y. Research on improved focused crawler and its application in food safety public opinion analysis. 2017 Chinese Autom Congr [Internet]. 2017;2847-52. Available from: http://ieeexplore.ieee.org/document/8243261/ http://ieeexplore.ieee.org/document/8243...
] |
VSM and Multifactor correlation co-efficient |
Full page text and crawler theme |
TF-IDF |
|
Harvest Rate, Precision and Recall |
0.22 |
Xu et al., [1717 Xu G, Jiang P, Ma C, Daneshmand M. A Focused Crawler Model Based on Mutation Improving Particle Swarm Optimization Algorithm. Proc - 2018 IEEE Int Conf Ind Internet, ICII 2018. 2018;(Icii):173-4.] |
Particle Swarm Optimization |
Full page text, anchor text, surrounding text and URL text |
TF-IDF |
100 Seed URLs for each topic |
Harvest Rate |
0.29 |
Rungsawang et al., [77 Rungsawang A, Angkawattanawit N. Learnable topic-specific web crawler. J Netw Comput Appl. 2005;28(2):97-114.] |
VSM and BHITS |
Title, full page text, anchor text and link context |
TF-IDF |
10 Seed URLs for each topic |
Harvest Rate |
0.27 |
Kumar et al., [1818 Kumar M, Vig R. Learnable Focused Meta Crawling Through Web. Procedia Technol [Internet]. 2012;6(1994):606-11. Available from: http://dx.doi.org/10.1016/j.protcy.2012.10.073 http://dx.doi.org/10.1016/j.protcy.2012....
] |
VSM, Hub score and Authority score |
Full page text and anchor text |
TF-IDF |
Seed URLs are generated from ODP |
Harvest Rate |
0.26 |
Goyal et al., [1919 Goyal N, Bhatia R, Kumar M. A genetic algorithm based focused web crawler for automatic webpage classification. IET Conf Publ. 2016;2016(CP739).] |
Genetic algorithm |
Title, full page text, anchor text, paragraph text, list text, bold text and heading text |
Cosine similarity |
http://www.stanford.edu/ is crawled up to depth 6. |
Harvest Rate |
0.26 |
Zhao et al. [2020 Zhao F, Zhou J, Nie C, Huang H, Jin H. SmartCrawler: A two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput. 2016;9(4):608-20.] |
Cosine Similarity |
Full page text, Context of URL, anchor text, and text around URL |
TF-IDF |
100 Seed URLs for each topic |
Harvest Rate |
0.29 |
Jung-ran Park et al. [2121 Park JR, Yang C, Tosaka Y, Ping Q, Mimouni H El. Developing an automatic crawling system for populating a digital repository of professional development resources: A pilot study. J Electron Resour Librariansh. 2016;28(2):63-72.] |
Cosine similarity and HITS |
Full page text and anchor text |
TF-IDF |
15 Seed URLs for each topic |
Harvest Rate |
0.26 |
Chen et al. [2222 Chen X, Zhang X. HAWK: A focused crawler with content and link analysis. IEEE Int Conf E-bus Eng ICEBE'08 - Work AiR'08, EM2I'08, SOAIC'08, SOKM'08, BIMA'08, DKEEE'08. 2008;677-80.] |
Cosine similarity and page rank |
Full page text and anchor text |
TF-IDF |
Seed URLs are generated from ODP |
Harvest Rate |
0.29 |
Rawat et al., [2323 Rawat S, Patil DR. Efficient focused crawling based on best first search. Proc 2013 3rd IEEE Int Adv Comput Conf IACC 2013. 2013;908-11.] |
Cosine Similarity |
Full page text and anchor text |
TF-IDF |
|
Harvest Rate |
0.21 |
Hati et al. [2424 Hati D, Sahoo B, Kumar A. Adaptive focused crawling based on link analysis. ICETC 2010 - 2010 2nd Int Conf Educ Technol Comput. 2010;4:455-60.] |
Cosine Similarity |
Full page text, anchor text, cohesive text and also relevance score of parent pages |
TF-IDF |
1 Seed URL for each topic |
Harvest Rate |
0.27 |
Mangaravite et al. [2525 Mangaravite V, Tavares De Assis G, Ferreira AA. Improving the efficiency of a genre-aware approach to focused crawling based on link context. Proc - 2012 8th Lat Am Web Congr LA-WEB 2012. 2012;17-23.] |
Cosine Similarity |
Full page text, anchor text, title text and URL text |
TF-IDF |
3 Seed URL for each topic |
Harvest Rate |
0.27 |
Wei et al. [2626 Zhao W, Guan Z, Cao Z, Liu Z. Mining and harvesting high quality topical resources from the web. Chinese J Electron. 2016;25(1):48-57.] |
VSM, cash gain and RVM |
Full page text and link context |
TF-IDF |
10 Seed URL for each topic |
Harvest rate, precision and recall |
0.34 |
Gupta et al. [2727 Gupta S, Duhan N, Bansal P. An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital Libraries. Int J Inf Retr Res. 2019;9(3):23-47.] |
Cosine Similarity |
Full page text, keyword text and the title text |
TF-IDF |
15-20 Seed URL for each topic |
Precision and Recall |
|