Extracting information from PDF documents for use in automatic indexing of e-books

GIL-LEIVA, Isidoro; FUJITA, Mariângela Spotti Lopes; REDIGOLO, Franciele Marques; SARAN, Jordan Ferreira

Acessibilidade / Reportar erro

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

ORIGINAL • Transinformação 34 • 2022 • https://doi.org/10.1590/2318-0889202234e210069 copy

Extracting information from PDF documents for use in automatic indexing of e-books

Authorship SCIMAGO INSTITUTIONS RANKINGS

Abstract

The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.

Keywords
Software evaluation; PDFMiner.six; PDFAct; PDF-extract; PDFExtract; Grobib; Automatic indexing

Software	Extracción de TOC	Extracción de partes del texto	Tipos de extracción	Organización semántica	Formatos conversión	Código fuente abierto	Última actualización
PDFMiner.six	-	-	párrafos	-	XML, HTML, TXT	sí	Marzo/2021
PDFAct	-	X	palabras, bloques, líneas, páginas, párrafos	X	XML, TXT, JSON	sí	Mayo/2021
pdf-extract	-	X	párrafos	-	XML	sí	Septiembre/2015
PDFExtract	-	-	páginas, párrafos	X	HTML	sí	Abril/2021
GROBID	-	X	párrafos	X	XML	sí	Junio/2021

Partes de extracción del texto	PDFMiner.six	PDFAct	Pdf-extract	PDFExtract	GROBID
TOC	-	-	-	-	-
Título	-	X	X	-	X
Subtítulo	-	-	-	-	-
Secciones y subsecciones	-	X	-	-	-
Primer párrafo de una sección / subsección	-	-	-	-	-
Títulos de tablas y gráficos	-	X	-	-	-
Referencias	-	X	X	-	X

Herramienta/biblioteca	Puntuación
PDFMiner.six	3
PDFAct	9
Pdf-extract	5
PDFExtract	3
GROBID	5

	P	P	P	P	I	I	I	I	E	E	E	E
Total	11	0	1	0	20	0	16	9	10	0	0	0
Calcetín (%)	55	0	5	0	100	0	80	45	50	0	0	0
Corpus	T	TOC	L	R	T	TOC	L	R	T	TOC	L	R
Libro 1	3	0	0	0	2	0	1	1	1	0	0	0
Libro 2	1	0	0	0	2	0	1	1	0	0	0	0
Libro 3	0	0	0	0	3	0	1	1	0	0	0	0
Libro 4	2	0	0	0	3	0	0	0	0	0	0	0
Libro 5	0	0	1	0	3	0	1	0	0	0	0	0
Libro 6	0	0	0	0	3	0	1	0	3	0	0	0
Libro 7	2	0	0	0	2	0	1	0	3	0	0	0
Libro 8	3	0	0	0	2	0	0	0	1	0	0	0
Libro 9	0	0	0	0	2	0	1	1	2	0	0	0
Libro 10	1	0	0	0	2	0	1	1	2	0	0	0
Libro 11	3	0	0	0	3	0	0	0	1	0	0	0
Libro 12	0	0	0	0	1	0	1	0	0	0	0	0
Libro 13	0	0	0	0	3	0	1	0	0	0	0	0
Libro 14	0	0	0	0	3	0	0	0	0	0	0	0
Libro 15	2	0	0	0	2	0	1	0	1	0	0	0
Libro 16	0	0	0	0	3	0	1	1	1	0	0	0
Libro 17	2	0	0	0	1	0	1	1	0	0	0	0
Libro 18	2	0	0	0	2	0	1	1	0	0	0	0
Libro 19	0	0	0	0	1	0	1	1	3	0	0	0
Libro 20	3	0	0	0	1	0	1	0	0	0	0	0

Pontifícia Universidade Católica de Campinas Núcleo de Editoração SBI - Campus II - Av. John Boyd Dunlop, s/n. - Prédio de Odontologia, Jd. Ipaussurama - 13059-900 - Campinas - SP, Tel.: +55 19 3343-6875 - Campinas - SP - Brazil
E-mail: transinfo@puc-campinas.edu.br

Acompanhe os números deste periódico no seu leitor de RSS

[1] Correspondência para/Correspondence to: I. GIL-LEIVA. E-mail: isgil@um.es.