pdfbox - How to get style information of elements in PDF using Apache Tika? -
i playing around apache tika extract text pdf files. know how style information font size, text color, whether specific piece of text (few words) in italics, bold, etc. using apache tika?
is possible type of information?
also if possible table information using apache tika? information start of table, start of first row, first cell, etc.
it more convenient use api pdftextstream. tika extracts raw textual information pdf, while pdftextstream gives structured text correlated info such character encoding, height, region of text etc.
Comments
Post a Comment