PDA

View Full Version : How to read the text of a pdf file?



Momergil
12th October 2011, 22:12
Hello!

I already know how to read a text from a .txt file using QFile, QTextStream and so forth, but I don't know how to open a .pdf file and read its content. I tried to do it in the same way recently, and what I got was:


"%PDF-1.5
%µµµµ
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(pt-BR) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 3 0 R] >>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.2 841.92] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<</Filter/FlateDecode/Length 147>>
stream
xœM±
Â0„÷@ÞáÆ?C“üià 8”M«(²AÁÇ7YTnø8à ®Æ’Ù£ëÌ·#lßc#†$â ¦Y3˜µmÎR0lN&¶Õu@^7é&…Å¥ÔFŠ…’ªxE'U3 ½Ÿªj讜#djé˜ÛÑ €×Uq *H;)¦,+¯·Úûņ/~[LsÄì_#ó
endstream
endobj
5 0 obj
<</Type/Font/Subtype/TrueType/Name/F1/BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FontDescriptor 6 0 R/FirstChar 32/LastChar 120/Widths 14 0 R>>
endobj
6 0 obj
<</Type/FontDescriptor/FontName/Times#20New#20Roman/Flags 32/ItalicAngle 0/Ascent 891/Descent -216/CapHeight 693/AvgWidth 401/MaxWidth 2568/FontWeight 400/XHeight 250/Leading 42/StemV 40/FontBBox[ -568 -216 2000 693] >>
endobj
7 0 obj
<</Author(Martin)/Creator(þÿ

while the text inside was "texto aqui".

So how do I open a .pdf file and read its content inside a Qt software?

Thanks!

Momergil

ChrisW67
13th October 2011, 00:44
Like you do when you are not using Qt... you use a third party library or utility, e.g. Poppler (http://poppler.freedesktop.org/) or PoDoFo (http://podofo.sourceforge.net/about.html), or you write your own based on the public PDF reference material (http://www.adobe.com/devnet/pdf/pdf_reference.html). Qt does not contain any ability to interpret PDF files.

Momergil
13th October 2011, 14:58
Like you do when you are not using Qt... you use a third party library or utility, e.g. Poppler (http://poppler.freedesktop.org/) or PoDoFo (http://podofo.sourceforge.net/about.html), or you write your own based on the public PDF reference material (http://www.adobe.com/devnet/pdf/pdf_reference.html). Qt does not contain any ability to interpret PDF files.

Hello Chris,

thanks very much.


God bless,

Momergil

peterlee
22nd January 2016, 06:23
Did you mean to extract text from pdf files (http://www.pqscan.com/extract-text/)? I wonder whether there are any differences between pdf extraction and pdf to text conversion (http://www.pqscan.com/pdf-to-text/) process? Whose way of processing is much simpler and faster? Any suggestion will be appreciated. Thanks in advance.



Best regards,
Lee

anda_skoa
22nd January 2016, 10:45
Running a converter tool is likely easier, as this just means running a child process and gathering its output or reading its result file.

Using a PDF library is more code but also potentially gives you information on paging, formatting, etc.

Cheers,
_