How to read the text of a pdf file? [Archive] - Qt Centre Forum

View Full Version : How to read the text of a pdf file?

Momergil

12th October 2011, 22:12

Hello!

I already know how to read a text from a .txt file using QFile, QTextStream and so forth, but I don't know how to open a .pdf file and read its content. I tried to do it in the same way recently, and what I got was:

"%PDF-1.5
%ÂµÂµÂµÂµ
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(pt-BR) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 3 0 R] >>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.2 841.92] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<</Filter/FlateDecode/Length 147>>
stream
xÅ“MÂÂ±
Ã‚0â€žÃ·@ÃžÃ¡Ã†?Câ€œÃ¼iÃ� �8â€MÂ«(Â²AÃÃ‡7YTnÃ¸8Ã ®Æ’Ã™Â£Ã«ÃŒÂ·#lÃŸc#â€ $â� �¦Y3ËœÂµmÂÃŽR0lN&Â¶Ã•u@^7Ã©&â€¦Ã…Â¥Ã”FÅ â€¦â€™ÂªxE'U3 Â½Å¸ÂªjÃ¨Â®Å“#djÃ©ËœÃ›Ã‘Â� �â‚¬Ã—Uq *H;)Â¦,+Ã‚Â¯Ã‚Â·ÃšÃ»Ã…â€� �/~[LsÃ„Ã¬_#Ã³
endstream
endobj
5 0 obj
<</Type/Font/Subtype/TrueType/Name/F1/BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FontDescriptor 6 0 R/FirstChar 32/LastChar 120/Widths 14 0 R>>
endobj
6 0 obj
<</Type/FontDescriptor/FontName/Times#20New#20Roman/Flags 32/ItalicAngle 0/Ascent 891/Descent -216/CapHeight 693/AvgWidth 401/MaxWidth 2568/FontWeight 400/XHeight 250/Leading 42/StemV 40/FontBBox[ -568 -216 2000 693] >>
endobj
7 0 obj
<</Author(Martin)/Creator(Ã¾Ã¿

while the text inside was "texto aqui".

So how do I open a .pdf file and read its content inside a Qt software?

Thanks!

Momergil

ChrisW67

13th October 2011, 00:44

Like you do when you are not using Qt... you use a third party library or utility, e.g. Poppler (http://poppler.freedesktop.org/) or PoDoFo (http://podofo.sourceforge.net/about.html), or you write your own based on the public PDF reference material (http://www.adobe.com/devnet/pdf/pdf_reference.html). Qt does not contain any ability to interpret PDF files.

Momergil

13th October 2011, 14:58

Like you do when you are not using Qt... you use a third party library or utility, e.g. Poppler (http://poppler.freedesktop.org/) or PoDoFo (http://podofo.sourceforge.net/about.html), or you write your own based on the public PDF reference material (http://www.adobe.com/devnet/pdf/pdf_reference.html). Qt does not contain any ability to interpret PDF files.

Hello Chris,

thanks very much.

God bless,

Momergil

peterlee

22nd January 2016, 06:23

Did you mean to extract text from pdf files (http://www.pqscan.com/extract-text/)? I wonder whether there are any differences between pdf extraction and pdf to text conversion (http://www.pqscan.com/pdf-to-text/) process? Whose way of processing is much simpler and faster?ã€€Any suggestion will be appreciated. Thanks in advance.

Best regards,
Lee

anda_skoa

22nd January 2016, 10:45

Running a converter tool is likely easier, as this just means running a child process and gathering its output or reading its result file.

Using a PDF library is more code but also potentially gives you information on paging, formatting, etc.

Cheers,
_

Powered by vBulletin® Version 4.2.5 Copyright © 2024 vBulletin Solutions Inc. All rights reserved.