PDA

View Full Version : Text file



wirasto
6th July 2010, 12:30
How to know a file is a text file ? Like *.txt, *.c, *.cpp, etc. Not binary file like *.pdf, *.jpg, *.exe, etc.
In linux, file extension is not important right ? So, I think detecting by file extension is not a good way...

Thank's before...

borisbn
6th July 2010, 16:19
Usually, text files don't contain symbols with code less then 0x20, except of \r \n and \t.
you can analyze each symbol in file, and, if there are several characters less then 0x20 ( except of \r \n and \t ) you can say, that this file is NOT a text

squidge
6th July 2010, 16:30
If by 'text' you mean US-ASCII (like most .c, .cpp, etc files), then just confirm the first 1KB of so of the file is between 32 and 127 (with the exceptions borisbn states, such as \r, \n & \t)
'.txt' files are more complicated - they could be in unicode, so then it becomes a lot more difficult (Unless you just support UTF8, then you can recognise the byte ordering and individual characters easily)

Septi
6th July 2010, 17:03
In Linux, there's an utility called 'file', that uses a set of tests to figure out file type. For instance, on my box:

$ file .bashrc /usr/bin/xfwm4 Videos/it/yasnippet.avi

gives:

.bashrc: ASCII text
/usr/bin/xfwm4: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped
Videos/it/yasnippet.avi: RIFF (little-endian) data, AVI, 588 x 560, ~15 fps, video:

In text files, there always presents word "text", in executables "executable" and in various binary data types "data". You can get rid of filenames with --brief or have a nice bedtime reading with man file. So, you can just pipe the output into your app and search for strings like "text". If I remember right, there is a libfile that actually does the job, and you probably can just link to it, but I'm not sure. Also I don't know about any crossplatform way to do it :(

squidge
6th July 2010, 18:00
On Linux, there's also a mime library so you can figure out the type of a file and it'll return eg. "text/html", but again, that's really not portable beyond Linux unless you port that library yourself along with it's dependancies.