PDA

View Full Version : Detect language from the unicode



nikhilqt
18th March 2012, 23:51
Is there any way to detect language from the respecting unicode supported string ? Please let me know if you any approaches which you feel it might help me.
Thanks in advance.:confused:

wysota
19th March 2012, 00:08
You can use statistics -- search input for common words (or sets of words) unique for a given language. You can try looking for character sets but that will only detect alphabets and not languages.

ChrisW67
19th March 2012, 00:28
It would be unusual to get a single string in isolation: external data can help. For example, if the string is part of address and you collect the country you get another criteria to help limit the possibilities. The presence of RTL characters or direction marks can also limit the options.

nikhilqt
19th March 2012, 00:33
Thanks wysota. I can think of an idea here now. I can save all the language unicode alphabets with their respective language name. And I can do unicode contains operation to detect the respective language. Can i do unicode alphabet comparison ?


It would be unusual to get a single string in isolation: external data can help. For example, if the string is part of address and you collect the country you get another criteria to help limit the possibilities. The presence of RTL characters or direction marks can also limit the options.
Yeah, nice idea. But I have a sample data with all the languages in it mixed intermediately. Let me see if i can get any information in the file related to the language.

wysota
19th March 2012, 00:47
If you have large text with mixed languages it will be extremely hard to separate them. Besides that naive checking every possible entry will take ages. You need to preprocess the text making some kind of dictionary counting words or phrases, filtering out entries common to many languages and then employ some statistics aparatus to guess the language. The easiest one to guess is definitely chinese and most other asian languages as well as languages where a particular alphabet is used by a small number of languages (arabic, hebrew, etc.). It will probably be the most difficult to detect English as many texts often quote some terms coming from English, especially in technical texts.

Language recognition is a domain in itself, don't expect to write a 100-200 line long program that will do what you want. You can improve your chances by connecting to some online ontology database (such as wordnet for English) to detect phrases or even sentences that do have a meaning in a particular language.

ChrisW67
19th March 2012, 00:52
Then you have a real problem. The only approach would be statistical and have fun with text like:


Tom approached the man in uniform, "Je ne parle pas bien français. Pouvez-vous m'aider à trouver un poste de police?" The man replied, "Je ne parle pas français non plus. Sprechen Sie deutsch?"

wysota
19th March 2012, 09:50
If you want statistics then you need definitely more than 1000 tokens.