String similarity check [Archive]

sedi

23rd June 2012, 16:45

Is there any ready-made string similarity check in Qt?
I've read about Levenshtein distance or soundex, is there anything like this built in?

According to this thread (http://www.mikrocontroller.net/topic/164870)(sorry, German language) I've implemented a counting of small substrings that gives me quite reasonable results for Names (which I need to compare) when I take substrings of n=2 chars and an 80% threshold.

I'd like to know if there is a better routine fast enough that I can just take and use.

This is what I use right now (corrections on style and speed are welcome!)

header:

bool isSimilar(QString a, QString b, qreal percentage=80, int n = 2, Qt::CaseSensitivity caseSense= Qt::CaseInsensitive);

implementation:

bool MainWindow::isSimilar(QString a, QString b, qreal percentage, int n, Qt::CaseSensitivity caseSense)
//Iterates substrings in groups of n chars from a und finds these in b.
//The number of hits is then divided by the length of the shorter string.
//To properly take word beginnings and endings into account
//spaces are being inserted before and after the strings.
{
if (a.isEmpty()||b.isEmpty()) return false;
qreal hits=0;
a=QString(" ").repeated(n-1)+a+QString(" ").repeated(n-1);
b=QString(" ").repeated(n-1)+b+QString(" ").repeated(n-1);
QString part;
for (int i=0;i<a.count()-(n-1);i++)
{
part=a.mid(i,n);
if (b.contains(part,caseSense)) hits++;
}
if (a.length()<b.length()) return (percentage < (100*hits/(a.length()-(n-1))));
else return (percentage < (100*hits/(b.length()-(n-1))));
}

For the name "Markus Bertram" I get these results:

Bertram, Markus - 93,3
Markus E. Bertram - 100
Marcus Emil Bertram - 86,7
marc bertram - 84,6 (case-insensitive)
Martin Bertram - 73,3 (false)
Martin Bergmann - 46,7 (false)

amleto

23rd June 2012, 17:26

don't know anything off hand.

your current implementation should at least include this change

const QString spaces = QString(" ").repeated(n-1);
a = spaces + a + spaces;
// etc

wysota

23rd June 2012, 18:07

The most basic approaches I know are Hamming distance and Edit Distance (aka Levenshtein Distance). There are other string metrics available, though:
http://en.wikipedia.org/wiki/String_metric

Edit distance is quite easy to implement efficiently using QtConcurrent or OpenCL..