sedi

23rd June 2012, 17:45

Is there any ready-made string similarity check in Qt?

I've read about Levenshtein distance or soundex, is there anything like this built in?

According to this thread (http://www.mikrocontroller.net/topic/164870)(sorry, German language) I've implemented a counting of small substrings that gives me quite reasonable results for Names (which I need to compare) when I take substrings of n=2 chars and an 80% threshold.

I'd like to know if there is a better routine fast enough that I can just take and use.

This is what I use right now (corrections on style and speed are welcome!)

header:

bool isSimilar(QString a, QString b, qreal percentage=80, int n = 2, Qt::CaseSensitivity caseSense= Qt::CaseInsensitive);

implementation:

bool MainWindow::isSimilar(QString a, QString b, qreal percentage, int n, Qt::CaseSensitivity caseSense)

//Iterates substrings in groups of n chars from a und finds these in b.

//The number of hits is then divided by the length of the shorter string.

//To properly take word beginnings and endings into account

//spaces are being inserted before and after the strings.

{

if (a.isEmpty()||b.isEmpty()) return false;

qreal hits=0;

a=QString(" ").repeated(n-1)+a+QString(" ").repeated(n-1);

b=QString(" ").repeated(n-1)+b+QString(" ").repeated(n-1);

QString part;

for (int i=0;i<a.count()-(n-1);i++)

{

part=a.mid(i,n);

if (b.contains(part,caseSense)) hits++;

}

if (a.length()<b.length()) return (percentage < (100*hits/(a.length()-(n-1))));

else return (percentage < (100*hits/(b.length()-(n-1))));

}

For the name "Markus Bertram" I get these results:

Bertram, Markus - 93,3

Markus E. Bertram - 100

Marcus Emil Bertram - 86,7

marc bertram - 84,6 (case-insensitive)

Martin Bertram - 73,3 (false)

Martin Bergmann - 46,7 (false)

I've read about Levenshtein distance or soundex, is there anything like this built in?

According to this thread (http://www.mikrocontroller.net/topic/164870)(sorry, German language) I've implemented a counting of small substrings that gives me quite reasonable results for Names (which I need to compare) when I take substrings of n=2 chars and an 80% threshold.

I'd like to know if there is a better routine fast enough that I can just take and use.

This is what I use right now (corrections on style and speed are welcome!)

header:

bool isSimilar(QString a, QString b, qreal percentage=80, int n = 2, Qt::CaseSensitivity caseSense= Qt::CaseInsensitive);

implementation:

bool MainWindow::isSimilar(QString a, QString b, qreal percentage, int n, Qt::CaseSensitivity caseSense)

//Iterates substrings in groups of n chars from a und finds these in b.

//The number of hits is then divided by the length of the shorter string.

//To properly take word beginnings and endings into account

//spaces are being inserted before and after the strings.

{

if (a.isEmpty()||b.isEmpty()) return false;

qreal hits=0;

a=QString(" ").repeated(n-1)+a+QString(" ").repeated(n-1);

b=QString(" ").repeated(n-1)+b+QString(" ").repeated(n-1);

QString part;

for (int i=0;i<a.count()-(n-1);i++)

{

part=a.mid(i,n);

if (b.contains(part,caseSense)) hits++;

}

if (a.length()<b.length()) return (percentage < (100*hits/(a.length()-(n-1))));

else return (percentage < (100*hits/(b.length()-(n-1))));

}

For the name "Markus Bertram" I get these results:

Bertram, Markus - 93,3

Markus E. Bertram - 100

Marcus Emil Bertram - 86,7

marc bertram - 84,6 (case-insensitive)

Martin Bertram - 73,3 (false)

Martin Bergmann - 46,7 (false)