This is the third article in a short series on fuzzy matching:
- Example algorithms
- Testing and context
In this article I will consider the difference between context-dependent and context-independent fuzziness, and think about how fuzzy matching systems can be tested.
Context-dependent and context-independent fuzziness
If you are trying to do fuzzy matching of strings, then the previous article introduced one way of doing that – the Levenshtein distance. The Levenshtein distance knows nothing about the strings being compared, so can be described as a context-independent approach.
There are situations where this works, but others where it doesn’t. For instance, to a human the following strings are likely to fall into two groups of two:
- Fred Brooks Consulting Services Ltd.
- F. Brooks Limited
- Jo Smith and Sons PLC
- J. Smith
This is even though there are big differences within the pairs. A human has subconsciously switched to the domain of company names, which contains a subdomain of people’s names. In this context (in the UK at least) “Limited” and “Ltd.” are synonyms, “Fred” can be abbreviated to “F.” and “Jo” to “J.”, and “Consulting Services” and “and Sons PLC” are relatively low importance, filler-type words.
If you wanted a computer to say that the list of four names above was split into two pairs, you would need to encode the rules of the context so that the computer can follow them. For instance, remove all low importance (low entropy) words, and convert words to a standard synonym (e.g. Limited to Ltd).
It’s important to know which context is the appropriate one, as different contexts could have conflicting rules. For instance, if you were compiling a list of synonyms, then jump/leap seems like a sensible pair, but different contexts would have different synonyms for jumper:
- Jumper/sweater – clothing
- Jumper/shunt – electronics
- Jumper/hunter – horses
Not all manipulation of text needs to be context-specific. It’s possible to encode rules about how words are formed from smaller elements (the rules of morphology) e.g. how singular becomes plural, how an adjective becomes an adverb etc. Then these rules can be applied to reduce words to just their most important part and throw away the bits that change via the rules. So, history, histories and historical would all be converted to something like histor, showing that they are all different aspects of the same word. This process is referred to as stemming.
It’s possible to combine context-dependent and context-independent fuzzy matching. For instance, filler words are removed (context-dependent) and then what’s left is compared via the Levenshtein distance (context-independent).
Note that, the context-dependent parts of fuzzy matching can be just as arbitrary as the context-independent. With the context-independent Levenshtein distance, the arbitrary choice is over how great a distance is acceptable. With context-dependent approaches you need to decide which words you include in the list of filler words to delete, which words you pick as deserving a list of synonyms and the words you include in each list of synonyms.
This isn’t intended to a comprehensive description of context-dependent fuzzy matching because that’s, as the name suggests, context-dependent and I can’t cover all contexts here.
If you refer to the first article in the series, you can see that I split the problem of fuzzy matching into two: how a match is scored and where the dividing line is of scores that are good enough vs. scores that aren’t good enough. If you were doing white box testing, where you could see the inner workings of the code, you could try to test the two parts separately.
However, I will assume that you are doing black box testing, and so can’t get this level of detail. Instead, the best you can do is a more general exploration of how the system copes with variability and ambiguity.
In the case of speech recognition (for which you might use an HMM), you could try:
- Different kinds of speaker;
- Different ways of speaking for one speaker (quietly/loudly, fast/slowly, happy/sad etc.);
- Different amounts and kinds of background noise (radio, railway station, beard scrape);
- Ambiguous sounds.
What’s ambiguous will depend on the context a bit, but it will be things like words that rhyme with each other, the letters S and F, the letters that rhyme with E, 5 and 9 etc.
In the case of text matching, you could start with a perfect match, then move further and further away. If you know two or more words that the system’s interested in, preferably ones that are similar to each other, you could work out a word containing typos that’s mid-way between two of the important words and see which, if either, is picked. For instance, if the system’s looking for hammer and spanner, then the nonsense word hanner is mid-way between the two (it’s at a Levenshtein distance of two from both).
There’s an awful lot more to testing fuzzy matching than this – much more than I could cover in a single article. As well as purely technical concerns it’s important to consider the human side of things. For instance, if the system is doing fuzzy matching of photos of people’s faces, does the matching work equally well or poorly for different skin colours, genders, ages etc?