By Aleksandra Vercauteren, Senior NLU Engineer at Faction XYZ.
As a company specialised in natural language understanding, word embeddings are one of the building blocks of our technology. Our NLU models need to be capable of correctly ‘understanding’ what is written or said. One way of doing this is by using an intent classification model: when a user inputs a sentence, the model predicts an intent. Accurate intent classification is thus crucial for the interaction to flow smoothly. Although our models are the most accurate that we know of for Dutch and French, sometimes the classification goes wrong, and occasionally this can result in funny conversations. Very recently, a chatbot built on our NLU models failed to recognize a user’s witty sarcasm, and responded in the following passive-aggressive manner:
User: Thanks for not answering, genius.
Bot: You’re welcome!
The sentence was misclassified as a “thank you” intent, to which the bot very politely gave a correct response. A lot of the misclassifications we have identified can be retraced to the word embeddings we make use of and the way they are created.
There are several ways of training word embeddings, but what most of them have in common is that the whole conception of word embeddings is based on the idea that lexical semantics is distributional: words with similar meanings occur in similar contexts, and similar contexts contain similar words.
You shall know a word by the company it keeps!
John Rupert Firth — A synopsis of linguistic theory
But is that really enough? Can you get the full meaning of a word by just looking at its context? What does context mean? And where does it go wrong? By sharing these questions here I hope the marvellous complexity of human language can provide you with a same sense of astonishment as it provides me.
Putting the context into the word.
Simply put, the classic word2vec word embeddings represent words as the average of the contexts they can appear in. The ‘context’ a word occurs in can be described at a variety of levels. You can speak of the syntactic context of a word. For instance, some words have a different part of speech — and thus also a different meaning — depending on the syntactic context they occur in.
I work very hard because I love my work.
The first occurrence of work is a verb, and designates an action. The second occurrence on the other hand is a noun and designates an activity. Although we can agree that both words are similar in a number of respects — they are homophones and homonyms and share the same root work -, they are dissimilar in meaning. Nevertheless, they will have one single vector representation that summarises both of these words and the contexts they can appear in.
Let’s look at another example:
There is coffee on the kitchen floor.
I would like a coffee, please.
Here, both occurrences of coffee are nouns, but their meaning is different due to the context in which they occur. In the first sentence, coffee designates a substance, it is a mass noun, while in the second sentence, coffee designates a cup of black gold. How do we know? The syntactic context. The presence of the indefinite determiner ‘a’ indicates that this occurrence of coffee is countable, hence we automatically infer that there should be a cup involved. However, you cannot do this trick with all words. Sand will never be countable, no matter which determiner you put in front of it. So yes, semantics is distributional, in the sense that the meaning of words can depend on the context they occur in, and that words cannot occur in certain contexts because their meaning simply doesn’t allow for it.
Also the semantic context can ‘change’ the meaning of a word. Take a look at the following example.
The bass played the bass.
Both instances of bass are nouns, both are preceded by the same determiner, but they clearly have a different meaning. We are dealing with a fish with musical aspirations here. How do we know? Well, only basses can be played, while basses cannot. Observe that although the syntactic context (more specifically the word order and subject-object relations) helps us to interpret each occurrence of bass, it is mainly the semantics of played interacting with the semantics of the basses that helps us disambiguate the meaning. In the example below, we switch around the subject-object relation and the word order, but we still get the same two interpretations of bass thanks to the presence of played:
The bass was played by the bass. (wait… maybe basses can be played after all… The drama of being a fish!)
Finally, we have the pragmatic context. Pragmatics refers to language use, and a lot of human nature comes into play here. Humans use language for a variety of purposes, not only for transmitting ideas and information, but also to get other people to do things. Humans use different variants of a language to show their inclusion in social groups or to comply to certain social rules. Most relevant for our discussion here, is the role of register. In different social contexts, we tend to use a different language. The differences can be in pronunciation (I dunno vs. I don’t know) but also in the vocabulary we choose to use. We can ask a friend to grab us a coffee (in a cup), but we are likely to ask a colleague to bring us a coffee, please. Exact same syntax, same semantics, even same intention (I am too lazy to get a coffee so I want you to do it without explicitly telling you what to do because I value our relation and the social rules we live by prescribe that we be polite hence use questions instead of imperatives to maintain good relations). Different context though. Not linguistic, but pragmatic.
Can you can a can as a canner can can a can?
One of the reasons (there are many more, word embeddings are awesome!) why word embeddings are so popular is because they allow us to calculate word similarity. As mentioned above, similar words can occur in similar contexts, so given that word vectors represent the average context a word can occur in, we expect similar words to have similar word vectors. However, just like context, word similarity can refer to a number of concepts.
What does it mean exactly for words to be “similar”? Words can be similar in a number of respects: they can have the same spelling (address and address) or the same or a similar pronunciation (too and two). Words can be similar in that they have the same root: canner, can and can have the same root can, but can doesn’t, it has the other root can. Remember to breathe and don’t get confused.
Words can also have the same part of speech: work and sing are similar in that they are both verbs, work and song are similar in that they are both nouns (and work and work are similar in that they have the same root work).
Finally, words can have the same or a similar meaning. But again, what does it mean to have a similar meaning? The meaning of two words can be similar in that they refer to entities of the same ontological class, like for instance apple and steak which both refer to entities of the class of edible objects. Appleand pear have even more similar meanings, since they do not only both refer to entities in the class of edible objects, but also to entities in the class of fruits, and even more restrictively, the class of pome fruits. However, most people think of synonyms or near synonyms when speaking of similar words: words that refer to the same entity or action or concept. Word embeddings don’t. They think of several types of similarity, and mix them all together.
Survival of the fittest.
As I mentioned above, word vectors represent the average of the contexts they occur in. We feed a huge corpus to a smart algorithm and we get a bunch of numerical representations of the words in the corpus. Similar vectors represent similar words, because they occur in similar contexts.
Due to the complexity of the word-context interplay, vector similarity won’t always represent the type of similarity that we are after. For instance, one of the words with a vector most similar to the vector of can, is cannot (check your favorite word vectors!). I think we can all agree that these two words do not exactly mean the same thing, so why are their vectors so similar? It is pretty simple actually: one aspect of the context both words can occur in has a much bigger weight than the other aspects and that will determine the shape of their vectors. In this case the syntactic context wins. Both can and cannot can only occur in a very restricted set of syntactic contexts. Being auxiliary verbs, they can only occur in a position preceding other auxiliary verbs or lexical verbs, or at the beginning of a sentence in case the sentence is a question. Additionally, their poor semantic content makes both can and cannotcompatible with a very large variety of verbs and nouns, so the semantic context it can appear in is pretty much undefined. As a consequence, the vector of these words will mainly represent their syntactic distribution and from a word vector point of view, they will be very similar to words with the same syntactic distribution and poor semantic content, such as all other auxiliary verbs. The presence of the negation in cannot does not have sufficient weight to compensate for the syntactic similarity between can and cannot. Luckily there are ways to make sure that the quite relevant semantic difference is taken into account, for instance by making use of a hard coded negation extractor, which identifies negative elements such as no, but also un- and anti-. Since we started making use of such a negation extractor, our bots became less cocky.
Syntactic context determines the vector representation for all functional words, such as determiners, numericals, prepositions and pronouns. Their semantic content simply isn’t strong enough to transpierce in their vector representation and their syntactic distribution is so restricted that the word vectors will basically represent the syntactic context they so frequently occur in.
Another case is when the pragmatic context has the biggest weight. If you look up the words that are most similar to nope in the fastText pre-trained word vectors, you will notice that the top ten contains a lot of short words that are frequently used in an informal register, such as anyways, fwiw, yeah and hmmm. Again, these words have little semantic content. They are also syntactically quite independent (they can be ‘sentences’ on their own). This leaves room for pragmatics to determine their distribution, and this will be represented in their word vectors.
Does the semantic context have any influence at all? Of course it does! Especially with words that have a strong, specialized, semantic content, a rather free syntactic distribution and no pragmatic ties to any register. This is the case for a lot of nouns, verbs and adjectives. The words most similar to furious are a.o. angry, infuriated and enraged, which are synonyms. However, the more general the semantic content of the word, and thus the bigger the variety of semantic contexts it can occur in, the more ‘general’ the semantic representation in their word vectors will be. Take for instance make, which has create in its top ten, but also give, bring and get, which, like make, are verbs that have multiple meanings, depending on the context they occur in (I made it! vs. I made an omelette). So apparently the average meaning (whatever that means) of make is similar to the average meaning of give. Very often the semantic similarity represented by the word vectors will rather be class similarity. Apricot is similar to plums and pears because they appear in fruity contexts, and flu is similar to dengue and measles because they occur in feverish contexts.
The reality-possibility discrepancy.
So why do synonyms rarely pop up when looking for similar vectors? It is not because two words can appear in the same context, that they will. Synonyms often differ from each other in the sense that one of the alternatives is used very frequently with one of its other meanings. Think for instance of our beloved friend the bass. The most similar word is guitar, and there is no fish whatsoever in the top ten, probably because we don’t talk about the fish as often as we do about the instrument. Another reason is that words are rarely synonyms in all contexts: dialect and variant can be used interchangeably when talking linguistics, but that is the only context in which that is possible. Variant can occur in so many other contexts, and these contexts will pull the word vector in a certain direction, away from dialect. Synonyms are not always synonyms in all variants of a language. Verlof (holiday or leave) is a synonym of vakantie (holiday) in Flemish Dutch, but not in Dutch from the Netherlands. Finally, synonyms can be preferred in certain registers. Bro is a synonym of friend, but not when you are talking to your boss.
It is thus clear that using word vectors to find synonyms isn’t a very good idea. There are some alternative approaches to build a good synonym suggestor though. For once, you can make use of a built-in thesaurus and predict which one of the candidate synonyms is the most likely to occur in the given context using word prediction techniques. In any case, a rule of thumb is that you need to take into account the specific context, because whether two words are synonyms or not, often depends on the context they occur in.
An average conclusion.
Word vectors represent the average context of a word. All aspects of the context: syntax, semantics and pragmatics. If one aspect has more weight than another, this will be visible in the word vector representation. If a word occurs more frequently in one context than in the other, this will be visible in the word vector representation. If a word occurs more frequently with one of its meanings, this will be visible in the word vector representation. So if you want to use word vectors to calculate similarity, beware of the similarity complexity.
Although word vectors are very useful in a number of respects and allowed us to make great progress in NLU, they are a based on a rather simplistic approach to language (we have to start somewhere…). There are so many subtleties in language that humans understand intuitively and juggle with without even thinking about it. We are capable of inferring a lot of meaning that isn’t even linguistically expressed, just based on the context in which we use language, or our knowledge about and relation with the people we interact with, and so many other factors. So let’s keep up the good work and deal with the shortcomings of word vectors, because one day I want a pet robot. Any ideas to make it better? Get in touch!