Analyze the geometry of representions.
Understanding the representation space is a hot topic in the area of NLP since distributed representations made huge improvments on variety of NLP tasks and we do not know why. Exisiting probing methods use a learned classifier’s accuracy, mutual information or complexity as a proxy for the quality of a representation. However, classifier-based probes fail to reflect the difference because different representations may require different classifiers. In this post:
A learning task can usually be decomposed into three components:
However, there are so many factors can affect the performance of the whole learning process. For example, the optimizer (the algorithm) stops before convergence or we choose a bad initialization point. The above figure list all the possible factors. Clearly, the quality of the representation space is not the only factor.
Preliminary experiments are conducted on the supersense role labeling task with BERT-base-cased and ELMo original model. Many different classifiers are trained and evaluated, from logistic regression to two layer neural network. For each specific classifier, we train for \(10\) times using \(10\) different random seeds and record the minimum and maximum accuracy for these \(10\) runs. The following table shows our initial experiment results.
From the table above, we observe:
The above figure and table show that evaluating the quality of representations based on the learned classifiers is not reliable. Bad initialization point or wrong choice of model type can also lead to a bad predictive performance. We can conclude that a bad predictive performance does not necessarily mean bad representation and vice versa.
In this post, we study the question: Can we evaluate the quality of a representation for an NLP task directly witout relying on classifiers as proxy?
Given an NLP task, we want to distentangle the evaluation of a representation \(E\) from the classifier \(h\) that are trained over it. To do so, the first step is to characterize all classifiers supported by a representation.
From the viewpoint of a classifier, it is trained to find a set of parameters such that the prediction of the classifier is consistent with the examples in the training set. Geometrically speaking, each learned classifier is a decision boundary in the high dimension space as showed in the following figure.
In the above figure, it is a simple binary classification problem. There are many classifiers that can separate these two classes. This figure shows two linear (\(h_1\) and \(h_2\)) and a non-linear (\(h_3\)) examples. This suggests us that a representation $E$ can admit a set of classifiers that are consistent with the training set and a learner choose one of them. Given a set \(\mathcal{H}\) of classifiers of interest, the subset of classifiers that are consistent with a given dataset represents the version space with respect to \(\mathcal{H}\). However, the original definition of version space does not allow errors. Here, to account for the errors or noise in the data, we define a \(\epsilon\)-version space: the set of classifiers that can achieve less than \(\epsilon\) error on a given dataset.
Suppose \(\mathcal{H}\) is the whole hypothesis space consisting of all possible classifiers \(h\) of interest. The \(\epsilon\)-version space \(V_{\epsilon}(\mathcal{H}, E, D)\) expressed by a representation \(E\) for a labeled dataset \(D\) is defined as:
\[V_{\epsilon}(\mathcal{H}, E, D)\triangleq \{h\in \mathcal{H}\vert err(h, E, D)\le \epsilon\}\]where \(err\) is training error.
Be note here, the \(\epsilon\)-version space \(V_{\epsilon}(\mathcal{H},E,D)\) is only a set of learned classifiers and does not involve any learning. Intuitively, the larger \(\epsilon\)-version space would make the learning process easier to find a consistent classifier.
Previous classifier-based probing work measures the quality
of a representation by investigating properties of a
specific \(h\in V_{\epsilon}\). For example, some work
restricts the probe model to be linear. Commonly measured
properties include generalization error
Now, the next question is how can we find the \(\epsilon\)-version space for each representation. From the definition, \(V_{\epsilon}(\mathcal{H},E,D)\) is a infinite set. Although it is impossible to enumerate all the possible classifiers, geometry perspective indeed provide some insights about \(V_{\epsilon}(\mathcal{H},E,D)\).
We know that each classifier \(h\in V_{\epsilon}(\mathcal{H},E,D)\) is a decision boundary in the representation space and it can be in the form of any shape, like the following figure shows. On the left, the decision boundaries can be linear or non-linear. On the right, the decision boundary has to be a circle.
We also know that a set of piecewise linear functions can mimic any functions. If we use a set of piecewise linear functions to mimic the decision boundaries (the middle ones in the above figure), we find that the space is split into different groups, each of which contains points with exactly one label. Because of linear functions, these groups are seen as convex regions in the representation space (the bottom ones in the above figure). Any classifier in \(V_{\epsilon}(\mathcal{H},E,D)\) must cross the region between the groups with different labels; these are the regions that separate labels from each other, as shown in the gray areas in the bottom of above figure. So, an intuitive idea is to use these grey areas to approximate the \(V_{\epsilon}(\mathcal{H},E,D)\).
Although finding the set of all decision boundaries remain hard, finding the regions between convex groups that these piecewise linear function splits the data into is less so. Grouping points in the space is well-defined problem: Clustering. However, in this case, our clustering problem has different criteria:
Now, we successfully transform the problem of finding \(\epsilon\)-version space into a clustering problem with special criteria.
To find clusters that satisify the above criteria, we propose DirectProbe, a simple yet effective heuristic clustering algorithm:
The above animation describes the algorithm. When the new cluster overlaps (green dashed lines) with the red ones, it stops the merging and find the next closest pair.
Now, we have developed DirectProbe, a heuristic clustering approach that can approximate \(\epsilon\)-version space. Next, we apply this approach on variety of representations and tasks and analyze the results.
After applying the DirectProbe, we could end up \(n\) clusters.
By using the number of clusters, we can answer the question:
Can a linear classifier fit the training set for a task with a given representation?
To validate our claim, we use the training accuracy of a linear SVM classifier. If a linear SVM can perfectly fit (\(100\%\) accuracy), then there exist linear decision boundaries that separate the labels. The following table shows the results of our experiments. We can observe that almost all of the representations we experimented are linear separable for most of the tasks. We think this maybe the reason why linear model usually work for BERT-family models. The only exception is RoBERT-large on the pos tagging task. DirectProbe ends up with \(1487\) clusters and linear SVM can not achieve \(100\%\) training accuracy either.
Be note, linear separability does not mean the task is easy or that the best classifier should be a linear one.
The distance between clusters is another important property we should consider. We apply the DirectProbe on each layer of BERT-base-cased model for five tasks. For each layer(space), we compute the minimum distance between all pairs of clusters. The next figure shows our results.
The horizonal axis is the layer index of BERT-base-cased, the left vertical axis (blue line) is the best classifier accuracy and right vertical axis (red line) is the minimum distance between all pairs of clusters. We observe that both best classifier accuracy and minimum distance show similar trends across different layers: first increasing, then decreasing. Although not a simple linear relation, it shows that minimum distance correlates with the best performance for an embedding space. Using the minimum distances of different layers, we answer the question:
How do different layers of BERT differ in their representations for a task?
The standard pipeline of using contextual embedding models is to fine-tuning the original model on a specific task and then deploy fine-tuned model. Fine-tuning usually can improve the performance, which makes us wondering:
What changes in the embedding space after fine-tuning?
We apply the DirectProbe on the last layer of BERT-base-cased before and after fine-tuning for five tasks. Similarily, we compute the minimum distances between all pairs of clusters. The next table shows the results.
We can observe that both the best classifier accuracy and minimum distance show a big boost. It means that fine-tuning pushes the clusters away from each other in the representation space, which results in a larger \(\epsilon\)-version space. This verifies our assumption:
A larger $\epsilon$-version space admits more classifiers and allows for better generalization
The distance between clusters can also confuses classifiers. By comparing the distances between clusters, we answer the question:
Which labels for a task are more confusable?
We compute the distances between all pairs of labels based on the last layer of BERT-base-cased^{1}. The distances are evenly split into 3 categories: small, medium and large. We partion all label pairs into these three bins. For each task, we use the predictions of the best classifier to compute the number of misclassified label pairs for each bin. The distribution of all errors is shown in the following table.
We can easily observe that all the errors concentrate on the small distance bin. This means that small distance between clusters indeed confuse a classifier and we can detect it without training classifiers.
As we discussed earlier, any classifier \(h\in V_{\epsilon}(\mathcal{H},E,D)\) is a predictor for the task \(D\) on the representation \(E\). So, as a by-product, the clusters from DirectProbe can also be used as a predictor. The prediction strategy can be very simple: for a test example, we assign it to its closest cluster. We call this accuracy intra-accuracy. Because this strategy is very similar to the nearest neighbor classification (1-kNN), which assigns the unlabeled test point to its closest labeled point, we also compare with the 1-kNN accuracy. The following figure shows the results.
We observe that intra-accuracy always outperforms the simple 1-kNN classifier, showing that DirectProbe can utilize more information from the representation space. Moreover, all the pearson correlation coefficients between best accuracy and intra-accuracy (showed in the parentheses alongside each task title) suggests a high linear correlation between best classifier accuracy and intra-accuracy, which means the intra-accuracy can be a good predictor of the best classifier accuracy for a representation. From this, we argue that intra-accuracy can be interpreted as a benchmark accuracy of a given representation without actually training classifiers.
In this post, we ask the question: what makes a representation good for a task? We answer it by developing DirectProbe, a heuristic approach builds upon hierarichal clustering to approximate the \(\epsilon\)-version space. By applying DirectProbe, we find:
linearly separable. So the number of labels pairs equals the numebr of cluster pairs.
For all tasks, the last layer of BERT-base-cased is ↩