Fine-tuning a pre-trained model on a specific task becomes a standard strategy to improve the task performance. However, very little is understood about how fine-tuning process affects the underlying representation and why fine-tuning invariably seems to improve the performance.
We applied two probing methods, classifier-based probing and DirectProbe on variants of BERT representations and tasks. In this post,
To analyze how fine-tuning works, we use two probing methods to analyze the represention space before and after fine-tuning process.
The first probe we used is a classifier-based probe, which is a common methodology to probe representation space. We train classifiers over representations to understand how well a representation encodes the labels for a task. For all of our analysis, we use two-layer neural networks as our probes. The classification performance can provide a direct assessment of the effect of fine-tuning.
The second probe we used is DirectProbe, a recently proposed technique which analyzes representation space from a geometric perspective by clustering. Different from the other probes that analyze the geomtry of the representations, DirectProbe requires a representation and a given labeling task. It is reasonable to probe the representation based on a give task because fine-tuning a representation creates different representations for different tasks.
DirectProbe is built on upon the categorizetion of not one, but all decision boundaries in a representation space that are consistent with the a training set for a given task. It approximates this set of decision boundaries by a supervised clustering algorithm, which returns a set of clusters such that each cluster only contains the points with the same label. There are no overlaps between the convex hulls of these clusters. The following figure describe this process. More detailed description of DirectProbe can be found here.
We use three properties of the clusters returned by DirectProbe to measure the representations:
In this section, we will go over all the observations and discoveies we find during this probing work.
First, let’s look at the fine-tuned classification performance. We train a two-layer neural network as the classifier to classify five tasks:
It is commonly accepted that the fine-tuning process improves task performances. However, during our experiments, we did find one exception that fine-tuning does not improve the performance.
Fine-tuning Diverges the Training and Test Set
The following table summarizes the classification performances of all the representations and tasks.
We observe that BERT-small does not show improvments after fine-tuning process on supersense function task, which seems to be odd considering in all other cases, the fine-tuning improves the performance. In the meanwhile, we also obsverve that after fine-tuning, the spaital similarity between training and test set is decreasing for all representations and tasks (showed in the last column), indicating that the training and test sets become divergent as a result of fine-tuning.
Fine-tuning Memorizes Training Set
To understand why BERT-small decreases the performance on the supersense function task, we hypothesize that fine-tuning can memorize the training set. To validate this htpothesis, we design the following experiment:
The difference bwtween the subtest and test sets are, subtest is used during fine-tuning but not visible to the classifiers. The following table summarizes the visibility.
By comparing the learning curves of subtest and test set, we can verify if the fine-tuning process memorizes the subtest. We conduct our experiments on the four tasks using BERT-small. The following figures show the results.
In the above figures, we can observe that before fine-tuning, subtest and test have a similar learning curve; while after fine-tuning, subtrain and subtest have a similar curve although subtest is not visible for classifier training. This observations means that the representation memorizes the subtest during the fine-tuning process such that subtrain and subtest share the exactly the same regularity. Although this observation of memorization cannot explain why BERT does not improve the performance after fine-tuning, it points to a possible direction for investigation: If the memorization beomes severe, will the fine-tuning process overfit the training set and cannot generalize to unseen examples? We leave this question for future research.
After analyzing the performance of fine-tuning, we will take a deeper step to see the geometric change during fine-tuning. In this subsection, we will focus on the linearity of the representations by comparing the number of clusters produced by DirectProbe before and after fine-tuning.
Smaller Representations Require More Complex Classifiers
The following table summarizes the results on BERT-tiny. We only show BERt-tiny here because other representations are linearly separable even before fine-tuning. In the following table, we observe that small representations such as BERT-tiny are non-linear for most of the tasks. Although a non-linearity does not necessarily imply poor generalizaiton, it represents a more complex spatial structure and requires a more complex classifier. It would be advisable to use a non-linear classifier if we want to use a small representations (say due to limited resources).
Fine-tuning Makes the Space Simpler
In the above table, we also observe that after fine-tuning, the number of clusters decreases, suggesting that fine-tuning updates the space such that points with different labels are in a simpler spatial configuration.
Next, let’s analyze the spatial structure of the representations during fine-tuning. In this subsection, we focus on tracking the distance between clusters and how these clusters move.
Fine-tuning Pushes Each Label Away From Each Other
We track the minimum distance of each label to all other labels during fine-tuning^{1}. The following figure shows these distances change in the last layer of BERT-base.
For clearity, we only present the three labels where the mimimum distance increases the most, and three where it increases the least. We also observe that although the trend is increasing, the minimum distance of each label can decrease during the course of fine-tuning, e.g. the label STUFF in supersense role task, suggesting the instability of fine-tuning.
To better understand how these labels move during the fine-tuning, we compute the centroids of each cluster to represent the label position. We select three closest labels from the POS tagging task and track the paths of the centroids of each label cluster in the last layer of BERT-base during the fine-tuning. The following figure shows the PCA projection in 2D of these paths.
We observe that before fine-tuning, the centroids of all these three labels are close to each other. As the fine-tuning proceeds, the centroids move around in different directions and become far away from each other.
We conclude that fine-tuning pushes each label away from others. This larger gap between labels admits more classifiers consistent with the labels, and allows for better generalization. Note that neither the loss function nor the optimizer explicitly mandates this change. Indeed, since in most cases, the labels are originally linearly separable, the learner need not adjust the representation at all.
In the last subsection, we hypothesiz that fine-tuning improves the task performance by enlarging the gaps between label clusters. A natural inference of this hypothesis is if we can have a process that shrinks the gaps between labels, we should be able to obersve a decreasing performance. In this subsection, we investigate how fine-tuning for one task affects another.
In this experiment, we fine-tune BERT-base on the PS-role and POS tagging task separately and use the fine-tuned models to generate contextualized representations for the PS-fxn task. We choose PS-role because PS-role and PS-fxn are similar tasks. On the other hand, POS tagging and PS-fxn are contradicting tasks. POS tagging require all the perpositions to be grouped together, while PS-fxn requires differnt prepositions be to far away from each other.
The above table summarizes our cross-task fine-tuning results. The third and forth columns indicate the number of labels whose minimum distance is increased or decreased after fine-tuning. The second column from the right shows the average distance change over all labels. From this table, we observe that similar tasks (PS-role and PS-fxn) still increases the distance (third row) but in a lesser extent comparing the standarad fine-tuning process (second row). Also, we observe a minor improvements when fine-tuning on a similar task (87.75 –> 88.53). However, when we fine-tuning on a contraticting task, the distances between clusters are all decreasing (last row) and the performance decreases at the same time (87.75 –> 83.24).
In summary, based on the last three subsections, we can conclude that fine-tuning injects or removes task-related information from representation by adjusting the distances between label clusters even if the original representation is linearly separable. When the original representation does not support a linear classifier, fine-tuning tries to group points with the same label into a small number of clusters, ideally one cluster.
At last, we will analyze the behaviors of different layers of BERT representation. Previous work has already show that lower layers changed little compared to higher layers. Here, we are going to present more about the layer behaviors.
Higher Layers Change More Than the Lower Layers
First, we quantitatively analyzing the changes to the different layers. We use each cluster’s centroid to represent each label’s position and quantify its movement by computing the Euclidean distance between the centroids before and after fine-tuning for each label, for all layers. Following figure shows movements of POS tagging task using BERT-base.
We can observe as the layer increases, the distance becomes larger, suggesting higher layer change more than the lower layers.
Higher Layers Do Not Change Arbitrarily
Although we can confirm that higher layer change more than the lower layers, we find that the hihger layer sitll remain close to the original representations. To study the dynamics of fine-tuning, we compared the intermediate representation of each layer during fine-tuning to its corresponding original pre-trained one. The similarity between two representations is calculated as the Pearson correlation coefficient of their distance vectors as described earlier.
One important observation from the above figure is that even if the higher layers change much more than the lower layers, they do not change arbitrarily. Instead, the high Pearson correlation coefficients of high layers show strong lienar relation (more than $0.5$) between origial representation and the fine-tuned one, suggesting fine-tuning pushes each label away from each other while preserving the relative positions of each label. This means the fine-tuning process encode the task-dependent information while preserving the pre-trained information as much as possible.
The Labels of Lower Layers Move in Small Regions
To verify if the lower layers really do not change, for each label, we compute difference between its centroids before and after fine-tuning. Following figure shows the results.
In the above figure, We observe that the movements of labels in lower layers concentrate in a few directions compared to the higher layers, suggesting the labels in lower layers do change, but do not separate the labels as much as the higher layers. Note that, the motion range of lower layers is much smaller than the higher layers. The two projected dimensions range from \(−1\) to \(3\) and from \(−3\) to \(3\) for layer two, while for layer 12 they range from \(−12\) to \(13\) and \(−12\) to \(8\), suggesting that labels in lower layers only move in a small region compared to higher layers.
In this post, we ask and answer the following three questions:
In most cases, the number of clusters equal to the number of labels. So, we use the clusters and labels interchangeablly. ↩
Weight pruning is another line of work to compress BERT style models. In this post, I will survey BERT related weight pruning papers.
In general, weight pruning means identifying and removing redundant or less essential weights and/or components. Related work can be divided into three categories:
Just like the data quantization, one characteristic of this line of work is, after pruning, the weights usually become very sparse. Sparse weights can reduce the model size largely. However, it can not speed up the inference time. As a result, special hardware is needed to achieve a decent computation speed-up after element-wise pruning.
This line of work empirically shows that weights of large transformer-based models are indeed redundant. A significant part of weights can be removed without hurting the final performance.
Kovaleva et al.
Similarly, Michel et al.
Consistent results are also found in Voita et al.
After empirically showing the weights are indeed redundant, many pruning methods are proposed—one line of work focus on identifying and removing individual weights of the given model. The importance of each weight can be determined by its absolute value, gradients, or other measurements defined by designers.
Gordon et al.
Similarly, Sanh et al.
Different from Gordon et al.
Different from element-wise pruning, structured pruning compresses models by identifying and removing component modules, such as attention heads or embedding layers.
Self-attention in transformers is used to let the model find
the most related part of the input. However,
Raganato et al.
Fan et al.
Instead of focusing on the attention heads or entire
transformer layers, Wang et al.
The comparison between weight pruing papers:
Paper | Memory (BERT:100%) | Inference | Component | Pretrain | Performace |
---|---|---|---|---|---|
Zafrir et al. |
N/A | N/A | N/A | N/A | N/A |
Stock et al. |
N/A | N/A | N/A | N/A | N/A |
Shen et al. |
N/A | N/A | N/A | N/A | N/A |
Gordon et al. |
30~40% | N/A | all weights | True | Same |
Guo et al. |
12%~41% | N/A | all weights | True | worse |
Sanh et al. |
3%~10% | N/A | layers and heads. | False | worse |
Raganato et al. |
N/A | N/A | attention heads | False | similar |
Fan et al. |
25%~50% | N/A | transformer layers | True | same |
Wang et al. |
65% | N/A | matrix multiplication | False | slightly worse |
Understanding the representation space is a hot topic in the area of NLP since distributed representations made huge improvments on variety of NLP tasks and we do not know why. Exisiting probing methods use a learned classifier’s accuracy, mutual information or complexity as a proxy for the quality of a representation. However, classifier-based probes fail to reflect the difference because different representations may require different classifiers. In this post:
A learning task can usually be decomposed into three components:
However, there are so many factors can affect the performance of the whole learning process. For example, the optimizer (the algorithm) stops before convergence or we choose a bad initialization point. The above figure list all the possible factors. Clearly, the quality of the representation space is not the only factor.
Preliminary experiments are conducted on the supersense role labeling task with BERT-base-cased and ELMo original model. Many different classifiers are trained and evaluated, from logistic regression to two layer neural network. For each specific classifier, we train for \(10\) times using \(10\) different random seeds and record the minimum and maximum accuracy for these \(10\) runs. The following table shows our initial experiment results.
From the table above, we observe:
The above figure and table show that evaluating the quality of representations based on the learned classifiers is not reliable. Bad initialization point or wrong choice of model type can also lead to a bad predictive performance. We can conclude that a bad predictive performance does not necessarily mean bad representation and vice versa.
In this post, we study the question: Can we evaluate the quality of a representation for an NLP task directly witout relying on classifiers as proxy?
Given an NLP task, we want to distentangle the evaluation of a representation \(E\) from the classifier \(h\) that are trained over it. To do so, the first step is to characterize all classifiers supported by a representation.
From the viewpoint of a classifier, it is trained to find a set of parameters such that the prediction of the classifier is consistent with the examples in the training set. Geometrically speaking, each learned classifier is a decision boundary in the high dimension space as showed in the following figure.
In the above figure, it is a simple binary classification problem. There are many classifiers that can separate these two classes. This figure shows two linear (\(h_1\) and \(h_2\)) and a non-linear (\(h_3\)) examples. This suggests us that a representation $E$ can admit a set of classifiers that are consistent with the training set and a learner choose one of them. Given a set \(\mathcal{H}\) of classifiers of interest, the subset of classifiers that are consistent with a given dataset represents the version space with respect to \(\mathcal{H}\). However, the original definition of version space does not allow errors. Here, to account for the errors or noise in the data, we define a \(\epsilon\)-version space: the set of classifiers that can achieve less than \(\epsilon\) error on a given dataset.
Suppose \(\mathcal{H}\) is the whole hypothesis space consisting of all possible classifiers \(h\) of interest. The \(\epsilon\)-version space \(V_{\epsilon}(\mathcal{H}, E, D)\) expressed by a representation \(E\) for a labeled dataset \(D\) is defined as:
\[V_{\epsilon}(\mathcal{H}, E, D)\triangleq \{h\in \mathcal{H}\vert err(h, E, D)\le \epsilon\}\]where \(err\) is training error.
Be note here, the \(\epsilon\)-version space \(V_{\epsilon}(\mathcal{H},E,D)\) is only a set of learned classifiers and does not involve any learning. Intuitively, the larger \(\epsilon\)-version space would make the learning process easier to find a consistent classifier.
Previous classifier-based probing work measures the quality
of a representation by investigating properties of a
specific \(h\in V_{\epsilon}\). For example, some work
restricts the probe model to be linear. Commonly measured
properties include generalization error
Now, the next question is how can we find the \(\epsilon\)-version space for each representation. From the definition, \(V_{\epsilon}(\mathcal{H},E,D)\) is a infinite set. Although it is impossible to enumerate all the possible classifiers, geometry perspective indeed provide some insights about \(V_{\epsilon}(\mathcal{H},E,D)\).
We know that each classifier \(h\in V_{\epsilon}(\mathcal{H},E,D)\) is a decision boundary in the representation space and it can be in the form of any shape, like the following figure shows. On the left, the decision boundaries can be linear or non-linear. On the right, the decision boundary has to be a circle.
We also know that a set of piecewise linear functions can mimic any functions. If we use a set of piecewise linear functions to mimic the decision boundaries (the middle ones in the above figure), we find that the space is split into different groups, each of which contains points with exactly one label. Because of linear functions, these groups are seen as convex regions in the representation space (the bottom ones in the above figure). Any classifier in \(V_{\epsilon}(\mathcal{H},E,D)\) must cross the region between the groups with different labels; these are the regions that separate labels from each other, as shown in the gray areas in the bottom of above figure. So, an intuitive idea is to use these grey areas to approximate the \(V_{\epsilon}(\mathcal{H},E,D)\).
Although finding the set of all decision boundaries remain hard, finding the regions between convex groups that these piecewise linear function splits the data into is less so. Grouping points in the space is well-defined problem: Clustering. However, in this case, our clustering problem has different criteria:
Now, we successfully transform the problem of finding \(\epsilon\)-version space into a clustering problem with special criteria.
To find clusters that satisify the above criteria, we propose DirectProbe, a simple yet effective heuristic clustering algorithm:
The above animation describes the algorithm. When the new cluster overlaps (green dashed lines) with the red ones, it stops the merging and find the next closest pair.
Now, we have developed DirectProbe, a heuristic clustering approach that can approximate \(\epsilon\)-version space. Next, we apply this approach on variety of representations and tasks and analyze the results.
After applying the DirectProbe, we could end up \(n\) clusters.
By using the number of clusters, we can answer the question:
Can a linear classifier fit the training set for a task with a given representation?
To validate our claim, we use the training accuracy of a linear SVM classifier. If a linear SVM can perfectly fit (\(100\%\) accuracy), then there exist linear decision boundaries that separate the labels. The following table shows the results of our experiments. We can observe that almost all of the representations we experimented are linear separable for most of the tasks. We think this maybe the reason why linear model usually work for BERT-family models. The only exception is RoBERT-large on the pos tagging task. DirectProbe ends up with \(1487\) clusters and linear SVM can not achieve \(100\%\) training accuracy either.
Be note, linear separability does not mean the task is easy or that the best classifier should be a linear one.
The distance between clusters is another important property we should consider. We apply the DirectProbe on each layer of BERT-base-cased model for five tasks. For each layer(space), we compute the minimum distance between all pairs of clusters. The next figure shows our results.
The horizonal axis is the layer index of BERT-base-cased, the left vertical axis (blue line) is the best classifier accuracy and right vertical axis (red line) is the minimum distance between all pairs of clusters. We observe that both best classifier accuracy and minimum distance show similar trends across different layers: first increasing, then decreasing. Although not a simple linear relation, it shows that minimum distance correlates with the best performance for an embedding space. Using the minimum distances of different layers, we answer the question:
How do different layers of BERT differ in their representations for a task?
The standard pipeline of using contextual embedding models is to fine-tuning the original model on a specific task and then deploy fine-tuned model. Fine-tuning usually can improve the performance, which makes us wondering:
What changes in the embedding space after fine-tuning?
We apply the DirectProbe on the last layer of BERT-base-cased before and after fine-tuning for five tasks. Similarily, we compute the minimum distances between all pairs of clusters. The next table shows the results.
We can observe that both the best classifier accuracy and minimum distance show a big boost. It means that fine-tuning pushes the clusters away from each other in the representation space, which results in a larger \(\epsilon\)-version space. This verifies our assumption:
A larger $\epsilon$-version space admits more classifiers and allows for better generalization
The distance between clusters can also confuses classifiers. By comparing the distances between clusters, we answer the question:
Which labels for a task are more confusable?
We compute the distances between all pairs of labels based on the last layer of BERT-base-cased^{1}. The distances are evenly split into 3 categories: small, medium and large. We partion all label pairs into these three bins. For each task, we use the predictions of the best classifier to compute the number of misclassified label pairs for each bin. The distribution of all errors is shown in the following table.
We can easily observe that all the errors concentrate on the small distance bin. This means that small distance between clusters indeed confuse a classifier and we can detect it without training classifiers.
As we discussed earlier, any classifier \(h\in V_{\epsilon}(\mathcal{H},E,D)\) is a predictor for the task \(D\) on the representation \(E\). So, as a by-product, the clusters from DirectProbe can also be used as a predictor. The prediction strategy can be very simple: for a test example, we assign it to its closest cluster. We call this accuracy intra-accuracy. Because this strategy is very similar to the nearest neighbor classification (1-kNN), which assigns the unlabeled test point to its closest labeled point, we also compare with the 1-kNN accuracy. The following figure shows the results.
We observe that intra-accuracy always outperforms the simple 1-kNN classifier, showing that DirectProbe can utilize more information from the representation space. Moreover, all the pearson correlation coefficients between best accuracy and intra-accuracy (showed in the parentheses alongside each task title) suggests a high linear correlation between best classifier accuracy and intra-accuracy, which means the intra-accuracy can be a good predictor of the best classifier accuracy for a representation. From this, we argue that intra-accuracy can be interpreted as a benchmark accuracy of a given representation without actually training classifiers.
In this post, we ask the question: what makes a representation good for a task? We answer it by developing DirectProbe, a heuristic approach builds upon hierarichal clustering to approximate the \(\epsilon\)-version space. By applying DirectProbe, we find:
linearly separable. So the number of labels pairs equals the numebr of cluster pairs.
For all tasks, the last layer of BERT-base-cased is ↩
Since 2018, BERT
Data quantization refers to representing each model weight using fewer bits, which reduces the memory footprint of the model and lowers the precision of its numerical calculations. There are a few characteristics of data quantization:
A straightforward idea is directly to truncate the weight
bits to the target bandwidth. However, this operation
usually leads to a large drop in
accuracy Ganesh et al.
There are not many papers that directly quantizes the models of the BERT model family.
Zafrir et al.
Similarly, Fan et al.
Instead of quantizing all the weights into the same
bandwidth, Shen et al.
The comparison between data quantization papers:
Paper | Memory (BERT:100%) | Inference | Component | Pretrain | Performace |
---|---|---|---|---|---|
Zafrir et al. |
25% | N/A | Float Bits | False | very close |
Fan et al. |
5% | N/A | Float Bits | False | worse |
Shen et al. |
7.8% | N/A | Float Bits | False | slightly worse |
In this very first post, I will first explain why do we need a different representation for computers and what we have done in this path. I want to present a “map” in this area to understand where we are on the road to our ultimate target: giving instructions to computers by talking.
Although talking to a computer is a very appealing idea, in reality, it is still a hard problem. Why? Because computers “think” differently from humans. Language is full of ambiguity, which is somehow an advantage because it enables us to use a finite set of symbols to describe this infinite world. However, this ambiguity does not work for computers. Computers can not understand this ambiguity because it only deals with numbers, which is deterministic. For example, computers can perform adding or multiplying operations, but computers can not understand the difference between the fruit “Apple” and the company “Apple”. To computers, these two words are the same symbol and thus represent the same meaning.
To let computers understand the ambiguity of natural language, we need to “translate” natural language into some representations that computers enjoy. The primary element of this translation is to translate words because words are the smallest element that contains semantic meaning. Over the years, researchers developed many different representing methods. These representing methods evolved through various stages. Following is a summary of all these different representing methods.
In the field of word representation, researchers have gone a long way. Here, I try to divide the representing history into different stages. Be note, and this is only my perspective and not a standard division.
In this post, I will briefly talk about these different methods. More details will be presented in the following posts.
Dictionary lookup is a straightforward idea that may exist long before computers (that’s why I call it prehistory). One sentence can describe this method: Map each word to a unique number. This unique number is used as the representation of the corresponding word.
Suppose we have a small language, which only consists of three words: I, am, and groot. Now, we need to create a representation of these three words. By assigning each word a unique number, we can create a lookup dictionary shown below:
Words | ID |
---|---|
I | 0 |
am | 1 |
groot | 2 |
By creating such a mapping relation (i$\rightarrow 0$, am$\rightarrow 1$, groot$\rightarrow 2$), we successfully represent the whole language into numbers, which can be understood by computers. For each given word, we can return its representation by looking it up in the dictionary. Although this is a simple idea, it builds the fundamentals for all other “modern” methods.
A straightforward example is showed in the following figure:
This figure is a table of morse code. Here, each letter is mapped to dots and dashes. This figure composes a simple dictionary. Morese code exists long before the creation of computers. This is why I call the dictionary lookup a prehistory representation method.
Since we call it prehistory, it must have many problems. Mapping to numbers is the only advantage of this dictionary lookup method. There are many drawbacks to this method.
To overcome all these drawbacks, researchers keep trying to develop new representation methods, which leads us to the next stage: Middle Age.
Now, we are entering the middle age of word representation. In this age, one-hot encoding, along with its variants, Bag-of-words, becomes the most commonly used method.
The idea of one-hot representation is also simple: Create a vector with filled zeros except one that you want to represent. Following the example language of I am groot, we can have three vectors with a size of \(4\):
I | am | groot | UNKNOWN | |
---|---|---|---|---|
I | 1 | 0 | 0 | 0 |
am | 0 | 1 | 0 | 0 |
groot | 0 | 0 | 1 | 0 |
As we can see here, each dimension represents one word. Only the word that needs to be represented is set to \(1\). For example, word I is represented as \([1, 0, 0, 0]\). The extra dimension is used for out of vocabulary words.
The most important advantage of this representation is it can eliminate the wrong ordering assumption of Dictionary Lookup. Each word is represented equally. However, this representation still suffers the other two drawbacks of Dictionary Lookup:
Although there are drawbacks to one-hot representation, it was widely used years ago. One important extension is Bag-of-Words representation. Instead of representing a single word, Bag-of-Words usually is used to represent a sentence or a document. One can see the Bag-of-Words as the summation of one-hot representations of each word in one sentence. Following are two example sentences composed by our example I am groot language:
Because not does not belong to the language vocabulary, we have to use the UNKNOWN dimension to represent it.
Bag-of-Words suffers several drawbacks:
The next stage is the enlightenment age, in which the two most important ideas of representing words were proposed. These two ideas make a significant impact on the research history of natural language processing.
The first idea is a famous assumption:
You shall know a word by the company it keeps
This assumption is proposed by John Rupert Firth. I can
not find the exact source of this quotation. But I can find
a similar one in Firth
… .the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously.
This assumption lay the foundation of all following learning models that produce word embeddings. What does this assumption mean? It means that its context decides the meaning of a word. For example, suppose we have the following sentence:
\[\text{The fluffy __ barked as it chased a cat.}\]Which word should we fill in this blank? A simple guess is dog. Why can we guess the meaning without knowing the word? Because the context (other words in the sentence) decides the meaning of this blank word. I think this sentence can well explain the context assumption Firth proposed.
The other important idea is the proposition of using a low dimension dense vector to represent words. Using a feature learning model, each word is represented by a dense vector with a fixed and lower dimension (compared to the size of the dictionary). A critical property of this dense vector is: words with the same meaning will have a similar vector. We usually call this dense vector word emebedding or distributed representation. In the following of this post, I will use these two terms interchangeably.
Let us see an example of what does a dense vector mean. Recall that in the One-hot representation, we represent the word groot as a binary vector:
\[\text{groot} \rightarrow [0, 0, 1, 0]\]In a dense vector, groot represents as:
\[\text{groot} \rightarrow [2.56, -0.55, 5.30, -4.06, 0.82]\]The difference is in each dimension. We have real numbers instead of all zeros. The dimension of this dense vector is fixed. The size of vocabulary does not change the dimension.
The earliest detailed analysis of distributed representation
I can find is in Chapter 3 in Parallel Distributed
Processing
However, although two insightful ideas have been made,
distributed representations have not become popular because,
at that time, no good techniques were available to produce
high-quality word embeddings. This situation continues until
the year of 2003. In this year, Yoshua Bengio published
the paper: A neural probabilistic language model
Although Yoshua Bengio pointed a bright way to produce word embeddings, it is costly to train word embeddings because the neural network involves a lot of computations. The word embedding model was still not a hot topic in the NLP field for about ten years.
In 2013, a team at Google led by Tomas Mikolov
published a word embedding toolkit word2vec, which can
train word embeddings faster than previous approaches. It is
this word2vec model that makes the word embedding idea
take off. Mikolov’s famous paper: “Efficient estimation of
word representations in vector space”
This example shows that word embedding produced by word2vec at least captures some semantic meanings of words. Our representation finally is not just some meaningless symbols. It can carry meanings of words.
I call this Industrial Age because it is similar to the Industrial Age in human history, significant improvements have been made for productive forces. Since the publication of word2vec, more research has been done following the direction of distributed representations. In the next year of the creation of word2vec, almost every NLP task has been integrated with word embeddings.
However, although word2vec can achieve the best performance on many NLP tasks at that time, people find some drawbacks of the word2vec-style models.
To overcome the OOV problem, Facebook’s AI Research (FAIR)
lab created another famous embedding model in 2017:
fastText
Before entering the next age, we need to understand the difference between two concepts: type vector and token vector.
Type Vector: Type vector means context-independent. The same word in different sentences (contexts) has the same representation. For example, the word “bank” in the sentence “I need to deposit some money to the bank” and “I am walking along the river bank” has the same representation.
Token Vector: Token vector means context-dependent. The same word in the different sentence (contexts) has different representations. For example, the word “bank” in the sentence “I need to deposit some money to the bank” and “I am walking along the river bank” have entirely different representations because the word “bank” represents different meanings in these two sentences.
Clearly, word2vec produces the type vector. We will talk about the token vector in the next age.
If the creation of type vector (word2vec model) marks the
age of industry, then the creation of BERT marks the
modern era. In 2018, Google published another important
paper in word representation history: “BERT: Pre-training of
Deep Bidirectional Transformers for Language
Understanding”
BERT almost improves every single NLP task. The authors of BERT call it the new era of NLP. Instead of creating an embedding for each type word, BERT creates embeddings for tokens. It generalizes word2vec-style models from type vector to token vector. For example, the word “Bank” in sentences “I need to deposit some money to the bank” and “I am walking along the river bank” have completely different representations.
Also, BERT creates a new scheme of using word embeddings. Previously, we train our embedding learning models on some unlabelled text and apply these embeddings to a specific task without updating the embedding models. By using BERT, we train BERT on large unlabelled text corpus, then fine-tune on the specific task to adjust (update) the BERT model to a specific task to achieve better performance. After fine-tuning, the model is applied to the test set.
Like the creation of word2vec, after the creation of BERT, there are many other related models and works proposed. BERT becomes a hot topic in recent years. Although BERT may create a new representation era for us, there are still many unknown to us.
In this post, I summarize the development history of word representations straightforwardly. I did not mention a lot of excellent work in this post (I will talk about them in other posts) because I want to make this post compact enough to deliver the whole picture of the research of word representations. I also ignore all the details of the models I mentioned in this post. This post is not about explanations of models but clears the development history. In each age, I strengthen the improvement point to show why models in the new era can mark a new era.
Starting from this post, I will keep writing research work about the word representation field. I hope you enjoy this series of posts.
]]>A reasonable way is to estimate the size of subset based on sampling.
Let’s first set up some backgrounds. We assume the whole set is \(Y\), its size is denoted as \(\vert Y\vert\). Set \(S\subset Y\) is the subset and \(\vert S\vert\) is what we need to estimate.
In the Weston et al.
Keep randomly choose(with replacement) an element \(y\) from \(Y\) until the element \(y\in S\). Record the sampling times, denoted as \(N\). Then, the size of \(S\) can be estimated as \(\vert S \vert=\frac{\vert Y \vert}{E[N]}\approx\frac{\vert Y\vert}{N}\).
Proof:
The proving process is very simple. We use \(p\) to denote the probability that element \(y\) belongs to set \(S\):
\[p=Pr[y\in S] = \frac{\vert S \vert}{\vert Y \vert}\]Based on the sampling process, we can get:
\[Pr[N=i]=(1-p)^{i-1}p\]When \(N=i\), we sampled \(i\) times, which means the previous \(i-1\) times are all failed to pick the element belonging to \(S\). This is actually a geometric distribution, so its expectation is:
\[E[N] = \frac{1}{p} = \frac{\vert Y\vert}{\vert S \vert}\]Transform the equation above:
\[\vert S \vert = \frac{\vert Y \vert}{E[N]}\]Also, we know that when the sampling rounds become infinity, the expirical value equals the expecation:
\[E[N] = \lim\limits_{m\rightarrow\infty}\frac{1}{m}\sum\limits_{i=1}^m N_i\]where \(N_i\) represents the sampling times in the \(i^{th}\) round of sampling. In order to estimate quickly, we can set \(m=1\), then we finally get:
\[\vert S\vert \approx \frac{\vert Y\vert}{N}\]Apart from the previous sampling method, there are actually other ways. One of the choices is to use bernoulli distribution. The process is as following:
Keep randomly choose(with replacement) an element \(y\) from \(Y\) for \(N\) times, then check how many elements in this \(N\) elements that belong to \(S\), which is denoted as \(N\). Clearly, \(M < N\). Then, the size of \(S\) can be estimated as \(\vert S\vert=\frac{E[M]\vert Y \vert}{N}\approx \frac{M\vert Y \vert}{N}\).
Proof:
Similarily, we use \(p\) to denote the probability that any element \(y\in Y\) belongs to \(S\):
\[p=Pr[y\in S] = \frac{\vert S \vert}{\vert Y\vert}\]Based on this sampling method, we know:
\[Pr[M=i] = \binom{N}{i}(1-p)^{N-i}p^i\]This means we independently choose(with replacement) \(N\) elements from \(Y\), in which contains \(i\) elements that belong to \(S\). This actually a bernoulli distribution, so its expecation is:
\[E[M] = pN = \frac{\vert S \vert}{\vert Y\vert}\cdot N\]Rewrite this as:
\[\vert S\vert = \frac{E[M]}{N}\cdot \vert Y\vert\]Based on the defintion of expectation:
\[E[M] = \lim\limits_{m\rightarrow\infty}\frac{1}{m}\sum\limits_{i=1}^m M_i\]Similarily, we can set \(m=1\), then we can get:
\[\vert S\vert\approx \frac{M\vert Y\vert}{N}\]To compare the difference between these two different methods, I did some experiments. The results are showed as following.
I first compare the accuracy of estimation of these two sampling methods. I got 4 different plots based on different ratios of \(\vert Y\vert\) and \(\vert S \vert\).
From these four plots, we can conclude:
When the ratio of \(\vert Y\vert\) and \(\vert S \vert\) is larger(\(0.5\)), bernoulli distribution is more stable; when the ratio becomes smaller(\(0.0005\)), geometric distribution is becoming more stable.
Another important aspect of sampling is the time. After all, all we want is to accelerate the computation. The following 4 polts show the comparison of time based on the same settings.
The conclusion is:
When the ratio of \(\vert Y\vert\) and \(\vert S \vert\) is larger(\(0.5\)), geometric distribution is faster than bernoulli distribution; when the ratio becomes smaller(\(0.0005\)), geometric distribution takes more and more time and bernoulli distribution will take the advantage.
]]>