Extreme Multi-label Text Classification:Kim-CNN & XML-CNN

This passage is a learning note about a paper talking about the extreme multi-label text classification.


XMTC -> Extreme Multi-label Text Classification

Finding each document its most relevant subset of labels from an extremely large space of categories.

Training data: {(xi, yi)}n1, xi ∈ X, yi ∈ {0, 1}L
X is the data, y is the label.

Learning a mapping g: X -> {0, 1}L
Our goal is finding a mapping from x to y.

Each document xi is associated with a set of relevant labels, denoted by label vector yi.

Two key challenges in XMTC

When n, L, D are large:

  1. Scalability -> both in training and testing (n -> the number of document; D -> the number of feature)
  2. Data sparsity

So, how to extract richer features representation and how to exploit label correlations are the key challenges.

  • Target-embedding methods

Compress label vectors in target space down to low-dimensional embeddings.

  • Tree-based ensemble methods

Recursively partitions instance space to induce a tree structure.

  • Deep learning for text classification

Automatically extract features from raw text.
Has been remarkable success in multi-class classification.


It instead of using just a bag-of-word features, the roll text is fed into the model.

The resolution may not be that good, so here is a sequence of words, and each word is replaced by the word embedding. So you have a vector here. It’s a word embedding with the associated word.

So the input will form a n by k matrix, where n is number of words in a document, and k is dimension of the word embedding. You can think of this like image, and we are doing convolutional flitters on this image.

Then, what they do is they place one deconvolution flitter sliding through the time dimension.

So the first part is extracting a generalized version of the n-gram features, compared to the traditional bag-of-word or bigram or trigram features.

The second part, the red box there, they are using the filter size of two, but you can also use three or four and so on. That’s way, you are extracting more generalized bigram or trigram features.

After that, you have a feature map, and what they do, they just do max pooling to extract the most strongest signal in that feature map and stack them all together. And finally is the fully connected layer.

So this is a brief introduction of the current strongest method in multi-class classification.

There are several different parts, roughly three.

XML-CNN (Extreme Multi-label CNN)

The first is the convolution, The red box is the convolution filter. We slide through the convolution filter in another dimension. Here we do it opposite direction that we swipe through the convolution filter through the word embedding dimension. You can think in an image is a spatial dimension. So the motivation is like each filters is capturing Moore’s global information, given the whole sequence.

So flying through different dimension of the word embedding is like capturing the most salient features of word among the entire document. So we also have filter of size 248 et cetera that capturing different spatial relations among the embedding matrix.

After that, we also have a feature map. What we do is not the traditional max pooling. What we do is like adaptive max pooling that extract, two to three most secure known features in the feature map. So if we are doing that, the final feature representation here will be even two to three times larger than the previous method. In the extreme multi-label setting, we know that the final output number of label here is very huge.

At last, we need to make a low-rank matrix factorization here. So there is a safe 512th hidden representation in the middle that first project this long feature map to a lower dimension and then we do fully connect.

Memory Consumption: A case study


Vocabulary size: V(30k)
Word embedding dimension: D(300)
Document length: S(500)
Number of labels: L(670K)
Hidden = 512
Pooling units = 128 Conv1D filters: 32
filter sizes = [2, 4, 8] Total number of parameters: Embedding layer: V*D = 9M Conv1D layer: S*32*(2+4+8) = 32K + 64K + 128K = 224K Hidden layer: (128*32*3)*512 = 6.29M Output layer: 512*670K = 343M Total: 358.51M

If using floating precision, the memory for medal parameters is around 1.33GB.

The analysis here is just the minimum of memory you use because when you are doing back propagation and the mini batches, that also depends on the mini batch size you use.

Choice of Loss

We know that if in the Multi-class classification, if you use Softmax, the model will favor only one label and pushing other label to zero.

So in the XMTC setting, you mostly put zero probability output on the other labels. But actually, in this kind of dataset, because each document is only tag with the most relevant, say five or ten labels, it doesn°Øt mean that other label is not relevant.

Using Softmax may not be that good choice, so we consider the most naive Binary Cross-entropy using the Sigmoid.

So how do we learn the model parameter θ?

We simply use the mean-square out less, the loss function here.

Denotes the loss function L(~, ~). The energy function is parameterized by a neural network E(y; x, θ)

First, we need to find out the prediction yi hat, by solving the inference problem and of the energy network, given the fixed model parameter θ. So, exact solving the inference may not be feasible. Then what they do?

Exact solving the inference may be infeasible, solved approximately by gradient descent with fix iterations.

They just do gradient descent with a fixed maximum iteration number, for instance, say just 5. So the do five times obtaining the prediction yi hat and fix that to calculate the gradient based on feature.

Test-time Optimization

  • Computation Graph

So actually calculating the gradient with respect to model parameter θ could be somehow tricky. So, let’s look at the computational graph here. This is the sketch on my note.

The input is x, and you path this through some feature network. Use cache the feature and you will calculate, you first make a forward path based on this to calculate the prediction y. So you need to calculate the gradient with respect to y here in the 3rd box is a model part. And in the backward pass, you also want to calculate the gradient with respect to the model parameter θ.

Now you need to visit multiply times through the model box, for example, the first path it’s like this, the upper part. And the second part is like that, the middle part, because the yi is sort of like the final prediction of y is a dependency among the previous state and previous stat is an input function.

The last, you need to make the derivative multiple times path. But this is the detail and another work in the year is ICML.

ICML(Input Convex Neural Networks)

ICML, which is very similar to the precious structure Prediction Energy Network, except one thing, The design architecture of energy network.

In this work, they design the energy network to be convex with respect to the y. The benefit of that is when you do the test time optimization, solving inference problems put, this will have a global optimal solution.

Of course, to design the network to be convex, you have some assumption or can say constraint need to be made on with parameter W. W(z)1:k-1 are non-negative, activation functions are convex and non-decreasing.

Something to say

In recent months,I have so much to do, such as TOEFL, GRE and final exam, that I don’t have time to update my blog. At the same time, I find that that writing notes by hand is more convenient and comfortable than using Word,which is another reason for reducing updates.

The following photo is two pages of this motes, and please don’t care about ugly handwritings.