ACL 2018 Notes
Shuai Tang
shuaitang93@ucsd.edu
Andrej Zukov-Gregoric
andrej.zukovgregoric@blackrock.com
Contents
1 Sunday, July 15th: Tutorials 2
1.1 Neural Approaches to Conversational AI . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Neural Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Monday, July 16th: Day 1 5
2.1 Probabilistic FastText for Multi-Sense Word Embeddings . . . . . . . . . . . . . . 5
2.2 Unsupervised Learning of Distributional Relation Vectors . . . . . . . . . . . . . . 5
2.3 Explicit Retrofitting of Distributional Word Vectors . . . . . . . . . . . . . . . . . 6
2.4 Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with Ex-
ternal Commonsense Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Multi-Relational Question Answering from Narratives: Machine Reading and Rea-
soning in Simulated Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Simple and Effective Multi-Paragraph Reading Comprehension . . . . . . . . . . . 8
2.7 Semantically Equivalent Adversarial Rules for Debugging NLP Models . . . . . . 8
2.8 On the limitations of Unsupervised Bilingual Dictionary Induction . . . . . . . . . 9
2.9 Generating Sentences by Editing Prototypes . . . . . . . . . . . . . . . . . . . . . 11
2.10 A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings
of Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Tuesday, July 17th: Day 2 15
3.1 Batch IS NOT Heavy: Learning Word Representations From All Samples . . . . . 15
3.2 Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations . . . . . . 16
3.3 Guess Me if You Can: Acronym Disambiguation for Enterprise . . . . . . . . . . . 16
4 Wednesday, July 18th: Day 3 18
4.1 What you can cram into a single vector: Probing sentence embeddings for linguistic
properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
ACL 2018 has ended here in Melbourne, so we decided to spend some time organising our notes
into a more coherent document. The goal of this document is to help you quickly gain an overview
of the current directions in which NLP is moving.
1 Sunday, July 15th: Tutorials
It begins! Sunday starts with Tutorials. They’re divided into morning and afternoon ones. I spent
the morning at Neural Approaches to Conversational AI followed by Neural Semantic Parsing in
the afternoon.
1.1 Neural Approaches to Conversational AI
The tutorial is led by Jianfeng Gao and Michel Galley from MSR Redmond and Lihong Li from
Google Seattle.
Jianfeng opens by saying that dialogue can be viewed as a unifying task in NLP because it necessi-
tates the solving of other tasks such as:
1. reading comprehension for extracting information from dialogue history
2. semantic parsing done either explicitly or implicitly, for being able to reason across KBs
3. dialogue state tracking for being able to encode the state of a conversation across time
4. natural language understanding for being able to identify the intents of dialogue partici-
pants
5. text generation for being able to formulate responses or ask clarification questions.
The tutorial identifies three categories of dialogue agents: (1) question answering agents, (2) task-
oriented dialogue agents, and (3) social bots.
Question Answering Agents
Jianfeng continues by saying that we can divide QA agents into two further categories: (1) text-QA
agents which attempt to answer questions across passages of text (think reading comprehension à la
SQuAD & answer sentence selection à la WikiQA) and (2) KB-QA agents which attempt to answer
questions across knowledge bases.
Papers to read:
Stochastic Answer Networks for Machine Reading Comprehension [20]
Link Prediction using Embedded Knowledge Graphs [28]
A Knowledge-Grounder Neural Conversation Model [11]
Task-oriented agents
Following speech act theory in linguistics, it is clear we often partake in dialogue with the intent
of achieving something. Modelling intent over time thus becomes crucial. Typically, task-oriented
agents are architected to have
1. a natural language understanding (NLU) module for identifying intents and slot filling;
2. a state tracker for tracking conversation state;
3. a dialogue policy which selects the next action based on the current state;
4. a natural language generator (NLG) for response generation.
The tutorial goes on to cover a few recent papers on the above: E2E memory networks [32], Neural
Belief Tracker [23], Hierarchical Policy Learners [25], Integrating Planning for Dialogue Policy
Learning [24], Hybrid Code Networks [34], and An E2E Neural Dialogue System [18].
Finally, the Microsoft Dialogue Challenge at SLT-2018.
2
Social bots
The rest of the tutorial covered social bots. These are dialogue systems that are fully data-driven
in the sense that they need little interaction with the user’s environment (no need for API calls or
reasoning across KBs) and such models cope well with free-form dialogues.
Papers to read:
Although task-oriented dialogue systems and social bots were originally developed for different
purposes, there is a trend of combining both as a step towards building an open-domain dialogue
agent. For example, on the one hand, [11] presented a fully data-driven and knowledge-grounded
neural conversation model aimed at producing more contentful responses without slot filling. On
the other hand, [37] proposed a task-oriented dialogue agent based on the encoder-decoder model
with chatting capability. These works represent steps toward end-to-end dialogue systems that are
useful in scenarios beyond chitchat.
1.2 Neural Semantic Parsing
Sunday afternoon. Time for tutorial two. The jet lag is real.
The tutorial was led by Luke Zettlemoyer and his colleagues Pradeep Dasigi, Srini Iyer, Alane Suhr,
and Matt Gardner. Slides can be found here
1
.
Luke Zettlemoyer began the tutorial by summarizing recent events in the field. In short, the guiding
motivation behind semantic parsing is that the meaning behind language can be condensed into a
logical form. Parsing out this form should thus be of interest if we are to elucidate the semantic
meaning behind natural language. Historically, researchers tackled this problem using combinatory
categorical grammars (CCGs) but by 2016/7, with the advent of sequence-to-sequence models, it
was clear that neural approaches to parsing text into logical form performed better.
Next up was Alane Suhr, who gave a quick overview of the various datasets in the field. Briefly, she
divided datasets into four categories
1. traditional semantic parsing datasets where the goal is to generate executable representa-
tions,
2. datasets grounded in some environment (think NLVR),
3. Broader domain datasets such as the AMR dataset where the goal is to map any English
sentence into a formal AMR representation, or WikiTableQuestions
4. Sequential language understanding datasets which model sequences of natural language
utterances paired with some logical form (ATIS dataset, SCONE, SQA)
Pradeep Dasigi continued the tutorial by talking about constrained decoding. Traditional seman-
tic parsers used grammar-based parsing algorithms whereas newer neural semantic parsing uses
encoded-decoder architectures. However, decoders can generate outputs which are not valid syntac-
tically or semantically. How do we constrain the space of possible generations to be more valid?
Two models were discussed Seq2Tree, which produces syntactically (but not necessarily seman-
tically) valid output and neural semantic parsing with type constraints which does the above but
produces semantically valid outputs as well by generating from a grammar that guarantees the gen-
erated logical form is well-typed.
Pradeep continued by talking about how semantic parsers are commonly trained. To recap, a com-
mon task is given a question and some context such as a database as input, map to the output logical
form. Manually annotating these logical forms is expensive which makes it hard to posit this as a
fully supervised problem. Instead, we can turn to weaker forms of supervision. Instead of optimiz-
ing for the correct logical form, we optimize for the correct answer as generated by the logical form
we output.
There are three common training methods in this space:
1. Maximum Marginal Likelihood; where we want to maximize the probability of an answer
given our input but since we output logical forms we have to marginalize over all logical
1
https://github.com/allenai/acl2018-semantic-parsing-tutorial
3
forms which output our answer to get its conditional probability. Since this is intractable,
there are heuristics for searching across logical forms these can either be on- or off-line.
2. Structured Learning Methods; where we try to maximise margins or minimize expected
risks
3. Reinforcement Learning Methods.
Srini Iyer talked about semantic parsing as code generation.
Alane Suhr came back to talk about Context-Dependent Language Understanding. Clearly logical
forms can be constrained even more across time such as in a dialogue setting. For example, if
I prompt a system to “Show me flights from Seattle to Boston next Monday” followed by “On
American Airlines” I am clearly making the second prompt dependent on the first and as such
constraining the space of possible logical forms.
To model this, we can make prompts sequentially dependent whereby the decoder at time t is depen-
dent not only on the prompt at time t but some condensed representation of previous prompts. This
is achieved by having a turn-level encoder which produced a turn-level vector representation of the
history of prompts so far at each turn. These representations are then concatenated with every input
word embeddings entering the encoder in the next time step. Additionally, the decoder can be made
dependent on previously outputted queries. The goal is to minimize the token-level cross-entropy
loss of the generated output SQL queries.
Interestingly, because each batch consists of an entire interaction, we backpropogate through an en-
tire interaction which means our gradients are sensitive to the length of our interactions. To remove
this sensitivity an interaction loss can be introduced which introduced a term which multiplies the
loss to re-normalize it based on the length of the current interaction batch.
Another interesting insight is that positional embeddings can be added to input hidden states to some
attentional module by concatenating to them the positions to which they appear in reverse order, e.g.:
[3; h0], [2; h1], [1; h2], [0, h3].
State of the art performance is achieved on (a modified) ATIS dataset and a detailed ablation study
is provided.
The new CONALA
2
dataset was presented.
Open questions: clarification questions, latent decisions learning to distinguish between meaning
derived from current utterance vs, interaction history
Finally, Matt Gardner introduced AllenNLP
3
.
2
https://conala-corpus.github.io/
3
https://allennlp.org/
4
2 Monday, July 16th: Day 1
The World Cup final started at 1:00AM local time. Needless to say, many were tired.
2.1 Probabilistic FastText for Multi-Sense Word Embeddings
Ref: [2]
Background: Two types of word embeddings (1) dictionary-level embeddings such as word2vec
and (2) probabilistic word embeddings where words are assigned a distribution instead of a point in
vector space. PFT provides probabilistic character-level representations of words.
In probabilistic embeddings every work w is associated with a density function such as a single
Gaussian or a Gaussian mixture with K Gaussian components. Individual Gaussian components are
good at representing the multiple senses of a word.
Motivation: Create a probabilistic FastText which can better capture the multiple senses of words.
Method: In the simple case using a single Gaussian. The mean holds much of the semantic infor-
mation and in the single-Gaussian case is a function of both the n-gram and dictionary features:
µ
w
=
1
|NG
w
| + 1
v
w
+
X
g N G
w
z
g
Where z
g
is the vector representation of n-gram feature g and NG
w
is the set of n-gram features for
word w and v
w
is its dictionary representation.
The model parameters to be learned are z
g
and v
w
. A margin loss is used to push the energy between
a word and a true context pair to be higher than between the word and a negative context pair. The
energy between two words is defined to be a expected likelihood kernel (the closed-form of which
is used in the paper).
Results: Evaluated on a bunch of word-similarity datasets. In the multivariate Gaussian setting
achieves state-of-the-art at 50 dimensions. Not as competitive at 300 dimensions. Still, in total it
outperforms older methods on larger word-similarity datasets even at 300 dimensions.
2.2 Unsupervised Learning of Distributional Relation Vectors
Ref: [15]
Background: The remarkable property of words embeddings is that they capture lexical relation-
ships beyond mere similarity. E.g. a is to b what c is to ?” can easily be answered. More com-
plicated relationships such as the fact that Macron succeeded Hollande as president are harder to be
captured by word embeddings. Paper proposes to learn relatedness vector between two words s and
t by learning a relation vector r
st
in an unsupervised fashion.
Motivation: Traditionally r
st
are built by averaging the word embeddings between s and t but this
has two big drawbacks: (1) many words occurring between s and t will be semantically related to s
or t but no descriptive of the relationship between the two; (2) it gives too much weight to stop-words
which cannot be simply removed because certain stop words are crucial for modelling relationships.
Method: First the authors propose a modified GloVe model which instead of learning:
X
i
X
j:x
ij
6=0
f (x
ij
)
w
i
· ˜w
j
+ b
i
+
˜
b
j
log x
ij
2
learns the following instead:
X
i
X
jJ
i
1
σ
2
j
w
i
· ˜w
j
+
˜
b
j
P MI
S
(i, j)
2
5
The of us smoothed frequency counts and residual variance based weighting makes the words em-
beddings more robust to rate words which is important in relation extraction as the relation vectors
are often estimated from very sparse co-occurrence counts (see paper for details).
Clearly the objective pushes w
i
· ˜w
j
+
˜
b
j
to approximate P MI
S
(i, j). We can think of the word vector
w
i
as a low-dimensional encoding of (PMI
S
(i, 1), . . . , P MI
S
(i, n)). By replacing w
i
with a point
in space which models a vector relation such as (w
i
w
k
) · ˜w
j
= P MI
W
(i, j) P M I
W
(k, j) we
can begin begin interpreting relations in terms of PMI.
Learning Global Relation Vectors: So how do they learn the relation vectors r
st
? GloVe is based on
statistics about (main word, context word) pairs whereas relations need statistics on (source word,
context word, target word) triples. Turns out there are well known generalizations of PMI to three
arguments. An objective to learn for this is then:
X
jJ
ik
r
ik
· ˜w
j
+
˜
b
j
SI(i, j, k)
2
where SI is the three-argument PMI. Computing r
ik
for every pair of words is infeasible given the
number of (i, j, k) triples. Instead, words embeddings learned using the modified GloVe objective
then the when learning the relations the context w
j
and bias vectors
˜
b
j
are fixed.
Results: Paper takes a dump of Wikipedia to train on. It compares trained model to other well
known relation representation methods. Such as taking the Diff, Avg, or Conc of the context word
embeddings. They beat all baselines on well known relation datasets: the Google Analogy Test Set
and the DiffVec dataset. A relation prototypicality study is also conducted on the SemEval 2012
Task2 and a relation extraction study on the NYT corpus.
2.3 Explicit Retrofitting of Distributional Word Vectors
Ref: [12]
Background:
Semantic specialization of distributional word vectors, referred to as retrofitting, is a process of
fine-tuning word vectors using external lexical knowledge in order to better embed some semantic
relation. Simple similarity measures between word embeddings encode an abstract semantic asso-
ciating instead of a precise semantic relation. For example, it’s difficult to discern synonyms from
antonyms in distributional spaces. This is debilitating particularly in downstream tasks which rely
on more precise semantic relations between words. A standard solution is to move beyond unsu-
pervised approaches in a process called word vector space specialization or retrofitting where often
external resources such as WordNet are used to specialize distributional spaces for lexical relations.
This is achieved using two main strategies: (1) Joint Specialization Models : integrate external
constraints into the distributional training procedure, and (2) Post-Processing Models inject lexical
knowledge retroactively to satisfy external constraints. (2) tends to outperform (1) but it suffers from
one big drawback - they locally update only vectors of words present in the external resource.
Motivation: Merge (1) and (2).
Method: The papers starts of by presenting a new way of creating constraint-aware training in-
stances that help nudge the model in the right direction. This is done by taking an external resource
of (w
i
, w
j
, r) word, word, constraint triplets and then generating examples from it by retrieving K
words closest to w
i
and K words closest to w
j
thus forming a micro-batch M:
M (w
i
, w
j
, r) = {(x
i
, x
j
, g
r
(x
i
, x
j
))}

x
i
, x
k
m
, g
x
i
, x
k
m

K
k=1

x
j
, x
k
n
, g
x
j
, x
k
n

K
k=1
where g
r
is a distance score in the end space we want to learn which is tied to the relation. For
example, if we want to learn synonyms g
r
should minimize the distance between x
i
and x
j
. For
words in the two sampled sets, we assume that the distance g(·, ·) in the specialized space should be
the same as in the distribution space.
6
The paper proposes learning a explicit specialization function f : X X
0
which when applied
to the distributional vector space X transforms it into a specialized space X
0
which better captures
semantic relations. The function f is set to be a feed forward neural net.
The 2K + 1 mini-batches M are fed to the model where each training examples consists of a pair
of distributional (unspecialized) embeddings x
i
and x
j
and a score g denoting the desired distance
between the vectors. The objective then becomes:
J
MSD
=
N
X
i=1
g
f
x
i
1
, f
x
i
2

g
i
2
Which minimized the mean squared distance between the outputs of the explicit specialization func-
tion and f the desired distances g.
An additional topological regularization term is added which helps the model not disrupt the bene-
ficial properties of the distributional space:
J
REG
=
N
X
i=1
g
x
i
1
, f
x
i
1

+ g
x
i
2
, f
x
i
2

The above simply says, keep vectors in the distributional space close to vectors in the specialized
space.
Results: The method is shown to outperform distributional vectors on a bunch of tasks.
2.4 Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External
Commonsense Knowledge
Ref: [9]
Background: Reading comprehension is the task of answering questions about, or with the help of,
a passage of context text. In simple cases such as in SQuAD 1.0, the answer is always a contiguous
span of passage text. Complexities include answers which depend on long distance dependencies,
or even rely on common sense knowledge underivable from the passage alone. The paper focuses
on the cloze-style task where questions are formed from a passage of text by replacing tokens in its
sentences with placeholders and the task is to fill them in.
Motivation: Integrate common sense knowledge to enhance Cloze-style reading comprehension.
Method: The paper uses the Open Mind Common Sense part of ConceptNet, a dataset of com-
mon sense relation triplets (subject, relation, object) where the subject and object can be multi-word
expressions. Each training example (D, Q, A
1...10
) is augmented with P external knowledge facts
picked heuristically. The problem is modelled using an attention sum reader model [16]. The knowl-
edge triplet components are separately encoded using the same biGRU used to encode context to-
kens. By attending over the dot product between context words subject representations, multiplying
this with the object representations and reduce summing across the object representations we get
back knowledge-enriched context representations.
Results: The Common Nouns and Named Entities partitions of the Children’s Book Test dataset are
used. State-of-the-art results are achieved on the Common Nouns dataset.
2.5 Multi-Relational Question Answering from Narratives: Machine Reading and
Reasoning in Simulated Worlds
Ref: [35]
Background: QA mostly divides into text-QA and KB-QA. However, this division doesn’t capture
all cases of QA. One such case is what this paper calls multi-relational QA over personal narrative.
A special form of QA which is perhaps best understood through example. Think of how one might
store knowledge in a home assistant device so as to be able to query it later. If you’re a developer,
you might say "There is a new project starting. Its name is Project Alpha. It has 20 developers
assigned to it. The delivery date is September next year. Actually, no, it’s in November." One way
7
of solving this problem is by learning how to store it in a knowledge base first. Another way is to
store it as text and then answer questions across it. The paper is about the latter.
Motivation: Learn to answer question about a sequence of sentences where the sentences encode
multiple relations between and about entities mentioned in the text.
Method: Four new datasets are created (collectively called TextWorlds) and five models are trained
and tested on them (LogReg, Seq2Seq, BiDAF, MemN2N, DrQA). The most interesting tests are
across-world tests where the model is expected to generalize across domains (e.g. from an academic
domain to a shopping domain). There are two test settings, (1) predicting the correct entity in the text
given the text (n-way classification) (2) predicting the correct spans (note that there can be multiple
spans).
Results: Preliminary results on applying existing models to the datasets. Across-word performance
is poor. In-world performance is of course better. Most challenging examples are where entity
relations are deeply compositional. Could applying recurrent entity networks help?
2.6 Simple and Effective Multi-Paragraph Reading Comprehension
Ref: [10]
Background: Multi-paragraph text-QA is hard. Current neural models can only handle short pas-
sages of text. Whole documents are beyond the scope of current models unless the problem is
somehow pipelined.
Pipeline Method
The paragraph that has the smallest TF-IDF cosine distance with the question is first chosen. Next,
a paragraph-level QA model is applied to it to find the answer. To handle noisy labels, such as cases
where the answer is contained within multiple spans (some of which might be misleading) a summed
objective function is used which optimizes the sum of the probabilities of all answer span starts and
ends. In the case of starts the objective becomes the sum of exponentiated scores of answer start
tokens over the same sum across all tokens:
log
P
kA
e
s
k
P
n
i=1
e
s
i
This objective is agnostic to how the model distributes probability mass across the possible answer
spans, thus the model can “choose” to focus on only the more relevant spans. The model which
computes the start and end scores is a collection of modules including bi-directional GRUs, the
bi-directional attention mechanism of BiDAF, self-attention modules, and a final prediction layer.
Confidence method
The second proposed method outlines a common method which takes span start and end scores
s
i
and e
j
and sums them to form a span confidence score. At test time the model is run on each
paragraph and the answer span with the highest confidence is selected. However, this may lead to
poorly calibrated confidence scores due to two reasons: (1) models trained with a softmax objective
need not care about the magnitudes of the inputs to the objective as long as the ratios between them
are kept the same (2) if the model only sees paragraphs that contain answers, it might become too
confident in spurious patterns which it related to an answer being there. Four approaches to mitigate
the above two problems are explored: (i) let the normalization term in the objective softmax run
across all document paragraphs (ii) merge all paragraphs during training into one (iii) allow for a no-
answer option. (iv) use a sigmoid loss objective by computing the start/end probabilities by applying
the sigmoid function to start/end scores. Since the scores are being evaluated independently, the idea
is they should comparable across paragraphs.
Results: Three datasets are evaluated on: TriviaQA Web, TriviaQA Unfiltered and SQuAD. Strate-
gies (i), (ii), and (iii) perform best and do not degrade as more paragraphs are considered (at least for
TriviaQA). The most robust strategy is the shared norm (i) which doesn’t degrade on any dataset.
2.7 Semantically Equivalent Adversarial Rules for Debugging NLP Models
Ref: [30]
8
Background: Adversarial attacks in the computer vision community have been studied extensively.
A classifier might confuse an avocado with a millennial if we imperceptibly change a pixel or two.
Adversarial examples can help us study the robustness of our models as well as help elucidate their
reasoning.
Motivation: Come up with NLP adversarial examples which are semantically equivalent, i.e. we
want them to be worded differently whilst preserving their meaning.
Method: The paper introduces the Semantically Equivalent Adversary (SEA), defined to be a
paraphrase x
0
of input text x which leads to changed predictions f(x) 6= f(x
0
). Paraphrases
are created by translating x into multiple pivot languages to and taking the score of their back-
translations to be proportional to P (x
0
|x). For these score to be consistent they are normalized
into what the paper calls a semantic score S(x, x
0
). Semantic equivalence is then defined as
SemEq (x, x
0
) = 1 [S (x, x
0
) τ].
The paper also introduces Semantically Equivalent Adversarial Rules (SEARs) which are rule-based
system which generate SEAs. A rule is taken to be of the form r = (a c) where the first instance
of the antecedent a is replaced by the consequent c for every instance that includes a. For example,
r = (movie f ilm) would lead to r("Great movie!") = "Great film!". More general rules can be
formed such as (What NOUN Which NOUN).
Results: The results show that SEAs are very effective at confusing QA models. Tests are performed
using BiDAF on SQuAD, and to a visual QA model applied to the VQA dataset. SEA effectively
confuse state-of-the-art-models. They can be used to augment the datasets these models are trained
on to make them more robust.
2.8 On the limitations of Unsupervised Bilingual Dictionary Induction
Ref: [31]
It is so good to listen to my favourite blogger (so far) talking about his research on analysing unsu-
pervised algorithms for bilingual dictionary induction.
Background: Research on bilingual dictionary induction (BDI) and machine translation has a very
long history, and also it has a direct impact on our daily life since it matters how we communicates
with people who don’t speak the language/languages we speak. Recent research on unsupervised
approaches to address this topic has drawn wide attention since it doesn’t require labelled data which
could be costly and time-consuming when collecting.
[8] proposed an unsupervised four-step method for word translation, and it demonstrates strong
results that are sometimes even better than supervised methods. Speaking from a linguistic per-
spective, the success of [8] might be due to the fact that languages used in their experiments are
linguistically similar, thus it is important to study when the proposed method fails and also the
limitation of unsupervised methods on BDI.
Motivation: The unsupervised approaches for BDI generally start with two sets (source and target)
of word embeddings pre-trained on monolingual corpora, and the task is to build a dictionary that
translates words in one language to those in the other one. The paper has a focus on studying the
impact of a few factors on the performance of the approach proposed in [8], and show that only
when two languages are linguistically similar to each other, and the two sets of word embeddings
are derived from the same domain, such as wikipedia, the proposed method in [8] works well. A
new eigenvalue-based graph similarity is proposed in the paper to measure how similar two sets
of word embeddings are topologically, and the paper show strong correlation between the model’s
performance on BDI and the graph similarity. In addition, weak supervision from identically spelt
words in two languages provides surprising good performance across different language pairs.
Graph Similarity As illustrated in the paper that word embeddings are far from isomorphic, the
paper proposed to evaluate how isospectral two sets of word embeddings are by comparing the
eigenvalues of two Laplacian matrices. Suppose G
1
and G
2
are two sets of word embeddings which
come from two languages, A
1
and A
2
are adjacency matrices, L
1
= D
1
A
1
and L
2
= D
2
A
2
,
where D
1
and D
2
are diagonal matrices of degrees, (D
i
)
nn
=
P
m
(A
i
)
nm
, i 1, 2. The similarity
9
metric is defined as follows:
∆ =
k
X
i=1
(λ
1i
λ
2i
)
2
(1)
where k = min
j
(
P
k
i=1
λ
ji
P
n
i=1
λ
ji
> 0.9
)
(2)
The above equations can be interpreted as 1) find the minimal k such that top k eigenvalues capture
90% energy in spectrum, 2) calculate Euclidean distance between two sets of eigenvalues. The
paper showed that the proposed graph similarity is positively correlated with the performance of
unsupervised BDI algorithm, specifically, [8]’s method.
Summary of [8] An unsupervised word translation/BDI algorithm was proposed in [8], which
demonstrated strong performance across different language pairs. The algorithm is summarised
into 4 steps:
1. Monolingual word embeddings: The word embeddings of each language is derived by
running FastText [3] on monolingual corpus.
2. Adversarial mapping: A linear mapping W is learnt between two sets of embeddings with
an adversarial classifier. Orthonormal regularisation is applied during training.
3. Refinement (Procrustes analysis): A set of translations is generated by retrieving bidirec-
tional nearest neighbours, and the orthogonal Procrustes problem is applied to refine the
mapping W :
W
?
= arg min
W
||W X Y ||
F
= UV
>
s.t. UΣV
>
= SVD(Y X
>
)
4. Cross-domain similarity local scaling (CSLS): Hubness issue is common in high-
dimensional data analysis, and CSLS is proposed to expand the high-density areas and
condense low-density ones.
The paper briefly goes through the unsupervised method for BDI in [8], and then it talks about the
impact of different factors on both linguistics and machine learning side.
Impact of language similarity. Experiments conducted in [8] contain languages that are mostly
dependent-marking
4
except for French which is mixed marking
5
. This paper collected other lan-
guages that are mixed-marking (Estonian and Finnish), double-marking (Greek), and dependent
marking (Hungarian, Polish and Turkish) to study the impact of language similarity.
The unsupervised adversarial method failed (close to 0 performance) when learning to translate from
English to Estonian, Finnish and Greek, which are not dependent-marking languages as English
is. The results suggest that, the method in [8] seems to be challenged when paring English with
languages that are not isolating and do not have dependent marking. When translating two languages
that are both mixed-marking, the unsupervised method didn’t fail. Thus, the language similarity at
the morphological level indeed has an impact.
Impact of domain differences The paper collected three corpora to study the impact of domain
differences in training data. The results show that the unsupervised method in [8] failed when source
and target word embeddings were derived from two different domains, which seems to suggest that
corpora in similar or comparable domains are required in order to make the unsupervised method
work.
Impact of hyperparams Hyperparameters are also important on determining whether the unsuper-
vised method works or not.
4
Dependent-marking languages include English, German, Chinese, Russian, and Spanish, etc. An example
in German is that, “ein Herr” is grammatically correct since “Herr” is a masculine noun, but, “ein” is marked to
“eine” in “eine Frau” since “Frau” is a feminine noun. An example of hard-marking in English is that, “walk”
in “I walk” is marked to “walks” in “John walks”.
5
The categorisation of whether a language is head-, or dependent-marking is based on which one of two
happens more frequently in the language. If they happen equally frequently, then the language is mixed-
marking, if both head and dependent are marked, then the language is double-marking.
10
[21] proposed skip-gram and continuous bag-of-words algorithms for efficient estimation of word
vectors. The unsupervised method in [8] won’t work if source and target word embeddings are
not derived from the same algorithm. Even if both word embeddings are derived from the same
algorithm, varying hyperparams leads to slightly worse performance.
Impact of evaluation procedure 1) part-of-speech, in general, the performance of verbs is the
lowest across all methods, since the meaning of verbs highly relies on the context where verbs are
situated in, and they have more morphological variants. 2) Homographs: surprisingly, words which
have translations that are homographs, are associated with lower precision than other words.
To summarise, it seems that the unsupervised approaches, such as the very promising method in [8],
are still very limited, which also means there are still a lot to explore on unsupervised BDI. Another
issue that we need to take consideration is that, weak supervision derived from identically spelt
words in two languages provides solid performance which should be considered as a strong baseline
for challenging scenarios, such as translating two languages that are not in the same morphology
categories.
2.9 Generating Sentences by Editing Prototypes
Ref: [13]
(Okay, this is a TACL paper, and it was presented at ACL2018.)
Background: Generative modelling of sentences is a hard topic as recurrent neural language models
tend to produce generic utterances seen in training set which leads to poor diversity in generated
sentences. Instead of drawing samplings from scratch, the paper proposed a novel way of generating
sentences by editing prototypes in a sense that a sentence is generated by conditioning on a prototype
sentence and a sampled editing vector.
Method: The prototypes x
0
of a sentence x are those with high lexical overlap with x, as measured
by Jaccard distance d
J
. For each sentence x, a set of similar neighbours is collected N(x) = {x
0
X : d
J
(x, x
0
) < 0.5}.
The proposed model is called as the “neural editor”. The training objective follows the idea of VAEs
[17], and the model has three parts.
I. Neural editor p
edit
(x|x
0
, z). The proposed neural editor is a left-to-right sequence-to-sequence
model with attention. The encoder produces hidden states given the prototypes x
0
and the decoder
learns to generate the sentence x conditioned on an edit vector z by concatenating z to the input of
the decoder at each time step.
II. Edit prior p(z). Each sample from the prior distribution is a product of two samples indepen-
dently drawn from two distributions:
z
dir
vMF(0)
z
norm
Unif(0, 10)
z = z
norm
z
dir
where vMF(0) is von-Mises Fisher distribution with concentration κ = 0, which is essentially a
uniform distribution over a high-dimensional sphere, Unif(0, 10) is a uniform distribution between 0
and 10. The idea is that the prior distribution can be decomposed into two independent distributions,
in which one is used to determine the direction of an edit vector, and the other one is used to control
the length of it. A more intuitive explanation is that, z
dir
refers to how the prototype should be edited,
and z
norm
refers to how much.
III. Approximate edit posterior q(z|x, x
0
). As the prior is composed of two distributions, the ap-
proximate edit posterior also outputs two components.
q(z
dir
|x, x
0
) = vMF(z
dir
; f
dir
, κ) exp(κz
>
dir
f
dir
)
q(z
norm
|x, x
0
) = Unif(z
norm
; [
˜
f
dir
,
˜
f
dir
+ ])
where z
dir
is sampled from vMF(z
dir
; f
dir
, κ), which is defined by a fixed concentration κ, and a
normalised vector f
dir
indicating the direction of concentration, and z
norm
is sampled from a uniform
distribution. The fun part is how the model calculates f
dir
and f
norm
.
11
Suppose we sample a sentence x and its prototype x
0
, and f represents how the prototype x
0
should
be modified to the target sentence x, instead of learning a function that maps the two sentences to a
vector f, we make use of pretrained word vectors and calculate f by concatenating two vectors, in
which one is the average of word vectors for words added to the prototype, and the other one is that
of word vectors for word deleted. Essentially, f represents the word-level mismatches between the
sampled prototype x
0
and the target sentence x. Then, f
norm
= ||f || and f
dir
= f /f
norm
. For stable
training,
˜
f
norm
= min(f
norm
, 10).
IV. Training Objective: As usual, the training objective function contains two parts, including a
reconstruction term E
zq(z|x,x
0
)
[p
edit
(x|x
0
, z)], and a KL divergence term D
KL
(q(z|x, x
0
)||p(z)).
In the proposed parametrisation, the KL divergence term is a constant, and sampling from the vMF
distribution can be easily implemented.
Advantages: First, since the prior distribution is defined on a high-dimensional unit sphere, and the
distance between any two edit vectors is encoded as the dot-product between two normalised vectors,
it matches the cosine similarity metric which is naturally used in comparing similarity between word
vectors. Thus, the edit vectors should be able to capture semantic difference between a sampled
target sentence and its prototype.
Second, the proposed parametrisation avoids “posterior collapse” as the KL divergence term in the
training objective is a constant, and it only depends on the concentration term κ in vMF distribution.
Previous VAEs with Gaussian prior suffer from “posterior collapse” issue, and [4] proposed a couple
tricks to detriment the auto-regressive decoder used in their VAE, while here, no other tricks are
necessary since the training objective contains a single reconstruction term.
Results: Language modelling tasks on Yelp and Google OneBillionWord are conducted, and the
proposed “neural editor” is compared with several baseline models. For every sentence in the test
set, the model retrieves at least one sentence from the training set as the prototype. The “neural
editor” provides improvement on a large number of sentences, while it gives lower log-likelihood
on test sentences which don’t have similar sentences in the training set, as the model is only trained
to make small edits (based on Jaccard distance).
Human evaluation on plausibility, grammaticality and diversity of generated sentences is collected
for comparison, the “neural editor” overall has higher scores, and it is more stable than other baseline
models when varying the temperature term during evaluation.
Thoughts: The parametrisation in this paper reminds me of another paper which also utilised a
very interesting parametrisation to set the KL divergence term in VAEs as a constant. Paper is here
[33].
2.10 A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of
Word Embeddings
Ref: [1]
A very interesting paper about fully self-supervised bilingual dictionary induction.
Motivation: The method proposed in [8] was evaluated on simple scenarios as those tested lan-
guages are morphologically similar to each other, and thus the method yields great performance.
However, it doesn’t work well and possible fails at challenging scenarios, such as translating be-
tween English and Finnish, as they belong to two different morphology types, and dealing with
monolingual word embeddings derived from slightly dissimilar domains.
The paper aims to propose a robust self-supervised learning algorithm for unsupervised bilingual
dictionary induction from monolingual word embeddings only, and the proposed algorithm is capa-
ble of handling challenging scenarios as described above.
Method: The proposed algorithm contains 4 steps, and we will discuss each of them below:
I. Monolingual word embeddings: The word embeddings are learnt using CBOW [21] on monolin-
gual corpora for each language. After learning, word vectors are normalised and zero-centred, and
these 2 steps have been shown to be beneficial to the task. Then word vectors are normalised again
in order to keep a unit length.
12
II. Fully unsupervised initialisation: The underlying assumption is that the embedding spaces are
isometric (distance-preserving)
6
, thus the similarity matrices of source and target language, M
X
=
XX
>
and M
Z
= ZZ
>
, should be equivalent up to a permutation of their rows and columns.
However, it is intractable to try every possible permutation.
A simple approximation is proposed to overcome the issue. Each row in M
X
and M
Z
are sorted
independently. Under the assumption of perfect isometry, same word should get exact same vector
in the similarity matrix across different languages, thus a result based on nearest neighbour retrieval
serves as an initial dictionary to the following step. In practice,
M
X
and
M
Z
are sorted and
normalised using the step mentioned above, and regarded as the initialisation of word vectors (X
0
and Z
0
) in source and target language in next step, then a nearest neighbour retrieval is applied to
build an initial dictionary
7
.
III. Robust self-learning procedure: The procedure consists of two steps, and it runs two steps
iteratively until convergence.
W
X
, W
Z
= arg max
W
X
,W
Z
O
d
(R)
X
i
X
j
D
ij
((X
i
W
X
) · (Z
j
W
Z
))
D
ij
=
1 if j = arg max
k
(X
i
W
X
) · (Z
k
W
Z
)
0 otherwise
where O
d
(R) is the set of orthonormal matrices. The optimisation objective is guaranteed to con-
verge to a local optimum, and four key steps are introduced to help the objective converge to good
local optima, which are
1. at each iteration, we randomly keep some elements in the similarity matrix with probability
p and set the rest to 0. It turned out to be crucial to get reasonable performance when
translating between English and Finnish.
2. When learning the mappings W
X
and W
Z
, only the 20,000 most frequent words in each
language are used. It drastically decreases the running time during training, and it provides
similar performance compared to keeping all words.
3. Cross-domain Similarity Local Scaling (CSLS), which was proposed in [8], is applied to
build dictionary in the second step, as it reduces the hubness issue in high-dimensional
embeddings. It is also crucial to make the proposed method work, as well as the method
proposed in [8].
4. The dictionary is induced bidirectionally, D = D
XZ
+ D
ZX
, in order to encourage the
algorithm to avoid bad local optima.
IV. Refinement: symmetric re-weighting: Once the self-learning has converged to a good solution,
a symmetric re-weighting is applied in both languages, W
X
= US
1
2
and W
X
= V S
1
2
, where U,
S, V are calculated by running singular value decomposition on X
>
DZ, USV
>
= X
>
DZ. It
encourages the mappings to explore other dimensions besides the most relevant dimensions that
best match the current solution.
Results: The experiments are conducted in two conditions, which are 1) when English is the source
language, and the target languages include Italian, German, Finnish, and Spanish, 2) when English
is the target language, and the source languages include Spanish, Italian and Turkish. The ablation
study on the importance of 4 key steps discussed earlier is conducted on the first condition. In
conclusion, the proposed method has three advantages against its comparison partners.
I. Fast: The running time of the proposed self-learning method is much less than previously
proposed unsupervised/self-supervised methods, including [8] and [36].
II. Stable: The proposed method is also much stabler than other previously proposed methods.
The authors ran each of three models, including [8], [36] and the proposed one, for 10 times, and the
proposed method didn’t fail in any run, while the other two methods failed in several runs, especially
in challenging scenarios.
6
For the sake of simplicity, isometry here means that there exists a bijective transformation between two
sets of word vectors that preserves distance of any two words.
7
X
0
and Z
0
are only used in the proposed self-learning procedure for only the first iteration, and then the
original embeddings are used in the procedure
13
III. Better: Overall, the proposed method provides stronger performance than other two methods
mentioned above.
Thoughts: The first author has been working on BDI and machine translation for a couple years in
both supervised and unsupervised fashion. Follow him here
8
.
8
http://www.mikelartetxe.com/
14
3 Tuesday, July 17th: Day 2
3.1 Batch IS NOT Heavy: Learning Word Representations From All Samples
Ref: [14]
Motivation: Skip-gram, CBOW [21] and GloVe [26] have been proposed to learn high-quality
word vectors efficiently with stochastic gradient descent (SGD). However, in skip-gram and CBOW,
the quality of learnt word vectors is highly sensitive to the empirical distribution for negative sam-
pling, as pointed out in [22], uni-gram raised up to power 0.75 seems to a sweet spot for learning,
and also, the one-sample learning scheme in SGD causes fluctuation at the early stage of learning;
in GloVe, only positive (observed in the training corpus) word-context pairs are used for learning,
while all negative pairs (unobserved) are ignored and never used compared to running SVD on top
of Positive Point Mutual Information (PPMI) matrix.
The paper proposed “AllVec” learning algorithm that utilises all positive and negative word-context
pairs to learn word vectors, and then a reformulation of the objective function is proposed to incor-
porate (full) batch gradient descent and replace noisy stochastic gradient descent and biased negative
sampling.
Method: The objective function is defined as follows:
L =
X
(w,c)S
α
+
wc
(r
+
wc
U
w
˜
U
>
c
)
2
L
P
observed/positive pairs
+
X
(w,c)(V ×V )\S
α
wc
(r
wc
U
w
˜
U
>
c
)
2
L
N
unobserved/negative pairs
where S refers to the set of observed word-context (wc) pairs in training corpora, V is the vocabulary,
U is the word embedding matrix,
˜
U is the context embedding matrix.
The objective here is a regression problem, and it resembles running SVD on top on PPMI matrix.
Similarly, r
+
wc
is set to be the PPMI of the pair wc, and r
wc
can be either 0 or 1. In order to
downweight frequently occurred words in training corpora, α
+
wc
follows the subsampling scheme in
GloVe [26], and more importantly, α
wc
penalises more when the word w appears more frequently,
which is similar to the negative sampling scheme in skip-gram, CBOW [22].
In order to apply full batch gradient, a reformulation is applied:
˜
L =
X
wV
X
cV
α
c
(r
U
w
˜
U
>
c
)
2
L
A
all pairs
+
X
(w,c)S
(α
+
wc
α
c
)(∆ U
w
˜
U
>
c
)
2
L
0
P
observed pairs
where ∆ = (α
+
wc
r
+
wc
α
c
r
)/(α
+
wc
α
c
) and constant terms are omitted.
As L
A
computes a value for every word-context pair, the time complexity is O(k|V |
2
), where k is
the dimensionality of the vectors, and |V | is the size of the vocabulary, thus it is crucial to reduce the
time complexity of computing L
A
. After expanding the square term and applying the commutative
property to rearrange the calculation, the transformed
˜
L
A
without constant terms is:
˜
L
A
=
k
X
d=0
k
X
d
0
=0
X
wV
u
wd
u
wd
0
!
X
cV
α
c
˜u
cd
˜u
cd
0
!
2r
k
X
d=0
X
wV
u
wd
!
X
cV
α
c
˜u
cd
!
=
k
X
d=0
k
X
d
0
=0
p
w
dd
0
p
c
dd
0
2r
k
X
d=0
q
w
d
q
c
d
In this way, before each iteration, p
w
dd
0
, p
c
dd
0
, q
w
d
, and q
c
d
can be pre-calculated, and the time complex-
ity to compute each of them for all pairs is O(|V |k
2
), O(|V |k
2
), O(|V |k) and O(|V |k). With all
pre-calculated terms, the time complexity of calculating
˜
L
A
is O(k
2
). Therefore, in total, including
pre-calculation and the calculation of assembling all pre-calculated terms, the time complexity is
15
O(2|V |k
2
+ 2|V |k + k
2
) O(2|V |k) as the size of the vocabulary |V | is much larger than the
dimensionality of learnable vectors k, and it is much smaller than the original O(k|V |
2
).
Results: The training experiments are conducted on four corpora individually, and four comparison
partners, including skip-gram, skip-gram with adaptive sampler [5], GloVe, and LexVec [27] are
trained with same hyperparameter settings. The downstream tasks for evaluation include MEN,
MC, RW, RG, WSim, and WRel.
Overall, AllVec outperforms previously proposed comparison partners, although it requires more
iterations for AllVec to converge than other SGD-optimised methods
9
.
In order to show that negative samples are important in terms of learning high-quality word vec-
tors, an ablation study is conducted to illustrate the effect of varying α
wc
. The performance of the
proposed AllVec drops drastically when α
wc
is set to be 0 (as in GloVe) compared to other positive
values of α
wc
, thus it is helpful to include unobserved pairs into learning word vectors.
Thoughts: I tend to think of the proposed learning method as a weighted decomposition of the PPMI
matrix without the orthonormal constraint in SVD, and it performs well across all downstream tasks.
I would love to see whether removing the orthonormal constraint would help or not.
3.2 Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations
Ref: [29]
Background: Noun compounds are an important part of language. For example apple cake is a
noun compound signifying a particular type of cake. One way of describing the semantics of a noun-
compound is through paraphrasing. Thus apple cake is a cake made from apples. Noun-compound
paraphrasing may be considered a subtask of the general paraphrasing task.
Motivation: Develop a model which generalizes well to unseen noun-compounds and rare para-
phrases.
Method: Each training examples contains two constituents and a paraphrase (w
2
, p, w
1
) such as
(apple, made of, cake) and the model is trained on 3 subtasks: (1) predict p given w
1
and w
2
,
(2) predict w
1
given p and w
2
, and (3) predict w
2
given p and w
1
. The paraphrases are col-
lected by taking a list of noun-compounds from SemEval to extract common POS patterns such
as [w
2
] VERB PREP [w
1
]. Thus the model actually learns to classify or rank across the paraphrase
templates. The model itself is a simple LSTM model.
Results: A qualitative analysis is present as well as an empirical evaluation on the SemEval 2013
Task 4 where the model presented in the paper outperforms other models by a wide margin.
3.3 Guess Me if You Can: Acronym Disambiguation for Enterprise
Ref: [19]
Background: Anyone that has worked in a big company knows that acronyms can be annoy-
ing. Never mind usual word polysemy, acronyms are where things get really crazy. Furthermore,
acronyms usually appear in isolation without their full meaning spelt out. Therefore, it is partic-
ularly useful to develop a system that can automatically resolve the true meanings of acronyms in
enterprise documents. A further important distinction is that acronyms can either be external or
internal to an organisation. For example, AI might stand for Asset Intelligence within Microsoft,
whereas outside it usually stands for Artificial Intelligence. Even when acronyms denote the same
thing internally and externally the context within which they appear can be very different. OS might
have a lot to do performance or install externally, whereas internally at a place like Microsoft, it will
appear in contexts to do with development or implementation.
Motivation: Solve the limitations of previous work which does not distinguish between external
and internal meanings.
Method: The papers divides Entity Acronym Disambiguation into two sub-problems: (1) Acronym
Meaning Mining which aims at mining acronym/meaning pairs from the enterprise corpus and (2)
Meaning Candidate Ranking whose goal is to rank the candidate meanings associated with the target
9
The Beauty of SGD.
16
acronym. An assumption is made that the acronyms for disambiguation are provided as input to the
system. The paper does not try to optimize the performance of acronym detection (e.g. identifying
acronyms beyond the simple capitalized rule, or distinguishing cases where a capitalized term is not
an acronym but a regular English word, such as “OK”).
I. Acronym Meaning Mining:
Candidate Generation: a phrase is considered to be a meaning candidate for an acronym
if: (1) the initial letters of the phrase match the acronym and the phrase and the acronym
co-occur in at least one document in the enterprise corpus; or (2) it is a valid candidate for
the acronym in public knowledge bases (e.g. Wikipedia).
Popularity Calculation: for each candidate meaning, a popularity score is calculated, which
reveals how often the candidate meaning is used as the genuine meaning of the acronym.
This is done by either finding a normalized count of meanings across the entire corpus, or
across documents. The scores are used later as parts of input features for the model.
Candidate Deduplication Heuristic rules are used to deduplicate acronyms e.g. CA which
stands for certificate authority might also be denoted by acronyms such as Cert Auth or
many others.
Context Harvesting: Context words around each meaning candidate are harvested.
II. Meaning Candidate Ranking:
Candidate Ranking: a ranking model is trained to rank candidate meanings as being the
true meaning of an acronym. In order to generate a training set, distant supervision is used.
Contexts are taken and acronym meanings are replaced with the acronyms themselves. The
task then is to train a model to disambiguate these acronyms back into their meanings.
Results: High F1 scores are achieved on both a manual and the main distantly supervised dataset.
More importantly, similar behaviour is seen across both datasets as various features are used. Finally,
the model is compared to well-known entity linking models and is shown to outperform them.
17
4 Wednesday, July 18th: Day 3
4.1 What you can cram into a single vector: Probing sentence embeddings for linguistic
properties
Ref: [7]
Motivation: Recent research on learning sentence representations from structured data and human-
annotated label demonstrated strong generalisation ability and transferability, and most of them can
be directly applied in various downstream tasks. However, it is not clear to us that what information
has been encoded in vector representations. This paper proposed 10 probing tasks that capture
simple linguistic features, and these tasks are collected to help us understand a little more about
vector representations.
Method: As mentioned in the paper, 10 probing tasks are categorised into three classes, including
surface (SentLen and WC), syntactic (BShift, TreeDepth and TopConst), and semantic information
(Tense, SubNum, ObjNum, SOMO and CoordInv).
1. SentLen: predict the length of a given sentence in terms of the number of words;
2. WordContent: tell whether which of the pre-picked 1,000 mid-frequency words a given
sentence contains;
3. BigramShift: test whether an sentence encoder is sensitive to legal word orders.
4. TreeDepth: check whether an encoder is capable of inferring the hierarchical structure of
sentences
5. Top Constituent: sentences must be classified in terms of the sequence of top constituents
immediately below the sentence (S) node
6. Tense: predict the tense of the main-clause verb of a given sentence
7. Subject Number: the number of the subject of the main clause
8. Object Number: the number of the direct object of the main clause
9. Semantic-Odd-Man-Out: tell whether a sentence has been modified or not. An example
modified sentence is: “ No one could see this Hayes and I wanted to know if it was real or
a spoonful (original: ploy).
10. Coordination Inversion: tell whether a sentence is intact or modified by inverting the order
of the clauses.
The sentences are all extracted from Toronto BookCorpus [38]. For each task, the training set
contains 100k sentences, and validation(development) set and test set have 10k sentences in each of
them. The classes in each task are balanced in order to prevent models from overfitting the internal
biases.
In total, four types of encoder architectures, including
1. bag-of-words models with or without TF-IDF
2. bidirectional LSTM (biLSTM) with the hidden state at last time step as the representation
3. biLSTM with global max-pooling over time on top of hidden states
4. gated convolutional network,
are trained on various representation learning objectives, and the training objectives include
1. auto-encoder
2. machine translation (English to French, English to German and English to Finnish)
3. (Seq2Tree) decoding tree structure of an input sentence given the vector representation of
it
4. (Skip-thought) predicting the next sentences given the current one
5. (NLI) natural language inference tasks.
18
Human performance is also collected for comparison.
Results: Overall, as expected, the training objective that requires models to decode tree structures of
sentences given their vector representations outperforms all other ones. Some of the best results even
surpass human performance, and the probing tasks where models have better performance include
TopConst, Tense, SubjNum and ObjNum.
There are several very interesting observations we can find in result tables in the paper.
1. Clearly, auto-encoder objective are not able to capture word content information, and it also
performs poorly on semantic-related tasks, but it performs extremely well on SentLen task.
Seq2Tree objective is also terribly bad at capturing word content, but it overall outperform
other objective on syntactic and semantic-related tasks.
2. Machine translation objectives give reasonable performance on word content task, as well
as natural language inference objective, as they both require a clear inference of word
meanings.
3. Skip-thought objective provides overall mediocre results among all training objectives,
which means unsupervised/self-supervised transfer learning still has a lot to explore.
4. The choice of representation in biLSTM matters, as global max-pooling over time outper-
forms picking only last hidden state on both syntactic and semantic-related tasks, while the
later one performs better on surface information related tasks, including SentLen and word
content.
5. Word order matters in most of the tasks.
An important thing about the probing tasks is that, the good performance of an encoder on these
tasks doesn’t imply that on the tasks in SentEval [6]. In [7], there is a strong correlation between
the performance on WordContent and that on previously mentioned downstream tasks, so that is
saying, word content is important for getting good results on most of the tasks in SentEval
10
.
10
More details about the SentEval package is here: https://github.com/facebookresearch/SentEval
19
References
[1] M. Artetxe, G. Labaka, and E. Agirre. A robust self-learning method for fully unsupervised
cross-lingual mappings of word embeddings. In ACL, 2018.
[2] B. Athiwaratkun, A. G. Wilson, and A. Anandkumar. Probabilistic fasttext for multi-sense
word embeddings. In ACL, 2018.
[3] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword
information. TACL, 5:135–146, 2017.
[4] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, and S. Bengio. Generating
sentences from a continuous space. In CoNLL, 2016.
[5] L. Chen, F. Yuan, J. M. Jose, and W. Zhang. Improving negative sampling for word represen-
tation using self-embedded features. In WSDM, 2018.
[6] A. Conneau and D. Kiela. Senteval: An evaluation toolkit for universal sentence representa-
tions. In LREC, 2018.
[7] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni. What you can cram into
a single vector: Probing sentence embeddings for linguistic properties. In ACL, 2018.
[8] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without
parallel data. In ICLR, 2018.
[9] A. Frank and T. Mihaylov. Knowledgeable reader: Enhancing cloze-style reading comprehen-
sion with external commonsense knowledge. In ACL, 2018.
[10] M. Gardner and C. Clark. Simple and effective multi-paragraph reading comprehension. In
ACL, 2018.
[11] M. Ghazvininejad, C. Brockett, M.-W. Chang, W. B. Dolan, J. Gao, W. tau Yih, and M. Galley.
A knowledge-grounded neural conversation model. In AAAI, 2018.
[12] G. Glavaš and I. Vuli
´
c. Explicit retrofitting of distributional word vectors. In ACL, 2018.
[13] K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang. Generating sentences by editing prototypes.
CoRR, abs/1709.08878, 2017.
[14] X. He, X. Xin, F. Yuan, and J. M. Jose. Batch is not heavy: Learning word representations
from all samples. In ACL, 2018.
[15] S. Jameel, Z. Bouraoui, and S. Schockaert. Unsupervised learning of distributional relation
vectors. In ACL, 2018.
[16] R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst. Text understanding with the attention
sum reader network. arXiv preprint arXiv:1603.01547, 2016.
[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
[18] X. Li, Y.-N. Chen, L. Li, J. Gao, and A. Celikyilmaz. End-to-end task-completion neural
dialogue systems. arXiv preprint arXiv:1703.01008, 2017.
[19] Y. Li, F. Tao, A. Fuxman, and B. Zhao. Guess me if you can: Acronym disambiguation for
enterprises. In ACL, 2018.
[20] X. Liu, Y. Shen, K. Duh, and J. Gao. Stochastic answer networks for machine reading com-
prehension. In ACL, 2018.
[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781, 2013.
[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of
words and phrases and their compositionality. In NIPS, 2013.
[23] N. Mrkši
´
c, D. O. Séaghdha, T.-H. Wen, B. Thomson, and S. Young. Neural belief tracker:
Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777, 2016.
[24] B. Peng, X. Li, J. Gao, J. Liu, and K.-F. Wong. Deep dyna-q: Integrating planning for task-
completion dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2182–2192,
2018.
[25] B. Peng, X. Li, L. Li, J. Gao, A. Celikyilmaz, S. Lee, and K.-F. Wong. Composite task-
completion dialogue policy learning via hierarchical deep reinforcement learning. arXiv
preprint arXiv:1704.03084, 2017.
[26] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.
In EMNLP, 2014.
[27] A. Salle, M. Idiart, and A. Villavicencio. Enhancing the lexvec distributed word representation
model using positional contexts and external memory. CoRR, abs/1606.01283, 2016.
20
[28] Y. Shen, P.-S. Huang, M.-W. Chang, and J. Gao. Link prediction using embedded knowledge
graphs. 2016.
[29] V. Shwartz and I. Dagan. Paraphrase to explicate: Revealing implicit noun-compound rela-
tions. In ACL, 2018.
[30] S. Singh, C. Guestrin, and M. T. Ribeiro. Semantically equivalent adversarial rules for debug-
ging nlp models. In ACL, 2018.
[31] A. Søgaard, I. Vulic, and S. Ruder. On the limitations of unsupervised bilingual dictionary
induction. In ACL, 2018.
[32] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in
neural information processing systems, pages 2440–2448, 2015.
[33] A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In
NIPS, 2017.
[34] J. D. Williams, K. Asadi, and G. Zweig. Hybrid code networks: practical and effi-
cient end-to-end dialog control with supervised and reinforcement learning. arXiv preprint
arXiv:1702.03274, 2017.
[35] B. Yang, I. Labutov, A. Prakash, and A. Azaria. Multi-relational question answering from
narratives: Machine reading and reasoning in simulated worlds. In ACL, 2018.
[36] M. Zhang, Y. Liu, H. Luan, and M. Sun. Adversarial training for unsupervised bilingual lexicon
induction. In ACL, 2017.
[37] T. Zhao, A. Lu, K. Lee, and M. Eskénazi. Generative encoder-decoder models for task-oriented
spoken dialog systems with chatting capability. In SIGDIAL Conference, 2017.
[38] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Align-
ing books and movies: Towards story-like visual explanations by watching movies and reading
books. ICCV, pages 19–27, 2015.
21