ACL 2018 Notes

Shuai Tang

shuaitang93@ucsd.edu

Andrej Zukov-Gregoric

andrej.zukovgregoric@blackrock.com

Contents

1 Sunday, July 15th: Tutorials 2

1.1 Neural Approaches to Conversational AI . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Neural Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Monday, July 16th: Day 1 5

2.1 Probabilistic FastText for Multi-Sense Word Embeddings . . . . . . . . . . . . . . 5

2.2 Unsupervised Learning of Distributional Relation Vectors . . . . . . . . . . . . . . 5

2.3 Explicit Retroﬁtting of Distributional Word Vectors . . . . . . . . . . . . . . . . . 6

2.4 Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with Ex-

ternal Commonsense Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Multi-Relational Question Answering from Narratives: Machine Reading and Rea-

soning in Simulated Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Simple and Effective Multi-Paragraph Reading Comprehension . . . . . . . . . . . 8

2.7 Semantically Equivalent Adversarial Rules for Debugging NLP Models . . . . . . 8

2.8 On the limitations of Unsupervised Bilingual Dictionary Induction . . . . . . . . . 9

2.9 Generating Sentences by Editing Prototypes . . . . . . . . . . . . . . . . . . . . . 11

2.10 A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings

of Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Tuesday, July 17th: Day 2 15

3.1 Batch IS NOT Heavy: Learning Word Representations From All Samples . . . . . 15

3.2 Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations . . . . . . 16

3.3 Guess Me if You Can: Acronym Disambiguation for Enterprise . . . . . . . . . . . 16

4 Wednesday, July 18th: Day 3 18

4.1 What you can cram into a single vector: Probing sentence embeddings for linguistic

properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

ACL 2018 has ended here in Melbourne, so we decided to spend some time organising our notes

into a more coherent document. The goal of this document is to help you quickly gain an overview

of the current directions in which NLP is moving.

1 Sunday, July 15th: Tutorials

It begins! Sunday starts with Tutorials. They’re divided into morning and afternoon ones. I spent

the morning at Neural Approaches to Conversational AI followed by Neural Semantic Parsing in

the afternoon.

1.1 Neural Approaches to Conversational AI

The tutorial is led by Jianfeng Gao and Michel Galley from MSR Redmond and Lihong Li from

Google Seattle.

Jianfeng opens by saying that dialogue can be viewed as a unifying task in NLP because it necessi-

tates the solving of other tasks such as:

1. reading comprehension for extracting information from dialogue history

2. semantic parsing done either explicitly or implicitly, for being able to reason across KBs

3. dialogue state tracking for being able to encode the state of a conversation across time

4. natural language understanding for being able to identify the intents of dialogue partici-

pants

5. text generation for being able to formulate responses or ask clariﬁcation questions.

The tutorial identiﬁes three categories of dialogue agents: (1) question answering agents, (2) task-

oriented dialogue agents, and (3) social bots.

Question Answering Agents

Jianfeng continues by saying that we can divide QA agents into two further categories: (1) text-QA

agents which attempt to answer questions across passages of text (think reading comprehension à la

SQuAD & answer sentence selection à la WikiQA) and (2) KB-QA agents which attempt to answer

questions across knowledge bases.

Papers to read:

Stochastic Answer Networks for Machine Reading Comprehension [20]

Link Prediction using Embedded Knowledge Graphs [28]

A Knowledge-Grounder Neural Conversation Model [11]

Task-oriented agents

Following speech act theory in linguistics, it is clear we often partake in dialogue with the intent

of achieving something. Modelling intent over time thus becomes crucial. Typically, task-oriented

agents are architected to have

1. a natural language understanding (NLU) module for identifying intents and slot ﬁlling;

2. a state tracker for tracking conversation state;

3. a dialogue policy which selects the next action based on the current state;

4. a natural language generator (NLG) for response generation.

The tutorial goes on to cover a few recent papers on the above: E2E memory networks [32], Neural

Belief Tracker [23], Hierarchical Policy Learners [25], Integrating Planning for Dialogue Policy

Learning [24], Hybrid Code Networks [34], and An E2E Neural Dialogue System [18].

Finally, the Microsoft Dialogue Challenge at SLT-2018.

Social bots

The rest of the tutorial covered social bots. These are dialogue systems that are fully data-driven

in the sense that they need little interaction with the user’s environment (no need for API calls or

reasoning across KBs) and such models cope well with free-form dialogues.

Papers to read:

Although task-oriented dialogue systems and social bots were originally developed for different

purposes, there is a trend of combining both as a step towards building an open-domain dialogue

agent. For example, on the one hand, [11] presented a fully data-driven and knowledge-grounded

neural conversation model aimed at producing more contentful responses without slot ﬁlling. On

the other hand, [37] proposed a task-oriented dialogue agent based on the encoder-decoder model

with chatting capability. These works represent steps toward end-to-end dialogue systems that are

useful in scenarios beyond chitchat.

1.2 Neural Semantic Parsing

Sunday afternoon. Time for tutorial two. The jet lag is real.

The tutorial was led by Luke Zettlemoyer and his colleagues Pradeep Dasigi, Srini Iyer, Alane Suhr,

and Matt Gardner. Slides can be found here

Luke Zettlemoyer began the tutorial by summarizing recent events in the ﬁeld. In short, the guiding

motivation behind semantic parsing is that the meaning behind language can be condensed into a

logical form. Parsing out this form should thus be of interest if we are to elucidate the semantic

meaning behind natural language. Historically, researchers tackled this problem using combinatory

categorical grammars (CCGs) but by 2016/7, with the advent of sequence-to-sequence models, it

was clear that neural approaches to parsing text into logical form performed better.

Next up was Alane Suhr, who gave a quick overview of the various datasets in the ﬁeld. Brieﬂy, she

divided datasets into four categories

1. traditional semantic parsing datasets where the goal is to generate executable representa-

tions,

2. datasets grounded in some environment (think NLVR),

3. Broader domain datasets such as the AMR dataset where the goal is to map any English

sentence into a formal AMR representation, or WikiTableQuestions

4. Sequential language understanding datasets which model sequences of natural language

utterances paired with some logical form (ATIS dataset, SCONE, SQA)

Pradeep Dasigi continued the tutorial by talking about constrained decoding. Traditional seman-

tic parsers used grammar-based parsing algorithms whereas newer neural semantic parsing uses

encoded-decoder architectures. However, decoders can generate outputs which are not valid syntac-

tically or semantically. How do we constrain the space of possible generations to be more valid?

Two models were discussed Seq2Tree, which produces syntactically (but not necessarily seman-

tically) valid output and neural semantic parsing with type constraints which does the above but

produces semantically valid outputs as well by generating from a grammar that guarantees the gen-

erated logical form is well-typed.

Pradeep continued by talking about how semantic parsers are commonly trained. To recap, a com-

mon task is given a question and some context such as a database as input, map to the output logical

form. Manually annotating these logical forms is expensive which makes it hard to posit this as a

fully supervised problem. Instead, we can turn to weaker forms of supervision. Instead of optimiz-

ing for the correct logical form, we optimize for the correct answer as generated by the logical form

we output.

There are three common training methods in this space:

1. Maximum Marginal Likelihood; where we want to maximize the probability of an answer

given our input but since we output logical forms we have to marginalize over all logical

https://github.com/allenai/acl2018-semantic-parsing-tutorial

forms which output our answer to get its conditional probability. Since this is intractable,

there are heuristics for searching across logical forms these can either be on- or off-line.

2. Structured Learning Methods; where we try to maximise margins or minimize expected

risks

3. Reinforcement Learning Methods.

Srini Iyer talked about semantic parsing as code generation.

Alane Suhr came back to talk about Context-Dependent Language Understanding. Clearly logical

forms can be constrained even more across time such as in a dialogue setting. For example, if

I prompt a system to “Show me ﬂights from Seattle to Boston next Monday” followed by “On

American Airlines” I am clearly making the second prompt dependent on the ﬁrst and as such

constraining the space of possible logical forms.

To model this, we can make prompts sequentially dependent whereby the decoder at time t is depen-

dent not only on the prompt at time t but some condensed representation of previous prompts. This

is achieved by having a turn-level encoder which produced a turn-level vector representation of the

history of prompts so far at each turn. These representations are then concatenated with every input

word embeddings entering the encoder in the next time step. Additionally, the decoder can be made

dependent on previously outputted queries. The goal is to minimize the token-level cross-entropy

loss of the generated output SQL queries.

Interestingly, because each batch consists of an entire interaction, we backpropogate through an en-

tire interaction which means our gradients are sensitive to the length of our interactions. To remove

this sensitivity an interaction loss can be introduced which introduced a term which multiplies the

loss to re-normalize it based on the length of the current interaction batch.

Another interesting insight is that positional embeddings can be added to input hidden states to some

attentional module by concatenating to them the positions to which they appear in reverse order, e.g.:

[3; h0], [2; h1], [1; h2], [0, h3].

State of the art performance is achieved on (a modiﬁed) ATIS dataset and a detailed ablation study

is provided.

The new CONALA

dataset was presented.

Open questions: clariﬁcation questions, latent decisions learning to distinguish between meaning

derived from current utterance vs, interaction history

Finally, Matt Gardner introduced AllenNLP

https://conala-corpus.github.io/

https://allennlp.org/

2 Monday, July 16th: Day 1

The World Cup ﬁnal started at 1:00AM local time. Needless to say, many were tired.

2.1 Probabilistic FastText for Multi-Sense Word Embeddings

Ref: [2]

Background: Two types of word embeddings (1) dictionary-level embeddings such as word2vec

and (2) probabilistic word embeddings where words are assigned a distribution instead of a point in

vector space. PFT provides probabilistic character-level representations of words.

In probabilistic embeddings every work w is associated with a density function such as a single

Gaussian or a Gaussian mixture with K Gaussian components. Individual Gaussian components are

good at representing the multiple senses of a word.

Motivation: Create a probabilistic FastText which can better capture the multiple senses of words.

Method: In the simple case using a single Gaussian. The mean holds much of the semantic infor-

mation and in the single-Gaussian case is a function of both the n-gram and dictionary features:

|NG

| + 1





g ∈N G





Where z

is the vector representation of n-gram feature g and NG

is the set of n-gram features for

word w and v

is its dictionary representation.

The model parameters to be learned are z

and v

. A margin loss is used to push the energy between

a word and a true context pair to be higher than between the word and a negative context pair. The

energy between two words is deﬁned to be a expected likelihood kernel (the closed-form of which

is used in the paper).

Results: Evaluated on a bunch of word-similarity datasets. In the multivariate Gaussian setting

achieves state-of-the-art at 50 dimensions. Not as competitive at 300 dimensions. Still, in total it

outperforms older methods on larger word-similarity datasets even at 300 dimensions.

2.2 Unsupervised Learning of Distributional Relation Vectors

Ref: [15]

Background: The remarkable property of words embeddings is that they capture lexical relation-

ships beyond mere similarity. E.g. “a is to b what c is to ?” can easily be answered. More com-

plicated relationships such as the fact that Macron succeeded Hollande as president are harder to be

captured by word embeddings. Paper proposes to learn relatedness vector between two words s and

t by learning a relation vector r

in an unsupervised fashion.

Motivation: Traditionally r

are built by averaging the word embeddings between s and t but this

has two big drawbacks: (1) many words occurring between s and t will be semantically related to s

or t but no descriptive of the relationship between the two; (2) it gives too much weight to stop-words

which cannot be simply removed because certain stop words are crucial for modelling relationships.

Method: First the authors propose a modiﬁed GloVe model which instead of learning:

j:x

6=0

f (x

)



· ˜w

+ b

− log x



learns the following instead:

j∈J



· ˜w

− P MI

(i, j)



The of us smoothed frequency counts and residual variance based weighting makes the words em-

beddings more robust to rate words which is important in relation extraction as the relation vectors

are often estimated from very sparse co-occurrence counts (see paper for details).

Clearly the objective pushes w

· ˜w

to approximate P MI

(i, j). We can think of the word vector

as a low-dimensional encoding of (PMI

(i, 1), . . . , P MI

(i, n)). By replacing w

with a point

in space which models a vector relation such as (w

− w

) · ˜w

= P MI

(i, j) −P M I

(k, j) we

can begin begin interpreting relations in terms of PMI.

Learning Global Relation Vectors: So how do they learn the relation vectors r

? GloVe is based on

statistics about (main word, context word) pairs whereas relations need statistics on (source word,

context word, target word) triples. Turns out there are well known generalizations of PMI to three

arguments. An objective to learn for this is then:

j∈J



· ˜w

− SI(i, j, k)



where SI is the three-argument PMI. Computing r

for every pair of words is infeasible given the

number of (i, j, k) triples. Instead, words embeddings learned using the modiﬁed GloVe objective

then the when learning the relations the context w

and bias vectors

are ﬁxed.

Results: Paper takes a dump of Wikipedia to train on. It compares trained model to other well

known relation representation methods. Such as taking the Diff, Avg, or Conc of the context word

embeddings. They beat all baselines on well known relation datasets: the Google Analogy Test Set

and the DiffVec dataset. A relation prototypicality study is also conducted on the SemEval 2012

Task2 and a relation extraction study on the NYT corpus.

2.3 Explicit Retroﬁtting of Distributional Word Vectors

Ref: [12]

Background:

Semantic specialization of distributional word vectors, referred to as retroﬁtting, is a process of

ﬁne-tuning word vectors using external lexical knowledge in order to better embed some semantic

relation. Simple similarity measures between word embeddings encode an abstract semantic asso-

ciating instead of a precise semantic relation. For example, it’s difﬁcult to discern synonyms from

antonyms in distributional spaces. This is debilitating particularly in downstream tasks which rely

on more precise semantic relations between words. A standard solution is to move beyond unsu-

pervised approaches in a process called word vector space specialization or retroﬁtting where often

external resources such as WordNet are used to specialize distributional spaces for lexical relations.

This is achieved using two main strategies: (1) Joint Specialization Models : integrate external

constraints into the distributional training procedure, and (2) Post-Processing Models inject lexical

knowledge retroactively to satisfy external constraints. (2) tends to outperform (1) but it suffers from

one big drawback - they locally update only vectors of words present in the external resource.

Motivation: Merge (1) and (2).

Method: The papers starts of by presenting a new way of creating constraint-aware training in-

stances that help nudge the model in the right direction. This is done by taking an external resource

of (w

, w

, r) word, word, constraint triplets and then generating examples from it by retrieving K

words closest to w

and K words closest to w

thus forming a micro-batch M:

M (w

, w

, r) = {(x

, x

, g

, x

))}∪



, x

, g



, x



k=1

∪



, x

, g



, x



k=1

where g

is a distance score in the end space we want to learn which is tied to the relation. For

example, if we want to learn synonyms g

should minimize the distance between x

and x

. For

words in the two sampled sets, we assume that the distance g(·, ·) in the specialized space should be

the same as in the distribution space.

The paper proposes learning a explicit specialization function f : X → X

which when applied

to the distributional vector space X transforms it into a specialized space X

which better captures

semantic relations. The function f is set to be a feed forward neural net.

The 2K + 1 mini-batches M are fed to the model where each training examples consists of a pair

of distributional (unspecialized) embeddings x

and x

and a score g denoting the desired distance

between the vectors. The objective then becomes:

MSD

i=1





, f





− g



Which minimized the mean squared distance between the outputs of the explicit specialization func-

tion and f the desired distances g.

An additional topological regularization term is added which helps the model not disrupt the bene-

ﬁcial properties of the distributional space:

REG

i=1



, f





+ g



, f





The above simply says, keep vectors in the distributional space close to vectors in the specialized

space.

Results: The method is shown to outperform distributional vectors on a bunch of tasks.

2.4 Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External

Commonsense Knowledge

Ref: [9]

Background: Reading comprehension is the task of answering questions about, or with the help of,

a passage of context text. In simple cases such as in SQuAD 1.0, the answer is always a contiguous

span of passage text. Complexities include answers which depend on long distance dependencies,

or even rely on common sense knowledge underivable from the passage alone. The paper focuses

on the cloze-style task where questions are formed from a passage of text by replacing tokens in its

sentences with placeholders and the task is to ﬁll them in.

Motivation: Integrate common sense knowledge to enhance Cloze-style reading comprehension.

Method: The paper uses the Open Mind Common Sense part of ConceptNet, a dataset of com-

mon sense relation triplets (subject, relation, object) where the subject and object can be multi-word

expressions. Each training example (D, Q, A

1...10

) is augmented with P external knowledge facts

picked heuristically. The problem is modelled using an attention sum reader model [16]. The knowl-

edge triplet components are separately encoded using the same biGRU used to encode context to-

kens. By attending over the dot product between context words subject representations, multiplying

this with the object representations and reduce summing across the object representations we get

back knowledge-enriched context representations.

Results: The Common Nouns and Named Entities partitions of the Children’s Book Test dataset are

used. State-of-the-art results are achieved on the Common Nouns dataset.

2.5 Multi-Relational Question Answering from Narratives: Machine Reading and

Reasoning in Simulated Worlds

Ref: [35]

Background: QA mostly divides into text-QA and KB-QA. However, this division doesn’t capture

all cases of QA. One such case is what this paper calls multi-relational QA over personal narrative.

A special form of QA which is perhaps best understood through example. Think of how one might

store knowledge in a home assistant device so as to be able to query it later. If you’re a developer,

you might say "There is a new project starting. Its name is Project Alpha. It has 20 developers

assigned to it. The delivery date is September next year. Actually, no, it’s in November." One way

of solving this problem is by learning how to store it in a knowledge base ﬁrst. Another way is to

store it as text and then answer questions across it. The paper is about the latter.

Motivation: Learn to answer question about a sequence of sentences where the sentences encode

multiple relations between and about entities mentioned in the text.

Method: Four new datasets are created (collectively called TextWorlds) and ﬁve models are trained

and tested on them (LogReg, Seq2Seq, BiDAF, MemN2N, DrQA). The most interesting tests are

across-world tests where the model is expected to generalize across domains (e.g. from an academic

domain to a shopping domain). There are two test settings, (1) predicting the correct entity in the text

given the text (n-way classiﬁcation) (2) predicting the correct spans (note that there can be multiple

spans).

Results: Preliminary results on applying existing models to the datasets. Across-word performance

is poor. In-world performance is of course better. Most challenging examples are where entity

relations are deeply compositional. Could applying recurrent entity networks help?

2.6 Simple and Effective Multi-Paragraph Reading Comprehension

Ref: [10]

Background: Multi-paragraph text-QA is hard. Current neural models can only handle short pas-

sages of text. Whole documents are beyond the scope of current models unless the problem is

somehow pipelined.

Pipeline Method

The paragraph that has the smallest TF-IDF cosine distance with the question is ﬁrst chosen. Next,

a paragraph-level QA model is applied to it to ﬁnd the answer. To handle noisy labels, such as cases

where the answer is contained within multiple spans (some of which might be misleading) a summed

objective function is used which optimizes the sum of the probabilities of all answer span starts and

ends. In the case of starts the objective becomes the sum of exponentiated scores of answer start

tokens over the same sum across all tokens:

−log



k∈A

i=1



This objective is agnostic to how the model distributes probability mass across the possible answer

spans, thus the model can “choose” to focus on only the more relevant spans. The model which

computes the start and end scores is a collection of modules including bi-directional GRUs, the

bi-directional attention mechanism of BiDAF, self-attention modules, and a ﬁnal prediction layer.

Conﬁdence method

The second proposed method outlines a common method which takes span start and end scores

and e

and sums them to form a span conﬁdence score. At test time the model is run on each

paragraph and the answer span with the highest conﬁdence is selected. However, this may lead to

poorly calibrated conﬁdence scores due to two reasons: (1) models trained with a softmax objective

need not care about the magnitudes of the inputs to the objective as long as the ratios between them

are kept the same (2) if the model only sees paragraphs that contain answers, it might become too

conﬁdent in spurious patterns which it related to an answer being there. Four approaches to mitigate

the above two problems are explored: (i) let the normalization term in the objective softmax run

across all document paragraphs (ii) merge all paragraphs during training into one (iii) allow for a no-

answer option. (iv) use a sigmoid loss objective by computing the start/end probabilities by applying

the sigmoid function to start/end scores. Since the scores are being evaluated independently, the idea

is they should comparable across paragraphs.

Results: Three datasets are evaluated on: TriviaQA Web, TriviaQA Unﬁltered and SQuAD. Strate-

gies (i), (ii), and (iii) perform best and do not degrade as more paragraphs are considered (at least for

TriviaQA). The most robust strategy is the shared norm (i) which doesn’t degrade on any dataset.

2.7 Semantically Equivalent Adversarial Rules for Debugging NLP Models

Ref: [30]

Background: Adversarial attacks in the computer vision community have been studied extensively.

A classiﬁer might confuse an avocado with a millennial if we imperceptibly change a pixel or two.

Adversarial examples can help us study the robustness of our models as well as help elucidate their

reasoning.

Motivation: Come up with NLP adversarial examples which are semantically equivalent, i.e. we

want them to be worded differently whilst preserving their meaning.

Method: The paper introduces the Semantically Equivalent Adversary (SEA), deﬁned to be a

paraphrase x

of input text x which leads to changed predictions f(x) 6= f(x

). Paraphrases

are created by translating x into multiple pivot languages to and taking the score of their back-

translations to be proportional to P (x

|x). For these score to be consistent they are normalized

into what the paper calls a semantic score S(x, x

). Semantic equivalence is then deﬁned as

SemEq (x, x

) = 1 [S (x, x

) ≥ τ].

The paper also introduces Semantically Equivalent Adversarial Rules (SEARs) which are rule-based

system which generate SEAs. A rule is taken to be of the form r = (a → c) where the ﬁrst instance

of the antecedent a is replaced by the consequent c for every instance that includes a. For example,

r = (movie → f ilm) would lead to r("Great movie!") = "Great ﬁlm!". More general rules can be

formed such as (What NOUN → Which NOUN).

Results: The results show that SEAs are very effective at confusing QA models. Tests are performed

using BiDAF on SQuAD, and to a visual QA model applied to the VQA dataset. SEA effectively

confuse state-of-the-art-models. They can be used to augment the datasets these models are trained

on to make them more robust.

2.8 On the limitations of Unsupervised Bilingual Dictionary Induction

Ref: [31]

It is so good to listen to my favourite blogger (so far) talking about his research on analysing unsu-

pervised algorithms for bilingual dictionary induction.

Background: Research on bilingual dictionary induction (BDI) and machine translation has a very

long history, and also it has a direct impact on our daily life since it matters how we communicates

with people who don’t speak the language/languages we speak. Recent research on unsupervised

approaches to address this topic has drawn wide attention since it doesn’t require labelled data which

could be costly and time-consuming when collecting.

[8] proposed an unsupervised four-step method for word translation, and it demonstrates strong

results that are sometimes even better than supervised methods. Speaking from a linguistic per-

spective, the success of [8] might be due to the fact that languages used in their experiments are

linguistically similar, thus it is important to study when the proposed method fails and also the

limitation of unsupervised methods on BDI.

Motivation: The unsupervised approaches for BDI generally start with two sets (source and target)

of word embeddings pre-trained on monolingual corpora, and the task is to build a dictionary that

translates words in one language to those in the other one. The paper has a focus on studying the

impact of a few factors on the performance of the approach proposed in [8], and show that only

when two languages are linguistically similar to each other, and the two sets of word embeddings

are derived from the same domain, such as wikipedia, the proposed method in [8] works well. A

new eigenvalue-based graph similarity is proposed in the paper to measure how similar two sets

of word embeddings are topologically, and the paper show strong correlation between the model’s

performance on BDI and the graph similarity. In addition, weak supervision from identically spelt

words in two languages provides surprising good performance across different language pairs.

Graph Similarity As illustrated in the paper that word embeddings are far from isomorphic, the

paper proposed to evaluate how isospectral two sets of word embeddings are by comparing the

eigenvalues of two Laplacian matrices. Suppose G

and G

are two sets of word embeddings which

come from two languages, A

and A

are adjacency matrices, L

= D

− A

and L

= D

− A

where D

and D

are diagonal matrices of degrees, (D

)

, ∀i ∈ 1, 2. The similarity

metric is deﬁned as follows:

∆ =

i=1

(λ

− λ

)

(1)

where k = min

(

i=1

> 0.9

)

(2)

The above equations can be interpreted as 1) ﬁnd the minimal k such that top k eigenvalues capture

90% energy in spectrum, 2) calculate Euclidean distance between two sets of eigenvalues. The

paper showed that the proposed graph similarity is positively correlated with the performance of

unsupervised BDI algorithm, speciﬁcally, [8]’s method.

Summary of [8] An unsupervised word translation/BDI algorithm was proposed in [8], which

demonstrated strong performance across different language pairs. The algorithm is summarised

into 4 steps:

1. Monolingual word embeddings: The word embeddings of each language is derived by

running FastText [3] on monolingual corpus.

2. Adversarial mapping: A linear mapping W is learnt between two sets of embeddings with

an adversarial classiﬁer. Orthonormal regularisation is applied during training.

3. Reﬁnement (Procrustes analysis): A set of translations is generated by retrieving bidirec-

tional nearest neighbours, and the orthogonal Procrustes problem is applied to reﬁne the

mapping W :

= arg min

||W X − Y ||

= UV

s.t. UΣV

= SVD(Y X

)

4. Cross-domain similarity local scaling (CSLS): Hubness issue is common in high-

dimensional data analysis, and CSLS is proposed to expand the high-density areas and

condense low-density ones.

The paper brieﬂy goes through the unsupervised method for BDI in [8], and then it talks about the

impact of different factors on both linguistics and machine learning side.

Impact of language similarity. Experiments conducted in [8] contain languages that are mostly

dependent-marking

except for French which is mixed marking

. This paper collected other lan-

guages that are mixed-marking (Estonian and Finnish), double-marking (Greek), and dependent

marking (Hungarian, Polish and Turkish) to study the impact of language similarity.

The unsupervised adversarial method failed (close to 0 performance) when learning to translate from

English to Estonian, Finnish and Greek, which are not dependent-marking languages as English

is. The results suggest that, the method in [8] seems to be challenged when paring English with

languages that are not isolating and do not have dependent marking. When translating two languages

that are both mixed-marking, the unsupervised method didn’t fail. Thus, the language similarity at

the morphological level indeed has an impact.

Impact of domain differences The paper collected three corpora to study the impact of domain

differences in training data. The results show that the unsupervised method in [8] failed when source

and target word embeddings were derived from two different domains, which seems to suggest that

corpora in similar or comparable domains are required in order to make the unsupervised method

work.

Impact of hyperparams Hyperparameters are also important on determining whether the unsuper-

vised method works or not.

Dependent-marking languages include English, German, Chinese, Russian, and Spanish, etc. An example

in German is that, “ein Herr” is grammatically correct since “Herr” is a masculine noun, but, “ein” is marked to

“eine” in “eine Frau” since “Frau” is a feminine noun. An example of hard-marking in English is that, “walk”

in “I walk” is marked to “walks” in “John walks”.

The categorisation of whether a language is head-, or dependent-marking is based on which one of two

happens more frequently in the language. If they happen equally frequently, then the language is mixed-

marking, if both head and dependent are marked, then the language is double-marking.

[21] proposed skip-gram and continuous bag-of-words algorithms for efﬁcient estimation of word

vectors. The unsupervised method in [8] won’t work if source and target word embeddings are

not derived from the same algorithm. Even if both word embeddings are derived from the same

algorithm, varying hyperparams leads to slightly worse performance.

Impact of evaluation procedure 1) part-of-speech, in general, the performance of verbs is the

lowest across all methods, since the meaning of verbs highly relies on the context where verbs are

situated in, and they have more morphological variants. 2) Homographs: surprisingly, words which

have translations that are homographs, are associated with lower precision than other words.

To summarise, it seems that the unsupervised approaches, such as the very promising method in [8],

are still very limited, which also means there are still a lot to explore on unsupervised BDI. Another

issue that we need to take consideration is that, weak supervision derived from identically spelt

words in two languages provides solid performance which should be considered as a strong baseline

for challenging scenarios, such as translating two languages that are not in the same morphology

categories.

2.9 Generating Sentences by Editing Prototypes

Ref: [13]

(Okay, this is a TACL paper, and it was presented at ACL2018.)

Background: Generative modelling of sentences is a hard topic as recurrent neural language models

tend to produce generic utterances seen in training set which leads to poor diversity in generated

sentences. Instead of drawing samplings from scratch, the paper proposed a novel way of generating

sentences by editing prototypes in a sense that a sentence is generated by conditioning on a prototype

sentence and a sampled editing vector.

Method: The prototypes x

of a sentence x are those with high lexical overlap with x, as measured

by Jaccard distance d

. For each sentence x, a set of similar neighbours is collected N(x) = {x

∈

X : d

(x, x

) < 0.5}.

The proposed model is called as the “neural editor”. The training objective follows the idea of VAEs

[17], and the model has three parts.

I. Neural editor p

edit

(x|x

, z). The proposed neural editor is a left-to-right sequence-to-sequence

model with attention. The encoder produces hidden states given the prototypes x

and the decoder

learns to generate the sentence x conditioned on an edit vector z by concatenating z to the input of

the decoder at each time step.

II. Edit prior p(z). Each sample from the prior distribution is a product of two samples indepen-

dently drawn from two distributions:

dir

∼ vMF(0)

norm

∼ Unif(0, 10)

z = z

norm

dir

where vMF(0) is von-Mises Fisher distribution with concentration κ = 0, which is essentially a

uniform distribution over a high-dimensional sphere, Unif(0, 10) is a uniform distribution between 0

and 10. The idea is that the prior distribution can be decomposed into two independent distributions,

in which one is used to determine the direction of an edit vector, and the other one is used to control

the length of it. A more intuitive explanation is that, z

dir

refers to how the prototype should be edited,

and z

norm

refers to how much.

III. Approximate edit posterior q(z|x, x

). As the prior is composed of two distributions, the ap-

proximate edit posterior also outputs two components.

q(z

dir

|x, x

) = vMF(z

dir

; f

dir

, κ) ∝ exp(κz

dir

)

q(z

norm

|x, x

) = Unif(z

norm

; [

dir

+ ])

where z

dir

is sampled from vMF(z

dir

; f

dir

, κ), which is deﬁned by a ﬁxed concentration κ, and a

normalised vector f

dir

indicating the direction of concentration, and z

norm

is sampled from a uniform

distribution. The fun part is how the model calculates f

dir

and f

norm

Suppose we sample a sentence x and its prototype x

, and f represents how the prototype x

should

be modiﬁed to the target sentence x, instead of learning a function that maps the two sentences to a

vector f, we make use of pretrained word vectors and calculate f by concatenating two vectors, in

which one is the average of word vectors for words added to the prototype, and the other one is that

of word vectors for word deleted. Essentially, f represents the word-level mismatches between the

sampled prototype x

and the target sentence x. Then, f

norm

= ||f || and f

dir

= f /f

norm

. For stable

training,

norm

= min(f

norm

, 10).

IV. Training Objective: As usual, the training objective function contains two parts, including a

reconstruction term E

z∼q(z|x,x

)

edit

(x|x

, z)], and a KL divergence term −D

(q(z|x, x

)||p(z)).

In the proposed parametrisation, the KL divergence term is a constant, and sampling from the vMF

distribution can be easily implemented.

Advantages: First, since the prior distribution is deﬁned on a high-dimensional unit sphere, and the

distance between any two edit vectors is encoded as the dot-product between two normalised vectors,

it matches the cosine similarity metric which is naturally used in comparing similarity between word

vectors. Thus, the edit vectors should be able to capture semantic difference between a sampled

target sentence and its prototype.

Second, the proposed parametrisation avoids “posterior collapse” as the KL divergence term in the

training objective is a constant, and it only depends on the concentration term κ in vMF distribution.

Previous VAEs with Gaussian prior suffer from “posterior collapse” issue, and [4] proposed a couple

tricks to detriment the auto-regressive decoder used in their VAE, while here, no other tricks are

necessary since the training objective contains a single reconstruction term.

Results: Language modelling tasks on Yelp and Google OneBillionWord are conducted, and the

proposed “neural editor” is compared with several baseline models. For every sentence in the test

set, the model retrieves at least one sentence from the training set as the prototype. The “neural

editor” provides improvement on a large number of sentences, while it gives lower log-likelihood

on test sentences which don’t have similar sentences in the training set, as the model is only trained

to make small edits (based on Jaccard distance).

Human evaluation on plausibility, grammaticality and diversity of generated sentences is collected

for comparison, the “neural editor” overall has higher scores, and it is more stable than other baseline

models when varying the temperature term during evaluation.

Thoughts: The parametrisation in this paper reminds me of another paper which also utilised a

very interesting parametrisation to set the KL divergence term in VAEs as a constant. Paper is here

[33].

2.10 A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of

Word Embeddings

Ref: [1]

A very interesting paper about fully self-supervised bilingual dictionary induction.

Motivation: The method proposed in [8] was evaluated on simple scenarios as those tested lan-

guages are morphologically similar to each other, and thus the method yields great performance.

However, it doesn’t work well and possible fails at challenging scenarios, such as translating be-

tween English and Finnish, as they belong to two different morphology types, and dealing with

monolingual word embeddings derived from slightly dissimilar domains.

The paper aims to propose a robust self-supervised learning algorithm for unsupervised bilingual

dictionary induction from monolingual word embeddings only, and the proposed algorithm is capa-

ble of handling challenging scenarios as described above.

Method: The proposed algorithm contains 4 steps, and we will discuss each of them below:

I. Monolingual word embeddings: The word embeddings are learnt using CBOW [21] on monolin-

gual corpora for each language. After learning, word vectors are normalised and zero-centred, and

these 2 steps have been shown to be beneﬁcial to the task. Then word vectors are normalised again

in order to keep a unit length.

II. Fully unsupervised initialisation: The underlying assumption is that the embedding spaces are

isometric (distance-preserving)

, thus the similarity matrices of source and target language, M

and M

= ZZ

, should be equivalent up to a permutation of their rows and columns.

However, it is intractable to try every possible permutation.

A simple approximation is proposed to overcome the issue. Each row in M

and M

are sorted

independently. Under the assumption of perfect isometry, same word should get exact same vector

in the similarity matrix across different languages, thus a result based on nearest neighbour retrieval

serves as an initial dictionary to the following step. In practice,

√

and

√

are sorted and

normalised using the step mentioned above, and regarded as the initialisation of word vectors (X

and Z

) in source and target language in next step, then a nearest neighbour retrieval is applied to

build an initial dictionary

III. Robust self-learning procedure: The procedure consists of two steps, and it runs two steps

iteratively until convergence.

, W

= arg max

∈O

(R)

((X

i∗

) · (Z

j∗

))



1 if j = arg max

i∗

) · (Z

k∗

)

0 otherwise

where O

(R) is the set of orthonormal matrices. The optimisation objective is guaranteed to con-

verge to a local optimum, and four key steps are introduced to help the objective converge to good

local optima, which are

1. at each iteration, we randomly keep some elements in the similarity matrix with probability

p and set the rest to 0. It turned out to be crucial to get reasonable performance when

translating between English and Finnish.

2. When learning the mappings W

and W

, only the 20,000 most frequent words in each

language are used. It drastically decreases the running time during training, and it provides

similar performance compared to keeping all words.

3. Cross-domain Similarity Local Scaling (CSLS), which was proposed in [8], is applied to

build dictionary in the second step, as it reduces the hubness issue in high-dimensional

embeddings. It is also crucial to make the proposed method work, as well as the method

proposed in [8].

4. The dictionary is induced bidirectionally, D = D

X→Z

+ D

Z→X

, in order to encourage the

algorithm to avoid bad local optima.

IV. Reﬁnement: symmetric re-weighting: Once the self-learning has converged to a good solution,

a symmetric re-weighting is applied in both languages, W

= US

and W

= V S

, where U,

S, V are calculated by running singular value decomposition on X

DZ, USV

= X

DZ. It

encourages the mappings to explore other dimensions besides the most relevant dimensions that

best match the current solution.

Results: The experiments are conducted in two conditions, which are 1) when English is the source

language, and the target languages include Italian, German, Finnish, and Spanish, 2) when English

is the target language, and the source languages include Spanish, Italian and Turkish. The ablation

study on the importance of 4 key steps discussed earlier is conducted on the ﬁrst condition. In

conclusion, the proposed method has three advantages against its comparison partners.

I. Fast: The running time of the proposed self-learning method is much less than previously

proposed unsupervised/self-supervised methods, including [8] and [36].

II. Stable: The proposed method is also much stabler than other previously proposed methods.

The authors ran each of three models, including [8], [36] and the proposed one, for 10 times, and the

proposed method didn’t fail in any run, while the other two methods failed in several runs, especially

in challenging scenarios.

For the sake of simplicity, isometry here means that there exists a bijective transformation between two

sets of word vectors that preserves distance of any two words.

and Z

are only used in the proposed self-learning procedure for only the ﬁrst iteration, and then the

original embeddings are used in the procedure

III. Better: Overall, the proposed method provides stronger performance than other two methods

mentioned above.

Thoughts: The ﬁrst author has been working on BDI and machine translation for a couple years in

both supervised and unsupervised fashion. Follow him here

http://www.mikelartetxe.com/

3 Tuesday, July 17th: Day 2

3.1 Batch IS NOT Heavy: Learning Word Representations From All Samples

Ref: [14]

Motivation: Skip-gram, CBOW [21] and GloVe [26] have been proposed to learn high-quality

word vectors efﬁciently with stochastic gradient descent (SGD). However, in skip-gram and CBOW,

the quality of learnt word vectors is highly sensitive to the empirical distribution for negative sam-

pling, as pointed out in [22], uni-gram raised up to power 0.75 seems to a sweet spot for learning,

and also, the one-sample learning scheme in SGD causes ﬂuctuation at the early stage of learning;

in GloVe, only positive (observed in the training corpus) word-context pairs are used for learning,

while all negative pairs (unobserved) are ignored and never used compared to running SVD on top

of Positive Point Mutual Information (PPMI) matrix.

The paper proposed “AllVec” learning algorithm that utilises all positive and negative word-context

pairs to learn word vectors, and then a reformulation of the objective function is proposed to incor-

porate (full) batch gradient descent and replace noisy stochastic gradient descent and biased negative

sampling.

Method: The objective function is deﬁned as follows:

L =

(w,c)∈S

− U

)

observed/positive pairs

(w,c)∈(V ×V )\S

−

− U

)

unobserved/negative pairs

where S refers to the set of observed word-context (wc) pairs in training corpora, V is the vocabulary,

U is the word embedding matrix,

U is the context embedding matrix.

The objective here is a regression problem, and it resembles running SVD on top on PPMI matrix.

Similarly, r

is set to be the PPMI of the pair wc, and r

−

can be either 0 or −1. In order to

downweight frequently occurred words in training corpora, α

follows the subsampling scheme in

GloVe [26], and more importantly, α

−

penalises more when the word w appears more frequently,

which is similar to the negative sampling scheme in skip-gram, CBOW [22].

In order to apply full batch gradient, a reformulation is applied:

L =

w∈V

c∈V

−

− U

)

all pairs

(w,c)∈S

(α

− α

−

)(∆ − U

)

observed pairs

where ∆ = (α

− α

−

)/(α

− α

−

) and constant terms are omitted.

As L

computes a value for every word-context pair, the time complexity is O(k|V |

), where k is

the dimensionality of the vectors, and |V | is the size of the vocabulary, thus it is crucial to reduce the

time complexity of computing L

. After expanding the square term and applying the commutative

property to rearrange the calculation, the transformed

without constant terms is:

d=0

w∈V

c∈V

−

˜u

− 2r

−

d=0

w∈V

c∈V

−

˜u

d=0

− 2r

−

d=0

In this way, before each iteration, p

, p

, q

, and q

can be pre-calculated, and the time complex-

ity to compute each of them for all pairs is O(|V |k

), O(|V |k

), O(|V |k) and O(|V |k). With all

pre-calculated terms, the time complexity of calculating

is O(k

). Therefore, in total, including

pre-calculation and the calculation of assembling all pre-calculated terms, the time complexity is

O(2|V |k

+ 2|V |k + k

) ≈ O(2|V |k) as the size of the vocabulary |V | is much larger than the

dimensionality of learnable vectors k, and it is much smaller than the original O(k|V |

Results: The training experiments are conducted on four corpora individually, and four comparison

partners, including skip-gram, skip-gram with adaptive sampler [5], GloVe, and LexVec [27] are

trained with same hyperparameter settings. The downstream tasks for evaluation include MEN,

MC, RW, RG, WSim, and WRel.

Overall, AllVec outperforms previously proposed comparison partners, although it requires more

iterations for AllVec to converge than other SGD-optimised methods

In order to show that negative samples are important in terms of learning high-quality word vec-

tors, an ablation study is conducted to illustrate the effect of varying α

−

. The performance of the

proposed AllVec drops drastically when α

−

is set to be 0 (as in GloVe) compared to other positive

values of α

−

, thus it is helpful to include unobserved pairs into learning word vectors.

Thoughts: I tend to think of the proposed learning method as a weighted decomposition of the PPMI

matrix without the orthonormal constraint in SVD, and it performs well across all downstream tasks.

I would love to see whether removing the orthonormal constraint would help or not.

3.2 Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations

Ref: [29]

Background: Noun compounds are an important part of language. For example apple cake is a

noun compound signifying a particular type of cake. One way of describing the semantics of a noun-

compound is through paraphrasing. Thus apple cake is a cake made from apples. Noun-compound

paraphrasing may be considered a subtask of the general paraphrasing task.

Motivation: Develop a model which generalizes well to unseen noun-compounds and rare para-

phrases.

Method: Each training examples contains two constituents and a paraphrase (w

, p, w

) such as

(apple, made of, cake) and the model is trained on 3 subtasks: (1) predict p given w

and w

(2) predict w

given p and w

, and (3) predict w

given p and w

. The paraphrases are col-

lected by taking a list of noun-compounds from SemEval to extract common POS patterns such

as [w

] VERB PREP [w

]. Thus the model actually learns to classify or rank across the paraphrase

templates. The model itself is a simple LSTM model.

Results: A qualitative analysis is present as well as an empirical evaluation on the SemEval 2013

Task 4 where the model presented in the paper outperforms other models by a wide margin.

3.3 Guess Me if You Can: Acronym Disambiguation for Enterprise

Ref: [19]

Background: Anyone that has worked in a big company knows that acronyms can be annoy-

ing. Never mind usual word polysemy, acronyms are where things get really crazy. Furthermore,

acronyms usually appear in isolation without their full meaning spelt out. Therefore, it is partic-

ularly useful to develop a system that can automatically resolve the true meanings of acronyms in

enterprise documents. A further important distinction is that acronyms can either be external or

internal to an organisation. For example, AI might stand for Asset Intelligence within Microsoft,

whereas outside it usually stands for Artiﬁcial Intelligence. Even when acronyms denote the same

thing internally and externally the context within which they appear can be very different. OS might

have a lot to do performance or install externally, whereas internally at a place like Microsoft, it will

appear in contexts to do with development or implementation.

Motivation: Solve the limitations of previous work which does not distinguish between external

and internal meanings.

Method: The papers divides Entity Acronym Disambiguation into two sub-problems: (1) Acronym

Meaning Mining which aims at mining acronym/meaning pairs from the enterprise corpus and (2)

Meaning Candidate Ranking whose goal is to rank the candidate meanings associated with the target

The Beauty of SGD.

acronym. An assumption is made that the acronyms for disambiguation are provided as input to the

system. The paper does not try to optimize the performance of acronym detection (e.g. identifying

acronyms beyond the simple capitalized rule, or distinguishing cases where a capitalized term is not

an acronym but a regular English word, such as “OK”).

I. Acronym Meaning Mining:

• Candidate Generation: a phrase is considered to be a meaning candidate for an acronym

if: (1) the initial letters of the phrase match the acronym and the phrase and the acronym

co-occur in at least one document in the enterprise corpus; or (2) it is a valid candidate for

the acronym in public knowledge bases (e.g. Wikipedia).

• Popularity Calculation: for each candidate meaning, a popularity score is calculated, which

reveals how often the candidate meaning is used as the genuine meaning of the acronym.

This is done by either ﬁnding a normalized count of meanings across the entire corpus, or

across documents. The scores are used later as parts of input features for the model.

• Candidate Deduplication Heuristic rules are used to deduplicate acronyms e.g. CA which

stands for certiﬁcate authority might also be denoted by acronyms such as Cert Auth or

many others.

• Context Harvesting: Context words around each meaning candidate are harvested.

II. Meaning Candidate Ranking:

• Candidate Ranking: a ranking model is trained to rank candidate meanings as being the

true meaning of an acronym. In order to generate a training set, distant supervision is used.

Contexts are taken and acronym meanings are replaced with the acronyms themselves. The

task then is to train a model to disambiguate these acronyms back into their meanings.

Results: High F1 scores are achieved on both a manual and the main distantly supervised dataset.

More importantly, similar behaviour is seen across both datasets as various features are used. Finally,

the model is compared to well-known entity linking models and is shown to outperform them.

4 Wednesday, July 18th: Day 3

4.1 What you can cram into a single vector: Probing sentence embeddings for linguistic

properties

Ref: [7]

Motivation: Recent research on learning sentence representations from structured data and human-

annotated label demonstrated strong generalisation ability and transferability, and most of them can

be directly applied in various downstream tasks. However, it is not clear to us that what information

has been encoded in vector representations. This paper proposed 10 probing tasks that capture

simple linguistic features, and these tasks are collected to help us understand a little more about

vector representations.

Method: As mentioned in the paper, 10 probing tasks are categorised into three classes, including

surface (SentLen and WC), syntactic (BShift, TreeDepth and TopConst), and semantic information

(Tense, SubNum, ObjNum, SOMO and CoordInv).

1. SentLen: predict the length of a given sentence in terms of the number of words;

2. WordContent: tell whether which of the pre-picked 1,000 mid-frequency words a given

sentence contains;

3. BigramShift: test whether an sentence encoder is sensitive to legal word orders.

4. TreeDepth: check whether an encoder is capable of inferring the hierarchical structure of

sentences

5. Top Constituent: sentences must be classiﬁed in terms of the sequence of top constituents

immediately below the sentence (S) node

6. Tense: predict the tense of the main-clause verb of a given sentence

7. Subject Number: the number of the subject of the main clause

8. Object Number: the number of the direct object of the main clause

9. Semantic-Odd-Man-Out: tell whether a sentence has been modiﬁed or not. An example

modiﬁed sentence is: “ No one could see this Hayes and I wanted to know if it was real or

a spoonful (original: ploy).”

10. Coordination Inversion: tell whether a sentence is intact or modiﬁed by inverting the order

of the clauses.

The sentences are all extracted from Toronto BookCorpus [38]. For each task, the training set

contains 100k sentences, and validation(development) set and test set have 10k sentences in each of

them. The classes in each task are balanced in order to prevent models from overﬁtting the internal

biases.

In total, four types of encoder architectures, including

1. bag-of-words models with or without TF-IDF

2. bidirectional LSTM (biLSTM) with the hidden state at last time step as the representation

3. biLSTM with global max-pooling over time on top of hidden states

4. gated convolutional network,

are trained on various representation learning objectives, and the training objectives include

1. auto-encoder

2. machine translation (English to French, English to German and English to Finnish)

3. (Seq2Tree) decoding tree structure of an input sentence given the vector representation of

4. (Skip-thought) predicting the next sentences given the current one

5. (NLI) natural language inference tasks.

Human performance is also collected for comparison.

Results: Overall, as expected, the training objective that requires models to decode tree structures of

sentences given their vector representations outperforms all other ones. Some of the best results even

surpass human performance, and the probing tasks where models have better performance include

TopConst, Tense, SubjNum and ObjNum.

There are several very interesting observations we can ﬁnd in result tables in the paper.

1. Clearly, auto-encoder objective are not able to capture word content information, and it also

performs poorly on semantic-related tasks, but it performs extremely well on SentLen task.

Seq2Tree objective is also terribly bad at capturing word content, but it overall outperform

other objective on syntactic and semantic-related tasks.

2. Machine translation objectives give reasonable performance on word content task, as well

as natural language inference objective, as they both require a clear inference of word

meanings.

3. Skip-thought objective provides overall mediocre results among all training objectives,

which means unsupervised/self-supervised transfer learning still has a lot to explore.

4. The choice of representation in biLSTM matters, as global max-pooling over time outper-

forms picking only last hidden state on both syntactic and semantic-related tasks, while the

later one performs better on surface information related tasks, including SentLen and word

content.

5. Word order matters in most of the tasks.

An important thing about the probing tasks is that, the good performance of an encoder on these

tasks doesn’t imply that on the tasks in SentEval [6]. In [7], there is a strong correlation between

the performance on WordContent and that on previously mentioned downstream tasks, so that is

saying, word content is important for getting good results on most of the tasks in SentEval

More details about the SentEval package is here: https://github.com/facebookresearch/SentEval

References

[1] M. Artetxe, G. Labaka, and E. Agirre. A robust self-learning method for fully unsupervised

cross-lingual mappings of word embeddings. In ACL, 2018.

[2] B. Athiwaratkun, A. G. Wilson, and A. Anandkumar. Probabilistic fasttext for multi-sense

word embeddings. In ACL, 2018.

[3] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword

information. TACL, 5:135–146, 2017.

[4] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, and S. Bengio. Generating

sentences from a continuous space. In CoNLL, 2016.

[5] L. Chen, F. Yuan, J. M. Jose, and W. Zhang. Improving negative sampling for word represen-

tation using self-embedded features. In WSDM, 2018.

[6] A. Conneau and D. Kiela. Senteval: An evaluation toolkit for universal sentence representa-

tions. In LREC, 2018.

[7] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni. What you can cram into

a single vector: Probing sentence embeddings for linguistic properties. In ACL, 2018.

[8] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without

parallel data. In ICLR, 2018.

[9] A. Frank and T. Mihaylov. Knowledgeable reader: Enhancing cloze-style reading comprehen-

sion with external commonsense knowledge. In ACL, 2018.

[10] M. Gardner and C. Clark. Simple and effective multi-paragraph reading comprehension. In

ACL, 2018.

[11] M. Ghazvininejad, C. Brockett, M.-W. Chang, W. B. Dolan, J. Gao, W. tau Yih, and M. Galley.

A knowledge-grounded neural conversation model. In AAAI, 2018.

[12] G. Glavaš and I. Vuli

c. Explicit retroﬁtting of distributional word vectors. In ACL, 2018.

[13] K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang. Generating sentences by editing prototypes.

CoRR, abs/1709.08878, 2017.

[14] X. He, X. Xin, F. Yuan, and J. M. Jose. Batch is not heavy: Learning word representations

from all samples. In ACL, 2018.

[15] S. Jameel, Z. Bouraoui, and S. Schockaert. Unsupervised learning of distributional relation

vectors. In ACL, 2018.

[16] R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst. Text understanding with the attention

sum reader network. arXiv preprint arXiv:1603.01547, 2016.

[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.

[18] X. Li, Y.-N. Chen, L. Li, J. Gao, and A. Celikyilmaz. End-to-end task-completion neural

dialogue systems. arXiv preprint arXiv:1703.01008, 2017.

[19] Y. Li, F. Tao, A. Fuxman, and B. Zhao. Guess me if you can: Acronym disambiguation for

enterprises. In ACL, 2018.

[20] X. Liu, Y. Shen, K. Duh, and J. Gao. Stochastic answer networks for machine reading com-

prehension. In ACL, 2018.

[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efﬁcient estimation of word representations in

vector space. arXiv preprint arXiv:1301.3781, 2013.

[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of

words and phrases and their compositionality. In NIPS, 2013.

[23] N. Mrkši

c, D. O. Séaghdha, T.-H. Wen, B. Thomson, and S. Young. Neural belief tracker:

Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777, 2016.

[24] B. Peng, X. Li, J. Gao, J. Liu, and K.-F. Wong. Deep dyna-q: Integrating planning for task-

completion dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Asso-

ciation for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2182–2192,

2018.

[25] B. Peng, X. Li, L. Li, J. Gao, A. Celikyilmaz, S. Lee, and K.-F. Wong. Composite task-

completion dialogue policy learning via hierarchical deep reinforcement learning. arXiv

preprint arXiv:1704.03084, 2017.

[26] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.

In EMNLP, 2014.

[27] A. Salle, M. Idiart, and A. Villavicencio. Enhancing the lexvec distributed word representation

model using positional contexts and external memory. CoRR, abs/1606.01283, 2016.

[28] Y. Shen, P.-S. Huang, M.-W. Chang, and J. Gao. Link prediction using embedded knowledge

graphs. 2016.

[29] V. Shwartz and I. Dagan. Paraphrase to explicate: Revealing implicit noun-compound rela-

tions. In ACL, 2018.

[30] S. Singh, C. Guestrin, and M. T. Ribeiro. Semantically equivalent adversarial rules for debug-

ging nlp models. In ACL, 2018.

[31] A. Søgaard, I. Vulic, and S. Ruder. On the limitations of unsupervised bilingual dictionary

induction. In ACL, 2018.

[32] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in

neural information processing systems, pages 2440–2448, 2015.

[33] A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In

NIPS, 2017.

[34] J. D. Williams, K. Asadi, and G. Zweig. Hybrid code networks: practical and efﬁ-

cient end-to-end dialog control with supervised and reinforcement learning. arXiv preprint

arXiv:1702.03274, 2017.

[35] B. Yang, I. Labutov, A. Prakash, and A. Azaria. Multi-relational question answering from

narratives: Machine reading and reasoning in simulated worlds. In ACL, 2018.

[36] M. Zhang, Y. Liu, H. Luan, and M. Sun. Adversarial training for unsupervised bilingual lexicon

induction. In ACL, 2017.

[37] T. Zhao, A. Lu, K. Lee, and M. Eskénazi. Generative encoder-decoder models for task-oriented

spoken dialog systems with chatting capability. In SIGDIAL Conference, 2017.

[38] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Align-

ing books and movies: Towards story-like visual explanations by watching movies and reading

books. ICCV, pages 19–27, 2015.