Distributed Representations of Words and Phrases and their Compositionality (Mikolov 2013)
Efficient Estimation of Word Representations in Vector Space (Mikolov 2013)
Natural Language Processing (almost) from Scratch (Collobert and Weston 2011)
Exploiting Similarities among Languages for Machine Translation (Mikolov 2013)
Zero-Shot Learning Through Cross-Modal Transfer (Socher 2013)
$x_{i,j} = TFIDF(t_i, d_j)$
$TFIDF(t_i, d_j) = TF(t_i, d_j) ⋅ IDF(t_i, d_j)$
$TF(t_i, d_j) = \frac{freq(t_i, d_j)}{|d_j|}$
$IDF(t_i, d_j) = log(\frac{|D|}{|\{d_k : t_i ∈ d_k\}|})$
Doc1 = "He is a lazy boy. She is also lazy."
Doc2 = "John is a lazy person."
Raw frequencies
| Doc1 | Doc2 | |
| a | 1 | 1 |
| also | 1 | 0 |
| boy | 1 | 0 |
| he | 1 | 0 |
| is | 2 | 1 |
| john | 0 | 1 |
| lazy | 2 | 1 |
| person | 0 | 1 |
| she | 1 | 0 |
TF-IDF
| Doc1 | Doc2 | |
| a | 0 | 0 |
| also | 0.077 | 0 |
| boy | 0.077 | 0 |
| he | 0.077 | 0 |
| is | 0 | 0 |
| john | 0 | 0.138 |
| lazy | 0 | 0 |
| person | 0 | 0.138 |
| she | 0.077 | 0 |
Given vocabulary $V$,
approximate a conditional distribution over words:
$$ P(W_o | W_i), W_i \in V, W_o \in V$$
$$ P(W_{i+1} | {W_{i-n}, ..., W_{i-1}, W_i}), W_j \in V, j \in \mathbb{N} $$
The cat is walking in the bedroom
A dog was running in a room
|
|
Output probabilities are modeled using the softmax function
$x = [x_1, x_2, ..., x_n]$
$$softmax(x_i) = \frac{e^{x_i}}{\sum_{j}{e^{x_j}}}$$
Cross-entropy is used as loss
$E(y,t) = - \sum_{m=1}^{V}{t_m \cdot log(y_m)} = - log(y_o)$
Given an input word $W_k$ and an output word $W_j$ in the context of $W_k$,
let $x$ be the one-hot representation for $W_k$
$x = [x_1, x_2, ..., x_k, ...,x_V], x_k = 1, x_n = 0, n \neq k$
$t = [t_1, t_2, ..., t_j, ...,t_V]$
$t_j = 1$
$t_n = 0, n \neq j$
activation $h = x^TW = W_{(k, :)} := v_{w_i}$
score $u_j = {v'}_{w_j}^{T} ⋅ h$, where ${v'}_{w_j}^{T} = {W'}_{(:,j)}$
$P(w_j | w_i) = y_j = \frac{exp(u_j)}{\sum_{j'=1}^{V}{exp(u_{j'})}}$
$E(y,t) = - \sum_{m=1}^{V}{t_m \cdot log(y_m)} = - log(y_o)$
$\frac{∂E}{∂{w'}_{ij}} = \frac{∂E}{∂u_j} ⋅ \frac{∂u_j}{∂{w'}_{ij}} = (y_j - t_j) ⋅ h_i$
${v'}_{w_{ij}}^{new} ← {v'}_{w_{ij}}^{old} - η ⋅ (y_j - t_j) ⋅ h_i$
$\frac{∂E}{∂w_{ki}} = \sum_{j=1}^{V}{(\frac{∂E}{∂u_j} ⋅ \frac{∂u_j}{∂h_i})} ⋅ \frac{∂h_i}{∂w_{ki}} = \\ = \sum_{j=1}^{V}{((y_j - t_j) ⋅ w'_{ij})} ⋅ x_k$
$v^{(new)}_{ki} ← v^{(old)}_{ki} - η ⋅ \sum_{j=1}^{V}{((y_j - t_j) ⋅ w'_{ij})} ⋅ x_k$
A Neural Probabilistic Language Model (Bengio 2003)
$ ./word2vec
-train data.txt
-output vectors.txt
-size 200
-window 5
-cbow 1
-hs 0
-negative 5
-sample 1e-4
-alpha 0.0025
-min-count 10
In the paper: $P(w_i) = 1 - \sqrt{\frac{sample}{f(w_i)}}$
In the code: $P(w_i) = 1 - \big({\sqrt{\frac{sample \cdot |V|}{f(w_i)}} + \frac{sample \cdot |V|}{f(w_i)}}\big)$
$ ./word2phrase -train text.txt -output phrases.txt -threshold 100 -min-count 10
Join two adjacent words with an underscore if their joint score is higher than threshold:
$score(w_i, w_j) = \frac{(count(w_i, w_j) - min\_count) \times |V|}{count(w_i) \times count(w_j)}$
Distributed Representations of Sentences and Documents (Le and Mikolov 2014)
The cat is walking in the bedroom
$\{the, cat, is, walking, wal, walk, walki, alk, alki, alkin, ..., \\ in, the, bedroom, bed, bedr, bedro, edro, edroo, ...\}$
$ ./fasttext
skipgram
-input text8
-output text8_model
-epoch 1
-lr 0.5
-minn 2
-maxn 5
-dim 300
$ ./fasttext nn text8_model.bin
Query word? gearshift gearing 0.790762 flywheels 0.779804 flywheel 0.777859 gears 0.776133 driveshafts 0.756345 driveshaft 0.755679 daisywheel 0.749998 wheelsets 0.748578 epicycles 0.744268 gearboxes 0.73986
__label__sauce __label__cheese how much does potato starch affect a cheese sauce recipe __label__food-safety __label__acidity dangerous pathogens capable of growing in acidic environments __label__cast-iron __label__stove how do i cover up the white spots on my cast iron stove __label__restaurant michelin three star restaurant; but if the chef is not there __label__knife-skills __label__dicing without knife skills how can i quickly and accurately dice vegetables __label__storage-method __label__equipment __label__bread whats the purpose of a bread box __label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home __label__chocolate american equivalent for british chocolate terms __label__baking __label__oven __label__convection fan bake vs bake __label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise regulation and balancing of readymade packed mayonnaise and other sauces
$ ./fasttext
supervised
-input cooking.train
-output model_cooking
-lr 1.0
-epoch 5
-wordNgrams 2
$ ./fasttext predict-prob model_cooking.bin - 5
Query: Why not put knives in the dishwasher?
__label__knives 0.335938 __label__equipment 0.0664063 __label__food-safety 0.0664063 __label__cleaning 0.0605469 __label__storage-method 0.0332031