Skip to content

IshwarKulkarni/WordEmbedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A simple word embedding application.

What's so special about this: Well, nothing other than this is written in pure C++, no external libraries Written mostly as an exercise and learn word embedding in detail.

Reads documents into a 'coprus', that is then fed to a 'Embedding Model' This model can be trained any number of times and 'serialized' to be loaded later.

Corpus:

Uses comma separated --sources argument to load up the words and setnece structure.
Also allows for stop words and other ignored words to instantiated. Can be serialized and has a 
non-deterministic behaviour. 

Model Class:

Has two subclasses:
    CBoW : Continuos Bag of words, given multiple context words predict a target word. 
    SkipGram: A single root word predicts a set of 'context words'
Both use negative sampling technique with about 20 negative sample per positive sample to 
improve perf. 

Embedding Matrix:

The "Matrix", is actually implemented as simply an orderer collection of Word Vector 
(or embeddings) that is interfaced using a 'Span' structure (as the math here only needs 
simple dot products between rows on the matrix. 

Evaluators:

Can evaluate a model based on 'closes-ness' of synonym words, 'far-ness' of antonym words and 
'meh-ness' of random words. Printing these for each training epoch gives a good view of the 
process

Misc Notes:

Uses GCC and C++17 std. using the micro arch option to GCC, we can use AVX2 on modern Intel 
CPU's. Datasets inclued one-corpus of Tolstoy's 'War and Peace', Multi-style document of POTUS 
speeches and finally the Reuters dataset.   

TODO:

1. AVX2 vectorization. - Completed with '-march=native' cmake option.
2. Get better datasets. - Added Retuers dataset.
3. Better 'sampleWord()' in corpus - using Freq based sampling/
4. Adding evaluation scripts - EmbeddingEvaluator.
5. CUDA!! -- May be not. 

About

A pure C++ Implementation of word Embedding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors