-
Notifications
You must be signed in to change notification settings - Fork 35
Ported to the latest gensim, line model dimentionality fixed, output format extended #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
GTmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the effort!
GTmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be your clone of DeepWalk -- could you change this to the official DeepWalk repo? Thanks!
GTmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! I actually feel it would be better to set the default number of workers to a smaller value (for example, gensim word2vec uses 3: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L660). Sometimes it is not desirable to use up all the CPUs, especially when you are running a model on a shared server. Let me know your thoughts :-)
GTmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding this!
src/baseline.py
Outdated
| @@ -1,7 +1,10 @@ | |||
| from gensim.models import Word2Vec | |||
| from gensim.models.word2vec import Word2Vec | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you remove that extra space? Thanks!
GTmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
GTmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
GTmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is touching the model part, so have you tested this change in terms of classification performance? Do you still get similar classification F1 score compared to the numbers in the paper / README file? Thanks!
Up to you, in all other functions I set the default number of workers to at least half of the available logical cpus: int(cpu_num() / 2) + 1. Any hardcoded value is not desirable, because the host might have 1 or 2 cores then a harcoded value 3 may affect the performance unlike the value dependent on cpu_num.
I have not changed any model parameters or internals except adaptation to the updated gensim API. I did perform training and evaluation of the harp + deepwalk / line embeddings on my datasets and they look fine. |
In the official deepwalk, the walks persistence expects text and not numbers (I made a pull request to the official repository). I'm not sure whether the numerical values necessity there was caused by some bugs that occurred and fixed during Harp porting to the updated Deepwalk and gensim, or the numeric walk items are required by Harp from Deepwalk. Anyway, Harp works fine with the extended version (accepting numerical walk items) of Deepwalk in my repository but I have not tested whether it works with the official repository without that extension. |
I just verified, the original latest Deepwalk lacks support of the numerical walk items to work with HARP, so the specified repository should be used until this Deepwalk pull request is merged. |
Workers number is set to 1 by default and to |
Fix for #8 (ported to the latest gensim), output extended with the .mat format