[Feat]Copy-free save and load for cuckoo hashtable#243
[Feat]Copy-free save and load for cuckoo hashtable#243Lifann wants to merge 7 commits intotensorflow:masterfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). For more information, open the CLA check for this pull request. |
3fdfd57 to
1d4fe07
Compare
|
|
||
| Status Save(OpKernelContext* ctx, const string filepath, | ||
| const size_t buffer_size) { | ||
| int64 value_dim = value_shape_.dim_size(0); |
There was a problem hiding this comment.
tensorflow::int64 is returned from dim_size and also used in table_->save()
|
|
||
| Status Load(OpKernelContext* ctx, const string filepath, | ||
| const size_t buffer_size) { | ||
| int64 value_dim = value_shape_.dim_size(0); |
| return table_->export_values(ctx, value_dim); | ||
| } | ||
|
|
||
| Status Save(OpKernelContext* ctx, const string filepath, |
There was a problem hiding this comment.
SaveToFile might be better for possible extending in the future.
| Status Save(OpKernelContext* ctx, const string filepath, | ||
| const size_t buffer_size) { | ||
| int64 value_dim = value_shape_.dim_size(0); | ||
| return table_->save(ctx, value_dim, filepath, buffer_size); |
There was a problem hiding this comment.
SaveToFile might be better for possible extending in the future.
| return table_->save(ctx, value_dim, filepath, buffer_size); | ||
| } | ||
|
|
||
| Status Load(OpKernelContext* ctx, const string filepath, |
There was a problem hiding this comment.
LoadFromFile might be better for possible extending in the future.
| Status Load(OpKernelContext* ctx, const string filepath, | ||
| const size_t buffer_size) { | ||
| int64 value_dim = value_shape_.dim_size(0); | ||
| return table_->load(ctx, value_dim, filepath, buffer_size); |
| using ValueType = ValueArrayBase<T>; | ||
|
|
||
| template <typename T> | ||
| class HostFileBuffer { |
There was a problem hiding this comment.
It looks like repeated code with HostFileBuffer in lookup_table_op_cpu.h, if yes, recommending you move them to tensorflow_recommenders_addons/dynamic_embedding/core/utils/host_file_buffer.h
tensorflow_recommenders_addons/dynamic_embedding/core/ops/cuckoo_hashtable_ops.cc
Show resolved
Hide resolved
| self.resource_handle, self._key_dtype, self._value_dtype) | ||
| return keys, values | ||
|
|
||
| def save(self, filepath, buffer_size=4194304, name=None): |
There was a problem hiding this comment.
save_to_file would be better
Becausesave_to_hdfs is possible in the future.
| value_dtype=self._value_dtype, | ||
| buffer_size=buffer_size) | ||
|
|
||
| def load(self, filepath, buffer_size=4194304, name=None): |
There was a problem hiding this comment.
load_from_file or load_from_localfile
|
This feature is very useful. Looking forward to it. May I ask how to use it? In our case, we will use estimator and save checkpoints per epoch. So should we customize a saver to save the tables manually? But if so, how can we get all the tables to save? Currently it only support usage like |
| size_t new_max_size = max_size_; | ||
| size_t capacity = table_->get_capacity(); | ||
|
|
||
| size_t cur_size = table_->get_size(stream); |
There was a problem hiding this comment.
from our profiling result, this kernel is expensive. I suggest that we can cache last size, and check if the table need to expand with expecting keys to be added. If not, we could add expect to last size, and just return. In this way, we could reduce the number of calls to get_size.
There was a problem hiding this comment.
Cached size makes latency between record to real-time size. Save and load are usually low-frequency operations, which may make large gap between recorded and real-time size.
|
|
||
| ~HostFileBuffer() { Close(); } | ||
|
|
||
| void Put(const T value) { |
There was a problem hiding this comment.
May I ask if T is tstring, will it work? I remember that sometimes the key is tstring.
There was a problem hiding this comment.
Haven't used string key type. I think it can be solved by abstracting the host buffer class and template specialization.
| size_t key_buffer_size = buffer_size; | ||
| string key_tmpfile = filepath + ".keys.tmp"; | ||
| string key_file = filepath + ".keys"; | ||
| auto key_buffer = HostFileBuffer<K>(ctx, key_tmpfile, key_buffer_size, |
There was a problem hiding this comment.
Is it better to abstract the file out of the HostFileBuffer class? We are heavily depending on hdfs, and would like to add hdfs support based on your work. Thanks
| size_t total_keys = 0; | ||
| size_t total_values = 0; | ||
| while (nkeys > 0) { | ||
| nkeys = key_buffer.Fill(); |
There was a problem hiding this comment.
is it possible that nkeys is 0?
There was a problem hiding this comment.
If nkeys is 0, then it leaves an empty file.
dfc5667 to
cf6c24f
Compare
- Also include the horovod compile fail on macOS(They was caused by the same reason)
…t reference with error address when high concurrency and server disconnection.
| } catch (const std::exception &err) { | ||
| LOG(ERROR) << "RedisHandler error in PipeExecRead for slices " | ||
| << hkey.data() << " -- " << err.what(); | ||
| error_ptr = std::current_exception(); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
This is a good solution, we want to solve this problem too, When can it be merged? |
Description
Brief Description of the PR:
Since dynamic embedding could be super large for memory limit. save and load with traditional TensorFlow checkpoint mechanism will use a lot of memory when saving or loading.
This PR provides a method to save or load files for dynamic embedding tables, without full volume copying.
Type of change
Checklist:
How Has This Been Tested?
Yes