Skip to content

Conversation

@JulianPinzaru
Copy link

Fixes #565.

Proposed changes

Using native python module fcntl to put locks on writing data to a file. This way we can prevent multiple processes writing to the same file simultaneously and erasing each other's data.

return written_characters


def append_to_file(filepath=None, data=None, mode='a', **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is strictly append -- might be good to take out kwarg and just pass in mode='a' util.to write_to_file

Comment on lines +14 to +25
def write_to_file(filepath=None, data=None, mode='w', **kwargs) -> int:
'''
Concurrency safe function for writing data to the file.
Param mode: file open mode ('a' for appending, 'w' for writing)
'''
assert mode == 'w' or mode == 'a'

with open(filepath, mode, **kwargs) as f:
fcntl.flock(f, fcntl.LOCK_EX)
written_characters = f.write(data)
fcntl.flock(f, fcntl.LOCK_UN)
return written_characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question comes from my lack of knowledge:
If another process tries to access file while it is locked does it lead to an IOError? Or does it automatically wait for lock to be released?

with open(filepath, mode, **kwargs) as f:
fcntl.flock(f, fcntl.LOCK_EX)
written_characters = f.write(data)
fcntl.flock(f, fcntl.LOCK_UN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth explicitly closing file

@bmblumenfeld
Copy link
Contributor

@JulianPinzaru awesome work! I left a couple comments. Should be good to go after addressing those! Let me know if you have any questions!

@youngj
Copy link
Contributor

youngj commented Feb 27, 2020

I did some tests with this approach. Although it ensures that one process finishes writing before the other process starts, the output file could still contain data from both processes if the first process writes more data than the second process.

Here's an example script to show how this can happen:

foo.py:

import argparse
import time
import fcntl

def write_to_file(filepath=None, data=None, mode='w', **kwargs) -> int:
    with open(filepath, mode, **kwargs) as f:
        fcntl.flock(f, fcntl.LOCK_EX)
        for i in range(1, 1000):
            f.write(data)
        time.sleep(3)
        for i in range(1, 1000):
            f.write(data)
        fcntl.flock(f, fcntl.LOCK_UN)

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('--foo', required=True, help='foo')
    args = parser.parse_args()

    write_to_file('foo.txt', args.foo + '\n')

Test two processes writing at once; the first process that opens the file writes more data than the second:
(python foo.py --foo=aaaaaaaaaaaaaaaaaaaaaa &); sleep 1; python foo.py --foo=b

The output file contains a bunch of b's followed by a bunch of \0 characters, followed by a bunch of a's.

Instead of using flock, I think the best approach is to write a temporary file with a unique name in the same directory, then rename it to the desired filename when it is complete. With this approach, flock isn't necessary because each writer has a unique filename to write to, and the filesystem will ensure that the rename is atomic.

This approach also ensures that read_from_file will always see a consistent version of the file without needing to acquire a lock even if it is being written to concurrently by another process.

@hathix hathix self-requested a review June 5, 2020 12:37
Copy link
Member

@hathix hathix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review feedback from this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid concurrency errors when saving cache files

5 participants