-
Notifications
You must be signed in to change notification settings - Fork 10
Ebnt 384 dvc #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mike0sv
wants to merge
22
commits into
dev
Choose a base branch
from
EBNT-384-dvc
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Ebnt 384 dvc #123
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
96489b0
EBNT-407 dataset and metric deletion
mike0sv a890e1e
EBNT-407 small fixes
mike0sv cc3400d
EBNT-404 evaluation results
mike0sv 8b1a7e1
EBNT-404 tests
mike0sv 69b6462
EBNT-404 add fields to sql model
mike0sv add294b
EBNT-404 add fields to sql model
mike0sv 6ad878b
Merge branch 'EBNT-259-datasets' into EBNT-404-evaluation-results
mike0sv 58794b0
EBNT-384 dvc dataset source
mike0sv 0e41308
Merge branch 'EBNT-259-datasets' into EBNT-384-dvc
mike0sv ca6e05c
fix tests
mike0sv f73d576
EBNT-384 fix tests
mike0sv 8bdee26
EBNT-384 windows tests
mike0sv a70207a
EBNT-384 windows tests
mike0sv f91f41c
EBNT-384 windows tests
mike0sv 6cd09da
EBNT-384 FUUUUUU
mike0sv 332947c
EBNT-384 add dvc ext
mike0sv e95e6f5
EBNT-384 no dvc
mike0sv a911882
EBNT-384 no dvc
mike0sv 2381379
EBNT-384 add dvc install
mike0sv e456464
EBNT-384 add dvc import
mike0sv 9f4e6b5
EBNT-384 no color
mike0sv ada049c
EBNT-384 local imports
mike0sv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| from .dataset_source import DvcBlob, create_dvc_source | ||
|
|
||
| __all__ = ['DvcBlob', 'create_dvc_source'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| import contextlib | ||
|
|
||
| import dvc.api | ||
| from dvc.repo import Repo | ||
|
|
||
| from ebonite.core.objects.artifacts import Blob, Blobs, StreamContextManager | ||
| from ebonite.core.objects.dataset_source import DatasetSource | ||
| from ebonite.repository.dataset.artifact import ArtifactDatasetSource, DatasetReader | ||
|
|
||
|
|
||
| class DvcBlob(Blob): | ||
| def __init__(self, path: str, repo: str = None, rev: str = None, remote: str = None, mode: str = 'r', | ||
| encoding: str = None): | ||
| self.path = path | ||
| self.repo = repo | ||
| self.rev = rev | ||
| self.remote = remote | ||
| self.mode = mode | ||
| self.encoding = encoding | ||
|
|
||
| def materialize(self, path): | ||
| Repo.get(self.remote, self.path, out=path, rev=self.rev) # TODO tests | ||
|
|
||
| @contextlib.contextmanager | ||
| def bytestream(self) -> StreamContextManager: | ||
| with dvc.api.open(self.path, self.repo, self.rev, self.remote, self.mode, self.encoding) as f: | ||
| yield f | ||
|
|
||
|
|
||
| def create_dvc_source(path: str, reader: DatasetReader, repo, rev: str = None, remote: str = None, mode: str = 'r', | ||
| encoding: str = None) -> DatasetSource: | ||
| artifacts = Blobs.from_blobs({path: DvcBlob(path, repo, rev, remote, mode, encoding)}) | ||
| return ArtifactDatasetSource(reader, artifacts) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,7 +22,7 @@ class NumpyNdarrayWriter(DatasetWriter): | |
| """DatasetWriter implementation for numpy ndarray""" | ||
|
|
||
| def write(self, dataset: Dataset) -> Tuple[DatasetReader, ArtifactCollection]: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does commit-mechanic of DVC works with writing datasets? |
||
| return NumpyNdarrayReader(), ArtifactCollection.from_blobs( | ||
| return NumpyNdarrayReader(dataset.dataset_type), ArtifactCollection.from_blobs( | ||
| {DATA_FILE: LazyBlob(lambda: save_npz(dataset.data))}) | ||
|
|
||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -191,16 +191,17 @@ class PandasReader(DatasetReader): | |
| """DatasetReader for pandas dataframes | ||
|
|
||
| :param format: PandasFormat instance to use | ||
| :param data_type: DataFrameType to use for aliging read data | ||
| :param dataset_type: DataFrameType to use for aliging read data | ||
| """ | ||
|
|
||
| def __init__(self, format: PandasFormat, data_type: DataFrameType): | ||
| self.data_type = data_type | ||
| def __init__(self, format: PandasFormat, dataset_type: DataFrameType, path: str = None): | ||
| super(PandasReader, self).__init__(dataset_type) | ||
| self.path = path or PANDAS_DATA_FILE | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if dataset is in remote location, or in any sort of container? How that will work? |
||
| self.format = format | ||
|
|
||
| def read(self, artifacts: ArtifactCollection) -> Dataset: | ||
| with artifacts.blob_dict() as blobs, blobs[PANDAS_DATA_FILE].bytestream() as b: | ||
| return Dataset.from_object(self.data_type.align(self.format.read(b))) | ||
| with artifacts.blob_dict() as blobs, blobs[self.path].bytestream() as b: | ||
| return Dataset.from_object(self.dataset_type.align(self.format.read(b))) | ||
|
|
||
|
|
||
| class PandasWriter(DatasetWriter): | ||
|
|
@@ -209,12 +210,14 @@ class PandasWriter(DatasetWriter): | |
| :param format: PandasFormat instance to use | ||
| """ | ||
|
|
||
| def __init__(self, format: PandasFormat): | ||
| def __init__(self, format: PandasFormat, path: str = None): | ||
| self.path = path or PANDAS_DATA_FILE | ||
| self.format = format | ||
|
|
||
| def write(self, dataset: Dataset) -> Tuple[DatasetReader, ArtifactCollection]: | ||
| blob = LazyBlob(lambda: self.format.write(dataset.data)) | ||
| return PandasReader(self.format, dataset.dataset_type), ArtifactCollection.from_blobs({PANDAS_DATA_FILE: blob}) | ||
| return (PandasReader(self.format, dataset.dataset_type, self.path), | ||
| ArtifactCollection.from_blobs({self.path: blob})) | ||
|
|
||
| # class PandasJdbcDatasetSource(_PandasDatasetSource): | ||
| # def __init__(self, dataset_type: DatasetType, table: str, connection: str, | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,3 +27,5 @@ lightgbm==2.3.1 | |
| torch==1.4.0+cpu ; sys_platform != "darwin" | ||
|
|
||
| torch==1.4.0 ; sys_platform == "darwin" | ||
|
|
||
| dvc==1.1.7 | ||
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| col1,col2 | ||
| 123,asdf | ||
| 456,cvbx |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| import contextlib | ||
| import os | ||
| import shutil | ||
|
|
||
| import pandas as pd | ||
| import pytest | ||
|
|
||
| from ebonite.core.analyzer.dataset import DatasetAnalyzer | ||
| from ebonite.ext.pandas import DataFrameType | ||
| from ebonite.ext.pandas.dataset_source import PandasFormatCsv, PandasReader | ||
| from ebonite.ext.s3 import S3ArtifactRepository | ||
| from ebonite.utils import fs | ||
| from tests.conftest import docker_test | ||
| from tests.ext.test_s3.conftest import ACCESS_KEY, SECRET_KEY # noqa | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def dvc_repo_factory(tmpdir): | ||
| def dvc_repo(remote, remote_kwargs=None): | ||
| repo_path = tmpdir | ||
| from dvc.repo import Repo | ||
| repo = Repo.init(repo_path, no_scm=True) | ||
|
|
||
| with repo.config.edit() as conf: | ||
| remote_config = {'url': remote} | ||
| if remote_kwargs is not None: | ||
| remote_config.update(remote_kwargs) | ||
| conf['remote']['storage'] = remote_config | ||
| conf['core']['remote'] = 'storage' | ||
|
|
||
| shutil.copy(fs.current_module_path('data1.csv'), repo_path) | ||
| data1_path = os.path.join(repo_path, 'data1.csv') | ||
| assert os.path.exists(data1_path) | ||
| repo.add([data1_path]) | ||
| assert os.path.exists(data1_path + '.dvc') | ||
| repo.push() | ||
| os.remove(data1_path) | ||
| shutil.rmtree(os.path.join(repo_path, '.dvc', 'cache'), ignore_errors=True) | ||
|
|
||
| return repo_path | ||
|
|
||
| return dvc_repo | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def local_dvc_repo(tmpdir_factory, dvc_repo_factory): | ||
| storage_path = str(tmpdir_factory.mktemp('storage')) | ||
| return dvc_repo_factory(storage_path) | ||
|
|
||
|
|
||
| @contextlib.contextmanager | ||
| def override_env(**envs): | ||
| prev = {e: os.environ.get(e, None) for e in envs.keys()} | ||
| try: | ||
| for e, val in envs.items(): | ||
| os.environ[e] = val | ||
| yield | ||
| finally: | ||
| for e, val in prev.items(): | ||
| if val is not None: | ||
| os.environ[e] = val | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def s3_dvc_repo(s3server, dvc_repo_factory): | ||
| url = f'http://localhost:{s3server}' | ||
|
|
||
| with override_env(AWS_ACCESS_KEY_ID=ACCESS_KEY, AWS_SECRET_ACCESS_KEY=SECRET_KEY, | ||
| S3_ACCESS_KEY=ACCESS_KEY, S3_SECRET_KEY=SECRET_KEY): | ||
| S3ArtifactRepository('dvc-bucket', url)._ensure_bucket() # noqa | ||
| return dvc_repo_factory('s3://dvc-bucket', | ||
| {'endpointurl': url}) | ||
|
|
||
|
|
||
| def test_create_dvc_source__local(local_dvc_repo): | ||
| dt = DataFrameType(['col1', 'col2'], ['int64', 'string'], []) | ||
| from ebonite.ext.dvc import create_dvc_source | ||
| ds = create_dvc_source(path='data1.csv', | ||
| reader=PandasReader(PandasFormatCsv(), dt, 'data1.csv'), | ||
| repo=local_dvc_repo) | ||
| dataset = ds.read() | ||
| assert isinstance(dataset.data, pd.DataFrame) | ||
| assert DatasetAnalyzer.analyze(dataset.data) == dt | ||
|
|
||
|
|
||
| @docker_test | ||
| def test_create_dvc_source_s3(s3_dvc_repo): | ||
| dt = DataFrameType(['col1', 'col2'], ['int64', 'string'], []) | ||
| from ebonite.ext.dvc import create_dvc_source | ||
| ds = create_dvc_source(path='data1.csv', | ||
| reader=PandasReader(PandasFormatCsv(), dt, 'data1.csv'), | ||
| repo=s3_dvc_repo) | ||
| dataset = ds.read() | ||
| assert isinstance(dataset.data, pd.DataFrame) | ||
| assert DatasetAnalyzer.analyze(dataset.data) == dt |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's rev? Also docs would be very appriciated.