A place for discussion & code review by mingchaoliao · Pull Request #1 · glennpai/simple-graph-etl-python

mingchaoliao · 2022-04-18T18:21:33Z

No description provided.

glennpai · 2022-04-18T21:08:27Z

Tracking new feature requests unrelated to this review under -> #2
Please continue to use this PR for review-related discussion.

Maxxxxz · 2022-04-18T21:21:33Z

+        if file_list_resp.status_code == 200:
+            objs = file_list_resp.json()['value']
+
+            for obj in objs:


I'd like to point these lines out in particular. There are a few ways to determine if an object is a file or a folder. You can check that it has the file property, and then use the download URL since you know it will exist. The second way is to just check that it has a download URL.
I think the way it is done here is fine since we don't care about folders, and only downloadable content in the drive. Just something worth pointing out since we can also use the properties to check for images and/or videos as well. This could be useful elsewhere.

Included change in pull -> #3
The existence of obj['file'] is almost the same as what we're doing above but this 'file' property will exist only on filetype objects. I don't know if there are any non-filetype objects that would have a downloadURL (folders do not) but this change should prevent us from grabbing those if there are.

Add filenames function, create local path param

mingchaoliao

add a requirements.txt file containing your project's dependencies.
I found out that the file names are hard to follow as they are all lowercase and have no separator. I think most people prefer all lowercase and use underscores to improve readability. You can probably follow the PEP8 style guide https://peps.python.org/pep-0008/#package-and-module-names
I encourage you to consider adding types to all possible places. Similar to JavaScript, Python is a weak-type language. But writing strong-typed code (like why we replace JS with TypeScript) can greatly reduce type related bugs and improve code readability (as other developers can easily inspect the code in an IDE, e.g. know exactly the parameter structure, go to the class definition, etc.). see more about Python type hint system at https://docs.python.org/3/library/typing.html
It seems like Microsoft has a Graph API SDK for python, https://github.com/microsoftgraph/msgraph-sdk-python-core, are you able to utilize it? (instead of doing it manually using requests).
The way you download/upload files won't work well with large files. For large files (several Gig), people usually download/upload using streams, or do it in multiple chunks. Although I don't think we will run into that issue very often.

mingchaoliao · 2022-04-20T02:53:20Z

@@ -0,0 +1,2 @@
+.env
+venv


missing a new line at the end of the file.

mingchaoliao · 2022-04-20T03:06:10Z

@@ -0,0 +1,14 @@
+from setuptools import find_packages, setup
+
+setup(


I assume you eventually want to render the README file to your package's profile page on the PyPI. PyPI won't automatically pick up the README file. You have to read the file and assign it to long_description. see more details at https://packaging.python.org/en/latest/guides/making-a-pypi-friendly-readme/

mingchaoliao · 2022-04-20T03:25:29Z

+Module to unify and simplify configuration of a SharePoint document library for use in
+a Python ETL
+"""
+class DocumentLibrary:


Based on the content of this class and how it is used in the SimpleETL class. I feel like calling it DocumentLibraryConfig, DocumentLibraryOptions, or something similar might be more approperiate.

mingchaoliao · 2022-04-20T03:40:20Z

+        self.library = document_library
+        self.__thumbprint = thumbprint
+        self.__private_key = private_key
+        self.__token = self.__acquire_token()


It's better to move the token exchange process out of the constructor and probably run it right before the first time an action (read/write/etc) requiring the authentication is executed, giving the following reasons:

It doesn't make sense to spend time before it is actually needed. If someone set up the package but ends up not using it, then the resource is wasted, although developers shouldn't do that.

It is harder to test. General speaking, in OOP, (with a few exceptions), the only thing a constructor should do is dependency injection, meaning assigns dependencies from the caller to its instance members. We usually don't have to test constructors, but doing business logic in it will then requires you to test the constructor itself.

mingchaoliao · 2022-04-20T03:44:26Z

+
+
+    @staticmethod
+    def __get_item_id(file_items, target_name):


It doesn't have to stay inside the class as a static method. It seems like a utility function and can put it into a dedicate utility py file, or still stays in the same file with the class but not inside the class.

mingchaoliao · 2022-04-20T04:44:20Z

+        raise Exception(result.get('error'))
+
+
+    def filenames(self, remote_path):


My personal preference, listFiles is more self-explainable.

mingchaoliao · 2022-04-20T05:08:45Z

+            objs = file_list_resp.json()['value']
+            for obj in objs:
+                if obj['file']:
+                    filenames.append(obj['name'])


I think it can be simplified to:

files = filter(lambda obj: obj['file'], objs) filenames = [ file['name'] for file in files ]

mingchaoliao · 2022-04-20T05:13:42Z

+                file_data = requests.get(obj['@microsoft.graph.downloadUrl'])
+                if file_data.status_code == 200:
+                    try:
+                        clean_path = re.sub(r'^(\\|\/)+|(\\|\/)+$', '', local_path)


seems like you can use os.path.normpath(...) here https://docs.python.org/3/library/os.path.html#os.path.normpath

mingchaoliao · 2022-04-20T05:23:13Z

+        list_url = f'{self.library.base_url}/root:/{remote_path}:/children'
+        file_list_response = requests.get(list_url,
+            headers={'Authorization': 'Bearer ' + self.__token})


hmm, I saw similar code in the filenames method. To improve reusability, maybe you can create another method to handle the list files/dirs work and let filenames and delete methods use it rather than doing their own file listing.

mingchaoliao · 2022-04-20T05:27:43Z

+                    headers={'Authorization': 'Bearer ' + self.__token})
+                if delete_response.status_code != 204:
+                    raise Exception(f'Failed to delete {file_name}. \
+                        {delete_response.raise_for_status()}')


raise_for_status will actually raise an exception. It makes your raise Exception(...) pointless. There are several cases like that in this file.

mingchaoliao · 2022-04-20T06:51:32Z

+            filenames (string[]): List of file names in the remote_path directory
+        """
+        filenames = []
+        file_list_resp = requests.get(f'{self.library.base_url}/root:/{remote_path}:/children',


You may be able to use filter and select query parameter to:

return files only https://docs.microsoft.com/en-us/graph/query-parameters#filter-parameter

return only the attributes you need https://docs.microsoft.com/en-us/graph/query-parameters#select-parameter

This comment also applies to other places that make api calls.

glennpai · 2022-04-20T13:01:39Z

add a requirements.txt file containing your project's dependencies.

I found out that the file names are hard to follow as they are all lowercase and have no separator. I think most people prefer all lowercase and use underscores to improve readability. You can probably follow the PEP8 style guide https://peps.python.org/pep-0008/#package-and-module-names

I encourage you to consider adding types to all possible places. Similar to JavaScript, Python is a weak-type language. But writing strong-typed code (like why we replace JS with TypeScript) can greatly reduce type related bugs and improve code readability (as other developers can easily inspect the code in an IDE, e.g. know exactly the parameter structure, go to the class definition, etc.). see more about Python type hint system at https://docs.python.org/3/library/typing.html

It seems like Microsoft has a Graph API SDK for python, https://github.com/microsoftgraph/msgraph-sdk-python-core, are you able to utilize it? (instead of doing it manually using requests).

The way you download/upload files won't work well with large files. For large files (several Gig), people usually download/upload using streams, or do it in multiple chunks. Although I don't think we will run into that issue very often.

This can be done.
This can be done.
I will look into stronger typing. I prefer explicit type definitions but had not previously used them in Python.
The official SDK is a powerful tool but unfortunately is not ideal for working with OIT SET restricted document libraries. From the authentication methods to our document library config, it is likely simpler to make a reusable request wrapper package for most actions than trying to weave in an existing SDK. When originally researching existing libs for our first SET SharePoint ETL, we found that nearly all official libs didn't work well with our required config or that third-party libs were deprecated or abandoned.
This is an issue I would like to gauge how important it would be to fix. According to the Graph API docs, the current method supports up to 60 MiB in one request. If we believe our SET integrations would ever need to upload files larger than 60 MiB, I would be open to modifying this to support larger files. It may also be worth creating a separate function to handle large uploads while leaving a simpler small-file upload function. I am open to ideas on how to approach this.

Christopher Glenn and others added 6 commits April 17, 2022 00:45

Genesis

f57792b

Update README.md

be973b5

Update README.md

e97ad50

Update README.md

0800cd4

Update README.md

23566bb

Update docstrings

fe41043

Maxxxxz reviewed Apr 18, 2022

View reviewed changes

Christopher Glenn and others added 2 commits April 18, 2022 18:57

Add filenames function, create local path param

64632a5

Merge pull request #3 from glennpai/add-filenames

97b1887

Add filenames function, create local path param

mingchaoliao commented Apr 20, 2022

View reviewed changes

Update README.md

cdaf4d4

		@@ -0,0 +1,14 @@
		from setuptools import find_packages, setup

		setup(

		raise Exception(result.get('error'))


		def filenames(self, remote_path):

Conversation

mingchaoliao commented Apr 18, 2022

Uh oh!

glennpai commented Apr 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glennpai Apr 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingchaoliao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glennpai commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

glennpai Apr 18, 2022 •

edited

Loading

glennpai commented Apr 20, 2022 •

edited

Loading