Skip to content

A place for discussion & code review#1

Draft
mingchaoliao wants to merge 9 commits intotemp-branchfrom
main
Draft

A place for discussion & code review#1
mingchaoliao wants to merge 9 commits intotemp-branchfrom
main

Conversation

@mingchaoliao
Copy link
Copy Markdown
Collaborator

No description provided.

@glennpai
Copy link
Copy Markdown
Owner

Tracking new feature requests unrelated to this review under -> #2
Please continue to use this PR for review-related discussion.

if file_list_resp.status_code == 200:
objs = file_list_resp.json()['value']

for obj in objs:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to point these lines out in particular. There are a few ways to determine if an object is a file or a folder. You can check that it has the file property, and then use the download URL since you know it will exist. The second way is to just check that it has a download URL.
I think the way it is done here is fine since we don't care about folders, and only downloadable content in the drive. Just something worth pointing out since we can also use the properties to check for images and/or videos as well. This could be useful elsewhere.

Copy link
Copy Markdown
Owner

@glennpai glennpai Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included change in pull -> #3
The existence of obj['file'] is almost the same as what we're doing above but this 'file' property will exist only on filetype objects. I don't know if there are any non-filetype objects that would have a downloadURL (folders do not) but this change should prevent us from grabbing those if there are.

Christopher Glenn and others added 2 commits April 18, 2022 18:57
Copy link
Copy Markdown
Collaborator Author

@mingchaoliao mingchaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. add a requirements.txt file containing your project's dependencies.
  2. I found out that the file names are hard to follow as they are all lowercase and have no separator. I think most people prefer all lowercase and use underscores to improve readability. You can probably follow the PEP8 style guide https://peps.python.org/pep-0008/#package-and-module-names
  3. I encourage you to consider adding types to all possible places. Similar to JavaScript, Python is a weak-type language. But writing strong-typed code (like why we replace JS with TypeScript) can greatly reduce type related bugs and improve code readability (as other developers can easily inspect the code in an IDE, e.g. know exactly the parameter structure, go to the class definition, etc.). see more about Python type hint system at https://docs.python.org/3/library/typing.html
  4. It seems like Microsoft has a Graph API SDK for python, https://github.com/microsoftgraph/msgraph-sdk-python-core, are you able to utilize it? (instead of doing it manually using requests).
  5. The way you download/upload files won't work well with large files. For large files (several Gig), people usually download/upload using streams, or do it in multiple chunks. Although I don't think we will run into that issue very often.

Comment thread .gitignore
@@ -0,0 +1,2 @@
.env
venv No newline at end of file
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a new line at the end of the file.

Comment thread setup.py
@@ -0,0 +1,14 @@
from setuptools import find_packages, setup

setup(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you eventually want to render the README file to your package's profile page on the PyPI. PyPI won't automatically pick up the README file. You have to read the file and assign it to long_description. see more details at https://packaging.python.org/en/latest/guides/making-a-pypi-friendly-readme/

Module to unify and simplify configuration of a SharePoint document library for use in
a Python ETL
"""
class DocumentLibrary:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the content of this class and how it is used in the SimpleETL class. I feel like calling it DocumentLibraryConfig, DocumentLibraryOptions, or something similar might be more approperiate.

self.library = document_library
self.__thumbprint = thumbprint
self.__private_key = private_key
self.__token = self.__acquire_token()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to move the token exchange process out of the constructor and probably run it right before the first time an action (read/write/etc) requiring the authentication is executed, giving the following reasons:

  1. It doesn't make sense to spend time before it is actually needed. If someone set up the package but ends up not using it, then the resource is wasted, although developers shouldn't do that.
  2. It is harder to test. General speaking, in OOP, (with a few exceptions), the only thing a constructor should do is dependency injection, meaning assigns dependencies from the caller to its instance members. We usually don't have to test constructors, but doing business logic in it will then requires you to test the constructor itself.



@staticmethod
def __get_item_id(file_items, target_name):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to stay inside the class as a static method. It seems like a utility function and can put it into a dedicate utility py file, or still stays in the same file with the class but not inside the class.

raise Exception(result.get('error'))


def filenames(self, remote_path):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference, listFiles is more self-explainable.

Comment on lines +92 to +95
objs = file_list_resp.json()['value']
for obj in objs:
if obj['file']:
filenames.append(obj['name'])
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be simplified to:

files = filter(lambda obj: obj['file'], objs)
filenames = [ file['name'] for file in files ]

file_data = requests.get(obj['@microsoft.graph.downloadUrl'])
if file_data.status_code == 200:
try:
clean_path = re.sub(r'^(\\|\/)+|(\\|\/)+$', '', local_path)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you can use os.path.normpath(...) here https://docs.python.org/3/library/os.path.html#os.path.normpath

Comment on lines +147 to +149
list_url = f'{self.library.base_url}/root:/{remote_path}:/children'
file_list_response = requests.get(list_url,
headers={'Authorization': 'Bearer ' + self.__token})
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I saw similar code in the filenames method. To improve reusability, maybe you can create another method to handle the list files/dirs work and let filenames and delete methods use it rather than doing their own file listing.

headers={'Authorization': 'Bearer ' + self.__token})
if delete_response.status_code != 204:
raise Exception(f'Failed to delete {file_name}. \
{delete_response.raise_for_status()}')
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise_for_status will actually raise an exception. It makes your raise Exception(...) pointless. There are several cases like that in this file.

filenames (string[]): List of file names in the remote_path directory
"""
filenames = []
file_list_resp = requests.get(f'{self.library.base_url}/root:/{remote_path}:/children',
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be able to use filter and select query parameter to:

  1. return files only https://docs.microsoft.com/en-us/graph/query-parameters#filter-parameter
  2. return only the attributes you need https://docs.microsoft.com/en-us/graph/query-parameters#select-parameter

This comment also applies to other places that make api calls.

@glennpai
Copy link
Copy Markdown
Owner

glennpai commented Apr 20, 2022

  1. add a requirements.txt file containing your project's dependencies.
  2. I found out that the file names are hard to follow as they are all lowercase and have no separator. I think most people prefer all lowercase and use underscores to improve readability. You can probably follow the PEP8 style guide https://peps.python.org/pep-0008/#package-and-module-names
  3. I encourage you to consider adding types to all possible places. Similar to JavaScript, Python is a weak-type language. But writing strong-typed code (like why we replace JS with TypeScript) can greatly reduce type related bugs and improve code readability (as other developers can easily inspect the code in an IDE, e.g. know exactly the parameter structure, go to the class definition, etc.). see more about Python type hint system at https://docs.python.org/3/library/typing.html
  4. It seems like Microsoft has a Graph API SDK for python, https://github.com/microsoftgraph/msgraph-sdk-python-core, are you able to utilize it? (instead of doing it manually using requests).
  5. The way you download/upload files won't work well with large files. For large files (several Gig), people usually download/upload using streams, or do it in multiple chunks. Although I don't think we will run into that issue very often.
  1. This can be done.
  2. This can be done.
  3. I will look into stronger typing. I prefer explicit type definitions but had not previously used them in Python.
  4. The official SDK is a powerful tool but unfortunately is not ideal for working with OIT SET restricted document libraries. From the authentication methods to our document library config, it is likely simpler to make a reusable request wrapper package for most actions than trying to weave in an existing SDK. When originally researching existing libs for our first SET SharePoint ETL, we found that nearly all official libs didn't work well with our required config or that third-party libs were deprecated or abandoned.
  5. This is an issue I would like to gauge how important it would be to fix. According to the Graph API docs, the current method supports up to 60 MiB in one request. If we believe our SET integrations would ever need to upload files larger than 60 MiB, I would be open to modifying this to support larger files. It may also be worth creating a separate function to handle large uploads while leaving a simpler small-file upload function. I am open to ideas on how to approach this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants