Skip to content

use curl_cffi instead of requests#2

Open
Likhithsai2580 wants to merge 3 commits intomainfrom
dev
Open

use curl_cffi instead of requests#2
Likhithsai2580 wants to merge 3 commits intomainfrom
dev

Conversation

@Likhithsai2580
Copy link
Copy Markdown
Owner

@Likhithsai2580 Likhithsai2580 commented Apr 13, 2025

User description

✅ Download resume with Range headers

✅ HTTP/2 and HTTP/3 support

✅ Native SSL (no weird cert errors like requests)

✅ Proxy support (SOCKS, HTTP, etc.)

✅ Built-in timeout and retry handling

✅ Way better performance for multi-part downloads


Description

  • Replaced requests with curl_cffi for better performance and support for HTTP/2 and HTTP/3.
  • Enhanced the YouTube downloader to support separate video and audio downloads, improving quality.
  • Implemented better progress tracking and merging of video and audio streams.
  • Improved error handling and logging in the download manager.

Changes walkthrough 📝

Relevant files
Enhancement
gui.py
Enhanced YouTube Downloader with curl_cffi                             

gui.py

  • Replaced requests with curl_cffi for HTTP requests.
  • Enhanced video and audio downloading capabilities.
  • Added separate handling for video and audio streams.
  • Improved progress tracking and merging of video and audio.
  • +362/-94
    idm.py
    Improved Download Manager with curl_cffi                                 

    idm.py

  • Switched from requests to curl_cffi for HTTP operations.
  • Improved directory handling for downloads and temporary files.
  • Added permission checks for output directories.
  • Enhanced error handling and logging.
  • +170/-126

    💡 Penify usage:
    Comment /help on the PR to get a list of all available Penify tools and their descriptions

    @penify-dev penify-dev bot added enhancement New feature or request Review effort [1-5]: 4 labels Apr 13, 2025
    @penify-dev
    Copy link
    Copy Markdown

    penify-dev bot commented Apr 13, 2025

    PR Review 🔍

    (Review updated until commit 448b328)

    ⏱️ Estimated effort to review [1-5]

    4, because the PR introduces significant changes by replacing the requests library with curl_cffi, which involves multiple modifications across several methods and classes. The complexity of handling video and audio streams separately adds to the review effort.

    🧪 Relevant tests

    No

    ⚡ Possible issues

    Possible Bug: In the _get_total_size method, if the Content-Length header is not present, it raises a RuntimeError. This could lead to unhandled exceptions if the server response is unexpected.

    Possible Bug: The _download_video_audio method does not handle exceptions that may arise from the download method of the video and audio streams, which could lead to silent failures.

    🔒 Security concerns

    No

    @penify-dev
    Copy link
    Copy Markdown

    penify-dev bot commented Apr 13, 2025

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Implement context management for the response object to enhance resource handling

    Use a context manager for the response object to ensure proper resource management and
    avoid potential memory leaks.

    idm.py [288]

    -resp = self.session.get(
    +with self.session.get(
     
    Suggestion importance[1-10]: 9

    Why: This suggestion correctly identifies a best practice for resource management by using a context manager for the response object. This change will help prevent memory leaks and ensure that resources are properly released.

    9
    Ensure proper closing of the response object to prevent resource leaks

    Ensure that the response object is properly closed in all scenarios, including when an
    exception occurs during the request.

    gui.py [56-61]

    -resp = self.session.head(self.download_url, allow_redirects=True, impersonate="chrome110")
    +try:
    +    resp = self.session.head(self.download_url, allow_redirects=True, impersonate="chrome110")
    +    # Additional logic here
    +finally:
    +    resp.close()
     
    Suggestion importance[1-10]: 8

    Why: The suggestion correctly identifies the need to ensure the response object is closed to prevent resource leaks, which is crucial for maintaining application performance and stability.

    8
    Possible bug
    Add error handling for missing Content-Length header to prevent unexpected behavior

    Handle the case where the response does not contain the 'Content-Length' header to avoid
    potential errors.

    gui.py [58]

    -self.total_size = int(resp.headers.get('Content-Length', 0))
    +content_length = resp.headers.get('Content-Length')
    +if content_length is None:
    +    raise RuntimeError("Content-Length header is missing.")
    +self.total_size = int(content_length)
     
    Suggestion importance[1-10]: 9

    Why: This suggestion is crucial as it addresses a potential bug that could lead to runtime errors if the 'Content-Length' header is missing, significantly improving the robustness of the code.

    9
    Performance
    Add a timeout to the request to prevent indefinite blocking

    Consider adding a timeout to the HEAD request to avoid hanging indefinitely.

    gui.py [56]

    -resp = self.session.head(self.download_url, allow_redirects=True, impersonate="chrome110")
    +resp = self.session.head(self.download_url, allow_redirects=True, impersonate="chrome110", timeout=10)
     
    Suggestion importance[1-10]: 7

    Why: Adding a timeout to the HEAD request is a good practice to prevent indefinite blocking, though it is a minor improvement compared to resource management.

    7
    Optimize session initialization to prevent redundant session creation

    Ensure that the session is re-initialized only when necessary to avoid unnecessary
    overhead.

    idm.py [465]

    -self.session = curl_requests.Session()
    +if not self.session:
    +    self.session = curl_requests.Session()
     
    Suggestion importance[1-10]: 6

    Why: While optimizing session initialization is beneficial, the current implementation does not check if the session is already initialized, which could lead to redundant session creation. The suggestion could be improved by checking if self.session is None before re-initializing.

    6
    Maintainability
    Improve error handling for header access to provide clearer logging

    Handle potential exceptions specifically when accessing the Content-Length header to avoid
    misleading logs.

    idm.py [213]

    -if (content_length := resp.headers.get('Content-Length')):
    +content_length = resp.headers.get('Content-Length')
    +if content_length is not None:
     
    Suggestion importance[1-10]: 7

    Why: The suggestion improves error handling for accessing the Content-Length header, which can lead to clearer logs. However, the proposed change does not address all potential issues, such as cases where the header might not exist at all.

    7
    Possible issue
    Verify the compatibility of the impersonate parameter with the new library to avoid runtime errors

    Ensure that the impersonate parameter is supported by the curl_cffi library to avoid
    potential issues.

    gui.py [110-115]

    -resp = self.session.get(self.download_url, headers=headers, stream=True, timeout=30, impersonate="chrome110", verify=False)
    +resp = self.session.get(self.download_url, headers=headers, stream=True, timeout=30, verify=False)
     
    Suggestion importance[1-10]: 6

    Why: This suggestion addresses a potential compatibility issue with the new library, which is important but not critical, hence a moderate score.

    6

    @Likhithsai2580
    Copy link
    Copy Markdown
    Owner Author

    /describe

    @penify-dev
    Copy link
    Copy Markdown

    penify-dev bot commented Apr 20, 2025

    PR Description updated to latest commit (448b328)

    @Likhithsai2580
    Copy link
    Copy Markdown
    Owner Author

    /review

    @penify-dev
    Copy link
    Copy Markdown

    penify-dev bot commented Apr 20, 2025

    Persistent review updated to latest commit 448b328

    try:
    resp = self.session.head(self.download_url, allow_redirects=True, impersonate="chrome110")
    self.total_size = int(resp.headers.get('Content-Length', 0))
    if self.total_size == 0:
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Consider adding error handling for the _get_total_size method to manage cases where the Content-Length header is missing, instead of raising a RuntimeError. This can be done by logging a warning and setting a default value for self.total_size. [important]


    # Set up callback and download video
    self.video_stream.on_progress = video_progress_callback
    self.video_stream.download(output_path=os.path.dirname(self.video_file), filename=os.path.basename(self.video_file))
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In the _download_video_audio method, ensure that exceptions from the download calls for both video and audio streams are caught and handled appropriately to avoid silent failures. You can log the error and return a failure status. [important]

    if self.shutdown_flag.is_set():
    self._save_resume_data()
    return
    resp = self.session.get(
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In the download_segment method, consider implementing a retry mechanism for the HTTP request to handle transient network issues more gracefully. This can improve the robustness of the download process. [medium]

    try:
    self.output_path.mkdir(parents=True, exist_ok=True, mode=0o755)
    temp_file = self.resume_file.with_name(f"{self.resume_file.name}.tmp")
    temp_file = self.resume_file.with_suffix('.tmp')
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In the _save_resume_data method, ensure that the temporary file is removed in case of an exception during the save process to prevent leaving stale files. This can be done in a finally block. [medium]

    @Likhithsai2580
    Copy link
    Copy Markdown
    Owner Author

    /improve

    @penify-dev
    Copy link
    Copy Markdown

    penify-dev bot commented Apr 20, 2025

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Maintainability
    Enhance the cleanup process by verifying file existence before deletion

    Ensure that the cleanup process in _clear_resume_data is robust by checking for the
    existence of files before attempting to unlink them.

    idm.py [112]

    -if self.resume_file.exists():
    +if self.resume_file.exists() and self.resume_file.is_file():
     
    Suggestion importance[1-10]: 8

    Why: This suggestion improves the robustness of the cleanup process by ensuring that the code checks for file existence before attempting to delete them, which can prevent potential errors during execution.

    8
    Possible issue
    Implement a retry mechanism for handling unexpected HTTP status codes

    Handle the case where the server does not respond with a valid status code more gracefully
    by implementing a retry mechanism.

    idm.py [297-298]

     if resp.status_code not in (200, 206):
    +    logger.error(f"Part {part_id} failed: HTTP {resp.status_code}. Retrying...")
    +    continue  # Optionally implement a retry mechanism here
     
    Suggestion importance[1-10]: 6

    Why: The suggestion is valid as it proposes a retry mechanism for handling unexpected HTTP status codes, which can improve robustness. However, the implementation is only partially complete as it suggests a comment without providing a full retry logic.

    6
    Initialize the session only if it hasn't been set yet to prevent potential errors

    Ensure that the session is properly initialized before making any requests to avoid
    potential issues with uninitialized session usage.

    idm.py [45]

    -self.session = curl_requests.Session()
    +if not hasattr(self, 'session'):
    +    self.session = curl_requests.Session()
     
    Suggestion importance[1-10]: 3

    Why: The suggestion addresses a potential issue with session initialization, but the current implementation already initializes the session correctly. Adding a check for initialization is unnecessary and does not improve the code significantly.

    3
    Performance
    Add a timeout to the session to prevent indefinite hanging on network requests

    Consider adding a timeout to the session initialization to prevent hanging indefinitely on
    network requests.

    idm.py [45]

    -self.session = curl_requests.Session()
    +self.session = curl_requests.Session(timeout=10)  # Set a timeout for the session
     
    Suggestion importance[1-10]: 4

    Why: While adding a timeout to the session is a good practice, the current implementation does not support a timeout parameter in the way suggested. The suggestion does not accurately reflect the capabilities of the curl_requests.Session().

    4

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant