-
Notifications
You must be signed in to change notification settings - Fork 1
pandas: Starter and Jupyter tutorial. Guidelines for efficient ingest. #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughReplaces inline pandas content on the DataFrame connect index with a seealso link, adds a pandas integration landing page plus three tutorials (starter, Jupyter, efficient ingest), and adds an anchor label to the Dask usage doc. All changes are documentation-only. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Pandas as pandas.DataFrame.to_sql()
participant SA as SQLAlchemy Engine
participant Crate as CrateDB
User->>Pandas: to_sql(table, engine, method=insert_bulk, chunksize=N)
note right of Pandas: chunk DataFrame into batches
Pandas->>SA: send batch payload
SA->>Crate: BULK INSERT (batch)
Crate-->>SA: ACK / status
SA-->>Pandas: batch result
Pandas-->>User: completed (all batches processed)
rect rgba(235,245,255,0.25)
note over Pandas,Crate: If chunksize=None, default batching behavior applies
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| import crate | ||
| import pandas as pd | ||
| from sqlalchemy import create_engine | ||
| from crate.client.sqlalchemy.support import insert_bulk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That code isn't using sqlalchemy-cratedb yet, so we should update it. Maybe you could lend a hand here to update and verify this document & code snippet, @karynzv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved with 925c986.
f1ebc44 to
961bff7
Compare
1bc9ce5 to
925c986
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (2)
docs/integrate/pandas/efficient-ingest.md (1)
34-34: Consider using a conventional table name without hyphens.The table name
"cratedb-demo"contains a hyphen, which requires quoting in SQL. Whileto_sql()handles this automatically, using an underscore (cratedb_demo) or removing the separator entirely follows more conventional naming practices and avoids potential quoting issues in manual SQL queries.Apply this change if you prefer conventional naming:
- name="cratedb-demo", + name="cratedb_demo",docs/integrate/pandas/index.md (1)
14-19: Consider consolidating redundant content.The two paragraphs both introduce pandas but with slightly different wording and capitalization styles. The second paragraph's note about "Pandas (stylized as pandas)" creates confusion after already using lowercase in the first paragraph.
Consider consolidating into a single, clearer introduction:
-[pandas] is a fast, powerful, flexible, and easy-to-use open-source data analysis -and manipulation tool, built on top of the Python programming language. - -Pandas (stylized as pandas) is a software library written for the Python programming -language for data manipulation and analysis. In particular, it offers data structures -and operations for manipulating numerical tables and time series. +[pandas] is a fast, powerful, flexible, and easy-to-use open-source data analysis +and manipulation tool, built on top of the Python programming language. It offers +data structures and operations for manipulating numerical tables and time series.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/connect/df/index.md(1 hunks)docs/integrate/dask/usage.md(1 hunks)docs/integrate/pandas/efficient-ingest.md(1 hunks)docs/integrate/pandas/index.md(1 hunks)docs/integrate/pandas/tutorial-jupyter.md(1 hunks)docs/integrate/pandas/tutorial-start.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/integrate/pandas/tutorial-start.md
43-43: Spaces inside code span elements
(MD038, no-space-in-code)
docs/integrate/pandas/tutorial-jupyter.md
43-43: Spaces inside code span elements
(MD038, no-space-in-code)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build docs
🔇 Additional comments (20)
docs/integrate/dask/usage.md (1)
2-2: LGTM! Documentation anchor added for cross-referencing.The new anchor
(dask-efficient-ingest)=enables consistent cross-referencing across the documentation, complementing the efficient ingestion guides being added for pandas.docs/integrate/pandas/efficient-ingest.md (3)
1-18: LGTM! Clear introduction and prerequisites.The guide provides a solid foundation for understanding bulk inserts and correctly references
sqlalchemy-cratedbas a prerequisite.
42-54: LGTM! Practical performance guidance.The performance observations and chunksize tuning advice provide valuable, actionable guidance for users working with large datasets.
55-60: LGTM! Clear conclusion.The conclusion appropriately wraps up the tutorial and provides helpful links to further resources.
docs/integrate/pandas/tutorial-start.md (3)
1-19: LGTM! Clear introduction and specific requirements.The tutorial provides a solid foundation, and the specific version requirements (CrateDB 5.2, pandas 2, crate 0.31) help users ensure compatibility.
20-41: LGTM! Valid SQL schema.The
CREATE TABLEstatement uses appropriate data types for the customer dataset fields.
51-68: LGTM! Correct pandas query pattern.The
pd.read_sql()usage correctly demonstrates connecting to CrateDB and loading data into a DataFrame.docs/integrate/pandas/tutorial-jupyter.md (6)
1-52: LGTM! Comprehensive setup instructions.The tutorial provides clear, step-by-step instructions for setting up CrateDB Cloud, Python, and Jupyter Notebook.
88-117: LGTM! Web scraping implementation is correct.The function properly handles Wikipedia HTML parsing and ticker symbol formatting for Yahoo Finance compatibility.
161-187: LGTM! Financial data download implementation is correct.The function properly uses yfinance, handles DataFrame construction, and formats columns appropriately for CrateDB insertion.
207-216: LGTM! Standard database connection pattern.The psycopg2 connection code appropriately uses placeholders for sensitive credentials.
230-234: LGTM! Table creation function is appropriate for tutorial context.The function correctly uses double quotes for identifier quoting. While the table name is concatenated directly into the SQL statement, this is acceptable for tutorial code where the table name is controlled by the script author rather than external input.
446-511: LGTM! Effective tutorial conclusion with practical examples.The
update_tablefunction effectively orchestrates all the helper functions, and the final test section provides a complete end-to-end example with query demonstrations.docs/connect/df/index.md (1)
14-17: LGTM! Consistent documentation structure.The change appropriately centralizes pandas documentation by replacing inline content with a reference to the dedicated pandas integration page, maintaining consistency with the Dask and Polars sections.
docs/integrate/pandas/index.md (6)
1-9: LGTM!The page anchor, heading, and floating logo are correctly structured using MyST syntax.
21-36: LGTM!The Data Model section provides a clear, well-structured overview of pandas' core concepts relevant to CrateDB integration.
38-46: LGTM!The Learn section properly uses MyST cross-reference syntax and includes relevant tutorials and external resources.
48-50: LGTM!The Code examples section correctly links to the GitHub examples repository.
52-58: LGTM!The toctree directive is correctly structured with appropriate depth and hidden navigation for the three tutorial pages.
61-63: External links verified: All referenced URLs return HTTP 200 and correctly point to the intended resources.
| import math | ||
|
|
||
| def insert_values(table_name, data): | ||
|
|
||
| values_array = [] | ||
|
|
||
| # adding each closing date, ticker and close value tuple to values array | ||
|
|
||
| for row in range(len(data)): | ||
|
|
||
| # saving entries from the ith row as a list of date values | ||
|
|
||
| row_values = data.iloc[row, :] | ||
|
|
||
| # checking if there is a NaN entry and setting it to -1 | ||
|
|
||
| close_value = row_values['close_value'] | ||
| if (math.isnan(close_value)): | ||
| close_value = -1 | ||
|
|
||
| # formatting date entries to match timestamp format | ||
|
|
||
| closing_date = row_values['closing_date'].strftime("%Y-%m-%d") | ||
| closing_date = "'{}'".format( | ||
| closing_date + "{time}".format(time="T00:00:00Z")) | ||
|
|
||
| # putting a comma between values tuples, but not on the last tuple | ||
|
|
||
| values_array.append("({},\'{}\',{})".format( | ||
| closing_date, row_values['ticker'], close_value)) | ||
| ``` | ||
|
|
||
| * After all the row values have been added to the `values_array` list, I create a new table with the specified name (if it does not already exist) using the `create_table` function. | ||
| * Then I create the first part of the SQL `INSERT` statement, which includes the table name and the column names we insert into (`closing_date`, `ticker`, and `close_value`). This part of the statement is stored in the `insert_stmt` variable. | ||
| * Next, I add the values tuples from `values_array` to the `insert_stmt`, separated by commas. The final SQL `INSERT` statement is created by concatenating the `insert_stmt `variable and a semicolon at the end. | ||
|
|
||
| ```python | ||
| # creates a new table (in case it does not exist yet) | ||
|
|
||
| create_table(table_name) | ||
|
|
||
| # first part of the insert statement | ||
|
|
||
| insert_stmt = "INSERT INTO \"{}\" (closing_date, ticker, close_value) VALUES ".format( | ||
| table_name) | ||
|
|
||
| # adding data tuples to the insert statement | ||
|
|
||
| insert_stmt += ", ".join(values_array) + ";" | ||
| ``` | ||
| * Finally, the function executes the `INSERT` statement using the `cursor.execute()` method, and prints out a message indicating how many rows were inserted into the table. | ||
|
|
||
| ```python | ||
| cursor.execute(insert_stmt) | ||
|
|
||
| print("Inserted " + str(len(data)) + " rows in CrateDB") | ||
| ``` | ||
| In summary, in `insert_values`, I take the table name and the data, format the data into a SQL `INSERT` statement, and insert the data into the specified table. | ||
|
|
||
| This is what the complete function looks like: | ||
|
|
||
| ```python | ||
| import math | ||
|
|
||
| def insert_values(table_name, data): | ||
|
|
||
| values_array = [] | ||
|
|
||
| # adding each closing date, ticker and close value tuple to values array | ||
|
|
||
| for row in range(len(data)): | ||
|
|
||
| # saving entries from the ith row as a list of date values | ||
|
|
||
| row_values = data.iloc[row, :] | ||
|
|
||
| # checking if there is a NaN entry and setting it to -1 | ||
|
|
||
| close_value = row_values['close_value'] | ||
| if (math.isnan(close_value)): | ||
| close_value = -1 | ||
|
|
||
| # formatting date entries to match timestamp format | ||
|
|
||
| closing_date = row_values['closing_date'].strftime("%Y-%m-%d") | ||
| closing_date = "'{}'".format( | ||
| closing_date + "{time}".format(time="T00:00:00Z")) | ||
|
|
||
| # putting a comma between values tuples, but not on the last tuple | ||
|
|
||
| values_array.append("({},\'{}\',{})".format( | ||
| closing_date, row_values['ticker'], close_value)) | ||
|
|
||
| # creates a new table (in case it does not exist yet) | ||
|
|
||
| create_table(table_name) | ||
|
|
||
| # first part of the insert statement | ||
|
|
||
| insert_stmt = "INSERT INTO \"{}\" (closing_date, ticker, close_value) VALUES ".format( | ||
| table_name) | ||
|
|
||
| # adding data tuples to the insert statement | ||
|
|
||
| insert_stmt += ", ".join(values_array) + ";" | ||
|
|
||
| cursor.execute(insert_stmt) | ||
|
|
||
| print("Inserted " + str(len(data)) + " rows in CrateDB") | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection risk: Use parameterized queries instead of string concatenation.
The insert_values function builds SQL INSERT statements using string formatting without proper parameterization. While this may work for the tutorial's trusted data source (Yahoo Finance), users adapting this code for other sources could face SQL injection vulnerabilities.
Specifically, at line 289, the ticker value is directly interpolated into the SQL string. If a ticker contained a single quote (e.g., O'REILLY), it would break the SQL statement.
Consider adding a note in the documentation recommending parameterized queries for production use:
**Note for Production Use**: This tutorial uses string concatenation for simplicity.
For production systems, especially when handling untrusted data sources, use parameterized
queries via `cursor.executemany()` or similar methods to prevent SQL injection vulnerabilities.Alternatively, you could demonstrate the safer approach using executemany:
def insert_values(table_name, data):
create_table(table_name)
# Prepare data as list of tuples
values = []
for row in range(len(data)):
row_values = data.iloc[row, :]
close_value = row_values['close_value'] if not math.isnan(row_values['close_value']) else -1
closing_date = row_values['closing_date']
values.append((closing_date, row_values['ticker'], close_value))
# Use parameterized query
insert_stmt = f'INSERT INTO "{table_name}" (closing_date, ticker, close_value) VALUES (%s, %s, %s)'
cursor.executemany(insert_stmt, values)
print(f"Inserted {len(data)} rows in CrateDB")🤖 Prompt for AI Agents
In docs/integrate/pandas/tutorial-jupyter.md around lines 260 to 369, the
insert_values example builds SQL via string concatenation which allows SQL
injection (ticker values can break the statement); update the doc to warn
readers and replace the example with a safe parameterized approach: keep the
create_table call, build a list of tuples (closing_date, ticker, close_value)
while converting NaNs to -1 and keeping closing_date as a datetime or ISO
string, then use a parameterized INSERT statement with cursor.executemany()
(e.g. VALUES (%s, %s, %s)) instead of formatting values into the SQL string;
also add the suggested “Note for Production Use” admonition about parameterized
queries for untrusted data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderabbitai: Please carry this over into a dedicated issue. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue for the SQL injection security concern
gh issue create \
--repo crate/cratedb-guide \
--title "Security: SQL injection risk in pandas Jupyter tutorial" \
--body "## Description
The \`insert_values\` function in the pandas Jupyter tutorial (\`docs/integrate/pandas/tutorial-jupyter.md\`, lines 260-369) builds SQL INSERT statements using string concatenation without proper parameterization. This creates potential SQL injection vulnerabilities when users adapt the code for other data sources.
## Problem
At line 289, ticker values are directly interpolated into the SQL string:
\`\`\`python
values_array.append(\"({},\\'{}\\'',{})\".format(
closing_date, row_values['ticker'], close_value))
\`\`\`
If a ticker contained a single quote (e.g., \`O'REILLY\`), it would break the SQL statement and could allow injection attacks with untrusted data sources.
## Suggested Solution
1. Add a warning note about the security implications for production use
2. Demonstrate the safer approach using parameterized queries with \`cursor.executemany()\`
Example safe implementation:
\`\`\`python
def insert_values(table_name, data):
create_table(table_name)
# Prepare data as list of tuples
values = []
for row in range(len(data)):
row_values = data.iloc[row, :]
close_value = row_values['close_value'] if not math.isnan(row_values['close_value']) else -1
closing_date = row_values['closing_date']
values.append((closing_date, row_values['ticker'], close_value))
# Use parameterized query
insert_stmt = f'INSERT INTO \"{table_name}\" (closing_date, ticker, close_value) VALUES (%s, %s, %s)'
cursor.executemany(insert_stmt, values)
print(f\"Inserted {len(data)} rows in CrateDB\")
\`\`\`
## References
- Original discussion: https://github.com/crate/cratedb-guide/pull/297#discussion_r2392441430
- File: \`docs/integrate/pandas/tutorial-jupyter.md\`
## Requested by
@amotl" \
--assignee amotl
Unrecoverable error: Request timed out after 300000ms (requestId=cec6e24b-9a18-4cf8-95ed-698bdec4fa1a)
🏁 Script executed:
#!/bin/bash
# Retry creating the GitHub issue for SQL injection security concern
gh issue create \
--repo crate/cratedb-guide \
--title "Security: SQL injection risk in pandas Jupyter tutorial" \
--assignee amotl \
--body "## Description
The insert_values function in the pandas Jupyter tutorial (docs/integrate/pandas/tutorial-jupyter.md, lines 260-369) builds SQL INSERT statements using string concatenation without proper parameterization. This creates potential SQL injection vulnerabilities when users adapt the code for other data sources.
## Problem
At line 289, ticker values are directly interpolated into the SQL string. If a ticker contained a single quote (e.g., O'REILLY), it would break the SQL statement and could allow injection attacks with untrusted data sources.
## Suggested Solution
1. Add a warning note about the security implications for production use
2. Demonstrate the safer approach using parameterized queries with cursor.executemany()
Use parameterized queries instead of string concatenation to prevent SQL injection.
## References
- Original discussion: https://github.com/crate/cratedb-guide/pull/297#discussion_r2392441430
- File: docs/integrate/pandas/tutorial-jupyter.md
Requested by @amotl"
Length of output: 49
@amotl Done! I've created issue #354 to track the SQL injection security concern in the pandas Jupyter tutorial. The issue includes:
- Description of the problem with string concatenation in the
insert_valuesfunction - The security implications for production use
- Suggested solution using parameterized queries with
cursor.executemany() - References back to this PR and comment thread
You can view the issue here: #354
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
docs/integrate/pandas/tutorial-jupyter.md (2)
255-346: SQL injection risk already flagged.This code segment was previously flagged for SQL injection vulnerabilities due to string concatenation in SQL statement construction. The user has acknowledged this and requested it be tracked separately.
358-380: Datetime import must appear before first usage.The
select_last_inserted_datefunction usesdatetime.strptime()at line 374, but thedatetimeimport is not shown until line 384 (in the context of the next function). Users following the tutorial sequentially will encounter aNameError.The datetime import should be added at the beginning of the notebook along with other imports (after line 147 or in a consolidated imports section). If showing imports incrementally for pedagogical reasons, the import must appear before line 358.
🧹 Nitpick comments (3)
docs/integrate/pandas/tutorial-jupyter.md (3)
53-53: Fix bare URL to use proper Markdown link syntax.The bare URL at line 53 should be formatted as a proper Markdown link for better accessibility and rendering.
Apply this diff:
-Now I can access my Jupyter Notebook by opening the URL printed in the terminal after running this last command. In my case, it is at http://localhost:8888/ +Now I can access my Jupyter Notebook by opening the URL printed in the terminal after running this last command. In my case, it is at <http://localhost:8888/>
145-148: Specify language identifier for code fence.The code fence starting at line 145 is missing a language identifier, which should be
pythonto enable syntax highlighting.Apply this diff:
-``` +```python import yfinance as yf import pandas as pd--- `282-282`: **Remove trailing space in inline code.** There's an extra space before the closing backtick in `` `insert_stmt ` `` at line 282. Apply this diff: ```diff -* Next, I add the values tuples from `values_array` to the `insert_stmt `, separated by commas. The final SQL `INSERT` statement is created by concatenating the `insert_stmt `variable and a semicolon at the end. +* Next, I add the values tuples from `values_array` to the `insert_stmt`, separated by commas. The final SQL `INSERT` statement is created by concatenating the `insert_stmt` variable and a semicolon at the end.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/integrate/pandas/index.md(1 hunks)docs/integrate/pandas/tutorial-jupyter.md(1 hunks)docs/integrate/pandas/tutorial-start.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/integrate/pandas/index.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/integrate/pandas/tutorial-jupyter.md
53-53: Bare URL used
(MD034, no-bare-urls)
145-145: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
239-239: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
282-282: Spaces inside code span elements
(MD038, no-space-in-code)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build docs
🔇 Additional comments (2)
docs/integrate/pandas/tutorial-jupyter.md (1)
1-481: LGTM with minor improvements needed.This is a comprehensive, well-structured tutorial that effectively demonstrates end-to-end data collection and storage workflows. The step-by-step explanations and code examples provide clear guidance for users learning to integrate pandas, yfinance, and CrateDB.
The main issues to address are:
- The datetime import ordering (critical - already flagged)
- SQL injection pattern (already flagged for separate tracking)
- Minor markdown formatting issues (optional improvements)
Consider adding a consolidated imports section early in the tutorial to help users ensure all dependencies are installed upfront:
# Required installations: # pip install beautifulsoup4 requests yfinance pandas psycopg2-binary import requests from bs4 import BeautifulSoup import yfinance as yf import pandas as pd import psycopg2 as ps import math from datetime import datetime, timedeltadocs/integrate/pandas/tutorial-start.md (1)
1-95: LGTM! Well-structured starter tutorial.This tutorial provides a clear, concise introduction to integrating CrateDB with pandas. The code examples are correct and follow a logical progression from setup through data loading, querying, analysis, and visualization. All previously flagged issues (variable naming, method calls, formatting) have been resolved.
The tutorial effectively demonstrates:
- CrateDB setup options
- Table creation and bulk data import with COPY FROM
- Reading data into pandas DataFrames
- Performing aggregations with groupby
- Data visualization with matplotlib
Minor note: Users should be aware that the
COPY FROM 'file:///path/to/Customers.csv'command requires the file to be accessible to the CrateDB server, not just the client machine. This is correctly documented but worth emphasizing for users new to CrateDB.
About
Continue adding integration guides from the community forum.
Preview
References