-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Pass s3:// file URLs directly to API in BedrockConverseModel
#3663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
s3:// file URLs directly to API in BedrockConverseModel
| format = item.media_type.split('/')[1] | ||
| assert format in ('jpeg', 'png', 'gif', 'webp'), f'Unsupported image format: {format}' | ||
| image: ImageBlockTypeDef = {'format': format, 'source': {'bytes': downloaded_item['data']}} | ||
| image: ImageBlockTypeDef = {'format': format, 'source': cast(Any, source)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of casting this to Any, can we fix the source type hint to be DocumentSourceTypeDef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
| 'name': name, | ||
| 'format': item.format, | ||
| 'source': {'bytes': downloaded_item['data']}, | ||
| 'source': cast(Any, source), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
| format = downloaded_item['data_type'] | ||
| source: dict[str, Any] | ||
| if item.url.startswith('s3://'): | ||
| source = {'s3Location': {'uri': item.url}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a bucketOwner field that users may want to set. Maybe we can tell them to encode it as a query param on the URL, and parse it out here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like s3://my-bucket/key?bucketOwner=owner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's what I was thinking
| if item.url.startswith('s3://'): | ||
| source = {'s3Location': {'uri': item.url}} | ||
| else: | ||
| downloaded_item = await download_item(item, data_format='bytes', type_format='extension') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_item currently has logic gating for gs:// URLs; let's check s3:// URLs there as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the existing code in download_item checks for gs:// and youtube URLs:
if item.url.startswith('gs://'):
raise UserError('Downloading from protocol "gs://" is not supported.')
elif isinstance(item, VideoUrl) and item.is_youtube:
raise UserError('Downloading YouTube videos is not supported.')
What check do you mean for s3:// here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same check raising an error saying that download_item does not support s3:// URLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok need to stop supporting download altogether. Updated.
Kept the check pretty simple. Not sure whether we should go for a proper url parsing here since the expectation is just bucketOwner param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not drastically more code, I'd prefer proper URL parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document this feature in input.md. At the bottom there's already a section on uploaded files to Google, can you mention S3 files + BedrockConverseModel there as well please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DouweM I have updated the doc. Added a paragraph for S3 + BedrockConverseModel. Also updated the section above to mention that S3 files will not be downloaded.
Looking at the doc, I am a bit confused. Given that we have updated download_item to skip downloading s3:// URLs altogether, won't it apply to all models? For example, if we pass a s3:// URL to a model that doesn't support downloading itself, because of our check, Pydantic AI will also stop downloading for that particular model, right?
Not sure if I question is clear enough. Basically, in this MR, we are passing the s3:// URL directly to BedrockConverseModel but we have updated downloaed_item function to raise an error if s3:// is passed. This would mean we will stop downloading from s3://URLs for other models, no?
| format = item.media_type.split('/')[1] | ||
| assert format in ('jpeg', 'png', 'gif', 'webp'), f'Unsupported image format: {format}' | ||
| image: ImageBlockTypeDef = {'format': format, 'source': {'bytes': downloaded_item['data']}} | ||
| image: ImageBlockTypeDef = {'format': format, 'source': cast(DocumentSourceTypeDef, source)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need to cast if hint the type of source to be source: DocumentSourceTypeDef
Closes #3621