Skip to content

Refactor/hf df functions#33

Merged
jpcompartir merged 25 commits intomainfrom
refactor/hf_df_functions
Nov 20, 2025
Merged

Refactor/hf df functions#33
jpcompartir merged 25 commits intomainfrom
refactor/hf_df_functions

Conversation

@jpcompartir
Copy link
Copy Markdown
Owner

Version bump to 0.1.2

Release notes:

EndpointR 0.1.2

  • File writing improvements: hf_embed_df() and hf_classify_df() now write intermediate results as .parquet files to output_dir directories, similar to improvements in 0.1.1 for OpenAI functions

  • Parameter changes: Moved from batch_size to chunk_size argument across hf_embed_df(), hf_classify_df(), and oai_complete_df() for consistency

  • New chunking functions: Introduced hf_embed_chunks() and hf_classify_chunks() for more efficient batch processing with better error handling

  • Dependency update: Package now depends on arrow for faster .parquet file writing and reading

  • Metadata tracking: Hugging Face functions that write to files (hf_embed_df(), hf_classify_df(), hf_embed_chunks(), hf_classify_chunks()) now write metadata.json to output directories containing:

    • Endpoint URL and API key name used
    • Processing parameters (chunk_size, concurrent_requests, timeout, max_retries)
    • Inference parameters (truncate, max_length)
    • Timestamp and row counts
    • Useful for debugging, reproducibility, and tracking which models/endpoints were used
  • max_length parameter: Added max_length parameter to hf_classify_df() and hf_classify_chunks() for text truncation control. Note: hf_embed_df() handles truncation automatically via endpoint configuration (set AUTO_TRUNCATE in endpoint settings)

  • New utility functions:

    • hf_get_model_max_length() - Retrieve maximum token length for a Hugging Face model
    • hf_get_endpoint_info() - Retrieve detailed information about a Hugging Face Inference Endpoint
  • Improved reporting: Chunked/batch processing functions now report total successes and failures at completion

pass output_file to hf_embed_chunks from inside hf_embed_df to fix the filenmae issue
…r debugging people's code/errors (including my own)
add max_length to hf_embed_df and hf_embed_df
move hf_classify_df over to hf_classify_chunks not hf_classify_batch

remove old comments from hf_embed_df
… is to turn on 'AUTO_TRUNCATE' in the Set up of the endpoint
build rd files
add comma in embed test
add test/dev docs to .Rbuildignore
@jpcompartir jpcompartir merged commit 3531dd6 into main Nov 20, 2025
1 check passed
@jpcompartir jpcompartir deleted the refactor/hf_df_functions branch December 3, 2025 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant