PineconePDFExtractor

PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.

Installation

Use the package manager pip to install PineconePDFExtractor.

Google Colab

pip install PineconePDFExtractor

Check the latest version here:

https://pypi.org/project/PineconePDFExtractor/

Usage

from pdf.PineconePDFExtractor import PdfProcessor

# Create a PineconePDFExtractor instance with a batch size of 200
extractor = PdfProcessor(200)

# Process a list of PDF files
result = extractor.process_files(['file1.pdf', 'file2.pdf'])

# The result is a dictionary with the batch size and a list of documents
# Each document is a dictionary with the id (file name without extension), metadata (number of pages), source (file path), and text (extracted text)

## Example result
# {
#   'batch_size': 200,
#   'documents': [
#     {
#       'id': 'file1',
#       'metadata': {
#         'pages': 1
#       },
#       'source': 'file1.pdf',
#       'text': 'This is the extracted text from file1.pdf'
#     },
#     {
#       'id': 'file2',
#       'metadata': {
#         'pages': 2
#       },
#       'source': 'file2.pdf',
#       'text': 'This is the extracted text from file2.pdf'
#     }
#   ]
# }

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
pdf		pdf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PineconePDFExtractor

Installation

Google Colab

Check the latest version here:

Usage

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PineconePDFExtractor

Installation

Google Colab

Check the latest version here:

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages