Skip to content

Only submit changes/deltas when processing bulk requests#191

Open
kiivihal wants to merge 7 commits intomainfrom
feature/bulk-delta-storage
Open

Only submit changes/deltas when processing bulk requests#191
kiivihal wants to merge 7 commits intomainfrom
feature/bulk-delta-storage

Conversation

@kiivihal
Copy link
Member

The load on downstream services such as the elasticsearch index and Fuseki triplestore is heavy when processing identical data. This pull request changes the bulk indexing workflow to store the data first in a blob-storage (minio) and keep track of the records and changes in an embedded database (duckdb). After the final call to drop-orphans, the changes are calculated and submitted for processing.

For the SPARQL insert and updates, we remove the 'drop' statement for each named graph, and replace it with a delta insert statement. This deletes specific triples and adds the new ones.

@gitguardian
Copy link

gitguardian bot commented Apr 25, 2023

⚠️ GitGuardian has uncovered 4 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id Secret Commit Filename
6465341 Generic High Entropy Secret edb25ac docker-compose.yml View secret
6465341 Generic High Entropy Secret edb25ac hub3.toml View secret
6465341 Generic High Entropy Secret b8ae4ce hub3.toml View secret
6753863 Redis Server Password 81ac24d docker-compose.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

feat(go.mod): add oklog/ulid v1.3.1 dependency
chore(hub3.toml): change minio endpoint port from 9000 to 9010
refactor(hub3/fragments/graph.go): modify Reader method to return the length of the byte array

refactor(rdfstream.go): reorganize import statements
feat(rdfstream.go): add HubID and OrgID to the fragment metadata in IndexFragments() function

refactor(resource.go): fix typo in CreateDateRange error message
refactor(resource.go): rename year variable to date in padYears function
refactor(resource.go): rename formattedDate variable in padYears function
refactor(resource.go): fix typo in hyphenateDate error message
refactor(resource.go): rename splitDate function to splitPeriod for clarity

refactor(resource.go): improve error messages in SetContextLevels and NewResourceMap functions
fix(resource.go): fix typo in error message in SetContextLevels function

refactor(sparql.go): add omitempty to SparqlUpdate struct fields
feat(config/bulk.go): add StoreRequests field to Bulk struct
feat(config/elasticsearch.go): add LogRequests field to BulkConfig struct

feat(handle_upload.go): add GetGraph method to Service struct
feat(options.go): add SetLogRequests option to Service struct

feat(parser.go): add support for logging raw requests in bulk parser service

feat(parser.go): add support for storing bulk request to disk for debugging
feat(parser.go): add support for storing graphs to MinIO
fix(parser.go): fix variable naming inconsistency in setDataSet function

refactor(parser.go): remove unused code and comments
feat(parser.go): add HubID and OrgID to RDF bulk request
fix(parser.go): use IterTriplesOrdered instead of IterTriples to serialize triples in order

refactor(service.go): reformat code for better readability
feat(service.go): add logRequests boolean option to NewService function
refactor(config): remove unused SparqlUsername and SparqlPassword fields from RDF configuration
feat(config): enable storing only changed triples in the triple store
feat(bulk): add support for storing RDF data in Redis for delta updates
feat(bulk): add support for finding and dropping orphaned graphs in Redis for delta updates
refactor(bulk): remove unused SetDBPath option

refactor(parser.go): remove unused imports and variables
feat(parser.go): add setUpdateDataset and dataset methods to safely access and modify the dataset
feat(parser.go): add storeGraphDeltasOld method to be removed later
feat(parser.go): add storeGraphDeltas method to store graph deltas in redis and S3
feat(parser.go): add dropGraphOrphans method to drop orphan graphs from redis and triple store
feat(parser.go): add incrementRevision method to increment the revision of the dataset
feat(parser.go): add process method to process requests and increment revisions
feat(parser.go): add stats field to Stats struct to track graphs stored

refactor(service.go): remove unused imports and variables
feat(service.go): add Redis support to the bulk service to store and retrieve data
… project up to date and reduce clutter

chore(go.sum): update dependencies
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant