Skip to content

Explicit auth and emulating Google Cloud Storage as object storage for Iceberg #26854

@szisiu

Description

@szisiu

Our development and test environments use MinIO as an S3-compatible object store for Iceberg. Since our production environment uses GCS, this requires us to install heavy Amazon S3 libraries for the Hive Metastore (HMS) in our development setup. This adds unnecessary dependencies, increases build size, and creates a divergence between our development and production environments.

We could replace MinIO with a lightweight, GCS-compatible emulator to better align our development environment with production. The recommended solution is fsouza/fake-gcs-server, which can be easily integrated into our existing containerized setup.

The new Trino native GCS file system (fs.native-gcs.enabled=true) does not provide a mechanism to disable authentication, making it impossible to use with local, unauthenticated GCS emulators like fake-gcs-server.

Even when gcs.endpoint is configured to point to a local emulator, the underlying Google Cloud client library still attempts to perform a real OAuth2 token exchange with the public Google endpoint (https://oauth2.googleapis.com/token). This fails because a dummy credential file must be provided (to avoid an "ADC not found" error), but this dummy credential is invalid for the real authentication service.

There should be a way to use the recommended native client for local testing without it attempting to contact external authentication services.

Environment

  • Trino Version: latest
  • Connector: Iceberg (iceberg.catalog.type=hive_metastore)
  • Storage Emulator: fsouza/fake-gcs-server
  • Metastore: Hive Metastore 4.1
  • Setup: Docker Compose

Configuration

  1. docker-compose.yml template:

    services:
      gcs-emulator:
        image: fsouza/fake-gcs-server
        container_name: gcs-emulator
        ports:
          - "4443:4443"
        command: -scheme http -public-host gcs-emulator:4443
    
      mysql-db:
        image: mysql:8.0
        container_name: mysql-db
        environment:
          - MYSQL_ROOT_PASSWORD=secret
          - MYSQL_DATABASE=metastore
          - MYSQL_USER=hive
          - MYSQL_PASSWORD=hive
    
      hive-metastore:
        image: apache/hive:4.0.0
        container_name: hive-metastore
        depends_on: [mysql-db]
        ports: ["9083:9083"]
        volumes:
          - ./hive/conf/metastore-site.xml:/opt/hive/conf/metastore-site.xml
    
      trino:
        image: trinodb/trino:latest
        container_name: trino
        ports: ["8080:8080"]
        volumes:
          - ./trino/catalog:/etc/trino/catalog
  2. metastore-site.xml:

    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://mysql-db:3306/metastore</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.cj.jdbc.Driver</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>hive</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>hive</value>
        </property>
        <property>
            <name>fs.gs.project.id</name>
            <value>dummy</value> 
        </property>
        <property>
            <name>fs.gs.auth.type</name>
            <value>NONE</value> <!-- or UNATHENTICATED, see https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v3.1.8/gcs/CONFIGURATION.md#user-credentials -->
        </property>
        <property>
            <name>google.cloud.storage.api.endpoint</name>
            <value>http://gcs-emulator:4443</value>
        </property>
    </configuration>
  3. iceberg.properties (the problematic configuration):

    connector.name=iceberg
    iceberg.catalog.type=hive_metastore
    hive.metastore.uri=thrift://hive-metastore:9083
    
    # Configuration for Trino's Native GCS Client, see: https://trino.io/docs/current/object-storage/file-system-gcs.html#general-configuration
    fs.native-gcs.enabled=true
    gcs.endpoint=http://gcs-emulator:4443
    
    # A dummy key file is required to prevent "ADC not found" errors,
    # but this key triggers a real auth attempt.
    # The content of the file is a syntactically valid but fake key.
    gcs.json-key-file-path=/etc/trino/dummy-credentials.json 

    (Note: The user will need to create and mount a dummy credentials file for this step, but the key is that even with it, the process fails).

    # Generate the key
    openssl genrsa -out private_key.pem 2048
    openssl pkcs8 -topk8 -inform PEM -outform PEM -nocrypt -in private_key.pem
    
    # Then copy the output and format for JSON
    # Replace each newline with \n in the JSON file
  4. Start the services and prepare the emulator:

    # Start all containers
    docker-compose up -d
    
    # Create a bucket in the emulator
    curl -X POST -H "Content-Type: application/json" \
      --data '{"name": "test-bucket"}' \
      http://localhost:4443/storage/v1/b
  5. Run the failing SQL command via the Trino CLI or any client:

    CREATE SCHEMA iceberg.test_schema
    WITH ( location = 'gs://test-bucket/test_schema' );

Actual Result

The query fails with an error indicating an authentication failure with the real Google OAuth2 endpoint.

Query failed: GCS service error listing files: gs://test-bucket/test_schema.db/

...
Caused by: com.google.cloud.storage.StorageException: Error getting access token for service account: 400 Bad Request
POST https://oauth2.googleapis.com/token
{"error":"invalid_grant","error_description":"Invalid grant: account not found"}
...

Analysis and Suggested Solution

The root cause is that the native GCS client's authentication logic is tightly coupled with the official Google authentication libraries, which do not recognize the gcs.endpoint property for authentication calls. The client sees a service account key and immediately attempts to exchange it for a real token.

A new configuration property is needed to explicitly disable this behavior for local testing. I suggest adding a property like gcs.authentication-type with a possible value of NONE.

Refs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions