Skip to content

why query this parquet file reports Scanning of nested columns in Parquet files is disabled? #102

@l1t1

Description

@l1t1

data: https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet 7,134,977,202 bytes
I use this python script from #78 and modify it to show timer

#!/usr/bin/env python3
import readline
import time
from argparse import ArgumentParser
from tableauhyperapi import HyperProcess, Connection, Telemetry, CreateMode, HyperException
# hyperapi-cli
## An interactive HyperAPI SQL cli

##This script allows you to interactively execute SQL commands via HyperAPI.

## Usage
##bash
##./hyperapi-cli.py [optional hyper database file]
##

def main():
    parser = ArgumentParser("HyperAPI interactive cli.")
    parser.add_argument("database", type=str, nargs='?',
                        help="A Hyper file to attach on startup")

    args = parser.parse_args()
    create_mode = CreateMode.CREATE_IF_NOT_EXISTS if args.database else CreateMode.NONE

    with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper_process:
        try:
            with Connection(hyper_process.endpoint, args.database, create_mode) as connection:
                while True:
                    try:
                        sql = input("> ")
                    except (EOFError, KeyboardInterrupt):
                        return
                    try:
                        t=time.time()
                        with connection.execute_query(sql) as result:
                            print("\t".join(str(column.name)
                                  for column in result.schema.columns))
                            for row in result:
                                print("\t".join(str(column) for column in row))
                        print(round(time.time()-t,3),"s\n")
                    except HyperException as exception:
                        print(f"Error executing SQL: {exception}")
        except HyperException as exception:
            print(f"Unable to connect to the database: {exception}")


if __name__ == "__main__":
    main()

query result

> select count(*) from external('./hacknernews.parquet');
"count"
28737557
0.779 s

> select * from external('./hacknernews.parquet') limit 1;
Error executing SQL: Scanning of nested columns in Parquet files is disabled.
Hint: Do not select group column kids when scanning the file
Context: 0xfa6b0e2f

duckdb can select * the same file

D describe select * from 'd:/hacknernews.parquet';
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ BIGINT      │ YES     │         │         │         │
│ deleted     │ UTINYINT    │ YES     │         │         │         │
│ type        │ BLOB        │ YES     │         │         │         │
│ by          │ BLOB        │ YES     │         │         │         │
│ time        │ BIGINT      │ YES     │         │         │         │
│ text        │ BLOB        │ YES     │         │         │         │
│ dead        │ UTINYINT    │ YES     │         │         │         │
│ parent      │ BIGINT      │ YES     │         │         │         │
│ poll        │ BIGINT      │ YES     │         │         │         │
│ kids        │ BIGINT[]    │ YES     │         │         │         │
│ url         │ BLOB        │ YES     │         │         │         │
│ score       │ INTEGER     │ YES     │         │         │         │
│ title       │ BLOB        │ YES     │         │         │         │
│ parts       │ BIGINT[]    │ YES     │         │         │         │
│ descendants │ INTEGER     │ YES     │         │         │         │
├─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 15 rows                                                 6 columns │
└───────────────────────────────────────────────────────────────────┘

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions