From fb9d4e7afe68a21e65d3f4407e675cab28974654 Mon Sep 17 00:00:00 2001 From: Pavithra Eswaramoorthy Date: Mon, 13 May 2024 11:38:39 +0200 Subject: [PATCH 01/27] Documentation Restructure MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes #111 The structure has diverged a fair bit from the initial proposal The Getting Started tutorials will cover a lot of csp.baselib usage details, hence it's not a huge focus in the concepts section (→ Already tracking in a separate issue) The "How-to" guides only have the migrated content from old docs right now, we'll be updating all the pages to follow a "how-to" format (→ Opened a new issue) The docs authoring workflow will change a little with the new GitHub sidebar, we'll add relevant docs to make this easier Note: This is amnual squash and rewrite of 7894f05069dbd0fa15446d95cf2701aa0a40454d --- docs/wiki/0.-Introduction.md | 871 ---------- docs/wiki/5.-Adapters.md | 1517 ----------------- docs/wiki/6.-Dynamic-Graphs.md | 110 -- docs/wiki/9.-Caching.md | 3 - docs/wiki/Home.md | 84 +- docs/wiki/_Footer.md | 1 + docs/wiki/_Sidebar.md | 61 + docs/wiki/api-references/Base-Adapters-API.md | 110 ++ .../Base-Nodes-API.md} | 284 +-- .../api-references/Functional-Methods-API.md | 64 + .../Input-Output-Adapters-API.md | 360 ++++ .../Math-and-Logic-Nodes-API.md} | 16 +- .../Random-Time-Series-Generators-API.md} | 17 +- .../Statistical-Nodes-API.md} | 698 +------- .../csp.Struct-API.md} | 10 +- docs/wiki/api-references/csp.dynamic-API.md | 49 + .../csp.profiler-API.md} | 63 +- docs/wiki/concepts/Adapters.md | 15 + docs/wiki/concepts/CSP-Graph.md | 114 ++ docs/wiki/concepts/CSP-Node.md | 271 +++ docs/wiki/concepts/Execution-Modes.md | 243 +++ docs/wiki/concepts/Historical-Buffers.md | 133 ++ .../Build-CSP-from-Source.md} | 149 +- docs/wiki/dev-guides/Contribute.md | 9 + docs/wiki/dev-guides/GitHub-Conventions.md | 73 + .../dev-guides/Local-Development-Setup.md | 87 + .../Release-Process.md} | 192 +-- docs/wiki/dev-guides/Roadmap.md | 17 + docs/wiki/get-started/First-Steps.md | 48 + docs/wiki/get-started/Installation.md | 20 + docs/wiki/how-tos/Add-Cycles-in-Graphs.md | 52 + docs/wiki/how-tos/Create-Dynamic-Baskets.md | 58 + docs/wiki/how-tos/Profile-CSP-Code.md | 77 + docs/wiki/how-tos/Use-Statistical-Nodes.md | 433 +++++ .../Write-Historical-Input-Adapters.md | 415 +++++ docs/wiki/how-tos/Write-Output-Adapters.md | 317 ++++ .../how-tos/Write-Realtime-Input-Adapters.md | 407 +++++ docs/wiki/references/Examples.md | 7 + docs/wiki/references/Glossary.md | 142 ++ 39 files changed, 3864 insertions(+), 3733 deletions(-) delete mode 100644 docs/wiki/0.-Introduction.md delete mode 100644 docs/wiki/5.-Adapters.md delete mode 100644 docs/wiki/6.-Dynamic-Graphs.md delete mode 100644 docs/wiki/9.-Caching.md create mode 100644 docs/wiki/_Footer.md create mode 100644 docs/wiki/_Sidebar.md create mode 100644 docs/wiki/api-references/Base-Adapters-API.md rename docs/wiki/{1.-Generic-Nodes-(csp.baselib).md => api-references/Base-Nodes-API.md} (54%) create mode 100644 docs/wiki/api-references/Functional-Methods-API.md create mode 100644 docs/wiki/api-references/Input-Output-Adapters-API.md rename docs/wiki/{2.-Math-Nodes-(csp.math).md => api-references/Math-and-Logic-Nodes-API.md} (85%) rename docs/wiki/{4.-Random-Time-Series-Generation-(csp.random).md => api-references/Random-Time-Series-Generators-API.md} (92%) rename docs/wiki/{3.-Statistics-Nodes-(csp.stats).md => api-references/Statistical-Nodes-API.md} (76%) rename docs/wiki/{7.-csp.Struct.md => api-references/csp.Struct-API.md} (87%) create mode 100644 docs/wiki/api-references/csp.dynamic-API.md rename docs/wiki/{8.-Profiler.md => api-references/csp.profiler-API.md} (67%) create mode 100644 docs/wiki/concepts/Adapters.md create mode 100644 docs/wiki/concepts/CSP-Graph.md create mode 100644 docs/wiki/concepts/CSP-Node.md create mode 100644 docs/wiki/concepts/Execution-Modes.md create mode 100644 docs/wiki/concepts/Historical-Buffers.md rename docs/wiki/{98.-Building-From-Source.md => dev-guides/Build-CSP-from-Source.md} (64%) create mode 100644 docs/wiki/dev-guides/Contribute.md create mode 100644 docs/wiki/dev-guides/GitHub-Conventions.md create mode 100644 docs/wiki/dev-guides/Local-Development-Setup.md rename docs/wiki/{99.-Developer.md => dev-guides/Release-Process.md} (52%) create mode 100644 docs/wiki/dev-guides/Roadmap.md create mode 100644 docs/wiki/get-started/First-Steps.md create mode 100644 docs/wiki/get-started/Installation.md create mode 100644 docs/wiki/how-tos/Add-Cycles-in-Graphs.md create mode 100644 docs/wiki/how-tos/Create-Dynamic-Baskets.md create mode 100644 docs/wiki/how-tos/Profile-CSP-Code.md create mode 100644 docs/wiki/how-tos/Use-Statistical-Nodes.md create mode 100644 docs/wiki/how-tos/Write-Historical-Input-Adapters.md create mode 100644 docs/wiki/how-tos/Write-Output-Adapters.md create mode 100644 docs/wiki/how-tos/Write-Realtime-Input-Adapters.md create mode 100644 docs/wiki/references/Examples.md create mode 100644 docs/wiki/references/Glossary.md diff --git a/docs/wiki/0.-Introduction.md b/docs/wiki/0.-Introduction.md deleted file mode 100644 index ea23d3a36..000000000 --- a/docs/wiki/0.-Introduction.md +++ /dev/null @@ -1,871 +0,0 @@ -# Graph building concepts - -When writing csp code there will be runtime components in the form of `csp.node` methods, as well as graph-building components in the form of `csp.graph` components. - -It is important to understand that `csp.graph` components will only be executed once at application startup in order to construct the graph. -Once the graph is constructed, `csp.graph` code is no longer needed. -Once the graph is run, only inputs, csp.nodes and outputs will be active as data flows through the graph, driven by input ticks. -For example, this is a simple bit of graph code: - -```python -import csp -from csp import ts -from datetime import datetime - - -@csp.node -def spread(bid: ts[float], ask: ts[float]) -> ts[float]: - if csp.valid(bid, ask): - return ask - bid - - -@csp.graph -def my_graph(): - bid = csp.const(1.0) - ask = csp.const(2.0) - bid = csp.multiply(bid, csp.const(4)) - ask = csp.multiply(ask, csp.const(3)) - s = spread(bid, ask) - - csp.print('spread', s) - csp.print('bid', bid) - csp.print('ask', ask) - - -if __name__ == '__main__': - csp.run(my_graph, starttime=datetime.utcnow()) -``` - -In this simple example `my_graph` is defined as a `csp.graph` component. -This method will be called once by `csp.run` in order to construct the graph. -`csp.const` defines a constant value as a timeseries which will tick once upon startup (this is effectively an input). - -`bid = csp.multiply(bid, csp.const(4))` will insert a `csp.multiply` node to do timeseries multiplication. -`bid` and `ask` are then connected to the user defined `csp.node` `spread`. -`bid`/`ask` and the calculated `spread` are then linked to the `csp.print` output to print the results. - -In order to help visualize this graph, you can call `csp.show_graph`: - -![359407708](https://github.com/Point72/csp/assets/3105306/8cc50ad4-68f9-4199-9695-11c136e3946c) - -The result of this would be: - -``` -2020-04-02 15:33:38.256724 bid:4.0 -2020-04-02 15:33:38.256724 ask:6.0 -2020-04-02 15:33:38.256724 spread:2.0 -``` - -## Anatomy of a csp.node - -The heart of a calculation graph are the csp.nodes that run the computations. -`csp.node` methods can take any number of scalar and timeseries arguments, and can return 0 → N timeseries outputs. -Timeseries inputs/outputs should be thought of as the edges that connect components of the graph. -These "edges" can tick whenever they have a new value. -Every tick is associated with a value and the time of the tick. -csp.nodes can have various other features, here is a an example of a csp.node that demonstrates many of the features. -Keep in mind that nodes will execute repeatedly as inputs tick with new data. -They may (or may not) generate an output as a result of an input tick. - -```python -from datetime import timedelta - -@csp.node # 1 -def demo_node(n: int, xs: ts[float], ys: ts[float]) -> ts[float]: # 2 - with csp.alarms(): # 3 - # Define an alarm time-series of type bool # 4 - alarm = csp.alarm(bool) # 5 - # 6 - with csp.state(): # 7 - # Create a state variable bound to the node # 8 - s_sum = 0.0 # 9 - # 10 - with csp.start(): # 11 - # Code block that executes once on start of the engine # 12 - # one can set timeseries properties here as well, such as # 13 - # csp.set_buffering_policy(xs, tick_count=5) # 14 - # csp.set_buffering_policy(xs, tick_history=timedelta(minutes=1))# 15 - # csp.make_passive(xs) # 16 - csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 17 - # 18 - with csp.stop(): # 19 - pass # code block to execute when the engine is done # 20 - # 21 - if csp.ticked(xs, ys) and csp.valid(xs, ys): # 22 - s_sum += xs * ys # 23 - # 24 - if csp.ticked(alarm): # 25 - csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 26 - return s_sum # 27 -``` - -Lets review line by line - -1\) Every csp node must start with the **`@csp.node`** decorator - -2\) `csp` nodes are fully typed and type-checking is strictly enforced. -All arguments must be typed, as well as all outputs. -Outputs are typed using function annotation syntax. - -Single outputs can be unnamed, for multiple outputs they must be named. -When using multiple outputs, annotate the type using **`def my_node(inputs) → csp.Outputs(name1=ts[], name2=ts[])`** where `T` and `V` are the respective types of `name1` and `name2`. - -Note the syntax of timeseries inputs, they are denoted by **`ts[type]`**. -Scalars can be passed in as regular types, in this example we pass in `n` which expects a type of `int` - -3\) **`with csp.alarms()`**: nodes can (optionally) declare internal alarms, every instance of the node will get its own alarm that can be scheduled and act just like a timeseries input. -All alarms must be declared within the alarms context. - -5\) Instantiate an alarm in the alarms context using the `csp.alarm(typ)` function. This creates an alarm which is a time-series of type `typ`. - -7\) **`with csp.state()`**: optional state variables can be defined under the state context. -Note that variables declared in state will live across invocations of the method. - -9\) An example declaration and initialization of state variable `s_sum`. -It is good practice to name state variables prefixed with `s_`, which is the convention in the `csp` codebase. - -11\) **`with csp.start()`**: an optional block to execute code at the start of the engine. -Generally this is used to setup initial timers or set input timeseries properties such as buffer sizes, or to make inputs passive - -14-15) **`csp.set_buffering_policy`**: nodes can request a certain amount of history be kept on the incoming time series, this can be denoted in number of ticks or in time. -By setting a buffering policy, nodes can access historical values of the timeseries (by default only the last value is kept) - -16\) **`csp.make_passive`** / **`csp.make_active`**: Nodes may not need to react to all of their inputs, they may just need their latest value. -For performance purposes the node can mark an input as passive to avoid triggering the node unnecessarily. -`make_active` can be called to reactivate an input. - -17\) **`csp.schedule_alarm`**: scheduled a one-shot tick on the given alarm input. -The values given are the timedelta before the alarm triggers and the value it will have when it triggers. -Note that `schedule_alarm` can be called multiple times on the same alarm to schedule multiple triggers. - -19\) **`with csp.stop()`** is an optional block that can be called when the engine is done running. - -22\) all nodes will have if conditions to react to different inputs. -**`csp.ticked()`** takes any number of inputs and returns true if **any** of the inputs ticked. -**`csp.valid`** similar takes any number of inputs however it only returns true if **all** inputs are valid. -Valid means that an input has had at least one tick and so it has a "current value". - -23\) One of the benefits of `csp` is that you always have easy access to the latest value of all inputs. -`xs` and `ys` on line 22,23 will always have the latest value of both inputs, even if only one of them just ticked. - -25\) This demonstrates how an alarm can be treated like any other input. - -27\) We tick our running "sum" as an output here every second. - -## Basket inputs - -In addition to single time-series inputs, a node can also accept a **basket** of time series as an argument. -A basket is essentially a collection of timeseries which can be passed in as a single argument. -Baskets can either be list baskets or dict baskets. -Individual timeseries in a basket can tick independently, and they can be looked at and reacted to individually or as a collection. - -For example: - -```python -@csp.node # 1 -def demo_basket_node( # 2 - list_basket: [ts[int]], # 3 - dict_basket: {str: ts[int]} # 4 -) -> ts[float]: # 5 - # 6 - if csp.ticked(list_basket): # 7 - return sum(list_basket.validvalues()) # 8 - # 9 - if csp.ticked(list_basket[3]): # 10 - return list_basket[3] # 11 - # 12 - if csp.ticked(dict_basket): # 13 - # can iterate over ticked key,items # 14 - # for k,v in dict_basket.tickeditems():# 15 - # ... # 16 - return sum(dict_basket.tickedvalues()) # 17 -``` - -3\) Note the syntax of basket inputs. -list baskets are noted as `[ts[type]]` (a list of time series) and dict baskets are `{key_type: ts[ts_type]}` (a dictionary of timeseries keyed by type `key_type`). - -7\) Just like single timeseries, we can react to a basket if it ticked. -The convention is the same as passing multiple inputs to `csp.ticked`, `csp.ticked` is true if **any** basket input ticked. -`csp.valid` is true is **all** basket inputs are valid. - -8\) baskets have various iterators to access their inputs: - -- **`tickedvalues`**: iterator of values of all ticked inputs -- **`tickedkeys`**: iterator of keys of all ticked inputs (keys are list index for list baskets) -- **`tickeditems`**: iterator of (key,value) tuples of ticked inputs -- **`validvalues`**: iterator of values of all valid inputs -- **`validkeys`**: iterator of keys of all valid inputs -- **`validitems`**: iterator of (key,value) tuples of valid inputs -- **`keys`**: list of keys on the basket (**dictionary baskets only** ) - -10-11) This demonstrates the ability to access an individual element of a -basket and react to it as well as access its current value - -## **Node Outputs** - -Nodes can return any number of outputs (including no outputs, in which case it is considered an "output" or sink node, -see [Graph Pruning](https://github.com/Point72/csp/wiki/0.-Introduction#graph-pruning)). -Nodes with single outputs can return the output as an unnamed output. -Nodes returning multiple outputs must have them be named. -When a node is called at graph building time, if its is a single unnamed node the return variable is an edge representing the output which can be passed into other nodes. -If the outputs are named, the return value is an object with the outputs available as attributes. -For example (examples below demonstrate various ways to output the data as well) - -```python -@csp.node -def single_unnamed_outputs(n: ts[int]) -> ts[int]: - # can either do - return n - # or - # csp.output(n) to continue processes after output - - -@csp.node -def multiple_named_outputs(n: ts[int]) -> csp.Outputs(y=ts[int], z=ts[float]): - # can do - # csp.output(y=n, z=n+1.) to output to multiple outputs - # or separate the outputs to tick out at separate points: - # csp.output(y=n) - # ... - # csp.output(z=n+1.) - # or can return multiple values with: - return csp.output(y=n, z=n+1.) - -@csp.graph -def my_graph(n: ts[int]): - x = single_unnamed_outputs(n) - # x represents the output edge of single_unnamed_outputs, - # we can pass it a time series input to other nodes - csp.print('x', x) - - - result = multiple_named_outputs(n) - # result holds all the outputs of multiple_named_outputs, which can be accessed as attributes - csp.print('y', result.y) - csp.print('z', result.z) -``` - -## Basket Outputs - -Similarly to inputs, a node can also produce a basket of timeseries as an output. -For example: - -```python -class MyStruct(csp.Struct): # 1 - symbol: str # 2 - index: int # 3 - value: float # 4 - # 5 -@csp.node # 6 -def demo_basket_output_node( # 7 - in_: ts[MyStruct], # 8 - symbols: [str], # 9 - num_symbols: int # 10 -) -> csp.Outputs( # 11 - dict_basket=csp.OutputBasket( # 12 - Dict[str, ts[float]], # 13 - shape="symbols", # 14 - ), # 15 - list_basket=csp.OutputBasket( # 16 - List[ts[float]], # 17 - shape="num_symbols" # 18 - ), # 19 -): # 20 - # 21 - if csp.ticked(in_): # 22 - # output to dict basket # 23 - csp.output(dict_basket[in_.symbol], in_.value) - # alternate output syntax, can output multiple keys at once - # csp.output(dict_basket={in_.symbol: in_.value}) - # output to list basket - csp.output(list_basket[in_.index], in_.value) - # alternate output syntax, can output multiple keys at once - # csp.output(list_basket={in_.index: in_.value}) -``` - -11-20) Note the output declaration syntax. -A basket output can be either named or unnamed (both examples here are named), and its shape can be specified two ways. -The `shape` parameter is used with a scalar value that defines the shape of the basket, or the name of the scalar argument (a dict basket expects shape to be a list of keys. lists basket expects `shape` to be an `int`). -`shape_of` is used to take the shape of an input basket and apply it to the output basket. - -23+) There are several choices for output syntax. -The following work for both list and dict baskets: - -- `csp.output(basket={key: value, key2: value2, ...})` -- `csp.output(basket[key], value)` -- `csp.output({key: value}) # only works if the basket is the only output` - -## Generic Types - -`csp` supports syntax for generic types as well. -To denote a generic type we use a string (typically `'T'` is used) to denote a generic type. -When a node is called the type of the argument will get bound to the given type variable, and further inputs / outputs will be checked and bound to said typevar. -Note that the string syntax `'~T'` denotes the argument expects the *value* of a type, rather than a type itself: - -```python -@csp.node -def sample(trigger: ts[object], x: ts['T']) -> ts['T']: - '''will return current value of x on trigger ticks''' - with csp.state(): - csp.make_passive(x) - - if csp.ticked(trigger) and csp.valid(x): - return x - - -@csp.node -def const(value: '~T') -> ts['T']: - ... -``` - -`sample` takes a timeseries of type `'T'` as an input, and returns a timeseries of type `'T'`. -This allows us to pass in a `ts[int]` for example, and get a `ts[int]` as an output, or `ts[bool]` → `ts[bool]` - -`const` takes value as an *instance* of type `T`, and returns a timeseries of type `T`. -So we can call `const(5)` and get a `ts[int]` output, or `const('hello!')` and get a `ts[str]` output, etc... - -## Engine Time - -The `csp` engine always maintains its current view of time. -The current time of the engine can be accessed at any time within a csp.node by calling `csp.now()` - -## Graph Propagation and Single-dispatch - -The `csp` graph propagation algorithm ensures that all nodes are executed *once* per engine cycle, and in the correct order. -Correct order means, that all input dependencies of a given node are guaranteed to have been evaluated before a given node is executed. -Take this graph for example: - -![359407953](https://github.com/Point72/csp/assets/3105306/d9416353-6755-4e37-8467-01da516499cf) - -On a given cycle lets say the `bid` input ticks. -The `csp` engine will ensure that **`mid`** is executed, followed by **`spread`** and only once **`spread`**'s output is updated will **`quote`** be called. -When **`quote`** executes it will have the latest values of the `mid` and `spread` calc for this cycle. - -## Graph Pruning - -One should note a subtle optimization technique in `csp` graphs. -Any part of a graph that is created at graph building time, but is NOT connected to any output nodes, will be pruned from the graph and will not exist during runtime. -An output is defined as either an output adapter or a `csp.node` without any outputs of its own. -The idea here is that we can avoid doing work if it doesn't result in any output being generated. -In general its best practice for all csp.nodes to be \***side-effect free**, in other words they shouldn't mutate any state outside of the node. -Assuming all nodes are side-effect free, pruning the graph would not have any noticeable effects. - -## Anatomy of a `csp.graph` - -To reiterate, csp.graph methods are called in order to construct the graph and are only executed before the engine is run. -csp.graph methods don't do anything special, they are essentially regular python methods, but they can be defined to accept inputs and generate outputs similar to csp.nodes. -This is solely used for type checking. -csp.graph methods can be created to encapsulate components of a graph, and can be called from other csp.graph methods in order to help facilitate graph building. - -Simple example: - -```python -@csp.graph -def calc_symbol_pnl(symbol: str, trades: ts[Trade]) -> ts[float]: - # sub-graph code needed to compute pnl for given symbol and symbol's trades - # sub-graph can subscribe to market data for the symbol as needed - ... - - -@csp.graph -def calc_portfolio_pnl(symbols: [str]) -> ts[float]: - symbol_pnl = [] - for symbol in symbols: - symbol_trades = trade_adapter.subscribe(symbol) - symbol_pnl.append(calc_symbol_pnl(symbol, symbol_trades)) - - return csp.sum(symbol_pnl) -``` - -In this simple example we have a csp.graph component `calc_symbol_pnl` which encapsulates computing pnl for a single symbol. -`calc_portfolio_pnl` is a graph that computes portfolio level pnl, it invokes the symbol-level pnl calc for every symbol, then sums up the results for the portfolio level pnl. - -# Historical Buffers - -`csp` provides access to historical input data as well. -By default only the last value of an input is kept in memory, however one can request history to be kept on an input either by number of ticks or by time using **csp.set_buffering_policy.** - -The methods **csp.value_at**, **csp.time_at** and **csp.item_at** can be used to retrieve historical input values. -Each node should call **csp.set_buffering_policy** to make sure that its inputs are configured to store sufficiently long history for correct implementation. -For example, let's assume that we have a stream of data and we want to create equally sized buckets from the data. -A possible implementation of such a node would be: - -```python -@csp.node -def data_bin_generator(bin_size: int, input: ts['T']) -> ts[['T']]: - with csp.start(): - assert bin_size > 0 - # This makes sure that input stores at least bin_size entries - csp.set_buffering_policy(input, tick_count=bin_size) - if csp.ticked(input) and (csp.num_ticks(input) % bin_size == 0): - return [csp.value_at(input, -i) for i in range(bin_size)] -``` - -In this example, we use **`csp.set_buffering_policy(input, tick_count=bin_size)`** to ensure that the buffer history contains at least **`bin_size`** elements. -Note that an input can be shared by multiple nodes, if multiple nodes provide size requirements, the buffer size would be resolved to the maximum size to support all requests. - -Alternatively, **`csp.set_buffering_policy`** supports a **`timedelta`** parameter **`tick_history`** instead of **`tick_count`.** -If **`tick_history`** is provided, the buffer will scale dynamically to ensure that any period of length **`tick_history`** will fit into the history buffer. - -To identify when there are enough samples to construct a bin we use **`csp.num_ticks(input) % bin_size == 0`**. -The function **`csp.num_ticks`** returns the number or total ticks for a given time series. -NOTE: The actual size of the history buffer is usually less than **`csp.num_ticks`** as buffer is dynamically truncated to satisfy the set policy. - -The past values in this example are accessed using **`csp.value_at`**. -The various historical access methods take the same arguments and return the value, time and tuple of `(time,value)` respectively: - -- **`csp.value_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **value** at requested `index_or_time` -- **`csp.time_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **datetime** at requested `index_or_time` -- **`csp.item_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns tuple of `(datetime,value)` at requested `index_or_time` - - **`ts`**: the name of the input - - **`index_or_time`**: - - If providing an **index**, this represents how many ticks back to rereieve **and should be \<= 0**. - 0 indicates the current value, -1 is the previous value, etc. - - If providing **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. - **NOTE** that timedelta must be negative to represent time in the past.. - - **`duplicate_policy`**: when requesting history by datetime or timedelta, its possible that there could be multiple values that match the given time. - **`duplicate_policy`** can be provided to control the behavior of what to return in this case. - The default policy is to return the LAST_VALUE that exists at the given time. - - **`default`**: value to be returned if the requested time is out of the history bounds (if default is not provided and a request is out of bounds an exception will be raised). - -To illustrate the usage of history access using the **timedelta** indexing, consider a possible implementation of a function that sums up samples taken every second for each periods of **n_seconds** of the input time series. -If the value ticks slower than every second then this implementation could sample the same value more than once (this is just an illustration, it's NOT recommended to use such implementation in real application as it could be implemented more efficiently): - -```python -@csp.node -def sample_sum(n_seconds: int, input: ts[int], default_sample_value: int = 0) -> ts[int]: - with csp.alarms(): - a = csp.alarm(bool) - with csp.start(): - assert n_seconds > 0 - # This makes sure that input stores at least n_seconds seconds - csp.set_buffering_policy(input, tick_history=timedelta(seconds=n_seconds)) - # Flag the input as passive since we don't need to react to its ticks - csp.make_passive(input) - # Schedule the first sample in n_seconds-1 from start, to also capture the initial value - csp.schedule_alarm(a, timedelta(seconds=n_seconds - 1), True) - if csp.ticked(a): - # Schedule the next sample in n_seconds from start - csp.schedule_alarm(a, timedelta(seconds=n_seconds), True) - res = 0 - for i in range(n_seconds): - res += csp.value_at(input, timedelta(seconds=-i), default=default_sample_value) - return res -``` - -## Historical Range Access - -In similar fashion, the methods **`csp.values_at`**, **`csp.times_at`** and **`csp.items_at`** can be used to retrieve a range of historical input values as numpy arrays. -The bin generator example above can be accomplished more efficiently with range access: - -```python -@csp.node -def data_bin_generator(bin_size: int, input: ts['T']) -> ts[['T']]: - with csp.start(): - assert bin_size > 0 - # This makes sure that input stores at least bin_size entries - csp.set_buffering_policy(input, tick_count=bin_size) - if csp.ticked(input) and (csp.num_ticks(input) % bin_size == 0): - return csp.values_at(input, -bin_size + 1, 0).tolist() -``` - -The past values in this example are accessed using **`csp.values_at`**. -The various historical access methods take the same arguments and return the value, time and tuple of `(times,values)` respectively: - -- **`csp.values_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: - returns values in specified range as a numpy array -- **`csp.times_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: - returns times in specified range as a numpy array -- **`csp.items_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: - returns a tuple of (times, values) numpy arrays - - **`ts`** - the name of the input - - **`start_index_or_time`**: - - If providing an **index**, this represents how many ticks back to retrieve **and should be \<= 0**. - 0 indicates the current value, -1 is the previous value, etc. - - If providing  **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. - **NOTE that timedelta must be negative** to represent time in the past.. - - If **None** is provided, the range will begin "from the beginning" - i.e., the oldest tick in the buffer. - - **end_index_or_time:** same as start_index_or_time - - If **None** is provided, the range will go "until the end" - i.e., the newest tick in the buffer. - - **`start_index_policy`**: only for use with datetime/timedelta as the start and end parameters. - - **\`TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it - - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it - - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the beginning timestamp, include it. - Otherwise, if there is a tick before the beginning timestamp, force a tick at the beginning timestamp with the prevailing value at the time. - - **end_index_policy** only for use with datetime/timedelta and the start and end parameters. - - **TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it - - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it - - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the end timestamp, include it. - Otherwise, if there is a tick before the end timestamp, force a tick at the end timestamp with the prevailing value at the time - -Range access is optimized at the C++ layer and for this reason its far more efficient than calling the single value access methods in a loop, and they should be substituted in where possible. - -Below is a rolling average example to illustrate the use of timedelta indexing. -Note that `timedelta(seconds=-n_seconds)` is equivalent to `csp.now() - timedelta(seconds=n_seconds)`, since datetime indexing is supported. - -```python -@csp.node -def rolling_average(x: ts[float], n_seconds: int) -> ts[float]: - with csp.start(): - assert n_seconds > 0 - csp.set_buffering_policy(x, tick_history=timedelta(seconds=n_seconds)) - if csp.ticked(x): - avg = np.mean(csp.values_at(x, timedelta(seconds=-n_seconds), timedelta(seconds=0), - csp.TimeIndexPolicy.INCLUSIVE, csp.TimeIndexPolicy.INCLUSIVE)) - csp.output(avg) -``` - -When accessing all elements within the buffering policy window like -this, it would be more succinct to pass None as the start and end time, -but datetime/timedelta allows for more general use (e.g. rolling average -between 5 seconds and 1 second ago, or average specifically between -9:30:00 and 10:00:00) - -# Cyclical graph - `csp.feedback` - -By definition of the graph building code, csp graphs can only produce acyclical graphs. -However, there are many occasions where a cycle may be required. -For example, lets say you want part of your graph to simulate an exchange. -That part of the graph would need to accept new orders and return acks and executions. -However, the acks / executions would likely need to *feedback* into the same part of the graph that generated the orders. -For this reason, the `csp.feedback` construct exists. -Using `csp.feedback` one can wire a feedback as an input to a node, and effectively bind the actual edge that feeds it later in the graph. -Note that internally the graph is still acyclical. -Internally `csp.feedback` creates a pair of output and input adapters that are bound together. -When a timeseries that is bound to a feedback ticks, it is fed to the feedback which then schedules the tick on its bound input to be executed on the **next engine cycle**. -The next engine cycle will execute with the same engine time as the cycle that generated it, but it will be evaluated in a subsequent cycle. - -- **`csp.feedback(ts_type)`**: `ts_type` is the type of the timeseries (ie int, str). - This returns an instance of a feedback object - - **`out()`**: this method returns the timeseries edge which can be passed as an input to your node - - **`bind(ts)`**: this method is called to bind an edge as the source of the feedback after the fact - -A simple example should help demonstrate a possible usage. -Lets say we want to simulate acking orders that are generated from a node called `my_algo`. -In addition to generating the orders, `my_algo` also wants needs to receive the execution reports (this is demonstrated in example `e_13_feedback.py`) - -The graph code would look something like this: - -```python -# Simulate acking an order -@csp.node -def my_exchange(order:ts[Order]) -> ts[ExecReport]: - # ... impl details ... - -@csp.node -def my_algo(exec_report:ts[ExecReport]) -> ts[Order]: - # .. impl details ... - -@csp.graph -def my_graph(): - # create the feedback first so that we can refer to it later - exec_report_fb = csp.feedback(ExecReport) - - # generate orders, passing feedback out() which isn't bound yet - orders = my_algo(exec_report_fb.out()) - - # get exec_reports from "simulator" - exec_report = my_exchange(orders) - - # now bind the exec reports to the feedback, finishing the "loop" - exec_report_fb.bind(exec_report) -``` - -The graph would end up looking like this. -It remains acyclical, but the `FeedbackOutputDef` is bound to the `FeedbackInputDef` here, any tick to out will push the tick to in on the next cycle: - -![366521848](https://github.com/Point72/csp/assets/3105306/c4f920ff-49f9-4a52-8404-7c1989768da7) - -# Collecting Graph Outputs - -If the `csp.graph` passed to `csp.run` has outputs, the full timeseries will be returned from `csp.run` like so: - -**outputs example** - -```python -import csp -from datetime import datetime, timedelta - -@csp.graph -def my_graph() -> ts[int]: - return csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1))) - -if __name__ == '__main__': - res = csp.run(my_graph, starttime=datetime(2021,11,8)) - print(res) -``` - -result: - -```raw -{0: [(datetime.datetime(2021, 11, 8, 0, 0), 1), (datetime.datetime(2021, 11, 8, 0, 0, 1), 2)]} -``` - -Note that the result is a list of `(datetime, value)` tuples. - -You can also use [csp.add_graph_output]() to add outputs. -These do not need to be in the top-level graph called directly from `csp.run`. - -This gives the same result: - -**add_graph_output example** - -```python -@csp.graph -def my_graph(): -    csp.add_graph_output('a', csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1)))) -``` - -In addition to python outputs like above, you can set the optional `csp.run` argument `output_numpy` to `True` to get outputs as numpy arrays: - -**numpy outputs** - -```python -result = csp.run(my_graph, starttime=datetime(2021,11,8), output_numpy=True) -``` - -result: - -```raw -{0: (array(['2021-11-08T00:00:00.000000000', '2021-11-08T00:00:01.000000000'], dtype='datetime64[ns]'), array([1, 2], dtype=int64))} -``` - -Note that the result there is a tuple per output, containing two numpy arrays, one with the datetimes and one with the values. - -# Realtime / Simulation Modes - -The `csp` engine can be run in two flavors, realtime and simulation. - -In simulation mode, the engine is always run at full speed pulling in time-based data from its input adapters and running them through the graph. -All inputs in simulation are driven off the provided timestamped data of its inputs. - -In realtime mode, the engine runs in wallclock time as of "now". -Realtime engines can get data from realtime adapters which source data on separate threads and pass them through to the engine (ie think of activeMQ events happening on an activeMQ thread and being passed along to the engine in "realtime"). - -Since engines can run in both simulated and realtime mode, users should **always** use **`csp.now()`** to get the current time in csp.nodes - -## Simulation Mode - -Simulation mode is the default mode of the engine. -As stated above, simulation mode is used when you want your engine to crunch through historical data as fast as possible. -In simulation mode, the engine runs on some historical data that is fed in through various adapters. -The adapters provide events by time, and they are streamed into the engine via the adapter timeseries in time order. -csp.timer and csp.node alarms are scheduled and executed in "historical time" as well. -Note that there is no strict requirement for simulated runs to run on historical dates. -As long as the engine is not in realtime mode, it remains in simulation mode until the provided endtime, even if endtime is in the future. - -## Realtime Mode - -Realtime mode is opted into by passing `realtime=True` to `csp.run(...)`. -When run in realtime mode, the engine will run in simulation mode from the provided starttime → wallclock "now" as of the time of calling run. -Once the simulation run is done, the engine switches into realtime mode. -Under realtime mode, external realtime adapters will be able to send data into the engine thread. -All time based inputs such as csp.timer and alarms will switch to executing in wallclock time as well. - -As always, `csp.now()` should still be used in csp.node code, even when running in realtime mode. -`csp.now()` will be the time assigned to the current engine cycle. - -## csp.PushMode - -When consuming data from input adapters there are three choices on how one can consume the data: - -| PushMode | EngineMode | Description | -| :------- | :--------- | :---------- | -| **LAST_VALUE** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with the last value on a given timestamp | -|   | Realtime | all ticks that occurred since previous engine cycle will collapse / conflate to the latest value | -| **NON_COLLAPSING** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once per engine cycle. subsequent cycles will execute with the same time | -|   | Realtime | all ticks that occurred since previous engine cycle will be ticked across subsequent engine cycles as fast as possible | -| **BURST** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with a list of all values | -|   | Realtime | all ticks that occurred since previous engine cycle will tick once with a list of all the values | - -## Realtime Group Event Synchronization - -The `csp` framework supports properly synchronizing events across multiple timeseries that are sourced from the same realtime adapter. -A classical example of this is a market data feed. -Say you consume bid, ask and trade as 3 separate time series for the same product / exchange. -Since the data flows in asynchronously from a separate thread, bid, ask and trade events could end up executing in the engine at arbitrary slices of time, leading to crossed books and trades that are out of range of the bid/ask. -The engine can properly provide a correct synchronous view of all the inputs, regardless of their PushModes. -Its up to adapter implementations to determine which inputs are part of a synchronous "PushGroup". - -Here's a classical example. -An Application wants to consume conflating bid/ask as LAST_VALUE but it doesn't want to conflate trades, so its consumed as NON_COLLAPSING. - -Lets say we have this sequence of events on the actual market data feed's thread, coming in one the wire in this order. -The columns denote the time the callbacks come in off the market data thread. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EventTT+1T+2T+3T+4T+5T+6
BID100.00100.01
-
99.9799.9899.99
-
ASK100.02
-
100.03
-

-

-
100.00
TRADE
-

-
100.02
-

-
100.03
-
- -Without any synchronization you can end up with nonsensical views based on random timing. -Here's one such possibility (bid/ask are still LAST_VALUE, trade is NON_COLLAPSING). - -Over here ET is engine time. -Lets assume engine had a huge delay and hasn't processed any data submitted above yet. -Without any synchronization, bid/ask would completely conflate, and trade would unroll over multiple engine cycles - - - - - - - - - - - - - - - - - - - - - - - - -
EventETET+1
BID99.99
-
ASK100.00
-
TRADE100.02100.03
- -However, since market data adapters will group bid/ask/trade inputs together, the engine won't let bid/ask events advance ahead of trade events since trade is NON_COLLAPSING. -NON_COLLAPSING inputs will essentially act as a barrier, not allowing events ahead of the barrier tick before the barrier is complete. -Lets assume again that the engine had a huge delay and hasn't processed any data submitted above. -With proper barrier synchronizations the engine cycles would look like this under the same conditions: - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EventETET+1ET+2
BID100.0199.99
-
ASK100.03
-
100.00
TRADE100.02100.03
-
- -Note how the last ask tick of 100.00 got held up to a separate cycle (ET+2) so that trade could tick with the correct view of bid/ask at the time of the second trade (ET+1) - -As another example, lets say the engine got delayed briefly at wire time T, so it was able to process T+1 data. -Similarly it got briefly delayed at time T+4 until after T+6.  The engine would be able to process all data at time T+1, T+2, T+3 and T+6, leading to this sequence of engine cycles. -The equivalent "wire time" is denoted in parenthesis - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EventET (T+1)ET+1 (T+2)ET+2 (T+3)ET+3 (T+5)ET+4 (T+6)
BID100.01
-
99.9799.99
-
ASK100.02100.03
-

-
100.00
TRADE
-
100.02
-
100.03
-
diff --git a/docs/wiki/5.-Adapters.md b/docs/wiki/5.-Adapters.md deleted file mode 100644 index 7b7b97f7d..000000000 --- a/docs/wiki/5.-Adapters.md +++ /dev/null @@ -1,1517 +0,0 @@ -# Intro - -To get various data sources into and out of the graph, various Input and Output Adapters are available, such as CSV, Parquet, and database adapters (amongst others). -Users can also write their own input and output adapters, as explained below. - -There are two types of Input Adapters: **Historical** (aka Simulated) adapters and **Realtime** Adapters. -Historical adapters are used to feed in historical timeseries data into the graph from some data source which has timeseries data. -Realtime Adapters are used to feed in live event based data in realtime, generally events created from external sources on separate threads. - -There is not distinction of Historical vs Realtime output adapters since outputs need not care if the generated timeseries data which are wired into them are generated from realtime or historical inputs. - -In CSP terminology, a single adapter corresponds to a single timeseries edge in the graph. -There are common cases where a single data source may be used to provide data to multiple adapter (timeseries) instances, for example a single CSV file with price data for many stocks can be read once but used to provide data to many individual, one per stock. -In such cases an AdapterManager is used to coordinate management of the single source (CSV file, database, Kafka connection, etc) and provided data to individual adapters. - -Note that adapters can be quickly written and prototyped in python, and if needed can be moved to a c+ implementation for more efficiency. - -# Kafka - -The Kafka adapter is a user adapter to stream data from a Kafka bus as a reactive time series. It leverages the [librdkafka](https://github.com/confluentinc/librdkafka) C/C++ library internally. - -The `KafkaAdapterManager` instance represents a single connection to a broker. -A single connection can subscribe and/or publish to multiple topics. - -## API - -```python -KafkaAdapterManager( - broker, - start_offset: typing.Union[KafkaStartOffset,timedelta,datetime] = None, - group_id: str = None, - group_id_prefix: str = '', - max_threads=100, - max_queue_size=1000000, - auth=False, - security_protocol='SASL_SSL', - sasl_kerberos_keytab='', - sasl_kerberos_principal='', - ssl_ca_location='', - sasl_kerberos_service_name='kafka', - rd_kafka_conf_options=None, - debug: bool = False, - poll_timeout: timedelta = timedelta(seconds=1) -): -``` - -- **`broker`**: name of the Kafka broker, such as `protocol://host:port` - -- **`start_offset`**: signify where to start the stream playback from (defaults to `KafkaStartOffset.LATEST`). - Can be one of the`KafkaStartOffset` enum types or: - - - `datetime`: to replay from the given absolute time - - `timedelta`: this will be taken as an absolute offset from starttime to playback from - -- **`group_id`**: if set, this adapter will behave as a consume-once consumer. - `start_offset` may not be set in this case since adapter will always replay from the last consumed offset. - -- **\`group_id_prefix**: when not passing an explicit group_id, a prefix can be supplied that will be use to prefix the UUID generated for the group_id - -- **`max_threads`**: maximum number of threads to create for consumers. - The topics are round-robin'd onto threads to balance the load. - The adapter won't create more threads than topics. - -- **`max_queue_size`**: maximum size of the (internal to Kafka) message queue. - If the queue is full, messages can be dropped, so the default is very large. - -## MessageMapper - -In order to publish or subscribe, you need to define a MsgMapper. -These are the supported message types: - -- **`JSONTextMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** -- **`ProtoMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** - -You should choose the `DateTimeType` based on how you want (when publishing) or expect (when subscribing) your datetimes to be represented on the wire. -The supported options are: - -- `UINT64_NANOS` -- `UINT64_MICROS` -- `UINT64_MILLIS` -- `UINT64_SECONDS` - -The enum is defined in [csp/adapters/utils.py](https://github.com/Point72/csp/blob/main/csp/adapters/utils.py#L5). - -Note the `JSONTextMessageMapper` currently does not have support for lists. -To subscribe to json data with lists, simply subscribe using the `RawTextMessageMapper` and process the text into json (e.g. via json.loads). - -## Subscribing and Publishing - -Once you have an `KafkaAdapterManager` object and a `MsgMapper` object, you can subscribe to topics using the following method: - -```python -KafkaAdapterManager.subscribe( - ts_type: type, - msg_mapper: MsgMapper, - topic: str, - key=None, - field_map: typing.Union[dict,str] = None, - meta_field_map: dict = None, - push_mode: csp.PushMode = csp.PushMode.LAST_VALUE, - adjust_out_of_order_time: bool = False -): -``` - -- **`ts_type`**: the timeseries type you want to get the data on. This can be a `csp.Struct` or basic timeseries type -- **`msg_mapper`**: the `MsgMapper` object discussed above -- **`topic`**: the topic to subscribe to -- **`key`**: The key to subscribe to. If `None`, then this will subscribe to all messages on the topic. Note that in this "wildcard" mode, all messages will tick as "live" as replay in engine time cannot be supported -- **`field_map`**: dictionary of `{message_field: struct_field}` to define how the subscribed message gets mapped onto the struct -- **`meta_field_map`**: to extract meta information from the kafka message, provide a meta_field_map dictionary of meta field info → struct field name to place it into. - The following meta fields are currently supported: - - **`"partition"`**: which partition the message came from - - **`"offset"`**: the kafka offset of the given message - - **`"live"`**: whether this message is "live" and not being replayed - - **`"timestamp"`**: timestamp of the kafka message - - **`"key"`**: key of the message -- **`push_mode`**: `csp.PushMode` (LAST_VALUE, NON_COLLAPSING, BURST) -- **`adjust_out_of_order_time`**: in some cases it has been seen that kafka can produce out of order messages, even for the same key. - This allows the adapter to be more laz and allow it through by forcing time to max(time, prev time) - -Similarly, you can publish on topics using the following method: - -```python -KafkaAdapterManager.publish( - msg_mapper: MsgMapper, - topic: str, - key: str, - x: ts['T'], - field_map: typing.Union[dict,str] = None -): -``` - -- **`msg_mapper`**: same as above -- **`topic`**: same as above -- **`key`**: key to publish to -- **`x`**: the timeseries to publish -- **`field_map`**: dictionary of {struct_field: message_field} to define how the struct gets mapped onto the published message. - Note this dictionary is the opposite of the field_map in subscribe() - -## Known Issues - -If you are having issues, such as not getting any output or the application simply locking up, start by ensuring that you are logging the adapter's `status()` with a `csp.print`/`log` call and set `debug=True`. -Then follow the known issues below. - -- Reason: `GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (No Kerberos credentials available)` - - - **Resolution**: Kafka uses kerberos tickets for authentication. Need to set-up kerberos token first - -- `Message received on unknown topic: errcode: Broker: Group authorization failed error: FindCoordinator response error: Group authorization failed.` - - - **Resolution**: Kafka broker running on windows are case sensitive to kerberos token. When creating Kerberos token with kinit, make sure to use principal name with case sensitive user id. - -- `authentication: SASL handshake failed (start (-4)): SASL(-4): no mechanism available: No worthy mechs found (after 0ms in state AUTH_REQ)` - - - **Resolution**: cyrus-sasl-gssapi needs to be installed on the box for Kafka kerberos authentication - -- `Message error on topic "an-example-topic". errcode: Broker: Topic authorization failed error: Subscribed topic not available: an-example-topic: Broker: Topic authorization failed)` - - - **Resolution**: The user account does not have access to the topic - -# Parquet - -## ParquetReader - -The `ParquetReader` adapter is a generic user adapter to stream data from [Apache Parquet](https://parquet.apache.org/) files as a CSP time series. -`ParquetReader` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the CSP framework. - -### API - -```python -ParquetReader( - self, - filename_or_list, - symbol_column=None, - time_column=None, - tz=None -): - """ - :param filename_or_list: The specifier of the file/files to be read. Can be either: - - Instance of str, in which case it's interpreted os a path of single file to be read - - A callable, in which case it's interpreted as a generator function that will be called like f(starttime, endtime) where starttime and endtime - are the start and end times of the current engine run. It's expected to generate a sequence of filenames to read. - - Iterable container, for example a list of files to read - :param symbol_column: An optional parameter that specifies the name of the symbol column if the file if there is any - :param time_column: A mandatory specification of the time column name in the parquet files. This column will be used to inject the row values - from parquet at the given timestamps. - :param tz: The pytz timezone of the timestamp column, should only be provided if the time_column in parquet file doesn't have tz info. -""" -``` - -### Subscription - -```python -def subscribe( - self, - symbol, - typ, - field_map=None, - push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING -): - """Subscribe to the rows corresponding to a given symbol - This form of subscription can be used only if non empty symbol_column was supplied during ParquetReader construction. - :param symbol: The symbol to subscribe to, for example 'AAPL' - :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type - that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. - :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be - a string specifying the column name, if typ is a csp Struct then field_map should be a str->str dictionary of the form - {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct - fields and the parquet files columns. - :param push_mode: A push mode for the output adapter - """ - -def subscribe_all( - self, - typ, - field_map=None, - push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING -): - """Subscribe to all rows of the input files. - :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type - that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. - :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be - a string specifying the column name, if typ is a csp Struct then field_map should be a str->str dictionary of the form - {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct - fields and the parquet files columns. - :param push_mode: A push mode for the output adapter - """ -``` - -Parquet reader provides two subscription methods. -**`subscribe`** produces a time series only of the rows that correspond to the given symbol, -\*\*`subscribe_all`\*\*produces a time series of all rows in the parquet files. - -## ParquetWriter - -The ParquetWriter adapter is a generic user adapter to stream data from CSP time series to [Apache Parquet](https://parquet.apache.org/) files. -`ParquetWriter` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the `csp` framework. -Any time series of Struct objects will be flattened to multiple columns. - -### Construction - -```python -ParquetWriter( - self, - file_name: Optional[str], - timestamp_column_name, - config: Optional[ParquetOutputConfig] = None, - filename_provider: Optional[csp.ts[str]] = None -): - """ - :param file_name: The path of the output parquet file name. Must be provided if no filename_provider specified. If both file_name and filename_provider are specified then file_name will be used as the initial output file name until filename_provider provides a new file name. - :param timestamp_column_name: Required field, if None is provided then no timestamp will be written. - :param config: Optional configuration of how the file should be written (such as compression, block size,...). - :param filename_provider: An optional time series that provides a times series of file paths. When a filename_provider time series provides a new file path, the previous open file name will be closed and all subsequent data will be written to the new file provided by the path. This enable partitioning and splitting the data based on time. - """ -``` - -### Publishing - -```python -def publish_struct( - self, - value: ts[csp.Struct], - field_map: Dict[str, str] = None -): - """Publish a time series of csp.Struct objects to file - - :param value: The time series of Struct objects that should be published. - :param field_map: An optional dict str->str of the form {struct_field_name:column_name} that maps the names of the - structure fields to the column names to which the values should be written. If the field_map is non None, then only - the fields that are specified in the field_map will be written to file. If field_map is not provided then all fields - of a structure will be written to columns that match exactly the field_name. - """ - -def publish( - self, - column_name, - value: ts[object] -): - """Publish a time series of primitive type to file - :param column_name: The name of the parquet file column to which the data should be written to - :param value: The time series that should be published - """ -``` - -Parquet writer provides two publishing methods. -**`publish_struct`** is used to publish time series of **`csp.Struct`** objects while **`publish`** is used to publish primitive time series. -The columns in the written parquet file is a union of all columns that were published (the order is preserved). -A new row is written to parquet file whenever any of the inputs ticks. -For the given row, any column that corresponds to a time series that didn't tick, will have null values. - -### Example of using ParquetReader and ParquetWriter - -```python -import tempfile -from datetime import datetime, timedelta - -import csp -from csp.adapters.parquet import ParquetOutputConfig, ParquetReader, ParquetWriter - - -class Dummy(csp.Struct): - int_val: int - float_val: float - - -@csp.graph -def write_struct(file_name: str): - st = datetime(2020, 1, 1) - - curve = csp.curve(Dummy, [(st + timedelta(seconds=1), Dummy(int_val=1, float_val=1.0)), - (st + timedelta(seconds=2), Dummy(int_val=2, float_val=2.0)), - (st + timedelta(seconds=3), Dummy(int_val=3, float_val=3.0))]) - writer = ParquetWriter(file_name=file_name, timestamp_column_name='csp_time', - config=ParquetOutputConfig(allow_overwrite=True)) - writer.publish_struct(curve) - - -@csp.graph -def write_series(file_name: str): - st = datetime(2020, 1, 1) - - curve_int = csp.curve(int, [(st + timedelta(seconds=i), i * 5) for i in range(10)]) - curve_str = csp.curve(str, [(st + timedelta(seconds=i), f'str_{i}') for i in range(10)]) - writer = ParquetWriter(file_name=file_name, timestamp_column_name='csp_time', - config=ParquetOutputConfig(allow_overwrite=True)) - writer.publish('int_vals', curve_int) - writer.publish('str_vals', curve_str) - - -@csp.graph -def writer_graph(struct_file_name: str, series_file_name: str): - write_struct(struct_file_name) - write_series(series_file_name) - - -@csp.graph -def reader_graph(series_file_name: str): - reader = ParquetReader(series_file_name, time_column='csp_time') - csp.print('Read as struct', reader.subscribe_all(Dummy)) - csp.print('Read as single int column', reader.subscribe_all(int, 'int_val')) - csp.print('Read as single float column', reader.subscribe_all(float, 'float_val')) - - -if __name__ == '__main__': - with tempfile.NamedTemporaryFile(suffix='.parquet') as struct_file: - struct_file.file.close() - with tempfile.NamedTemporaryFile(suffix='.parquet') as series_file: - series_file.file.close() - g = csp.run(writer_graph, struct_file.name, series_file.name, - starttime=datetime(2020, 1, 1), endtime=timedelta(minutes=1)) - g = csp.run(reader_graph, struct_file.name, - starttime=datetime(2020, 1, 1), endtime=timedelta(minutes=1)) - - -``` - -# DBReader - -The DBReader adapter is a generic user adapter to stream data from a database as a reactive time series. -It leverages sqlalchemy internally in order to be able to access various DB backends. - -Please refer to the [SQLAlchemy Docs](https://docs.sqlalchemy.org/en/13/core/tutorial.html) for information on how to create sqlalchemy connections. - -The DBReader instance represents a single connection to a database. -From a single reader you can subscribe to various streams, either the entire stream of data (which would basically represent the result of a single join) or if a symbol column is declared, subscribe by symbol which will then demultiplex rows to the right adapter. - -## API - -```python -DBReader(self, connection, time_accessor, table_name=None, schema_name=None, query=None, symbol_column=None, constraint=None): - """ - :param connection: sqlalchemy engine or (already connected) connection object. - :param time_accessor: TimeAccessor object - :param table_name: name of table in database as a string - :param query: either string query or sqlalchemy query object. Ex: "select * from users" - :param symbol_column: name of symbol column in table as a string - :param constraint: additional sqlalchemy constraints for query. Ex: constraint = db.text('PRICE>:price').bindparams(price = 100.0) - """ -``` - -- **connection**: seqlalchemy engine or existing connection object. -- **time_accessor**: see below -- **table_name**: either table or query is required. - If passing a table_name then this table will be queried against for subscribe calls -- **query**: (optional) if table isn't supplied user can provide a direct query string or sqlalchemy query object. - This is useful if you want to run a join call. - For basic single-table queries passing table_name is preferred -- **symbol_column**: (optional) in order to be able to demux rows bysome column, pass `symbol_column`. - Example case for this is if database has data stored for many symbols in a single table, and you want to have a timeseries tick per symbol. -- **constraint**: (optional) additional sqlalchemy constraints for query. Ex: `constraint = db.text('PRICE>:price').bindparams(price= 100.0)` - -## TimeAccessor - -All data fed into `csp` must be time based. -`TimeAccessor` is a helper class that defines how to extract timestamp information from the results of the data. -Users can define their own `TimeAccessor` implementation or use pre-canned ones: - -- `TimestampAccessor( self, time_column, tz=None)`: use this if there exists a single datetime column already. - Provide the column name and optionally the timezone of the column (if its timezone-less in the db) -- `DateTimeAccessor(self, date_column, time_column, tz=None)`: use this if there are two separate columns for date and time, this accessor will combine the two columns to create a single datetime. - Optionally pass tz if time column is timezone-less in the db - -User implementations would have to extend `TimeAccessor` interface. -In addition to defining how to convert db columns to timestamps, accessors are also used to augment the query to limit the data for the graph's start and end times. - -Once you have a DBReader object created, you can subscribe to time_series from it using the following methods: - -- `subscribe(self, symbol, typ, field_map=None)` -- `subscribe_all(self, typ, field_map=None)` - -Both of these calls expect `typ` to be a `csp.Struct` type. -`field_map` is a dictionary of `{ db_column : struct_column }` mappings that define how to map the database column names to the fields on the struct. - -`subscribe` is used to subscribe to a stream for the given symbol (symbol_column is required when creating DBReader) - -`subscribe_all` is used to retrieve all the data resulting from the request as a single timeseries. - -# Symphony - -The Symphony adapter allows for reading and writing of messages from the [Symphony](https://symphony.com/) message platform using [`requests`](https://requests.readthedocs.io/en/latest/) and the [Symphony SDK](https://docs.developers.symphony.com/). - -# Slack - -The Slack adapter allows for reading and writing of messages from the [Slack](https://slack.com) message platform using the [Slack Python SDK](https://slack.dev/python-slack-sdk/). - -# Writing Input and Output Adapters - -## Input Adapters - -There are two main categories of writing input adapters, historical and realtime. -When writing historical adapters you will need to implement a "pull" adapter, which pulls data from a historical data source in time order, one event at a time. -There are also ManagedSimAdapters for feeding multiple "managed" pull adapters from a single source (more on that below). -When writing realtime adapters, you will need to implement a "push" adapter, which will get data from a separate thread that drives external events and "pushes" them into the engine as they occur. - -When writing input adapters it is also very important to denote the difference between "graph building time" and "runtime" versions of your adapter. -For example, `csp.adapters.csv` has a `CSVReader` class that is used at graph building time. -**Graph build time components** solely *describe* the adapter. -They are meant to do little else than keep track of the type of adapter and its parameters, which will then be used to construct the actual adapter implementation when the engine is constructed from the graph description. -It is the runtime implementation that actual runs during the engine execution phase to process data. - -For clarity of this distinction, in the descriptions below we will denote graph build time components with *--graph--* and runtime implementations with *--impl--*. - -### Historical Adapters - -There are two flavors of historical input adapters that can be written. -The simplest one is a PullInputAdapter. -A PullInputAdapter can be used to convert a single source into a single timeseries. -The csp.curve implementation is a good example of this. -Single source to single timeseries adapters are of limited use however, and the more typical use case is for AdapterManager based input adapters to service multiple InputAdapters from a single source. -For this one would use an AdapterManager to coordinate processing of the data source, and ManagedSimInputAdapter as the individual timeseries providers. - -#### PullInputAdapter - Python - -To write a Python based PullInputAdapter one must write a class that derives from csp.impl.pulladapter.PullInputAdapter. -The derived type should the define two methods: - -- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. - start_Time and end_time will be tz-unaware datetime objects in UTC time. - At this point the adapter should open its resource and seek to the requested starttime. -- `def next(self)`: this method will be repeatedly called by the engine. - The adapter should return the next event as a time,value tuple. - If there are no more events, then the method should return None - -The PullInputAdapter that you define will be used as the runtime *--impl–-*. -You also need to define a *--graph--* time representation of the time series edge. -In order to do this you should define a csp.impl.wiring.py_pull_adapter_def. -The py_pull_adapter_def creates a *--graph--* time representation of your adapter: - -```python -def py_pull_adapter_def(name, adapterimpl, out_type, **kwargs) -``` - -- **`name`**: string name for the adapter -- **`adapterimpl`**: a derived implementation of csp.impl.pulladapter.PullInputAdapter -- **`out_type`**: the type of the output, should be a `ts[]` type. Note this can use tvar types if a subsequent argument defines the tvar -- **`kwargs`**: \*\*kwargs here be passed through as arguments to the PullInputAdapter implementation - -Note that the \*\*kwargs passed to py_pull_adapter_def should be the names and types of the variables, like arg1=type1, arg2=type2. -These are the names of the kwargs that the returned input adapter will take and pass through to the PullInputAdapter implementation, and the types expected for the values of those args. - -csp.curve is a good simple example of this: - -```python -import copy -from csp.impl.pulladapter import PullInputAdapter -from csp.impl.wiring import py_pull_adapter_def -from csp import ts -from datetime import timedelta - - -class Curve(PullInputAdapter): - def __init__(self, typ, data): - ''' data should be a list of tuples of (datetime, value) or (timedelta, value)''' - self._data = data - self._index = 0 - super().__init__() - - def start(self, start_time, end_time): - if isinstance(self._data[0][0], timedelta): - self._data = copy.copy(self._data) - for idx, data in enumerate(self._data): - self._data[idx] = (start_time + data[0], data[1]) - - while self._index < len(self._data) and self._data[self._index][0] < start_time: - self._index += 1 - - super().start(start_time, end_time) - - def next(self): - - if self._index < len(self._data): - time, value = self._data[self._index] - if time <= self._end_time: - self._index += 1 - return time, value - return None - - -curve = py_pull_adapter_def('curve', Curve, ts['T'], typ='T', data=list) -``` - -Now curve can be called in graph code to create a curve input adapter: - -```python -x = csp.curve(int, [ (t1, v1), (t2, v2), .. ]) -csp.print('x', x) -``` - -See example "e_14_user_adapters_01_pullinput.py for - -#### PullInputAdapter - C++ - -**Step 1)** PullInputAdapter impl - -Similar to the Python PullInputAdapter API is the c++ API which one can leverage to improve performance of an adapter implementation. -The *--impl--* is very similar to python pull adapter. -One should derive from `PullInputAdapter`, a templatized base class (templatized on the type of the timeseries) and define these methods: - -- **`start(DateTime start, DateTime end)`**: similar to python API start, called when engine starts. - Open resource and seek to start time here -- **`stop()`**: called on engine shutdown, cleanup resource -- **`bool next(DateTime & t, T & value)`**: if there is data to provide, sets the next time and value for the adapter and returns true. - Otherwise, return false - -**Step 2)** Expose creator func to python - -Now that we have a c++ impl defined, we need to expose a python creator for it. -Define a method that conforms to the signature - -```cpp -csp::InputAdapter * create_my_adapter( - csp::AdapterManager * manager, - PyEngine * pyengine, - PyTypeObject * pyType, - PushMode pushMode, - PyObject * args) -``` - -- **`manager`**: will be nullptr for pull adapters -- **`pyengine `**: PyEngine engine wrapper object -- **`pyType`**: this is the type of the timeseries input adapter to be created as a PyTypeObject. - one can switch on this type using switchPyType to create the properly typed instance -- **`pushMode`**: the csp PushMode for the adapter (pass through to base InputAdapter) -- **`args`**: arguments to pass to the adapter impl - -Then simply register the creator method: - -**`REGISTER_INPUT_ADAPTER(_my_adapter, create_my_adapter)`** - -This will register methodname onto your python module, to be accessed as your module.methodname. -Note this uses csp/python/InitHelpers which is used in the \_cspimpl module. -To do this in a separate python module, you need to register InitHelpers in that module. - -**Step 3)** Define your *--graph–-* time adapter - -One liner now to wrap your impl in a graph time construct using csp.impl.wiring.input_adapter_def: - -```python -my_adapter = input_adapter_def('my_adapter', my_module._my_adapter, ts[int], arg1=int, arg2={str:'foo'}) -``` - -my_adapter can now be called with arg1, arg2 to create adapters in your graph. -Note that the arguments are typed using v=t syntax.  v=(t,default) is used to define arguments with defaults. - -Also note that all input adapters implicitly get a push_mode argument that is defaulted to csp.PushMode.LAST_VALUE. - -#### ManagedSimInputAdapter - Python - -In most cases you will likely want to expose a single source of data into multiple input adapters. -For this use case your adapter should define an AdapterManager *--graph--* time component, and AdapterManagerImpl *--impl--* runtime component. -The AdapterManager *--graph--* time component just represents the parameters needed to create the *--impl--* AdapterManager. -Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. - -Similarly you will need to define a derived ManagedSimInputAdapter *--impl--* component to handle events directed at an individual time series adapter. - -**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. -graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. - -#### AdapterManager - **--graph-- time** - -The graph time AdapterManager doesn't need to derive from any interface. -It should be initialized with any information the impl needs in order to open/process the data source (ie csv file, time column, db connection information, etc etc). -It should also have an API to create individual timeseries adapters. -These adapters will then get passed the adapter manager *--impl--* as an argument where they are created, so that they can register themselves for processing. -The AdapterManager also needs to define a **\_create** method. -The **\_create** is the bridge between the *--graph--* time AdapterManager representation and the runtime *--impl--* object. -**\_create** will be called on the *--graph--* time AdapterManager which will in turn create the *--impl--* instance. -\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and a memo dict which can optionally be used for any memoization that on might want. - -Lets take a look at CSVReader as an example: - -```python -# GRAPH TIME -class CSVReader: - def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): - self._filename = filename - self._symbol_column = symbol_column - self._delimiter = delimiter - self._time_converter = time_converter - - def subscribe(self, symbol, typ, field_map=None): - return CSVReadAdapter(self, symbol, typ, field_map) - - def _create(self, engine, memo): - return CSVReaderImpl(engine, self) -``` - -- **`__init__`**: as you can see, all `__init__` does is keep the parameters that the impl will need. -- **`subscribe`**: API to create an individual timeseries / edge from this file for the given symbol. - typ denotes the type of the timeseries to create (ie `ts[int]`) and field_map is used for mapping columns onto csp.Struct types. - Note that subscribe returns a CSVReadAdapter instance. - CSVReadAdapter is the *--graph--* time representation of the edge (similar to how we defined csp.curve above). - We pass it `self` as its first argument, which will be used to create the AdapterManager *--impl--* -- **`\_create`**: the method to create the *--impl--* object from the given *--graph--* time representation of the manager - -The CSVReader would then be used in graph building code like so: - -```python -reader = CSVReader('my_data.csv', time_formatter, symbol_column='SYMBOL', delimiter='|') -# aapl will represent a ts[PriceQuantity] edge that will tick with rows from -# the csv file matching on SYMBOL column AAPL -aapl = reader.subscribe('AAPL', PriceQuantity) -``` - -##### AdapterManager - **--impl-- runtime** - -The AdapterManager *--impl--* is responsible for opening the data source, parsing and processing through all the data and managing all the adapters it needs to feed. -The impl class should derive from csp.impl.adaptermanager.AdapterManagerImpl and implement the following methods: - -- **`start(self,starttime,endtime)`**: this is called when the engine starts up. - At this point the impl should open the resource providing the data and seek to starttime. - starttime/endtime will be tz-unaware datetime objects in UTC time -- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point -- **`process_next_sim_timeslice(self, now)`**: this method will be called multiple times through the run. - The initial call will provide now with starttime. - The impl's responsibility is to process all data at the given timestamp (more on how to do this below). - The method should return the next time in the data source, or None if there is no more data to process. - The method will be called again with the provided timestamp as "now" in the next iteration. - **NOTE** that process_next_sim_timeslice is required to move ahead in time. - In most cases the resource data can be supplied in time order, if not it would have to be sorted up front. - -process_next_sim_timeslice should parse data for a given time/row of data and then push it through to any registered ManagedSimInputAdapter that matches on the given row - -##### ManagedSimInputAdapter - **--impl-- runtime** - -Users will need to define ManagedSimInputAdapter derived types to represent the individual timeseries adapter *--impl--* objects. -Objects should derive from csp.impl.adaptermanager.ManagedSimInputAdapter. - -ManagedSimInputAdapter.`__init__` takes two arguments: - -- **`typ`**: this is the type of the timeseries, ie int for a `ts[int]` -- **`field_map`**: Optional, field_map is a dictionary used to map source column names → csp.Struct field names. - -ManagedSimInputAdapter defines a method `push_tick()` which takes the value to feed the input for given timeslice (as defined by "now" at the adapter manager level). -There is also a convenience method called `process_dict()` which will take a dictionary of `{column : value}` entries and convert it properly into the right value based on the given **field_map.** - -##### \*\*ManagedSimInputAdapter - **--graph-- time** - -As with the csp.curve example, we need to define a graph-time construct that represents a ManagedSimInputAdapter edge. -In order to define this we use py_managed_adapter_def. -py_managed_adapter_defis AdapterManager "aware" and will properly create the AdapterManager *--impl--* the first time its encountered. -It will then pass the manager impl as an argument to the ManagedSimInputAdapter. - -```python -def py_managed_adapter_def(name, adapterimpl, out_type, manager_type, **kwargs): -""" -Create a graph representation of a python managed sim input adapter. -:param name: string name for the adapter -:param adapterimpl: a derived implementation of csp.impl.adaptermanager.ManagedSimInputAdapter -:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar -:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter -:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation -the first argument to the implementation will be the adapter manager impl instance -""" -``` - -##### Example - CSVReader - -Putting this all together lets take a look at a CSVReader implementation -and step through what's going on: - -```python -import csv as pycsv -from datetime import datetime - -from csp import ts -from csp.impl.adaptermanager import AdapterManagerImpl, ManagedSimInputAdapter -from csp.impl.wiring import pymanagedadapterdef - -# GRAPH TIME -class CSVReader: - def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): - self._filename = filename - self._symbol_column = symbol_column - self._delimiter = delimiter - self._time_converter = time_converter - - def subscribe(self, symbol, typ, field_map=None): - return CSVReadAdapter(self, symbol, typ, field_map) - - def _create(self, engine, memo): - return CSVReaderImpl(engine, self) -``` - -Here we define CSVReader, our AdapterManager *--graph--* time representation. -It holds the parameters that will be used for the impl, it implements a `subscribe()` call for users to create timeseries and defines a \_create method to create a runtime *--impl–-* instance from the graphtime representation. -Note how on line 17 we pass self to the CSVReadAdapter, this is what binds the input adapter to this AdapterManager - -```python -# RUN TIME -class CSVReaderImpl(AdapterManagerImpl): # 1 - def __init__(self, engine, adapterRep): # 2 - super().__init__(engine) # 3 - # 4 - self._rep = adapterRep # 5 - self._inputs = {} # 6 - self._csv_reader = None # 7 - self._next_row = None # 8 - # 9 - def start(self, starttime, endtime): # 10 - self._csv_reader = pycsv.DictReader( # 11 - open(self._rep._filename, 'r'), # 12 - delimiter=self._rep._delimiter # 13 - ) # 14 - self._next_row = None # 15 - # 16 - for row in self._csv_reader: # 17 - time = self._rep._time_converter(row) # 18 - self._next_row = row # 19 - if time >= starttime: # 20 - break # 21 - # 22 - def stop(self): # 23 - self._csv_reader = None # 24 - # 25 - def register_input_adapter(self, symbol, adapter): # 26 - if symbol not in self._inputs: # 27 - self._inputs[symbol] = [] # 28 - self._inputs[symbol].append(adapter) # 29 - # 30 - def process_next_sim_timeslice(self, now): # 31 - if not self._next_row: # 32 - return None # 33 - # 34 - while True: # 35 - time = self._rep._time_converter(self._next_row) # 36 - if time > now: # 37 - return time # 38 - self.process_row(self._next_row) # 39 - try: # 40 - self._next_row = next(self._csv_reader) # 41 - except StopIteration: # 42 - return None # 43 - # 44 - def process_row(self, row): # 45 - symbol = row[self._rep._symbol_column] # 46 - if symbol in self._inputs: # 47 - for input in self._inputs.get(symbol, []): # 48 - input.process_dict(row) # 49 -``` - -CSVReaderImpl is the runtime *--impl–-*. -It gets created when the engine is being built from the described graph. - -- **lines 10-21 - start()**: this is the start method that gets called with the time range the graph will be run against. - Here we open our resource (pycsv.DictReader) and scan t through the data until we reach the requested starttime. - -- **lines 23-24 - stop()**: this is the stop call that gets called when the engine is done running and is shutdown, we free our resource here - -- **lines 26-29**: the CSVReader allows one to subscribe to many symbols from one file. - symbols are keyed by a provided SYMBOL column. - The individual adapters will self-register with the CSVReaderImpl when they are created with the requested symbol. - CSVReaderImpl keeps track of what adapters have been registered for what symbol in its self.\_inputs map - -- **lines 31-43**: this is main method that gets invoked repeatedly throughout the run. - For every distinct timestamp in the file, this method will get invoked once and the method is expected to go through the resource data for all points with time now, process the row and push the data to any matching adapters. - The method returns the next timestamp when its done processing all data for "now", or None if there is no more data. - **NOTE** that the csv impl expects the data to be in time order. - process_next_sim_timeslice must advance time forward. - -- **lines 45-49**: this method takes a row of data (provided as a dict from DictReader), extracts the symbol and pushes the row through to all input adapters that match - -```python -class CSVReadAdapterImpl(ManagedSimInputAdapter): # 1 - def __init__(self, managerImpl, symbol, typ, field_map): # 2 - managerImpl.register_input_adapter(symbol, self) # 3 - super().__init__(typ, field_map) # 4 - # 5 -CSVReadAdapter = py_managed_adapter_def( # 6 - 'csvadapter', - CSVReadAdapterImpl, - ts['T'], - CSVReader, - symbol=str, - typ='T', - fieldMap=(object, None) -) -``` - -- **line 3**: this is where the instance of an adapter *--impl--* registers itself with the CSVReaderImpl. -- **line 6+**: this is where we define CSVReadAdapter, the *--graph--* time representation of a CSV adapter, returned from CSVReader.subscribe - -See example "e_14_user_adapters_02_adaptermanager_siminput" for another example of how to write a managed sim adapter manager. - -### Realtime Adapters - -#### PushInputAdapter - python - -To write a Python based PushInputAdapter one must write a class that derives from csp.impl.pushadapter.PushInputAdapter. -The derived type should the define two methods: - -- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. - start_time and end_time will be tz-unaware datetime objects in UTC time (generally these aren't needed for realtime adapters). - At this point the adapter should open its resource / connect the data source / start any driver threads that are needed. -- `def stop(self)`: This method well be called when the engine is done running. - At this point any open threads should be stopped and resources cleaned up. - -The PushInputAdapter that you define will be used as the runtime *--impl–-*. -You also need to define a *--graph--* time representation of the time series edge. -In order to do this you should define a csp.impl.wiring.py_push_adapter_def. -The py_push_adapter_def creates a *--graph--* time representation of your adapter: - -**def py_push_adapter_def(name, adapterimpl, out_type, \*\*kwargs)** - -- **`name`**: string name for the adapter -- **`adapterimpl`**: a derived implementation of - csp.impl.pushadapter.PushInputAdapter -- **`out_type`**: the type of the output, should be a ts\[\] type. - Note this can use tvar types if a subsequent argument defines the - tvar -- **`kwargs`**: \*\*kwargs here be passed through as arguments to the - PushInputAdapter implementation - -Note that the \*\*kwargs passed to py_push_adapter_def should be the names and types of the variables, like arg1=type1, arg2=type2. -These are the names of the kwargs that the returned input adapter will take and pass through to the PushInputAdapter implementation, and the types expected for the values of those args. - -Example e_14_user_adapters_03_pushinput.py demonstrates a simple example of this - -```python -from csp.impl.pushadapter import PushInputAdapter -from csp.impl.wiring import py_push_adapter_def -import csp -from csp import ts -from datetime import datetime, timedelta -import threading -import time - - -# The Impl object is created at runtime when the graph is converted into the runtime engine -# it does not exist at graph building time! -class MyPushAdapterImpl(PushInputAdapter): - def __init__(self, interval): - print("MyPushAdapterImpl::__init__") - self._interval = interval - self._thread = None - self._running = False - - def start(self, starttime, endtime): - """ start will get called at the start of the engine, at which point the push - input adapter should start its thread that will push the data onto the adapter. Note - that push adapters will ALWAYS have a separate thread driving ticks into the csp engine thread - """ - print("MyPushAdapterImpl::start") - self._running = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - """ stop will get called at the end of the run, at which point resources should - be cleaned up - """ - print("MyPushAdapterImpl::stop") - if self._running: - self._running = False - self._thread.join() - - def _run(self): - counter = 0 - while self._running: - self.push_tick(counter) - counter += 1 - time.sleep(self._interval.total_seconds()) - - -# MyPushAdapter is the graph-building time construct. This is simply a representation of what the -# input adapter is and how to create it, including the Impl to create and arguments to pass into it -MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[int], interval=timedelta) -``` - -Note how line 41 calls **self.push_tick**. -This is the call to get data from the adapter thread ticking into the csp engine - -Now MyPushAdapter can be called in graph code to create a timeseries that is sourced by MyPushAdapterImpl - -```python -@csp.graph -def my_graph(): - # At this point we create the graph-time representation of the input adapter. This will be converted - # into the impl once the graph is done constructing and the engine is created in order to run - data = MyPushAdapter(timedelta(seconds=1)) - csp.print('data', data) -``` - -#### GenericPushAdapter - -If you dont need as much control as PushInputAdapter provides, or if you have some existing source of data on a thread you can't control, another option is to use the higher-level abstraction csp.GenericPushAdapter. -csp.GenericPushAdapter wraps a csp.PushInputAdapter implementation internally and provides a simplified interface. -The downside of csp.GenericPushAdapter is that you lose some control of when the input feed starts and stop. - -Lets take a look at the example found in "e_14_generic_push_adapter" - -```python -# This is an example of some separate thread providing data -class Driver: - def __init__(self, adapter : csp.GenericPushAdapter): - self._adapter = adapter - self._active = False - self._thread = None - - def start(self): - self._active = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - if self._active: - self._active = False - self._thread.join() - - def _run(self): - print("driver thread started") - counter = 0 - # Optionally, we can wait for the adapter to start before proceeding - # Alternatively we can start pushing data, but push_tick may fail and return False if - # the csp engine isn't ready yet - self._adapter.wait_for_start() - - while self._active and not self._adapter.stopped(): - self._adapter.push_tick(counter) - counter += 1 - time.sleep(1) - -@csp.graph -def my_graph(): - adapter = csp.GenericPushAdapter(int) - driver = Driver(adapter) - # Note that the driver thread starts *before* the engine is started here, which means some ticks may potentially get dropped if the - # data source doesn't wait for the adapter to start. This may be ok for some feeds, but not others - driver.start() - - # Lets be nice and shutdown the driver thread when the engine is done - csp.schedule_on_engine_stop(driver.stop) -``` - -In this example we have this dummy Driver class which simply represents some external source of data which arrives on a thread that's completely independent of the engine. -We pass along a csp.GenericInputAdapter instance to this thread, which can then call adapter.push_tick to get data into the engine (see line 27). - -On line 24 we can also see an optional feature which allows the unrelated thread to wait for the adapter to be ready to accept data before ticking data onto it. -If push_tick is called before the engine starts / the adapter is ready to receive data, it will simply drop the data. -Note that GenericPushAadapter.push_tick will return a bool to indicate whether the data was successfully pushed to the engine or not. - -### Realtime AdapterManager - -In most cases you will likely want to expose a single source of data into multiple input adapters. -For this use case your adapter should define an AdapterManager *--graph--* time component, and AdapterManagerImpl *--impl--* runtime component. -The AdapterManager *--graph--* time component just represents the parameters needed to create the *--impl--* AdapterManager. -Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. - -Similarly you will need to define a derived PushInputAdapter *--impl--* component to handle events directed at an individual time series adapter. - -**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. -Graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. - -#### AdapterManager - **graph-- time** - -The graph time AdapterManager doesn't need to derive from any interface. -It should be initialized with any information the impl needs in order to open/process the data source (ie activemq connection information, server host port, multicast channels, config files, etc etc). -It should also have an API to create individual timeseries adapters. -These adapters will then get passed the adapter manager *--impl--* as an argument when they are created, so that they can register themselves for processing. -The AdapterManager also needs to define a **\_create** method. -The **\_create** is the bridge between the *--graph--* time AdapterManager representation and the runtime *--impl--* object. -**\_create** will be called on the *--graph--* time AdapterManager which will in turn create the *--impl--* instance. -\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and memo dict which can optionally be used for any memoization that on might want. - -Lets take a look at the example found in -"e_14_user_adapters_04_adaptermanager_pushinput" - -```python -# This object represents our AdapterManager at graph time. It describes the manager's properties -# and will be used to create the actual impl when its time to build the engine -class MyAdapterManager: - def __init__(self, interval: timedelta): - """ - Normally one would pass properties of the manager here, ie filename, - message bus, etc - """ - self._interval = interval - - def subscribe(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING): - """ User facing API to subscribe to a timeseries stream from this adapter manager """ - # This will return a graph-time timeseries edge representing and edge from this - # adapter manager for the given symbol / arguments - return MyPushAdapter(self, symbol, push_mode=push_mode) - - def _create(self, engine, memo): - """ This method will get called at engine build time, at which point the graph time manager representation - will create the actual impl that will be used for runtime - """ - # Normally you would pass the arguments down into the impl here - return MyAdapterManagerImpl(engine, self._interval) -``` - -- **\_\_init\_\_** - as you can see, all \_\_init\_\_ does is keep the parameters that the impl will need. -- **subscribe** - API to create an individual timeseries / edge from this file for the given symbol. - The interface defined here is up to the adapter writer, but generally "subscribe" is recommended, and it should take any number of arguments needed to define a single stream of data. - *MyPushAdapter* is the *--graph--* time representation of the edge, which will be described below. - We pass it *self* as its first argument, which will be used to create the AdapterManager *--impl--* -- **\_create** - the method to create the *--impl--* object from the given *--graph--* time representation of the manager - -MyAdapterManager would then be used in graph building code like so: - -```python -adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) -data = adapter_manager.subscribe('AAPL', push_mode=csp.PushMode.LAST_VALUE) -csp.print(symbol + " last_value", data) -``` - -## AdapterManager - **impl-- runtime** - -The AdapterManager *--impl--* is responsible for opening the data source, parsing and processing all the data and managing all the adapters it needs to feed. -The impl class should derive from csp.impl.adaptermanager.AdapterManagerImpl and implement the following methods: - -- **start(self,starttime,endtime)**: this is called when the engine starts up. - At this point the impl should open the resource providing the data and start up any thread(s) needed to listen to and react to external data. - starttime/endtime will be tz-unaware datetime objects in UTC time, though typically these aren't needed for realtime adapters -- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point -- **`process_next_sim_timeslice(self, now)`**: this is used by sim adapters, for realtime adapter managers we simply return None - -In the example manager, we spawn a processing thread in the `start()` call. -This thread runs in a loop until it is shutdown, and will generate random data to tick out to the registered input adapters. -Data is passed to a given adapter by calling **push_tick**() - -#### PushInputAdapter - **--impl-- runtime** - -Users will need to define PushInputAdapter derived types to represent the individual timeseries adapter *--impl--* objects. -Objects should derive from csp.impl.pushadapter.PushInputAdapter. - -PushInputAdapter defines a method `push_tick()` which takes the value to feed the input timeseries. - -#### PushInputAdapter - **--graph-- time** - -Similar to the stand alone PushInputAdapter described above, we need to define a graph-time construct that represents a PushInputAdapter edge. -In order to define this we use py_push_adapter_def again, but this time we pass the adapter manager *--graph--* time type so that it gets constructed properly. -When the PushInputAdapter instance is created it will also receive an instance of the adapter manager *--impl–-*, which it can then self-register on/ - -```python -def py_push_adapter_def (name, adapterimpl, out_type, manager_type=None, memoize=True, force_memoize=False, **kwargs): -""" -Create a graph representation of a python push input adapter. -:param name: string name for the adapter -:param adapterimpl: a derived implementation of csp.impl.pushadapter.PushInputAdapter -:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar -:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter -:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation -the first argument to the implementation will be the adapter manager impl instance -""" -``` - -#### Example - -Continuing with the --graph-- time AdapterManager described above, we -now define the impl: - -```python -# This is the actual manager impl that will be created and executed during runtime -class MyAdapterManagerImpl(AdapterManagerImpl): - def __init__(self, engine, interval): - super().__init__(engine) - - # These are just used to simulate a data source - self._interval = interval - self._counter = 0 - - # We will keep track of requested input adapters here - self._inputs = {} - - # Our driving thread, all realtime adapters will need a separate thread of execution that - # drives data into the engine thread - self._running = False - self._thread = None - - def start(self, starttime, endtime): - """ start will get called at the start of the engine run. At this point - one would start up the realtime data source / spawn the driving thread(s) and - subscribe to the needed data """ - self._running = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - """ This will be called at the end of the engine run, at which point resources should be - closed and cleaned up """ - if self._running: - self._running = False - self._thread.join() - - def register_input_adapter(self, symbol, adapter): - """ Actual PushInputAdapters will self register when they are created as part of the engine - This is the place we gather all requested input adapters and their properties - """ - if symbol not in self._inputs: - self._inputs[symbol] = [] - # Keep a list of adapters by key in case we get duplicate adapters (should be memoized in reality) - self._inputs[symbol].append(adapter) - - def process_next_sim_timeslice(self, now): - """ This method is only used by simulated / historical adapters, for realtime we just return None """ - return None - - def _run(self): - """ Our driving thread, in reality this will be reacting to external events, parsing the data and - pushing it into the respective adapter - """ - symbols = list(self._inputs.keys()) - while self._running: - # Lets pick a random symbol from the requested symbols - symbol = symbols[random.randint(0, len(symbols) - 1)] - adapters = self._inputs[symbol] - data = MyData(symbol=symbol, value=self._counter) - self._counter += 1 - for adapter in adapters: - adapter.push_tick(data) - - time.sleep(self._interval.total_seconds()) -``` - -Then we define our PushInputAdapter --impl--, which basically just -self-registers with the adapter manager --impl-- upon construction. We -also define our PushInputAdapter *--graph--* time construct using `py_push_adapter_def`. - -```python -# The Impl object is created at runtime when the graph is converted into the runtime engine -# it does not exist at graph building time. a managed sim adapter impl will get the -# adapter manager runtime impl as its first argument -class MyPushAdapterImpl(PushInputAdapter): - def __init__(self, manager_impl, symbol): - print(f"MyPushAdapterImpl::__init__ {symbol}") - manager_impl.register_input_adapter(symbol, self) - super().__init__() - - -MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[MyData], MyAdapterManager, symbol=str) -``` - -And then we can run our adapter in a csp graph - -```python -@csp.graph -def my_graph(): - print("Start of graph building") - - adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) - symbols = ['AAPL', 'IBM', 'TSLA', 'GS', 'JPM'] - for symbol in symbols: - # your data source might tick faster than the engine thread can consume it - # push_mode can be used to buffered up tick events will get processed - # LAST_VALUE will conflate and only tick the latest value since the last cycle - data = adapter_manager.subscribe(symbol, csp.PushMode.LAST_VALUE) - csp.print(symbol + " last_value", data) - - # BURST will change the timeseries type from ts[T] to ts[[T]] (list of ticks) - # that will tick with all values that have buffered since the last engine cycle - data = adapter_manager.subscribe(symbol, csp.PushMode.BURST) - csp.print(symbol + " burst", data) - - # NON_COLLAPSING will tick all events without collapsing, unrolling the events - # over multiple engine cycles - data = adapter_manager.subscribe(symbol, csp.PushMode.NON_COLLAPSING) - csp.print(symbol + " non_collapsing", data) - - print("End of graph building") - - -csp.run(my_graph, starttime=datetime.utcnow(), endtime=timedelta(seconds=10), realtime=True) -``` - -Do note that realtime adapters will only run in realtime engines (note the `realtime=True` argument to `csp.run`). - -## Output Adapters - -Output adapters are used to define graph outputs, and they differ from input adapters in a number of important ways. -Output adapters also differ from terminal nodes, e.g. regular `csp.node` instances that do not define outputs, and instead consume and emit their inputs inside their `csp.ticked`  blocks. - -For many use cases, it will be sufficient to omit writing an output adapter entirely. -Consider the following example of a terminal node that writes an input dictionary timeseries to a file. - -```python -@csp.node -def write_to_file(x: ts[Dict], filename: str): - if csp.ticked(x): - with open(filename, "a") as fp: - fp.write(json.dumps(x)) -``` - -This is a perfectly fine node, and serves its purpose. -Unlike input adapters, output adapters do not need to differentiate between *historical* and *realtime* mode. -Input adapters drive the execution of the graph, whereas output adapters are reactive to their input nodes and subject to the graph's execution. - -However, there are a number of reasons why you might want to define an output adapter instead of using a vanilla node. -The most important of these is when you want to share resources across a number of output adapters (e.g. with a Manager), or between an input and an output node, e.g. reading data from a websocket, routing it through your csp graph, and publishing data *to the same websocket connection*. -For most use cases, a vanilla csp node will suffice, but let's explore some anyway. - -### OutputAdapter - Python - -To write a Python based OutputAdapter one must write a class that derives from `csp.impl.outputadapter.OutputAdapter`. -The derived type should define the method: - -- `def on_tick(self, time: datetime, value: object)`: this will be called when the input to the output adapter ticks. - -The OutputAdapter that you define will be used as the runtime *--impl–-*.  You also need to define a *--graph--* time representation of the time series edge. -In order to do this you should define a csp.impl.wiring.py_output_adapter_def. -The py_output_adapter_def creates a *--graph--* time representation of your adapter: - -**def py_output_adapter_def(name, adapterimpl, \*\*kwargs)** - -- **`name`**: string name for the adapter -- **`adapterclass`**: a derived implementation of `csp.impl.outputadapter.OutputAdapter` -- **`kwargs`**: \*\*kwargs here be passed through as arguments to the OutputAdapter implementation - -Note that the `**kwargs` passed to py_output_adapter_def should be the names and types of the variables, like `arg1=type1, arg2=type2`. -These are the names of the kwargs that the returned output adapter will take and pass through to the OutputAdapter implementation, and the types expected for the values of those args. - -Here is a simple example of the same filewriter from above: - -```python -from csp.impl.outputadapter import OutputAdapter -from csp.impl.wiring import py_output_adapter_def -from csp import ts -import csp -from json import dumps -from datetime import datetime, timedelta - - -class MyFileWriterAdapterImpl(OutputAdapter): - def __init__(self, filename: str): - super().__init__() - self._filename = filename - - def start(self): -        self._fp = open(self._filename, "a") - - def stop(self): -        self._fp.close() - -   def on_tick(self, time, value): - self._fp.write(dumps(value) + "\n") - - -MyFileWriterAdapter = py_output_adapter_def( - name='MyFileWriterAdapter', - adapterimpl=MyFileWriterAdapterImpl, - input=ts['T'], - filename=str, -) -``` - -Now our adapter can be called in graph code: - -```python -@csp.graph -def my_graph(): - curve = csp.curve( - data=[ - (timedelta(seconds=0), {"a": 1, "b": 2, "c": 3}), - (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), - (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), - ], - typ=object, - ) - - MyFileWriterAdapter(curve, filename="testfile.jsonl") -``` - -As explained above, we could also do this via single node (this is probably the best version between the three): - -```python -@csp.node -def dump_json(data: ts['T'], filename: str): - with csp.state(): - s_file=None - with csp.start(): - s_file = open(filename, "w") - with csp.stop(): - s_file.close() - if csp.ticked(data): - s_file.write(json.dumps(data) + "\n") - s_file.flush() -``` - -### OutputAdapter - C++ - -TODO - -### OutputAdapter with Manager - -Adapter managers function the same way for output adapters as for input adapters, i.e. to manage a single shared resource from the manager across a variety of discrete output adapters. - -### InputOutputAdapter - Python - -As a as last example, lets tie everything together and implement a managed push input adapter combined with a managed output adapter. -This example is available in `e_14_user_adapters_05_adaptermanager_inputoutput` . - -First, we will define our adapter manager. -In this example, we're going to cheat a little bit and combine our adapter manager (graph time) and our adapter manager impl (run time). - -```python -class MyAdapterManager(AdapterManagerImpl): - ''' - This example adapter will generate random `MyData` structs every `interval`. This simulates an upstream - data feed, which we "connect" to only a single time. We then multiplex the results to an arbitrary - number of subscribers via the `subscribe` method. - - We can also receive messages via the `publish` method from an arbitrary number of publishers. These messages - are demultiplexex into a number of outputs, simulating sharing a connection to a downstream feed or responses - to the upstream feed. - ''' - def __init__(self, interval: timedelta): - self._interval = interval - self._counter = 0 - self._subscriptions = {} - self._publications = {} - self._running = False - self._thread = None - - def subscribe(self, symbol): - '''This method creates a new input adapter implementation via the manager.''' - return _my_input_adapter(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING) - - def publish(self, data: ts['T'], symbol: str): - '''This method creates a new output adapter implementation via the manager.''' - return _my_output_adapter(self, data, symbol) - - def _create(self, engine, memo): - # We'll avoid having a second class and make our AdapterManager and AdapterManagerImpl the same - super().__init__(engine) - return self - - def start(self, starttime, endtime): - self._running = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - if self._running: - self._running = False - self._thread.join() - - # print closing of the resources - for name in self._publications.values(): - print("closing asset {}".format(name)) - - def register_subscription(self, symbol, adapter): - if symbol not in self._subscriptions: - self._subscriptions[symbol] = [] - self._subscriptions[symbol].append(adapter) - - def register_publication(self, symbol): - if symbol not in self._publications: - self._publications[symbol] = "publication_{}".format(symbol) - - def _run(self): - '''This method runs in a background thread and generates random input events to push to the corresponding adapter''' - symbols = list(self._subscriptions.keys()) - while self._running: - # Lets pick a random symbol from the requested symbols - symbol = symbols[random.randint(0, len(symbols) - 1)] - - data = MyData(symbol=symbol, value=self._counter) - - self._counter += 1 - - for adapter in self._subscriptions[symbol]: - # push to all the subscribers - adapter.push_tick(data) - - time.sleep(self._interval.total_seconds()) - - def _on_tick(self, symbol, value): - '''This method just writes the data to the appropriate outbound "channel"''' - print("{}:{}".format(self._publications[symbol], value)) -``` - -This adapter manager is a bit of a silly example, but it demonstrates the core concepts. -The adapter manager will demultiplex a shared stream (in this case, the stream defined in `_run`  is a random sequence of `MyData` structs) between all the input adapters it manages. -The input adapter itself will do nothing more than let the adapter manager know that it exists: - -```python -class MyInputAdapterImpl(PushInputAdapter): - '''Our input adapter is a very simple implementation, and just - defers its work back to the manager who is expected to deal with - sharing a single connection. - ''' - def __init__(self, manager, symbol): - manager.register_subscription(symbol, self) - super().__init__() -``` - -Similarly, the adapter manager will multiplex the output adapter streams, in this case combining them into streams of print statements. -And similar to the input adapter, the output adapter does relatively little more than letting the adapter manager know that it has work available, using its triggered `on_tick` method to call the adapter manager's `_on_tick` method. - -``` -class MyOutputAdapterImpl(OutputAdapter): - '''Similarly, our output adapter is simple as well, deferring - its functionality to the manager - ''' - def __init__(self, manager, symbol): - manager.register_publication(symbol) - self._manager = manager - self._symbol = symbol - super().__init__() - - def on_tick(self, time, value): - self._manager._on_tick(self._symbol, value) -``` - -As a last step, we need to ensure that the runtime adapter implementations are registered with our graph: - -```python -_my_input_adapter = py_push_adapter_def(name='MyInputAdapter', adapterimpl=MyInputAdapterImpl, out_type=ts[MyData], manager_type=MyAdapterManager, symbol=str) -_my_output_adapter = py_output_adapter_def(name='MyOutputAdapter', adapterimpl=MyOutputAdapterImpl, manager_type=MyAdapterManager, input=ts['T'], symbol=str) -``` - -To test this example, we will: - -- instantiate our manager -- subscribe to a certain number of input adapter "streams" (which the adapter manager will demultiplex out of a single random node) -- print the data -- sink each stream into a smaller number of output adapters (which the adapter manager will multiplex into print statements) - -```python -@csp.graph -def my_graph(): - adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) - - data_1 = adapter_manager.subscribe("data_1") - data_2 = adapter_manager.subscribe("data_2") - data_3 = adapter_manager.subscribe("data_3") - - csp.print("data_1", data_1) - csp.print("data_2", data_2) - csp.print("data_3", data_3) - - # pump two streams into 1 output and 1 stream into another - adapter_manager.publish(data_1, "data_1") - adapter_manager.publish(data_2, "data_1") - adapter_manager.publish(data_3, "data_3") -``` - -Here is the result of a single run: - -``` -2023-02-15 19:14:53.859951 data_1:MyData(symbol=data_1, value=0) -publication_data_1:MyData(symbol=data_1, value=0) -2023-02-15 19:14:54.610281 data_3:MyData(symbol=data_3, value=1) -publication_data_3:MyData(symbol=data_3, value=1) -2023-02-15 19:14:55.361157 data_3:MyData(symbol=data_3, value=2) -publication_data_3:MyData(symbol=data_3, value=2) -2023-02-15 19:14:56.112030 data_2:MyData(symbol=data_2, value=3) -publication_data_1:MyData(symbol=data_2, value=3) -2023-02-15 19:14:56.862881 data_2:MyData(symbol=data_2, value=4) -publication_data_1:MyData(symbol=data_2, value=4) -2023-02-15 19:14:57.613775 data_1:MyData(symbol=data_1, value=5) -publication_data_1:MyData(symbol=data_1, value=5) -2023-02-15 19:14:58.364408 data_3:MyData(symbol=data_3, value=6) -publication_data_3:MyData(symbol=data_3, value=6) -2023-02-15 19:14:59.115290 data_2:MyData(symbol=data_2, value=7) -publication_data_1:MyData(symbol=data_2, value=7) -2023-02-15 19:14:59.866160 data_2:MyData(symbol=data_2, value=8) -publication_data_1:MyData(symbol=data_2, value=8) -2023-02-15 19:15:00.617068 data_1:MyData(symbol=data_1, value=9) -publication_data_1:MyData(symbol=data_1, value=9) -2023-02-15 19:15:01.367955 data_2:MyData(symbol=data_2, value=10) -publication_data_1:MyData(symbol=data_2, value=10) -2023-02-15 19:15:02.118259 data_3:MyData(symbol=data_3, value=11) -publication_data_3:MyData(symbol=data_3, value=11) -2023-02-15 19:15:02.869170 data_2:MyData(symbol=data_2, value=12) -publication_data_1:MyData(symbol=data_2, value=12) -2023-02-15 19:15:03.620047 data_1:MyData(symbol=data_1, value=13) -publication_data_1:MyData(symbol=data_1, value=13) -closing asset publication_data_1 -closing asset publication_data_3 -``` - -Although simple, this examples demonstrates the utility of the adapters and adapter managers. -An input resource is managed by one entity, distributed across a variety of downstream subscribers. -Then a collection of streams is piped back into a single entity. diff --git a/docs/wiki/6.-Dynamic-Graphs.md b/docs/wiki/6.-Dynamic-Graphs.md deleted file mode 100644 index d9c188ace..000000000 --- a/docs/wiki/6.-Dynamic-Graphs.md +++ /dev/null @@ -1,110 +0,0 @@ -`csp` graphs are somewhat limiting in that they cannot change shape once the process starts up. -`csp` dynamic graphs addresses this issue by introducing a construct to allow applications to dynamically add / remove sub-graphs from a running graph. - -# csp.DynamicBasket - -`csp` dynamic baskets are a pre-requisite construct needed for dynamic graphs. -csp.DynamicBaskets work just like regular static `csp` baskets, however dynamic baskets can change their shape over time. -csp.DynamicBaskts can only be created from either `csp` nodes or from csp.dynamic calls, as described below. -A node can take a csp.DynamicBasket as an input or generate a dynamic basket as an output. -Dynamic baskets are always dictionary-style baskets, where time series can be added by key. -Note that timeseries can also be removed from dynamic baskets. - -## Syntax - -Dynamic baskets are denoted by the type `csp.DynamicBasket[key_type, ts_type]`, so for example `csp.DynamicBasket[str,int]` would be a dynamic basket that will have keys of type str, and timeseries of type int. -One can also use the non-python shorthand `{ ts[str] : ts[int] }` to signify the same. - -## Generating dynamic basket output - -For nodes that generate dynamic basket output, they would use the same interface as regular basket outputs. -The difference being that if you output a key that hasn't been seen before, it will automatically be added to the dynamic basket. -In order to remove a key from a dynamic basket output, you would use the csp.remove_dynamic_key method. -**NOTE** that it is illegal to add and remove a key in the same cycle: - -```python -@csp.node -def dynamic_demultiplex_example(data : ts[ 'T' ], key : ts['K']) -> csp.DynamicBasket['T', 'K']: - if csp.ticked(data) and csp.valid(key): - csp.output({ key : data }) - - - ## To remove a key, which wouldn't be done in this example node: - ## csp.remove_dynamic_key(key) -``` - -To remove a key one would use `csp.remove_dynamic_key`. -For a single unnamed output, the method expects the key. -For named outputs, the arguments would be `csp.remove_dynamic_key(output_name, key)` - -## Consuming dynamic basket input - -Taking dynamic baskets as input is exactly the same as static baskets. -There is one additional bit of information available on dynamic basket inputs though, which is the .shape property. -As keys are added or removed, the `basket.shape` property will tick the the change events. -The `.shape` property behaves effectively as a `ts[csp.DynamicBasketEvents]`: - -```python -@csp.node -def consume_dynamic_basket(data : csp.DynamicBasket[str,int]): - if csp.ticked(data.shape): - for key in data.shape.added: - print(f'key {key} was added') - for key in data.shape.removed: - print(f'key {key} was removed') - - - if csp.ticked(data): - for key,value in data.tickeditems(): - #...regular basket access here -``` - -# csp.dynamic - -- **`csp.dynamic(trigger, sub_graph, graph_args...) → csp.DynamicBasket[ ... ]`** - - **`trigger`**: a csp.DynamicBasket input. - As new keys are added to the basket, they will trigger sub_graph instances to be created. - As keys are removed, they will shutdown their respective sub-graph - - **`sub_graph`** - a regular csp.graph method that will be wired as new keys are added on trigger - - **`graph_args`**: these are the args passed to the sub_graph at the time of creation. - Note the special semantics of argument passing to dynamic sub-graphs: - - **`scalars`**: can be passed as is, assuming they are known at main graph build time - - **`timeseries`** - can be passed as is, assuming they are known at main graph build time - - **`csp.snap(ts)`**: this will convert a timeseries input to a **`scalar`** at the time of graph creation, allowing you to get a "dynamic" scalar value to use at sub_graph build time - - **`csp.snapkey()`**: this will pass through the key that was added which triggered this dynamic sub-graph. - One can use this to get the key triggering the sub-graph. - - **`csp.attach()`**: this will pass through the timeseries of the input trigger for the key which triggered this dynamic sub-graph. - For example, say we have a dynamic basket of `{ symbol : ts[orders ]}` as our input trigger. - As a new symbol is added, we will trigger a sub-graph to process this symbol. - Say we also want to feed in the `ts[orders]` for the given symbol into our sub_graph, we would pass `csp.attach()` as the argument. - - **`output`**: every output of sub_graph (if there are any) will be returned as a member of a csp.DynamicBasket output. - As new keys are added to the trigger, which generates sub-graphs, keys will be added to the output dynamic basket - (Note, output keys will only generate on first tick of some output data, not upon instantiation of the sub-graph, since csp.DynamicBasket requires all keys to have valid values) - -```python -@csp.graph -def my_sub_graph(symbol : str, orders : ts[ Orders ], portfolio_position : ts[int], some_scalar : int) -> ts[Fill]: - ... regular csp.graph code ... - - -@csp.graph -def main(): - # position as ts[int] - portfolio_position = get_portfolio_position() - - - all_orders = get_orders() - # demux fat-pipe of orders into a dynamic basket keyed by symbol - demuxed_orders = csp.dynamic_demultiplex(all_orders, all_orders.symbol) - - - result = csp.dynamic(demuxed_orders, my_sub_graph, - csp.snap(all_orders.symbol), # Grab scalar value of all_orders.symbol at time of instantiation - #csp.snapkey(), # Alternative way to grab the key that instantiated the sub-graph - csp.attach(), # extract the demuxed_orders[symbol] time series of the symbol being created in the sub_graph - portfolio_position, # pass in regular ts[] - 123) # pass in some scalar - - - # process result.fills which will be a csp.DynamicBasket of { symbol : ts[Fill] } -``` diff --git a/docs/wiki/9.-Caching.md b/docs/wiki/9.-Caching.md deleted file mode 100644 index e93519600..000000000 --- a/docs/wiki/9.-Caching.md +++ /dev/null @@ -1,3 +0,0 @@ -`csp` provides a caching layer of graph outputs. The caching layer is generally a parquet writer/reader wrapper of graph outputs. The system automatically manages resolving the run time of the engine and resolving whether the data can be read from cache or isn't available in cache (in which case data will be written to cache). Future runs can then read the data from cache and avoid calculations of the same data. Goals of the caching layer: - -More documentation to follow! diff --git a/docs/wiki/Home.md b/docs/wiki/Home.md index 8901882eb..9d7ae95f8 100644 --- a/docs/wiki/Home.md +++ b/docs/wiki/Home.md @@ -1,70 +1,38 @@ -`csp` ("Composable Stream Processing") is a functional-like reactive -language that makes time-series stream processing simple to do.  The -main reactive engine is a C++ based engine which has been exposed to -python ( other languages may optionally be extended in future versions -). `csp` applications define a connected graph of components using a -declarative language (which is essentially python).  Once a graph is -constructed it can be run using the C++ engine. Graphs are composed of -some number of "input" adapters, a set of connected calculation "nodes" -and at the end sent off to "output" adapters. Inputs as well as the -engine can be seamlessly run in simulation mode using historical input -adapters or in realtime mode using realtime input adapters. + + + + CSP logo mark - text will be black in light color mode and white in dark color mode. + -# Contents +CSP (Composable Stream Processing) is a library for high-performance real-time event stream processing in Python. -- [0. Introduction](https://github.com/Point72/csp/wiki/0.-Introduction) -- [1. Generic Nodes (csp.baselib)]() -- [2. Math Nodes (csp.math)]() -- [3. Statistics Nodes (csp.stats)]() -- [4. Random Time Series Generation]() -- [5. Adapters](https://github.com/Point72/csp/wiki/5.-Adapters) -- [6. Dynamic Graphs](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) -- [7. csp.Struct](https://github.com/Point72/csp/wiki/7.-csp.Struct) -- [8. Profiler](https://github.com/Point72/csp/wiki/8.-Profiler) -- [9. Caching](https://github.com/Point72/csp/wiki/9.-Caching) +## Key Features -# Installation +- **Powerful C++ Engine:** Execute the graph using CSP's C++ Graph Processing Engine +- **Simulation (i.e., offline) mode:** Test workflows on historical data and quickly move to real-time data in deployment +- **Infrastructure-agnostic:** Connect to any data format or storage database, using built-in (Parquet, Kafka, etc.) or custom adapters +- **Highly-customizable:** Write your own input and output adapters for any data/storage formats, and real-time adapters for specific workflows +- **PyData interoperability:** Use your favorite libraries from the Scientific Python Ecosystem for numerical and statistical computations +- **Functional/declarative style:** Write concise and composable code for stream processing by building graphs in Python -We ship binary wheels to install `csp` on MacOS and Linux via `pip`: + -```bash -pip install csp -``` +## Get Started -Other platforms will need to see the instructions to [build `csp` from -source](https://github.com/Point72/csp/wiki/98.-Building-From-Source). +- [Install CSP](Installation) and [write your first CSP program](First-Steps) +- Learn more about [nodes](CSP-Node), [graphs](CSP-Graph), and [execution modes](Execution-Modes) +- Learn to extend CSP with [adapters](Adapters) -We plan to create conda packages on conda-forge and ship binaries for Windows in -the near future. + -# Contributing +> \[!TIP\] +> Find relevant docs with GitHub’s search function, use `repo:Point72/csp type:wiki ` to search the documentation Wiki Pages. -Contributions are welcome on this project. We distribute under the terms of the [Apache 2.0 license](https://github.com/Point72/csp/blob/main/LICENSE). +## Community -For **bug reports** or **small feature requests**, please open an issue on our [issues page](https://github.com/Point72/csp/issues). +- [Contribute](Contribute) to CSP and help improve the project +- Read about future plans in the [project roadmap](Roadmap) -For **questions** or to discuss **larger changes or features**, please use our [discussions page](https://github.com/Point72/csp/discussions). +## License -For **contributions**, please see our [developer documentation](https://github.com/Point72/csp/wiki/99.-Developer). We have `help wanted` and `good first issue` tags on our issues page, so these are a great place to start. - -For **documentation updates**, make PRs that update the pages in `/docs/wiki`. The documentation is pushed to the GitHub wiki automatically through a GitHub workflow. Note that direct updates to this wiki will be overwritten. - -# Roadmap - -We do not have a formal roadmap, but we're happy to discuss features, improvements, new adapters, etc, in our [discussions area](https://github.com/Point72/csp/discussions). Here are some high level items we hope to accomplish in the next few months: - -- Support `clang` compiler and full MacOS support ([#33](https://github.com/Point72/csp/issues/33) / [#132](https://github.com/Point72/csp/pull/132)) -- Support `msvc` compiler and full Windows support ([#109](https://github.com/Point72/csp/issues/109)) -- Establish a better pattern for adapters ([#165](https://github.com/Point72/csp/discussions/165)) - -## Adapters and Extensions - -- Redis Pub/Sub Adapter with [Redis-plus-plus](https://github.com/sewenew/redis-plus-plus) ([#61](https://github.com/Point72/csp/issues/61)) -- C++-based websocket adapter - - Client adapter in [#152](https://github.com/Point72/csp/pull/152) -- C++-based HTTP/SSE adapter -- Add support for other graph viewers, including interactive / standalone / Jupyter - -## Other Open Source Projects - -- `csp-gateway`: Application development framework, built with [FastAPI](https://fastapi.tiangolo.com) and [Perspective](https://github.com/finos/perspective). This is a library we have built internally at Point72 on top of `csp` that we hope to open source later in 2024. It allows for easier construction of modular `csp` applications, along with a pluggable REST/WebSocket API and interactive UI. +CSP is licensed under the Apache 2.0 license. See the [LICENSE](https://github.com/Point72/csp/blob/main/LICENSE) file for details. diff --git a/docs/wiki/_Footer.md b/docs/wiki/_Footer.md new file mode 100644 index 000000000..602a25509 --- /dev/null +++ b/docs/wiki/_Footer.md @@ -0,0 +1 @@ +_This wiki is autogenerated. To made updates, open a PR against the original source file in [`docs/wiki`](https://github.com/Point72/csp/tree/main/docs/wiki)._ diff --git a/docs/wiki/_Sidebar.md b/docs/wiki/_Sidebar.md new file mode 100644 index 000000000..cd137edf5 --- /dev/null +++ b/docs/wiki/_Sidebar.md @@ -0,0 +1,61 @@ + + +**[Home](Home)** + +**Get Started (Tutorials)** + +- [Installation](Installation) +- [First steps](First-Steps) + + + +**Concepts** + +- [CSP Node](CSP-Node) +- [CSP Graph](CSP-Graph) +- [Historical Buffers](Historical-Buffers) +- [Execution Modes](Execution-Modes) +- [Adapters](Adapters) + +**How-to guides** + +- [Use Statistical Nodes](Use-Statistical-Nodes) +- Use Adapters (coming soon) +- [Add Cycles in Graphs](Add-Cycles-in-Graphs) +- [Create Dynamic Baskets](Create-Dynamic-Baskets) +- Write Adapters: + - [Write Historical Input Adapters](Write-Historical-Input-Adapters) + - [Write Realtime Input Adapters](Write-Realtime-Input-Adapters) + - [Write Output Adapters](Write-Output-Adapters) +- [Profile CSP Code](Profile-CSP-Code) + +**References** + +- API Reference + - [Base Nodes API](Base-Nodes-API) + - [Base Adapters API](Base-Adapters-API) + - [Math and Logic Nodes API](Math-and-Logic-Nodes-API) + - [Statistical Nodes API](Statistical-Nodes-API) + - [Functional Methods API](Functional-Methods-API) + - [Adapters (Kafka, Parquet, DBReader) API](Input-Output-Adapters-API) + - [Random Time Series Generators API](Random-Time-Series-Generators-API) + - [`csp.Struct` API](csp.Struct-API) + - [`csp.dynamic` API](csp.dynamic-API) + - [`csp.profiler` API](csp.profiler-API) +- [Examples](Examples) +- [Glossary of Terms](Glossary) + +**Developer Guide** + +- [Contributing](Contribute) +- [Development Setup](Local-Development-Setup) +- [Build CSP from Source](Build-CSP-from-Source) +- [GitHub Conventions (for maintainers)](GitHub-Conventions) +- [Release Process (for maintainers)](Release-Process) +- [Roadmap](Roadmap) diff --git a/docs/wiki/api-references/Base-Adapters-API.md b/docs/wiki/api-references/Base-Adapters-API.md new file mode 100644 index 000000000..a72820cc9 --- /dev/null +++ b/docs/wiki/api-references/Base-Adapters-API.md @@ -0,0 +1,110 @@ +`csp.baselib` defines some generally useful adapters, which are also imported directly into the CSP namespace when importing CSP. + +These are all graph-time constructs. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [`csp.timer`](#csptimer) +- [`csp.const`](#cspconst) +- [`csp.curve`](#cspcurve) +- [`csp.add_graph_output`](#cspadd_graph_output) +- [`csp.feedback`](#cspfeedback) + +## `csp.timer` + +```python +csp.timer( + interval: timedelta, + value: '~T' = True, + allow_deviation: bool = False +) +``` + +This will create a repeating timer edge that will tick on the given `timedelta` with the given value (value defaults to `True`, returning a `ts[bool]`) + +Args: + +- **`interval`**: how often to tick value +- **`value`**: the actual value that will tick every interval (defaults to the value `True`) +- **`allow_deviation`**: When running in realtime the engine will ensure timers execute exactly when they requested on their intervals. + If your engine begins to lag, timers will still execute at the expected time "in the past" as the engine catches up + (imagine having a `csp.timer` fire every 1/2 second but the engine becomes delayed for 1 second. + By default the half seconds will still execute until time catches up to wallclock). + When `allow_deviation` is `True`, and the engine is in realtime mode, subsequent timers will always be scheduled from the current wallclock + interval, + so they won't end up lagging behind at the expensive of the timer skewing. + +## `csp.const` + +```python +csp.const( + value: '~T', + delay: timedelta = timedelta() +) +``` + +This will create an edge that ticks one time with the value provided. +By default this will tick at the start of the engine, delta can be provided to delay the tick + +## `csp.curve` + +```python +csp.curve( + typ: 'T', + data: typing.Union[list, tuple] +) +``` + +This allows you to convert a list of non-CSP data into a ticking edge in CSP + +Args: + +- **`typ`**: is the type of the value of the data of this edge +- **`data`**: is either a list of tuples of `(datetime, value)`, or a tuple of two equal-length numpy ndarrays, the first with datetimes and the second with values. + In either case, that will tick on the returned edge into the engine, and the data must be in time order. + Note that for the list of tuples case, you can also provide tuples of (timedelta, value) where timedelta will be the offset from the engine's start time. + +## `csp.add_graph_output` + +```python +csp.add_graph_output( + key: object, + input: ts['T'], + tick_count: int = -1, + tick_history: timedelta = timedelta() +) +``` + +This allows you to connect an edge as a "graph output". +All edges added as outputs will be returned to the caller from `csp.run` as a dictionary of `key: [(datetime, value)]` +(list of datetime, values that ticked on the edge) or if `csp.run` is passed `output_numpy=True`, as a dictionary of +`key: (array, array)` (tuple of two numpy arrays, one with datetimes and one with values). +See [Collecting Graph Outputs](https://github.com/Point72/csp/wiki/0.-Introduction#collecting-graph-outputs) + +Args: + +- **`key`**: key to return the results as from `csp.run` +- **`input`**: edge to connect +- **`tick_count`**: number of ticks to keep in the buffer (defaults to -1 - all ticks) +- **`tick_history`**: amount of ticks to keep by time window (defaults to keeping all history) + +## `csp.feedback` + +```python +csp.feedback(typ) +``` + +`csp.feedback` is a construct that can be used to create artificial loops in the graph. +Use feedbacks in order to delay bind an input to a node in order to be able to create a loop +(think of writing a simulated exchange that takes orders in and needs to feed responses back to the originating node). + +`csp.feedback` itself is not an edge, its a construct that allows you to access the delayed edge / bind a delayed input. + +Args: + +- **`typ`**: type of the edge's data to be bound + +Methods: + +- **`out()`**: call this method on the feedback object to get the edge which can be wired as an input +- **`bind(x: ts[object])`**: call this to bind an edge to the feedback diff --git a/docs/wiki/1.-Generic-Nodes-(csp.baselib).md b/docs/wiki/api-references/Base-Nodes-API.md similarity index 54% rename from docs/wiki/1.-Generic-Nodes-(csp.baselib).md rename to docs/wiki/api-references/Base-Nodes-API.md index 913b39e66..81acf4b8d 100644 --- a/docs/wiki/1.-Generic-Nodes-(csp.baselib).md +++ b/docs/wiki/api-references/Base-Nodes-API.md @@ -1,114 +1,43 @@ -# Intro - CSP comes with some basic constructs readily available and commonly used. -The latest set of baselib nodes / adapters can be found in the csp.baselib module. - -All of the nodes / adapters noted here are imported directly into the csp namespace when importing csp. -These are all graph-time constructs. - -# Adapters - -## `timer` - -```python -csp.timer( - interval: timedelta, - value: '~T' = True, - allow_deviation: bool = False -) -``` - -This will create a repeating timer edge that will tick on the given `timedelta` with the given value (value defaults to `True`, returning a `ts[bool]`) - -Args: - -- **`interval`**: how often to tick value -- **`value`**: the actual value that will tick every interval (defaults to the value `True`) -- **`allow_deviation`**: When running in realtime the engine will ensure timers execute exactly when they requested on their intervals. - If your engine begins to lag, timers will still execute at the expected time "in the past" as the engine catches up - (imagine having a `csp.timer` fire every 1/2 second but the engine becomes delayed for 1 second. - By default the half seconds will still execute until time catches up to wallclock). - When `allow_deviation` is `True`, and the engine is in realtime mode, subsequent timers will always be scheduled from the current wallclock + interval, - so they won't end up lagging behind at the expensive of the timer skewing. - -## `const` - -```python -csp.const( - value: '~T', - delay: timedelta = timedelta() -) -``` - -This will create an edge that ticks one time with the value provided. -By default this will tick at the start of the engine, delta can be provided to delay the tick - -## `curve` - -```python -csp.curve( - typ: 'T', - data: typing.Union[list, tuple] -) -``` - -This allows you to convert a list of non-csp data into a ticking edge in csp - -Args: - -- **`typ`**: is the type of the value of the data of this edge -- **`data`**: is either a list of tuples of `(datetime, value)`, or a tuple of two equal-length numpy ndarrays, the first with datetimes and the second with values. - In either case, that will tick on the returned edge into the engine, and the data must be in time order. - Note that for the list of tuples case, you can also provide tuples of (timedelta, value) where timedelta will be the offset from the engine's start time. - -## `add_graph_output` - -```python -csp.add_graph_output( - key: object, - input: ts['T'], - tick_count: int = -1, - tick_history: timedelta = timedelta() -) -``` - -This allows you to connect an edge as a "graph output". -All edges added as outputs will be returned to the caller from `csp.run` as a dictionary of `key: [(datetime, value)]` -(list of datetime, values that ticked on the edge) or if `csp.run` is passed `output_numpy=True`, as a dictionary of -`key: (array, array)` (tuple of two numpy arrays, one with datetimes and one with values). -See [Collecting Graph Outputs](https://github.com/Point72/csp/wiki/0.-Introduction#collecting-graph-outputs) - -Args: - -- **`key`**: key to return the results as from csp.run -- **`input`**: edge to connect -- **`tick_count`**: number of ticks to keep in the buffer (defaults to -1 - all ticks) -- **`tick_history`**: amount of ticks to keep by time window (defaults to keeping all history) - -## `feedback` - -```python -csp.feedback(typ) -``` - -`csp.feedback` is a construct that can be used to create artificial loops in the graph. -Use feedbacks in order to delay bind an input to a node in order to be able to create a loop -(think of writing a simulated exchange that takes orders in and needs to feed responses back to the originating node). - -`csp.feedback` itself is not an edge, its a construct that allows you to access the delayed edge / bind a delayed input. - -Args: - -- **`typ`**: type of the edge's data to be bound - -Methods: +The latest set of base nodes can be found in the `csp.baselib` module. -- **`out()`**: call this method on the feedback object to get the edge which can be wired as an input -- **`bind(x: ts[object])`**: call this to bind an edge to the feedback +All of the nodes noted here are imported directly into the CSP namespace when importing CSP. -# Basic Nodes +These are all graph-time constructs. -## `print` +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [`csp.print`](#cspprint) +- [`csp.log`](#csplog) +- [`csp.sample`](#cspsample) +- [`csp.firstN`](#cspfirstn) +- [`csp.count`](#cspcount) +- [`csp.delay`](#cspdelay) +- [`csp.diff`](#cspdiff) +- [`csp.merge`](#cspmerge) +- [`csp.split`](#cspsplit) +- [`csp.filter`](#cspfilter) +- [`csp.drop_dups`](#cspdrop_dups) +- [`csp.unroll`](#cspunroll) +- [`csp.collect`](#cspcollect) +- [`csp.flatten`](#cspflatten) +- [`csp.default`](#cspdefault) +- [`csp.gate`](#cspgate) +- [`csp.apply`](#cspapply) +- [`csp.null_ts`](#cspnull_ts) +- [`csp.stop_engine`](#cspstop_engine) +- [`csp.multiplex`](#cspmultiplex) +- [`csp.demultiplex`](#cspdemultiplex) +- [`csp.dynamic_demultiplex`](#cspdynamic_demultiplex) +- [`csp.dynamic_collect`](#cspdynamic_collect) +- [`csp.drop_nans`](#cspdrop_nans) +- [`csp.times`](#csptimes) +- [`csp.times_ns`](#csptimes_ns) +- [`csp.accum`](#cspaccum) +- [`csp.exprtk`](#cspexprtk) + +## `csp.print` ```python csp.print( @@ -118,7 +47,7 @@ csp.print( This node will print (using python `print()`) the time, tag and value of `x` for every tick of `x` -## `log` +## `csp.log` ```python csp.log( @@ -132,7 +61,7 @@ csp.log( ``` Similar to `csp.print`, this will log ticks using the logger on the provided level. -The default 'csp' logger is used if none is provided to the node. +The default CSP logger is used if none is provided to the node. Args: @@ -141,7 +70,7 @@ Args: This can be useful when printing large strings in log calls. If individual time-series values are subject to modification *after* the log call, then the user must pass in a copy of the time-series if they wish to have proper threaded logging. -## `sample` +## `csp.sample` ```python csp.sample( @@ -154,7 +83,7 @@ Use this to down-sample an input. `csp.sample` will return the current value of `x` any time trigger ticks. This can be combined with `csp.timer` to sample the input on a time interval. -## `firstN` +## `csp.firstN` ```python csp.firstN( @@ -165,15 +94,15 @@ csp.firstN( Only output the first `N` ticks of the input. -## `count` +## `csp.count` ```python -csp.count(x: ts[object]) → ts[int] +csp.count(x: ts[object]) → ts[int] ``` Returns the ticking count of ticks of the input -## `delay` +## `csp.delay` ```python csp.delay( @@ -184,7 +113,7 @@ csp.delay( This will delay all ticks of the input `x` by the given `delay`, which can be given as a `timedelta` to delay a specified amount of time, or as an int to delay a specified number of ticks (delay must be positive) -## `diff` +## `csp.diff` ```python csp.diff( @@ -195,7 +124,7 @@ csp.diff( When `x` ticks, output difference between current tick and value time or ticks ago (once that exists) -## `merge` +## `csp.merge` ```python csp.merge( x: ts['T'], y: ts['T']) → ts['T'] @@ -203,9 +132,9 @@ csp.merge( x: ts['T'], y: ts['T']) → ts['T'] Merges the two timeseries `x` and `y` into a single series. If both tick on the same cycle, the first input (`x`) wins and the value of `y` is dropped. -For loss-less merging see `csp.flatten` +For loss-less merging see `csp.flatten` -## `split` +## `csp.split` ```python csp.split( @@ -219,7 +148,7 @@ If `flag` is `True` when `x` ticks, output 'true' will tick with the value of `x If `flag` is `False` at the time of the input tick, then 'false' will tick. Note that if flag is not valid at the time of the input tick, the input will be dropped. -## `filter` +## `csp.filter` ```python csp.filter(flag: ts[bool], x: ts['T']) → ts['T'] @@ -228,7 +157,7 @@ csp.filter(flag: ts[bool], x: ts['T']) → ts['T'] Will only tick out input ticks of `x` if the current value of `flag` is `True`. If flag is `False`, or if flag is not valid (hasn't ticked yet) then `x` is suppressed. -## `drop_dups` +## `csp.drop_dups` ```python csp.drop_dups(x: ts['T']) → ts['T'] @@ -236,34 +165,34 @@ csp.drop_dups(x: ts['T']) → ts['T'] Will drop consecutive duplicate values from the input. -## `unroll` +## `csp.unroll` ```python csp.unroll(x: ts[['T']]) → ts['T'] ``` -Given a timeseries of a *list* of values, unroll will "unroll" the values in the list into a timeseries of the elements. +Given a timeseries of a *list* of values, unroll will "unroll" the values in the list into a timeseries of the elements. `unroll` will ensure to preserve the order across all list ticks. Ticks will be unrolled in subsequent engine cycles. -## `collect` +## `csp.collect` ```python csp.collect(x: [ts['T']]) → ts[['T']] ``` -Given a basket of inputs, return a timeseries of a *list* of all values that ticked +Given a basket of inputs, return a timeseries of a *list* of all values that ticked -## `flatten` +## `csp.flatten` ```python csp.flatten(x: [ts['T']]) → ts['T'] ``` Given a basket of inputs, return all ticks across all inputs as a single timeseries of type 'T' -(This is similar to `csp.merge` except that it can take more than two inputs, and is lossless) +(This is similar to `csp.merge` except that it can take more than two inputs, and is lossless) -## `default` +## `csp.default` ```python csp.default( @@ -276,7 +205,7 @@ csp.default( Defaults the input series to the value of `default` at start of the engine, or after `delay` if `delay` is provided. If `x` ticks right at the start of the engine, or before `delay` if `delay` is provided, `default` value will be discarded. -## `gate` +## `csp.gate` ```python csp.gate( @@ -290,7 +219,7 @@ csp.gate( While open, the input will tick out as a single value burst. While closed, input ticks will buffer up until they can be released. -## `apply` +## `csp.apply` ```python csp.apply( @@ -302,7 +231,7 @@ csp.apply( Applies the provided callable `f` on every tick of the input and returns the result of the callable. -## `null_ts` +## `csp.null_ts` ```python csp.null_ts(typ: 'T') @@ -310,7 +239,7 @@ csp.null_ts(typ: 'T') Returns a "null" timeseries of the given type which will never tick. -## `stop_engine` +## `csp.stop_engine` ```python csp.stop_engine(x: ts['T']) @@ -318,7 +247,7 @@ csp.stop_engine(x: ts['T']) Forces the engine to stop if `x` ticks -## `multiplex` +## `csp.multiplex` ```python csp.multiplex( @@ -339,7 +268,7 @@ Args: the input basket whenever the key ticks (defaults to `False`) - **`raise_on_bad_key`**: if `True` an exception will be raised if key ticks with an unrecognized key (defaults to `False`) -## `demultiplex` +## `csp.demultiplex` ```python csp.demultiplex( @@ -350,18 +279,18 @@ csp.demultiplex( ) → {key: ts['T']} ``` -Given a single timeseries input, a key timeseries to demultiplex on and a set of expected keys, will output the given input onto the corresponding basket output of the current value of `key`. +Given a single timeseries input, a key timeseries to demultiplex on and a set of expected keys, will output the given input onto the corresponding basket output of the current value of `key`. A good example use case of this is demultiplexing a timeseries of trades by account. -Assuming your trade struct has an account field, you can `demultiplex(trades, trades.account, [ 'acct1', 'acct2', ... ])`. +Assuming your trade struct has an account field, you can `demultiplex(trades, trades.account, [ 'acct1', 'acct2', ... ])`. Args: - **`x`**: the input timeseries to demultiplex - **`key`**: a ticking timeseries of the current key to output to -- **`keys`**: a list of expected keys that will define the shape of the output basket.  The list of keys must be known at graph building time +- **`keys`**: a list of expected keys that will define the shape of the output basket. The list of keys must be known at graph building time - **`raise_on_bad_key`**: if `True` an exception will be raised of key ticks with an unrecognized key (defaults to `False`) -## `dynamic_demultiplex` +## `csp.dynamic_demultiplex` ```python csp.dynamic_demultiplex( @@ -372,7 +301,7 @@ csp.dynamic_demultiplex( Similar to `csp.demultiplex`, this version will return a [Dynamic Basket](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) output that will dynamically add new keys as they are seen. -## `dynamic_collect` +## `csp.dynamic_collect` ```python csp.dynamic_collect( @@ -382,7 +311,7 @@ csp.dynamic_collect( Similar to `csp.collect`, this function takes a [Dynamic Basket](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) input and returns a dictionary of the key-value pairs corresponding to the values that ticked. -## `drop_nans` +## `csp.drop_nans` ```python csp.drop_nans(x: ts[float]) → ts[float] @@ -390,7 +319,7 @@ csp.drop_nans(x: ts[float]) → ts[float] Filters nan (Not-a-number) values out of the time series. -## `times` +## `csp.times` ```python csp.times(x: ts['T']) → ts[datetime] @@ -398,7 +327,7 @@ csp.times(x: ts['T']) → ts[datetime] Given a timeseries, returns the time at which that series ticks -## `times_ns` +## `csp.times_ns` ```python csp.times_ns(x: ts['T']) → ts[int] @@ -406,7 +335,7 @@ csp.times_ns(x: ts['T']) → ts[int] Given a timeseries, returns the epoch time in nanoseconds at which that series ticks -## `accum` +## `csp.accum` ```python csp.accum(x: ts["T"], start: "~T" = 0) -> ts["T"] @@ -414,72 +343,7 @@ csp.accum(x: ts["T"], start: "~T" = 0) -> ts["T"] Given a timeseries, accumulate via `+=` with starting value `start`. -# Math and Logic nodes - -See [Math Nodes](). - -# Functional Methods - -Edges in csp contain some methods to serve as syntactic sugar for stringing nodes together in a pipeline. This makes it easier to read/modify workflows and avoids the need for nested brackets. - -## `apply` - -```python -Edge.apply(self, func, *args, **kwargs) -``` - -Calls `csp.apply` on the edge with the provided python `func`. - -Args: - -- **`func`**: A scalar function that will be applied on each value of the Edge. If a different output type is returned, pass a tuple `(f, typ)`, where `typ` is the output type of f -- **`args`**: Positional arguments passed into `func` -- **`kwargs`**: Dictionary of keyword arguments passed into func - -## `pipe` - -```python -Edge.pipe(self, node, *args, **kwargs) -``` - -Calls the `node` on the edge. - -Args: - -- **`node`**: A graph node that will be applied to the Edge, which is passed into node as the first argument. - Alternatively, a `(node, edge_keyword)` tuple where `edge_keyword` is a string indicating the keyword of node that expects the edge. -- **`args`**: Positional arguments passed into `node` -- **`kwargs`**: Dictionary of keyword arguments passed into `node` - -## `run` - -```python -Edge.run(self, node, *args, **kwargs) -``` - -Alias for `csp.run(self, *args, **kwargs)` - -## Example of functional methods - -```python -import csp -from datetime import datetime, timedelta -import math - -(csp.timer(timedelta(minutes=1)) - .pipe(csp.count) - .pipe(csp.delay, timedelta(seconds=1)) - .pipe((csp.sample, 'x'), trigger=csp.timer(timedelta(minutes=2))) - .apply((math.sin, float)) - .apply(math.pow, 3) - .pipe(csp.firstN, 10) - .run(starttime=datetime(2000,1,1), endtime=datetime(2000,1,2))) - -``` - -# Other nodes - -## `exprtk` +## `csp.exprtk` ```python csp.exprtk( @@ -498,8 +362,8 @@ Args: - **`expression_str`**: an expression, as per the [C++ Mathematical Expression Library](http://www.partow.net/programming/exprtk/) (see [readme](http://www.partow.net/programming/exprtk/code/readme.txt) - **`inputs`**: a dict basket of timeseries. The keys will correspond to the variables in the expression. The timeseries can be of float or string -- **`state_vars`**: an optional dictionary of variables to be held in state between executions, and assignable within the expression.  Keys are the variable names and values are the starting values +- **`state_vars`**: an optional dictionary of variables to be held in state between executions, and assignable within the expression. Keys are the variable names and values are the starting values - **`trigger`**: an optional trigger for when to calculate. By default will calculate on any input tick - **`functions`**: an optional dictionary whose keys are function names that can be used in the expression, and whose values are of the form `(("arg1", ..), "function body")`, for example `{"foo": (("x","y"), "x\*y")}` -- **`constants`**: an optional dictionary of constants.  Keys are constant names and values are their values +- **`constants`**: an optional dictionary of constants. Keys are constant names and values are their values - **`output_ndarray`**: if `True`, output ndarray (1D) instead of float. Note that to output `ndarray`, the expression needs to use return like `return [a, b, c]`. The length of the array can vary between ticks. diff --git a/docs/wiki/api-references/Functional-Methods-API.md b/docs/wiki/api-references/Functional-Methods-API.md new file mode 100644 index 000000000..3f2447fa9 --- /dev/null +++ b/docs/wiki/api-references/Functional-Methods-API.md @@ -0,0 +1,64 @@ +Edges in CSP contain some methods to serve as syntactic sugar for stringing nodes together in a pipeline. This makes it easier to read/modify workflows and avoids the need for nested brackets. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [`apply`](#apply) +- [`pipe`](#pipe) +- [`run`](#run) +- [Example of functional methods](#example-of-functional-methods) + +## `apply` + +```python +Edge.apply(self, func, *args, **kwargs) +``` + +Calls `csp.apply` on the edge with the provided python `func`. + +Args: + +- **`func`**: A scalar function that will be applied on each value of the Edge. If a different output type is returned, pass a tuple `(f, typ)`, where `typ` is the output type of f +- **`args`**: Positional arguments passed into `func` +- **`kwargs`**: Dictionary of keyword arguments passed into func + +## `pipe` + +```python +Edge.pipe(self, node, *args, **kwargs) +``` + +Calls the `node` on the edge. + +Args: + +- **`node`**: A graph node that will be applied to the Edge, which is passed into node as the first argument. + Alternatively, a `(node, edge_keyword)` tuple where `edge_keyword` is a string indicating the keyword of node that expects the edge. +- **`args`**: Positional arguments passed into `node` +- **`kwargs`**: Dictionary of keyword arguments passed into `node` + +## `run` + +```python +Edge.run(self, node, *args, **kwargs) +``` + +Alias for `csp.run(self, *args, **kwargs)` + +## Example of functional methods + +```python +import csp +from datetime import datetime, timedelta +import math + +(csp.timer(timedelta(minutes=1)) + .pipe(csp.count) + .pipe(csp.delay, timedelta(seconds=1)) + .pipe((csp.sample, 'x'), trigger=csp.timer(timedelta(minutes=2))) + .apply((math.sin, float)) + .apply(math.pow, 3) + .pipe(csp.firstN, 10) + .run(starttime=datetime(2000,1,1), endtime=datetime(2000,1,2))) + +``` diff --git a/docs/wiki/api-references/Input-Output-Adapters-API.md b/docs/wiki/api-references/Input-Output-Adapters-API.md new file mode 100644 index 000000000..7dcd70d75 --- /dev/null +++ b/docs/wiki/api-references/Input-Output-Adapters-API.md @@ -0,0 +1,360 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Kafka](#kafka) + - [API](#api) + - [MessageMapper](#messagemapper) + - [Subscribing and Publishing](#subscribing-and-publishing) + - [Known Issues](#known-issues) +- [Parquet](#parquet) + - [ParquetReader](#parquetreader) + - [API](#api-1) + - [Subscription](#subscription) + - [ParquetWriter](#parquetwriter) + - [Construction](#construction) + - [Publishing](#publishing) +- [DBReader](#dbreader) + - [TimeAccessor](#timeaccessor) +- [Symphony](#symphony) +- [Slack](#slack) + +## Kafka + +The Kafka adapter is a user adapter to stream data from a Kafka bus as a reactive time series. It leverages the [librdkafka](https://github.com/confluentinc/librdkafka) C/C++ library internally. + +The `KafkaAdapterManager` instance represents a single connection to a broker. +A single connection can subscribe and/or publish to multiple topics. + +### API + +```python +KafkaAdapterManager( + broker, + start_offset: typing.Union[KafkaStartOffset,timedelta,datetime] = None, + group_id: str = None, + group_id_prefix: str = '', + max_threads=100, + max_queue_size=1000000, + auth=False, + security_protocol='SASL_SSL', + sasl_kerberos_keytab='', + sasl_kerberos_principal='', + ssl_ca_location='', + sasl_kerberos_service_name='kafka', + rd_kafka_conf_options=None, + debug: bool = False, + poll_timeout: timedelta = timedelta(seconds=1) +): +``` + +- **`broker`**: name of the Kafka broker, such as `protocol://host:port` + +- **`start_offset`**: signify where to start the stream playback from (defaults to `KafkaStartOffset.LATEST`). + Can be one of the`KafkaStartOffset` enum types or: + + - `datetime`: to replay from the given absolute time + - `timedelta`: this will be taken as an absolute offset from starttime to playback from + +- **`group_id`**: if set, this adapter will behave as a consume-once consumer. + `start_offset` may not be set in this case since adapter will always replay from the last consumed offset. + +- **\`group_id_prefix**: when not passing an explicit group_id, a prefix can be supplied that will be use to prefix the UUID generated for the group_id + +- **`max_threads`**: maximum number of threads to create for consumers. + The topics are round-robin'd onto threads to balance the load. + The adapter won't create more threads than topics. + +- **`max_queue_size`**: maximum size of the (internal to Kafka) message queue. + If the queue is full, messages can be dropped, so the default is very large. + +### MessageMapper + +In order to publish or subscribe, you need to define a MsgMapper. +These are the supported message types: + +- **`JSONTextMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** +- **`ProtoMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** + +You should choose the `DateTimeType` based on how you want (when publishing) or expect (when subscribing) your datetimes to be represented on the wire. +The supported options are: + +- `UINT64_NANOS` +- `UINT64_MICROS` +- `UINT64_MILLIS` +- `UINT64_SECONDS` + +The enum is defined in [csp/adapters/utils.py](https://github.com/Point72/csp/blob/main/csp/adapters/utils.py#L5). + +Note the `JSONTextMessageMapper` currently does not have support for lists. +To subscribe to json data with lists, simply subscribe using the `RawTextMessageMapper` and process the text into json (e.g. via json.loads). + +### Subscribing and Publishing + +Once you have an `KafkaAdapterManager` object and a `MsgMapper` object, you can subscribe to topics using the following method: + +```python +KafkaAdapterManager.subscribe( + ts_type: type, + msg_mapper: MsgMapper, + topic: str, + key=None, + field_map: typing.Union[dict,str] = None, + meta_field_map: dict = None, + push_mode: csp.PushMode = csp.PushMode.LAST_VALUE, + adjust_out_of_order_time: bool = False +): +``` + +- **`ts_type`**: the timeseries type you want to get the data on. This can be a `csp.Struct` or basic timeseries type +- **`msg_mapper`**: the `MsgMapper` object discussed above +- **`topic`**: the topic to subscribe to +- **`key`**: The key to subscribe to. If `None`, then this will subscribe to all messages on the topic. Note that in this "wildcard" mode, all messages will tick as "live" as replay in engine time cannot be supported +- **`field_map`**: dictionary of `{message_field: struct_field}` to define how the subscribed message gets mapped onto the struct +- **`meta_field_map`**: to extract meta information from the kafka message, provide a meta_field_map dictionary of meta field info → struct field name to place it into. + The following meta fields are currently supported: + - **`"partition"`**: which partition the message came from + - **`"offset"`**: the kafka offset of the given message + - **`"live"`**: whether this message is "live" and not being replayed + - **`"timestamp"`**: timestamp of the kafka message + - **`"key"`**: key of the message +- **`push_mode`**: `csp.PushMode` (LAST_VALUE, NON_COLLAPSING, BURST) +- **`adjust_out_of_order_time`**: in some cases it has been seen that kafka can produce out of order messages, even for the same key. + This allows the adapter to be more laz and allow it through by forcing time to max(time, prev time) + +Similarly, you can publish on topics using the following method: + +```python +KafkaAdapterManager.publish( + msg_mapper: MsgMapper, + topic: str, + key: str, + x: ts['T'], + field_map: typing.Union[dict,str] = None +): +``` + +- **`msg_mapper`**: same as above +- **`topic`**: same as above +- **`key`**: key to publish to +- **`x`**: the timeseries to publish +- **`field_map`**: dictionary of {struct_field: message_field} to define how the struct gets mapped onto the published message. + Note this dictionary is the opposite of the field_map in subscribe() + +### Known Issues + +If you are having issues, such as not getting any output or the application simply locking up, start by ensuring that you are logging the adapter's `status()` with a `csp.print`/`log` call and set `debug=True`. +Then follow the known issues below. + +- Reason: `GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (No Kerberos credentials available)` + + - **Resolution**: Kafka uses kerberos tickets for authentication. Need to set-up kerberos token first + +- `Message received on unknown topic: errcode: Broker: Group authorization failed error: FindCoordinator response error: Group authorization failed.` + + - **Resolution**: Kafka broker running on windows are case sensitive to kerberos token. When creating Kerberos token with kinit, make sure to use principal name with case sensitive user id. + +- `authentication: SASL handshake failed (start (-4)): SASL(-4): no mechanism available: No worthy mechs found (after 0ms in state AUTH_REQ)` + + - **Resolution**: cyrus-sasl-gssapi needs to be installed on the box for Kafka kerberos authentication + +- `Message error on topic "an-example-topic". errcode: Broker: Topic authorization failed error: Subscribed topic not available: an-example-topic: Broker: Topic authorization failed)` + + - **Resolution**: The user account does not have access to the topic + +## Parquet + +### ParquetReader + +The `ParquetReader` adapter is a generic user adapter to stream data from [Apache Parquet](https://parquet.apache.org/) files as a CSP time series. +`ParquetReader` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the CSP framework. + +#### API + +```python +ParquetReader( + self, + filename_or_list, + symbol_column=None, + time_column=None, + tz=None +): + """ + :param filename_or_list: The specifier of the file/files to be read. Can be either: + - Instance of str, in which case it's interpreted os a path of single file to be read + - A callable, in which case it's interpreted as a generator function that will be called like f(starttime, endtime) where starttime and endtime + are the start and end times of the current engine run. It's expected to generate a sequence of filenames to read. + - Iterable container, for example a list of files to read + :param symbol_column: An optional parameter that specifies the name of the symbol column if the file if there is any + :param time_column: A mandatory specification of the time column name in the parquet files. This column will be used to inject the row values + from parquet at the given timestamps. + :param tz: The pytz timezone of the timestamp column, should only be provided if the time_column in parquet file doesn't have tz info. +""" +``` + +#### Subscription + +```python +def subscribe( + self, + symbol, + typ, + field_map=None, + push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING +): + """Subscribe to the rows corresponding to a given symbol + This form of subscription can be used only if non empty symbol_column was supplied during ParquetReader construction. + :param symbol: The symbol to subscribe to, for example 'AAPL' + :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type + that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. + :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be + a string specifying the column name, if typ is a csp.Struct then field_map should be a str->str dictionary of the form + {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct + fields and the parquet files columns. + :param push_mode: A push mode for the output adapter + """ + +def subscribe_all( + self, + typ, + field_map=None, + push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING +): + """Subscribe to all rows of the input files. + :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type + that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. + :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be + a string specifying the column name, if typ is a csp.Struct then field_map should be a str->str dictionary of the form + {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct + fields and the parquet files columns. + :param push_mode: A push mode for the output adapter + """ +``` + +Parquet reader provides two subscription methods. +**`subscribe`** produces a time series only of the rows that correspond to the given symbol, +\*\*`subscribe_all`\*\*produces a time series of all rows in the parquet files. + +### ParquetWriter + +The ParquetWriter adapter is a generic user adapter to stream data from CSP time series to [Apache Parquet](https://parquet.apache.org/) files. +`ParquetWriter` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the CSP framework. +Any time series of Struct objects will be flattened to multiple columns. + +#### Construction + +```python +ParquetWriter( + self, + file_name: Optional[str], + timestamp_column_name, + config: Optional[ParquetOutputConfig] = None, + filename_provider: Optional[csp.ts[str]] = None +): + """ + :param file_name: The path of the output parquet file name. Must be provided if no filename_provider specified. If both file_name and filename_provider are specified then file_name will be used as the initial output file name until filename_provider provides a new file name. + :param timestamp_column_name: Required field, if None is provided then no timestamp will be written. + :param config: Optional configuration of how the file should be written (such as compression, block size,...). + :param filename_provider: An optional time series that provides a times series of file paths. When a filename_provider time series provides a new file path, the previous open file name will be closed and all subsequent data will be written to the new file provided by the path. This enable partitioning and splitting the data based on time. + """ +``` + +#### Publishing + +```python +def publish_struct( + self, + value: ts[csp.Struct], + field_map: Dict[str, str] = None +): + """Publish a time series of csp.Struct objects to file + + :param value: The time series of Struct objects that should be published. + :param field_map: An optional dict str->str of the form {struct_field_name:column_name} that maps the names of the + structure fields to the column names to which the values should be written. If the field_map is non None, then only + the fields that are specified in the field_map will be written to file. If field_map is not provided then all fields + of a structure will be written to columns that match exactly the field_name. + """ + +def publish( + self, + column_name, + value: ts[object] +): + """Publish a time series of primitive type to file + :param column_name: The name of the parquet file column to which the data should be written to + :param value: The time series that should be published + """ +``` + +Parquet writer provides two publishing methods. +**`publish_struct`** is used to publish time series of **`csp.Struct`** objects while **`publish`** is used to publish primitive time series. +The columns in the written parquet file is a union of all columns that were published (the order is preserved). +A new row is written to parquet file whenever any of the inputs ticks. +For the given row, any column that corresponds to a time series that didn't tick, will have null values. + +## DBReader + +The DBReader adapter is a generic user adapter to stream data from a database as a reactive time series. +It leverages sqlalchemy internally in order to be able to access various DB backends. + +Please refer to the [SQLAlchemy Docs](https://docs.sqlalchemy.org/en/13/core/tutorial.html) for information on how to create sqlalchemy connections. + +The DBReader instance represents a single connection to a database. +From a single reader you can subscribe to various streams, either the entire stream of data (which would basically represent the result of a single join) or if a symbol column is declared, subscribe by symbol which will then demultiplex rows to the right adapter. + +```python +DBReader(self, connection, time_accessor, table_name=None, schema_name=None, query=None, symbol_column=None, constraint=None): + """ + :param connection: sqlalchemy engine or (already connected) connection object. + :param time_accessor: TimeAccessor object + :param table_name: name of table in database as a string + :param query: either string query or sqlalchemy query object. Ex: "select * from users" + :param symbol_column: name of symbol column in table as a string + :param constraint: additional sqlalchemy constraints for query. Ex: constraint = db.text('PRICE>:price').bindparams(price = 100.0) + """ +``` + +- **connection**: seqlalchemy engine or existing connection object. +- **time_accessor**: see below +- **table_name**: either table or query is required. + If passing a table_name then this table will be queried against for subscribe calls +- **query**: (optional) if table isn't supplied user can provide a direct query string or sqlalchemy query object. + This is useful if you want to run a join call. + For basic single-table queries passing table_name is preferred +- **symbol_column**: (optional) in order to be able to demux rows bysome column, pass `symbol_column`. + Example case for this is if database has data stored for many symbols in a single table, and you want to have a timeseries tick per symbol. +- **constraint**: (optional) additional sqlalchemy constraints for query. Ex: `constraint = db.text('PRICE>:price').bindparams(price= 100.0)` + +### TimeAccessor + +All data fed into CSP must be time based. +`TimeAccessor` is a helper class that defines how to extract timestamp information from the results of the data. +Users can define their own `TimeAccessor` implementation or use pre-canned ones: + +- `TimestampAccessor( self, time_column, tz=None)`: use this if there exists a single datetime column already. + Provide the column name and optionally the timezone of the column (if its timezone-less in the db) +- `DateTimeAccessor(self, date_column, time_column, tz=None)`: use this if there are two separate columns for date and time, this accessor will combine the two columns to create a single datetime. + Optionally pass tz if time column is timezone-less in the db + +User implementations would have to extend `TimeAccessor` interface. +In addition to defining how to convert db columns to timestamps, accessors are also used to augment the query to limit the data for the graph's start and end times. + +Once you have a DBReader object created, you can subscribe to time_series from it using the following methods: + +- `subscribe(self, symbol, typ, field_map=None)` +- `subscribe_all(self, typ, field_map=None)` + +Both of these calls expect `typ` to be a `csp.Struct` type. +`field_map` is a dictionary of `{ db_column : struct_column }` mappings that define how to map the database column names to the fields on the struct. + +`subscribe` is used to subscribe to a stream for the given symbol (symbol_column is required when creating DBReader) + +`subscribe_all` is used to retrieve all the data resulting from the request as a single timeseries. + +## Symphony + +The Symphony adapter allows for reading and writing of messages from the [Symphony](https://symphony.com/) message platform using [`requests`](https://requests.readthedocs.io/en/latest/) and the [Symphony SDK](https://docs.developers.symphony.com/). + +## Slack + +The Slack adapter allows for reading and writing of messages from the [Slack](https://slack.com) message platform using the [Slack Python SDK](https://slack.dev/python-slack-sdk/). diff --git a/docs/wiki/2.-Math-Nodes-(csp.math).md b/docs/wiki/api-references/Math-and-Logic-Nodes-API.md similarity index 85% rename from docs/wiki/2.-Math-Nodes-(csp.math).md rename to docs/wiki/api-references/Math-and-Logic-Nodes-API.md index 4da20dda2..0f0d7d788 100644 --- a/docs/wiki/2.-Math-Nodes-(csp.math).md +++ b/docs/wiki/api-references/Math-and-Logic-Nodes-API.md @@ -1,18 +1,18 @@ -# Math and Logic nodes - -In an effort not to bloat the wiki, the following boolean and mathematical operations are available which should be self explanatory. +The following boolean and mathematical operations are available which should be self explanatory. Also note that there is syntactic sugar in place when wiring a graph. Edges have most operators overloaded includes `+`, `-`, `*`, `/`, `**`, `>`, `>=`, `<`, `<=`, `==`, `!=`, so you can have code like `csp.const(1) + csp.const(2)` work properly. Right hand side values will also automatically be upgraded to `csp.const()` if its detected that its not an edge, so something like `x = csp.const(1) + 2` will work as well. -## Binary logical operators +## Table of Contents + +1. Binary logical operators - **`csp.not_(ts[bool]) → ts[bool]`** - **`csp.and_(x: [ts[bool]]) → ts[bool]`** - **`csp.or_(x: [ts[bool]]) → ts[bool]`** -## Binary mathematical operators +2. Binary mathematical operators - **`csp.add(x: ts['T'], y: ts['T']) → ts['T']`** - **`csp.sub(x: ts['T'], y: ts['T']) → ts['T']`** @@ -22,7 +22,7 @@ Right hand side values will also automatically be upgraded to `csp.const( ts[Fill]: + ... regular csp.graph code ... + + +@csp.graph +def main(): + # position as ts[int] + portfolio_position = get_portfolio_position() + + + all_orders = get_orders() + # demux fat-pipe of orders into a dynamic basket keyed by symbol + demuxed_orders = csp.dynamic_demultiplex(all_orders, all_orders.symbol) + + + result = csp.dynamic(demuxed_orders, my_sub_graph, + csp.snap(all_orders.symbol), # Grab scalar value of all_orders.symbol at time of instantiation + #csp.snapkey(), # Alternative way to grab the key that instantiated the sub-graph + csp.attach(), # extract the demuxed_orders[symbol] time series of the symbol being created in the sub_graph + portfolio_position, # pass in regular ts[] + 123) # pass in some scalar + + + # process result.fills which will be a csp.DynamicBasket of { symbol : ts[Fill] } +``` diff --git a/docs/wiki/8.-Profiler.md b/docs/wiki/api-references/csp.profiler-API.md similarity index 67% rename from docs/wiki/8.-Profiler.md rename to docs/wiki/api-references/csp.profiler-API.md index 3a485ccea..e0e5c0c36 100644 --- a/docs/wiki/8.-Profiler.md +++ b/docs/wiki/api-references/csp.profiler-API.md @@ -1,6 +1,4 @@ -The `csp.profiler` library allows users to time cycle/node executions during a graph run. There are two available utilities. - -# Profiler: runtime profiling +## `csp.profiler()` Users can simply run graphs under a `Profiler()` context to extract profiling information. The code snippet below runs a graph in profile mode and extracts the profiling data by calling `results()`. @@ -44,64 +42,7 @@ ProfilerInfo additionally comes with some useful utilities. These are: - **`ProfilerInfo.max_exec_node(self)`** - Returns the node type which had the most total executions as a tuple: `(name, node_stat)` where node_stat is a dictionary with the same keys as `node_stats[elem]` -One can use these metrics to identify bottlenecks/inefficiencies in their graphs. - -## Profiling a real-time csp.graph - -The `csp.profiler` library provides a GUI for profiling real-time csp graphs. -One can access this GUI by adding a `http_port`  argument to their profiler call. - -```python -with profiler.Profiler(http_port=8888) as p: - results = csp.run(graph, starttime=st, endtime=et) # run the graph normally -``` - -This will open up the GUI on `localhost:8888` (as http_port=8888) which will display real-time node timing, cycle timing and memory snapshots. -Profiling stats will be calculated whenever you refresh the page or call a GET request. -Additionally, you can add the `format=json`argument (`localhost:8888?format=json`) to your request to receive the ProfilerInfo as a `JSON`  object rather than the `HTML` display. - -Users can add the `display_graphs=True` flag to include bar/pie charts of node execution times in the web UI. -The matplotlib package is required to use the flag. - -```python -with profiler.Profiler(http_port=8888, display_graphs=True) as p: - ... -``` - -new_profiler - -## Saving raw profiling data to a file - -Users can save individual node execution times and individual cycle execution times to a `.csv` file if they desire. -This is useful if you want to apply your own analysis e.g. calculate percentiles. -To do this, simply add the flags `node_file=` or `cycle_file=` - -```python -with profiler.Profiler(cycle_file="cycle_data.csv", node_file="node_data.csv") as p: - ... -``` - -After the graph is run, the file `node_data.csv`  contains: - -``` -Node Type,Execution Time -count,1.9814e-05 -cast_int_to_float,1.2791e-05 -_time_window_updates,4.759e-06 -... -``` - -After the graph is run, the file `cycle_data.csv`  contains: - -``` -Execution Time -9.4757e-05 -4.5205e-05 -2.2873e-05 -... -``` - -# graph_info: build-time information +## `profiler.graph_info()` Users can also extract build-time information about the graph without running it by calling profiler.graph_info. The code snippet below shows how to call graph_info. diff --git a/docs/wiki/concepts/Adapters.md b/docs/wiki/concepts/Adapters.md new file mode 100644 index 000000000..fc4e54907 --- /dev/null +++ b/docs/wiki/concepts/Adapters.md @@ -0,0 +1,15 @@ +To get various data sources into and out of the graph, various Input and Output Adapters are available, such as CSV, Parquet, and database adapters (amongst others). +Users can also write their own input and output adapters, as explained below. + +There are two types of Input Adapters: **Historical** (aka Simulated) adapters and **Realtime** Adapters. + +Historical adapters are used to feed in historical timeseries data into the graph from some data source which has timeseries data. +Realtime Adapters are used to feed in live event based data in realtime, generally events created from external sources on separate threads. + +There is not distinction of Historical vs Realtime output adapters since outputs need not care if the generated timeseries data which are wired into them are generated from realtime or historical inputs. + +In CSP terminology, a single adapter corresponds to a single timeseries edge in the graph. +There are common cases where a single data source may be used to provide data to multiple adapter (timeseries) instances, for example a single CSV file with price data for many stocks can be read once but used to provide data to many individual, one per stock. +In such cases an AdapterManager is used to coordinate management of the single source (CSV file, database, Kafka connection, etc) and provided data to individual adapters. + +Note that adapters can be quickly written and prototyped in python, and if needed can be moved to a c+ implementation for more efficiency. diff --git a/docs/wiki/concepts/CSP-Graph.md b/docs/wiki/concepts/CSP-Graph.md new file mode 100644 index 000000000..c84b6542f --- /dev/null +++ b/docs/wiki/concepts/CSP-Graph.md @@ -0,0 +1,114 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Anatomy of a `csp.graph`](#anatomy-of-a-cspgraph) +- [Graph Propagation and Single-dispatch](#graph-propagation-and-single-dispatch) +- [Graph Pruning](#graph-pruning) +- [Collecting Graph Outputs](#collecting-graph-outputs) + +## Anatomy of a `csp.graph` + +To reiterate, `csp.graph` methods are called in order to construct the graph and are only executed before the engine is run. +`csp.graph` methods don't do anything special, they are essentially regular python methods, but they can be defined to accept inputs and generate outputs similar to `csp.nodes`. +This is solely used for type checking. +`csp.graph` methods can be created to encapsulate components of a graph, and can be called from other `csp.graph` methods in order to help facilitate graph building. + +Simple example: + +```python +@csp.graph +def calc_symbol_pnl(symbol: str, trades: ts[Trade]) -> ts[float]: + # sub-graph code needed to compute pnl for given symbol and symbol's trades + # sub-graph can subscribe to market data for the symbol as needed + ... + + +@csp.graph +def calc_portfolio_pnl(symbols: [str]) -> ts[float]: + symbol_pnl = [] + for symbol in symbols: + symbol_trades = trade_adapter.subscribe(symbol) + symbol_pnl.append(calc_symbol_pnl(symbol, symbol_trades)) + + return csp.sum(symbol_pnl) +``` + +In this simple example we have a `csp.graph` component `calc_symbol_pnl` which encapsulates computing pnl for a single symbol. +`calc_portfolio_pnl` is a graph that computes portfolio level pnl, it invokes the symbol-level pnl calc for every symbol, then sums up the results for the portfolio level pnl. + +## Graph Propagation and Single-dispatch + +The CSP graph propagation algorithm ensures that all nodes are executed *once* per engine cycle, and in the correct order. +Correct order means, that all input dependencies of a given node are guaranteed to have been evaluated before a given node is executed. +Take this graph for example: + +![359407953](https://github.com/Point72/csp/assets/3105306/d9416353-6755-4e37-8467-01da516499cf) + +On a given cycle lets say the `bid` input ticks. +The CSP engine will ensure that **`mid`** is executed, followed by **`spread`** and only once **`spread`**'s output is updated will **`quote`** be called. +When **`quote`** executes it will have the latest values of the `mid` and `spread` calc for this cycle. + +## Graph Pruning + +One should note a subtle optimization technique in CSP graphs. +Any part of a graph that is created at graph building time, but is NOT connected to any output nodes, will be pruned from the graph and will not exist during runtime. +An output is defined as either an output adapter or a `csp.node` without any outputs of its own. +The idea here is that we can avoid doing work if it doesn't result in any output being generated. +In general its best practice for all `csp.nodes` to be \***side-effect free**, in other words they shouldn't mutate any state outside of the node. +Assuming all nodes are side-effect free, pruning the graph would not have any noticeable effects. + +## Collecting Graph Outputs + +If the `csp.graph` passed to `csp.run` has outputs, the full timeseries will be returned from `csp.run` like so: + +**outputs example** + +```python +import csp +from datetime import datetime, timedelta + +@csp.graph +def my_graph() -> ts[int]: + return csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1))) + +if __name__ == '__main__': + res = csp.run(my_graph, starttime=datetime(2021,11,8)) + print(res) +``` + +result: + +```raw +{0: [(datetime.datetime(2021, 11, 8, 0, 0), 1), (datetime.datetime(2021, 11, 8, 0, 0, 1), 2)]} +``` + +Note that the result is a list of `(datetime, value)` tuples. + +You can also use [csp.add_graph_output]() to add outputs. +These do not need to be in the top-level graph called directly from `csp.run`. + +This gives the same result: + +**add_graph_output example** + +```python +@csp.graph +def my_graph(): + csp.add_graph_output('a', csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1)))) +``` + +In addition to python outputs like above, you can set the optional `csp.run` argument `output_numpy` to `True` to get outputs as numpy arrays: + +**numpy outputs** + +```python +result = csp.run(my_graph, starttime=datetime(2021,11,8), output_numpy=True) +``` + +result: + +```raw +{0: (array(['2021-11-08T00:00:00.000000000', '2021-11-08T00:00:01.000000000'], dtype='datetime64[ns]'), array([1, 2], dtype=int64))} +``` + +Note that the result there is a tuple per output, containing two numpy arrays, one with the datetimes and one with the values. diff --git a/docs/wiki/concepts/CSP-Node.md b/docs/wiki/concepts/CSP-Node.md new file mode 100644 index 000000000..229bfdc34 --- /dev/null +++ b/docs/wiki/concepts/CSP-Node.md @@ -0,0 +1,271 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Anatomy of a `csp.node`](#anatomy-of-a-cspnode) +- [Basket inputs](#basket-inputs) +- [**Node Outputs**](#node-outputs) +- [Basket Outputs](#basket-outputs) +- [Generic Types](#generic-types) + +## Anatomy of a `csp.node` + +The heart of a calculation graph are the `csp.nodes` that run the computations. +`csp.node` methods can take any number of scalar and timeseries arguments, and can return 0 → N timeseries outputs. +Timeseries inputs/outputs should be thought of as the edges that connect components of the graph. +These "edges" can tick whenever they have a new value. +Every tick is associated with a value and the time of the tick. +`csp.nodes` can have various other features, here is a an example of a `csp.node` that demonstrates many of the features. +Keep in mind that nodes will execute repeatedly as inputs tick with new data. +They may (or may not) generate an output as a result of an input tick. + +```python +from datetime import timedelta + +@csp.node # 1 +def demo_node(n: int, xs: ts[float], ys: ts[float]) -> ts[float]: # 2 + with csp.alarms(): # 3 + # Define an alarm time-series of type bool # 4 + alarm = csp.alarm(bool) # 5 + # 6 + with csp.state(): # 7 + # Create a state variable bound to the node # 8 + s_sum = 0.0 # 9 + # 10 + with csp.start(): # 11 + # Code block that executes once on start of the engine # 12 + # one can set timeseries properties here as well, such as # 13 + # csp.set_buffering_policy(xs, tick_count=5) # 14 + # csp.set_buffering_policy(xs, tick_history=timedelta(minutes=1))# 15 + # csp.make_passive(xs) # 16 + csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 17 + # 18 + with csp.stop(): # 19 + pass # code block to execute when the engine is done # 20 + # 21 + if csp.ticked(xs, ys) and csp.valid(xs, ys): # 22 + s_sum += xs * ys # 23 + # 24 + if csp.ticked(alarm): # 25 + csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 26 + return s_sum # 27 +``` + +Lets review line by line + +1\) Every CSP node must start with the **`@csp.node`** decorator + +2\) CSP nodes are fully typed and type-checking is strictly enforced. +All arguments must be typed, as well as all outputs. +Outputs are typed using function annotation syntax. + +Single outputs can be unnamed, for multiple outputs they must be named. +When using multiple outputs, annotate the type using **`def my_node(inputs) → csp.Outputs(name1=ts[], name2=ts[])`** where `T` and `V` are the respective types of `name1` and `name2`. + +Note the syntax of timeseries inputs, they are denoted by **`ts[type]`**. +Scalars can be passed in as regular types, in this example we pass in `n` which expects a type of `int` + +3\) **`with csp.alarms()`**: nodes can (optionally) declare internal alarms, every instance of the node will get its own alarm that can be scheduled and act just like a timeseries input. +All alarms must be declared within the alarms context. + +5\) Instantiate an alarm in the alarms context using the `csp.alarm(typ)` function. This creates an alarm which is a time-series of type `typ`. + +7\) **`with csp.state()`**: optional state variables can be defined under the state context. +Note that variables declared in state will live across invocations of the method. + +9\) An example declaration and initialization of state variable `s_sum`. +It is good practice to name state variables prefixed with `s_`, which is the convention in the CSP codebase. + +11\) **`with csp.start()`**: an optional block to execute code at the start of the engine. +Generally this is used to setup initial timers or set input timeseries properties such as buffer sizes, or to make inputs passive + +14-15) **`csp.set_buffering_policy`**: nodes can request a certain amount of history be kept on the incoming time series, this can be denoted in number of ticks or in time. +By setting a buffering policy, nodes can access historical values of the timeseries (by default only the last value is kept) + +16\) **`csp.make_passive`** / **`csp.make_active`**: Nodes may not need to react to all of their inputs, they may just need their latest value. +For performance purposes the node can mark an input as passive to avoid triggering the node unnecessarily. +`make_active` can be called to reactivate an input. + +17\) **`csp.schedule_alarm`**: scheduled a one-shot tick on the given alarm input. +The values given are the timedelta before the alarm triggers and the value it will have when it triggers. +Note that `schedule_alarm` can be called multiple times on the same alarm to schedule multiple triggers. + +19\) **`with csp.stop()`** is an optional block that can be called when the engine is done running. + +22\) all nodes will have if conditions to react to different inputs. +**`csp.ticked()`** takes any number of inputs and returns true if **any** of the inputs ticked. +**`csp.valid()`** similar takes any number of inputs however it only returns true if **all** inputs are valid. +Valid means that an input has had at least one tick and so it has a "current value". + +23\) One of the benefits of CSP is that you always have easy access to the latest value of all inputs. +`xs` and `ys` on line 22,23 will always have the latest value of both inputs, even if only one of them just ticked. + +25\) This demonstrates how an alarm can be treated like any other input. + +27\) We tick our running "sum" as an output here every second. + +## Basket inputs + +In addition to single time-series inputs, a node can also accept a **basket** of time series as an argument. +A basket is essentially a collection of timeseries which can be passed in as a single argument. +Baskets can either be list baskets or dict baskets. +Individual timeseries in a basket can tick independently, and they can be looked at and reacted to individually or as a collection. + +For example: + +```python +@csp.node # 1 +def demo_basket_node( # 2 + list_basket: [ts[int]], # 3 + dict_basket: {str: ts[int]} # 4 +) -> ts[float]: # 5 + # 6 + if csp.ticked(list_basket): # 7 + return sum(list_basket.validvalues()) # 8 + # 9 + if csp.ticked(list_basket[3]): # 10 + return list_basket[3] # 11 + # 12 + if csp.ticked(dict_basket): # 13 + # can iterate over ticked key,items # 14 + # for k,v in dict_basket.tickeditems():# 15 + # ... # 16 + return sum(dict_basket.tickedvalues()) # 17 +``` + +3\) Note the syntax of basket inputs. +list baskets are noted as `[ts[type]]` (a list of time series) and dict baskets are `{key_type: ts[ts_type]}` (a dictionary of timeseries keyed by type `key_type`). It is also possible to use the `List[ts[int]]` and `Dict[str, ts[int]]` typing notation. + +7\) Just like single timeseries, we can react to a basket if it ticked. +The convention is the same as passing multiple inputs to `csp.ticked`, `csp.ticked` is true if **any** basket input ticked. +`csp.valid` is true is **all** basket inputs are valid. + +8\) baskets have various iterators to access their inputs: + +- **`tickedvalues`**: iterator of values of all ticked inputs +- **`tickedkeys`**: iterator of keys of all ticked inputs (keys are list index for list baskets) +- **`tickeditems`**: iterator of (key,value) tuples of ticked inputs +- **`validvalues`**: iterator of values of all valid inputs +- **`validkeys`**: iterator of keys of all valid inputs +- **`validitems`**: iterator of (key,value) tuples of valid inputs +- **`keys`**: list of keys on the basket (**dictionary baskets only** ) + +10-11) This demonstrates the ability to access an individual element of a +basket and react to it as well as access its current value + +## **Node Outputs** + +Nodes can return any number of outputs (including no outputs, in which case it is considered an "output" or sink node, +see [Graph Pruning](https://github.com/Point72/csp/wiki/0.-Introduction#graph-pruning)). +Nodes with single outputs can return the output as an unnamed output. +Nodes returning multiple outputs must have them be named. +When a node is called at graph building time, if it is a single unnamed node the return variable is an edge representing the output which can be passed into other nodes. +An output timeseries cannot be ticked more than once in a given node invocation. +If the outputs are named, the return value is an object with the outputs available as attributes. +For example (examples below demonstrate various ways to output the data as well) + +```python +@csp.node +def single_unnamed_outputs(n: ts[int]) -> ts[int]: + # can either do + return n + # or + # csp.output(n) to continue processes after output + + +@csp.node +def multiple_named_outputs(n: ts[int]) -> csp.Outputs(y=ts[int], z=ts[float]): + # can do + # csp.output(y=n, z=n+1.) to output to multiple outputs + # or separate the outputs to tick out at separate points: + # csp.output(y=n) + # ... + # csp.output(z=n+1.) + # or can return multiple values with: + return csp.output(y=n, z=n+1.) + +@csp.graph +def my_graph(n: ts[int]): + x = single_unnamed_outputs(n) + # x represents the output edge of single_unnamed_outputs, + # we can pass it a time series input to other nodes + csp.print('x', x) + + + result = multiple_named_outputs(n) + # result holds all the outputs of multiple_named_outputs, which can be accessed as attributes + csp.print('y', result.y) + csp.print('z', result.z) +``` + +## Basket Outputs + +Similarly to inputs, a node can also produce a basket of timeseries as an output. +For example: + +```python +class MyStruct(csp.Struct): # 1 + symbol: str # 2 + index: int # 3 + value: float # 4 + # 5 +@csp.node # 6 +def demo_basket_output_node( # 7 + in_: ts[MyStruct], # 8 + symbols: [str], # 9 + num_symbols: int # 10 +) -> csp.Outputs( # 11 + dict_basket=csp.OutputBasket({str: ts[float]}, shape="symbols"), # 15 + list_basket=csp.OutputBasket([ts[float]], shape="num_symbols"), # 16 +): # 17 + # 18 + if csp.ticked(in_): # 19 + # output to dict basket # 20 + csp.output(dict_basket[in_.symbol], in_.value) # 21 + # alternate output syntax, can output multiple keys at once # 22 + # csp.output(dict_basket={in_.symbol: in_.value}) # 23 + # output to list basket # 24 + csp.output(list_basket[in_.index], in_.value) # 25 + # alternate output syntax, can output multiple keys at once # 26 + # csp.output(list_basket={in_.index: in_.value}) # 27 +``` + +11-17) Note the output declaration syntax. +A basket output can be either named or unnamed (both examples here are named), and its shape can be specified two ways. +The `shape` parameter is used with a scalar value that defines the shape of the basket, or the name of the scalar argument (a dict basket expects shape to be a list of keys. lists basket expects `shape` to be an `int`). +`shape_of` is used to take the shape of an input basket and apply it to the output basket. + +20+) There are several choices for output syntax. +The following work for both list and dict baskets: + +- `csp.output(basket={key: value, key2: value2, ...})` +- `csp.output(basket[key], value)` +- `csp.output({key: value}) # only works if the basket is the only output` + +## Generic Types + +CSP supports syntax for generic types as well. +To denote a generic type we use a string (typically `'T'` is used) to denote a generic type. +When a node is called the type of the argument will get bound to the given type variable, and further inputs / outputs will be checked and bound to said typevar. +Note that the string syntax `'~T'` denotes the argument expects the *value* of a type, rather than a type itself: + +```python +@csp.node +def sample(trigger: ts[object], x: ts['T']) -> ts['T']: + '''will return current value of x on trigger ticks''' + with csp.state(): + csp.make_passive(x) + + if csp.ticked(trigger) and csp.valid(x): + return x + + +@csp.node +def const(value: '~T') -> ts['T']: + ... +``` + +`sample` takes a timeseries of type `'T'` as an input, and returns a timeseries of type `'T'`. +This allows us to pass in a `ts[int]` for example, and get a `ts[int]` as an output, or `ts[bool]` → `ts[bool]` + +`const` takes value as an *instance* of type `T`, and returns a timeseries of type `T`. +So we can call `const(5)` and get a `ts[int]` output, or `const('hello!')` and get a `ts[str]` output, etc... diff --git a/docs/wiki/concepts/Execution-Modes.md b/docs/wiki/concepts/Execution-Modes.md new file mode 100644 index 000000000..46902a820 --- /dev/null +++ b/docs/wiki/concepts/Execution-Modes.md @@ -0,0 +1,243 @@ +The CSP engine can be run in two flavors, realtime and simulation. + +In simulation mode, the engine is always run at full speed pulling in time-based data from its input adapters and running them through the graph. +All inputs in simulation are driven off the provided timestamped data of its inputs. + +In realtime mode, the engine runs in wallclock time as of "now". +Realtime engines can get data from realtime adapters which source data on separate threads and pass them through to the engine (ie think of activeMQ events happening on an activeMQ thread and being passed along to the engine in "realtime"). + +Since engines can run in both simulated and realtime mode, users should **always** use **`csp.now()`** to get the current time in `csp.node`s. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Simulation Mode](#simulation-mode) +- [Realtime Mode](#realtime-mode) +- [csp.PushMode](#csppushmode) +- [Realtime Group Event Synchronization](#realtime-group-event-synchronization) + +## Simulation Mode + +Simulation mode is the default mode of the engine. +As stated above, simulation mode is used when you want your engine to crunch through historical data as fast as possible. +In simulation mode, the engine runs on some historical data that is fed in through various adapters. +The adapters provide events by time, and they are streamed into the engine via the adapter timeseries in time order. +`csp.timer` and `csp.node` alarms are scheduled and executed in "historical time" as well. +Note that there is no strict requirement for simulated runs to run on historical dates. +As long as the engine is not in realtime mode, it remains in simulation mode until the provided endtime, even if endtime is in the future. + +## Realtime Mode + +Realtime mode is opted into by passing `realtime=True` to `csp.run(...)`. +When run in realtime mode, the engine will run in simulation mode from the provided starttime → wallclock "now" as of the time of calling run. +Once the simulation run is done, the engine switches into realtime mode. +Under realtime mode, external realtime adapters will be able to send data into the engine thread. +All time based inputs such as `csp.timer` and alarms will switch to executing in wallclock time as well. + +As always, `csp.now()` should still be used in `csp.node` code, even when running in realtime mode. +`csp.now()` will be the time assigned to the current engine cycle. + +## csp.PushMode + +When consuming data from input adapters there are three choices on how one can consume the data: + +| PushMode | EngineMode | Description | +| :------- | :--------- | :---------- | +| **LAST_VALUE** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with the last value on a given timestamp | +| | Realtime | all ticks that occurred since previous engine cycle will collapse / conflate to the latest value | +| **NON_COLLAPSING** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once per engine cycle. subsequent cycles will execute with the same time | +| | Realtime | all ticks that occurred since previous engine cycle will be ticked across subsequent engine cycles as fast as possible | +| **BURST** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with a list of all values | +| | Realtime | all ticks that occurred since previous engine cycle will tick once with a list of all the values | + +## Realtime Group Event Synchronization + +The CSP framework supports properly synchronizing events across multiple timeseries that are sourced from the same realtime adapter. +A classical example of this is a market data feed. +Say you consume bid, ask and trade as 3 separate time series for the same product / exchange. +Since the data flows in asynchronously from a separate thread, bid, ask and trade events could end up executing in the engine at arbitrary slices of time, leading to crossed books and trades that are out of range of the bid/ask. +The engine can properly provide a correct synchronous view of all the inputs, regardless of their PushModes. +Its up to adapter implementations to determine which inputs are part of a synchronous "PushGroup". + +Here's a classical example. +An Application wants to consume conflating bid/ask as LAST_VALUE but it doesn't want to conflate trades, so its consumed as NON_COLLAPSING. + +Lets say we have this sequence of events on the actual market data feed's thread, coming in one the wire in this order. +The columns denote the time the callbacks come in off the market data thread. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EventTT+1T+2T+3T+4T+5T+6
BID100.00100.01
+
99.9799.9899.99
+
ASK100.02
+
100.03
+

+

+
100.00
TRADE
+

+
100.02
+

+
100.03
+
+ +Without any synchronization you can end up with nonsensical views based on random timing. +Here's one such possibility (bid/ask are still LAST_VALUE, trade is NON_COLLAPSING). + +Over here ET is engine time. +Lets assume engine had a huge delay and hasn't processed any data submitted above yet. +Without any synchronization, bid/ask would completely conflate, and trade would unroll over multiple engine cycles + + + + + + + + + + + + + + + + + + + + + + + + +
EventETET+1
BID99.99
+
ASK100.00
+
TRADE100.02100.03
+ +However, since market data adapters will group bid/ask/trade inputs together, the engine won't let bid/ask events advance ahead of trade events since trade is NON_COLLAPSING. +NON_COLLAPSING inputs will essentially act as a barrier, not allowing events ahead of the barrier tick before the barrier is complete. +Lets assume again that the engine had a huge delay and hasn't processed any data submitted above. +With proper barrier synchronizations the engine cycles would look like this under the same conditions: + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EventETET+1ET+2
BID100.0199.99
+
ASK100.03
+
100.00
TRADE100.02100.03
+
+ +Note how the last ask tick of 100.00 got held up to a separate cycle (ET+2) so that trade could tick with the correct view of bid/ask at the time of the second trade (ET+1) + +As another example, lets say the engine got delayed briefly at wire time T, so it was able to process T+1 data. +Similarly it got briefly delayed at time T+4 until after T+6. The engine would be able to process all data at time T+1, T+2, T+3 and T+6, leading to this sequence of engine cycles. +The equivalent "wire time" is denoted in parenthesis + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EventET (T+1)ET+1 (T+2)ET+2 (T+3)ET+3 (T+5)ET+4 (T+6)
BID100.01
+
99.9799.99
+
ASK100.02100.03
+

+
100.00
TRADE
+
100.02
+
100.03
+
diff --git a/docs/wiki/concepts/Historical-Buffers.md b/docs/wiki/concepts/Historical-Buffers.md new file mode 100644 index 000000000..8eafc3c2c --- /dev/null +++ b/docs/wiki/concepts/Historical-Buffers.md @@ -0,0 +1,133 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Historical Buffers](#historical-buffers) +- [Historical Range Access](#historical-range-access) + +## Historical Buffers + +CSP can provide access to historical input data as well. +By default only the last value of an input is kept in memory, however one can request history to be kept on an input either by number of ticks or by time using **csp.set_buffering_policy.** + +The methods **csp.value_at**, **csp.time_at** and **csp.item_at** can be used to retrieve historical input values. +Each node should call **csp.set_buffering_policy** to make sure that its inputs are configured to store sufficiently long history for correct implementation. +For example, let's assume that we have a stream of data and we want to create equally sized buckets from the data. +A possible implementation of such a node would be: + +```python +@csp.node +def data_bin_generator(bin_size: int, input: ts['T']) -> ts[['T']]: + with csp.start(): + assert bin_size > 0 + # This makes sure that input stores at least bin_size entries + csp.set_buffering_policy(input, tick_count=bin_size) + if csp.ticked(input) and (csp.num_ticks(input) % bin_size == 0): + return [csp.value_at(input, -i) for i in range(bin_size)] +``` + +In this example, we use **`csp.set_buffering_policy(input, tick_count=bin_size)`** to ensure that the buffer history contains at least **`bin_size`** elements. +Note that an input can be shared by multiple nodes, if multiple nodes provide size requirements, the buffer size would be resolved to the maximum size to support all requests. + +Alternatively, **`csp.set_buffering_policy`** supports a **`timedelta`** parameter **`tick_history`** instead of **`tick_count`.** +If **`tick_history`** is provided, the buffer will scale dynamically to ensure that any period of length **`tick_history`** will fit into the history buffer. + +To identify when there are enough samples to construct a bin we use **`csp.num_ticks(input) % bin_size == 0`**. +The function **`csp.num_ticks`** returns the number or total ticks for a given time series. +NOTE: The actual size of the history buffer is usually less than **`csp.num_ticks`** as buffer is dynamically truncated to satisfy the set policy. + +The past values in this example are accessed using **`csp.value_at`**. +The various historical access methods take the same arguments and return the value, time and tuple of `(time,value)` respectively: + +- **`csp.value_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **value** of the timeseries at requested `index_or_time` +- **`csp.time_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **datetime** of the timeseries at requested `index_or_time` +- **`csp.item_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns tuple of `(datetime,value)` of the timeseries at requested `index_or_time` + - **`ts`**: the name of the input + - **`index_or_time`**: + - If providing an **index**, this represents how many ticks back to rereieve **and should be \<= 0**. + 0 indicates the current value, -1 is the previous value, etc. + - If providing **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. + **NOTE** that timedelta must be negative to represent time in the past.. + - **`duplicate_policy`**: when requesting history by datetime or timedelta, its possible that there could be multiple values that match the given time. + **`duplicate_policy`** can be provided to control the behavior of what to return in this case. + The default policy is to return the LAST_VALUE that exists at the given time. + - **`default`**: value to be returned if the requested time is out of the history bounds (if default is not provided and a request is out of bounds an exception will be raised). + +The following demonstrate a possible way to compute a rolling sum for the past N ticks. Please note that this is for demonstration purposes only and is not efficient. A more efficient +vectorized version can be seen below, though even that would not be recommended for a rolling sum since csp.stats.sum would be even more efficient with its C++ impl in-line calculation + +```python +@csp.node +def rolling_sum(x:ts[float], tick_count: int) -> ts[float]: + with csp.start(): + csp.set_buffering_policy(x, tick_count=tick_count) + + if csp.ticked(x): + return sum(csp.value_at(x, -i) for i in range(min(csp.num_ticks(x), tick_count))) +``` + +## Historical Range Access + +In similar fashion, the methods **`csp.values_at`**, **`csp.times_at`** and **`csp.items_at`** can be used to retrieve a range of historical input values as numpy arrays. +The sample_sum example above can be accomplished more efficiently with range access: + +```python +@csp.node +def rolling_sum(x:ts[float], tick_count: int) -> ts[float]: + with csp.start(): + csp.set_buffering_policy(x, tick_count=tick_count) + + if csp.ticked(x): + return csp.values_at(x).sum() +``` + +The past values in this example are accessed using **`csp.values_at`**. +The various historical access methods take the same arguments and return the value, time and tuple of `(times,values)` respectively: + +- **`csp.values_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: + returns values in specified range as a numpy array +- **`csp.times_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: + returns times in specified range as a numpy array +- **`csp.items_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: + returns a tuple of (times, values) numpy arrays + - **`ts`** - the name of the input + - **`start_index_or_time`**: + - If providing an **index**, this represents how many ticks back to retrieve **and should be \<= 0**. + 0 indicates the current value, -1 is the previous value, etc. + - If providing **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. + **NOTE that timedelta must be negative** to represent time in the past.. + - If **None** is provided, the range will begin "from the beginning" - i.e., the oldest tick in the buffer. + - **end_index_or_time:** same as start_index_or_time + - If **None** is provided, the range will go "until the end" - i.e., the newest tick in the buffer. + - **`start_index_policy`**: only for use with datetime/timedelta as the start and end parameters. + - **\`TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it + - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it + - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the beginning timestamp, include it. + Otherwise, if there is a tick before the beginning timestamp, force a tick at the beginning timestamp with the prevailing value at the time. + - **end_index_policy** only for use with datetime/timedelta and the start and end parameters. + - **TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it + - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it + - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the end timestamp, include it. + Otherwise, if there is a tick before the end timestamp, force a tick at the end timestamp with the prevailing value at the time + +Range access is optimized at the C++ layer and for this reason its far more efficient than calling the single value access methods in a loop, and they should be substituted in where possible. + +Below is a rolling average example to illustrate the use of timedelta indexing. +Note that `timedelta(seconds=-n_seconds)` is equivalent to `csp.now() - timedelta(seconds=n_seconds)`, since datetime indexing is supported. + +```python +@csp.node +def rolling_average(x: ts[float], n_seconds: int) -> ts[float]: + with csp.start(): + assert n_seconds > 0 + csp.set_buffering_policy(x, tick_history=timedelta(seconds=n_seconds)) + if csp.ticked(x): + avg = np.mean(csp.values_at(x, timedelta(seconds=-n_seconds), timedelta(seconds=0), + csp.TimeIndexPolicy.INCLUSIVE, csp.TimeIndexPolicy.INCLUSIVE)) + csp.output(avg) +``` + +When accessing all elements within the buffering policy window like +this, it would be more succinct to pass None as the start and end time, +but datetime/timedelta allows for more general use (e.g. rolling average +between 5 seconds and 1 second ago, or average specifically between +9:30:00 and 10:00:00) diff --git a/docs/wiki/98.-Building-From-Source.md b/docs/wiki/dev-guides/Build-CSP-from-Source.md similarity index 64% rename from docs/wiki/98.-Building-From-Source.md rename to docs/wiki/dev-guides/Build-CSP-from-Source.md index cc39d82be..6aeda4226 100644 --- a/docs/wiki/98.-Building-From-Source.md +++ b/docs/wiki/dev-guides/Build-CSP-from-Source.md @@ -1,6 +1,34 @@ -`csp` is written in Python and C++ with Python and C++ build dependencies. While prebuilt wheels are provided for end users, it is also straightforward to build `csp` from either the Python [source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/) or the GitHub repository. - -As a convenience, `csp` uses a `Makefile` for commonly used commands. You can print the main available commands by running `make` with no arguments +CSP is written in Python and C++ with Python and C++ build dependencies. While prebuilt wheels are provided for end users, it is also straightforward to build CSP from either the Python [source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/) or the GitHub repository. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Make commands](#make-commands) +- [Prerequisites](#prerequisites) +- [Building with Conda on Linux](#building-with-conda-on-linux) + - [Install conda](#install-conda) + - [Clone](#clone) + - [Install build dependencies](#install-build-dependencies) + - [Build](#build) +- [Building with a system package manager](#building-with-a-system-package-manager) + - [Clone](#clone-1) + - [Install build dependencies](#install-build-dependencies-1) + - [Linux](#linux) + - [MacOS](#macos) + - [Install Python dependencies](#install-python-dependencies) + - [Build](#build-1) + - [Building on `aarch64` Linux](#building-on-aarch64-linux) +- [Lint and Autoformat](#lint-and-autoformat) +- [Testing](#testing) +- [Troubleshooting](#troubleshooting) + - [MacOS](#macos-1) + - [vcpkg install failed](#vcpkg-install-failed) + - [Building thrift:arm64-osx/thrift:x64-osx failed](#building-thriftarm64-osxthriftx64-osx-failed) + - [CMake was unable to find a build program corresponding to "Unix Makefiles".](#cmake-was-unable-to-find-a-build-program-corresponding-to-unix-makefiles) + +## Make commands + +As a convenience, CSP uses a `Makefile` for commonly used commands. You can print the main available commands by running `make` with no arguments ```bash > make @@ -13,27 +41,40 @@ lint run lints test run the tests ``` -# Prerequisites +## Prerequisites -`csp` has a few system-level dependencies which you can install from your machine package manager. Other package managers like `conda`, `nix`, etc, should also work fine. Currently, `csp` relies on the `GNU` compiler toolchain only. +CSP has a few system-level dependencies which you can install from your machine package manager. Other package managers like `conda`, `nix`, etc, should also work fine. Currently, CSP relies on the `GNU` compiler toolchain only. -# Building with Conda on Linux +## Building with Conda on Linux The easiest way to get started on a Linux machine is by installing the necessary dependencies in a self-contained conda environment. -Tweak this script to create a conda environment, install the build dependencies, build, and install a development version of `csp` into the environment. Note that we use [micromamba](https://mamba.readthedocs.io/en/latest/index.html) in this example, but [Anaconda](https://www.anaconda.com/download), [Miniconda](https://docs.anaconda.com/free/miniconda/index.html), [Miniforge](https://github.com/conda-forge/miniforge), etc, should all work fine. +Tweak this script to create a conda environment, install the build dependencies, build, and install a development version of CSP into the environment. -## Install Conda +### Install conda ```bash -# download and install micromamba for Linux/Mac -"${SHELL}" <(curl -L micro.mamba.pm/install.sh) +mkdir ~/github +cd ~/github + +# this downloads a Linux x86_64 build, change your architecture to match your development machine +# see https://conda-forge.org/miniforge/ for alternate download links + +wget https://github.com/conda-forge/miniforge/releases/download/23.3.1-1/Mambaforge-23.3.1-1-Linux-x86_64.sh +chmod 755 Mambaforge-23.3.1-1-Linux-x86_64.sh +./Mambaforge-23.3.1-1-Linux-x86_64.sh -b -f -u -p csp_venv + +. ~/github/csp_venv/etc/profile.d/conda.sh -# on windows powershell -# Invoke-Expression ((Invoke-WebRequest -Uri https://micro.mamba.pm/install.ps1).Content) +# optionally, run this if you want to set up conda in your .bashrc +# conda init bash + +conda config --add channels conda-forge +conda config --set channel_priority strict +conda activate base ``` -## Clone +### Clone ```bash git clone https://github.com/Point72/csp.git @@ -41,7 +82,7 @@ cd csp git submodule update --init --recursive ``` -## Install build dependencies +### Install build dependencies ```bash # Note the operating system, change as needed @@ -50,22 +91,32 @@ micromamba create -n csp -f conda/dev-environment-unix.yml micromamba activate csp ``` -## Build +### Build ```bash -make build-conda +make build -# finally install into the csp conda environment +# on aarch64 linux, comment the above command and use this instead +# VCPKG_FORCE_SYSTEM_BINARIES=1 make build + +# finally install into the csp_venv conda environment make develop ``` -## A note about dependencies +If you didn’t do `conda init bash` you’ll need to re-add conda to your shell environment and activate the `csp` environment to use it: + +```bash +. ~/github/csp_venv/etc/profile.d/conda.sh +conda activate csp -In Conda, we pull our dependencies from the Conda environment by setting the environment variable `CSP_USE_VCPKG=0`. This will force the build to not pull dependencies from vcpkg. This may or may not work in other environments or with packages provided by other package managers or built from source, but there is too much variability for us to support alternative patterns. +# make sure everything works +cd ~/github/csp +make test +``` -# Building with a system package manager +## Building with a system package manager -## Clone +### Clone Clone the repo and submodules with: @@ -75,9 +126,9 @@ cd csp git submodule update --init --recursive ``` -## Install build dependencies +### Install build dependencies -### Linux +#### Linux **Debian/Ubuntu/etc** @@ -103,7 +154,7 @@ sudo make dependencies-fedora sudo dnf group install "Development Tools" ``` -### MacOS +#### MacOS **Homebrew** @@ -114,7 +165,7 @@ make dependencies-mac # brew install bison cmake flex make ninja ``` -## Install Python dependencies +### Install Python dependencies Python build and develop dependencies are specified in the `pyproject.toml`, but you can manually install them: @@ -129,16 +180,13 @@ make requirements Note that these dependencies would otherwise be installed normally as part of [PEP517](https://peps.python.org/pep-0517/) / [PEP518](https://peps.python.org/pep-0518/). -## Build +### Build Build the python project in the usual manner: ```bash make build -# on aarch64 linux, comment the above command and use this instead -# VCPKG_FORCE_SYSTEM_BINARIES=1 make build - # or # python setup.py build build_ext --inplace ``` @@ -151,20 +199,15 @@ On `aarch64` Linux the VCPKG_FORCE_SYSTEM_BINARIES environment variable must be VCPKG_FORCE_SYSTEM_BINARIES=1 make build ``` -## Using System Dependencies - -By default, we pull and build dependencies with [vcpkg](https://vcpkg.io/en/). We only support non-vendored dependencies via Conda (see [A note about dependencies](#A-note-about-dependencies) above). +## Lint and Autoformat -# Lint and Autoformat - -`csp` has listing and auto formatting. +CSP has listing and auto formatting. | Language | Linter | Autoformatter | Description | | :------- | :----- | :------------ | :---------- | | C++ | `clang-format` | `clang-format` | Style | | Python | `ruff` | `ruff` | Style | | Python | `isort` | `isort` | Imports | -| Markdown | `mdformat` / `codespell` | `mdformat` / `codespell` | Style/Spelling | **C++ Linting** @@ -188,7 +231,7 @@ make fix-cpp make lint-py # or # python -m isort --check csp/ setup.py -# python -m ruff check csp/ setup.py +# python -m ruff csp/ setup.py ``` **Python Autoformatting** @@ -200,27 +243,9 @@ make fix-py # python -m ruff format csp/ setup.py ``` -**Documentation Linting** - -```bash -make lint-docs -# or -# python -m mdformat --check docs/wiki/ README.md examples/README.md -# python -m codespell_lib docs/wiki/ README.md examples/README.md -``` - -**Documentation Autoformatting** +## Testing -```bash -make fix-docs -# or -# python -m mdformat docs/wiki/ README.md examples/README.md -# python -m codespell_lib --write docs/wiki/ README.md examples/README.md -``` - -# Testing - -`csp` has both Python and C++ tests. The bulk of the functionality is tested in Python, which can be run via `pytest`. First, install the Python development dependencies with +CSP has both Python and C++ tests. The bulk of the functionality is tested in Python, which can be run via `pytest`. First, install the Python development dependencies with ```bash make develop @@ -289,17 +314,17 @@ There are a few test flags available: - **`CSP_TEST_KAFKA`** - **`CSP_TEST_SKIP_EXAMPLES`**: skip tests of examples folder -# Troubleshooting +## Troubleshooting -## MacOS +### MacOS -### vcpkg install failed +#### vcpkg install failed Check the `vcpkg-manifest-install.log` files, and install the corresponding packages if needed. For example, you may need to `brew install pkg-config`. -### Building thrift:arm64-osx/thrift:x64-osx failed +#### Building thrift:arm64-osx/thrift:x64-osx failed ``` Thrift requires bison > 2.5, but the default `/usr/bin/bison` is version 2.3. @@ -311,7 +336,7 @@ On ARM: `export PATH="/opt/homebrew/opt/bison/bin:$PATH"` On Intel: `export PATH="/usr/local/opt/bison/bin:$PATH"` -### CMake was unable to find a build program corresponding to "Unix Makefiles". +#### CMake was unable to find a build program corresponding to "Unix Makefiles". Complete error message: diff --git a/docs/wiki/dev-guides/Contribute.md b/docs/wiki/dev-guides/Contribute.md new file mode 100644 index 000000000..7de8b8d33 --- /dev/null +++ b/docs/wiki/dev-guides/Contribute.md @@ -0,0 +1,9 @@ +Contributions are welcome on this project. We distribute under the terms of the [Apache 2.0 license](https://github.com/Point72/csp/blob/main/LICENSE). + +For **bug reports** or **small feature requests**, please open an issue on our [issues page](https://github.com/Point72/csp/issues). + +For **questions** or to discuss **larger changes or features**, please use our [discussions page](https://github.com/Point72/csp/discussions). + +For **contributions**, please see our [developer documentation](https://github.com/Point72/csp/wiki/99.-Developer). We have `help wanted` and `good first issue` tags on our issues page, so these are a great place to start. + +For **documentation updates**, make PRs that update the pages in `/docs/wiki`. The documentation is pushed to the GitHub wiki automatically through a GitHub workflow. Note that direct updates to this wiki will be overwritten. diff --git a/docs/wiki/dev-guides/GitHub-Conventions.md b/docs/wiki/dev-guides/GitHub-Conventions.md new file mode 100644 index 000000000..a60d64530 --- /dev/null +++ b/docs/wiki/dev-guides/GitHub-Conventions.md @@ -0,0 +1,73 @@ +## Triaging Issues + +The bug tracker is both a venue for users to communicate to the +developers about defects and a database of known defects. It's up to the +maintainers to ensure issues are high quality. + +We have a number of labels that can be applied to sort issues into +categories. If you notice newly created issues that are poorly labeled, +consider adding or removing some labels that do not apply to the issue. + +The issue template encourages users to write bug reports that clearly +describe the problem they are having and include steps to reproduce the +issue. However, users sometimes ignore the template or are not used to +GitHub and make mistakes in formatting or communication. + +If you are able to infer what they meant and are able to understand the +issue, feel free to edit their issue description to fix formatting or +correct issues with a script demonstrating the issue. + +If there is still not enough information or if the issue is unclear, +request more information from the submitter. If they do not respond or +do not clarify sufficiently, close the issue. Try to be polite and have +empathy for inexperienced issue authors. + +## How to check out a PR locally + +This workflow is described in the [GitHub +docs](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally). + +1. Identify the pull request ID. This is the number of the pull request + in the GitHub UI, which shows up in the URL for the pull request. For + example, https://github.com/Point72/csp/pull/98 has PR ID 98. + +1. Fetch the pull request ref and assign it to a local branch name. + + ```bash + git fetch upstream pull//HEAD/:LOCAL_BRANCH_NAME + ``` + + where `` is the PR ID number and `LOCAL_BRANCH_NAME` is a name + chosen for the PR branch in your local checkout of CSP. + +1. Switch to the PR branch + + ```bash + git switch LOCAL_BRANCH_NAME + ``` + +1. Rebuild CSP + +## Pushing Fixups to Pull Requests + +Sometimes pull requests don't quite make it across the finish line. In +cases where only a small fixup is required to make a PR mergeable and +the author of the pull request is unresponsive to requests, the best +course of action is often to push to the pull request directly to +resolve the issues. + +To do this, check out the pull request locally using the above +instructions. Then make the changes needed for the pull request and push +the local branch back to GitHub: + +```bash +git push upstream LOCAL_BRANCH_NAME +``` + +Where `LOCAL_BRANCH_NAME` is the name you gave to the PR branch when you +fetched it from GitHub. + +Note that if the user who created the pull request selected the option +to forbid pushes to their pull request, you will instead need to +recreate the pull request by pushing the PR branch to your fork and +making a pull request like normal. diff --git a/docs/wiki/dev-guides/Local-Development-Setup.md b/docs/wiki/dev-guides/Local-Development-Setup.md new file mode 100644 index 000000000..15de91ccf --- /dev/null +++ b/docs/wiki/dev-guides/Local-Development-Setup.md @@ -0,0 +1,87 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Step 1: Build CSP from Source](#step-1-build-csp-from-source) +- [Step 2: Configuring Git and GitHub for Development](#step-2-configuring-git-and-github-for-development) + - [Create your fork](#create-your-fork) + - [Configure remotes](#configure-remotes) + - [Authenticating with GitHub](#authenticating-with-github) + - [Configure commit signing](#configure-commit-signing) +- [Guidelines](#guidelines) + +## Step 1: Build CSP from Source + +To work on CSP, you are going to need to build it from source. See +[Build CSP from Source](Build-CSP-from-Source) for +detailed build instructions. + +Once you've built CSP from a `git` clone, you will also need to +configure `git` and your GitHub account for CSP development. + +## Step 2: Configuring Git and GitHub for Development + +### Create your fork + +The first step is to create a personal fork of CSP. To do so, click +the "fork" button at https://github.com/Point72/csp, or just navigate +[here](https://github.com/Point72/csp/fork) in your browser. Set the +owner of the repository to your personal GitHub account if it is not +already set that way and click "Create fork". + +### Configure remotes + +Next, you should set some names for the `git` remotes corresponding to +main Point72 repository and your fork. If you started with a clone of +the main `Point72` repository, you could do something like: + +```bash +cd csp +git remote rename origin upstream + +# for SSH authentication +git remote add origin git@github.com:/csp.git + +# for HTTP authentication +git remote add origin https://github.com//csp.git +``` + +### Authenticating with GitHub + +If you have not already configured `ssh` access to GitHub, you can find +instructions to do so +[here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh), +including instructions to create an SSH key if you have not done +so. Authenticating with SSH is usually the easiest route. If you are working in +an environment that does not allow SSH connections to GitHub, you can look into +[configuring a hardware +passkey](https://docs.github.com/en/authentication/authenticating-with-a-passkey/about-passkeys) +or adding a [personal access +token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) +to avoid the need to type in your password every time you push to your fork. + +### Configure commit signing + +Additionally, you will need to configure your local `git` setup and +GitHub account to use [commit +signing](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification). All +commits to the `csp` repository must be signed to increase the +difficulty of a supply-chain attack against the CSP codebase. The +easiest way to do this is to [configure `git` to sign commits with your +SSH +key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification). You +can also use a [GPG +key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#gpg-commit-signature-verification) +to sign commits. + +In either case, you must also add your public key to your github account +as a signing key. Note that if you have already added an SSH key as an +authentication key, you will need to add it again [as a signing +key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account). + +## Guidelines + +After developing a change locally, ensure that both [lints](Build-CSP-from-Source#lint-and-autoformat) and [tests](Build-CSP-from-Source#testing) pass. Commits should be squashed into logical units, and all commits must be signed (e.g. with the `-s` git flag). CSP requires [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin) for all contributions. + +If your work is still in-progress, open a [draft pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests#draft-pull-requests). Otherwise, open a normal pull request. It might take a few days for a maintainer to review and provide feedback, so please be patient. If a maintainer asks for changes, please make said changes and squash your commits if necessary. If everything looks good to go, a maintainer will approve and merge your changes for inclusion in the next release. + +Please note that non substantive changes, large changes without prior discussion, etc, are not accepted and pull requests may be closed. diff --git a/docs/wiki/99.-Developer.md b/docs/wiki/dev-guides/Release-Process.md similarity index 52% rename from docs/wiki/99.-Developer.md rename to docs/wiki/dev-guides/Release-Process.md index 1c37d8fc2..94866e573 100644 --- a/docs/wiki/99.-Developer.md +++ b/docs/wiki/dev-guides/Release-Process.md @@ -1,157 +1,16 @@ -# tl;dr - -After developing a change locally, ensure that both [lints](https://github.com/Point72/csp/wiki/98.-Building-From-Source#lint-and-autoformat) and [tests](https://github.com/Point72/csp/wiki/98.-Building-From-Source#testing) pass. Commits should be squashed into logical units, and all commits must be signed (e.g. with the `-s` git flag). `csp` requires [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin) for all contributions. - -If your work is still in-progress, open a [draft pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests#draft-pull-requests). Otherwise, open a normal pull request. It might take a few days for a maintainer to review and provide feedback, so please be patient. If a maintainer asks for changes, please make said changes and squash your commits if necessary. If everything looks good to go, a maintainer will approve and merge your changes for inclusion in the next release. - -**Please note that non substantive changes, large changes without prior discussion, etc, are not accepted and pull requests may be closed.** - -# Setting up a development environment - -To work on `csp`, you are going to need to build it from source. See -https://github.com/Point72/csp/wiki/98.-Building-From-Source for -detailed build instructions. - -Once you've built `csp` from a `git` clone, you will also need to -configure `git` and your GitHub account for `csp` development. - -## Configuring Git and GitHub for Development - -### Create your fork - -The first step is to create a personal fork of `csp`. To do so, click -the "fork" button at https://github.com/Point72/csp, or just navigate -[here](https://github.com/Point72/csp/fork) in your browser. Set the -owner of the repository to your personal GitHub account if it is not -already set that way and click "Create fork". - -### Configure remotes - -Next, you should set some names for the `git` remotes corresponding to -main Point72 repository and your fork. If you started with a clone of -the main `Point72` repository, you could do something like: - -```bash -cd csp -git remote rename origin upstream - -# for SSH authentication -git remote add origin git@github.com:/csp.git - -# for HTTP authentication -git remote add origin https://github.com//csp.git -``` - -### Authenticating with GitHub - -If you have not already configured `ssh` access to GitHub, you can find -instructions to do so -[here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh), -including instructions to create an SSH key if you have not done -so. Authenticating with SSH is usually the easiest route. If you are working in -an environment that does not allow SSH connections to GitHub, you can look into -[configuring a hardware -passkey](https://docs.github.com/en/authentication/authenticating-with-a-passkey/about-passkeys) -or adding a [personal access -token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) -to avoid the need to type in your password every time you push to your fork. - -### Configure commit signing - -Additionally, you will need to configure your local `git` setup and -GitHub account to use [commit -signing](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification). All -commits to the `csp` repository must be signed to increase the -difficulty of a supply-chain attack against the `csp` codebase. The -easiest way to do this is to [configure `git` to sign commits with your -SSH -key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification). You -can also use a [GPG -key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#gpg-commit-signature-verification) -to sign commits. - -In either case, you must also add your public key to your github account -as a signing key. Note that if you have already added an SSH key as an -authentication key, you will need to add it again [as a signing -key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account). - -# Github for maintainers - -## Triaging Issues - -The bug tracker is both a venue for users to communicate to the -developers about defects and a database of known defects. It's up to the -maintainers to ensure issues are high quality. - -We have a number of labels that can be applied to sort issues into -categories. If you notice newly created issues that are poorly labeled, -consider adding or removing some labels that do not apply to the issue. - -The issue template encourages users to write bug reports that clearly -describe the problem they are having and include steps to reproduce the -issue. However, users sometimes ignore the template or are not used to -GitHub and make mistakes in formatting or communication. - -If you are able to infer what they meant and are able to understand the -issue, feel free to edit their issue description to fix formatting or -correct issues with a script demonstrating the issue. - -If there is still not enough information or if the issue is unclear, -request more information from the submitter. If they do not respond or -do not clarify sufficiently, close the issue. Try to be polite and have -empathy for inexperienced issue authors. - -## How to check out a PR locally - -This workflow is described in the [GitHub -docs](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally). - -1. Identify the pull request ID. This is the number of the pull request - in the GitHub UI, which shows up in the URL for the pull request. For - example, https://github.com/Point72/csp/pull/98 has PR ID 98. - -1. Fetch the pull request ref and assign it to a local branch name. - - ```bash - git fetch upstream pull//HEAD/:LOCAL_BRANCH_NAME - ``` - - where `` is the PR ID number and `LOCAL_BRANCH_NAME` is a name - chosen for the PR branch in your local checkout of `csp`. - -1. Switch to the PR branch - - ```bash - git switch LOCAL_BRANCH_NAME - ``` - -1. Rebuild `csp` - -## Pushing Fixups to Pull Requests - -Sometimes pull requests don't quite make it across the finish line. In -cases where only a small fixup is required to make a PR mergeable and -the author of the pull request is unresponsive to requests, the best -course of action is often to push to the pull request directly to -resolve the issues. - -To do this, check out the pull request locally using the above -instructions. Then make the changes needed for the pull request and push -the local branch back to GitHub: - -```bash -git push upstream LOCAL_BRANCH_NAME -``` - -Where `LOCAL_BRANCH_NAME` is the name you gave to the PR branch when you -fetched it from GitHub. - -Note that if the user who created the pull request selected the option -to forbid pushes to their pull request, you will instead need to -recreate the pull request by pushing the PR branch to your fork and -making a pull request like normal. - -# Release Manual +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Doing a "normal" release](#doing-a-normal-release) + - [Choosing a version number](#choosing-a-version-number) + - [Preparing and tagging a release](#preparing-and-tagging-a-release) + - [Releasing to PyPI](#releasing-to-pypi) + - [A developer's first release](#a-developers-first-release) + - [Doing the release](#doing-the-release) + - [Download release artifacts from github actions](#download-release-artifacts-from-github-actions) + - [Optionally upload to testpypi to test "pip install"](#optionally-upload-to-testpypi-to-test-pip-install) + - [Upload to pypi](#upload-to-pypi) +- [Dealing with release mistakes](#dealing-with-release-mistakes) ## Doing a "normal" release @@ -176,7 +35,7 @@ different potential impact on users. This is the most common kind of release. A patch release should only include fixes for bugs or other changes that cannot impact code a user writes with the `csp` package. A user should be able to safely - upgrade `csp` from the previous version to a new patch release with + upgrade CSP from the previous version to a new patch release with no changes to the output of their code and no new errors being raised, except for fixed bugs. Whether or not a bug fix is sufficiently impactful to break backward compatibility is a @@ -215,6 +74,13 @@ Follow these steps when it's time to tag a new release. Before doing this, you will need to ensure `bump2version` is installed into your development environment. +> \[!NOTE\] +> The following steps assume you have a personal fork of csp. +> If you are working from the main `Point72/csp` repo, use `origin` +> instead of `upstream` in the git commands. Specifically, +> `git pull origin main --tags` in the step 1, +> and `git push origin main --follow-tags` in step 7. + 1. Ensure your local clone of `csp` is synced up with GitHub, including any tags that have been pushed since you last synced: @@ -283,18 +149,20 @@ actions running, one for the push to `main` and one for the new tag. You want to inspect the action running for the new tag. Once the run finishes, there should be a new release on the ["Releases" page](https://github.com/Point72/csp/releases). +If the release is in "Draft" state, click on the pencil icon to +"Edit" and publish it with the "Publish release" button. ### Releasing to PyPI #### A developer's first release If this is your first release, you will need an account on pypi.org and -your account will need to be added as a maintainer to the `csp` project -on pypi. You will also need to have two factor authentication enabled on +your account will need to be added as a maintainer to the CSP project +on PyPI. You will also need to have two factor authentication enabled on your PyPI account. Once that is set up, navigate to the API token page in your PyPI -settings and generate an API token scoped to the `csp` project. **Do not** +settings and generate an API token scoped to the CSP project. **Do not** navigate away from the page displaying the API token before the next step. @@ -321,7 +189,7 @@ content: #### Doing the release -##### Download release artifacts from github actions +#### Download release artifacts from github actions Make sure you are in the root of the `csp` repository and execute the following commands. @@ -347,7 +215,7 @@ twine check --strict dist/* This happens as part of the CI so this should only be a double-check. -##### Optionally upload to testpypi to test "pip install" +#### Optionally upload to testpypi to test "pip install" ``` twine upload --repository testpypi dist/* @@ -363,12 +231,12 @@ pip install --index-url https://test.pypi.org/simple --extra-index-url https://p Note that `extra-index-url` is necessary to ensure downloading dependencies succeeds. -##### Upload to pypi +#### Upload to pypi If you are sure the release is ready, you can upload to pypi like so: ```bash -twine upload --repository csp dist/*` +twine upload --repository csp dist/* ``` Note that this assumes you have a `.pypirc` set up as explained above. diff --git a/docs/wiki/dev-guides/Roadmap.md b/docs/wiki/dev-guides/Roadmap.md new file mode 100644 index 000000000..25bb34c4d --- /dev/null +++ b/docs/wiki/dev-guides/Roadmap.md @@ -0,0 +1,17 @@ +We do not have a formal roadmap, but we're happy to discuss features, improvements, new adapters, etc, in our [discussions area](https://github.com/Point72/csp/discussions). + +Here are some high level items we hope to accomplish in the next few months: + +- Support `msvc` compiler and full Windows support ([#109](https://github.com/Point72/csp/issues/109)) +- Establish a better pattern for adapters ([#165](https://github.com/Point72/csp/discussions/165)) +- Parallelization to improve runtime, for historical/offline distributions +- Support for cross-process communication in realtime distributions + +## Adapters and Extensions + +- C++-based HTTP/SSE adapter +- Add support for other graph viewers, including interactive / standalone / Jupyter + +## Other Open Source Projects + +- `csp-gateway`: Application development framework, built with [FastAPI](https://fastapi.tiangolo.com) and [Perspective](https://github.com/finos/perspective). This is a library we have built internally at Point72 on top of `csp` that we hope to open source later in 2024. It allows for easier construction of modular `csp` applications, along with a pluggable REST/WebSocket API and interactive UI. diff --git a/docs/wiki/get-started/First-Steps.md b/docs/wiki/get-started/First-Steps.md new file mode 100644 index 000000000..eb9bc0ec8 --- /dev/null +++ b/docs/wiki/get-started/First-Steps.md @@ -0,0 +1,48 @@ +When writing CSP code there will be runtime components in the form of `csp.node` methods, as well as graph-building components in the form of `csp.graph` components. + +It is important to understand that `csp.graph` components will only be executed once at application startup in order to construct the graph. +Once the graph is constructed, `csp.graph` code is no longer needed. +Once the graph is run, only inputs, `csp.node`s and outputs will be active as data flows through the graph, driven by input ticks. + +For example, this is a simple bit of graph code: + +```python +import csp +from csp import ts +from datetime import datetime + + +@csp.node +def spread(bid: ts[float], ask: ts[float]) -> ts[float]: + if csp.valid(bid, ask): + return ask - bid + + +@csp.graph +def my_graph(): + bid = csp.const(1.0) + ask = csp.const(2.0) + bid = csp.multiply(bid, csp.const(4)) + ask = csp.multiply(ask, csp.const(3)) + s = spread(bid, ask) + + csp.print('spread', s) + csp.print('bid', bid) + csp.print('ask', ask) + + +if __name__ == '__main__': + csp.run(my_graph, starttime=datetime.utcnow()) +``` + +In order to help visualize this graph, you can call `csp.show_graph`: + +![359407708](https://github.com/Point72/csp/assets/3105306/8cc50ad4-68f9-4199-9695-11c136e3946c) + +The result of this would be: + +``` +2020-04-02 15:33:38.256724 bid:4.0 +2020-04-02 15:33:38.256724 ask:6.0 +2020-04-02 15:33:38.256724 spread:2.0 +``` diff --git a/docs/wiki/get-started/Installation.md b/docs/wiki/get-started/Installation.md new file mode 100644 index 000000000..f71043475 --- /dev/null +++ b/docs/wiki/get-started/Installation.md @@ -0,0 +1,20 @@ +## `pip` + +We ship binary wheels to install CSP on MacOS and Linux via `pip`: + +```bash +pip install csp +``` + +## `conda` + +CSP is available on `conda` for Linux and Mac: + +```bash +conda install csp -c conda-forge +``` + +## Source installation + +For other platforms, follow the instructions to [build CSP from +source](Build-CSP-from-Source). diff --git a/docs/wiki/how-tos/Add-Cycles-in-Graphs.md b/docs/wiki/how-tos/Add-Cycles-in-Graphs.md new file mode 100644 index 000000000..d8fba4312 --- /dev/null +++ b/docs/wiki/how-tos/Add-Cycles-in-Graphs.md @@ -0,0 +1,52 @@ +By definition of the graph building code, CSP graphs can only produce acyclical graphs. +However, there are many occasions where a cycle may be required. +For example, lets say you want part of your graph to simulate an exchange. +That part of the graph would need to accept new orders and return acks and executions. +However, the acks / executions would likely need to *feedback* into the same part of the graph that generated the orders. +For this reason, the `csp.feedback` construct exists. +Using `csp.feedback` one can wire a feedback as an input to a node, and effectively bind the actual edge that feeds it later in the graph. +Note that internally the graph is still acyclical. +Internally `csp.feedback` creates a pair of output and input adapters that are bound together. +When a timeseries that is bound to a feedback ticks, it is fed to the feedback which then schedules the tick on its bound input to be executed on the **next engine cycle**. +The next engine cycle will execute with the same engine time as the cycle that generated it, but it will be evaluated in a subsequent cycle. + +- **`csp.feedback(ts_type)`**: `ts_type` is the type of the timeseries (ie int, str). + This returns an instance of a feedback object + - **`out()`**: this method returns the timeseries edge which can be passed as an input to your node + - **`bind(ts)`**: this method is called to bind an edge as the source of the feedback after the fact + +A simple example should help demonstrate a possible usage. +Lets say we want to simulate acking orders that are generated from a node called `my_algo`. +In addition to generating the orders, `my_algo` also wants needs to receive the execution reports (this is demonstrated in example `e_13_feedback.py`) + +The graph code would look something like this: + +```python +# Simulate acking an order +@csp.node +def my_exchange(order:ts[Order]) -> ts[ExecReport]: + # ... impl details ... + +@csp.node +def my_algo(exec_report:ts[ExecReport]) -> ts[Order]: + # .. impl details ... + +@csp.graph +def my_graph(): + # create the feedback first so that we can refer to it later + exec_report_fb = csp.feedback(ExecReport) + + # generate orders, passing feedback out() which isn't bound yet + orders = my_algo(exec_report_fb.out()) + + # get exec_reports from "simulator" + exec_report = my_exchange(orders) + + # now bind the exec reports to the feedback, finishing the "loop" + exec_report_fb.bind(exec_report) +``` + +The graph would end up looking like this. +It remains acyclical, but the `FeedbackOutputDef` is bound to the `FeedbackInputDef` here, any tick to out will push the tick to in on the next cycle: + +![366521848](https://github.com/Point72/csp/assets/3105306/c4f920ff-49f9-4a52-8404-7c1989768da7) diff --git a/docs/wiki/how-tos/Create-Dynamic-Baskets.md b/docs/wiki/how-tos/Create-Dynamic-Baskets.md new file mode 100644 index 000000000..9457d68bf --- /dev/null +++ b/docs/wiki/how-tos/Create-Dynamic-Baskets.md @@ -0,0 +1,58 @@ +CSP graphs are somewhat limiting in that they cannot change shape once the process starts up. +CSP dynamic graphs addresses this issue by introducing a construct to allow applications to dynamically add / remove sub-graphs from a running graph. + +`csp.DynamicBasket`s are a pre-requisite construct needed for dynamic graphs. +`csp.DynamicBasket`s work just like regular static CSP baskets, however dynamic baskets can change their shape over time. +`csp.DynamicBasket`s can only be created from either CSP nodes or from `csp.dynamic` calls, as described below. +A node can take a `csp.DynamicBasket` as an input or generate a dynamic basket as an output. +Dynamic baskets are always dictionary-style baskets, where time series can be added by key. +Note that timeseries can also be removed from dynamic baskets. + +## Syntax + +Dynamic baskets are denoted by the type `csp.DynamicBasket[key_type, ts_type]`, so for example `csp.DynamicBasket[str,int]` would be a dynamic basket that will have keys of type str, and timeseries of type int. +One can also use the non-python shorthand `{ ts[str] : ts[int] }` to signify the same. + +## Generating dynamic basket output + +For nodes that generate dynamic basket output, they would use the same interface as regular basket outputs. +The difference being that if you output a key that hasn't been seen before, it will automatically be added to the dynamic basket. +In order to remove a key from a dynamic basket output, you would use the `csp.remove_dynamic_key` method. +**NOTE** that it is illegal to add and remove a key in the same cycle: + +```python +@csp.node +def dynamic_demultiplex_example(data : ts[ 'T' ], key : ts['K']) -> csp.DynamicBasket['T', 'K']: + if csp.ticked(data) and csp.valid(key): + csp.output({ key : data }) + + + ## To remove a key, which wouldn't be done in this example node: + ## csp.remove_dynamic_key(key) +``` + +To remove a key one would use `csp.remove_dynamic_key`. +For a single unnamed output, the method expects the key. +For named outputs, the arguments would be `csp.remove_dynamic_key(output_name, key)` + +## Consuming dynamic basket input + +Taking dynamic baskets as input is exactly the same as static baskets. +There is one additional bit of information available on dynamic basket inputs though, which is the .shape property. +As keys are added or removed, the `basket.shape` property will tick the the change events. +The `.shape` property behaves effectively as a `ts[csp.DynamicBasketEvents]`: + +```python +@csp.node +def consume_dynamic_basket(data : csp.DynamicBasket[str,int]): + if csp.ticked(data.shape): + for key in data.shape.added: + print(f'key {key} was added') + for key in data.shape.removed: + print(f'key {key} was removed') + + + if csp.ticked(data): + for key,value in data.tickeditems(): + #...regular basket access here +``` diff --git a/docs/wiki/how-tos/Profile-CSP-Code.md b/docs/wiki/how-tos/Profile-CSP-Code.md new file mode 100644 index 000000000..15e7a5c30 --- /dev/null +++ b/docs/wiki/how-tos/Profile-CSP-Code.md @@ -0,0 +1,77 @@ +The `csp.profiler` library allows users to time cycle/node executions during a graph run. There are two available utilities. + +One can use these metrics to identify bottlenecks/inefficiencies in their graphs. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Profiling a real-time csp.graph](#profiling-a-real-time-cspgraph) +- [Saving raw profiling data to a file](#saving-raw-profiling-data-to-a-file) +- [graph_info: build-time information](#graph_info-build-time-information) + +## Profiling a real-time `csp.graph` + +The `csp.profiler` library provides a GUI for profiling real-time CSP graphs. +One can access this GUI by adding a `http_port` argument to their profiler call. + +```python +with profiler.Profiler(http_port=8888) as p: + results = csp.run(graph, starttime=st, endtime=et) # run the graph normally +``` + +This will open up the GUI on `localhost:8888` (as http_port=8888) which will display real-time node timing, cycle timing and memory snapshots. +Profiling stats will be calculated whenever you refresh the page or call a GET request. +Additionally, you can add the `format=json`argument (`localhost:8888?format=json`) to your request to receive the ProfilerInfo as a `JSON` object rather than the `HTML` display. + +Users can add the `display_graphs=True` flag to include bar/pie charts of node execution times in the web UI. +The matplotlib package is required to use the flag. + +```python +with profiler.Profiler(http_port=8888, display_graphs=True) as p: + ... +``` + +new_profiler + +## Saving raw profiling data to a file + +Users can save individual node execution times and individual cycle execution times to a `.csv` file if they desire. +This is useful if you want to apply your own analysis e.g. calculate percentiles. +To do this, simply add the flags `node_file=` or `cycle_file=` + +```python +with profiler.Profiler(cycle_file="cycle_data.csv", node_file="node_data.csv") as p: + ... +``` + +After the graph is run, the file `node_data.csv` contains: + +``` +Node Type,Execution Time +count,1.9814e-05 +cast_int_to_float,1.2791e-05 +_time_window_updates,4.759e-06 +... +``` + +After the graph is run, the file `cycle_data.csv` contains: + +``` +Execution Time +9.4757e-05 +4.5205e-05 +2.2873e-05 +... +``` + +## graph_info: build-time information + +Users can also extract build-time information about the graph without running it by calling `profiler.graph_info`. + +The code snippet below shows how to call `graph_info`. + +```python +from csp import profiler + +info = profiler.graph_info(graph) +``` diff --git a/docs/wiki/how-tos/Use-Statistical-Nodes.md b/docs/wiki/how-tos/Use-Statistical-Nodes.md new file mode 100644 index 000000000..a5e0f0084 --- /dev/null +++ b/docs/wiki/how-tos/Use-Statistical-Nodes.md @@ -0,0 +1,433 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) +- [Working with a single-valued time series](#working-with-a-single-valued-time-series) +- [Working with a NumPy time series](#working-with-a-numpy-time-series) +- [Working with a basket of time series](#working-with-a-basket-of-time-series) +- [Cross-sectional statistics](#cross-sectional-statistics) +- [Expanding window statistics](#expanding-window-statistics) +- [Common user options](#common-user-options) + - [Intervals](#intervals) + - [Triggers, samplers and resets](#triggers-samplers-and-resets) + - [Data validity](#data-validity) + - [NaN handling](#nan-handling) + - [Weighted statistics](#weighted-statistics) +- [Numerical stability](#numerical-stability) + - [The `recalc` parameter](#the-recalc-parameter) + +## Introduction + +The `csp.stats` library provides rolling window calculations on time series data in CSP. +The goal of the library is to provide a uniform, robust interface for statistical calculations in CSP. +Each computation is a `csp.graph` which consists of one or more nodes that perform a given computation. +Users can treat these graphs as a "black box" with specified inputs and outputs as provided in the API reference. +Example statistics graphs for *mean* and *standard deviation* are provided below to give a rough idea of how the graphs work. + +**Mean using a tick-specified interval** +![437686747](https://github.com/Point72/csp/assets/3105306/5586a355-e405-45c3-aa6d-c64754fd6c26) + +**Standard deviation using a tick-specified interval** +![437686748](https://github.com/Point72/csp/assets/3105306/8ae2ab7a-413d-4175-89d5-5b252401a83e) + +Rolling windows can either be specified by the number of ticks in the window or the time duration of the window. +Users can specify minimum window sizes for results as well as the minimum number of data points for a valid computation. +Standard NaN handling is provided with two different options. +Weighting is available for relevant stats functions such as sums, mean, covariance, and skew. + +## Working with a single-valued time series + +Time series of float and int types can be used for all stats functions, except those listed as "NumPy Specific". +Internally, all values are cast to float-type. +`NaN` values in the series (if applicable) are allowed and will be handled as specified by the `ignore_na` flag. + +If you are performing the same calculation on many different series, **it is highly recommended that you use a NumPy array.** +NumPy array inputs result in a much smaller CSP graph which can drastically improve performance. +If different series tick asynchronously, then sometimes using single-input calculations cannot be avoided. +However, you can consider sampling your data at regularly specified intervals, and then using the sampled values to create a NumPy array which is provided to the calculation. + +## Working with a NumPy time series + +All statistics functions work on both single-input time series and time series of NumPy arrays. +NumPy arrays provide the ability to perform the same calculation on many different elements within the same `csp.node`, and therefore drastically reduce the overall size of the CSP graph. +The performance benefits of using NumPy arrays for large-scale computations (i.e. thousands of symbols) is order of magnitudes faster, per benchmarking. +To convert a list of individual series into a NumPy array, use the `csp.stats.list_to_numpy` conversion node. +To convert back to a basket of series, use the `csp.stats.numpy_to_list` converter. + +All calculations on NumPy arrays are performed element-wise, with exception of `cov_matrix` and `corr_matrix` which are defined in the statistical sense. +Arrays of arbitrary dimension are supported, as well as array views such as transposes and slices. +The data type of arrays must be of float-type, not an int. +If your data is integer valued, convert the array to a float-type using the `astype` function in the NumPy library. +Basic mathematical operations (such as addition, multiplication etc.) are defined on NumPy array time series using NumPy's built-in functions, which allow for proper broadcasting rules. + +## Working with a basket of time series + +There are two ways that users can run stats function on a listbasket of time series. +If the data in the time series ticks together (or *relatively* together) then users can convert their listbasket data into a NumPy array time series +using the `list_to_numpy` node, run the calculations they want, and then convert back to a listbasket using the `numpy_to_list` node. +Since NumPy arrays only require one node per computation, whereas a list of `N` time series will require `N` nodes, this method is highly efficient even for small graphs. +Below is a diagram of the workflow for a listbasket with 2 elements. + +**A sum over a listbasket with 2 elements** +![437687654](https://github.com/Point72/csp/assets/3105306/0e12b9ff-9461-497c-895d-3b1c33669235) + +If the data does not tick (or is sampled) at the same time or the computations are fundamentally different in nature (i.e. different intervals), then the NumPy method will not provide the desired functionality. +Instead, if users wish to store all their individual time series in a listbasket, then they must use single input stats with standard CSP listbasket syntax. +This method is significantly slower than using NumPy arrays, since the graphs must be much larger. +However, depending on your use case, this may be unavoidable. +If possible, it is highly recommended that you consider transformations to your data that allow it to be stored in NumPy arrays, such as sampling at given intervals. + +## Cross-sectional statistics + +The `stats` library also exposes an option to compute cross-sectional statistics. +Cross-sectional statistics are statistics which are computed using every value in the window at each iteration. +These computations are less efficient than rolling window functions that employ smart updating. +However, some computations may have to be applied cross-sectionally, and some users may want to apply cross-sectional statistics for small window calculations that require high numerical stability. + +To use cross-sectional statistics, use the `csp.stats.cross_sectional` utility to receive all data in the current window. +Then, use `csp.apply` to use your own function on the cross-sectional data. +The `cross_sectional` function allows for the same user options as standard stats functions (such as triggering and sampling). +An example of using `csp.stats.cross_sectional` is shown below: + +```python +# Starttime: 2020-01-01 00:00:00 +x = {'2020-01-01': 1, '2020-01-01': 2, '2020-01-01': 3, '2020-01-01': 4, '2020-01-01': 5} +cs = cross_sectional(x, interval=3, min_window=2) +cs +``` + +```python +{'2020-01-02': [1,2], '2020-01-03': [1,2,3], '2020-01-04': [2,3,4], '2020-01-05': [3,4,5]} +``` + +```python +# Calculate a cross-sectional mean +cs_mean = csp.apply(cs, lambda v: sum(v)/len(v), float) +cs_mean +``` + +```python +{'2020-01-02': 1.5, '2020-01-03': 2.0, '2020-01-04': 3.0, '2020-01-05': 4.0} +``` + +## Expanding window statistics + +An expanding window holds all ticks of its underlying time series - in other words, the window grows unbounded as you receive more data points. +To use an expanding window, either don't specify an interval or set `interval=None`. +An example of an expanding window sum is shown below: + +```python +# Starttime: 2020-01-01 00:00:00 +x = {'2020-01-01': 1, '2020-01-01': 2, '2020-01-01': 3, '2020-01-01': 4, '2020-01-01': 5} +sum(x) +``` + +```python +{'2020-01-01': 1, '2020-01-02': 3, '2020-01-03': 6, '2020-01-04': 10, '2020-01-05': 15} +``` + +## Common user options + +### Intervals + +Intervals can be specified as a tick window or a time window. +Tick windows are int arguments while time windows are timedelta arguments. +For example, + +- `csp.stats.mean(x, interval=4)` will calculate a rolling mean over the last 4 ticks of data. +- `csp.stats.mean(x, interval=timedelta(seconds=4))` will calculate a rolling mean over the last 4 seconds of data + +Time intervals are inclusive at the right endpoint but **exclusive** at the left endpoint. +For example, if `x` ticks every one second with a value of `1`, and I call `csp.stats.sum(x, timedelta(seconds=1))`then my output will be `1` at all times. +It will not be `2`, since the left endpoint value (which ticked *exactly* one second ago) is not included. + +Tick intervals include `NaN` values. +For example, a tick interval of size `10` with `9` `NaN` values in the interval will only use the single non-nan value for computations. +For more information on `NaN` handling, see the "NaN handling" section. + +If no interval is specified, then the calculation will be treated as an expanding window statistic and all data will be cumulative (see the above section on Expanding Window Statistics). + +### Triggers, samplers and resets + +**Triggers** are optional arguments which *trigger* a computation of the statistic. +If no trigger is provided as an argument, the statistic will be computed every time `x` ticks i.e. `x` becomes the trigger. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +trigger = {'2020-01-02': True} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=2, trigger=trigger) +``` + +```python +# No result at day 3 +{'2020-01-02': 3} +``` + +**Samplers** are optional arguments which *sample* the data. +Samplers are used to signify when the data, `x`, *should* tick. +If no sampler is provided, the data is sampled whenever `x` ticks i.e. `x` becomes the sampler. + +- If the sampler ticks and `x` does as well, then the tick is treated as valid data +- If the sampler ticks but `x` does not, then the tick is treated as `NaN` data +- If the sampler does not tick but `x` does, then the tick is ignored completely + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +sampler = {'2020-01-01': True, '2020-01-03': True} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=2, sampler=sampler) +``` + +```python +# Tick on day 2 is ignored +{'2020-01-03': 4} +``` + +**Resets** are optional arguments which *reset* the interval, clearing all existing data. +Whenever reset ticks, the data is cleared. +If no reset is provided, then the data is never reset. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +reset = {'2020-01-02 12:00:00': True} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=2, reset=reset) +``` + +```python +# Data is reset after day 2 +{'2020-01-02': 3, '2020-01-03': 3} +``` + +**Important:** the order of operations between all three actions is as follows: reset, sample, trigger. +If all three series were to tick at the same time: the data is first reset, then sampled, and then a computation is triggered. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +reset = {2020-01-03: True} + +# Trigger = sampler = x. Reset, trigger and sampler therefore all tick at 2020-01-03 + +sum(x, interval=2, reset=reset) +``` + +```python +# the data is first reset, then 3 is sampled, and then the sum is computed +{'2020-01-02': 3, '2020-01-03': 3} +``` + +### Data validity + +**Minimum window size** (`min_window`) is the smallest allowable window before returning a computation. +If a time window interval is used, then `min_window` must also be a `timedelta`. +If a tick interval is used, then `min_window` must also be an `int`. +Minimum window is a startup condition: once the minimum window size is reached, it will never go away. +For example, if you have a minimum window of 5 ticks with a 10 tick interval, once 5 ticks of data have occurred computations will always be returned when triggered. +By *default*, the minimum window size is equal to the interval itself. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +sum(x, interval=2, min_window=1} +``` + +```python +{'2020-01-01': 1, '2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=timedelta(days=2), min_window=timedelta(days=1)) +``` + +```python +# Assuming graph start time is 2020-01-01 +{'2020-01-02': 3, '2020-01-03': 5} +``` + +**Minimum data points** (`min_data_points`) is the number of *valid* (non-nan) data points that must exist in the current window for a valid computation. +By default, min_data_points is 0. +However, in most applications, if you are dealing with frequently NaN data you may want to ensure that stats computations provide meaningful results. +Thus, if the interval has fewer than min_data_points values, the computation is too noisy and thus NaN is returned instead. + +```python +x = {'2020-01-01': 1, '2020-01-02': nan, '2020-01-03': 3} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 1, '2020-01-03': 3} + +sum(x, interval=2, min_data_points=2) +``` + +```python +# We only have 1 valid data point +{'2020-01-02': nan, '2020-01-03': nan} +``` + +### NaN handling + +The stats library provides a uniform interface for NaN handling. +Functions have an `ignore_na` parameter which is a bool argument (default value is `True`). + +- If `ignore_na=True`, then NaN values are "ignored" in the computation but still included in the interval +- If `ignore_na=False`, then NaN values make the whole computation NaN ("poison" the interval) as long as they are present in the interval + +```python +x = {'2020-01-01': 1, '2020-01-02': nan, '2020-01-03': 3, '2020-01-04': 4} + +sum(x, interval=2, ignore_na=True} +``` + +```python +{'2020-01-02': 1, '2020-01-03': 3, '2020-01-04': 7} +``` + +```python +sum(x, interval=2, ignore_na=False) +``` + +```python +# NaN at t=2 only out of interval by t=4 +{'2020-01-02': nan, '2020-01-03': nan, '2020-01-04': 7} +``` + +For exponential moving calculations, **EMA NaN handling** is slightly different. +If `ignore_na=True`, then NaN values are completely discarded. +If `ignore_na=False`, then NaN values do not poison the interval, but rather count as a tick with no data. +This affects the reweighting of past data points when the next tick with valid data is added. +For a detailed explanation, see the EMA section. + +### Weighted statistics + +**Weights** is an optional time-series which gives a relative weight to each data point. +Weighted statistics are available for: *sum(), mean(), var(), cov(), stddev(), sem(), corr(), skew(), kurt(), cov_matrix()* and *corr_matrix()*. +Since weights are relative, they do not need to be normalized by the user. +Weights also do not need to tick at the same time as the data, necessarily: the weights are *sampled* whenever the data sampler ticks. +For higher-order statistics such as variance, covariance, correlation, standard deviation, standard error, skewness and kurtosis, weights are interpreted as *frequency weights*. +This means that a weight of 1 corresponds to that observation occurring once and a weight of 2 signifies that observation occurring twice. + +If either the data *or* its corresponding weight is NaN, then the weighted data point is collectively treated as NaN. + +```python +# Single valued time series + +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3, '2020-01-04': 4} +weights = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-04': 1} + +sum(x, interval=2, weights=weights) +``` + +```python +# Weight of 2 applied to x=3, as it is sampled +{'2020-01-02': 5, '2020-01-03': 10, '2020-01-04': 10} +``` + +```python +mean(x, interval=2, weights=weights) +``` + +```python +# Weighted mean +{'2020-01-02': 1.667, '2020-01-03': 2.5, '2020-01-04': 3.333} +``` + +If the time-series is of type `float`, then the weights series is also of type `float`. +If the time-series is of type `np.ndarray`, then the weights series is sometimes of type `np.ndarray` and sometimes of type `float`. +For element-wise statistics *sum(), mean(), var(), stddev(), sem(), skew(), kurt()* the weights are element-wise as well. +For *cov_matrix()* and *corr_matrix(),* the weights are of type float since they apply to the data vector collectively. +Consult the individual function references for more details. + +```python +# NumPy applied element-wise + +x = {'2020-01-01': [1,1], '2020-01-02': [2,2], '2020-01-03': [3,3]} +weights = {'2020-01-01': [1,2], '2020-01-02': [2,1], '2020-01-03': [1,1]} + +sum(x, interval=2, weights=weights) + +``` + +```python +{'2020-01-02': [5,4], '2020-01-03': [7,5]} +``` + +```python +mean(x, interval=2, weights=weights) +``` + +```python +# Weighted mean +{'2020-01-02': [1.667, 1.333], '2020-01-03': [2.333, 2.5]} +``` + +## Numerical stability + +Stats functions are not guaranteed to be numerically stable due to the nature of a rolling window calculation. +These functions implement online algorithms which have increased risk of floating point precision errors, especially when the data is ill-conditioned. +**Users are recommended to apply their own data cleaning** before calling these functions. +Data cleaning may include clipping large, erroneous values to be NaN or normalizing data based on historical ranges. +Cleaning can be implemented using the `csp.apply` node (see baselib documentation) with your cleaning pipeline expressed within a callable object (function). +If numerical stability is paramount, then cross-sectional calculations can be used at the cost of efficiency (see the section below on Cross-Sectional Statistics). + +Where possible, `csp.stats` algorithms are chosen to maximize stability while maintaining their online efficiency. +For example, rolling variance is calculated using Welford's online algorithm and rolling sums are calculated using Kahan's algorithm if `precise=True` is set. +Floating-point error can still accumulate when the functions are used on large data streams, especially if the interval used is small in comparison to the quantity of data. +Each stats method that is prone to floating-point error exposes a **recalc parameter** which is an optional time-series argument to trigger a clean recalculation of the statistic. +The recalculation clears any accumulated floating-point error up to that point. + +### The `recalc` parameter + +The `recalc` parameter is an optional time-series argument designed to stop unbounded floating-point error accumulation in rolling `csp.stats` functions. +When `recalc` ticks, the next calculation of the desired statistic will be computed with all data in the window. +This clears any accumulated error from prior intervals. +The parameter is meant to be used heuristically for use cases involving large data streams and small interval sizes, causing values to be continuously added and removed from the window. +Periodically triggering a recalculation will limit the floating-point error accumulation caused by these updates; for example, a user could set `recalc` to tick every 100 intervals of their data. +The cost of triggering a recalculation is efficiency: since all data in the window must be processed, it is not as fast as doing the calculation in the standard online fashion. + +A basic example using the `recalc` parameter is provided below. + +```python +x = {'2020-01-01': 0.1, '2020-01-02': 0.2, '2020-01-03': 0, '2020-01-04': 0} +sum(x, interval=2) +``` + +```python +# floating-point error has caused the sum to not perfectly go to zero +{'2020-01-02': 0.3, '2020-01-03': 0.19999999, '2020-01-03': -0.00000001} +``` + +```python +recalc = {'2020-01-04': True} +sum(x, interval=2, recalc=recalc) +``` + +```python +# at day 4, a clean recalculation clears the floating-point error from the previous data +{'2020-01-02': 0.3, '2020-01-03': 0.19999999, '2020-01-04': 0} +``` diff --git a/docs/wiki/how-tos/Write-Historical-Input-Adapters.md b/docs/wiki/how-tos/Write-Historical-Input-Adapters.md new file mode 100644 index 000000000..ac5435c59 --- /dev/null +++ b/docs/wiki/how-tos/Write-Historical-Input-Adapters.md @@ -0,0 +1,415 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) +- [Types of Historical Adapters](#types-of-historical-adapters) +- [PullInputAdapter](#pullinputadapter) + - [PullInputAdapter - Python](#pullinputadapter---python) + - [PullInputAdapter - C++](#pullinputadapter---c) +- [AdapterManager and ManagedSimInputAdapter - Python](#adaptermanager-and-managedsiminputadapter---python) + - [AdapterManager - **--graph-- time**](#adaptermanager-----graph---time) + - [AdapterManager - **--impl-- runtime**](#adaptermanager-----impl---runtime) + - [ManagedSimInputAdapter - **--impl-- runtime**](#managedsiminputadapter-----impl---runtime) + - [ManagedSimInputAdapter - **--graph-- time**](#managedsiminputadapter-----graph---time) + - [Example - CSVReader](#example---csvreader) + +## Introduction + +There are two main categories of writing input adapters, historical and realtime. + +When writing historical adapters you will need to implement a "pull" adapter, which pulls data from a historical data source in time order, one event at a time. + +There are also ManagedSimAdapters for feeding multiple "managed" pull adapters from a single source (more on that below). + +When writing input adapters it is also very important to denote the difference between "graph building time" and "runtime" versions of your adapter. +For example, `csp.adapters.csv` has a `CSVReader` class that is used at graph building time. + +**Graph build time components** solely *describe* the adapter. +They are meant to do little else than keep track of the type of adapter and its parameters, which will then be used to construct the actual adapter implementation when the engine is constructed from the graph description. +It is the runtime implementation that actual runs during the engine execution phase to process data. + +For clarity of this distinction, in the descriptions below we will denote graph build time components with *--graph--* and runtime implementations with *--impl--*. + +## Types of Historical Adapters + +There are two flavors of historical input adapters that can be written. +The simplest one is a PullInputAdapter. +A PullInputAdapter can be used to convert a single source into a single timeseries. +The `csp.curve` implementation is a good example of this. +Single source to single timeseries adapters are of limited use however, and the more typical use case is for AdapterManager based input adapters to service multiple InputAdapters from a single source. +For this one would use an AdapterManager to coordinate processing of the data source, and ManagedSimInputAdapter as the individual timeseries providers. + +## PullInputAdapter + +### PullInputAdapter - Python + +To write a Python based `PullInputAdapter` one must write a class that derives from `csp.impl.pulladapter.PullInputAdapter`. +The derived type should the define two methods: + +- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. + `start_Time` and `end_time` will be tz-unaware datetime objects in UTC time. + At this point the adapter should open its resource and seek to the requested starttime. +- `def next(self)`: this method will be repeatedly called by the engine. + The adapter should return the next event as a time,value tuple. + If there are no more events, then the method should return `None`. + +The `PullInputAdapter` that you define will be used as the runtime *--impl–-*. +You also need to define a *--graph--* time representation of the time series edge. +In order to do this you should define a `csp.impl.wiring.py_pull_adapter_def`. +The `py_pull_adapter_def` creates a *--graph--* time representation of your adapter: + +```python +def py_pull_adapter_def(name, adapterimpl, out_type, **kwargs) +``` + +- **`name`**: string name for the adapter +- **`adapterimpl`**: a derived implementation of `csp.impl.pulladapter.PullInputAdapter` +- **`out_type`**: the type of the output, should be a `ts[]` type. Note this can use tvar types if a subsequent argument defines the tvar +- **`kwargs`**: \*\*kwargs here be passed through as arguments to the `PullInputAdapter` implementation + +Note that the \*\*kwargs passed to `py_pull_adapter_def` should be the names and types of the variables, like `arg1=type1, arg2=type2`. +These are the names of the kwargs that the returned input adapter will take and pass through to the `PullInputAdapter` implementation, and the types expected for the values of those args. + +`csp.curve` is a good simple example of this: + +```python +import copy +from csp.impl.pulladapter import PullInputAdapter +from csp.impl.wiring import py_pull_adapter_def +from csp import ts +from datetime import timedelta + + +class Curve(PullInputAdapter): + def __init__(self, typ, data): + ''' data should be a list of tuples of (datetime, value) or (timedelta, value)''' + self._data = data + self._index = 0 + super().__init__() + + def start(self, start_time, end_time): + if isinstance(self._data[0][0], timedelta): + self._data = copy.copy(self._data) + for idx, data in enumerate(self._data): + self._data[idx] = (start_time + data[0], data[1]) + + while self._index < len(self._data) and self._data[self._index][0] < start_time: + self._index += 1 + + super().start(start_time, end_time) + + def next(self): + + if self._index < len(self._data): + time, value = self._data[self._index] + if time <= self._end_time: + self._index += 1 + return time, value + return None + + +curve = py_pull_adapter_def('curve', Curve, ts['T'], typ='T', data=list) +``` + +Now curve can be called in graph code to create a curve input adapter: + +```python +x = csp.curve(int, [ (t1, v1), (t2, v2), .. ]) +csp.print('x', x) +``` + +See example [e_14_user_adapters_01_pullinput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_01_pullinput.py) for more details. + +### PullInputAdapter - C++ + +**Step 1)** `PullInputAdapter` impl + +Similar to the Python `PullInputAdapter` API is the c++ API which one can leverage to improve performance of an adapter implementation. +The *--impl--* is very similar to python pull adapter. +One should derive from `PullInputAdapter`, a templatized base class (templatized on the type of the timeseries) and define these methods: + +- **`start(DateTime start, DateTime end)`**: similar to python API start, called when engine starts. + Open resource and seek to start time here +- **`stop()`**: called on engine shutdown, cleanup resource +- **`bool next(DateTime & t, T & value)`**: if there is data to provide, sets the next time and value for the adapter and returns true. + Otherwise, return false + +**Step 2)** Expose creator func to python + +Now that we have a c++ impl defined, we need to expose a python creator for it. +Define a method that conforms to the signature + +```cpp +csp::InputAdapter * create_my_adapter( + csp::AdapterManager * manager, + PyEngine * pyengine, + PyTypeObject * pyType, + PushMode pushMode, + PyObject * args) +``` + +- **`manager`**: will be nullptr for pull adapters +- **`pyengine `**: PyEngine engine wrapper object +- **`pyType`**: this is the type of the timeseries input adapter to be created as a `PyTypeObject`. + one can switch on this type using switchPyType to create the properly typed instance +- **`pushMode`**: the CSP PushMode for the adapter (pass through to base InputAdapter) +- **`args`**: arguments to pass to the adapter impl + +Then simply register the creator method: + +**`REGISTER_INPUT_ADAPTER(_my_adapter, create_my_adapter)`** + +This will register methodname onto your python module, to be accessed as your module.methodname. +Note this uses `csp/python/InitHelpers` which is used in the `_cspimpl` module. +To do this in a separate python module, you need to register `InitHelpers` in that module. + +**Step 3)** Define your *--graph–-* time adapter + +One liner now to wrap your impl in a graph time construct using `csp.impl.wiring.input_adapter_def`: + +```python +my_adapter = input_adapter_def('my_adapter', my_module._my_adapter, ts[int], arg1=int, arg2={str:'foo'}) +``` + +`my_adapter` can now be called with `arg1, arg2` to create adapters in your graph. +Note that the arguments are typed using `v=t` syntax. `v=(t,default)` is used to define arguments with defaults. + +Also note that all input adapters implicitly get a push_mode argument that is defaulted to `csp.PushMode.LAST_VALUE`. + +## AdapterManager and ManagedSimInputAdapter - Python + +In most cases you will likely want to expose a single source of data into multiple input adapters. +For this use case your adapter should define an AdapterManager *--graph--* time component, and AdapterManagerImpl *--impl--* runtime component. +The AdapterManager *--graph--* time component just represents the parameters needed to create the *--impl--* AdapterManager. +Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. + +Similarly you will need to define a derived ManagedSimInputAdapter *--impl--* component to handle events directed at an individual time series adapter. + +**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. +graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. + +### AdapterManager - **--graph-- time** + +The graph time AdapterManager doesn't need to derive from any interface. +It should be initialized with any information the impl needs in order to open/process the data source (ie csv file, time column, db connection information, etc etc). +It should also have an API to create individual timeseries adapters. +These adapters will then get passed the adapter manager *--impl--* as an argument where they are created, so that they can register themselves for processing. +The AdapterManager also needs to define a **\_create** method. +The **\_create** is the bridge between the *--graph--* time AdapterManager representation and the runtime *--impl--* object. +**\_create** will be called on the *--graph--* time AdapterManager which will in turn create the *--impl--* instance. +\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and a memo dict which can optionally be used for any memoization that on might want. + +Lets take a look at [`CSVReader`](https://github.com/Point72/csp/blob/main/csp/adapters/csv.py) as an example: + +```python +# GRAPH TIME +class CSVReader: + def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): + self._filename = filename + self._symbol_column = symbol_column + self._delimiter = delimiter + self._time_converter = time_converter + + def subscribe(self, symbol, typ, field_map=None): + return CSVReadAdapter(self, symbol, typ, field_map) + + def _create(self, engine, memo): + return CSVReaderImpl(engine, self) +``` + +- **`__init__`**: as you can see, all `__init__` does is keep the parameters that the impl will need. +- **`subscribe`**: API to create an individual timeseries / edge from this file for the given symbol. + typ denotes the type of the timeseries to create (ie `ts[int]`) and field_map is used for mapping columns onto `csp.Struct` types. + Note that subscribe returns a `CSVReadAdapter` instance. + `CSVReadAdapter` is the *--graph--* time representation of the edge (similar to how we defined `csp.curve` above). + We pass it `self` as its first argument, which will be used to create the AdapterManager *--impl--* +- **`\_create`**: the method to create the *--impl--* object from the given *--graph--* time representation of the manager + +The `CSVReader` would then be used in graph building code like so: + +```python +reader = CSVReader('my_data.csv', time_formatter, symbol_column='SYMBOL', delimiter='|') +# aapl will represent a ts[PriceQuantity] edge that will tick with rows from +# the csv file matching on SYMBOL column AAPL +aapl = reader.subscribe('AAPL', PriceQuantity) +``` + +### AdapterManager - **--impl-- runtime** + +The AdapterManager *--impl--* is responsible for opening the data source, parsing and processing through all the data and managing all the adapters it needs to feed. +The impl class should derive from `csp.impl.adaptermanager.AdapterManagerImpl` and implement the following methods: + +- **`start(self,starttime,endtime)`**: this is called when the engine starts up. + At this point the impl should open the resource providing the data and seek to starttime. + starttime/endtime will be tz-unaware datetime objects in UTC time +- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point +- **`process_next_sim_timeslice(self, now)`**: this method will be called multiple times through the run. + The initial call will provide now with starttime. + The impl's responsibility is to process all data at the given timestamp (more on how to do this below). + The method should return the next time in the data source, or None if there is no more data to process. + The method will be called again with the provided timestamp as "now" in the next iteration. + **NOTE** that process_next_sim_timeslice is required to move ahead in time. + In most cases the resource data can be supplied in time order, if not it would have to be sorted up front. + +`process_next_sim_timeslice` should parse data for a given time/row of data and then push it through to any registered `ManagedSimInputAdapter` that matches on the given row. + +### ManagedSimInputAdapter - **--impl-- runtime** + +Users will need to define `ManagedSimInputAdapter` derived types to represent the individual timeseries adapter *--impl--* objects. +Objects should derive from `csp.impl.adaptermanager.ManagedSimInputAdapter`. + +`ManagedSimInputAdapter.__init__` takes two arguments: + +- **`typ`**: this is the type of the timeseries, ie int for a `ts[int]` +- **`field_map`**: Optional, field_map is a dictionary used to map source column names → `csp.Struct` field names. + +`ManagedSimInputAdapter` defines a method `push_tick()` which takes the value to feed the input for given timeslice (as defined by "now" at the adapter manager level). +There is also a convenience method called `process_dict()` which will take a dictionary of `{column : value}` entries and convert it properly into the right value based on the given **field_map.** + +### ManagedSimInputAdapter - **--graph-- time** + +As with the `csp.curve` example, we need to define a graph-time construct that represents a `ManagedSimInputAdapter` edge. +In order to define this we use `py_managed_adapter_def`. +`py_managed_adapter_def` is AdapterManager-"aware" and will properly create the AdapterManager *--impl--* the first time its encountered. +It will then pass the manager impl as an argument to the `ManagedSimInputAdapter`. + +```python +def py_managed_adapter_def(name, adapterimpl, out_type, manager_type, **kwargs): +""" +Create a graph representation of a python managed sim input adapter. +:param name: string name for the adapter +:param adapterimpl: a derived implementation of csp.impl.adaptermanager.ManagedSimInputAdapter +:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar +:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter +:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation +the first argument to the implementation will be the adapter manager impl instance +""" +``` + +### Example - CSVReader + +Putting this all together lets take a look at a `CSVReader` implementation +and step through what's going on: + +```python +import csv as pycsv +from datetime import datetime + +from csp import ts +from csp.impl.adaptermanager import AdapterManagerImpl, ManagedSimInputAdapter +from csp.impl.wiring import pymanagedadapterdef + +# GRAPH TIME +class CSVReader: + def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): + self._filename = filename + self._symbol_column = symbol_column + self._delimiter = delimiter + self._time_converter = time_converter + + def subscribe(self, symbol, typ, field_map=None): + return CSVReadAdapter(self, symbol, typ, field_map) + + def _create(self, engine, memo): + return CSVReaderImpl(engine, self) +``` + +Here we define CSVReader, our AdapterManager *--graph--* time representation. +It holds the parameters that will be used for the impl, it implements a `subscribe()` call for users to create timeseries and defines a \_create method to create a runtime *--impl–-* instance from the graphtime representation. +Note how on line 17 we pass self to the CSVReadAdapter, this is what binds the input adapter to this AdapterManager + +```python +# RUN TIME +class CSVReaderImpl(AdapterManagerImpl): # 1 + def __init__(self, engine, adapterRep): # 2 + super().__init__(engine) # 3 + # 4 + self._rep = adapterRep # 5 + self._inputs = {} # 6 + self._csv_reader = None # 7 + self._next_row = None # 8 + # 9 + def start(self, starttime, endtime): # 10 + self._csv_reader = pycsv.DictReader( # 11 + open(self._rep._filename, 'r'), # 12 + delimiter=self._rep._delimiter # 13 + ) # 14 + self._next_row = None # 15 + # 16 + for row in self._csv_reader: # 17 + time = self._rep._time_converter(row) # 18 + self._next_row = row # 19 + if time >= starttime: # 20 + break # 21 + # 22 + def stop(self): # 23 + self._csv_reader = None # 24 + # 25 + def register_input_adapter(self, symbol, adapter): # 26 + if symbol not in self._inputs: # 27 + self._inputs[symbol] = [] # 28 + self._inputs[symbol].append(adapter) # 29 + # 30 + def process_next_sim_timeslice(self, now): # 31 + if not self._next_row: # 32 + return None # 33 + # 34 + while True: # 35 + time = self._rep._time_converter(self._next_row) # 36 + if time > now: # 37 + return time # 38 + self.process_row(self._next_row) # 39 + try: # 40 + self._next_row = next(self._csv_reader) # 41 + except StopIteration: # 42 + return None # 43 + # 44 + def process_row(self, row): # 45 + symbol = row[self._rep._symbol_column] # 46 + if symbol in self._inputs: # 47 + for input in self._inputs.get(symbol, []): # 48 + input.process_dict(row) # 49 +``` + +`CSVReaderImpl` is the runtime *--impl–-*. +It gets created when the engine is being built from the described graph. + +- **lines 10-21 - start()**: this is the start method that gets called with the time range the graph will be run against. + Here we open our resource (`pycsv.DictReader`) and scan t through the data until we reach the requested starttime. + +- **lines 23-24 - stop()**: this is the stop call that gets called when the engine is done running and is shutdown, we free our resource here + +- **lines 26-29**: the `CSVReader` allows one to subscribe to many symbols from one file. + symbols are keyed by a provided `SYMBOL` column. + The individual adapters will self-register with the `CSVReaderImpl` when they are created with the requested symbol. + `CSVReaderImpl` keeps track of what adapters have been registered for what symbol in its `self._inputs` map. + +- **lines 31-43**: this is main method that gets invoked repeatedly throughout the run. + For every distinct timestamp in the file, this method will get invoked once and the method is expected to go through the resource data for all points with time now, process the row and push the data to any matching adapters. + The method returns the next timestamp when its done processing all data for "now", or None if there is no more data. + **NOTE** that the csv impl expects the data to be in time order. + `process_next_sim_timeslice` must advance time forward. + +- **lines 45-49**: this method takes a row of data (provided as a dict from `DictReader`), extracts the symbol and pushes the row through to all input adapters that match + +```python +class CSVReadAdapterImpl(ManagedSimInputAdapter): # 1 + def __init__(self, managerImpl, symbol, typ, field_map): # 2 + managerImpl.register_input_adapter(symbol, self) # 3 + super().__init__(typ, field_map) # 4 + # 5 +CSVReadAdapter = py_managed_adapter_def( # 6 + 'csvadapter', + CSVReadAdapterImpl, + ts['T'], + CSVReader, + symbol=str, + typ='T', + fieldMap=(object, None) +) +``` + +- **line 3**: this is where the instance of an adapter *--impl--* registers itself with the `CSVReaderImpl`. +- **line 6+**: this is where we define `CSVReadAdapter`, the *--graph--* time representation of a CSV adapter, returned from `CSVReader.subscribe` + +See example [e_14_user_adapters_02_adaptermanager_siminput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_02_adaptermanager_siminput.py) for another example of how to write a managed sim adapter manager. diff --git a/docs/wiki/how-tos/Write-Output-Adapters.md b/docs/wiki/how-tos/Write-Output-Adapters.md new file mode 100644 index 000000000..8fb92d0ff --- /dev/null +++ b/docs/wiki/how-tos/Write-Output-Adapters.md @@ -0,0 +1,317 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Output Adapters](#output-adapters) + - [OutputAdapter - Python](#outputadapter---python) + - [OutputAdapter - C++](#outputadapter---c) + - [OutputAdapter with Manager](#outputadapter-with-manager) + - [InputOutputAdapter - Python](#inputoutputadapter---python) + +## Output Adapters + +Output adapters are used to define graph outputs, and they differ from input adapters in a number of important ways. +Output adapters also differ from terminal nodes, e.g. regular `csp.node` instances that do not define outputs, and instead consume and emit their inputs inside their `csp.ticked` blocks. + +For many use cases, it will be sufficient to omit writing an output adapter entirely. +Consider the following example of a terminal node that writes an input dictionary timeseries to a file. + +```python +@csp.node +def write_to_file(x: ts[Dict], filename: str): + if csp.ticked(x): + with open(filename, "a") as fp: + fp.write(json.dumps(x)) +``` + +This is a perfectly fine node, and serves its purpose. +Unlike input adapters, output adapters do not need to differentiate between *historical* and *realtime* mode. +Input adapters drive the execution of the graph, whereas output adapters are reactive to their input nodes and subject to the graph's execution. + +However, there are a number of reasons why you might want to define an output adapter instead of using a vanilla node. +The most important of these is when you want to share resources across a number of output adapters (e.g. with a Manager), or between an input and an output node, e.g. reading data from a websocket, routing it through your CSP graph, and publishing data *to the same websocket connection*. +For most use cases, a vanilla CSP node will suffice, but let's explore some anyway. + +### OutputAdapter - Python + +To write a Python based OutputAdapter one must write a class that derives from `csp.impl.outputadapter.OutputAdapter`. +The derived type should define the method: + +- `def on_tick(self, time: datetime, value: object)`: this will be called when the input to the output adapter ticks. + +The OutputAdapter that you define will be used as the runtime *--impl–-*. You also need to define a *--graph--* time representation of the time series edge. +In order to do this you should define a `csp.impl.wiring.py_output_adapter_def`. +The `py_output_adapter_def` creates a *--graph--* time representation of your adapter: + +**def py_output_adapter_def(name, adapterimpl, \*\*kwargs)** + +- **`name`**: string name for the adapter +- **`adapterclass`**: a derived implementation of `csp.impl.outputadapter.OutputAdapter` +- **`kwargs`**: \*\*kwargs here be passed through as arguments to the OutputAdapter implementation + +Note that the `**kwargs` passed to py_output_adapter_def should be the names and types of the variables, like `arg1=type1, arg2=type2`. +These are the names of the kwargs that the returned output adapter will take and pass through to the OutputAdapter implementation, and the types expected for the values of those args. + +Here is a simple example of the same filewriter from above: + +```python +from csp.impl.outputadapter import OutputAdapter +from csp.impl.wiring import py_output_adapter_def +from csp import ts +import csp +from json import dumps +from datetime import datetime, timedelta + + +class MyFileWriterAdapterImpl(OutputAdapter): + def __init__(self, filename: str): + super().__init__() + self._filename = filename + + def start(self): + self._fp = open(self._filename, "a") + + def stop(self): + self._fp.close() + + def on_tick(self, time, value): + self._fp.write(dumps(value) + "\n") + + +MyFileWriterAdapter = py_output_adapter_def( + name='MyFileWriterAdapter', + adapterimpl=MyFileWriterAdapterImpl, + input=ts['T'], + filename=str, +) +``` + +Now our adapter can be called in graph code: + +```python +@csp.graph +def my_graph(): + curve = csp.curve( + data=[ + (timedelta(seconds=0), {"a": 1, "b": 2, "c": 3}), + (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), + (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), + ], + typ=object, + ) + + MyFileWriterAdapter(curve, filename="testfile.jsonl") +``` + +As explained above, we could also do this via single node (this is probably the best version between the three): + +```python +@csp.node +def dump_json(data: ts['T'], filename: str): + with csp.state(): + s_file=None + with csp.start(): + s_file = open(filename, "w") + with csp.stop(): + s_file.close() + if csp.ticked(data): + s_file.write(json.dumps(data) + "\n") + s_file.flush() +``` + +### OutputAdapter - C++ + +TODO + +### OutputAdapter with Manager + +Adapter managers function the same way for output adapters as for input adapters, i.e. to manage a single shared resource from the manager across a variety of discrete output adapters. + +### InputOutputAdapter - Python + +As a as last example, lets tie everything together and implement a managed push input adapter combined with a managed output adapter. +This example is available in [e_14_user_adapters_06_adaptermanager_inputoutput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_06_adaptermanager_inputoutput.py). + +First, we will define our adapter manager. +In this example, we're going to cheat a little bit and combine our adapter manager (graph time) and our adapter manager impl (run time). + +```python +class MyAdapterManager(AdapterManagerImpl): + ''' + This example adapter will generate random `MyData` structs every `interval`. This simulates an upstream + data feed, which we "connect" to only a single time. We then multiplex the results to an arbitrary + number of subscribers via the `subscribe` method. + + We can also receive messages via the `publish` method from an arbitrary number of publishers. These messages + are demultiplexex into a number of outputs, simulating sharing a connection to a downstream feed or responses + to the upstream feed. + ''' + def __init__(self, interval: timedelta): + self._interval = interval + self._counter = 0 + self._subscriptions = {} + self._publications = {} + self._running = False + self._thread = None + + def subscribe(self, symbol): + '''This method creates a new input adapter implementation via the manager.''' + return _my_input_adapter(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING) + + def publish(self, data: ts['T'], symbol: str): + '''This method creates a new output adapter implementation via the manager.''' + return _my_output_adapter(self, data, symbol) + + def _create(self, engine, memo): + # We'll avoid having a second class and make our AdapterManager and AdapterManagerImpl the same + super().__init__(engine) + return self + + def start(self, starttime, endtime): + self._running = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + if self._running: + self._running = False + self._thread.join() + + # print closing of the resources + for name in self._publications.values(): + print("closing asset {}".format(name)) + + def register_subscription(self, symbol, adapter): + if symbol not in self._subscriptions: + self._subscriptions[symbol] = [] + self._subscriptions[symbol].append(adapter) + + def register_publication(self, symbol): + if symbol not in self._publications: + self._publications[symbol] = "publication_{}".format(symbol) + + def _run(self): + '''This method runs in a background thread and generates random input events to push to the corresponding adapter''' + symbols = list(self._subscriptions.keys()) + while self._running: + # Lets pick a random symbol from the requested symbols + symbol = symbols[random.randint(0, len(symbols) - 1)] + + data = MyData(symbol=symbol, value=self._counter) + + self._counter += 1 + + for adapter in self._subscriptions[symbol]: + # push to all the subscribers + adapter.push_tick(data) + + time.sleep(self._interval.total_seconds()) + + def _on_tick(self, symbol, value): + '''This method just writes the data to the appropriate outbound "channel"''' + print("{}:{}".format(self._publications[symbol], value)) +``` + +This adapter manager is a bit of a silly example, but it demonstrates the core concepts. +The adapter manager will demultiplex a shared stream (in this case, the stream defined in `_run` is a random sequence of `MyData` structs) between all the input adapters it manages. +The input adapter itself will do nothing more than let the adapter manager know that it exists: + +```python +class MyInputAdapterImpl(PushInputAdapter): + '''Our input adapter is a very simple implementation, and just + defers its work back to the manager who is expected to deal with + sharing a single connection. + ''' + def __init__(self, manager, symbol): + manager.register_subscription(symbol, self) + super().__init__() +``` + +Similarly, the adapter manager will multiplex the output adapter streams, in this case combining them into streams of print statements. +And similar to the input adapter, the output adapter does relatively little more than letting the adapter manager know that it has work available, using its triggered `on_tick` method to call the adapter manager's `_on_tick` method. + +``` +class MyOutputAdapterImpl(OutputAdapter): + '''Similarly, our output adapter is simple as well, deferring + its functionality to the manager + ''' + def __init__(self, manager, symbol): + manager.register_publication(symbol) + self._manager = manager + self._symbol = symbol + super().__init__() + + def on_tick(self, time, value): + self._manager._on_tick(self._symbol, value) +``` + +As a last step, we need to ensure that the runtime adapter implementations are registered with our graph: + +```python +_my_input_adapter = py_push_adapter_def(name='MyInputAdapter', adapterimpl=MyInputAdapterImpl, out_type=ts[MyData], manager_type=MyAdapterManager, symbol=str) +_my_output_adapter = py_output_adapter_def(name='MyOutputAdapter', adapterimpl=MyOutputAdapterImpl, manager_type=MyAdapterManager, input=ts['T'], symbol=str) +``` + +To test this example, we will: + +- instantiate our manager +- subscribe to a certain number of input adapter "streams" (which the adapter manager will demultiplex out of a single random node) +- print the data +- sink each stream into a smaller number of output adapters (which the adapter manager will multiplex into print statements) + +```python +@csp.graph +def my_graph(): + adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) + + data_1 = adapter_manager.subscribe("data_1") + data_2 = adapter_manager.subscribe("data_2") + data_3 = adapter_manager.subscribe("data_3") + + csp.print("data_1", data_1) + csp.print("data_2", data_2) + csp.print("data_3", data_3) + + # pump two streams into 1 output and 1 stream into another + adapter_manager.publish(data_1, "data_1") + adapter_manager.publish(data_2, "data_1") + adapter_manager.publish(data_3, "data_3") +``` + +Here is the result of a single run: + +``` +2023-02-15 19:14:53.859951 data_1:MyData(symbol=data_1, value=0) +publication_data_1:MyData(symbol=data_1, value=0) +2023-02-15 19:14:54.610281 data_3:MyData(symbol=data_3, value=1) +publication_data_3:MyData(symbol=data_3, value=1) +2023-02-15 19:14:55.361157 data_3:MyData(symbol=data_3, value=2) +publication_data_3:MyData(symbol=data_3, value=2) +2023-02-15 19:14:56.112030 data_2:MyData(symbol=data_2, value=3) +publication_data_1:MyData(symbol=data_2, value=3) +2023-02-15 19:14:56.862881 data_2:MyData(symbol=data_2, value=4) +publication_data_1:MyData(symbol=data_2, value=4) +2023-02-15 19:14:57.613775 data_1:MyData(symbol=data_1, value=5) +publication_data_1:MyData(symbol=data_1, value=5) +2023-02-15 19:14:58.364408 data_3:MyData(symbol=data_3, value=6) +publication_data_3:MyData(symbol=data_3, value=6) +2023-02-15 19:14:59.115290 data_2:MyData(symbol=data_2, value=7) +publication_data_1:MyData(symbol=data_2, value=7) +2023-02-15 19:14:59.866160 data_2:MyData(symbol=data_2, value=8) +publication_data_1:MyData(symbol=data_2, value=8) +2023-02-15 19:15:00.617068 data_1:MyData(symbol=data_1, value=9) +publication_data_1:MyData(symbol=data_1, value=9) +2023-02-15 19:15:01.367955 data_2:MyData(symbol=data_2, value=10) +publication_data_1:MyData(symbol=data_2, value=10) +2023-02-15 19:15:02.118259 data_3:MyData(symbol=data_3, value=11) +publication_data_3:MyData(symbol=data_3, value=11) +2023-02-15 19:15:02.869170 data_2:MyData(symbol=data_2, value=12) +publication_data_1:MyData(symbol=data_2, value=12) +2023-02-15 19:15:03.620047 data_1:MyData(symbol=data_1, value=13) +publication_data_1:MyData(symbol=data_1, value=13) +closing asset publication_data_1 +closing asset publication_data_3 +``` + +Although simple, this examples demonstrates the utility of the adapters and adapter managers. +An input resource is managed by one entity, distributed across a variety of downstream subscribers. +Then a collection of streams is piped back into a single entity. diff --git a/docs/wiki/how-tos/Write-Realtime-Input-Adapters.md b/docs/wiki/how-tos/Write-Realtime-Input-Adapters.md new file mode 100644 index 000000000..10eedc4cd --- /dev/null +++ b/docs/wiki/how-tos/Write-Realtime-Input-Adapters.md @@ -0,0 +1,407 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) +- [PushInputAdapter - Python](#pushinputadapter---python) +- [GenericPushAdapter](#genericpushadapter) +- [Realtime AdapterManager](#realtime-adaptermanager) + - [AdapterManager - **graph-- time**](#adaptermanager---graph---time) + - [AdapterManager - **impl-- runtime**](#adaptermanager---impl---runtime) + - [PushInputAdapter - **--impl-- runtime**](#pushinputadapter-----impl---runtime) + - [PushInputAdapter - **--graph-- time**](#pushinputadapter----graph---time) + - [Example](#example) + +## Introduction + +There are two main categories of writing input adapters, historical and realtime. + +When writing realtime adapters, you will need to implement a "push" adapter, which will get data from a separate thread that drives external events and "pushes" them into the engine as they occur. + +When writing input adapters it is also very important to denote the difference between "graph building time" and "runtime" versions of your adapter. +For example, `csp.adapters.csv` has a `CSVReader` class that is used at graph building time. +**Graph build time components** solely *describe* the adapter. +They are meant to do little else than keep track of the type of adapter and its parameters, which will then be used to construct the actual adapter implementation when the engine is constructed from the graph description. +It is the runtime implementation that actual runs during the engine execution phase to process data. + +For clarity of this distinction, in the descriptions below we will denote graph build time components with *--graph--* and runtime implementations with *--impl--*. + +## PushInputAdapter - Python + +To write a Python based `PushInputAdapter` one must write a class that derives from `csp.impl.pushadapter.PushInputAdapter`. +The derived type should the define two methods: + +- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. + start_time and end_time will be tz-unaware datetime objects in UTC time (generally these aren't needed for realtime adapters). + At this point the adapter should open its resource / connect the data source / start any driver threads that are needed. +- `def stop(self)`: This method well be called when the engine is done running. + At this point any open threads should be stopped and resources cleaned up. + +The `PushInputAdapter` that you define will be used as the runtime *--impl–-*. +You also need to define a *--graph--* time representation of the time series edge. +In order to do this you should define a `csp.impl.wiring.py_push_adapter_def`. +The `py_push_adapter_def` creates a *--graph--* time representation of your adapter: + +**def py_push_adapter_def(name, adapterimpl, out_type, \*\*kwargs)** + +- **`name`**: string name for the adapter +- **`adapterimpl`**: a derived implementation of + `csp.impl.pushadapter.PushInputAdapter` +- **`out_type`**: the type of the output, should be a `ts[]` type. + Note this can use tvar types if a subsequent argument defines the + tvar. +- **`kwargs`**: \*\*kwargs here be passed through as arguments to the + PushInputAdapter implementation + +Note that the \*\*kwargs passed to `py_push_adapter_def` should be the names and types of the variables, like `arg1=type1, arg2=type2`. +These are the names of the kwargs that the returned input adapter will take and pass through to the `PushInputAdapter` implementation, and the types expected for the values of those args. + +Example [e_14_user_adapters_03_pushinput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_03_pushinput.py) demonstrates a simple example of this. + +```python +from csp.impl.pushadapter import PushInputAdapter +from csp.impl.wiring import py_push_adapter_def +import csp +from csp import ts +from datetime import datetime, timedelta +import threading +import time + + +# The Impl object is created at runtime when the graph is converted into the runtime engine +# it does not exist at graph building time! +class MyPushAdapterImpl(PushInputAdapter): + def __init__(self, interval): + print("MyPushAdapterImpl::__init__") + self._interval = interval + self._thread = None + self._running = False + + def start(self, starttime, endtime): + """ start will get called at the start of the engine, at which point the push + input adapter should start its thread that will push the data onto the adapter. Note + that push adapters will ALWAYS have a separate thread driving ticks into the csp engine thread + """ + print("MyPushAdapterImpl::start") + self._running = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + """ stop will get called at the end of the run, at which point resources should + be cleaned up + """ + print("MyPushAdapterImpl::stop") + if self._running: + self._running = False + self._thread.join() + + def _run(self): + counter = 0 + while self._running: + self.push_tick(counter) + counter += 1 + time.sleep(self._interval.total_seconds()) + + +# MyPushAdapter is the graph-building time construct. This is simply a representation of what the +# input adapter is and how to create it, including the Impl to create and arguments to pass into it +MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[int], interval=timedelta) +``` + +Note how line 41 calls **self.push_tick**. +This is the call to get data from the adapter thread ticking into the CSP engine. + +Now `MyPushAdapter` can be called in graph code to create a timeseries that is sourced by `MyPushAdapterImpl`: + +```python +@csp.graph +def my_graph(): + # At this point we create the graph-time representation of the input adapter. This will be converted + # into the impl once the graph is done constructing and the engine is created in order to run + data = MyPushAdapter(timedelta(seconds=1)) + csp.print('data', data) +``` + +## GenericPushAdapter + +If you dont need as much control as `PushInputAdapter` provides, or if you have some existing source of data on a thread you can't control, another option is to use the higher-level abstraction `csp.GenericPushAdapter`. +`csp.GenericPushAdapter` wraps a `csp.PushInputAdapter` implementation internally and provides a simplified interface. +The downside of `csp.GenericPushAdapter` is that you lose some control of when the input feed starts and stop. + +Lets take a look at the example found in [e_14_generic_push_adapter.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_generic_push_adapter.py): + +```python +# This is an example of some separate thread providing data +class Driver: + def __init__(self, adapter : csp.GenericPushAdapter): + self._adapter = adapter + self._active = False + self._thread = None + + def start(self): + self._active = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + if self._active: + self._active = False + self._thread.join() + + def _run(self): + print("driver thread started") + counter = 0 + # Optionally, we can wait for the adapter to start before proceeding + # Alternatively we can start pushing data, but push_tick may fail and return False if + # the csp engine isn't ready yet + self._adapter.wait_for_start() + + while self._active and not self._adapter.stopped(): + self._adapter.push_tick(counter) + counter += 1 + time.sleep(1) + +@csp.graph +def my_graph(): + adapter = csp.GenericPushAdapter(int) + driver = Driver(adapter) + # Note that the driver thread starts *before* the engine is started here, which means some ticks may potentially get dropped if the + # data source doesn't wait for the adapter to start. This may be ok for some feeds, but not others + driver.start() + + # Lets be nice and shutdown the driver thread when the engine is done + csp.schedule_on_engine_stop(driver.stop) +``` + +In this example we have this dummy `Driver` class which simply represents some external source of data which arrives on a thread that's completely independent of the engine. +We pass along a `csp.GenericInputAdapter` instance to this thread, which can then call adapter.push_tick to get data into the engine (see line 27). + +On line 24 we can also see an optional feature which allows the unrelated thread to wait for the adapter to be ready to accept data before ticking data onto it. +If push_tick is called before the engine starts / the adapter is ready to receive data, it will simply drop the data. +Note that GenericPushAadapter.push_tick will return a bool to indicate whether the data was successfully pushed to the engine or not. + +## Realtime `AdapterManager` + +In most cases you will likely want to expose a single source of data into multiple input adapters. +For this use case your adapter should define an `AdapterManager` *--graph--* time component, and `AdapterManagerImpl` *--impl--* runtime component. +The `AdapterManager` *--graph--* time component just represents the parameters needed to create the *--impl--* `AdapterManager`. +Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. + +Similarly you will need to define a derived `PushInputAdapter` *--impl--* component to handle events directed at an individual time series adapter. + +**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. +Graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. + +### AdapterManager - **graph-- time** + +The graph time `AdapterManager` doesn't need to derive from any interface. +It should be initialized with any information the impl needs in order to open/process the data source (ie activemq connection information, server host port, multicast channels, config files, etc etc). +It should also have an API to create individual timeseries adapters. +These adapters will then get passed the adapter manager *--impl--* as an argument when they are created, so that they can register themselves for processing. +The `AdapterManager` also needs to define a **\_create** method. +The **\_create** is the bridge between the *--graph--* time `AdapterManager` representation and the runtime *--impl--* object. +**\_create** will be called on the *--graph--* time `AdapterManager` which will in turn create the *--impl--* instance. +\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and memo dict which can optionally be used for any memoization that on might want. + +Lets take a look at the example found in [e_14_user_adapters_04_adaptermanager_pushinput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_04_adaptermanager_pushinput.py): + +```python +# This object represents our AdapterManager at graph time. It describes the manager's properties +# and will be used to create the actual impl when its time to build the engine +class MyAdapterManager: + def __init__(self, interval: timedelta): + """ + Normally one would pass properties of the manager here, ie filename, + message bus, etc + """ + self._interval = interval + + def subscribe(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING): + """ User facing API to subscribe to a timeseries stream from this adapter manager """ + # This will return a graph-time timeseries edge representing and edge from this + # adapter manager for the given symbol / arguments + return MyPushAdapter(self, symbol, push_mode=push_mode) + + def _create(self, engine, memo): + """ This method will get called at engine build time, at which point the graph time manager representation + will create the actual impl that will be used for runtime + """ + # Normally you would pass the arguments down into the impl here + return MyAdapterManagerImpl(engine, self._interval) +``` + +- **\_\_init\_\_** - as you can see, all \_\_init\_\_ does is keep the parameters that the impl will need. +- **subscribe** - API to create an individual timeseries / edge from this file for the given symbol. + The interface defined here is up to the adapter writer, but generally "subscribe" is recommended, and it should take any number of arguments needed to define a single stream of data. + *MyPushAdapter* is the *--graph--* time representation of the edge, which will be described below. + We pass it *self* as its first argument, which will be used to create the `AdapterManager` *--impl--* +- **\_create** - the method to create the *--impl--* object from the given *--graph--* time representation of the manager + +`MyAdapterManager` would then be used in graph building code like so: + +```python +adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) +data = adapter_manager.subscribe('AAPL', push_mode=csp.PushMode.LAST_VALUE) +csp.print(symbol + " last_value", data) +``` + +### AdapterManager - **impl-- runtime** + +The `AdapterManager` *--impl--* is responsible for opening the data source, parsing and processing all the data and managing all the adapters it needs to feed. +The impl class should derive from `csp.impl.adaptermanager.AdapterManagerImpl` and implement the following methods: + +- **start(self,starttime,endtime)**: this is called when the engine starts up. + At this point the impl should open the resource providing the data and start up any thread(s) needed to listen to and react to external data. + starttime/endtime will be tz-unaware datetime objects in UTC time, though typically these aren't needed for realtime adapters +- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point +- **`process_next_sim_timeslice(self, now)`**: this is used by sim adapters, for realtime adapter managers we simply return None + +In the example manager, we spawn a processing thread in the `start()` call. +This thread runs in a loop until it is shutdown, and will generate random data to tick out to the registered input adapters. +Data is passed to a given adapter by calling `push_tick()`. + +### PushInputAdapter - **--impl-- runtime** + +Users will need to define `PushInputAdapter` derived types to represent the individual timeseries adapter *--impl--* objects. +Objects should derive from `csp.impl.pushadapter.PushInputAdapter`. + +`PushInputAdapter` defines a method `push_tick()` which takes the value to feed the input timeseries. + +### PushInputAdapter - **--graph-- time** + +Similar to the stand alone `PushInputAdapter` described above, we need to define a graph-time construct that represents a `PushInputAdapter` edge. +In order to define this we use `py_push_adapter_def` again, but this time we pass the adapter manager *--graph--* time type so that it gets constructed properly. +When the `PushInputAdapter` instance is created it will also receive an instance of the adapter manager *--impl–-*, which it can then self-register on. + +```python +def py_push_adapter_def (name, adapterimpl, out_type, manager_type=None, memoize=True, force_memoize=False, **kwargs): +""" +Create a graph representation of a python push input adapter. +:param name: string name for the adapter +:param adapterimpl: a derived implementation of csp.impl.pushadapter.PushInputAdapter +:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar +:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter +:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation +the first argument to the implementation will be the adapter manager impl instance +""" +``` + +### Example + +Continuing with the *--graph--* time `AdapterManager` described above, we +now define the impl: + +```python +# This is the actual manager impl that will be created and executed during runtime +class MyAdapterManagerImpl(AdapterManagerImpl): + def __init__(self, engine, interval): + super().__init__(engine) + + # These are just used to simulate a data source + self._interval = interval + self._counter = 0 + + # We will keep track of requested input adapters here + self._inputs = {} + + # Our driving thread, all realtime adapters will need a separate thread of execution that + # drives data into the engine thread + self._running = False + self._thread = None + + def start(self, starttime, endtime): + """ start will get called at the start of the engine run. At this point + one would start up the realtime data source / spawn the driving thread(s) and + subscribe to the needed data """ + self._running = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + """ This will be called at the end of the engine run, at which point resources should be + closed and cleaned up """ + if self._running: + self._running = False + self._thread.join() + + def register_input_adapter(self, symbol, adapter): + """ Actual PushInputAdapters will self register when they are created as part of the engine + This is the place we gather all requested input adapters and their properties + """ + if symbol not in self._inputs: + self._inputs[symbol] = [] + # Keep a list of adapters by key in case we get duplicate adapters (should be memoized in reality) + self._inputs[symbol].append(adapter) + + def process_next_sim_timeslice(self, now): + """ This method is only used by simulated / historical adapters, for realtime we just return None """ + return None + + def _run(self): + """ Our driving thread, in reality this will be reacting to external events, parsing the data and + pushing it into the respective adapter + """ + symbols = list(self._inputs.keys()) + while self._running: + # Lets pick a random symbol from the requested symbols + symbol = symbols[random.randint(0, len(symbols) - 1)] + adapters = self._inputs[symbol] + data = MyData(symbol=symbol, value=self._counter) + self._counter += 1 + for adapter in adapters: + adapter.push_tick(data) + + time.sleep(self._interval.total_seconds()) +``` + +Then we define our `PushInputAdapter` *--impl--*, which basically just +self-registers with the adapter manager *--impl--* upon construction. We +also define our `PushInputAdapter` *--graph--* time construct using `py_push_adapter_def`. + +```python +# The Impl object is created at runtime when the graph is converted into the runtime engine +# it does not exist at graph building time. a managed sim adapter impl will get the +# adapter manager runtime impl as its first argument +class MyPushAdapterImpl(PushInputAdapter): + def __init__(self, manager_impl, symbol): + print(f"MyPushAdapterImpl::__init__ {symbol}") + manager_impl.register_input_adapter(symbol, self) + super().__init__() + + +MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[MyData], MyAdapterManager, symbol=str) +``` + +And then we can run our adapter in a CSP graph + +```python +@csp.graph +def my_graph(): + print("Start of graph building") + + adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) + symbols = ['AAPL', 'IBM', 'TSLA', 'GS', 'JPM'] + for symbol in symbols: + # your data source might tick faster than the engine thread can consume it + # push_mode can be used to buffered up tick events will get processed + # LAST_VALUE will conflate and only tick the latest value since the last cycle + data = adapter_manager.subscribe(symbol, csp.PushMode.LAST_VALUE) + csp.print(symbol + " last_value", data) + + # BURST will change the timeseries type from ts[T] to ts[[T]] (list of ticks) + # that will tick with all values that have buffered since the last engine cycle + data = adapter_manager.subscribe(symbol, csp.PushMode.BURST) + csp.print(symbol + " burst", data) + + # NON_COLLAPSING will tick all events without collapsing, unrolling the events + # over multiple engine cycles + data = adapter_manager.subscribe(symbol, csp.PushMode.NON_COLLAPSING) + csp.print(symbol + " non_collapsing", data) + + print("End of graph building") + + +csp.run(my_graph, starttime=datetime.utcnow(), endtime=timedelta(seconds=10), realtime=True) +``` + +Do note that realtime adapters will only run in realtime engines (note the `realtime=True` argument to `csp.run`). diff --git a/docs/wiki/references/Examples.md b/docs/wiki/references/Examples.md new file mode 100644 index 000000000..412293196 --- /dev/null +++ b/docs/wiki/references/Examples.md @@ -0,0 +1,7 @@ +> \[!WARNING\] +> This page is a work in progress. + + diff --git a/docs/wiki/references/Glossary.md b/docs/wiki/references/Glossary.md new file mode 100644 index 000000000..d66dd4722 --- /dev/null +++ b/docs/wiki/references/Glossary.md @@ -0,0 +1,142 @@ +> \[!WARNING\] +> This page is a work in progress. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Terms](#terms) + - [Engine time](#engine-time) + - [Event streaming](#event-streaming) + - [Time series](#time-series) + - [Tick](#tick) + - [Node](#node) + - [Graph](#graph) + - [Alarm](#alarm) + - [Adapter](#adapter) + - [Realtime](#realtime) + - [Wiring (or graph building time)](#wiring-or-graph-building-time) + - [Graph run time](#graph-run-time) + - [Ticked (as in csp.ticked)](#ticked-as-in-cspticked) + - [Valid (as in csp.valid)](#valid-as-in-cspvalid) + - [Push mode](#push-mode) + - [Edge](#edge) + - [Delayed edge](#delayed-edge) + - [Feedback](#feedback) + - [Struct](#struct) + - [List basket](#list-basket) + - [Dict basket](#dict-basket) + - [Dynamic graph](#dynamic-graph) + - [Push input adapter](#push-input-adapter) + - [Pull input adapter](#pull-input-adapter) + - [Output adapter](#output-adapter) + - [Managed sim adapter](#managed-sim-adapter) + - [Adapter manager](#adapter-manager) + +## Terms + + + +### Engine time + +The CSP engine always maintains its current view of time. +The current time of the engine can be accessed at any time within a `csp.node` by calling `csp.now()` + +### Event streaming + + + +### Time series + + + +### Tick + + + +### Node + + + +### Graph + + + +### Alarm + + + +### Adapter + + + +### Realtime + + + +### Wiring (or graph building time) + + + +### Graph run time + + + +### Ticked (as in csp.ticked) + + + +### Valid (as in csp.valid) + + + +### Push mode + + + +### Edge + + + +### Delayed edge + + + +### Feedback + + + +### Struct + + + +### List basket + + + +### Dict basket + + + +### Dynamic graph + + + +### Push input adapter + + + +### Pull input adapter + + + +### Output adapter + + + +### Managed sim adapter + + + +### Adapter manager + + From e72e6205aa3a61b850fa066da528b05f4a2776a0 Mon Sep 17 00:00:00 2001 From: Pavithra Eswaramoorthy Date: Wed, 10 Apr 2024 23:28:11 +0530 Subject: [PATCH 02/27] :broom: Fix links to old docs pages Signed-off-by: Pavithra Eswaramoorthy --- README.md | 2 +- docs/wiki/api-references/Base-Adapters-API.md | 2 +- docs/wiki/api-references/Base-Nodes-API.md | 4 ++-- docs/wiki/concepts/CSP-Graph.md | 2 +- docs/wiki/concepts/CSP-Node.md | 2 +- docs/wiki/dev-guides/Contribute.md | 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 3af71a95d..1e9abd50b 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ See [our wiki!](https://github.com/Point72/csp/wiki) ## Development -Check out the [Developer Documentation](https://github.com/Point72/csp/wiki/99.-Developer) +Check out the [contribution guide](https://github.com/Point72/csp/wiki/Contribute) and [local development instructions](https://github.com/Point72/csp/wiki/Local-Development-Setup). ## Authors diff --git a/docs/wiki/api-references/Base-Adapters-API.md b/docs/wiki/api-references/Base-Adapters-API.md index a72820cc9..ecae3aa91 100644 --- a/docs/wiki/api-references/Base-Adapters-API.md +++ b/docs/wiki/api-references/Base-Adapters-API.md @@ -79,7 +79,7 @@ This allows you to connect an edge as a "graph output". All edges added as outputs will be returned to the caller from `csp.run` as a dictionary of `key: [(datetime, value)]` (list of datetime, values that ticked on the edge) or if `csp.run` is passed `output_numpy=True`, as a dictionary of `key: (array, array)` (tuple of two numpy arrays, one with datetimes and one with values). -See [Collecting Graph Outputs](https://github.com/Point72/csp/wiki/0.-Introduction#collecting-graph-outputs) +See [Collecting Graph Outputs](CSP-Graph#collecting-graph-outputs) Args: diff --git a/docs/wiki/api-references/Base-Nodes-API.md b/docs/wiki/api-references/Base-Nodes-API.md index 81acf4b8d..ae073051f 100644 --- a/docs/wiki/api-references/Base-Nodes-API.md +++ b/docs/wiki/api-references/Base-Nodes-API.md @@ -299,7 +299,7 @@ csp.dynamic_demultiplex( ) → {ts['K']: ts['T']} ``` -Similar to `csp.demultiplex`, this version will return a [Dynamic Basket](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) output that will dynamically add new keys as they are seen. +Similar to `csp.demultiplex`, this version will return a [Dynamic Basket](Create-Dynamic-Baskets) output that will dynamically add new keys as they are seen. ## `csp.dynamic_collect` @@ -309,7 +309,7 @@ csp.dynamic_collect( ) → ts[{'K': 'T'}] ``` -Similar to `csp.collect`, this function takes a [Dynamic Basket](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) input and returns a dictionary of the key-value pairs corresponding to the values that ticked. +Similar to `csp.collect`, this function takes a [Dynamic Basket](Create-Dynamic-Baskets) input and returns a dictionary of the key-value pairs corresponding to the values that ticked. ## `csp.drop_nans` diff --git a/docs/wiki/concepts/CSP-Graph.md b/docs/wiki/concepts/CSP-Graph.md index c84b6542f..03fadb7e8 100644 --- a/docs/wiki/concepts/CSP-Graph.md +++ b/docs/wiki/concepts/CSP-Graph.md @@ -84,7 +84,7 @@ result: Note that the result is a list of `(datetime, value)` tuples. -You can also use [csp.add_graph_output]() to add outputs. +You can also use [csp.add_graph_output](Base-Adapters-API#cspadd_graph_output) to add outputs. These do not need to be in the top-level graph called directly from `csp.run`. This gives the same result: diff --git a/docs/wiki/concepts/CSP-Node.md b/docs/wiki/concepts/CSP-Node.md index 229bfdc34..bf72fc325 100644 --- a/docs/wiki/concepts/CSP-Node.md +++ b/docs/wiki/concepts/CSP-Node.md @@ -155,7 +155,7 @@ basket and react to it as well as access its current value ## **Node Outputs** Nodes can return any number of outputs (including no outputs, in which case it is considered an "output" or sink node, -see [Graph Pruning](https://github.com/Point72/csp/wiki/0.-Introduction#graph-pruning)). +see [Graph Pruning](CSP-Graph#graph-pruning)). Nodes with single outputs can return the output as an unnamed output. Nodes returning multiple outputs must have them be named. When a node is called at graph building time, if it is a single unnamed node the return variable is an edge representing the output which can be passed into other nodes. diff --git a/docs/wiki/dev-guides/Contribute.md b/docs/wiki/dev-guides/Contribute.md index 7de8b8d33..453422bcd 100644 --- a/docs/wiki/dev-guides/Contribute.md +++ b/docs/wiki/dev-guides/Contribute.md @@ -4,6 +4,6 @@ For **bug reports** or **small feature requests**, please open an issue on our [ For **questions** or to discuss **larger changes or features**, please use our [discussions page](https://github.com/Point72/csp/discussions). -For **contributions**, please see our [developer documentation](https://github.com/Point72/csp/wiki/99.-Developer). We have `help wanted` and `good first issue` tags on our issues page, so these are a great place to start. +For **contributions**, please see our [developer documentation](Local-Development-Setup). We have `help wanted` and `good first issue` tags on our issues page, so these are a great place to start. For **documentation updates**, make PRs that update the pages in `/docs/wiki`. The documentation is pushed to the GitHub wiki automatically through a GitHub workflow. Note that direct updates to this wiki will be overwritten. From 118a2fbf23bb7ff25624e54a50afd57df90673ce Mon Sep 17 00:00:00 2001 From: Arham Chopra Date: Wed, 10 Apr 2024 15:50:40 -0400 Subject: [PATCH 03/27] Fix to_json serialization for floats Signed-off-by: Arham Chopra --- cpp/csp/python/PyStructToJson.cpp | 46 +++++++++++++++++++++++++-- csp/tests/impl/test_struct.py | 52 +++++++++++++++++++++++++++++-- 2 files changed, 93 insertions(+), 5 deletions(-) diff --git a/cpp/csp/python/PyStructToJson.cpp b/cpp/csp/python/PyStructToJson.cpp index 1226326cc..b7e10ef7a 100644 --- a/cpp/csp/python/PyStructToJson.cpp +++ b/cpp/csp/python/PyStructToJson.cpp @@ -21,6 +21,32 @@ inline rapidjson::Value toJson( const T& val, const CspType& typ, rapidjson::Doc return rapidjson::Value( val ); } +// Helper function for parsing doubles +inline rapidjson::Value doubleToJson( const double& val, rapidjson::Document& doc ) +{ + // NOTE: Rapidjson adds support for this in a future release. Remove this when we upgrade rapidjson to a version + // after 07/16/2023 and use kWriteNanAndInfNullFlag in the writer. + // + // To be compatible with other JSON libraries, we cannot use the default approach that rapidjson has to + // serializing NaN, and (+/-)Infs. We need to manually convert them to NULLs. Rapidjson adds support for this + // in a future release. + if ( std::isnan( val ) || std::isinf( val ) ) + { + return rapidjson::Value(); + } + else + { + return rapidjson::Value( val ); + } +} + +// Helper function to convert doubles into json format recursively, by properly handlings NaNs, and Infs +template<> +inline rapidjson::Value toJson( const double& val, const CspType& typ, rapidjson::Document& doc, PyObject * callable ) +{ + return doubleToJson( val, doc ); +} + // Helper function to convert Enums into json format recursively template<> inline rapidjson::Value toJson( const CspEnum& val, const CspType& typ, rapidjson::Document& doc, PyObject * callable ) @@ -183,7 +209,21 @@ rapidjson::Value pyDictKeyToName( PyObject * py_key, rapidjson::Document& doc ) else if( PyFloat_Check( py_key ) ) { auto key = PyFloat_AsDouble( py_key ); - val.SetString( std::to_string( key ), doc.GetAllocator() ); + auto json_obj = doubleToJson( key, doc ); + if ( json_obj.IsNull() ) + { + auto * str_obj = PyObject_Str( py_key ); + Py_ssize_t len = 0; + const char * str = PyUnicode_AsUTF8AndSize( str_obj, &len ); + CSP_THROW( ValueError, "Cannot serialize " + std::string( str ) + " to key in JSON" ); + } + else + { + // Convert to string + std::stringstream s; + s << key; + val.SetString( s.str(), doc.GetAllocator() ); + } } else { @@ -255,12 +295,12 @@ rapidjson::Value pyObjectToJson( PyObject * value, rapidjson::Document& doc, PyO } else if( PyFloat_Check( value ) ) { - return rapidjson::Value( fromPython( value ) ); + return doubleToJson( fromPython( value ), doc ); } else if( PyUnicode_Check( value ) ) { Py_ssize_t len; - auto str = PyUnicode_AsUTF8AndSize( value , &len ); + auto str = PyUnicode_AsUTF8AndSize( value, &len ); rapidjson::Value str_val; str_val.SetString( str, len, doc.GetAllocator() ); return str_val; diff --git a/csp/tests/impl/test_struct.py b/csp/tests/impl/test_struct.py index 02caacf89..c588478f0 100644 --- a/csp/tests/impl/test_struct.py +++ b/csp/tests/impl/test_struct.py @@ -1282,6 +1282,18 @@ class MyStruct(csp.Struct): result_dict = {"b": False, "i": 456, "f": 1.73, "s": "789"} self.assertEqual(json.loads(test_struct.to_json()), result_dict) + test_struct = MyStruct(b=False, i=456, f=float("nan"), s="789") + result_dict = {"b": False, "i": 456, "f": None, "s": "789"} + self.assertEqual(json.loads(test_struct.to_json()), result_dict) + + test_struct = MyStruct(b=False, i=456, f=float("inf"), s="789") + result_dict = {"b": False, "i": 456, "f": None, "s": "789"} + self.assertEqual(json.loads(test_struct.to_json()), result_dict) + + test_struct = MyStruct(b=False, i=456, f=float("-inf"), s="789") + result_dict = {"b": False, "i": 456, "f": None, "s": "789"} + self.assertEqual(json.loads(test_struct.to_json()), result_dict) + def test_to_json_enums(self): from enum import Enum as PyEnum @@ -1434,8 +1446,13 @@ class MyStruct(csp.Struct): result_dict = {"i": 456, "l_any": l_l_i} self.assertEqual(json.loads(test_struct.to_json()), result_dict) - l_any = [[1, 2], "hello", [4, 3.2, [6, [7], (8, True, 10.5, (11, [12, False]))]]] - l_any_result = [[1, 2], "hello", [4, 3.2, [6, [7], [8, True, 10.5, [11, [12, False]]]]]] + l_any = [[1, float("nan")], [float("INFINITY"), float("-inf")]] + test_struct = MyStruct(i=456, l_any=l_any) + result_dict = {"i": 456, "l_any": [[1, None], [None, None]]} + self.assertEqual(json.loads(test_struct.to_json()), result_dict) + + l_any = [[1, 2], "hello", [4, 3.2, [6, [7], (8, True, 10.5, (11, [float("nan"), False]))]]] + l_any_result = [[1, 2], "hello", [4, 3.2, [6, [7], [8, True, 10.5, [11, [None, False]]]]]] test_struct = MyStruct(i=456, l_any=l_any) result_dict = {"i": 456, "l_any": l_any_result} self.assertEqual(json.loads(test_struct.to_json()), result_dict) @@ -1444,6 +1461,7 @@ def test_to_json_dict(self): class MyStruct(csp.Struct): i: int = 123 d_i: typing.Dict[int, int] + d_f: typing.Dict[float, int] d_dt: typing.Dict[str, datetime] d_d_s: typing.Dict[str, typing.Dict[str, str]] d_any: dict @@ -1458,6 +1476,12 @@ class MyStruct(csp.Struct): result_dict = {"i": 456, "d_i": d_i_res} self.assertEqual(json.loads(test_struct.to_json()), result_dict) + d_f = {1.2: 2, 2.3: 4, 3.4: 6, 4.5: 7} + d_f_res = {str(k): v for k, v in d_f.items()} + test_struct = MyStruct(i=456, d_f=d_f) + result_dict = {"i": 456, "d_f": d_f_res} + self.assertEqual(json.loads(test_struct.to_json()), result_dict) + dt = datetime.now(tz=pytz.utc) d_dt = {"d1": dt, "d2": dt} test_struct = MyStruct(i=456, d_dt=d_dt) @@ -1475,6 +1499,12 @@ class MyStruct(csp.Struct): result_dict = {"i": 456, "d_any": d_i_res} self.assertEqual(json.loads(test_struct.to_json()), result_dict) + d_f = {1.2: 2, 2.3: 4, 3.4: 6, 4.5: 7} + d_f_res = {str(k): v for k, v in d_f.items()} + test_struct = MyStruct(i=456, d_any=d_f) + result_dict = {"i": 456, "d_any": d_f_res} + self.assertEqual(json.loads(test_struct.to_json()), result_dict) + dt = datetime.now(tz=pytz.utc) d_dt = {"d1": dt, "d2": dt} test_struct = MyStruct(i=456, d_any=d_dt) @@ -1487,6 +1517,24 @@ class MyStruct(csp.Struct): result_dict = {"i": 456, "d_any": d_any_res} self.assertEqual(json.loads(test_struct.to_json()), result_dict) + d_f = {float("nan"): 2, 2.3: 4, 3.4: 6, 4.5: 7} + d_f_res = {str(k): v for k, v in d_f.items()} + test_struct = MyStruct(i=456, d_any=d_f) + with self.assertRaises(ValueError): + test_struct.to_json() + + d_f = {float("inf"): 2, 2.3: 4, 3.4: 6, 4.5: 7} + d_f_res = {str(k): v for k, v in d_f.items()} + test_struct = MyStruct(i=456, d_any=d_f) + with self.assertRaises(ValueError): + test_struct.to_json() + + d_f = {float("-inf"): 2, 2.3: 4, 3.4: 6, 4.5: 7} + d_f_res = {str(k): v for k, v in d_f.items()} + test_struct = MyStruct(i=456, d_any=d_f) + with self.assertRaises(ValueError): + test_struct.to_json() + def test_to_json_struct(self): class MySubSubStruct(csp.Struct): b: bool = True From 8adb1a7f06daa8c28cbe70b87b247e2a4dce80a3 Mon Sep 17 00:00:00 2001 From: Arham Chopra Date: Thu, 11 Apr 2024 13:51:51 -0400 Subject: [PATCH 04/27] Upgrade baseline in vcpkg.json Signed-off-by: Arham Chopra --- vcpkg | 2 +- vcpkg.json | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/vcpkg b/vcpkg index 0f9a4f65e..43d81795a 160000 --- a/vcpkg +++ b/vcpkg @@ -1 +1 @@ -Subproject commit 0f9a4f65e4d06fd70a8d839ec3a1eafc05652e70 +Subproject commit 43d81795a513e2ca6648354786178714f33c8b6f diff --git a/vcpkg.json b/vcpkg.json index e45b4a9d9..ad1fa7e96 100644 --- a/vcpkg.json +++ b/vcpkg.json @@ -23,5 +23,5 @@ "overrides": [ { "name": "arrow", "version": "15.0.0"} ], - "builtin-baseline": "288e8bebf4ca67c1f8ebd49366b03650cfd9eb7d" + "builtin-baseline": "43d81795a513e2ca6648354786178714f33c8b6f" } From f6b0963b7bea00681ab19e116448f492fa515d1e Mon Sep 17 00:00:00 2001 From: Arham Chopra Date: Wed, 10 Apr 2024 19:22:33 -0400 Subject: [PATCH 05/27] Parse None natively in to_json method Signed-off-by: Arham Chopra --- cpp/csp/python/PyStructToJson.cpp | 6 +++++- csp/tests/impl/test_struct.py | 21 +++++++++++++++++---- 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/cpp/csp/python/PyStructToJson.cpp b/cpp/csp/python/PyStructToJson.cpp index b7e10ef7a..1894ecf24 100644 --- a/cpp/csp/python/PyStructToJson.cpp +++ b/cpp/csp/python/PyStructToJson.cpp @@ -285,7 +285,11 @@ rapidjson::Value pyDictToJson( PyObject * py_dict, rapidjson::Document& doc, PyO rapidjson::Value pyObjectToJson( PyObject * value, rapidjson::Document& doc, PyObject * callable, bool is_recursing ) { - if( PyBool_Check( value ) ) + if( value == Py_None ) + { + return rapidjson::Value(); + } + else if( PyBool_Check( value ) ) { return rapidjson::Value( fromPython( value ) ); } diff --git a/csp/tests/impl/test_struct.py b/csp/tests/impl/test_struct.py index c588478f0..b5e634a9d 100644 --- a/csp/tests/impl/test_struct.py +++ b/csp/tests/impl/test_struct.py @@ -1451,8 +1451,13 @@ class MyStruct(csp.Struct): result_dict = {"i": 456, "l_any": [[1, None], [None, None]]} self.assertEqual(json.loads(test_struct.to_json()), result_dict) - l_any = [[1, 2], "hello", [4, 3.2, [6, [7], (8, True, 10.5, (11, [float("nan"), False]))]]] - l_any_result = [[1, 2], "hello", [4, 3.2, [6, [7], [8, True, 10.5, [11, [None, False]]]]]] + l_any = [[None], None, [1, 2, None]] + test_struct = MyStruct(i=456, l_any=l_any) + result_dict = {"i": 456, "l_any": l_any} + self.assertEqual(json.loads(test_struct.to_json()), result_dict) + + l_any = [[1, 2], "hello", [4, 3.2, [6, [7], (None, True, 10.5, (11, [float("nan"), None, False]))]]] + l_any_result = [[1, 2], "hello", [4, 3.2, [6, [7], [None, True, 10.5, [11, [None, None, False]]]]]] test_struct = MyStruct(i=456, l_any=l_any) result_dict = {"i": 456, "l_any": l_any_result} self.assertEqual(json.loads(test_struct.to_json()), result_dict) @@ -1511,8 +1516,16 @@ class MyStruct(csp.Struct): result_dict = json.loads(test_struct.to_json()) self.assertEqual({k: datetime.fromisoformat(d) for k, d in result_dict["d_any"].items()}, d_dt) - d_any = {"b1": {1: "k1", "d2": {4: 5.5}}, "b2": {"d3": {}, "d4": {"d5": {"d6": {"d7": {}}}}}} - d_any_res = {"b1": {"1": "k1", "d2": {"4": 5.5}}, "b2": {"d3": {}, "d4": {"d5": {"d6": {"d7": {}}}}}} + d_any = { + "b1": {1: "k1", "d2": {4: 5.5}}, + "b2": {"d3": {}, "d4": {"d5": {"d6": {"d7": {}}}}, "d8": None}, + "b3": None, + } + d_any_res = { + "b1": {"1": "k1", "d2": {"4": 5.5}}, + "b2": {"d3": {}, "d4": {"d5": {"d6": {"d7": {}}}}, "d8": None}, + "b3": None, + } test_struct = MyStruct(i=456, d_any=d_any) result_dict = {"i": 456, "d_any": d_any_res} self.assertEqual(json.loads(test_struct.to_json()), result_dict) From eaec4a09220d1e0b16bdaf8052101c4268b6c3a8 Mon Sep 17 00:00:00 2001 From: Arham Chopra Date: Thu, 11 Apr 2024 17:43:15 -0400 Subject: [PATCH 06/27] Update baseline to stable version Signed-off-by: Arham Chopra --- vcpkg.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vcpkg.json b/vcpkg.json index ad1fa7e96..be0aea177 100644 --- a/vcpkg.json +++ b/vcpkg.json @@ -23,5 +23,5 @@ "overrides": [ { "name": "arrow", "version": "15.0.0"} ], - "builtin-baseline": "43d81795a513e2ca6648354786178714f33c8b6f" + "builtin-baseline": "8a12a368c8af1e39fa597793fb58555f54414d0b" } From f11b8fca5eaf51d3f43293678864730ec603b2d5 Mon Sep 17 00:00:00 2001 From: Rob Ambalu Date: Wed, 17 Apr 2024 10:30:37 -0400 Subject: [PATCH 07/27] PushPullInputAdapter - fix to previous patch that fixed out of order time handling. Need to account for the null event which signifies end of replay Signed-off-by: Rob Ambalu --- cpp/csp/engine/PushPullInputAdapter.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cpp/csp/engine/PushPullInputAdapter.cpp b/cpp/csp/engine/PushPullInputAdapter.cpp index c61ce9760..051b0a204 100644 --- a/cpp/csp/engine/PushPullInputAdapter.cpp +++ b/cpp/csp/engine/PushPullInputAdapter.cpp @@ -81,7 +81,7 @@ PushPullInputAdapter::PullDataEvent * PushPullInputAdapter::nextPullEvent() auto * event = m_poppedPullEvents.front(); m_poppedPullEvents.pop(); - if( m_adjustOutOfOrderTime ) + if( m_adjustOutOfOrderTime && event ) event -> time = std::max( event -> time, rootEngine() -> now() ); return event; From 86c4e6d441249089066ab4efa59fd7102602910e Mon Sep 17 00:00:00 2001 From: Arham Chopra Date: Wed, 17 Apr 2024 11:53:19 -0400 Subject: [PATCH 08/27] Revert "Upgrade baseline in vcpkg.json" --- vcpkg | 2 +- vcpkg.json | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/vcpkg b/vcpkg index 43d81795a..0f9a4f65e 160000 --- a/vcpkg +++ b/vcpkg @@ -1 +1 @@ -Subproject commit 43d81795a513e2ca6648354786178714f33c8b6f +Subproject commit 0f9a4f65e4d06fd70a8d839ec3a1eafc05652e70 diff --git a/vcpkg.json b/vcpkg.json index be0aea177..e45b4a9d9 100644 --- a/vcpkg.json +++ b/vcpkg.json @@ -23,5 +23,5 @@ "overrides": [ { "name": "arrow", "version": "15.0.0"} ], - "builtin-baseline": "8a12a368c8af1e39fa597793fb58555f54414d0b" + "builtin-baseline": "288e8bebf4ca67c1f8ebd49366b03650cfd9eb7d" } From abcd3076915d382fbce59bd4c8c383f724a8a5b4 Mon Sep 17 00:00:00 2001 From: Tim Paine <3105306+timkpaine@users.noreply.github.com> Date: Wed, 17 Apr 2024 22:27:51 -0400 Subject: [PATCH 09/27] Move websocket example after merge Signed-off-by: Tim Paine <3105306+timkpaine@users.noreply.github.com> --- .../websocket/e1_websocket_client.py} | 0 .../websocket/{e1_websocket_output.py => e2_websocket_output.py} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename examples/{3_using_adapters/websocket_client.py => 03_using_adapters/websocket/e1_websocket_client.py} (100%) rename examples/03_using_adapters/websocket/{e1_websocket_output.py => e2_websocket_output.py} (100%) diff --git a/examples/3_using_adapters/websocket_client.py b/examples/03_using_adapters/websocket/e1_websocket_client.py similarity index 100% rename from examples/3_using_adapters/websocket_client.py rename to examples/03_using_adapters/websocket/e1_websocket_client.py diff --git a/examples/03_using_adapters/websocket/e1_websocket_output.py b/examples/03_using_adapters/websocket/e2_websocket_output.py similarity index 100% rename from examples/03_using_adapters/websocket/e1_websocket_output.py rename to examples/03_using_adapters/websocket/e2_websocket_output.py From 4ea4799bd27a3737401c7dad749bc64c12ded268 Mon Sep 17 00:00:00 2001 From: Adam Glustein Date: Wed, 17 Apr 2024 15:15:19 -0400 Subject: [PATCH 10/27] Maintain the type of a list-derived object when converting a struct in to_dict (#199) Signed-off-by: Adam Glustein --- cpp/csp/python/PyStructList.hi | 12 +++++++++++- csp/impl/struct.py | 4 +--- csp/tests/impl/test_struct.py | 13 +++++++++++++ 3 files changed, 25 insertions(+), 4 deletions(-) diff --git a/cpp/csp/python/PyStructList.hi b/cpp/csp/python/PyStructList.hi index bf829ea6c..7e09a5a2d 100644 --- a/cpp/csp/python/PyStructList.hi +++ b/cpp/csp/python/PyStructList.hi @@ -394,6 +394,16 @@ static PyMappingMethods py_struct_list_as_mapping = { py_struct_list_ass_subscript }; +static PyObject * +PyStructList_new( PyTypeObject *type, PyObject *args, PyObject *kwds ) +{ + // Since the PyStructList has no real meaning when created from Python, we can reconstruct the PSL's value + // by just treating it as a list. Thus, we simply override the tp_new behaviour to return a list object here. + // Again, since we don't have tp_init for the PSL, we need to rely on the Python list's tp_init function. + + return PyObject_Call( ( PyObject * ) &PyList_Type, args, kwds ); // Calls both tp_new and tp_init for a Python list +} + template static int PyStructList_tp_clear( PyStructList * self ) @@ -437,7 +447,7 @@ template PyTypeObject PyStructList::PyType = { .tp_clear = ( inquiry ) PyStructList_tp_clear, .tp_methods = PyStructList_methods, .tp_alloc = PyType_GenericAlloc, - .tp_new = PyType_GenericNew, + .tp_new = PyStructList_new, .tp_free = PyObject_GC_Del, }; diff --git a/csp/impl/struct.py b/csp/impl/struct.py index bf2206aa5..89eaa254b 100644 --- a/csp/impl/struct.py +++ b/csp/impl/struct.py @@ -108,9 +108,7 @@ def _obj_to_python(cls, obj): ) elif isinstance(obj, dict): return {k: cls._obj_to_python(v) for k, v in obj.items()} - elif isinstance(obj, list): - return list(cls._obj_to_python(v) for v in obj) - elif isinstance(obj, (tuple, set)): + elif isinstance(obj, (list, tuple, set)): return type(obj)(cls._obj_to_python(v) for v in obj) elif isinstance(obj, csp.Enum): return obj.name # handled in _obj_from_python diff --git a/csp/tests/impl/test_struct.py b/csp/tests/impl/test_struct.py index b5e634a9d..0920aa5cc 100644 --- a/csp/tests/impl/test_struct.py +++ b/csp/tests/impl/test_struct.py @@ -697,6 +697,19 @@ def test_from_dict_with_enum(self): struct = StructWithDefaults.from_dict({"e": MyEnum.A}) self.assertEqual(MyEnum.A, getattr(struct, "e")) + def test_from_dict_with_list_derived_type(self): + class ListDerivedType(list): + def __init__(self, iterable=None): + super().__init__(iterable) + + class StructWithListDerivedType(csp.Struct): + ldt: ListDerivedType + + s1 = StructWithListDerivedType(ldt=ListDerivedType([1,2])) + self.assertTrue(isinstance(s1.to_dict()['ldt'], ListDerivedType)) + s2 = StructWithListDerivedType.from_dict(s1.to_dict()) + self.assertEqual(s1, s2) + def test_from_dict_loop_no_defaults(self): looped = StructNoDefaults.from_dict(StructNoDefaults(a1=[9, 10]).to_dict()) self.assertEqual(looped, StructNoDefaults(a1=[9, 10])) From 5103026dc297221196081af54f57bc13aa280b30 Mon Sep 17 00:00:00 2001 From: Pavithra Eswaramoorthy Date: Tue, 23 Apr 2024 13:30:46 +0530 Subject: [PATCH 11/27] Re-apply lost updates in dev guides (#202) * Re-apply updates to build csp from source Signed-off-by: Pavithra Eswaramoorthy * Add note about DCO in the Contribute.md Signed-off-by: Pavithra Eswaramoorthy * Add back install notes about perl-ipc & git * Add back Using System Dependencies section Signed-off-by: Pavithra Eswaramoorthy --------- Signed-off-by: Pavithra Eswaramoorthy --- docs/wiki/dev-guides/Build-CSP-from-Source.md | 79 +++++++++++-------- docs/wiki/dev-guides/Contribute.md | 5 ++ 2 files changed, 51 insertions(+), 33 deletions(-) diff --git a/docs/wiki/dev-guides/Build-CSP-from-Source.md b/docs/wiki/dev-guides/Build-CSP-from-Source.md index 6aeda4226..3ecec0be1 100644 --- a/docs/wiki/dev-guides/Build-CSP-from-Source.md +++ b/docs/wiki/dev-guides/Build-CSP-from-Source.md @@ -10,6 +10,7 @@ CSP is written in Python and C++ with Python and C++ build dependencies. While p - [Clone](#clone) - [Install build dependencies](#install-build-dependencies) - [Build](#build) + - [A note about dependencies](#a-note-about-dependencies) - [Building with a system package manager](#building-with-a-system-package-manager) - [Clone](#clone-1) - [Install build dependencies](#install-build-dependencies-1) @@ -18,6 +19,7 @@ CSP is written in Python and C++ with Python and C++ build dependencies. While p - [Install Python dependencies](#install-python-dependencies) - [Build](#build-1) - [Building on `aarch64` Linux](#building-on-aarch64-linux) +- [Using System Dependencies](#using-system-dependencies) - [Lint and Autoformat](#lint-and-autoformat) - [Testing](#testing) - [Troubleshooting](#troubleshooting) @@ -49,29 +51,16 @@ CSP has a few system-level dependencies which you can install from your machine The easiest way to get started on a Linux machine is by installing the necessary dependencies in a self-contained conda environment. -Tweak this script to create a conda environment, install the build dependencies, build, and install a development version of CSP into the environment. +Tweak this script to create a conda environment, install the build dependencies, build, and install a development version of `csp` into the environment. Note that we use [micromamba](https://mamba.readthedocs.io/en/latest/index.html) in this example, but [Anaconda](https://www.anaconda.com/download), [Miniconda](https://docs.anaconda.com/free/miniconda/index.html), [Miniforge](https://github.com/conda-forge/miniforge), etc, should all work fine. ### Install conda ```bash -mkdir ~/github -cd ~/github +# download and install micromamba for Linux/Mac +"${SHELL}" <(curl -L micro.mamba.pm/install.sh) -# this downloads a Linux x86_64 build, change your architecture to match your development machine -# see https://conda-forge.org/miniforge/ for alternate download links - -wget https://github.com/conda-forge/miniforge/releases/download/23.3.1-1/Mambaforge-23.3.1-1-Linux-x86_64.sh -chmod 755 Mambaforge-23.3.1-1-Linux-x86_64.sh -./Mambaforge-23.3.1-1-Linux-x86_64.sh -b -f -u -p csp_venv - -. ~/github/csp_venv/etc/profile.d/conda.sh - -# optionally, run this if you want to set up conda in your .bashrc -# conda init bash - -conda config --add channels conda-forge -conda config --set channel_priority strict -conda activate base +# on windows powershell +# Invoke-Expression ((Invoke-WebRequest -Uri https://micro.mamba.pm/install.ps1).Content) ``` ### Clone @@ -88,31 +77,31 @@ git submodule update --init --recursive # Note the operating system, change as needed # Linux and MacOS should use the unix dev environment spec micromamba create -n csp -f conda/dev-environment-unix.yml + +# uncomment below if the build fails because git isn't new enough +# +# micromamba install -y -n csp git + +# uncomment below if the build fails because perl-ipc-system is missing +# (this happens on some RHEL7 systems) +# +# micromamba install -y -n csp perl-ipc-system-simple + micromamba activate csp ``` ### Build ```bash -make build - -# on aarch64 linux, comment the above command and use this instead -# VCPKG_FORCE_SYSTEM_BINARIES=1 make build +make build-conda -# finally install into the csp_venv conda environment +# finally install into the csp conda environment make develop ``` -If you didn’t do `conda init bash` you’ll need to re-add conda to your shell environment and activate the `csp` environment to use it: +### A note about dependencies -```bash -. ~/github/csp_venv/etc/profile.d/conda.sh -conda activate csp - -# make sure everything works -cd ~/github/csp -make test -``` +In Conda, we pull our dependencies from the Conda environment by setting the environment variable `CSP_USE_VCPKG=0`. This will force the build to not pull dependencies from vcpkg. This may or may not work in other environments or with packages provided by other package managers or built from source, but there is too much variability for us to support alternative patterns. ## Building with a system package manager @@ -187,6 +176,8 @@ Build the python project in the usual manner: ```bash make build +# on aarch64 linux, comment the above command and use this instead +# VCPKG_FORCE_SYSTEM_BINARIES=1 make build # or # python setup.py build build_ext --inplace ``` @@ -199,6 +190,10 @@ On `aarch64` Linux the VCPKG_FORCE_SYSTEM_BINARIES environment variable must be VCPKG_FORCE_SYSTEM_BINARIES=1 make build ``` +## Using System Dependencies + +By default, we pull and build dependencies with [vcpkg](https://vcpkg.io/en/). We only support non-vendored dependencies via Conda (see [A note about dependencies](#A-note-about-dependencies) above). + ## Lint and Autoformat CSP has listing and auto formatting. @@ -231,7 +226,7 @@ make fix-cpp make lint-py # or # python -m isort --check csp/ setup.py -# python -m ruff csp/ setup.py +# python -m ruff check csp/ setup.py ``` **Python Autoformatting** @@ -243,6 +238,24 @@ make fix-py # python -m ruff format csp/ setup.py ``` +**Documentation Linting** + +```bash +make lint-docs +# or +# python -m mdformat --check docs/wiki/ README.md examples/README.md +# python -m codespell_lib docs/wiki/ README.md examples/README.md +``` + +**Documentation Autoformatting** + +```bash +make fix-docs +# or +# python -m mdformat docs/wiki/ README.md examples/README.md +# python -m codespell_lib --write docs/wiki/ README.md examples/README.md +``` + ## Testing CSP has both Python and C++ tests. The bulk of the functionality is tested in Python, which can be run via `pytest`. First, install the Python development dependencies with diff --git a/docs/wiki/dev-guides/Contribute.md b/docs/wiki/dev-guides/Contribute.md index 453422bcd..19025b023 100644 --- a/docs/wiki/dev-guides/Contribute.md +++ b/docs/wiki/dev-guides/Contribute.md @@ -1,5 +1,10 @@ Contributions are welcome on this project. We distribute under the terms of the [Apache 2.0 license](https://github.com/Point72/csp/blob/main/LICENSE). +> \[!NOTE\] +> CSP requires [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin) for all contributions. +> This is enforced by a [Probot GitHub App](https://probot.github.io/apps/dco/), which checks that commits are "signed". +> Read [instructions to configure commit signing](Local-Development-Setup#configure-commit-signing). + For **bug reports** or **small feature requests**, please open an issue on our [issues page](https://github.com/Point72/csp/issues). For **questions** or to discuss **larger changes or features**, please use our [discussions page](https://github.com/Point72/csp/discussions). From 64559cb81b817c5c11038b2d8be928f5da8e9013 Mon Sep 17 00:00:00 2001 From: Adam Glustein Date: Mon, 29 Apr 2024 16:18:19 -0400 Subject: [PATCH 12/27] Include AS statement in SQL build query regardless of sqlalchemy version (#205) Signed-off-by: Adam Glustein --- csp/adapters/db.py | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/csp/adapters/db.py b/csp/adapters/db.py index 1aba12407..8316c1a67 100644 --- a/csp/adapters/db.py +++ b/csp/adapters/db.py @@ -351,10 +351,7 @@ def build_query(self, starttime, endtime): elif self._rep._use_raw_user_query: return db.text(self._rep._query) else: # self._rep._query - if _SQLALCHEMY_2: - from_obj = db.text(f"({self._rep._query})") - else: - from_obj = db.text(f"({self._rep._query}) AS user_query") + from_obj = db.text(f"({self._rep._query}) AS user_query") time_columns = self._rep._time_accessor.get_time_columns(self._rep._connection) if time_columns: From 2abdf593e035eba7b014a7f9b927f07c448299ec Mon Sep 17 00:00:00 2001 From: Rob Ambalu Date: Fri, 3 May 2024 10:23:49 -0400 Subject: [PATCH 13/27] Update vcpkg baseline (#209) * update vcpkg baseline. forced vcpkg triplet for linux to x64-linux ( was defaulting to the community x64-linux-dynamic for some reason, which fails when building boost on the latest baseline --- setup.py | 15 ++++++++++++-- vcpkg | 2 +- vcpkg.json | 57 +++++++++++++++++++++++++++++------------------------- 3 files changed, 45 insertions(+), 29 deletions(-) diff --git a/setup.py b/setup.py index aee9c79d9..085e442e5 100644 --- a/setup.py +++ b/setup.py @@ -22,6 +22,11 @@ # - omit coverage/gprof, not implemented ) +if sys.platform == "linux": + VCPKG_TRIPLET = "x64-linux" +else: + VCPKG_TRIPLET = None + # This will be used for e.g. the sdist if CSP_USE_VCPKG: if not os.path.exists("vcpkg"): @@ -30,12 +35,15 @@ subprocess.call(["git", "submodule", "update", "--init", "--recursive"]) if not os.path.exists("vcpkg/buildtrees"): subprocess.call(["git", "pull"], cwd="vcpkg") + args = ["install"] + if VCPKG_TRIPLET is not None: + args.append(f"--triplet={VCPKG_TRIPLET}") if os.name == "nt": subprocess.call(["bootstrap-vcpkg.bat"], cwd="vcpkg", shell=True) - subprocess.call(["vcpkg.bat", "install"], cwd="vcpkg", shell=True) + subprocess.call(["vcpkg.bat"] + args, cwd="vcpkg", shell=True) else: subprocess.call(["./bootstrap-vcpkg.sh"], cwd="vcpkg") - subprocess.call(["./vcpkg", "install"], cwd="vcpkg") + subprocess.call(["./vcpkg"] + args, cwd="vcpkg") python_version = f"{sys.version_info.major}.{sys.version_info.minor}" @@ -54,6 +62,9 @@ "-DCSP_USE_VCPKG=ON", ] ) + + if VCPKG_TRIPLET is not None: + cmake_args.append( f"-DVCPKG_TARGET_TRIPLET={VCPKG_TRIPLET}" ) else: cmake_args.append("-DCSP_USE_VCPKG=OFF") diff --git a/vcpkg b/vcpkg index 0f9a4f65e..04b0cf2b3 160000 --- a/vcpkg +++ b/vcpkg @@ -1 +1 @@ -Subproject commit 0f9a4f65e4d06fd70a8d839ec3a1eafc05652e70 +Subproject commit 04b0cf2b3fd1752d3c3db969cbc10ba0a4613cee diff --git a/vcpkg.json b/vcpkg.json index e45b4a9d9..a54276c50 100644 --- a/vcpkg.json +++ b/vcpkg.json @@ -1,27 +1,32 @@ { - "name": "main", - "version-string": "latest", - "dependencies": [ - "abseil", - "arrow", - "brotli", - "exprtk", - "gtest", - { - "name": "librdkafka", - "features": ["ssl"] - }, - "lz4", - "openssl", - "parquet", - "protobuf", - "rapidjson", - "thrift", - "utf8proc", - "websocketpp" - ], - "overrides": [ - { "name": "arrow", "version": "15.0.0"} - ], - "builtin-baseline": "288e8bebf4ca67c1f8ebd49366b03650cfd9eb7d" - } + "name": "main", + "version-string": "latest", + "dependencies": [ + "abseil", + "arrow", + "brotli", + "exprtk", + "gtest", + { + "name": "librdkafka", + "features": [ + "ssl" + ] + }, + "lz4", + "openssl", + "parquet", + "protobuf", + "rapidjson", + "thrift", + "utf8proc", + "websocketpp" + ], + "overrides": [ + { + "name": "arrow", + "version": "15.0.0" + } + ], + "builtin-baseline": "04b0cf2b3fd1752d3c3db969cbc10ba0a4613cee" +} From e2a4f2b303749a3c15f386e7d1d70bc1f7c5fabc Mon Sep 17 00:00:00 2001 From: Adam Glustein Date: Wed, 1 May 2024 11:39:59 -0400 Subject: [PATCH 14/27] Fix interrupt handling issues in csp: ensure first node is stopped and reset signaled flag across runs (#206) * Fix various interrupt handling issues in csp Signed-off-by: Adam Glustein * Add comment explaining signal handling in multiple engine threads Signed-off-by: Adam Glustein --------- Signed-off-by: Adam Glustein --- cpp/csp/engine/RootEngine.cpp | 23 ++++++++++--- cpp/csp/engine/RootEngine.h | 1 + cpp/csp/python/PyNode.cpp | 19 ++++++----- cpp/csp/python/csptestlibimpl.cpp | 32 ++++++++++++++++++ csp/tests/test_engine.py | 54 +++++++++++++++++++++++++++++++ 5 files changed, 114 insertions(+), 15 deletions(-) diff --git a/cpp/csp/engine/RootEngine.cpp b/cpp/csp/engine/RootEngine.cpp index a65f3a5d1..2ddc0eaca 100644 --- a/cpp/csp/engine/RootEngine.cpp +++ b/cpp/csp/engine/RootEngine.cpp @@ -11,12 +11,24 @@ namespace csp { -static volatile bool g_SIGNALED = false; +static volatile int g_SIGNAL_COUNT = 0; +/* +The signal count variable is maintained to ensure that multiple engine threads shutdown properly. + +An interrupt should cause all running engines to stop, but should not affect future runs in the same process. +Thus, each root engine keeps track of the signal count when its created. When an interrupt occurs, one engine thread +handles the interrupt by incrementing the count. Then, all other root engines detect the signal by comparing their +initial count to the current count. + +Future runs after the interrupt remain unaffected since they are initialized with the updated signal count, and will +only consider themselves "interupted" if another signal is received during their execution. +*/ + static struct sigaction g_prevSIGTERMaction; static void handle_SIGTERM( int signum ) { - g_SIGNALED = true; + g_SIGNAL_COUNT++; if( g_prevSIGTERMaction.sa_handler ) (*g_prevSIGTERMaction.sa_handler)( signum ); } @@ -58,6 +70,7 @@ RootEngine::RootEngine( const Dictionary & settings ) : Engine( m_cycleStepTable m_cycleCount( 0 ), m_settings( settings ), m_inRealtime( false ), + m_initSignalCount( g_SIGNAL_COUNT ), m_pushEventQueue( m_settings.queueWaitTime > TimeDelta::ZERO() ) { if( settings.get( "profile", false ) ) @@ -78,7 +91,7 @@ RootEngine::~RootEngine() bool RootEngine::interrupted() const { - return g_SIGNALED; + return g_SIGNAL_COUNT != m_initSignalCount; } void RootEngine::preRun( DateTime start, DateTime end ) @@ -131,7 +144,7 @@ void RootEngine::processEndCycle() void RootEngine::runSim( DateTime end ) { m_inRealtime = false; - while( m_scheduler.hasEvents() && m_state == State::RUNNING && !g_SIGNALED ) + while( m_scheduler.hasEvents() && m_state == State::RUNNING && !interrupted() ) { m_now = m_scheduler.nextTime(); if( m_now > end ) @@ -161,7 +174,7 @@ void RootEngine::runRealtime( DateTime end ) m_inRealtime = true; bool haveEvents = false; - while( m_state == State::RUNNING && !g_SIGNALED ) + while( m_state == State::RUNNING && !interrupted() ) { TimeDelta waitTime; if( !m_pendingPushEvents.hasEvents() ) diff --git a/cpp/csp/engine/RootEngine.h b/cpp/csp/engine/RootEngine.h index e3999d572..746704580 100644 --- a/cpp/csp/engine/RootEngine.h +++ b/cpp/csp/engine/RootEngine.h @@ -127,6 +127,7 @@ class RootEngine : public Engine PendingPushEvents m_pendingPushEvents; Settings m_settings; bool m_inRealtime; + int m_initSignalCount; PushEventQueue m_pushEventQueue; diff --git a/cpp/csp/python/PyNode.cpp b/cpp/csp/python/PyNode.cpp index ccad4e715..ba4ebbd7d 100644 --- a/cpp/csp/python/PyNode.cpp +++ b/cpp/csp/python/PyNode.cpp @@ -212,18 +212,17 @@ void PyNode::start() void PyNode::stop() { - PyObjectPtr rv = PyObjectPtr::own( PyObject_CallMethod( m_gen.ptr(), "close", nullptr ) ); - if( !rv.ptr() ) + if( this -> rootEngine() -> interrupted() && PyErr_CheckSignals() == -1 ) { - if( PyErr_Occurred() == PyExc_KeyboardInterrupt ) - { - PyErr_Clear(); - rv = PyObjectPtr::own( PyObject_CallMethod( m_gen.ptr(), "close", nullptr ) ); - } - - if( !rv.ptr() ) - CSP_THROW( PythonPassthrough, "" ); + // When an interrupt occurs a KeyboardInterrupt exception is raised in Python, which we need to clear + // before calling "close" on the generator. Else, the close method will fail due to the unhandled + // exception, and we lose the state of the generator before the "finally" block that calls stop() is executed. + PyErr_Clear(); } + + PyObjectPtr rv = PyObjectPtr::own( PyObject_CallMethod( m_gen.ptr(), "close", nullptr ) ); + if( !rv.ptr() ) + CSP_THROW( PythonPassthrough, "" ); } PyNode * PyNode::create( PyEngine * pyengine, PyObject * inputs, PyObject * outputs, PyObject * gen ) diff --git a/cpp/csp/python/csptestlibimpl.cpp b/cpp/csp/python/csptestlibimpl.cpp index 23bd299f5..f5a750bf8 100644 --- a/cpp/csp/python/csptestlibimpl.cpp +++ b/cpp/csp/python/csptestlibimpl.cpp @@ -66,6 +66,37 @@ EXPORT_CPPNODE( start_n2_throw ); } +namespace interrupt_stop_test +{ + +using namespace csp::python; + +void setStatus( const DialectGenericType & obj_, int64_t idx ) +{ + PyObjectPtr obj = PyObjectPtr::own( toPython( obj_ ) ); + PyObjectPtr list = PyObjectPtr::own( PyObject_GetAttrString( obj.get(), "stopped" ) ); + PyList_SET_ITEM( list.get(), idx, Py_True ); +} + +DECLARE_CPPNODE( set_stop_index ) +{ + INIT_CPPNODE( set_stop_index ) {} + + SCALAR_INPUT( DialectGenericType, obj_ ); + SCALAR_INPUT( int64_t, idx ); + + START() {} + INVOKE() {} + + STOP() + { + setStatus( obj_, idx ); + } +}; +EXPORT_CPPNODE( set_stop_index ); + +} + } } @@ -73,6 +104,7 @@ EXPORT_CPPNODE( start_n2_throw ); // Test nodes REGISTER_CPPNODE( csp::cppnodes::testing::stop_start_test, start_n1_set_value ); REGISTER_CPPNODE( csp::cppnodes::testing::stop_start_test, start_n2_throw ); +REGISTER_CPPNODE( csp::cppnodes::testing::interrupt_stop_test, set_stop_index ); static PyModuleDef _csptestlibimpl_module = { PyModuleDef_HEAD_INIT, diff --git a/csp/tests/test_engine.py b/csp/tests/test_engine.py index 3ae54afbb..028b5d6f9 100644 --- a/csp/tests/test_engine.py +++ b/csp/tests/test_engine.py @@ -2064,6 +2064,60 @@ def g() -> ts[int]: csp.run(g, starttime=datetime(2020, 1, 1), endtime=timedelta()) self.assertTrue(status["started"] and status["stopped"]) + def test_interrupt_stops_all_nodes(self): + @csp.node + def n(l: list, idx: int): + with csp.stop(): + l[idx] = True + + @csp.node + def raise_interrupt(): + with csp.alarms(): + a = csp.alarm(bool) + with csp.start(): + csp.schedule_alarm(a, timedelta(seconds=1), True) + if csp.ticked(a): + import signal + os.kill(os.getpid(), signal.SIGINT) + + # Python nodes + @csp.graph + def g(l: list): + n(l, 0) + n(l, 1) + n(l, 2) + raise_interrupt() + + stopped = [False, False, False] + with self.assertRaises(KeyboardInterrupt): + csp.run(g, stopped, starttime=datetime.utcnow(), endtime=timedelta(seconds=60), realtime=True) + + for element in stopped: + self.assertTrue(element) + + # C++ nodes + class RTI: + def __init__(self): + self.stopped = [False, False, False] + + @csp.node(cppimpl=_csptestlibimpl.set_stop_index) + def n2(obj_: object, idx: int): + return + + @csp.graph + def g2(rti: RTI): + n2(rti, 0) + n2(rti, 1) + n2(rti, 2) + raise_interrupt() + + rti = RTI() + with self.assertRaises(KeyboardInterrupt): + csp.run(g2, rti, starttime=datetime.utcnow(), endtime=timedelta(seconds=60), realtime=True) + + for element in rti.stopped: + self.assertTrue(element) + if __name__ == "__main__": unittest.main() From 89eda4cf2d6143b4b851e32a3f29b568281fcac2 Mon Sep 17 00:00:00 2001 From: Will Rieger Date: Mon, 6 May 2024 12:09:36 -0400 Subject: [PATCH 15/27] fix @217 | add tests Signed-off-by: Will Rieger --- .../adapters/websocket/ClientInputAdapter.cpp | 2 +- .../adapters/websocket/ClientInputAdapter.h | 2 - .../adapters/websocket/ClientOutputAdapter.h | 1 - csp/tests/adapters/test_websocket.py | 83 +++++++++++++++++++ 4 files changed, 84 insertions(+), 4 deletions(-) create mode 100644 csp/tests/adapters/test_websocket.py diff --git a/cpp/csp/adapters/websocket/ClientInputAdapter.cpp b/cpp/csp/adapters/websocket/ClientInputAdapter.cpp index 91fe230c1..ad1db9afc 100644 --- a/cpp/csp/adapters/websocket/ClientInputAdapter.cpp +++ b/cpp/csp/adapters/websocket/ClientInputAdapter.cpp @@ -30,7 +30,7 @@ void ClientInputAdapter::processMessage( std::string payload, PushBatch* batch ) if( type() -> type() == CspType::Type::STRUCT ) { - auto tick = m_converter -> asStruct( &payload, payload.length() ); + auto tick = m_converter -> asStruct( (void*)payload.data(), payload.length() ); pushTick( std::move(tick), batch ); } else if ( type() -> type() == CspType::Type::STRING ) { diff --git a/cpp/csp/adapters/websocket/ClientInputAdapter.h b/cpp/csp/adapters/websocket/ClientInputAdapter.h index 93711bbf6..93ae2614b 100644 --- a/cpp/csp/adapters/websocket/ClientInputAdapter.h +++ b/cpp/csp/adapters/websocket/ClientInputAdapter.h @@ -1,8 +1,6 @@ #ifndef _IN_CSP_ADAPTERS_WEBSOCKETS_CLIENT_INPUTADAPTER_H #define _IN_CSP_ADAPTERS_WEBSOCKETS_CLIENT_INPUTADAPTER_H -#include -#include #include #include #include diff --git a/cpp/csp/adapters/websocket/ClientOutputAdapter.h b/cpp/csp/adapters/websocket/ClientOutputAdapter.h index 64726837e..865831c8f 100644 --- a/cpp/csp/adapters/websocket/ClientOutputAdapter.h +++ b/cpp/csp/adapters/websocket/ClientOutputAdapter.h @@ -5,7 +5,6 @@ #include #include #include -#include namespace csp::adapters::websocket { diff --git a/csp/tests/adapters/test_websocket.py b/csp/tests/adapters/test_websocket.py new file mode 100644 index 000000000..401133d22 --- /dev/null +++ b/csp/tests/adapters/test_websocket.py @@ -0,0 +1,83 @@ +import os +import pytz +import threading +import unittest +from datetime import datetime + +import csp +from csp import ts + +if os.environ.get("CSP_TEST_WEBSOCKET"): + import tornado.ioloop + import tornado.web + import tornado.websocket + + from csp.adapters.websocket import JSONTextMessageMapper, RawTextMessageMapper, Status, WebsocketAdapterManager + + class EchoWebsocketHandler(tornado.websocket.WebSocketHandler): + def on_message(self, msg): + return self.write_message(msg) + + +@unittest.skipIf(not os.environ.get("CSP_TEST_WEBSOCKET"), "Skipping websocket adapter tests") +class TestWebsocket(unittest.TestCase): + @classmethod + def setUpClass(cls): + cls.app = tornado.web.Application([(r"/ws", EchoWebsocketHandler)]) + cls.app.listen(8000) + cls.io_loop = tornado.ioloop.IOLoop.current() + cls.io_thread = threading.Thread(target=cls.io_loop.start) + cls.io_thread.start() + + @classmethod + def tearDownClass(cls): + cls.io_loop.add_callback(cls.io_loop.stop) + if cls.io_thread: + cls.io_thread.join() + + def test_send_recv_msg(self): + @csp.node + def send_msg_on_open(status: ts[Status]) -> ts[str]: + if csp.ticked(status): + return "Hello, World!" + + @csp.graph + def g(): + ws = WebsocketAdapterManager("ws://localhost:8000/ws") + status = ws.status() + ws.send(send_msg_on_open(status)) + recv = ws.subscribe(str, RawTextMessageMapper()) + + csp.add_graph_output("recv", recv) + csp.stop_engine(recv) + + msgs = csp.run(g, starttime=datetime.now(pytz.UTC), realtime=True) + assert len(msgs) == 1 + assert msgs["recv"][0][1] == "Hello, World!" + + def test_send_recv_json(self): + class MsgStruct(csp.Struct): + a: int + b: str + + @csp.node + def send_msg_on_open(status: ts[Status]) -> ts[str]: + if csp.ticked(status): + return MsgStruct(a=1234, b="im a string").to_json() + + @csp.graph + def g(): + ws = WebsocketAdapterManager("ws://localhost:8000/ws") + status = ws.status() + ws.send(send_msg_on_open(status)) + recv = ws.subscribe(MsgStruct, JSONTextMessageMapper()) + + csp.add_graph_output("recv", recv) + csp.stop_engine(recv) + + msgs = csp.run(g, starttime=datetime.now(pytz.UTC), realtime=True) + assert len(msgs) == 1 + obj = msgs["recv"][0][1] + assert isinstance(obj, MsgStruct) + assert obj.a == 1234 + assert obj.b == "im a string" From da9a84f126d063f1b014f1584dab22a60f4fd967 Mon Sep 17 00:00:00 2001 From: Rob Ambalu Date: Mon, 6 May 2024 12:20:01 -0400 Subject: [PATCH 16/27] minor bugfix to unroll cppimpl. Missing cast from vector value to ElemT, which for bool would be a vector value of unsigned char. This was triggering a CSP_ASSERT in debug builds Signed-off-by: Rob Ambalu --- cpp/csp/cppnodes/baselibimpl.cpp | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cpp/csp/cppnodes/baselibimpl.cpp b/cpp/csp/cppnodes/baselibimpl.cpp index ab06c52cc..a0cd2a029 100644 --- a/cpp/csp/cppnodes/baselibimpl.cpp +++ b/cpp/csp/cppnodes/baselibimpl.cpp @@ -379,11 +379,11 @@ DECLARE_CPPNODE( unroll ) { size_t idx = 0; if( !s_pending ) - CSP_OUTPUT( v[idx++] ); + CSP_OUTPUT( static_cast( v[idx++] ) ); s_pending += sz - idx; for( ; idx < sz; ++idx ) - csp.schedule_alarm( alarm, TimeDelta::ZERO(), v[idx] ); + csp.schedule_alarm( alarm, TimeDelta::ZERO(), static_cast( v[idx] ) ); } } From efe0fd6ce2add81a0bb24d6fb61056bb2499236d Mon Sep 17 00:00:00 2001 From: Tim Paine Date: Fri, 3 May 2024 10:02:57 -0400 Subject: [PATCH 17/27] Add format check to lint step Signed-off-by: Tim Paine --- Makefile | 4 +--- docs/wiki/dev-guides/Build-CSP-from-Source.md | 1 + 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/Makefile b/Makefile index 2b0177ef3..77c0fbe71 100644 --- a/Makefile +++ b/Makefile @@ -38,6 +38,7 @@ install: ## install library lint-py: python -m isort --check csp/ examples/ setup.py python -m ruff check csp/ examples/ setup.py + python -m ruff format --check csp/ examples/ setup.py lint-cpp: # clang-format --dry-run -Werror -i -style=file `find ./cpp/ -name "*.*pp"` @@ -191,9 +192,6 @@ dependencies-debian: ## install dependencies for linux dependencies-fedora: ## install dependencies for linux yum install -y automake bison ccache cmake curl flex perl-IPC-Cmd tar unzip zip -dependencies-alma: ## install dependencies for linux - dnf install -y automake bison ccache cmake curl flex perl-IPC-Cmd tar unzip zip - dependencies-vcpkg: ## install dependnecies via vcpkg cd vcpkg && ./bootstrap-vcpkg.sh && ./vcpkg install diff --git a/docs/wiki/dev-guides/Build-CSP-from-Source.md b/docs/wiki/dev-guides/Build-CSP-from-Source.md index 3ecec0be1..0ccaaf424 100644 --- a/docs/wiki/dev-guides/Build-CSP-from-Source.md +++ b/docs/wiki/dev-guides/Build-CSP-from-Source.md @@ -227,6 +227,7 @@ make lint-py # or # python -m isort --check csp/ setup.py # python -m ruff check csp/ setup.py +# python -m ruff format --check csp/ setup.py ``` **Python Autoformatting** From da4d5e359c7a5d5ceccce0ef0aa921bfe7c50136 Mon Sep 17 00:00:00 2001 From: Tim Paine Date: Fri, 3 May 2024 10:03:12 -0400 Subject: [PATCH 18/27] fix format changes that will now result in lint failures Signed-off-by: Tim Paine --- csp/tests/impl/test_struct.py | 305 +++++++++++++++++++++++----------- csp/tests/test_engine.py | 9 +- 2 files changed, 212 insertions(+), 102 deletions(-) diff --git a/csp/tests/impl/test_struct.py b/csp/tests/impl/test_struct.py index 0920aa5cc..6f76339a9 100644 --- a/csp/tests/impl/test_struct.py +++ b/csp/tests/impl/test_struct.py @@ -171,20 +171,92 @@ def __init__(self, x: int): # items[:-2] are normal values of the given type that should be handled, # items[-2] is a normal value for non-generic and non-str types and None for generic and str types (the purpose is to test the raise of TypeError if a single object instead of a sequence is passed), # items[-1] is a value of a different type that is not convertible to the give type for non-generic types and None for generic types (the purpose is to test the raise of TypeError if an object of the wrong type is passed). -pystruct_list_test_values = { - int : [4, 2, 3, 5, 6, 7, 8, 's'], +pystruct_list_test_values = { + int: [4, 2, 3, 5, 6, 7, 8, "s"], bool: [True, True, True, False, True, False, True, 2], - float: [1.4, 3.2, 2.7, 1.0, -4.5, -6.0, -2.0, 's'], - datetime: [datetime(2022, 12, 6, 1, 2, 3), datetime(2022, 12, 7, 2, 2, 3), datetime(2022, 12, 8, 3, 2, 3), datetime(2022, 12, 9, 4, 2, 3), datetime(2022, 12, 10, 5, 2, 3), datetime(2022, 12, 11, 6, 2, 3), datetime(2022, 12, 13, 7, 2, 3), timedelta(seconds=.123)], - timedelta: [timedelta(seconds=.123), timedelta(seconds=12), timedelta(seconds=1), timedelta(seconds=.5), timedelta(seconds=123), timedelta(seconds=70), timedelta(seconds=700), datetime(2022, 12, 8, 3, 2, 3)], - date: [date(2022, 12, 6), date(2022, 12, 7), date(2022, 12, 8), date(2022, 12, 9), date(2022, 12, 10), date(2022, 12, 11), date(2022, 12, 13), timedelta(seconds=.123)], - time: [time(1, 2, 3), time(2, 2, 3), time(3, 2, 3), time(4, 2, 3), time(5, 2, 3), time(6, 2, 3), time(7, 2, 3), timedelta(seconds=.123)], - str : ['s', 'pqr', 'masd', 'wes', 'as', 'm', None, 5], - csp.Struct: [SimpleStruct(a = 1), AnotherSimpleStruct(b = 'sd'), SimpleStruct(a = 3), AnotherSimpleStruct(b = 'sdf'), SimpleStruct(a = -4), SimpleStruct(a = 5), SimpleStruct(a = 7), 4], # untyped struct list - SimpleStruct: [SimpleStruct(a = 1), SimpleStruct(a = 3), SimpleStruct(a = -1), SimpleStruct(a = -4), SimpleStruct(a = 5), SimpleStruct(a = 100), SimpleStruct(a = 1200), AnotherSimpleStruct(b = 'sd')], - SimpleEnum: [SimpleEnum.A, SimpleEnum.C, SimpleEnum.B, SimpleEnum.B, SimpleEnum.B, SimpleEnum.C, SimpleEnum.C, AnotherSimpleEnum.D], + float: [1.4, 3.2, 2.7, 1.0, -4.5, -6.0, -2.0, "s"], + datetime: [ + datetime(2022, 12, 6, 1, 2, 3), + datetime(2022, 12, 7, 2, 2, 3), + datetime(2022, 12, 8, 3, 2, 3), + datetime(2022, 12, 9, 4, 2, 3), + datetime(2022, 12, 10, 5, 2, 3), + datetime(2022, 12, 11, 6, 2, 3), + datetime(2022, 12, 13, 7, 2, 3), + timedelta(seconds=0.123), + ], + timedelta: [ + timedelta(seconds=0.123), + timedelta(seconds=12), + timedelta(seconds=1), + timedelta(seconds=0.5), + timedelta(seconds=123), + timedelta(seconds=70), + timedelta(seconds=700), + datetime(2022, 12, 8, 3, 2, 3), + ], + date: [ + date(2022, 12, 6), + date(2022, 12, 7), + date(2022, 12, 8), + date(2022, 12, 9), + date(2022, 12, 10), + date(2022, 12, 11), + date(2022, 12, 13), + timedelta(seconds=0.123), + ], + time: [ + time(1, 2, 3), + time(2, 2, 3), + time(3, 2, 3), + time(4, 2, 3), + time(5, 2, 3), + time(6, 2, 3), + time(7, 2, 3), + timedelta(seconds=0.123), + ], + str: ["s", "pqr", "masd", "wes", "as", "m", None, 5], + csp.Struct: [ + SimpleStruct(a=1), + AnotherSimpleStruct(b="sd"), + SimpleStruct(a=3), + AnotherSimpleStruct(b="sdf"), + SimpleStruct(a=-4), + SimpleStruct(a=5), + SimpleStruct(a=7), + 4, + ], # untyped struct list + SimpleStruct: [ + SimpleStruct(a=1), + SimpleStruct(a=3), + SimpleStruct(a=-1), + SimpleStruct(a=-4), + SimpleStruct(a=5), + SimpleStruct(a=100), + SimpleStruct(a=1200), + AnotherSimpleStruct(b="sd"), + ], + SimpleEnum: [ + SimpleEnum.A, + SimpleEnum.C, + SimpleEnum.B, + SimpleEnum.B, + SimpleEnum.B, + SimpleEnum.C, + SimpleEnum.C, + AnotherSimpleEnum.D, + ], list: [[1], [1, 2, 1], [6], [8, 3, 5], [3], [11, 8], None, None], # generic type list - SimpleClass: [SimpleClass(x = 1), SimpleClass(x = 5), SimpleClass(x = 9), SimpleClass(x = -1), SimpleClass(x = 2), SimpleClass(x = 3), None, None], # generic type user-defined + SimpleClass: [ + SimpleClass(x=1), + SimpleClass(x=5), + SimpleClass(x=9), + SimpleClass(x=-1), + SimpleClass(x=2), + SimpleClass(x=3), + None, + None, + ], # generic type user-defined } @@ -705,8 +777,8 @@ def __init__(self, iterable=None): class StructWithListDerivedType(csp.Struct): ldt: ListDerivedType - s1 = StructWithListDerivedType(ldt=ListDerivedType([1,2])) - self.assertTrue(isinstance(s1.to_dict()['ldt'], ListDerivedType)) + s1 = StructWithListDerivedType(ldt=ListDerivedType([1, 2])) + self.assertTrue(isinstance(s1.to_dict()["ldt"], ListDerivedType)) s2 = StructWithListDerivedType.from_dict(s1.to_dict()) self.assertEqual(s1, s2) @@ -1813,14 +1885,15 @@ def custom_jsonifier(obj): json.loads(test_struct.to_json(custom_jsonifier)) def test_list_field_append(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [] ) + + s = A(a=[]) s.a.append(v[0]) - + self.assertEqual(s.a, [v[0]]) s.a.append(v[1]) @@ -1834,14 +1907,15 @@ class A(csp.Struct): s.a.append(v[-1]) def test_list_field_insert(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [] ) + + s = A(a=[]) s.a.insert(0, v[0]) - + self.assertEqual(s.a, [v[0]]) s.a.insert(1, v[1]) @@ -1864,19 +1938,20 @@ class A(csp.Struct): s.a.insert(-1, v[-1]) def test_list_field_pop(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A(a = [v[0], v[1], v[2], v[3], v[4]]) + + s = A(a=[v[0], v[1], v[2], v[3], v[4]]) b = s.a.pop() - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3]]) self.assertEqual(b, v[4]) b = s.a.pop(-1) - + self.assertEqual(s.a, [v[0], v[1], v[2]]) self.assertEqual(b, v[3]) @@ -1884,13 +1959,13 @@ class A(csp.Struct): self.assertEqual(s.a, [v[0], v[2]]) self.assertEqual(b, v[1]) - + with self.assertRaises(IndexError) as e: s.a.pop() s.a.pop() s.a.pop() - - s = A(a = [v[0], v[1], v[2], v[3], v[4]]) + + s = A(a=[v[0], v[1], v[2], v[3], v[4]]) b = s.a.pop(-3) @@ -1904,14 +1979,15 @@ class A(csp.Struct): s.a.pop(4) def test_list_field_set_item(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2]] ) + + s = A(a=[v[0], v[1], v[2]]) s.a.__setitem__(0, v[3]) - + self.assertEqual(s.a, [v[3], v[1], v[2]]) s.a[1] = v[4] @@ -1927,7 +2003,7 @@ class A(csp.Struct): with self.assertRaises(IndexError) as e: s.a[-100] = v[0] - + s.a[5:6] = [v[0], v[1], v[2]] self.assertEqual(s.a, [v[3], v[4], v[5], v[0], v[1], v[2]]) @@ -1944,7 +2020,7 @@ class A(csp.Struct): self.assertEqual(s.a, [v[3], v[1], v[2], v[2], v[5]]) - # Check if not str or generic type (as str is a sequence of str) + # Check if not str or generic type (as str is a sequence of str) if v[-2] is not None: with self.assertRaises(TypeError) as e: s.a[1:4] = v[-2] @@ -1964,41 +2040,67 @@ class A(csp.Struct): self.assertEqual(s.a, [v[3], v[1], v[2], v[2], v[5]]) def test_list_field_reverse(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3]] ) + + s = A(a=[v[0], v[1], v[2], v[3]]) s.a.reverse() - + self.assertEqual(s.a, [v[3], v[2], v[1], v[0]]) - + def test_list_field_sort(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" # Not using pystruct_list_test_values, as sort() tests are of different semantics (order and sorting key existance matters). - values = { - int : [1, 5, 2, 2, -1, -5, 's'], - float: [1.4, 5.2, 2.7, 2.7, -1.4, -5.2, 's'], - datetime: [datetime(2022, 12, 6, 1, 2, 3), datetime(2022, 12, 8, 3, 2, 3), datetime(2022, 12, 7, 2, 2, 3), datetime(2022, 12, 7, 2, 2, 3), datetime(2022, 12, 5, 2, 2, 3), datetime(2022, 12, 3, 2, 2, 3), None], - timedelta: [timedelta(seconds=1), timedelta(seconds=123), timedelta(seconds=12), timedelta(seconds=12), timedelta(seconds=.1), timedelta(seconds=.01), None], - date: [date(2022, 12, 6), date(2022, 12, 8), date(2022, 12, 7), date(2022, 12, 7), date(2022, 12, 5), date(2022, 12, 3), None], + values = { + int: [1, 5, 2, 2, -1, -5, "s"], + float: [1.4, 5.2, 2.7, 2.7, -1.4, -5.2, "s"], + datetime: [ + datetime(2022, 12, 6, 1, 2, 3), + datetime(2022, 12, 8, 3, 2, 3), + datetime(2022, 12, 7, 2, 2, 3), + datetime(2022, 12, 7, 2, 2, 3), + datetime(2022, 12, 5, 2, 2, 3), + datetime(2022, 12, 3, 2, 2, 3), + None, + ], + timedelta: [ + timedelta(seconds=1), + timedelta(seconds=123), + timedelta(seconds=12), + timedelta(seconds=12), + timedelta(seconds=0.1), + timedelta(seconds=0.01), + None, + ], + date: [ + date(2022, 12, 6), + date(2022, 12, 8), + date(2022, 12, 7), + date(2022, 12, 7), + date(2022, 12, 5), + date(2022, 12, 3), + None, + ], time: [time(5, 2, 3), time(7, 2, 3), time(6, 2, 3), time(6, 2, 3), time(4, 2, 3), time(3, 2, 3), None], - str : ['s', 'xyz', 'w', 'w', 'bds', 'a', None], + str: ["s", "xyz", "w", "w", "bds", "a", None], } for typ, v in values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3], v[4], v[5]] ) - + + s = A(a=[v[0], v[1], v[2], v[3], v[4], v[5]]) + s.a.sort() - + self.assertEqual(s.a, [v[5], v[4], v[0], v[2], v[3], v[1]]) s.a.sort(reverse=True) - + self.assertEqual(s.a, [v[1], v[2], v[3], v[0], v[4], v[5]]) with self.assertRaises(TypeError) as e: @@ -2012,16 +2114,17 @@ class A(csp.Struct): s.a.sort(key=abs) self.assertEqual(s.a, [v[0], v[4], v[2], v[3], v[1], v[5]]) - + def test_list_field_extend(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2]] ) + + s = A(a=[v[0], v[1], v[2]]) s.a.extend([v[3]]) - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3]]) s.a.extend([]) @@ -2029,25 +2132,26 @@ class A(csp.Struct): self.assertEqual(s.a, [v[0], v[1], v[2], v[3], v[4], v[5]]) - # Check if not str or generic type (as str is a sequence of str) + # Check if not str or generic type (as str is a sequence of str) if v[-2] is not None: with self.assertRaises(TypeError) as e: s.a.extend(v[-2]) - - # Check if not generic type + + # Check if not generic type if v[-1] is not None: with self.assertRaises(TypeError) as e: s.a.extend([v[-1]]) - + def test_list_field_remove(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[0], v[2]] ) + + s = A(a=[v[0], v[1], v[0], v[2]]) s.a.remove(v[0]) - + self.assertEqual(s.a, [v[1], v[0], v[2]]) s.a.remove(v[2]) @@ -2058,32 +2162,34 @@ class A(csp.Struct): s.a.remove(v[3]) def test_list_field_clear(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3]] ) + + s = A(a=[v[0], v[1], v[2], v[3]]) s.a.clear() - + self.assertEqual(s.a, []) - + def test_list_field_del(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3]] ) + + s = A(a=[v[0], v[1], v[2], v[3]]) del s.a[0] - + self.assertEqual(s.a, [v[1], v[2], v[3]]) del s.a[1] self.assertEqual(s.a, [v[1], v[3]]) - s = A( a = [v[0], v[1], v[2], v[3]] ) + s = A(a=[v[0], v[1], v[2], v[3]]) del s.a[1:3] self.assertEqual(s.a, [v[0], v[3]]) @@ -2094,16 +2200,17 @@ class A(csp.Struct): with self.assertRaises(IndexError) as e: del s.a[5] - + def test_list_field_inplace_concat(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1]] ) + + s = A(a=[v[0], v[1]]) s.a.__iadd__([v[2], v[3]]) - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3]]) s.a += (v[4], v[5]) @@ -2117,22 +2224,23 @@ class A(csp.Struct): with self.assertRaises(TypeError) as e: s.a += v[-1] - # Check if not generic type + # Check if not generic type if v[-1] is not None: with self.assertRaises(TypeError) as e: s.a += [v[-1]] - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3], v[4], v[5]]) - + def test_list_field_inplace_repeat(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1]] ) + + s = A(a=[v[0], v[1]]) s.a.__imul__(1) - + self.assertEqual(s.a, [v[0], v[1]]) s.a *= 2 @@ -2143,10 +2251,10 @@ class A(csp.Struct): s.a *= [3] with self.assertRaises(TypeError) as e: - s.a *= 's' - + s.a *= "s" + s.a *= 0 - + self.assertEqual(s.a, []) s.a += [v[2], v[3]] @@ -2154,18 +2262,19 @@ class A(csp.Struct): self.assertEqual(s.a, [v[2], v[3]]) s.a *= -1 - + self.assertEqual(s.a, []) - + def test_list_field_lifetime(self): - '''Ensure that the lifetime of PyStructList field exceeds the lifetime of struct holding it''' + """Ensure that the lifetime of PyStructList field exceeds the lifetime of struct holding it""" + class A(csp.Struct): a: [int] - - s = A( a = [1, 2, 3] ) + + s = A(a=[1, 2, 3]) l = s.a del s - + self.assertEqual(l, [1, 2, 3]) diff --git a/csp/tests/test_engine.py b/csp/tests/test_engine.py index 028b5d6f9..4a2998f87 100644 --- a/csp/tests/test_engine.py +++ b/csp/tests/test_engine.py @@ -2078,8 +2078,9 @@ def raise_interrupt(): csp.schedule_alarm(a, timedelta(seconds=1), True) if csp.ticked(a): import signal + os.kill(os.getpid(), signal.SIGINT) - + # Python nodes @csp.graph def g(l: list): @@ -2094,12 +2095,12 @@ def g(l: list): for element in stopped: self.assertTrue(element) - + # C++ nodes class RTI: def __init__(self): self.stopped = [False, False, False] - + @csp.node(cppimpl=_csptestlibimpl.set_stop_index) def n2(obj_: object, idx: int): return @@ -2114,7 +2115,7 @@ def g2(rti: RTI): rti = RTI() with self.assertRaises(KeyboardInterrupt): csp.run(g2, rti, starttime=datetime.utcnow(), endtime=timedelta(seconds=60), realtime=True) - + for element in rti.stopped: self.assertTrue(element) From 7441fb552e74ebdb7742ffa6831873679f257f98 Mon Sep 17 00:00:00 2001 From: Rob Ambalu Date: Mon, 6 May 2024 15:22:48 -0400 Subject: [PATCH 19/27] Add build-debug option to Makefile so we dont forget the proper incantations (#222) * Add build-debug option to Makefile so we dont forget the proper incantations Signed-off-by: Rob Ambalu --- Makefile | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 2b0177ef3..86316ca64 100644 --- a/Makefile +++ b/Makefile @@ -11,7 +11,7 @@ endif ######### # BUILD # ######### -.PHONY: requirements develop build build-conda install +.PHONY: requirements develop build build-debug build-conda install requirements: ## install python dev and runtime dependencies python -m pip install toml @@ -24,6 +24,9 @@ develop: requirements ## install dependencies and build library build: ## build the library python setup.py build build_ext --inplace -- -- -j$(NPROC) +build-debug: ## build the library ( DEBUG ) - May need a make clean when switching from regular build to build-debug and vice versa + SKBUILD_CONFIGURE_OPTIONS="" DEBUG=1 python setup.py build build_ext --inplace -- -- -j$(NPROC) + build-conda: ## build the library in Conda CSP_USE_VCPKG=0 python setup.py build build_ext --inplace -- -- -j$(NPROC) From d9ac41d5d2843a5ad5b5562f5a1daa6f22c431d5 Mon Sep 17 00:00:00 2001 From: Tim Paine <3105306+timkpaine@users.noreply.github.com> Date: Tue, 7 May 2024 18:36:59 -0400 Subject: [PATCH 20/27] Pin linters to narrow range to avoid noise Signed-off-by: Tim Paine <3105306+timkpaine@users.noreply.github.com> --- conda/dev-environment-unix.yml | 6 +++--- pyproject.toml | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/conda/dev-environment-unix.yml b/conda/dev-environment-unix.yml index b520d6402..490ae2036 100644 --- a/conda/dev-environment-unix.yml +++ b/conda/dev-environment-unix.yml @@ -8,7 +8,7 @@ dependencies: - build - bump2version>=1 - cmake - - codespell + - codespell>=2.2.6,<2.3 - compilers - cyrus-sasl - exprtk @@ -23,7 +23,7 @@ dependencies: - libboost-headers - lz4-c - mamba - - mdformat + - mdformat>=0.7.17,<0.8 - ninja - numpy - pillow @@ -43,7 +43,7 @@ dependencies: - rapidjson - requests - ruamel.yaml - - ruff + - ruff>=0.3,<0.4 - scikit-build - slack-sdk - sqlalchemy diff --git a/pyproject.toml b/pyproject.toml index dafa366c6..e19191af5 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -60,9 +60,9 @@ develop = [ "twine", "wheel", # lint - "codespell", + "codespell>=2.2.6,<2.3", "isort>=5,<6", - "mdformat", + "mdformat>=0.7.17,<0.8", "ruff>=0.3,<0.4", # test "pytest", From 086c9d5cc7c018c83d15f466be0b3101aff2410e Mon Sep 17 00:00:00 2001 From: Tim Paine Date: Fri, 3 May 2024 10:03:30 -0400 Subject: [PATCH 21/27] Add placeholder block to build action for service tests (in another PR) Signed-off-by: Tim Paine --- .github/workflows/build.yml | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 69697d244..9e4ff3778 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -625,6 +625,20 @@ jobs: run: make test if: ${{ contains( 'numpy', matrix.package )}} + ########################### + #~~~~~~~~~~~~~~~~~~~~~~~~~# + #~~~~~~|#############|~~~~# + #~~~~~~|#|~~~~~~~/##/~~~~~# + #~~~~~~|#|~~~~~/##/~~~~~~~# + #~~~~~~~~~~~~/##/~~~~~~~~~# + #~~~~~~~~~~/##/~~~~~~~~~~~# + #~~~~~~~~/##/~~~~~~~~~~~~~# + #~~~~~~/##/~~~~~~~~~~~~~~~# + #~~~~~~~~~~~~~~~~~~~~~~~~~# + # Test Service Adapters # + #~~~~~~~~~~~~~~~~~~~~~~~~~# + # Coming soon! + ############################# #~~~~~~~~~~~~~~~~~~~~~~~~~~~# #~~~~~~|#############|~~~~~~# From 8a0d881986486d053cd36d41b5c0e862120e941b Mon Sep 17 00:00:00 2001 From: Adam Glustein Date: Tue, 7 May 2024 17:29:26 -0400 Subject: [PATCH 22/27] Remove all caching code from CSP (#213) Signed-off-by: Adam Glustein --- csp/__init__.py | 3 +- csp/baselib.py | 5 +- csp/cache_support.py | 16 - csp/impl/config.py | 51 - csp/impl/managed_dataset/__init__.py | 0 .../aggregation_period_utils.py | 87 - .../cache_partition_argument_serializer.py | 101 - .../cache_user_custom_object_serializer.py | 11 - csp/impl/managed_dataset/dataset_metadata.py | 85 - .../managed_dataset/dateset_name_constants.py | 4 - csp/impl/managed_dataset/datetime_utils.py | 8 - csp/impl/managed_dataset/managed_dataset.py | 487 ---- .../managed_dataset_lock_file_util.py | 111 - .../managed_dataset_merge_utils.py | 431 --- .../managed_dataset_path_resolver.py | 470 ---- .../managed_dataset/managed_parquet_writer.py | 340 --- csp/impl/mem_cache.py | 14 +- csp/impl/types/instantiation_type_resolver.py | 29 +- csp/impl/wiring/base_parser.py | 17 +- csp/impl/wiring/cache_support/__init__.py | 0 .../cache_support/cache_config_resolver.py | 22 - .../wiring/cache_support/cache_type_mapper.py | 55 - .../dataset_partition_cached_data.py | 662 ----- .../wiring/cache_support/graph_building.py | 745 ----- .../partition_files_container.py | 99 - .../cache_support/runtime_cache_manager.py | 74 - csp/impl/wiring/context.py | 19 +- csp/impl/wiring/graph.py | 320 +-- csp/impl/wiring/graph_parser.py | 72 +- csp/impl/wiring/node.py | 35 +- csp/impl/wiring/node_parser.py | 14 +- csp/impl/wiring/outputs.py | 15 - csp/impl/wiring/runtime.py | 37 +- csp/impl/wiring/signature.py | 1 - csp/impl/wiring/special_output_names.py | 3 - csp/impl/wiring/threaded_runtime.py | 4 - csp/tests/impl/test_struct.py | 305 ++- csp/tests/test_caching.py | 2438 ----------------- csp/tests/test_engine.py | 27 +- csp/tests/test_parsing.py | 33 +- 40 files changed, 261 insertions(+), 6989 deletions(-) delete mode 100644 csp/cache_support.py delete mode 100644 csp/impl/config.py delete mode 100644 csp/impl/managed_dataset/__init__.py delete mode 100644 csp/impl/managed_dataset/aggregation_period_utils.py delete mode 100644 csp/impl/managed_dataset/cache_partition_argument_serializer.py delete mode 100644 csp/impl/managed_dataset/cache_user_custom_object_serializer.py delete mode 100644 csp/impl/managed_dataset/dataset_metadata.py delete mode 100644 csp/impl/managed_dataset/dateset_name_constants.py delete mode 100644 csp/impl/managed_dataset/datetime_utils.py delete mode 100644 csp/impl/managed_dataset/managed_dataset.py delete mode 100644 csp/impl/managed_dataset/managed_dataset_lock_file_util.py delete mode 100644 csp/impl/managed_dataset/managed_dataset_merge_utils.py delete mode 100644 csp/impl/managed_dataset/managed_dataset_path_resolver.py delete mode 100644 csp/impl/managed_dataset/managed_parquet_writer.py delete mode 100644 csp/impl/wiring/cache_support/__init__.py delete mode 100644 csp/impl/wiring/cache_support/cache_config_resolver.py delete mode 100644 csp/impl/wiring/cache_support/cache_type_mapper.py delete mode 100644 csp/impl/wiring/cache_support/dataset_partition_cached_data.py delete mode 100644 csp/impl/wiring/cache_support/graph_building.py delete mode 100644 csp/impl/wiring/cache_support/partition_files_container.py delete mode 100644 csp/impl/wiring/cache_support/runtime_cache_manager.py delete mode 100644 csp/tests/test_caching.py diff --git a/csp/__init__.py b/csp/__init__.py index b7cc55c17..05a05a682 100644 --- a/csp/__init__.py +++ b/csp/__init__.py @@ -4,7 +4,6 @@ from csp.curve import curve from csp.dataframe import DataFrame from csp.impl.builtin_functions import * -from csp.impl.config import Config from csp.impl.constants import UNSET from csp.impl.enum import DynamicEnum, Enum from csp.impl.error_handling import set_print_full_exception_stack @@ -30,7 +29,7 @@ from csp.math import * from csp.showgraph import show_graph -from . import cache_support, stats +from . import stats __version__ = "0.0.3" diff --git a/csp/baselib.py b/csp/baselib.py index 090520441..fb2593c81 100644 --- a/csp/baselib.py +++ b/csp/baselib.py @@ -283,10 +283,7 @@ def get_basket_field(dict_basket: {"K": ts["V"]}, field_name: str) -> OutputBask :param field_name: :return: """ - if isinstance(dict_basket, csp.impl.wiring.cache_support.graph_building.WrappedCachedStructBasket): - return dict_basket.get_basket_field(field_name) - else: - return {k: getattr(v, field_name) for k, v in dict_basket.items()} + return {k: getattr(v, field_name) for k, v in dict_basket.items()} @node(cppimpl=_cspbaselibimpl.sample) diff --git a/csp/cache_support.py b/csp/cache_support.py deleted file mode 100644 index 1ff9ab5a0..000000000 --- a/csp/cache_support.py +++ /dev/null @@ -1,16 +0,0 @@ -from csp.impl.config import BaseCacheConfig, CacheCategoryConfig, CacheConfig -from csp.impl.managed_dataset.cache_user_custom_object_serializer import CacheObjectSerializer -from csp.impl.managed_dataset.dataset_metadata import TimeAggregation -from csp.impl.wiring import GraphCacheOptions, NoCachedDataException -from csp.impl.wiring.cache_support.cache_config_resolver import CacheConfigResolver - -__all__ = [ - "BaseCacheConfig", - "CacheCategoryConfig", - "CacheConfig", - "CacheConfigResolver", - "CacheObjectSerializer", - "GraphCacheOptions", - "NoCachedDataException", - "TimeAggregation", -] diff --git a/csp/impl/config.py b/csp/impl/config.py deleted file mode 100644 index a44145ccf..000000000 --- a/csp/impl/config.py +++ /dev/null @@ -1,51 +0,0 @@ -from typing import Dict, List - -from csp.impl.managed_dataset.cache_user_custom_object_serializer import CacheObjectSerializer -from csp.impl.struct import Struct -from csp.utils.file_permissions import FilePermissions, RWXPermissions - - -class BaseCacheConfig(Struct): - data_folder: str - read_folders: List[str] # Additional read folders from which the data should be read if available - lock_file_permissions: FilePermissions = FilePermissions( - user_permissions=RWXPermissions.READ | RWXPermissions.WRITE, - group_permissions=RWXPermissions.READ | RWXPermissions.WRITE, - others_permissions=RWXPermissions.READ | RWXPermissions.WRITE, - ) - data_file_permissions: FilePermissions = FilePermissions( - user_permissions=RWXPermissions.READ | RWXPermissions.WRITE, - group_permissions=RWXPermissions.READ, - others_permissions=RWXPermissions.READ, - ) - merge_existing_files: bool = True - - -class CacheCategoryConfig(BaseCacheConfig): - category: List[str] - - -class CacheConfig(BaseCacheConfig): - def __init__(self, *args, **kwargs): - super().__init__(*args, **kwargs) - if not hasattr(self, "cache_serializers"): - self.cache_serializers = {} - - allow_overwrite: bool - # An optional override of output folders by category - # For example: - # category_overrides = [ - # CacheCategoryConfig(category=['forecasts'], data_folder='possibly_group_cached_forecasts_path'), - # CacheCategoryConfig(category=['forecasts', 'active_research'], data_folder='possibly_user_specific_forecasts_paths'), - # ] - # All forecasts except for forecasts that are under active_research will be read from/written to possibly_group_cached_forecasts_path. - # It would commonly be a path that is shared by the research team. On the other hand all forecasts under active_research will be written - # to possibly_user_specific_forecasts_paths which can be a private path of the current user that currently researching the forecast and - # needs to redump it often - it's not ready to share with the team yet. - category_overrides: List[CacheCategoryConfig] - graph_overrides: Dict[object, BaseCacheConfig] - cache_serializers: Dict[type, CacheObjectSerializer] - - -class Config(Struct): - cache_config: CacheConfig diff --git a/csp/impl/managed_dataset/__init__.py b/csp/impl/managed_dataset/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/csp/impl/managed_dataset/aggregation_period_utils.py b/csp/impl/managed_dataset/aggregation_period_utils.py deleted file mode 100644 index bf85d805b..000000000 --- a/csp/impl/managed_dataset/aggregation_period_utils.py +++ /dev/null @@ -1,87 +0,0 @@ -import datetime -import glob -import os - -from csp.impl.managed_dataset.dataset_metadata import TimeAggregation - - -class AggregationPeriodUtils: - _AGG_LEVELS_GLOB_EXPRESSIONS = { - TimeAggregation.DAY: ["[0-9]" * 4, "[0-9]" * 2, "[0-9]" * 2], - TimeAggregation.MONTH: ["[0-9]" * 4, "[0-9]" * 2], - TimeAggregation.QUARTER: ["[0-9]" * 4, "Q[0-9]"], - TimeAggregation.YEAR: ["[0-9]" * 4], - } - - def __init__(self, aggregation_period: TimeAggregation): - self._aggregation_period = aggregation_period - - def resolve_period_start(self, cur_time: datetime.datetime): - if self._aggregation_period == TimeAggregation.DAY: - return datetime.datetime(cur_time.year, cur_time.month, cur_time.day) - elif self._aggregation_period == TimeAggregation.MONTH: - return datetime.datetime(cur_time.year, cur_time.month, 1) - elif self._aggregation_period == TimeAggregation.QUARTER: - return datetime.datetime(cur_time.year, ((cur_time.month - 1) // 3) * 3 + 1, 1) - elif self._aggregation_period == TimeAggregation.YEAR: - return datetime.datetime(cur_time.year, 1, 1) - else: - raise RuntimeError(f"Unsupported aggregation period {self._aggregation_period}") - - def resolve_period_end(self, cur_time: datetime.datetime, exclusive_end=True): - if self._aggregation_period == TimeAggregation.DAY: - res = datetime.datetime(cur_time.year, cur_time.month, cur_time.day) + datetime.timedelta(days=1) - elif self._aggregation_period == TimeAggregation.MONTH: - next_month_date = cur_time + datetime.timedelta(days=32 - cur_time.day) - res = datetime.datetime(next_month_date.year, next_month_date.month, 1) - elif self._aggregation_period == TimeAggregation.QUARTER: - extra_months = (3 - cur_time.month) % 3 - next_quarter_date = cur_time + datetime.timedelta(days=31 * extra_months + 32 - cur_time.day) - res = datetime.datetime(next_quarter_date.year, next_quarter_date.month, 1) - elif self._aggregation_period == TimeAggregation.YEAR: - res = datetime.datetime(cur_time.year + 1, 1, 1) - else: - raise RuntimeError(f"Unsupported aggregation period {self._aggregation_period}") - if not exclusive_end: - res -= datetime.timedelta(microseconds=1) - return res - - def resolve_period_start_end(self, cur_time: datetime.datetime, exclusive_end=True): - return self.resolve_period_start(cur_time), self.resolve_period_end(cur_time, exclusive_end=exclusive_end) - - def get_sub_folder_name(self, cur_time: datetime.datetime): - if self._aggregation_period == TimeAggregation.DAY: - return cur_time.strftime("%Y/%m/%d") - elif self._aggregation_period == TimeAggregation.MONTH: - return cur_time.strftime("%Y/%m") - elif self._aggregation_period == TimeAggregation.QUARTER: - quarter_index = (cur_time.month - 1) // 3 + 1 - return cur_time.strftime(f"%Y/Q{quarter_index}") - elif self._aggregation_period == TimeAggregation.YEAR: - return cur_time.strftime("%Y") - else: - raise RuntimeError(f"Unsupported aggregation period {self._aggregation_period}") - - def iterate_periods_in_date_range(self, start_time: datetime.datetime, end_time: datetime.datetime): - assert start_time <= end_time - period_start, period_end = self.resolve_period_start_end(start_time) - while period_start <= end_time: - yield period_start, period_end - period_start = period_end - period_end = self.resolve_period_end(period_start) - - def get_agg_bound_folder(self, root_folder: str, is_starttime: bool): - """Return the first/last partition folder for the dataset - :param root_folder: - :param is_starttime: - :return: - """ - glob_expressions = self._AGG_LEVELS_GLOB_EXPRESSIONS[self._aggregation_period] - cur = root_folder - ind = 0 if is_starttime else -1 - for glob_exp in glob_expressions: - cur_list = sorted(glob.glob(os.path.join(glob.escape(cur), glob_exp))) - if not cur_list: - return None - cur = os.path.join(cur, cur_list[ind]) - return cur diff --git a/csp/impl/managed_dataset/cache_partition_argument_serializer.py b/csp/impl/managed_dataset/cache_partition_argument_serializer.py deleted file mode 100644 index 2935b1f6b..000000000 --- a/csp/impl/managed_dataset/cache_partition_argument_serializer.py +++ /dev/null @@ -1,101 +0,0 @@ -import hashlib -import io -import ruamel.yaml -from abc import ABCMeta, abstractmethod - -from csp.impl.struct import Struct - - -class SerializedArgument: - def __init__(self, arg, serializer): - self._arg = arg - self._serializer = serializer - self._arg_as_string = None - self._arg_as_dict = None - self._arg_as_yaml_string = None - - def __str__(self): - return self.arg_as_string - - @property - def arg(self): - return self._arg - - @property - def arg_as_string(self): - if self._arg_as_string is None: - self._arg_as_string = self._serializer.to_string(self) - return self._arg_as_string - - @property - def arg_as_yaml_string(self): - if self._arg_as_yaml_string is None: - yaml = ruamel.yaml.YAML() - string_io = io.StringIO() - yaml.dump(self.arg_as_dict, string_io) - self._arg_as_yaml_string = string_io.getvalue() - return self._arg_as_yaml_string - - @property - def arg_as_dict(self): - if self._arg_as_dict is None: - self._arg_as_dict = self._serializer.to_json_dict(self) - return self._arg_as_dict - - -class CachePartitionArgumentSerializer(metaclass=ABCMeta): - @abstractmethod - def to_json_dict(self, value: SerializedArgument): - """ - :param value: The value to serialize - :returns: Should return a dict that will be written to yaml file - """ - raise NotImplementedError() - - @abstractmethod - def from_json_dict(self, value): - """ - :param value: The dict that is read from yaml file - :returns: Should return the deserialized object - """ - raise NotImplementedError() - - @abstractmethod - def to_string(self, value: SerializedArgument): - """Serialize the given object to a string (this string will be the partition folder name) - - :param value: The value to serialize - :returns: Should return a string that will be the folder name - """ - raise NotImplementedError() - - def __call__(self, value): - return SerializedArgument(value, self) - - -class StructPartitionArgumentSerializer(CachePartitionArgumentSerializer): - def __init__(self, typ): - self._typ = typ - - def to_json_dict(self, value: SerializedArgument): - """ - :param value: The value to serialize - :returns: Should return a dict that will be written to yaml file - """ - assert isinstance(value.arg, self._typ) - return value.arg.to_dict() - - def from_json_dict(self, value) -> Struct: - """ - :param value: The dict that is read from yaml file - :returns: Should return the deserialized object - """ - return self._typ.from_dict(value) - - def to_string(self, value: SerializedArgument): - """Serialize the given object to a string (this string will be the partition folder name) - - :param value: The value to serialize - :returns: Should return a string that will be the folder name - """ - return f"struct_{hashlib.md5(value.arg_as_yaml_string.encode()).hexdigest()}" diff --git a/csp/impl/managed_dataset/cache_user_custom_object_serializer.py b/csp/impl/managed_dataset/cache_user_custom_object_serializer.py deleted file mode 100644 index ff6377961..000000000 --- a/csp/impl/managed_dataset/cache_user_custom_object_serializer.py +++ /dev/null @@ -1,11 +0,0 @@ -from abc import ABCMeta, abstractmethod - - -class CacheObjectSerializer(metaclass=ABCMeta): - @abstractmethod - def serialize_to_bytes(self, value): - raise NotImplementedError - - @abstractmethod - def deserialize_from_bytes(self, value): - raise NotImplementedError diff --git a/csp/impl/managed_dataset/dataset_metadata.py b/csp/impl/managed_dataset/dataset_metadata.py deleted file mode 100644 index 136b3e648..000000000 --- a/csp/impl/managed_dataset/dataset_metadata.py +++ /dev/null @@ -1,85 +0,0 @@ -from enum import Enum, auto -from typing import Dict - -from csp.impl.struct import Struct -from csp.impl.wiring.cache_support.cache_type_mapper import CacheTypeMapper - - -class OutputType(Enum): - PARQUET = auto() - - -class TimeAggregation(Enum): - DAY = auto() - MONTH = auto() - QUARTER = auto() - YEAR = auto() - - -class DictBasketInfo(Struct): - key_type: object - value_type: object - - @classmethod - def _postprocess_dict_to_python(cls, d): - d["key_type"] = CacheTypeMapper.type_to_json(d["key_type"]) - d["value_type"] = CacheTypeMapper.type_to_json(d["value_type"]) - return d - - @classmethod - def _preprocess_dict_from_python(cls, d): - d["key_type"] = CacheTypeMapper.json_to_type(d["key_type"]) - d["value_type"] = CacheTypeMapper.json_to_type(d["value_type"]) - return d - - -class DatasetMetadata(Struct): - version: str = "1.0.0" - name: str - output_type: OutputType = OutputType.PARQUET - time_aggregation: TimeAggregation = TimeAggregation.DAY - columns: Dict[str, object] - dict_basket_columns: Dict[str, DictBasketInfo] - partition_columns: Dict[str, type] - timestamp_column_name: str - split_columns_to_files: bool = False - - @classmethod - def _postprocess_dict_to_python(cls, d): - output_type = d.get("output_type") - if output_type is not None: - d["output_type"] = output_type.name - time_aggregation = d.get("time_aggregation") - if time_aggregation is not None: - d["time_aggregation"] = time_aggregation.name - columns = d["columns"] - if columns: - d["columns"] = {k: CacheTypeMapper.type_to_json(v) for k, v in columns.items()} - - partition_columns = d.get("partition_columns") - if partition_columns: - d["partition_columns"] = {k: CacheTypeMapper.type_to_json(v) for k, v in partition_columns.items()} - - return d - - @classmethod - def _preprocess_dict_from_python(cls, d): - output_type = d.get("output_type") - if output_type is not None: - d["output_type"] = OutputType[output_type] - time_aggregation = d.get("time_aggregation") - if time_aggregation is not None: - d["time_aggregation"] = TimeAggregation[time_aggregation] - columns = d["columns"] - if columns: - d["columns"] = {k: CacheTypeMapper.json_to_type(v) for k, v in columns.items()} - partition_columns = d.get("partition_columns") - if partition_columns: - d["partition_columns"] = {k: CacheTypeMapper.json_to_type(v) for k, v in partition_columns.items()} - - return d - - @classmethod - def load_metadata(cls, file_path: str): - with open(file_path, "r") as f: - return DatasetMetadata.from_yaml(f.read()) diff --git a/csp/impl/managed_dataset/dateset_name_constants.py b/csp/impl/managed_dataset/dateset_name_constants.py deleted file mode 100644 index 3e67f8128..000000000 --- a/csp/impl/managed_dataset/dateset_name_constants.py +++ /dev/null @@ -1,4 +0,0 @@ -class DatasetNameConstants: - UNNAMED_OUTPUT_NAME = "csp_unnamed_output" - CSP_TIMESTAMP = "csp_timestamp" - PARTITION_ARGUMENT_FILE_NAME = ".csp_argument_value" diff --git a/csp/impl/managed_dataset/datetime_utils.py b/csp/impl/managed_dataset/datetime_utils.py deleted file mode 100644 index e2c350be3..000000000 --- a/csp/impl/managed_dataset/datetime_utils.py +++ /dev/null @@ -1,8 +0,0 @@ -from datetime import date, timedelta - -ONE_DAY_DELTA = timedelta(days=1) - - -def get_dates_in_range(start: date, end: date, inclusive_end=True): - n_days = (end - start).days + int(inclusive_end) - return [start + ONE_DAY_DELTA * i for i in range(n_days)] diff --git a/csp/impl/managed_dataset/managed_dataset.py b/csp/impl/managed_dataset/managed_dataset.py deleted file mode 100644 index dbf6d5f5f..000000000 --- a/csp/impl/managed_dataset/managed_dataset.py +++ /dev/null @@ -1,487 +0,0 @@ -import logging -import os -import tempfile -from datetime import date, datetime, timedelta -from typing import Dict, List, Optional, Tuple, Union - -import csp -from csp.impl.config import BaseCacheConfig -from csp.impl.enum import Enum -from csp.impl.managed_dataset.cache_partition_argument_serializer import ( - SerializedArgument, - StructPartitionArgumentSerializer, -) -from csp.impl.managed_dataset.dataset_metadata import DatasetMetadata, DictBasketInfo, TimeAggregation -from csp.impl.managed_dataset.dateset_name_constants import DatasetNameConstants -from csp.impl.managed_dataset.managed_dataset_lock_file_util import LockContext, ManagedDatasetLockUtil -from csp.impl.managed_dataset.managed_dataset_path_resolver import DatasetPaths -from csp.impl.struct import Struct -from csp.utils.file_permissions import FilePermissions, apply_file_permissions, create_folder_with_permissions -from csp.utils.rm_utils import rm_file_or_folder - - -class _MetadataRWUtil: - def __init__(self, dataset, metadata_file_path, metadata, lock_file_permissions, data_file_permissions): - self._dataset = dataset - self._metadata = metadata - self._metadata_file_path = metadata_file_path - self._lock_file_util = ManagedDatasetLockUtil(lock_file_permissions) - self._data_file_permissions = data_file_permissions - - def _write_metadata(self): - locked_folder = os.path.dirname(self._metadata_file_path) - with self._lock_file_util.write_lock(locked_folder): - if os.path.exists(self._metadata_file_path): - return - - file_base_name = os.path.basename(self._metadata_file_path) - create_folder_with_permissions(locked_folder, self._dataset.cache_config.data_file_permissions) - with tempfile.NamedTemporaryFile(mode="w+", prefix=file_base_name, dir=locked_folder, delete=False) as f: - try: - yaml = self._metadata.to_yaml() - f.file.write(yaml) - f.file.flush() - apply_file_permissions(f.name, self._data_file_permissions) - os.rename(f.name, self._metadata_file_path) - except: - rm_file_or_folder(f.name) - raise - - def load_existing_or_store_metadata(self): - """Loads existing metadata if no metadata exists, also will store the current metadata to file - - :return: A tuple of (loaded_metadata, file_lock) where file lock locks the metadata folder for reading (the file lock is in acquired state) - """ - if not os.path.exists(self._metadata_file_path): - self._write_metadata() - - locked_folder = os.path.dirname(self._metadata_file_path) - read_lock = self._lock_file_util.read_lock(locked_folder) - read_lock.lock() - try: - with open(self._metadata_file_path, "r") as f: - existing_metadata = self._metadata.from_yaml(f.read()) - return existing_metadata, read_lock - except: - read_lock.unlock() - raise - - -class ManagedDatasetPartition: - """A single partition of a dataset, this basically represents the lowest level of the chain dataset->partition. - - Single partition corresponds to a single instance of partition values. For example if there is a dataset that is partitioned on columns - a:int, b:float, c:str then a single partition would for example correspond to (1, 1.0, 'str1') while (2,2.0, 'str2') would be a different - partition object. - Single partition corresponds to a single instance of graph. - """ - - # In the future we are going to support containers as well, for now just primitives - PARTITION_TYPE_STR_CONVERTORS = { - bool: str, - int: str, - float: str, - str: str, - datetime: lambda v: v.strftime("%Y%m%d_%H%M%S_%f"), - date: lambda v: v.strftime("%Y%m%d_%H%M%S_%f"), - timedelta: lambda v: f"td_{int(v.total_seconds() * 1e6)}us", - } - - def __init__(self, dataset, partition_values: Optional[Dict[str, object]] = None): - """ - :param dataset: An instance of ManagedDataset to which the partition belongs - :param partition_values: A dictionary of partition column name to value of the column for the given partition - """ - self._dataset = dataset - self._values_tuple, self._values_dict = self._normalize_partition_values(partition_values) - self._data_paths = None - - def get_data_for_period(self, starttime: datetime, endtime: datetime, missing_range_handler): - return self.data_paths.get_data_files_in_range( - starttime, - endtime, - missing_range_handler=missing_range_handler, - split_columns_to_files=self.dataset.metadata.split_columns_to_files, - ) - - @property - def data_paths(self): - if self._data_paths is None: - dataset_data_paths = self._dataset.data_paths - if dataset_data_paths: - self._data_paths = dataset_data_paths.get_partition_paths(self._values_dict) - return self._data_paths - - @property - def dataset(self): - return self._dataset - - @property - def value_tuple(self): - return self._values_tuple - - @property - def value_dict(self): - return self._values_dict - - def _create_folder_with_permissions(self, cur_root_path, folder_permissions): - if not os.path.exists(cur_root_path): - try: - os.mkdir(cur_root_path) - apply_file_permissions(cur_root_path, folder_permissions) - return True - except FileExistsError: - pass - return False - - def create_root_folder(self, cache_config): - if os.path.exists(self.data_paths.root_folder): - return - cur_root_path = self.dataset.data_paths.root_folder - rel_path = os.path.relpath(self.data_paths.root_folder, cur_root_path) - path_parts = list(filter(None, os.path.normpath(rel_path).split(os.sep))) - assert path_parts[0] == "data" - cur_root_path = os.path.join(cur_root_path, path_parts[0]) - - file_permissions = cache_config.data_file_permissions - folder_permissions = cache_config.data_file_permissions.get_folder_permissions() - - self._create_folder_with_permissions(cur_root_path, folder_permissions) - values_dict = self._values_dict - assert len(values_dict) + 1 == len(path_parts) - lock_util = ManagedDatasetLockUtil(cache_config.lock_file_permissions) - - with self.dataset.use_lock_context(): - for sub_folder, argument_value in zip(path_parts[1:], values_dict.values()): - cur_root_path = os.path.join(cur_root_path, sub_folder) - self._create_folder_with_permissions(cur_root_path, folder_permissions) - if isinstance(argument_value, SerializedArgument): - value_file_path = os.path.join(cur_root_path, DatasetNameConstants.PARTITION_ARGUMENT_FILE_NAME) - if not os.path.exists(value_file_path): - with lock_util.write_lock(value_file_path, is_lock_in_root_folder=True) as lock_file: - if not os.path.exists(value_file_path): - with open(value_file_path, "w") as value_file: - value_file.write(argument_value.arg_as_yaml_string) - apply_file_permissions(value_file_path, file_permissions) - rm_file_or_folder(lock_file.file_path, is_file=True) - - def publish_file(self, file_name, start_time, end_time, file_permissions=None, lock_file_permissions=None): - output_file_name = self.data_paths.get_output_file_name( - start_time, end_time, split_columns_to_files=self.dataset.metadata.split_columns_to_files - ) - # We might try to publish some files that are already there. Example - # We ran 20210101-20210102. We now run 20210102-20210103, since the data is not fully in cache we will run the graph, the data for 20210102 will be generated again. - if os.path.exists(output_file_name): - rm_file_or_folder(file_name) - return - - if file_permissions is not None: - if os.path.isdir(file_name): - folder_permissions = file_permissions.get_folder_permissions() - apply_file_permissions(file_name, folder_permissions) - for f in os.listdir(file_name): - apply_file_permissions(os.path.join(file_name, f), file_permissions) - else: - apply_file_permissions(file_name, file_permissions) - - with self.dataset.use_lock_context(): - lock_util = ManagedDatasetLockUtil(lock_file_permissions) - with lock_util.write_lock(output_file_name, is_lock_in_root_folder=True) as lock: - if os.path.exists(output_file_name): - logging.warning(f"Not publishing {output_file_name} since it already exists") - rm_file_or_folder(file_name) - else: - os.rename(file_name, output_file_name) - lock.delete_file() - - def merge_files(self, start_time: datetime, end_time: datetime, cache_config, parquet_output_config): - from csp.impl.managed_dataset.managed_dataset_merge_utils import SinglePartitionFileMerger - - with self.dataset.use_lock_context(): - file_merger = SinglePartitionFileMerger( - dataset_partition=self, - start_time=start_time, - end_time=end_time, - cache_config=cache_config, - parquet_output_config=parquet_output_config, - ) - file_merger.merge_files() - - def cleanup_unneeded_files(self, start_time: datetime, end_time: datetime, cache_config): - unused_files = self.data_paths.get_unused_files( - starttime=start_time, endtime=end_time, split_columns_to_files=self.dataset.metadata.split_columns_to_files - ) - if unused_files: - with self.dataset.use_lock_context(): - lock_util = ManagedDatasetLockUtil(cache_config.lock_file_permissions) - for f in unused_files: - try: - with lock_util.write_lock( - f, is_lock_in_root_folder=True, timeout_seconds=0, retry_period_seconds=0 - ) as lock: - rm_file_or_folder(f) - lock.delete_file() - except BlockingIOError: - logging.warning(f"Not removing {f} since it's currently locked") - - def partition_merge_lock(self, start_time: datetime, end_time: datetime): - raise NotImplementedError() - - def _get_type_convertor(self, typ): - if issubclass(typ, Enum): - return str - elif issubclass(typ, Struct): - return StructPartitionArgumentSerializer(typ) - else: - return self.PARTITION_TYPE_STR_CONVERTORS[typ] - - def _normalize_partition_values(self, partition_values): - metadata = self._dataset.metadata - if partition_values: - assert len(partition_values) == len(metadata.partition_columns) - assert partition_values.keys() == metadata.partition_columns.keys() - ordered_partition_values = ((k, partition_values[k]) for k in metadata.partition_columns) - partition_values = {k: self._get_type_convertor(type(v))(v) for k, v in ordered_partition_values} - values_tuple = tuple(partition_values.values()) - else: - assert not hasattr(metadata, "partition_columns") - values_tuple = tuple() - return values_tuple, partition_values - - -class ManagedDataset: - """A single dataset, this basically represents the highest level of the chain dataset->partition. - - Single dataset corresponds to a set of dataset_partitions all having identical schema but having different partition keys. - Example consider having cached trades for each ticker and date. Single dataset represents all the "trades" and has the trade - schema attached to it. Each partition will be part of the dataset but correspond to a different ticker. - Single dataset corresponds to a single "graph" function (defines paths and schemas for all instances of this graph). - """ - - SUPPORTED_PARTITION_TYPES = set(ManagedDatasetPartition.PARTITION_TYPE_STR_CONVERTORS.keys()) - - def __init__( - self, - name, - category: List[str] = None, - timestamp_column_name: str = None, - columns_types: Dict[str, object] = None, - partition_columns: Dict[str, type] = None, - *, - cache_config: BaseCacheConfig, - split_columns_to_files: Optional[bool], - time_aggregation: TimeAggregation, - dict_basket_column_types: Dict[str, Union[Tuple[type, type], DictBasketInfo]] = None, - ): - """ - :param name: The name of the dataset: - :param category: The category classification of the dataset, for example ['stats', 'daily'], or ['forecasts'], - this is being used as part of the path of the dataset on disk - :param timestamp_column_name: The name of the timestamp column in the parquet files. - :param columns_types: A dictionary of name->type of dataset column types. - :param partition_columns: A dictionary of partitioning columns of the dataset. This columns are not written into parquet files but instead - are used as part of the dataset partition path. - :param cache_config: The cache configuration for the data set - :param split_columns_to_files: A boolean that specifies whether the data of the dataset is split across files. - :param time_aggregation: The data aggregation period for the dataset - :param dict_basket_column_types: The dictionary basket columns of the dataset - """ - self._name = name - self._category = category if category else [] - self._cache_config = cache_config - self._lock_context = None - self._metadata = DatasetMetadata( - name=name, - split_columns_to_files=True if split_columns_to_files else False, - time_aggregation=time_aggregation, - ) - dict_basket_columns = self._normalize_dict_basket_types(dict_basket_column_types) - if dict_basket_columns: - self._metadata.dict_basket_columns = dict_basket_columns - if timestamp_column_name: - self._metadata.timestamp_column_name = timestamp_column_name - self._metadata.columns = columns_types if columns_types else {} - if partition_columns: - self._metadata.partition_columns = partition_columns - - self._data_paths: Optional[DatasetPaths] = None - self._set_folders(cache_config.data_folder, getattr(cache_config, "read_folders", None)) - - @classmethod - def _normalize_dict_basket_types(cls, dict_basket_column_types): - if not dict_basket_column_types: - return None - dict_types = {} - for name, type_entry in dict_basket_column_types.items(): - if isinstance(type_entry, DictBasketInfo): - dict_types[name] = type_entry - else: - key_type, value_type = type_entry - dict_types[name] = DictBasketInfo(key_type=key_type, value_type=value_type) - return dict_types - - @classmethod - def load_from_disk(cls, cache_config, name, data_category: Optional[List[str]] = None): - data_paths = DatasetPaths( - parent_folder=cache_config.data_folder, - read_folders=getattr(cache_config, "read_folders", None), - name=name, - data_category=data_category, - ) - metadata_file_path = data_paths.get_metadata_file_path(existing=True) - if metadata_file_path: - with open(metadata_file_path, "r") as f: - metadata = DatasetMetadata.from_yaml(f.read()) - res = ManagedDataset( - name=metadata.name, - category=data_category, - timestamp_column_name=metadata.timestamp_column_name, - columns_types=metadata.columns, - cache_config=cache_config, - split_columns_to_files=metadata.split_columns_to_files, - time_aggregation=metadata.time_aggregation, - dict_basket_column_types=getattr(metadata, "dict_basket_columns", None), - ) - if hasattr(metadata, "partition_columns"): - res.metadata.partition_columns = metadata.partition_columns - return res - else: - return None - - @classmethod - def is_supported_partition_type(cls, typ): - if typ in ManagedDataset.SUPPORTED_PARTITION_TYPES or ( - isinstance(typ, type) and (issubclass(typ, Enum) or issubclass(typ, Struct)) - ): - return True - else: - return False - - @property - def cache_config(self): - assert self._cache_config is not None - return self._cache_config - - def use_lock_context(self): - if self._lock_context is None: - self._lock_context = LockContext(self) - return ManagedDatasetLockUtil.set_dataset_context(self._lock_context) - - def validate_and_lock_metadata( - self, - lock_file_permissions: Optional[FilePermissions] = None, - data_file_permissions: Optional[FilePermissions] = None, - read: bool = False, - write: bool = False, - ): - """Validate that code metadata correspond to existing metadata on disk. If necessary writes metadata file to disk. - - :param lock_file_permissions: The permissions of the lock files that are created for safely accessing metadata. - :param data_file_permissions: The permissions of the written metadata files. - :param read: A bool that specifies whether the dataset will be read - :param write: A bool that specifies whether the dataset will be written. - - Note: validation for read vs written datasets will be different. For read datasets we allow slightly different more relaxed schemas. - - :return: An obtained "shared" lock that locks the dataset schema. Caller is responsible for releasing the lock - """ - assert self.data_paths is not None - with self.use_lock_context(): - metadata = self.metadata - metadata_file_path = self.data_paths.get_metadata_file_path(existing=not write) - metadata_rw_util = _MetadataRWUtil( - dataset=self, - metadata_file_path=metadata_file_path, - metadata=self.metadata, - lock_file_permissions=lock_file_permissions, - data_file_permissions=data_file_permissions, - ) - existing_meta, read_lock = metadata_rw_util.load_existing_or_store_metadata() - is_metadata_different = False - - if write: - is_metadata_different = existing_meta != self.metadata - else: - for field in DatasetMetadata.metadata(): - if field not in ("columns", "dict_basket_columns"): - if getattr(existing_meta, field, None) != getattr(self.metadata, field, None): - is_metadata_different = True - # The read metadata must be a subset of the existing metadata - existing_meta_columns = existing_meta.columns - existing_meta_dict_columns = getattr(existing_meta, "dict_basket_columns", None) - for col_name, col_type in metadata.columns.items(): - existing_type = existing_meta_columns.get(col_name, None) - if existing_type is None or existing_type != col_type: - is_metadata_different = True - break - cur_dict_basket_columns = getattr(metadata, "dict_basket_columns", None) - if cur_dict_basket_columns or existing_meta_dict_columns: - if cur_dict_basket_columns is None or existing_meta_dict_columns is None: - is_metadata_different = True - else: - for col_name, col_info in metadata.dict_basket_columns.items(): - if not existing_meta_dict_columns: - is_metadata_different = True - break - existing_meta_column_info = existing_meta_dict_columns.get(col_name) - if existing_meta_column_info is None: - is_metadata_different = True - break - existing_type = existing_meta_column_info.value_type - cur_type = col_info.value_type - - if issubclass(existing_type, csp.Struct): - if not issubclass(cur_type, csp.Struct): - is_metadata_different = True - break - existing_meta = existing_type.metadata() - for field, field_type in cur_type.metadata().items(): - if existing_meta.get(field) != field_type: - is_metadata_different = True - else: - is_metadata_different = existing_type is not cur_type - - if is_metadata_different: - read_lock.unlock() - raise RuntimeError( - f"Metadata mismatch at {metadata_file_path}\nCurrent:\n{metadata}\nExisting:{existing_meta}\n" - ) - return read_lock - - def get_partition(self, partition_values: Dict[str, object]): - """Get a partition object that corresponds to the given instance of partition key->value mapping. - :param partition_values: - """ - return ManagedDatasetPartition(self, partition_values) - - @property - def category(self): - return self._category - - @property - def parent_folder(self): - if self._data_paths is None: - return None - return self._data_paths.parent_folder - - def _set_folders(self, parent_folder, read_folders): - assert self._data_paths is None - if parent_folder: - self._data_paths = DatasetPaths( - parent_folder=parent_folder, - read_folders=read_folders, - name=self._name, - data_category=self._category, - time_aggregation=self.metadata.time_aggregation, - ) - else: - assert not read_folders, "Provided read folders without parent folder" - self._lock_context = None - - @property - def data_paths(self) -> DatasetPaths: - return self._data_paths - - @property - def metadata(self): - return self._metadata diff --git a/csp/impl/managed_dataset/managed_dataset_lock_file_util.py b/csp/impl/managed_dataset/managed_dataset_lock_file_util.py deleted file mode 100644 index d638cec64..000000000 --- a/csp/impl/managed_dataset/managed_dataset_lock_file_util.py +++ /dev/null @@ -1,111 +0,0 @@ -import os -import threading -import typing -from contextlib import contextmanager - -from csp.utils.file_permissions import create_folder_with_permissions -from csp.utils.lock_file import LockFile - -if typing.TYPE_CHECKING: - from .managed_dataset import ManagedDataset - - -class LockContext: - def __init__(self, data_set: "ManagedDataset"): - self._data_set = data_set - - def resolve_lock_file_path_and_create_folders(self, file_path: str, use_read_folders: bool): - parent_folder, lock_file_path = self._data_set.data_paths.resolve_lock_file_path( - file_path, use_read_folders=use_read_folders - ) - # We need to make sure that the root folder is created with the right permissions - create_folder_with_permissions(parent_folder, self._data_set.cache_config.data_file_permissions) - create_folder_with_permissions( - os.path.dirname(lock_file_path), self._data_set.cache_config.lock_file_permissions - ) - return lock_file_path - - -class ManagedDatasetLockUtil: - _READ_WRITE_LOCK_FILE_NAME = ".csp_read_write_lock" - _MERGE_LOCK_FILE_NAME = ".csp_merge_lock" - _TLS = threading.local() - - def __init__(self, lock_file_permissions): - self._lock_file_permissions = lock_file_permissions - - @classmethod - @contextmanager - def set_dataset_context(cls, lock_context: LockContext): - prev = getattr(cls._TLS, "instance", None) - try: - cls._TLS.instance = lock_context - yield lock_context - finally: - if prev is not None: - cls._TLS.instance = prev - else: - delattr(cls._TLS, "instance") - - @classmethod - def get_cur_context(cls): - res = getattr(cls._TLS, "instance", None) - if res is None: - raise RuntimeError("Trying to get lock context without any context set") - return res - - def _create_lock(self, file_path, lock_name, shared, is_lock_in_root_folder, timeout_seconds, retry_period_seconds): - cur_context = self.get_cur_context() - if os.path.isfile(file_path) or is_lock_in_root_folder: - base_path = os.path.splitext(os.path.basename(file_path))[0] - dir_name = os.path.dirname(file_path) - lock_file_name = f"{lock_name}.{base_path}" - return LockFile( - file_path=cur_context.resolve_lock_file_path_and_create_folders( - os.path.join(dir_name, lock_file_name), use_read_folders=shared - ), - shared=shared, - file_permissions=self._lock_file_permissions, - timeout_seconds=timeout_seconds, - retry_period_seconds=retry_period_seconds, - ) - else: - return LockFile( - file_path=cur_context.resolve_lock_file_path_and_create_folders( - os.path.join(file_path, lock_name), use_read_folders=shared - ), - shared=shared, - file_permissions=self._lock_file_permissions, - timeout_seconds=timeout_seconds, - retry_period_seconds=retry_period_seconds, - ) - - def write_lock(self, file_path, is_lock_in_root_folder=None, timeout_seconds=None, retry_period_seconds=None): - return self._create_lock( - file_path, - lock_name=self._READ_WRITE_LOCK_FILE_NAME, - shared=False, - is_lock_in_root_folder=is_lock_in_root_folder, - timeout_seconds=timeout_seconds, - retry_period_seconds=retry_period_seconds, - ) - - def read_lock(self, file_path, is_lock_in_root_folder=None, timeout_seconds=None, retry_period_seconds=None): - return self._create_lock( - file_path, - lock_name=self._READ_WRITE_LOCK_FILE_NAME, - shared=True, - is_lock_in_root_folder=is_lock_in_root_folder, - timeout_seconds=timeout_seconds, - retry_period_seconds=retry_period_seconds, - ) - - def merge_lock(self, file_path, is_lock_in_root_folder=None, timeout_seconds=None, retry_period_seconds=None): - return self._create_lock( - file_path, - lock_name=self._MERGE_LOCK_FILE_NAME, - shared=False, - is_lock_in_root_folder=is_lock_in_root_folder, - timeout_seconds=timeout_seconds, - retry_period_seconds=retry_period_seconds, - ) diff --git a/csp/impl/managed_dataset/managed_dataset_merge_utils.py b/csp/impl/managed_dataset/managed_dataset_merge_utils.py deleted file mode 100644 index ef2ffeb72..000000000 --- a/csp/impl/managed_dataset/managed_dataset_merge_utils.py +++ /dev/null @@ -1,431 +0,0 @@ -import datetime -import itertools -import os -import pytz -import tempfile -from typing import Optional - -import csp -from csp.adapters.output_adapters.parquet import ParquetOutputConfig, ParquetWriter -from csp.cache_support import CacheConfig -from csp.impl.managed_dataset.aggregation_period_utils import AggregationPeriodUtils -from csp.impl.managed_dataset.managed_dataset import ManagedDatasetPartition -from csp.impl.managed_dataset.managed_dataset_lock_file_util import ManagedDatasetLockUtil -from csp.utils.file_permissions import apply_file_permissions -from csp.utils.lock_file import MultipleFilesLock - - -def _pa(): - """ - Lazy import pyarrow - """ - import pyarrow - - return pyarrow - - -def _create_wip_file(output_folder, start_time, is_folder: Optional[bool] = False): - prefix = start_time.strftime("%Y%m%d_H%M%S_%f") if start_time else "merge_" - - if is_folder: - return tempfile.mkdtemp(dir=output_folder, suffix="_WIP", prefix=prefix) - else: - fd, cur_file_path = tempfile.mkstemp(dir=output_folder, suffix="_WIP", prefix=prefix) - os.close(fd) - return cur_file_path - - -class _SingleBasketMergeData: - def __init__(self, basket_name, basket_types, input_files, basket_data_input_files): - self.basket_data_input_files = basket_data_input_files - self.basket_name = basket_name - self.basket_types = basket_types - self.count_column_name = f"{basket_name}__csp_value_count" - if issubclass(self.basket_types.value_type, csp.Struct): - self.data_column_names = [f"{basket_name}.{c}" for c in self.basket_types.value_type.metadata()] - else: - self.data_column_names = [basket_name] - self.symbol_column_name = f"{basket_name}__csp_symbol" - self._cur_basket_data_row_group = None - self._cur_row_group_data_table = None - self._cur_row_group_symbol_table = None - self._cur_row_group_last_returned_index = int(-1) - - def _load_row_group(self, next_row_group_index=None): - if next_row_group_index is None: - next_row_group_index = self._cur_basket_data_row_group + 1 - self._cur_basket_data_row_group = next_row_group_index - do_iter = True - while do_iter: - if self._cur_basket_data_row_group < self.basket_data_input_files[self.data_column_names[0]].num_row_groups: - self._cur_row_group_data_tables = [ - self.basket_data_input_files[c].read_row_group(self._cur_basket_data_row_group) - for c in self.data_column_names - ] - self._cur_row_group_symbol_table = self.basket_data_input_files[self.symbol_column_name].read_row_group( - self._cur_basket_data_row_group - ) - if self._cur_row_group_data_tables[0].shape[0] > 0: - do_iter = False - else: - self._cur_basket_data_row_group += 1 - else: - self._cur_row_group_data_tables = None - self._cur_row_group_symbol_table = None - do_iter = False - self._cur_row_group_last_returned_index = int(-1) - return self._cur_row_group_data_tables is not None - - @property - def _num_remaining_rows_cur_chunk(self): - if not self._cur_row_group_data_tables: - return 0 - remaining_items_cur_group = ( - self._cur_row_group_data_tables[0].shape[0] - 1 - self._cur_row_group_last_returned_index - ) - return remaining_items_cur_group - - def _skip_rows(self, num_rows_to_skip): - remaining_items_cur_chunk = self._num_remaining_rows_cur_chunk - while num_rows_to_skip > 0: - if num_rows_to_skip >= remaining_items_cur_chunk: - num_rows_to_skip -= remaining_items_cur_chunk - assert self._load_row_group() or num_rows_to_skip == 0 - else: - self._cur_row_group_last_returned_index += int(num_rows_to_skip) - num_rows_to_skip = 0 - - def _iter_chunks(self, row_indices, full_column_tables): - count_table = full_column_tables[self.count_column_name].columns[0] - count_table_cum_sum = count_table.to_pandas().cumsum() - if self._cur_basket_data_row_group is None: - self._load_row_group(0) - - if row_indices is None: - if count_table_cum_sum.empty: - return - num_rows_to_return = int(count_table_cum_sum.iloc[-1]) - else: - if row_indices.size == 0: - if not count_table_cum_sum.empty: - self._skip_rows(count_table_cum_sum.iloc[-1]) - return - - num_rows_to_return = int(count_table_cum_sum[row_indices[-1]]) - if row_indices[0] != 0: - skipped_rows = int(count_table_cum_sum[row_indices[0] - 1]) - self._skip_rows(skipped_rows) - num_rows_to_return -= skipped_rows - - while num_rows_to_return > 0: - s_i = self._cur_row_group_last_returned_index + 1 - if num_rows_to_return < self._num_remaining_rows_cur_chunk: - e_i = s_i + num_rows_to_return - self._skip_rows(num_rows_to_return) - num_rows_to_return = 0 - yield (self._cur_row_group_symbol_table[s_i:e_i],) + tuple( - t[s_i:e_i] for t in self._cur_row_group_data_tables - ) - else: - num_read_rows = self._num_remaining_rows_cur_chunk - e_i = s_i + num_read_rows - num_rows_to_return -= num_read_rows - yield (self._cur_row_group_symbol_table[s_i:e_i],) + tuple( - t[s_i:e_i] for t in self._cur_row_group_data_tables - ) - assert self._load_row_group() or num_rows_to_return == 0 - - -class _MergeFileInfo(csp.Struct): - file_path: str - start_time: datetime.datetime - end_time: datetime.datetime - - -class SinglePartitionFileMerger: - def __init__( - self, - dataset_partition: ManagedDatasetPartition, - start_time, - end_time, - cache_config: CacheConfig, - parquet_output_config: ParquetOutputConfig, - ): - self._dataset_partition = dataset_partition - self._start_time = start_time - self._end_time = end_time - self._cache_config = cache_config - self._parquet_output_config = parquet_output_config.copy().resolve_compression() - # TODO: cleanup all reference to existing files and backup files - self._split_columns_to_files = getattr(dataset_partition.dataset.metadata, "split_columns_to_files", False) - self._aggregation_period_utils = AggregationPeriodUtils( - self._dataset_partition.dataset.metadata.time_aggregation - ) - - def _is_overwrite_allowed(self): - allow_overwrite = getattr(self._cache_config, "allow_overwrite", None) - if allow_overwrite is not None: - return allow_overwrite - allow_overwrite = getattr(self._parquet_output_config, "allow_overwrite", None) - return bool(allow_overwrite) - - def _resolve_merged_output_file_name(self, merge_candidates): - output_file_name = self._dataset_partition.data_paths.get_output_file_name( - start_time=merge_candidates[0].start_time, - end_time=merge_candidates[-1].end_time, - split_columns_to_files=self._split_columns_to_files, - ) - - return output_file_name - - def _iterate_file_chunks(self, file_name, start_cutoff=None): - dataset = self._dataset_partition.dataset - parquet_file = _pa().parquet.ParquetFile(file_name) - if start_cutoff: - for i in range(parquet_file.metadata.num_row_groups): - time_stamps = parquet_file.read_row_group(i, [dataset.metadata.timestamp_column_name])[ - dataset.metadata.timestamp_column_name - ].to_pandas() - row_indices = time_stamps.index.values[(time_stamps > pytz.utc.localize(start_cutoff))] - - if row_indices.size == 0: - continue - - full_table = parquet_file.read_row_group(i)[row_indices[0] : row_indices[-1] + 1] - yield full_table - else: - for i in range(parquet_file.metadata.num_row_groups): - yield parquet_file.read_row_group(i) - - def _iter_column_names(self, include_regular_columns=True, include_basket_data_columns=True): - dataset = self._dataset_partition.dataset - if include_regular_columns: - yield dataset.metadata.timestamp_column_name - for c in dataset.metadata.columns.keys(): - yield c - if hasattr(dataset.metadata, "dict_basket_columns"): - for c, t in dataset.metadata.dict_basket_columns.items(): - if include_regular_columns: - yield f"{c}__csp_value_count" - if include_basket_data_columns: - if issubclass(t.value_type, csp.Struct): - for field_name in t.value_type.metadata(): - yield f"{c}.{field_name}" - else: - yield c - yield f"{c}__csp_symbol" - - def _iter_column_files(self, folder, include_regular_columns=True, include_basket_data_columns=True): - for c in self._iter_column_names( - include_regular_columns=include_regular_columns, include_basket_data_columns=include_basket_data_columns - ): - yield c, os.path.join(folder, f"{c}.parquet") - - def _iterate_folder_chunks(self, file_name, start_cutoff=None): - dataset = self._dataset_partition.dataset - input_files = {} - for c, f in self._iter_column_files(file_name, include_basket_data_columns=False): - input_files[c] = _pa().parquet.ParquetFile(f) - - basket_data_input_files = {} - for c, f in self._iter_column_files(file_name, include_regular_columns=False): - basket_data_input_files[c] = _pa().parquet.ParquetFile(f) - - timestamp_column_reader = input_files[dataset.metadata.timestamp_column_name] - - basked_data = ( - { - k: _SingleBasketMergeData(k, v, input_files, basket_data_input_files) - for k, v in dataset.metadata.dict_basket_columns.items() - } - if getattr(dataset.metadata, "dict_basket_columns", None) - else {} - ) - - if start_cutoff: - for i in range(timestamp_column_reader.metadata.num_row_groups): - time_stamps = timestamp_column_reader.read_row_group(i, [dataset.metadata.timestamp_column_name])[ - dataset.metadata.timestamp_column_name - ].to_pandas() - row_indices = time_stamps.index.values[(time_stamps > pytz.utc.localize(start_cutoff))] - - full_column_tables = {} - truncated_column_tables = {} - for c in self._iter_column_names(include_basket_data_columns=False): - full_table = input_files[c].read_row_group(i) - full_column_tables[c] = full_table - if row_indices.size > 0: - truncated_column_tables[c] = full_table[row_indices[0] : row_indices[-1] + 1] - - if row_indices.size > 0: - yield ( - truncated_column_tables, - ( - v._iter_chunks(row_indices=row_indices, full_column_tables=full_column_tables) - for v in basked_data.values() - ), - ) - else: - for v in basked_data.values(): - assert ( - len(list(v._iter_chunks(row_indices=row_indices, full_column_tables=full_column_tables))) - == 0 - ) - else: - for i in range(timestamp_column_reader.metadata.num_row_groups): - truncated_column_tables = {} - for c in self._iter_column_names(include_basket_data_columns=False): - truncated_column_tables[c] = input_files[c].read_row_group(i) - yield ( - truncated_column_tables, - ( - v._iter_chunks(row_indices=None, full_column_tables=truncated_column_tables) - for v in basked_data.values() - ), - ) - - def _iterate_chunks(self, file_name, start_cutoff=None): - if self._dataset_partition.dataset.metadata.split_columns_to_files: - return self._iterate_folder_chunks(file_name, start_cutoff) - else: - return self._iterate_file_chunks(file_name, start_cutoff) - - def _iterate_merged_batches(self, merge_candidates): - iters = [] - # Here we need both start time and end time to be exclusive - start_cutoff = merge_candidates[0].start_time - datetime.timedelta(microseconds=1) - end_cutoff = merge_candidates[-1].end_time + datetime.timedelta(microseconds=1) - - for merge_candidate in merge_candidates: - merged_file_cutoff_start = None - if merge_candidate.start_time <= start_cutoff: - merged_file_cutoff_start = start_cutoff - assert end_cutoff > merge_candidate.end_time - iters.append(self._iterate_chunks(merge_candidate.file_path, start_cutoff=merged_file_cutoff_start)) - start_cutoff = merge_candidate.end_time - return itertools.chain(*iters) - - def _merged_data_folders(self, aggregation_folder, merge_candidates): - output_file_name = self._resolve_merged_output_file_name(merge_candidates) - - file_permissions = self._cache_config.data_file_permissions - folder_permission = file_permissions.get_folder_permissions() - - wip_file = _create_wip_file(aggregation_folder, start_time=None, is_folder=True) - apply_file_permissions(wip_file, folder_permission) - writers = {} - try: - for (column1, src_file_name), (column2, file_name) in zip( - self._iter_column_files(merge_candidates[0].file_path), self._iter_column_files(wip_file) - ): - assert column1 == column2 - schema = _pa().parquet.read_schema(src_file_name) - writers[column1] = _pa().parquet.ParquetWriter( - file_name, - schema=schema, - compression=self._parquet_output_config.compression, - version=ParquetWriter.PARQUET_VERSION, - ) - for batch, basket_batches in self._iterate_merged_batches(merge_candidates): - for column_name, values in batch.items(): - writers[column_name].write_table(values) - - for single_basket_column_batches in basket_batches: - for batch_columns in single_basket_column_batches: - for single_column_table in batch_columns: - writer = writers[single_column_table.column_names[0]] - writer.write_table(single_column_table) - finally: - for writer in writers.values(): - writer.close() - - for _, f in self._iter_column_files(wip_file): - apply_file_permissions(f, file_permissions) - - os.rename(wip_file, output_file_name) - - def _merge_data_files(self, aggregation_folder, merge_candidates): - output_file_name = self._resolve_merged_output_file_name(merge_candidates) - - file_permissions = self._cache_config.data_file_permissions - - wip_file = _create_wip_file(aggregation_folder, start_time=None, is_folder=False) - schema = _pa().parquet.read_schema(merge_candidates[0].file_path) - with _pa().parquet.ParquetWriter( - wip_file, - schema=schema, - compression=self._parquet_output_config.compression, - version=ParquetWriter.PARQUET_VERSION, - ) as parquet_writer: - for batch in self._iterate_merged_batches(merge_candidates): - parquet_writer.write_table(batch) - - apply_file_permissions(wip_file, file_permissions) - os.rename(wip_file, output_file_name) - - def _resolve_merge_candidates(self, existing_files): - if not existing_files or len(existing_files) <= 1: - return None - - merge_candidates = [] - - for (file_period_start, file_period_end), file_path in existing_files.items(): - if not merge_candidates: - merge_candidates.append( - _MergeFileInfo(file_path=file_path, start_time=file_period_start, end_time=file_period_end) - ) - continue - assert file_period_start >= merge_candidates[-1].start_time - if merge_candidates[-1].end_time + datetime.timedelta(microseconds=1) >= file_period_start: - merge_candidates.append( - _MergeFileInfo(file_path=file_path, start_time=file_period_start, end_time=file_period_end) - ) - elif len(merge_candidates) <= 1: - merge_candidates.clear() - merge_candidates.append( - _MergeFileInfo(file_path=file_path, start_time=file_period_start, end_time=file_period_end) - ) - else: - break - if len(merge_candidates) > 1: - return merge_candidates - return None - - def _merge_single_period(self, aggregation_folder, aggregation_period_start, aggregation_period_end): - lock_file_utils = ManagedDatasetLockUtil(self._cache_config.lock_file_permissions) - continue_merge = True - while continue_merge: - with lock_file_utils.merge_lock(aggregation_folder): - existing_files, _ = self._dataset_partition.data_paths.get_data_files_in_range( - aggregation_period_start, - aggregation_period_end, - missing_range_handler=lambda *args, **kwargs: True, - split_columns_to_files=self._split_columns_to_files, - truncate_data_periods=False, - include_read_folders=False, - ) - merge_candidates = self._resolve_merge_candidates(existing_files) - if not merge_candidates: - break - lock_file_paths = [r.file_path for r in merge_candidates] - locks = [lock_file_utils.write_lock(f, is_lock_in_root_folder=True) for f in lock_file_paths] - all_files_lock = MultipleFilesLock(locks) - if not all_files_lock.lock(): - break - - if self._dataset_partition.dataset.metadata.split_columns_to_files: - self._merged_data_folders(aggregation_folder, merge_candidates) - else: - self._merge_data_files(aggregation_folder, merge_candidates) - all_files_lock.unlock() - - def merge_files(self): - for ( - aggregation_period_start, - aggregation_period_end, - ) in self._aggregation_period_utils.iterate_periods_in_date_range( - start_time=self._start_time, end_time=self._end_time - ): - aggregation_period_end -= datetime.timedelta(microseconds=1) - aggregation_folder = self._dataset_partition.data_paths.get_output_folder_name(aggregation_period_start) - self._merge_single_period(aggregation_folder, aggregation_period_start, aggregation_period_end) diff --git a/csp/impl/managed_dataset/managed_dataset_path_resolver.py b/csp/impl/managed_dataset/managed_dataset_path_resolver.py deleted file mode 100644 index 7ded37133..000000000 --- a/csp/impl/managed_dataset/managed_dataset_path_resolver.py +++ /dev/null @@ -1,470 +0,0 @@ -import datetime -import glob -import os -from typing import Callable, Dict, List, Optional, Union - -import csp -from csp.impl.constants import UNSET -from csp.impl.managed_dataset.aggregation_period_utils import AggregationPeriodUtils -from csp.impl.managed_dataset.dataset_metadata import OutputType, TimeAggregation -from csp.impl.managed_dataset.dateset_name_constants import DatasetNameConstants - - -class DatasetPartitionPaths: - _FILE_EXTENSION_BY_TYPE = {OutputType.PARQUET: ".parquet"} - _FOLDER_DATA_GLOB_EXPRESSION = ( - "[0-9]" * 8 + "_" + "[0-9]" * 6 + "_" + "[0-9]" * 6 + "-" + "[0-9]" * 8 + "_" + "[0-9]" * 6 + "_" + "[0-9]" * 6 - ) - - DATA_FOLDER = "data" - - def __init__( - self, - dataset_root_folder: str, - dataset_read_folders, - partitioning_values: Dict[str, str] = None, - time_aggregation: TimeAggregation = TimeAggregation.DAY, - ): - self._partition_values = tuple(partitioning_values.values()) - self._time_aggregation = time_aggregation - if self._partition_values: - sub_folder_parts = list(map(str, self._partition_values)) - else: - sub_folder_parts = [] - - self._root_folder = os.path.join(dataset_root_folder, self.DATA_FOLDER, *sub_folder_parts) - self._read_folders = [os.path.join(v, self.DATA_FOLDER, *sub_folder_parts) for v in dataset_read_folders] - self._aggregation_period_utils = AggregationPeriodUtils(time_aggregation) - - @property - def root_folder(self): - return self._root_folder - - @classmethod - def _parse_file_name_times(cls, file_name): - base_name = os.path.basename(file_name) - start = datetime.datetime.strptime(base_name[:22], "%Y%m%d_%H%M%S_%f") - end = datetime.datetime.strptime(base_name[23:45], "%Y%m%d_%H%M%S_%f") - return (start, end) - - def get_period_start_time(self, start_time: datetime.datetime) -> datetime.datetime: - """Compute the start of the period for the given timestamp - :param start_time: - :return: - """ - return AggregationPeriodUtils(self._time_aggregation).resolve_period_start(start_time) - - def get_file_cutoff_time(self, start_time: datetime.datetime) -> datetime.datetime: - """Compute the latest time that should be written to the file for which the data start at a given time - :param start_time: - :return: - """ - return AggregationPeriodUtils(self._time_aggregation).resolve_period_end(start_time) - - def _get_existing_data_bound_for_root_folder(self, is_starttime, root_folder, split_columns_to_files): - agg_bound_folder = self._aggregation_period_utils.get_agg_bound_folder( - root_folder=root_folder, is_starttime=is_starttime - ) - if agg_bound_folder is None: - return None - if split_columns_to_files: - all_files = sorted(glob.glob(f"{glob.escape(agg_bound_folder)}/{self._FOLDER_DATA_GLOB_EXPRESSION}")) - else: - all_files = sorted(glob.glob(f"{glob.escape(agg_bound_folder)}/*.parquet")) - if not all_files: - return None - index = 0 if is_starttime else -1 - return self._parse_file_name_times(all_files[index])[index] - - def _iterate_root_and_read_folders(self, include_root_folder=True, include_read_folders=True): - if include_root_folder: - yield self._root_folder - if include_read_folders: - for f in self._read_folders: - yield f - - def _get_existing_data_bound_time( - self, is_starttime, *, split_columns_to_files: bool, include_root_folder=True, include_read_folders=True - ): - res = None - - for root_folder in self._iterate_root_and_read_folders( - include_root_folder=include_root_folder, include_read_folders=include_read_folders - ): - cur_res = self._get_existing_data_bound_for_root_folder(is_starttime, root_folder, split_columns_to_files) - if res is None or (cur_res is not None and ((cur_res < res) == is_starttime)): - res = cur_res - return res - - def _normalize_start_end_time( - self, - starttime: datetime.datetime, - endtime: Union[datetime.datetime, datetime.timedelta], - split_columns_to_files: bool, - ): - if starttime is None: - starttime = self._get_existing_data_bound_time(True, split_columns_to_files=split_columns_to_files) - if starttime is None: - return None, None - - if endtime is None: - endtime = self._get_existing_data_bound_time(False, split_columns_to_files=split_columns_to_files) - if endtime is None: - return None, None - elif isinstance(endtime, datetime.timedelta): - endtime = starttime + endtime - return starttime, endtime - - def _list_files_on_disk( - self, - starttime: datetime.datetime, - endtime: Union[datetime.datetime, datetime.timedelta], - split_columns_to_files=False, - return_unused=False, - include_read_folders=True, - ): - if starttime is None or endtime is None: - return [] - - files_with_times = [] - unused_files = [] - for period_start, _ in self._aggregation_period_utils.iterate_periods_in_date_range(starttime, endtime): - file_by_base_name = {} - for root_folder in self._iterate_root_and_read_folders(include_read_folders=include_read_folders): - date_output_folder = self.get_output_folder_name(period_start, root_folder) - if split_columns_to_files: - files = glob.glob(f"{glob.escape(date_output_folder)}/" + self._FOLDER_DATA_GLOB_EXPRESSION) - else: - files = glob.glob(f"{glob.escape(date_output_folder)}/*.parquet") - for f in files: - base_name = os.path.basename(f) - if base_name not in file_by_base_name: - file_by_base_name[base_name] = f - sorted_base_names = sorted(file_by_base_name) - files = [file_by_base_name[f] for f in sorted_base_names] - - for file in files: - file_start, file_end = self._parse_file_name_times(file) - # Files are sorted ascending by start_time, end_time. For a given start time, we want to keep the highest end_time - new_record = (file_start, file_end, file) - if files_with_times and files_with_times[-1][0] == file_start: - unused_files.append(files_with_times[-1][-1]) - files_with_times[-1] = new_record - elif files_with_times and files_with_times[-1][1] >= file_end: - # The file is fully included in the previous file range - unused_files.append(file) - else: - files_with_times.append(new_record) - return unused_files if return_unused else files_with_times - - def get_unused_files( - self, - starttime: datetime.datetime, - endtime: Union[datetime.datetime, datetime.timedelta], - split_columns_to_files=False, - ): - starttime, endtime = self._normalize_start_end_time(starttime, endtime, split_columns_to_files) - return self._list_files_on_disk( - starttime=starttime, - endtime=endtime, - split_columns_to_files=split_columns_to_files, - return_unused=True, - include_read_folders=False, - ) - - def get_data_files_in_range( - self, - starttime: datetime.datetime, - endtime: Union[datetime.datetime, datetime.timedelta], - missing_range_handler: Callable[[datetime.datetime, datetime.datetime], bool] = None, - split_columns_to_files=False, - truncate_data_periods=True, - include_read_folders=True, - ): - """Retrieve a list of all files in the given time range (inclusive) - :param starttime: The start time of the period - :param endtime: The end time of the period - :param missing_range_handler: A function that handles missing data. Will be called with (missing_period_starttime, missing_period_endtime), - should return True, if the missing data is not an error, should return False otherwise (in which case an exception will be raised). - By default if no missing_range_handler is specified, the function will raise exception on any missing data. - :param split_columns_to_files: A boolean that specifies whether the columns are split into separate files - :param truncate_data_periods: A boolean that specifies whether the time period of each file should be truncated to the period that is consumed for a given - time range. For example consider a file that exists for period (20210101-20210201) and we pass in the starttime=20210115 and endtime=20120116 then - for the file above the period (key of the returned dict) will be truncated to (20210115,20120116) if the flag is set to false then - (20210101,20210201) will be returned as a key instead. - :param include_read_folders: A boolean that specifies whether the files in "read_folders" should be included - :returns A tuple (files, full_coverage) where data is a dictionary of period->file_path and full_coverage is a boolean that is True only - if the whole requested period is covered by the files, False otherwise - """ - starttime, endtime = self._normalize_start_end_time(starttime, endtime, split_columns_to_files) - # It's a boolean but since we need to modify it from within internal function, we need to make it a list of boolean - full_coverage = [True] - - def handle_missing_period_error_reporting(start, end): - if not missing_range_handler or not missing_range_handler(start, end): - raise RuntimeError(f"Missing cache data for range {start} to {end}") - full_coverage[0] = False - - res = {} - - files_with_times = self._list_files_on_disk( - starttime=starttime, - endtime=endtime, - split_columns_to_files=split_columns_to_files, - include_read_folders=include_read_folders, - ) - - if starttime: - for period_start, _ in self._aggregation_period_utils.iterate_periods_in_date_range(starttime, endtime): - prev_end = None - for file_start, file_end, file in files_with_times: - file_new_data_start = file_start - - if prev_end is not None and prev_end >= file_start: - if file_end <= prev_end: - # The period of this file is fully covered in the previous one - continue - if truncate_data_periods: - file_new_data_start = prev_end + datetime.timedelta(microseconds=1) - - if ( - (starttime <= file_new_data_start <= endtime) - or (starttime <= file_end <= endtime) - or (file_new_data_start <= starttime <= endtime <= file_end) - ): - if truncate_data_periods and starttime > file_new_data_start: - file_new_data_start = starttime - if file_end > endtime and truncate_data_periods: - file_end = endtime - res[(file_new_data_start, file_end)] = file - prev_end = file_end - - if not res: - if starttime is not None or endtime is not None: - handle_missing_period_error_reporting(starttime, endtime) - return {}, False - else: - ONE_MICRO = datetime.timedelta(microseconds=1) - - dict_iter = iter(res.keys()) - period_start, period_end = next(dict_iter) - if period_start > starttime: - handle_missing_period_error_reporting(starttime, period_start - ONE_MICRO) - - for cur_start, cur_end in dict_iter: - if cur_start > period_end + ONE_MICRO: - handle_missing_period_error_reporting(period_end + ONE_MICRO, cur_start - ONE_MICRO) - period_end = cur_end - if period_end < endtime: - handle_missing_period_error_reporting(period_end + ONE_MICRO, endtime) - - return res, full_coverage[0] - - def get_output_folder_name(self, start_time: Union[datetime.datetime, datetime.date], root_folder=None): - root_folder = root_folder or self._root_folder - return os.path.join(root_folder, self._aggregation_period_utils.get_sub_folder_name(start_time)) - - def get_output_file_name( - self, - start_time: datetime.datetime, - end_time: datetime.datetime, - output_type: OutputType = OutputType.PARQUET, - split_columns_to_files: bool = False, - ): - assert end_time >= start_time - if output_type not in (OutputType.PARQUET,): - raise NotImplementedError(f"Unsupported output type: {output_type}") - - output_folder = self.get_output_folder_name(start_time=start_time) - assert end_time <= self._aggregation_period_utils.resolve_period_end(start_time, exclusive_end=False) - if split_columns_to_files: - file_extension = "" - else: - file_extension = self._FILE_EXTENSION_BY_TYPE[output_type] - return os.path.join( - output_folder, - f"{start_time.strftime('%Y%m%d_%H%M%S_%f')}-{end_time.strftime('%Y%m%d_%H%M%S_%f')}{file_extension}", - ) - - -class DatasetPartitionKey: - def __init__(self, value_dict): - self._value_dict = value_dict - self._key = None - - @property - def kwargs(self): - return self._value_dict - - def _get_key(self): - if self._key is None: - self._key = tuple(self._value_dict.items()) - return self._key - - def __str__(self): - return f"DatasetPartitionKey({self._value_dict})" - - def __repr__(self): - return str(self) - - def __eq__(self, other): - if not isinstance(other, DatasetPartitionKey): - return False - return self._get_key() == other._get_key() - - def __hash__(self): - return hash(self._get_key()) - - -class DatasetPaths(object): - DATASET_METADATA_FILE_NAME = "dataset_meta.yml" - - def __init__( - self, - parent_folder: str, - read_folders: str, - name: str, - time_aggregation=TimeAggregation.DAY, - data_category: Optional[List[str]] = None, - ): - self._name = name - self._time_aggregation = time_aggregation - self._data_category = data_category - - # Note we must call the list on data_category since we want a copy that we're going to modify - dataset_sub_folder_parts = list(data_category) if data_category else [] - dataset_sub_folder_parts.append(name) - self._dataset_sub_folder_parts_str = os.path.join(*dataset_sub_folder_parts) - self._parent_folder = parent_folder - self._dataset_root_folder = os.path.abspath(os.path.join(parent_folder, self._dataset_sub_folder_parts_str)) - self._dataset_read_root_folders = ( - [os.path.abspath(os.path.join(v, *dataset_sub_folder_parts)) for v in read_folders] if read_folders else [] - ) - - def get_partition_paths(self, partitioning_values: Dict[str, str] = None): - return DatasetPartitionPaths( - self.root_folder, - self._dataset_read_root_folders, - partitioning_values, - time_aggregation=self._time_aggregation, - ) - - @property - def parent_folder(self): - return self._parent_folder - - @property - def root_folder(self): - return self._dataset_root_folder - - @classmethod - def _get_metadata_file_path(cls, root_folder): - return os.path.join(root_folder, cls.DATASET_METADATA_FILE_NAME) - - def get_metadata_file_path(self, existing: bool): - """ - Get the metadata file path if "existing" is True then any metadata from either root folder or read folders will be returned (whichever exists) or None if not - metadata file exists. If "existing" is False then the metadata for the "root_folder" will be returned, no matter if it exists or not. - :param existing: - :return: - """ - if not existing: - return os.path.join(self.root_folder, self.DATASET_METADATA_FILE_NAME) - - for folder in self._iter_root_folders(True): - file_path = os.path.join(folder, self.DATASET_METADATA_FILE_NAME) - if os.path.exists(file_path): - return file_path - return None - - def _iter_root_folders(self, use_read_folders): - yield self._dataset_root_folder - if use_read_folders: - for f in self._dataset_read_root_folders: - yield f - - def _resolve_partitions_recursively(self, metadata, cur_path, columns, column_index=0): - if column_index >= len(columns): - yield {} - return - - col_name = columns[column_index] - col_type = metadata.partition_columns[col_name] - - for sub_folder in os.listdir(cur_path): - cur_value, sub_folder_full = self._load_value_from_path(cur_path, sub_folder, col_type) - if cur_value is not UNSET: - for res in self._resolve_partitions_recursively(metadata, sub_folder_full, columns, column_index + 1): - d = {col_name: cur_value} - d.update(**res) - yield d - - def _load_value_from_path(self, cur_path, sub_folder, col_type): - cur_value = UNSET - sub_folder_full = os.path.join(cur_path, sub_folder) - if issubclass(col_type, csp.Struct): - if os.path.isdir(sub_folder_full) and sub_folder.startswith("struct_"): - value_file = os.path.join(sub_folder_full, DatasetNameConstants.PARTITION_ARGUMENT_FILE_NAME) - if os.path.exists(os.path.exists(value_file)): - with open(value_file, "r") as f: - cur_value = col_type.from_yaml(f.read()) - elif col_type in (int, float, str): - try: - cur_value = col_type(sub_folder) - except ValueError: - pass - elif col_type is datetime.date: - try: - cur_value = datetime.datetime.strptime(sub_folder, "%Y%m%d_000000_000000").date() - except ValueError: - pass - elif col_type is datetime.datetime: - try: - cur_value = datetime.datetime.strptime(sub_folder, "%Y%m%d_%H%M%S_%f") - except ValueError: - pass - elif col_type is datetime.timedelta: - try: - if sub_folder.startswith("td_") and sub_folder.endswith("us"): - cur_value = datetime.timedelta(microseconds=int(sub_folder[3:-2])) - except ValueError: - pass - elif col_type is bool: - if sub_folder == "True": - cur_value = True - elif sub_folder == "False": - cur_value = False - else: - raise RuntimeError(f"Unsupported partition value type {col_type}: {sub_folder}") - return cur_value, sub_folder_full - - def get_partition_keys(self, metadata): - if not hasattr(metadata, "partition_columns") or not metadata.partition_columns: - return [DatasetPartitionKey({})] - - results_set = set() - results = [] - - columns = list(metadata.partition_columns) - for root_folder in self._iter_root_folders(True): - data_folder = os.path.join(root_folder, DatasetPartitionPaths.DATA_FOLDER) - for res in self._resolve_partitions_recursively(metadata, data_folder, columns=columns): - key = DatasetPartitionKey(res) - if key not in results_set: - results_set.add(key) - results.append(key) - return results - - def resolve_lock_file_path(self, desired_path, use_read_folders): - """ - :param desired_path: The desired path of the lock as if it was in the data folder (this path is modified to a separate path) - :param use_read_folders: A boolean flags whether the read folders should be tried as the prefix for the current desired path - :return: A tuple of (parent_folder, file_path) where parent_folder is the LAST non lock specific folder in the path (anything after this is lock specific and should - be created with different permissions) - """ - for f in self._iter_root_folders(use_read_folders=use_read_folders): - if os.path.commonprefix((desired_path, f)) == f: - parent_folder = f[: -len(self._dataset_sub_folder_parts_str)] - rel_path = os.path.relpath(desired_path, parent_folder) - return parent_folder, os.path.join(parent_folder, ".locks", rel_path) - raise RuntimeError(f"Unable to resolve lock file path for file {desired_path}") diff --git a/csp/impl/managed_dataset/managed_parquet_writer.py b/csp/impl/managed_dataset/managed_parquet_writer.py deleted file mode 100644 index c1b8ba414..000000000 --- a/csp/impl/managed_dataset/managed_parquet_writer.py +++ /dev/null @@ -1,340 +0,0 @@ -import datetime -import os -from typing import Dict, Optional, TypeVar, Union - -import csp -from csp.adapters.parquet import ParquetOutputConfig, ParquetWriter -from csp.impl.managed_dataset.cache_user_custom_object_serializer import CacheObjectSerializer -from csp.impl.managed_dataset.dateset_name_constants import DatasetNameConstants -from csp.impl.managed_dataset.managed_dataset import ManagedDatasetPartition -from csp.impl.managed_dataset.managed_dataset_merge_utils import _create_wip_file -from csp.impl.wiring import Context -from csp.impl.wiring.cache_support.partition_files_container import PartitionFileContainer -from csp.impl.wiring.outputs import OutputsContainer -from csp.impl.wiring.special_output_names import ALL_SPECIAL_OUTPUT_NAMES, CSP_CACHE_ENABLED_OUTPUT - -T = TypeVar("T") - - -def _pa(): - """ - Lazy import pyarrow - """ - import pyarrow - - return pyarrow - - -def _create_output_file_or_folder(data_paths, cur_file_start_time, split_columns_to_files): - output_folder = data_paths.get_output_folder_name(start_time=cur_file_start_time) - if not os.path.exists(output_folder): - os.makedirs(output_folder, exist_ok=True) - s_cur_file_path = _create_wip_file(output_folder, cur_file_start_time, is_folder=split_columns_to_files) - return s_cur_file_path - - -def _generate_empty_parquet_files(dataset_partition, existing_file, files_to_generate, parquet_output_config): - if not files_to_generate: - return - - if os.path.isdir(existing_file): - file_schemas = { - f: _pa().parquet.ParquetFile(os.path.join(existing_file, f)).schema.to_arrow_schema() - for f in os.listdir(existing_file) - if f.endswith(".parquet") - } - for (s, e), dir_name in files_to_generate.items(): - for f_name, schema in file_schemas.items(): - with _pa().parquet.ParquetWriter( - os.path.join(dir_name, f_name), - schema=schema, - compression=parquet_output_config.compression, - version=ParquetWriter.PARQUET_VERSION, - ): - pass - PartitionFileContainer.get_instance().add_generated_file( - dataset_partition, s, e, dir_name, parquet_output_config - ) - else: - file_info = _pa().parquet.ParquetFile(existing_file) - schema = file_info.schema.to_arrow_schema() - - for (s, e), f_name in files_to_generate.items(): - with _pa().parquet.ParquetWriter( - f_name, - schema=schema, - compression=parquet_output_config.compression, - version=ParquetWriter.PARQUET_VERSION, - ): - PartitionFileContainer.get_instance().add_generated_file( - dataset_partition, s, e, f_name, parquet_output_config - ) - - -@csp.node -def _cache_filename_provider_custom_time( - dataset_partition: ManagedDatasetPartition, - config: Optional[ParquetOutputConfig], - split_columns_to_files: Optional[bool], - timestamp_ts: csp.ts[datetime.datetime], -) -> csp.ts[str]: - with csp.state(): - s_data_paths = dataset_partition.data_paths - s_last_start_time = None - s_cur_file_path = None - s_cur_file_cutoff_time = None - s_empty_files_to_generate = {} - s_last_closed_file = None - - with csp.start(): - config = config.copy() if config is not None else ParquetOutputConfig() - config.resolve_compression() - - with csp.stop(): - # We need to chack that s_cur_file_path since if the engine had startup error, s_cur_file_path is undefined - if "s_cur_file_path" in locals() and s_cur_file_path: - PartitionFileContainer.get_instance().add_generated_file( - dataset_partition, s_last_start_time, timestamp_ts, s_cur_file_path, config - ) - _generate_empty_parquet_files(dataset_partition, s_last_closed_file, s_empty_files_to_generate, config) - - if csp.ticked(timestamp_ts): - if s_cur_file_cutoff_time is None: - s_last_start_time = timestamp_ts - s_cur_file_cutoff_time = s_data_paths.get_file_cutoff_time(s_last_start_time) - s_cur_file_path = _create_output_file_or_folder(s_data_paths, s_last_start_time, split_columns_to_files) - return s_cur_file_path - elif timestamp_ts >= s_cur_file_cutoff_time: - PartitionFileContainer.get_instance().add_generated_file( - dataset_partition, - s_last_start_time, - s_cur_file_cutoff_time - datetime.timedelta(microseconds=1), - s_cur_file_path, - config, - ) - s_last_closed_file = s_cur_file_path - s_last_start_time = s_cur_file_cutoff_time - s_cur_file_cutoff_time = s_data_paths.get_file_cutoff_time(s_last_start_time) - # There might be some empty files in the middle, we need to take care of this by creating a bunch of empty files on the way - while s_cur_file_cutoff_time <= timestamp_ts: - s_cur_file_path = _create_output_file_or_folder(s_data_paths, s_last_start_time, split_columns_to_files) - s_empty_files_to_generate[ - (s_last_start_time, s_cur_file_cutoff_time - datetime.timedelta(microseconds=1)) - ] = s_cur_file_path - s_last_start_time = s_cur_file_cutoff_time - s_cur_file_cutoff_time = s_data_paths.get_file_cutoff_time(s_last_start_time) - - s_cur_file_path = _create_output_file_or_folder(s_data_paths, s_last_start_time, split_columns_to_files) - return s_cur_file_path - - -def _finalize_current_output_file( - data_paths, config, dataset_partition, now, cur_file_path, split_columns_to_files, last_start_time, cache_enabled -): - if cur_file_path: - PartitionFileContainer.get_instance().add_generated_file( - dataset_partition, last_start_time, now - datetime.timedelta(microseconds=1), cur_file_path, config - ) - - if cache_enabled: - output_folder = data_paths.get_output_folder_name(start_time=now) - if not os.path.exists(output_folder): - os.makedirs(output_folder, exist_ok=True) - return _create_wip_file(output_folder, now, is_folder=split_columns_to_files) - else: - return "" - - -@csp.node -def _cache_filename_provider( - dataset_partition: ManagedDatasetPartition, - config: Optional[ParquetOutputConfig], - split_columns_to_files: Optional[bool], - cache_control_ts: csp.ts[bool], - default_cache_enabled: bool, -) -> csp.ts[str]: - with csp.alarms(): - a_update_file_alarm = csp.alarm(bool) - - with csp.state(): - s_data_paths = dataset_partition.data_paths - s_last_start_time = None - s_cur_file_path = None - s_cache_enabled = default_cache_enabled - - with csp.start(): - config = config if config is not None else ParquetOutputConfig() - csp.schedule_alarm(a_update_file_alarm, datetime.timedelta(), False) - - with csp.stop(): - # We need to chack that s_cur_file_path since if the engine had startup error, s_cur_file_path is undefined - if "s_cur_file_path" in locals() and s_cur_file_path: - if s_cache_enabled: - PartitionFileContainer.get_instance().add_generated_file( - dataset_partition, s_last_start_time, csp.now(), s_cur_file_path, config - ) - - if csp.ticked(cache_control_ts): - if cache_control_ts: - # We didn't write and need to start writing - if not s_cache_enabled: - s_cache_enabled = True - s_cur_file_path = _finalize_current_output_file( - s_data_paths, - config, - dataset_partition, - csp.now(), - s_cur_file_path, - split_columns_to_files, - s_last_start_time, - s_cache_enabled, - ) - s_last_start_time = csp.now() - cutoff_time = s_data_paths.get_file_cutoff_time(s_last_start_time) - csp.schedule_alarm(a_update_file_alarm, cutoff_time, False) - return s_cur_file_path - else: - # It's a bit ugly for now, we will keep writing even when cache is disabled but then we will throw away the written data. - # we need a better way to address this in the future - if s_cache_enabled: - s_cache_enabled = False - s_cur_file_path = _finalize_current_output_file( - s_data_paths, - config, - dataset_partition, - csp.now(), - s_cur_file_path, - split_columns_to_files, - s_last_start_time, - s_cache_enabled, - ) - s_last_start_time = csp.now() - cutoff_time = s_data_paths.get_file_cutoff_time(s_last_start_time) - csp.schedule_alarm(a_update_file_alarm, cutoff_time, False) - return s_cur_file_path - - if csp.ticked(a_update_file_alarm) and s_last_start_time != csp.now(): - s_cur_file_path = _finalize_current_output_file( - s_data_paths, - config, - dataset_partition, - csp.now(), - s_cur_file_path, - split_columns_to_files, - s_last_start_time, - s_cache_enabled, - ) - s_last_start_time = csp.now() - cutoff_time = s_data_paths.get_file_cutoff_time(s_last_start_time) - csp.schedule_alarm(a_update_file_alarm, cutoff_time, False) - return s_cur_file_path - - -@csp.node -def _serialize_value(value: csp.ts["T"], type_serializer: CacheObjectSerializer) -> csp.ts[bytes]: - if csp.ticked(value): - csp.output(type_serializer.serialize_to_bytes(value)) - - -def create_managed_parquet_writer_node( - function_name: str, - dataset_partition: ManagedDatasetPartition, - values: OutputsContainer, - field_mapping: Dict[str, Union[str, Dict[str, str]]], - config: Optional[ParquetOutputConfig] = None, - data_timestamp_column_name=None, - controlled_cache: bool = False, - default_cache_enabled: bool = True, -): - metadata = dataset_partition.dataset.metadata - if data_timestamp_column_name is None: - timestamp_column_name = getattr(metadata, "timestamp_column_name", None) - else: - timestamp_column_name = data_timestamp_column_name - config = config.copy() if config else ParquetOutputConfig() - config.allow_overwrite = True - cache_serializers = Context.instance().config.cache_config.cache_serializers - - split_columns_to_files = metadata.split_columns_to_files - - if controlled_cache: - cache_control_ts = values[CSP_CACHE_ENABLED_OUTPUT] - else: - cache_control_ts = csp.const(True) - default_cache_enabled = True - - if not isinstance(values, OutputsContainer): - values = OutputsContainer(**{DatasetNameConstants.UNNAMED_OUTPUT_NAME: values}) - - if data_timestamp_column_name and data_timestamp_column_name != DatasetNameConstants.CSP_TIMESTAMP: - timestamp_ts = values - for k in data_timestamp_column_name.split("."): - timestamp_ts = getattr(timestamp_ts, k) - writer = ParquetWriter( - file_name=None, - timestamp_column_name=None, - config=config, - filename_provider=_cache_filename_provider_custom_time( - dataset_partition=dataset_partition, - config=config, - split_columns_to_files=split_columns_to_files, - timestamp_ts=timestamp_ts, - ), - split_columns_to_files=split_columns_to_files, - ) - else: - writer = ParquetWriter( - file_name=None, - timestamp_column_name=timestamp_column_name, - config=config, - filename_provider=_cache_filename_provider( - dataset_partition=dataset_partition, - config=config, - split_columns_to_files=split_columns_to_files, - cache_control_ts=cache_control_ts, - default_cache_enabled=default_cache_enabled, - ), - split_columns_to_files=split_columns_to_files, - ) - - all_columns = set() - for key, value in values._items(): - if key in ALL_SPECIAL_OUTPUT_NAMES: - continue - if isinstance(value, dict): - basket_metadata = metadata.dict_basket_columns[key] - writer.publish_dict_basket( - key, value, key_type=basket_metadata.key_type, value_type=basket_metadata.value_type - ) - elif isinstance(value.tstype.typ, type) and issubclass(value.tstype.typ, csp.Struct): - s_field_map = field_mapping.get(key) - for k, v in s_field_map.items(): - try: - if v in all_columns: - raise RuntimeError(f"Found multiple writers of column {v}") - except TypeError: - raise RuntimeError(f"Invalid cache field name mapping: {v}") - all_columns.add(v) - - writer.publish_struct(value, field_map=field_mapping.get(key)) - else: - col_name = field_mapping.get(key, key) - try: - if col_name in all_columns: - raise RuntimeError(f"Found multiple writers of column {col_name}") - except TypeError: - raise RuntimeError(f"Invalid cache field name mapping: {col_name}") - all_columns.add(col_name) - type_serializer = cache_serializers.get(value.tstype.typ) - if type_serializer: - writer.publish(col_name, _serialize_value(value, type_serializer)) - else: - writer.publish(col_name, value) - if ( - data_timestamp_column_name - and data_timestamp_column_name not in all_columns - and data_timestamp_column_name != DatasetNameConstants.CSP_TIMESTAMP - ): - raise RuntimeError( - f"{data_timestamp_column_name} specified as timestamp column but no writers for this column found" - ) diff --git a/csp/impl/mem_cache.py b/csp/impl/mem_cache.py index f4886465e..784b3de2f 100644 --- a/csp/impl/mem_cache.py +++ b/csp/impl/mem_cache.py @@ -1,9 +1,9 @@ import copy import inspect -import logging import threading from collections import namedtuple from functools import wraps +from warnings import warn from csp.impl.constants import UNSET @@ -149,11 +149,6 @@ def _preprocess_args(args): yield (arg_name, normalize_arg(arg_value)) -class _WarnedFlag: - def __init__(self): - self.value = False - - def function_full_name(f): """A utility function that can be used for implementation of function_name for csp_memoized_graph_object :param f: @@ -177,7 +172,6 @@ def csp_memoized(func=None, *, force_memoize=False, function_name=None, is_user_ :param is_user_data: A flag that specifies whether the memoized object is user object or graph object :return: """ - warned_flag = _WarnedFlag() def _impl(func): func_args = _resolve_func_args(func) @@ -204,10 +198,8 @@ def __call__(*args, **kwargs): except TypeError as e: if force_memoize: raise - if not warned_flag.value: - logging_context = function_name if function_name else str(func) - logging.debug(f"Not memoizing output of {str(logging_context)}: {str(e)}") - warned_flag.value = True + logging_context = function_name if function_name else str(func) + warn(f"Not memoizing output of {str(logging_context)}: {str(e)}", Warning) cur_item = func(*args, **kwargs) else: if cur_item is UNSET: diff --git a/csp/impl/types/instantiation_type_resolver.py b/csp/impl/types/instantiation_type_resolver.py index 22e5e66cd..baafef2ee 100644 --- a/csp/impl/types/instantiation_type_resolver.py +++ b/csp/impl/types/instantiation_type_resolver.py @@ -21,7 +21,7 @@ def __init__(self): self._type_registry: typing.Dict[typing.Tuple[type, type], type] = {} self._add_type_upcast(int, float, float) - def resolve_type(self, expected_type: type, new_type: type, allow_subtypes: bool, raise_on_error=True): + def resolve_type(self, expected_type: type, new_type: type, raise_on_error=True): if expected_type == new_type: return expected_type if expected_type is object or new_type is object: @@ -57,7 +57,7 @@ def resolve_type(self, expected_type: type, new_type: type, allow_subtypes: bool else: return None - if allow_subtypes and inspect.isclass(expected_type) and inspect.isclass(new_type): + if inspect.isclass(expected_type) and inspect.isclass(new_type): if issubclass(expected_type, new_type): # Generally if B inherits from A, we want to resolve from A, the only exception # is "Generic types". Dict[int, int] inherits from dict but we want the type to be resolved to the generic type @@ -203,7 +203,6 @@ def __init__( values: typing.List[object], forced_tvars: typing.Union[typing.Dict[str, typing.Type], None], is_input=True, - allow_subtypes=True, allow_none_ts=False, ): self._function_name = function_name @@ -211,7 +210,6 @@ def __init__( self._arguments = values self._forced_tvars = forced_tvars self._def_name = "inputdef" if is_input else "outputdef" - self._allow_subtypes = allow_subtypes self._allow_none_ts = allow_none_ts self._tvars: typing.Dict[str, type] = {} @@ -318,9 +316,7 @@ def _rec_validate_type_spec_vs_type_spec_and_resolve_tvars( return False else: # At this point it must be a scalar value - res_type = UpcastRegistry.instance().resolve_type( - expected_sub_type, actual_sub_type, allow_subtypes=self._allow_subtypes, raise_on_error=False - ) + res_type = UpcastRegistry.instance().resolve_type(expected_sub_type, actual_sub_type, raise_on_error=False) return res_type is expected_sub_type return True @@ -391,12 +387,7 @@ def _add_scalar_value(self, arg, in_out_def): def _is_scalar_value_matching_spec(self, inp_def_type, arg): if inp_def_type is typing.Any: return True - if ( - UpcastRegistry.instance().resolve_type( - inp_def_type, type(arg), allow_subtypes=self._allow_subtypes, raise_on_error=False - ) - is inp_def_type - ): + if UpcastRegistry.instance().resolve_type(inp_def_type, type(arg), raise_on_error=False) is inp_def_type: return True if CspTypingUtils.is_union_type(inp_def_type): types = inp_def_type.__args__ @@ -533,9 +524,7 @@ def _add_t_var_resolution(self, tvar, resolved_type, arg=None): self._raise_arg_mismatch_error(arg=self._cur_arg, tvar_info={tvar: old_tvar_type}) return - combined_type = UpcastRegistry.instance().resolve_type( - resolved_type, old_tvar_type, allow_subtypes=self._allow_subtypes, raise_on_error=False - ) + combined_type = UpcastRegistry.instance().resolve_type(resolved_type, old_tvar_type, raise_on_error=False) if combined_type is None: conflicting_tvar_types = self._conflicting_tvar_types.get(tvar) if conflicting_tvar_types is None: @@ -605,9 +594,7 @@ def _try_resolve_tvar_conflicts(self): assert resolved_type, f'"{tvar}" was not resolved' for conflicting_type in conflicting_types: if ( - UpcastRegistry.instance().resolve_type( - resolved_type, conflicting_type, allow_subtypes=self._allow_subtypes, raise_on_error=False - ) + UpcastRegistry.instance().resolve_type(resolved_type, conflicting_type, raise_on_error=False) is not resolved_type ): raise TypeError( @@ -627,7 +614,6 @@ def __init__( input_definitions: typing.Tuple[InputDef], arguments: typing.List[object], forced_tvars: typing.Union[typing.Dict[str, typing.Type], None], - allow_subtypes: bool = True, allow_none_ts: bool = False, ): self._scalar_inputs: typing.List[object] = [] @@ -637,7 +623,6 @@ def __init__( input_or_output_definitions=input_definitions, values=arguments, forced_tvars=forced_tvars, - allow_subtypes=allow_subtypes, allow_none_ts=allow_none_ts, ) @@ -692,14 +677,12 @@ def __init__( output_definitions: typing.Tuple[OutputDef], values: typing.List[object], forced_tvars: typing.Union[typing.Dict[str, typing.Type], None], - allow_subtypes=True, ): super().__init__( function_name=function_name, input_or_output_definitions=output_definitions, values=values, forced_tvars=forced_tvars, - allow_subtypes=allow_subtypes, allow_none_ts=False, ) diff --git a/csp/impl/wiring/base_parser.py b/csp/impl/wiring/base_parser.py index 760d803d4..5b90fc6d6 100644 --- a/csp/impl/wiring/base_parser.py +++ b/csp/impl/wiring/base_parser.py @@ -20,11 +20,9 @@ OutputTypeError, ) from csp.impl.types.container_type_normalizer import ContainerTypeNormalizer -from csp.impl.types.tstype import TsType from csp.impl.types.type_annotation_normalizer_transformer import TypeAnnotationNormalizerTransformer from csp.impl.types.typing_utils import CspTypingUtils from csp.impl.warnings import WARN_PYTHONIC -from csp.impl.wiring.special_output_names import CSP_CACHE_ENABLED_OUTPUT LEGACY_METHODS = {"__alarms__", "__state__", "__start__", "__stop__", "__outputs__", "__return__"} @@ -92,7 +90,7 @@ def wrapper(*args, **kwargs): class BaseParser(ast.NodeTransformer, metaclass=ABCMeta): _DEBUG_PARSE = False - def __init__(self, name, raw_func, func_frame, debug_print=False, add_cache_control_output=False): + def __init__(self, name, raw_func, func_frame, debug_print=False): self._name = name self._outputs = [] self._special_outputs = tuple() @@ -115,7 +113,6 @@ def __init__(self, name, raw_func, func_frame, debug_print=False, add_cache_cont body = ast.parse(source) self._funcdef = body.body[0] self._type_annotation_normalizer.normalize_type_annotations(self._funcdef) - self._add_cache_control_output = add_cache_control_output def _eval_expr(self, exp): return eval( @@ -548,15 +545,3 @@ def _postprocess_basket_outputs(self, main_func_signature, enforce_shape_for_bas output.typ.shape_func = self._compile_function(shape_func) else: output.typ.shape_func = lambda *args, s=output.typ.shape: s - - def _resolve_special_outputs(self): - if self._add_cache_control_output: - self._special_outputs += ( - OutputDef( - name=CSP_CACHE_ENABLED_OUTPUT, - typ=TsType[bool], - kind=ArgKind.TS, - ts_idx=self._outputs[-1].ts_idx + 1, - shape=None, - ), - ) diff --git a/csp/impl/wiring/cache_support/__init__.py b/csp/impl/wiring/cache_support/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/csp/impl/wiring/cache_support/cache_config_resolver.py b/csp/impl/wiring/cache_support/cache_config_resolver.py deleted file mode 100644 index a1b512332..000000000 --- a/csp/impl/wiring/cache_support/cache_config_resolver.py +++ /dev/null @@ -1,22 +0,0 @@ -from typing import List - -from csp.impl.config import CacheConfig - - -class CacheConfigResolver: - def __init__(self, cache_config: CacheConfig): - from csp.impl.wiring.cache_support.graph_building import CacheCategoryOverridesTree - - self._cache_config = cache_config - if cache_config: - self._cache_category_overrides = CacheCategoryOverridesTree.construct_from_cache_config(cache_config) - self._graph_overrides = getattr(cache_config, "graph_overrides", {}) - else: - self._cache_category_overrides = None - self._graph_overrides = None - - def resolve_cache_config(self, graph: object, category: List[str]): - resolved_config = self._graph_overrides.get(graph, None) - if resolved_config is None: - resolved_config = self._cache_category_overrides.resolve_root_folder(category) - return resolved_config diff --git a/csp/impl/wiring/cache_support/cache_type_mapper.py b/csp/impl/wiring/cache_support/cache_type_mapper.py deleted file mode 100644 index f2a697025..000000000 --- a/csp/impl/wiring/cache_support/cache_type_mapper.py +++ /dev/null @@ -1,55 +0,0 @@ -import datetime -from typing import Union - -import csp.typing -from csp.impl.types.typing_utils import CspTypingUtils -from csp.utils.qualified_name_utils import QualifiedNameUtils - - -class CacheTypeMapper: - STRING_TO_TYPE_MAPPING = { - "datetime": datetime.datetime, - "date": datetime.date, - "timedelta": datetime.timedelta, - "int": int, - "float": float, - "str": str, - "bool": bool, - } - TYPE_TO_STRING_MAPPING = {v: k for k, v in STRING_TO_TYPE_MAPPING.items()} - ARRAY_TYPE_NAME_TO_TYPE = { - "ARRAY": csp.typing.Numpy1DArray, - "MULTI_DIM_ARRAY": csp.typing.NumpyNDArray, - } - ARRAY_TYPE_TO_TYPE_NAME = {v: k for k, v in ARRAY_TYPE_NAME_TO_TYPE.items()} - - @classmethod - def json_to_type(cls, typ: Union[str, dict]): - if isinstance(typ, str): - python_type = cls.STRING_TO_TYPE_MAPPING.get(typ) - if python_type is None: - python_type = QualifiedNameUtils.get_object_from_qualified_name(typ) - if python_type is None: - raise TypeError(f"Unsupported arrow serialization type {typ}") - return python_type - else: - array_type = None - if isinstance(typ, dict) and len(typ) == 1: - typ_key, typ_value = next(iter(typ.items())) - array_type = cls.ARRAY_TYPE_NAME_TO_TYPE.get(typ_key) - if array_type is None: - raise TypeError(f"Trying to deserialize invalid type: {typ}") - return array_type[cls.json_to_type(typ_value)] - - @classmethod - def type_to_json(cls, typ): - str_type = cls.TYPE_TO_STRING_MAPPING.get(typ) - if str_type is None: - if CspTypingUtils.is_generic_container(typ): - origin = CspTypingUtils.get_origin(typ) - type_name = cls.ARRAY_TYPE_TO_TYPE_NAME.get(origin) - if type_name is not None: - return {type_name: cls.type_to_json(typ.__args__[0])} - - return QualifiedNameUtils.get_qualified_object_name(typ) - return str_type diff --git a/csp/impl/wiring/cache_support/dataset_partition_cached_data.py b/csp/impl/wiring/cache_support/dataset_partition_cached_data.py deleted file mode 100644 index 4b227aad6..000000000 --- a/csp/impl/wiring/cache_support/dataset_partition_cached_data.py +++ /dev/null @@ -1,662 +0,0 @@ -import datetime -import itertools -import logging -import numpy -import os -import pytz -import shutil -from concurrent.futures.thread import ThreadPoolExecutor -from typing import Callable, Dict, List, Optional - -import csp -from csp.adapters.output_adapters.parquet import resolve_array_shape_column_name -from csp.impl.managed_dataset.aggregation_period_utils import AggregationPeriodUtils -from csp.impl.managed_dataset.dateset_name_constants import DatasetNameConstants -from csp.impl.types.typing_utils import CspTypingUtils - - -class DataSetCachedData: - def __init__(self, dataset, cache_serializers, data_set_partition_calculator_func): - self._dataset = dataset - self._cache_serializers = cache_serializers - self._data_set_partition_calculator_func = data_set_partition_calculator_func - - def get_partition_keys(self): - return self._dataset.data_paths.get_partition_keys(self._dataset.metadata) - - def __call__(self, *args, **kwargs): - return DatasetPartitionCachedData( - self._data_set_partition_calculator_func(*args, **kwargs), self._cache_serializers - ) - - -class DatasetPartitionCachedData: - def __init__(self, dataset_partition, cache_serializers): - self._dataset_partition = dataset_partition - self._cache_serializers = cache_serializers - - @property - def metadata(self): - return self._dataset_partition.dataset.metadata - - @classmethod - def _normalize_time(cls, time: datetime.datetime, drop_tz_info=False): - res = None - if time is not None: - if isinstance(time, datetime.timedelta): - return time - if time.tzinfo is None: - res = pytz.utc.localize(time) - else: - res = time.astimezone(pytz.UTC) - if res is not None and drop_tz_info: - res = res.replace(tzinfo=None) - return res - - def _get_shape_columns(self, column_list): - for c in column_list: - c_type = self.metadata.columns.get(c) - if c_type and CspTypingUtils.is_numpy_nd_array_type(c_type): - yield c, resolve_array_shape_column_name(c) - - def _get_shape_columns_dict(self, column_list): - return dict(self._get_shape_columns(column_list)) - - def _get_array_columns(self, column_list): - for c in column_list: - c_type = self.metadata.columns.get(c) - if c_type and CspTypingUtils.is_numpy_array_type(c_type): - yield c - - def _get_array_columns_set(self, column_list): - return set(self._get_array_columns(column_list)) - - def get_data_files_for_period( - self, - starttime: Optional[datetime.datetime] = None, - endtime: Optional[datetime.datetime] = None, - missing_range_handler: Callable[[datetime.datetime, datetime.datetime], bool] = None, - ): - """Retrieve a list of all files in the given time range (inclusive) - :param starttime: The start time of the period - :param endtime: The end time of the period - :param missing_range_handler: A function that handles missing data. Will be called with (missing_period_starttime, missing_period_endtime), - should return True, if the missing data is not an error, should return False otherwise (in which case an exception will be raised) - """ - return self._dataset_partition.get_data_for_period( - self._normalize_time(starttime, True), self._normalize_time(endtime, True), missing_range_handler - )[0] - - def _truncate_df(self, starttime, endtime, df): - import pandas - - if starttime is not None: - starttime = pytz.UTC.localize(starttime) - if endtime is not None: - endtime = pytz.UTC.localize(endtime) - - timestamp_column_name = self._remove_unnamed_output_prefix(self.metadata.timestamp_column_name) - - if starttime is not None or endtime is not None: - mask = pandas.Series(True, df.index) - if starttime is not None: - mask &= df[timestamp_column_name] >= starttime - if endtime is not None: - mask &= df[timestamp_column_name] <= endtime - df = df[mask].reset_index(drop=True) - return df - - @classmethod - def _remove_unnamed_output_prefix(cls, value): - unnamed_prefix = f"{DatasetNameConstants.UNNAMED_OUTPUT_NAME}." - if isinstance(value, str): - return value.replace(unnamed_prefix, "") - else: - value.columns = [c.replace(unnamed_prefix, "") for c in value.columns] - - def _load_single_file_all_columns( - self, starttime, endtime, file_path, column_list, basket_column_list, struct_basket_sub_columns - ): - import numpy - import pandas - - df = pandas.read_parquet(file_path, columns=column_list) - self._remove_unnamed_output_prefix(df) - df = self._truncate_df(starttime, endtime, df) - - shape_columns = self._get_shape_columns_dict(column_list) - if shape_columns: - columns_to_drop = [] - for k, v in shape_columns.items(): - df[k] = numpy.array([a.reshape(s) for a, s in zip(df[k], df[v])], dtype=object) - columns_to_drop.append(v) - df = df.drop(columns=columns_to_drop) - return df - - def _create_empty_full_array(self, dtype, field_array_shape, pandas_dtype): - import numpy - - if numpy.issubdtype(dtype, numpy.integer) or numpy.issubdtype(dtype, numpy.floating): - field_array = numpy.full(field_array_shape, numpy.nan) - pandas_dtype = float - elif numpy.issubdtype(dtype, numpy.datetime64): - field_array = numpy.full(field_array_shape, None, dtype=dtype) - else: - field_array = numpy.full(field_array_shape, None, dtype=object) - pandas_dtype = object - return field_array, pandas_dtype - - def _convert_array_columns(self, arrow_columns, column_list, array_columns, shape_columns): - if not array_columns: - return arrow_columns, column_list - - new_column_values = [] - new_column_list = [] - - shape_columns_names = set(shape_columns.values()) - shape_column_arrays = {} - for c, v in zip(column_list, arrow_columns): - if c in shape_columns_names: - shape_column_arrays[c] = numpy.array(v) - - for c, v in zip(column_list, arrow_columns): - if c in shape_columns_names: - continue - if c in array_columns: - numpy_v = numpy.array(v, dtype=object) - shape_col_name = shape_columns.get(c) - if shape_col_name: - shape_col = shape_column_arrays[shape_col_name] - numpy_v = numpy.array([v.reshape(shape) for v, shape in zip(numpy_v, shape_col)], dtype=object) - new_column_values.append(numpy_v) - else: - new_column_values.append(v.to_pandas()) - new_column_list.append(c) - return new_column_values, new_column_list - - def _load_data_split_to_columns( - self, starttime, endtime, file_path, column_list, basket_column_list, struct_basket_sub_columns - ): - import numpy - import pandas - from pyarrow import Table - from pyarrow.parquet import ParquetFile - - value_arrays = [] - for c in column_list: - parquet_file = ParquetFile(os.path.join(file_path, f"{c}.parquet")) - value_arrays.append(parquet_file.read().columns[0]) - - array_columns = self._get_array_columns_set(column_list) - shape_columns = self._get_shape_columns_dict(column_list) - # If there are no array use the pyarrow table from arrays to pandas as it is faster, otherwise we need to convert columns since arrays are not - # pyarrow native types - if array_columns and value_arrays and value_arrays and value_arrays[0]: - value_arrays, column_list = self._convert_array_columns( - value_arrays, column_list, array_columns, shape_columns - ) - res = pandas.DataFrame.from_dict(dict(zip(column_list, value_arrays))) - else: - res = Table.from_arrays(value_arrays, column_list).to_pandas() - - self._remove_unnamed_output_prefix(res) - - if basket_column_list: - basket_dfs = [] - columns_l0 = list(res.columns) - columns_l1 = [""] * len(columns_l0) - - for column in basket_column_list: - value_type = self.metadata.dict_basket_columns[column].value_type - - if issubclass(value_type, csp.Struct): - columns = struct_basket_sub_columns.get(column, value_type.metadata().keys()) - value_columns = [f"{column}.{k}" for k in columns] - else: - assert ( - column not in struct_basket_sub_columns - ), f"Specified sub columns for {column} but it's not a struct" - value_columns = [column] - value_files = [os.path.join(file_path, f"{value_column}.parquet") for value_column in value_columns] - symbol_file = os.path.join(file_path, f"{column}__csp_symbol.parquet") - value_count_file = os.path.join(file_path, f"{column}__csp_value_count.parquet") - symbol_data = ParquetFile(symbol_file).read().columns[0].to_pandas() - value_data = [ParquetFile(value_file).read().columns[0].to_pandas() for value_file in value_files] - value_count_data_array = ParquetFile(value_count_file).read().columns[0].to_pandas().values - - if len(value_count_data_array) == 0 or value_count_data_array[-1] == 0: - continue - - cycle_indices = value_count_data_array.cumsum() - 1 - value_count_indices = numpy.indices(value_count_data_array.shape)[0] - good_index_mask = numpy.full(cycle_indices.shape, True) - good_index_mask[1:] = cycle_indices[1:] != cycle_indices[:-1] - - index_array = numpy.full(len(symbol_data), numpy.nan) - index_array[cycle_indices[good_index_mask]] = value_count_indices[good_index_mask] - basked_data_index = pandas.Series(index_array).bfill().astype(int).values - - data_dict = {"index": basked_data_index, "symbol": symbol_data} - for value_column, data in zip(value_columns, value_data): - data_dict[value_column] = data - - basket_data_raw = pandas.DataFrame(data_dict) - if basket_data_raw.empty: - continue - else: - all_symbols = basket_data_raw["symbol"].unique() - all_symbols.sort() - field_array_shape = (value_count_indices.size, all_symbols.size) - sym_indices = numpy.searchsorted(all_symbols, basket_data_raw.symbol.values) - - field_matrices = {} - for f in value_columns: - pandas_dtype = basket_data_raw[f].dtype - dtype = basket_data_raw[f].values.dtype - field_array, pandas_dtype = self._create_empty_full_array( - dtype, field_array_shape, pandas_dtype - ) - - field_array[basked_data_index, sym_indices] = basket_data_raw[f] - - field_matrices[f] = pandas.DataFrame(field_array, columns=all_symbols, dtype=pandas_dtype) - - # pandas pivot_table is WAAAAY to slow, we have to implement our own here - basket_data_aligned = pandas.concat( - field_matrices.values(), keys=list(field_matrices.keys()), axis=1 - ) - - if column == DatasetNameConstants.UNNAMED_OUTPUT_NAME: - l0, l1 = zip(*basket_data_aligned.columns) - if issubclass(value_type, csp.Struct): - unnamed_prefix_len = len(DatasetNameConstants.UNNAMED_OUTPUT_NAME) + 1 - l0 = [k[unnamed_prefix_len:] for k in l0] - basket_data_aligned.columns = list(zip(l0, l1)) - columns_l0 += l0 - else: - basket_data_aligned.columns = list(l1) - columns_l0 = None - columns_l1 += l1 - else: - columns_l0 += basket_data_aligned.columns.get_level_values(0).tolist() - columns_l1 += basket_data_aligned.columns.get_level_values(1).tolist() - basket_dfs.append(basket_data_aligned) - res = pandas.concat([res] + basket_dfs, axis=1) - if columns_l0: - res.columns = [columns_l0, columns_l1] - - return self._truncate_df(starttime, endtime, res) - - def _read_flat_data_from_files(self, symbol_file, value_files, num_values_to_skip, num_values_to_read): - import pyarrow.parquet - - parquet_files = [pyarrow.parquet.ParquetFile(symbol_file)] - if value_files: - parquet_files += [pyarrow.parquet.ParquetFile(file) for file in value_files.values()] - symbol_parquet_file = parquet_files[0] - for row_group_index in range(symbol_parquet_file.num_row_groups): - row_group = symbol_parquet_file.read_row_group(row_group_index, []) - if num_values_to_skip >= row_group.num_rows: - num_values_to_skip -= row_group.num_rows - continue - row_group_batches = [f.read_row_group(row_group_index).to_batches()[0] for f in parquet_files] - column_names = list(itertools.chain(*(batch.schema.names for batch in row_group_batches))) - column_values = list(itertools.chain(*(batch.columns for batch in row_group_batches))) - row_group_table = pyarrow.Table.from_arrays(column_values, column_names) - - cur_row_group_start_index = num_values_to_skip - num_values_to_skip = 0 - cur_row_group_num_values_to_read = int( - min(row_group.num_rows - cur_row_group_start_index, num_values_to_read) - ) - num_values_to_read -= int(cur_row_group_num_values_to_read) - yield row_group_table.slice(cur_row_group_start_index, cur_row_group_num_values_to_read) - if num_values_to_read == 0: - return - assert num_values_to_read == 0 - - def _load_flat_basket_data( - self, - starttime, - endtime, - timestamp_file_name, - symbol_file_name, - value_count_file_name, - value_files, - need_timestamp=True, - ): - import numpy - import pyarrow.parquet - - timestamps_arrow_array = pyarrow.parquet.ParquetFile(timestamp_file_name).read()[0] - timestamps_array = numpy.array(timestamps_arrow_array) - - if timestamps_array.size == 0: - return None - cond = numpy.full(timestamps_array.shape, True) - if starttime is not None: - cond = (timestamps_array >= numpy.datetime64(starttime)) & cond - if endtime is not None: - cond = (timestamps_array <= numpy.datetime64(endtime)) & cond - mask_indices = numpy.where(cond)[0] - if mask_indices.size == 0: - return None - start_index, end_index = mask_indices[0], mask_indices[-1] - value_counts = numpy.array(pyarrow.parquet.ParquetFile(value_count_file_name).read()[0]) - num_values_to_skip = value_counts[:start_index].sum() - value_counts_sub_array = value_counts[start_index : end_index + 1] - value_counts_sub_array_cumsum = value_counts_sub_array.cumsum() - num_values_to_read = value_counts_sub_array_cumsum[-1] if value_counts_sub_array_cumsum.size > 0 else 0 - if num_values_to_read == 0: - return None - res = pyarrow.concat_tables( - filter( - None, - self._read_flat_data_from_files(symbol_file_name, value_files, num_values_to_skip, num_values_to_read), - ) - ) - - if need_timestamp: - timestamps_full = numpy.full(num_values_to_read, None, timestamps_array.dtype) - timestamp_array_size = len(res) - timestamps_sub_array = timestamps_array[start_index : end_index + 1] - timestamps_sub_array = timestamps_sub_array[value_counts_sub_array != 0] - value_counts_sub_array_cumsum_aux = value_counts_sub_array_cumsum[ - value_counts_sub_array_cumsum < timestamp_array_size - ] - timestamps_full[0] = timestamps_sub_array[0] - timestamps_full[value_counts_sub_array_cumsum_aux] = timestamps_sub_array[1:] - null_indices = numpy.where(numpy.isnat(timestamps_full))[0] - non_null_indices = numpy.where(~numpy.isnat(timestamps_full))[0] - fill_indices = non_null_indices[numpy.searchsorted(non_null_indices, null_indices, side="right") - 1] - timestamps_full[null_indices] = timestamps_full[fill_indices] - res = res.add_column(0, self.metadata.timestamp_column_name, pyarrow.array(timestamps_full)) - return res - - def _get_flat_basket_df_for_period( - self, - basket_field_name: str, - symbol_column: str, - struct_fields: List[str] = None, - starttime: Optional[datetime.datetime] = None, - endtime: Optional[datetime.datetime] = None, - missing_range_handler: Callable[[datetime.datetime, datetime.datetime], bool] = None, - num_threads=1, - load_values=True, - concat=True, - need_timestamp=True, - ): - starttime = self._normalize_time(starttime, True) - endtime = self._normalize_time(endtime, True) - data_files = self.get_data_files_for_period(starttime, endtime, missing_range_handler) - - if basket_field_name is None: - basket_field_name = DatasetNameConstants.UNNAMED_OUTPUT_NAME - - if basket_field_name not in self.metadata.dict_basket_columns: - raise RuntimeError(f"No basket {basket_field_name} is returned from graph") - - symbol_files = [os.path.join(f, f"{basket_field_name}__csp_symbol.parquet") for f in data_files.values()] - if load_values: - value_type = self.metadata.dict_basket_columns[basket_field_name].value_type - if issubclass(value_type, csp.Struct): - struct_fields = struct_fields if struct_fields is not None else list(value_type.metadata().keys()) - value_files = [ - {field: os.path.join(f, f"{basket_field_name}.{field}.parquet") for field in struct_fields} - for f in data_files.values() - ] - else: - assert ( - struct_fields is None - ), f"Trying to provide struct_fields for non struct output {basket_field_name}" - value_files = [ - {basket_field_name: os.path.join(f, f"{basket_field_name}.parquet")} for f in data_files.values() - ] - else: - value_files = list(itertools.repeat({}, len(symbol_files))) - value_count_files = [ - os.path.join(f, f"{basket_field_name}__csp_value_count.parquet") for f in data_files.values() - ] - timestamp_files = [ - os.path.join(f, f"{self.metadata.timestamp_column_name}.parquet") for f in data_files.values() - ] - file_tuples = [ - (t, s, c, d) for t, s, c, d in zip(timestamp_files, symbol_files, value_count_files, value_files) - ] - - if num_threads > 1: - with ThreadPoolExecutor(max_workers=num_threads) as pool: - tasks = [ - pool.submit(self._load_flat_basket_data, starttime, endtime, *tup, need_timestamp=need_timestamp) - for tup in file_tuples - ] - results = list(task.result() for task in tasks) - else: - results = [ - self._load_flat_basket_data(starttime, endtime, *tup, need_timestamp=need_timestamp) - for tup in file_tuples - ] - results = list(filter(None, results)) - if not results: - return None - if concat: - import pyarrow - - return pyarrow.concat_tables(results) - else: - return results - - def get_flat_basket_df_for_period( - self, - symbol_column: str, - basket_field_name: str = None, - struct_fields: List[str] = None, - starttime: Optional[datetime.datetime] = None, - endtime: Optional[datetime.datetime] = None, - missing_range_handler: Callable[[datetime.datetime, datetime.datetime], bool] = None, - num_threads=1, - ): - res = self._get_flat_basket_df_for_period( - basket_field_name=basket_field_name, - symbol_column=symbol_column, - struct_fields=struct_fields, - starttime=starttime, - endtime=endtime, - missing_range_handler=missing_range_handler, - num_threads=num_threads, - concat=True, - ) - if res is None: - return None - res_df = res.to_pandas() - res_df.rename(columns={res_df.columns[1]: symbol_column}, inplace=True) - res_df.columns = [self._remove_unnamed_output_prefix(c) for c in res_df.columns] - return res_df - - def get_all_basket_ids_in_range( - self, - basket_field_name=None, - starttime: Optional[datetime.datetime] = None, - endtime: Optional[datetime.datetime] = None, - missing_range_handler: Callable[[datetime.datetime, datetime.datetime], bool] = None, - num_threads=1, - ): - import numpy - - symbol_column_name = "__csp_symbol__" - parquet_tables = self._get_flat_basket_df_for_period( - basket_field_name=basket_field_name, - symbol_column=symbol_column_name, - starttime=starttime, - endtime=endtime, - missing_range_handler=missing_range_handler, - num_threads=num_threads, - load_values=False, - concat=False, - need_timestamp=False, - ) - unique_arrays = [numpy.unique(numpy.array(t[0])) for t in parquet_tables] - return sorted(numpy.unique(numpy.concatenate(unique_arrays + unique_arrays))) - - def invalidate_cache( - self, starttime: Optional[datetime.datetime] = None, endtime: Optional[datetime.datetime] = None - ): - existing_data = self.get_data_files_for_period(starttime, endtime, lambda *args, **kwargs: True) - - if not existing_data: - return - - aggregation_period_utils = AggregationPeriodUtils(self.metadata.time_aggregation) - if starttime is not None: - agg_period_starttime = aggregation_period_utils.resolve_period_start(starttime) - if starttime != agg_period_starttime: - raise RuntimeError( - f"Trying to invalidate data starting on {starttime} - invalidation should be for full aggregation period (starting on {agg_period_starttime})" - ) - - if endtime is not None: - agg_period_endtime = aggregation_period_utils.resolve_period_end(endtime, exclusive_end=False) - if endtime != agg_period_endtime: - raise RuntimeError( - f"Trying to invalidate data ending on {endtime} - invalidation should be for full aggregation period (ending on {agg_period_endtime})" - ) - - root_folders_to_possibly_remove = set() - for k, v in existing_data.items(): - output_folder_name = self._dataset_partition.data_paths.get_output_folder_name(k[0]) - root_folders_to_possibly_remove.add(os.path.dirname(output_folder_name)) - logging.info(f"Removing {output_folder_name}") - shutil.rmtree(output_folder_name) - partition_root_folder = self._dataset_partition.data_paths.root_folder - while root_folders_to_possibly_remove: - aux = root_folders_to_possibly_remove - root_folders_to_possibly_remove = set() - for v in aux: - if not v.startswith(partition_root_folder): - continue - can_remove = True - for item in os.listdir(v): - if not item.startswith(".") and not item.endswith("_WIP"): - can_remove = False - break - if can_remove: - logging.info(f"Removing {v}") - shutil.rmtree(v) - root_folders_to_possibly_remove.add(os.path.dirname(v)) - - def get_data_df_for_period( - self, - starttime: Optional[datetime.datetime] = None, - endtime: Optional[datetime.datetime] = None, - missing_range_handler: Callable[[datetime.datetime, datetime.datetime], bool] = None, - data_loader_function: Callable[[str, Optional[List[str]]], object] = None, - column_list=None, - basket_column_list=None, - struct_basket_sub_columns: Optional[Dict[str, List[str]]] = None, - combine=True, - num_threads=1, - ): - """Retrieve a list of all files in the given time range (inclusive) - :param starttime: The start time of the period - :param endtime: The end time of the period - :param missing_range_handler: A function that handles missing data. Will be called with (missing_period_starttime, missing_period_endtime), - should return True, if the missing data is not an error, should return False otherwise (in which case an exception will be raised) - :param data_loader_function: A custom loader function that overrides the default pandas read. If non None specified, it implies combine=False. The - function will be called with 2 arguments (file_path, column_list). The file_path is the path of the file to be loaded and column_list is the list - of columns to be loaded (if column_list is None then all columns should be loaded) - :param column_list: The list of columns to be loaded. If None specified then all columns will be loaded - :param basket_column_list: The list of basket columns to be loaded. If None specified then all basket columns will be loaded. - :param struct_basket_sub_columns: A dictionary of {basket_name: List[str]} that specifies which sub columns of the basket should be loaded. Only valid for - struct baskets - :param combine: Combine the loaded data frames into a single dataframe (if False, will return a list of dataframes). If data_loader_function - is specified then combine is always treated as False - :param num_threads: The number of threads to use for loading the data - """ - starttime = self._normalize_time(starttime) - endtime = self._normalize_time(endtime) - if endtime is not None and isinstance(endtime, datetime.timedelta): - endtime = starttime + endtime - - data_files = self.get_data_files_for_period(starttime, endtime, missing_range_handler) - if data_loader_function is None: - if self.metadata.split_columns_to_files: - data_loader_function = self._load_data_split_to_columns - else: - data_loader_function = self._load_single_file_all_columns - else: - combine = False - - if column_list is None: - column_list = list(self.metadata.columns.keys()) - shape_columns = self._get_shape_columns_dict(column_list) - if shape_columns: - column_list += list(shape_columns.values()) - - if basket_column_list is None: - basket_columns = getattr(self.metadata, "dict_basket_columns", None) - basket_column_list = list(basket_columns.keys()) if basket_columns else None - - if struct_basket_sub_columns is None: - struct_basket_sub_columns = {} - if basket_column_list: - for col in basket_column_list: - value_type = self.metadata.dict_basket_columns[col].value_type - if issubclass(value_type, csp.Struct): - self.metadata.dict_basket_columns[col] - struct_basket_sub_columns[col] = list(value_type.metadata().keys()) - else: - if "" in struct_basket_sub_columns: - struct_basket_sub_columns[DatasetNameConstants.UNNAMED_OUTPUT_NAME] = struct_basket_sub_columns.pop("") - for k, v in struct_basket_sub_columns.items(): - if k not in basket_column_list: - raise RuntimeError(f"Specified sub columns for basket '{k}' but it's not loaded from file: {v}") - - if self.metadata.timestamp_column_name not in column_list: - column_list = [self.metadata.timestamp_column_name] + column_list - - if num_threads > 1: - with ThreadPoolExecutor(max_workers=num_threads) as pool: - tasks = [ - pool.submit( - data_loader_function, - file_start_time, - file_end_time, - data_file, - column_list, - basket_column_list, - struct_basket_sub_columns, - ) - for (file_start_time, file_end_time), data_file in data_files.items() - ] - dfs = [task.result() for task in tasks] - else: - dfs = [ - data_loader_function( - file_start_time, - file_end_time, - data_file, - column_list, - basket_column_list, - struct_basket_sub_columns, - ) - for (file_start_time, file_end_time), data_file in data_files.items() - ] - - dfs = [df for df in dfs if len(df) > 0] - - # For now we do it in one process, in the future might push it into multiprocessing load - for k, typ in self._dataset_partition.dataset.metadata.columns.items(): - serializer = self._cache_serializers.get(typ) - if serializer: - for df in dfs: - df[k] = df[k].apply(lambda v: serializer.deserialize_from_bytes(v) if v is not None else None) - - if combine: - if len(dfs) > 0: - import pandas - - return pandas.concat(dfs, ignore_index=True) - else: - return None - else: - return dfs diff --git a/csp/impl/wiring/cache_support/graph_building.py b/csp/impl/wiring/cache_support/graph_building.py deleted file mode 100644 index e4731e190..000000000 --- a/csp/impl/wiring/cache_support/graph_building.py +++ /dev/null @@ -1,745 +0,0 @@ -import copy -import os -from datetime import datetime, timedelta -from typing import Dict, List, Optional, Set, Tuple, TypeVar, Union - -import csp -from csp.adapters.parquet import ParquetOutputConfig -from csp.impl.config import CacheCategoryConfig, CacheConfig, Config -from csp.impl.managed_dataset.cache_user_custom_object_serializer import CacheObjectSerializer -from csp.impl.managed_dataset.dataset_metadata import TimeAggregation -from csp.impl.managed_dataset.dateset_name_constants import DatasetNameConstants -from csp.impl.managed_dataset.managed_dataset import ManagedDataset -from csp.impl.managed_dataset.managed_dataset_lock_file_util import ManagedDatasetLockUtil -from csp.impl.mem_cache import normalize_arg -from csp.impl.struct import Struct -from csp.impl.types import tstype -from csp.impl.types.common_definitions import ArgKind, OutputBasketContainer, OutputDef -from csp.impl.types.tstype import ts -from csp.impl.types.typing_utils import CspTypingUtils -from csp.utils.qualified_name_utils import QualifiedNameUtils - -# relative to avoid cycles -from ..context import Context -from ..edge import Edge -from ..node import node -from ..outputs import OutputsContainer -from ..signature import Signature -from ..special_output_names import UNNAMED_OUTPUT_NAME -from .cache_config_resolver import CacheConfigResolver - -T = TypeVar("T") - - -class _UnhashableObjectWrapper: - def __init__(self, obj): - self._obj = obj - - def __hash__(self): - return hash(id(self._obj)) - - def __eq__(self, other): - return id(self._obj) == id(other._obj) - - -class _CacheManagerKey: - def __init__(self, scalars, *extra_args): - self._normalized_scalars = tuple(self._normalize_scalars(scalars)) + tuple(extra_args) - self._hash = hash(self._normalized_scalars) - - def __hash__(self): - return self._hash - - def __eq__(self, other): - return self._hash == other._hash and self._normalized_scalars == other._normalized_scalars - - @classmethod - def _normalize_scalars(cls, scalars): - for scalar in scalars: - normalized_scalar = normalize_arg(scalar) - try: - hash(normalized_scalar) - except TypeError: - yield _UnhashableObjectWrapper(scalar) - else: - yield normalized_scalar - - -class WrappedStructEdge(Edge): - def __init__(self, wrapped_edge, parquet_reader, field_map): - super().__init__( - tstype=wrapped_edge.tstype, - nodedef=wrapped_edge.nodedef, - output_idx=wrapped_edge.output_idx, - basket_idx=wrapped_edge.basket_idx, - ) - self._parquet_reader = parquet_reader - self._field_map = field_map - self._single_field_edges = {} - - def __getattr__(self, key): - res = self._single_field_edges.get(key, None) - if res is not None: - return res - elemtype = self.tstype.typ.metadata().get(key) - if elemtype is None: - raise AttributeError("'%s' object has no attribute '%s'" % (self.tstype.typ.__name__, key)) - res = self._parquet_reader.subscribe_all(elemtype, field_map=self._field_map[key]) - self._single_field_edges[key] = res - return res - - -class WrappedCachedStructBasket(dict): - def __init__(self, typ, name, wrapped_edges, parquet_reader): - super().__init__(**wrapped_edges) - self._typ = typ - self._name = name - self._parquet_reader = parquet_reader - self._shape = None - self._field_dicts = {} - - def get_basket_field(self, field_name): - res = self._field_dicts.get(field_name) - if res is None: - if self._shape is None: - self._shape = list(self.keys()) - # res = self._parquet_reader.subscribe_dict_basket_struct_column(self._typ, self._name, self._shape, field_name) - res = {k: getattr(v, field_name) for k, v in self.items()} - self._field_dicts[field_name] = res - return res - - -class CacheCategoryOverridesTree: - """A utility class that is used to resolved category overrides for a given category like ['level_1', 'level_2', ...] - The basic implementation is a tree of levels - """ - - def __init__(self, cur_level_key: str = None, cur_level_value: CacheCategoryConfig = None, parent=None): - self.cur_level_key = cur_level_key - self.cur_level_value = cur_level_value - self.parent = parent - self.children = {} - - def _get_path_str(self): - path = [] - cur = self - while cur is not None: - if cur.cur_level_value is not None: - path.append(cur.cur_level_value) - cur = cur.parent - return str(reversed(path)) - - def __str__(self): - return f"CacheCategoryOverridesTree({self._get_path_str()}:{self.cur_level_value})" - - def __repr__(self): - return self.__str__() - - def _get_child(self, key: str): - res = self.children.get(key) - if res is None: - res = CacheCategoryOverridesTree(cur_level_key=key, parent=self) - self.children[key] = res - return res - - def _add_override(self, override: CacheCategoryConfig, cur_level_index=0): - if cur_level_index < len(override.category): - self._get_child(override.category[cur_level_index])._add_override(override, cur_level_index + 1) - else: - if self.cur_level_value is not None: - raise RuntimeError(f"Trying to override cache directory for {self._get_path_str()} more than once") - self.cur_level_value = override - - def resolve_root_folder(self, category: List[str], cur_level: int = 0): - """ - :param category: The category of the dataset - :return: A config override or the default config for the given category - """ - if cur_level == len(category): - return self.cur_level_value - - # We want the longest match possible, so first attempt resolving using children - child = self.children.get(category[cur_level]) - if child is not None: - child_res = child.resolve_root_folder(category, cur_level + 1) - if child_res is not None: - return child_res - - return self.cur_level_value - - @classmethod - def construct_from_cache_config(cls, cache_config: Optional[CacheConfig] = None): - res = CacheCategoryOverridesTree() - - if cache_config is None: - raise RuntimeError("data_folder must be set in global cache_config to use caching") - - res.cur_level_value = cache_config - - if hasattr(cache_config, "category_overrides"): - for override in cache_config.category_overrides: - res._add_override(override) - return res - - @classmethod - def construct_from_config(cls, config: Optional[Config] = None): - if config is not None and hasattr(config, "cache_config"): - return cls.construct_from_cache_config(config.cache_config) - return CacheCategoryOverridesTree() - - -class ContextCacheInfo(Struct): - """Graph building context storage class - - Should be stored inside a Context object, contains all the data collected during the graph building time to enable support - of caching - """ - - # A dictionary from tuple (function_id, scalar_arguments) to a corresponding GraphCacheManager - cache_managers: dict - # Dictionary of a graph function id to MangedDataset and field mapping that corresponds to it - managed_datasets_by_graph_object: Dict[object, Tuple[ManagedDataset, Dict[str, str]]] - # The object that is used to resolve the underlying sub folders of the given graph - cache_data_paths_resolver: CacheConfigResolver - - -class _EngineLockRelease: - """A utility that will release the given lock on engine stop""" - - def __init__(self): - self.cur_lock = None - - @node - def release_node(self): - with csp.stop(): - if self.cur_lock: - self.cur_lock.unlock() - self.cur_lock = None - - -class _MissingPeriodCallback: - def __init__(self): - self.first_missing_period = None - - def __call__(self, start, end): - if self.first_missing_period is None: - self.first_missing_period = (start, end) - return True - - -@node -def _deserialize_value(value: ts[bytes], type_serializer: CacheObjectSerializer, typ: "T") -> ts["T"]: - if csp.ticked(value): - return type_serializer.deserialize_from_bytes(value) - - -class GraphBuildPartitionCacheManager(object): - """A utility class that "manages" cache at graph building time - - One instance is created per (dataset, partition_values) - """ - - def __init__( - self, - function_name, - dataset: ManagedDataset, - partition_values, - expected_outputs, - cache_options, - csp_cache_start=None, - csp_cache_end=None, - csp_timestamp_shift=None, - ): - self._function_name = function_name - self.dataset_partition = dataset.get_partition(partition_values) - self._outputs = None - self._written_outputs = None - self._cache_options = cache_options - self._context = Context.TLS.instance - self._csp_cache_start = csp_cache_start if csp_cache_start else Context.TLS.instance.start_time - self._csp_cache_end = csp_cache_end if csp_cache_end else Context.TLS.instance.end_time - self._csp_timestamp_shift = csp_timestamp_shift if csp_timestamp_shift else timedelta() - missing_period_callback = _MissingPeriodCallback() - data_for_period, is_full_period_covered = self.get_data_for_period( - self._csp_cache_start, self._csp_cache_end, missing_range_handler=missing_period_callback - ) - cache_config = Context.TLS.instance.config.cache_config - self._first_missing_period = missing_period_callback.first_missing_period - if is_full_period_covered: - from csp.adapters.parquet import ParquetReader - - cache_serializers = cache_config.cache_serializers - # We need to release lock at the end of the run, generator is not guaranteed to do so if it's not called - engine_lock_releaser = _EngineLockRelease() - reader = ParquetReader( - filename_or_list=self._read_files_provider( - dataset, data_for_period, cache_config.lock_file_permissions, engine_lock_releaser - ), - time_column=self.dataset_partition.dataset.metadata.timestamp_column_name, - split_columns_to_files=self.dataset_partition.dataset.metadata.split_columns_to_files, - start_time=csp_cache_start, - end_time=csp_cache_end, - allow_overlapping_periods=True, - time_shift=self._csp_timestamp_shift, - ) - # We need to instantiate the node ot have it run - engine_lock_releaser.release_node() - self._outputs = OutputsContainer() - is_unnamed_output = False - for output in expected_outputs: - output_name = output.name - if output_name is None: - output_name = DatasetNameConstants.UNNAMED_OUTPUT_NAME - is_unnamed_output = True - if output.kind == ArgKind.BASKET_TS: - output_dict = reader.subscribe_dict_basket(typ=output.typ.typ, name=output_name, shape=output.shape) - output_value = WrappedCachedStructBasket(output.typ.typ, output_name, output_dict, reader) - else: - assert output.kind == ArgKind.TS - if isinstance(output.typ.typ, type) and issubclass(output.typ.typ, Struct): - # Reverse field mapping - write_field_map = cache_options.field_mapping.get(output_name) - field_map = {v: k for k, v in write_field_map.items()} - # Wrap the edge to allow single column reading - output_value = WrappedStructEdge( - reader.subscribe_all(typ=output.typ.typ, field_map=field_map), reader, write_field_map - ) - else: - type_serializer = cache_serializers.get(output.typ.typ) - if type_serializer: - output_value = _deserialize_value( - reader.subscribe_all( - typ=bytes, field_map=cache_options.field_mapping.get(output_name, output_name) - ), - type_serializer, - output.typ.typ, - ) - else: - output_value = reader.subscribe_all( - typ=output.typ.typ, field_map=cache_options.field_mapping.get(output_name, output_name) - ) - if is_unnamed_output: - assert len(expected_outputs) == 1 - self._outputs = output_value - else: - self._outputs[output_name] = output_value - - def _read_files_provider(self, dataset, data_files, lock_file_permissions, engine_lock_releaser): - assert data_files - items_iter = iter(data_files.items()) - finished = False - num_failures = 0 - next_filename = None - lock_util = ManagedDatasetLockUtil(lock_file_permissions) - while not finished: - if num_failures > 10: - raise RuntimeError( - f"Failed to read cached files too many times, last attempted file is {next_filename}" - ) - try: - (next_start_time, next_end_time), next_filename = next(items_iter) - except StopIteration: - finished = True - continue - is_file = os.path.isfile(next_filename) - is_dir = os.path.isdir(next_filename) - if not is_file and not is_dir: - data_files, _ = self.get_data_for_period(next_start_time, self._csp_cache_end) - assert data_files - items_iter = iter(data_files.items()) - num_failures += 1 - with dataset.use_lock_context(): - lock = lock_util.read_lock(next_filename, is_file) - lock.lock() - engine_lock_releaser.cur_lock = lock - try: - if os.path.exists(next_filename): - num_failures = 0 - yield next_filename - else: - data_files, _ = self.get_data_for_period(next_start_time, self._csp_cache_end) - assert data_files - items_iter = iter(data_files.items()) - num_failures += 1 - finally: - lock.unlock() - engine_lock_releaser.cur_lock = None - - @property - def first_missing_period(self): - return self._first_missing_period - - @property - def is_force_cache_read(self): - return hasattr(self._cache_options, "data_timestamp_column_name") - - @property - def outputs(self): - return self._outputs - - @property - def written_outputs(self): - return self._written_outputs - - @classmethod - def _resolve_anonymous_dataset_category(cls, cache_options): - category = getattr(cache_options, "category", None) - if category is None: - category = ["csp_unnamed_cache"] - return category - - @classmethod - def get_dataset_for_func(cls, graph, func, cache_options, data_folder): - category = cls._resolve_anonymous_dataset_category(cache_options) - cache_config_resolver = None - if isinstance(data_folder, Config): - cache_config_resolver = CacheConfigResolver(data_folder.cache_config) - elif isinstance(data_folder, CacheConfig): - cache_config_resolver = CacheConfigResolver(data_folder) - if isinstance(data_folder, CacheConfigResolver): - cache_config_resolver = data_folder - - if cache_config_resolver is None: - cache_config_resolver = CacheConfigResolver(CacheConfig(data_folder=data_folder)) - - cache_config = cache_config_resolver.resolve_cache_config(graph, category) - - dataset_name = getattr(cache_options, "dataset_name", None) if cache_options else None - if dataset_name is None: - dataset_name = f"{QualifiedNameUtils.get_qualified_object_name(func)}" - return ManagedDataset.load_from_disk(cache_config=cache_config, name=dataset_name, data_category=category) - - @classmethod - def _resolve_dataset(cls, graph, func, signature, cache_options, expected_outputs, tvars): - context_cache_data = Context.TLS.instance.cache_data - # We might have the dataset already - - func_id = id(func) - existing_dataset_and_field_mapping = context_cache_data.managed_datasets_by_graph_object.get(func_id) - if existing_dataset_and_field_mapping is not None: - cache_options.field_mapping = existing_dataset_and_field_mapping[1] - return existing_dataset_and_field_mapping[0] - - dataset_name = getattr(cache_options, "dataset_name", None) - partition_columns = {input.name: input.typ for input in signature.scalars} - column_types = {} - dict_basket_column_types = {} - - if len(expected_outputs) == 1 and expected_outputs[0].name is None: - timestamp_column_auto_prefix = f"{DatasetNameConstants.UNNAMED_OUTPUT_NAME}." - cur_def = expected_outputs[0] - expected_outputs = ( - OutputDef( - name=DatasetNameConstants.UNNAMED_OUTPUT_NAME, - typ=cur_def.typ, - kind=cur_def.kind, - ts_idx=cur_def.ts_idx, - shape=cur_def.shape, - ), - ) - else: - timestamp_column_auto_prefix = "" - - field_mapping = cache_options.field_mapping - for i, out in enumerate(expected_outputs): - if out.kind == ArgKind.BASKET_TS: - # Let's make sure that we're handling dict basket - if isinstance(out.shape, list) and tstype.isTsType(out.typ): - dict_basket_column_types[out.name] = signature.resolve_basket_key_type(i, tvars), out.typ.typ - else: - raise NotImplementedError(f"Caching of basket output {out.name} of type {out.typ} is unsupported") - elif isinstance(out.typ.typ, type) and issubclass(out.typ.typ, Struct): - struct_field_mapping = field_mapping.get(out.name) - if struct_field_mapping is None: - if cache_options.prefix_struct_names: - struct_col_types = {f"{out.name}.{k}": v for k, v in out.typ.typ.metadata().items()} - else: - struct_col_types = out.typ.typ.metadata() - column_types.update(struct_col_types) - struct_field_mapping = {n1: n2 for n1, n2 in zip(out.typ.typ.metadata(), struct_col_types)} - field_mapping[out.name] = struct_field_mapping - else: - for k, v in out.typ.typ.metadata().items(): - cache_col_name = struct_field_mapping.get(k, k) - column_types[cache_col_name] = v - else: - name = field_mapping.get(out.name, out.name) - column_types[name] = out.typ.typ - - if hasattr(cache_options, "data_timestamp_column_name"): - timestamp_column_name = timestamp_column_auto_prefix + cache_options.data_timestamp_column_name - else: - timestamp_column_name = "csp_timestamp" - - category = cls._resolve_anonymous_dataset_category(cache_options) - resolved_cache_config = context_cache_data.cache_data_paths_resolver.resolve_cache_config(graph, category) - dataset = cls._create_dataset( - func, - resolved_cache_config, - category, - dataset_name=dataset_name, - timestamp_column_name=timestamp_column_name, - columns_types=column_types, - partition_columns=partition_columns, - split_columns_to_files=cache_options.split_columns_to_files, - time_aggregation=cache_options.time_aggregation, - dict_basket_column_types=dict_basket_column_types, - ) - context_cache_data.managed_datasets_by_graph_object[func_id] = dataset, field_mapping - return dataset - - @classmethod - def _get_qualified_function_name(cls, func): - return f"{func.__module__}.{func.__name__}" - - @classmethod - def _create_dataset( - cls, - func, - cache_config, - category, - dataset_name, - timestamp_column_name: str = None, - columns_types: Dict[str, object] = None, - partition_columns: Dict[str, type] = None, - *, - split_columns_to_files: Optional[bool], - time_aggregation: TimeAggregation, - dict_basket_column_types: Dict[type, Union[type, Tuple[type, type]]], - ): - name = dataset_name if dataset_name else f"{QualifiedNameUtils.get_qualified_object_name(func)}" - dataset = ManagedDataset( - name=name, - category=category, - cache_config=cache_config, - timestamp_column_name=timestamp_column_name, - columns_types=columns_types, - partition_columns=partition_columns, - split_columns_to_files=split_columns_to_files, - time_aggregation=time_aggregation, - dict_basket_column_types=dict_basket_column_types, - ) - return dataset - - def get_data_for_period(self, starttime: datetime, endtime: datetime, missing_range_handler): - res, full_period_covered = self.dataset_partition.get_data_for_period( - starttime=starttime - self._csp_timestamp_shift, - endtime=endtime - self._csp_timestamp_shift, - missing_range_handler=missing_range_handler, - ) - if self._csp_timestamp_shift: - res = { - (start + self._csp_timestamp_shift, end + self._csp_timestamp_shift): path - for (start, end), path in res.items() - } - - return res, full_period_covered - - def _fix_outputs_for_caching(self, outputs): - if isinstance(outputs, OutputsContainer) and UNNAMED_OUTPUT_NAME in outputs: - outputs_dict = dict(outputs._items()) - outputs_dict[DatasetNameConstants.UNNAMED_OUTPUT_NAME] = outputs_dict.pop(UNNAMED_OUTPUT_NAME) - return OutputsContainer(**outputs_dict) - else: - return outputs - - def cache_outputs(self, outputs): - from csp.impl.managed_dataset.managed_parquet_writer import create_managed_parquet_writer_node - - outputs = self._fix_outputs_for_caching(outputs) - assert self._written_outputs is None - self._written_outputs = outputs - create_managed_parquet_writer_node( - function_name=self._function_name, - dataset_partition=self.dataset_partition, - values=outputs, - field_mapping=self._cache_options.field_mapping, - config=getattr(self._cache_options, "parquet_output_config", None), - data_timestamp_column_name=getattr(self.dataset_partition.dataset.metadata, "timestamp_column_name", None), - controlled_cache=self._cache_options.controlled_cache, - default_cache_enabled=self._cache_options.default_cache_enabled, - ) - - @classmethod - def create_cache_manager( - cls, - graph, - func, - signature, - non_ignored_scalars, - all_scalars, - cache_options, - expected_outputs, - tvars, - csp_cache_start=None, - csp_cache_end=None, - csp_timestamp_shift=None, - ): - if not hasattr(Context.TLS, "instance"): - raise RuntimeError("Graph must be instantiated under a wiring context") - - assert Context.TLS.instance.start_time is not None - assert Context.TLS.instance.end_time is not None - key = _CacheManagerKey( - all_scalars, id(func), tuple(tvars.items()), csp_cache_start, csp_cache_end, csp_timestamp_shift - ) - existing_cache_manager = Context.TLS.instance.cache_data.cache_managers.get(key) - if existing_cache_manager is not None: - return existing_cache_manager - - # We're going to modify field mapping, so we need to make a copy - cache_options = cache_options.copy() - cache_options.field_mapping = dict(cache_options.field_mapping) - - for output in expected_outputs: - if output.kind == ArgKind.TS and isinstance(output.typ.typ, type) and issubclass(output.typ.typ, Struct): - struct_field_map = cache_options.field_mapping.get(output.name) - if struct_field_map: - # We don't want to omit any fields from the cache otherwise read data will be different from what's written - # so whatever the user doesn't map, we will map to the original field. - full_field_map = copy.copy(struct_field_map) - for k in output.typ.typ.metadata(): - if k not in full_field_map: - full_field_map[k] = k - cache_options.field_mapping[output.name] = full_field_map - - dataset = cls._resolve_dataset(graph, func, signature, cache_options, expected_outputs, tvars) - - partition_values = dict(zip((i.name for i in signature.scalars), non_ignored_scalars)) - res = GraphBuildPartitionCacheManager( - function_name=QualifiedNameUtils.get_qualified_object_name(func), - dataset=dataset, - partition_values=partition_values, - expected_outputs=expected_outputs, - cache_options=cache_options, - csp_cache_start=csp_cache_start, - csp_cache_end=csp_cache_end, - csp_timestamp_shift=csp_timestamp_shift, - ) - Context.TLS.instance.cache_data.cache_managers[key] = res - return res - - -class GraphCacheOptions(Struct): - # The name of the dataset to which the data will be written - optional - dataset_name: str - # An optional column mapping for scalar time series, the mapping should be string (the name of the column in parquet file), - # for struct time series it should be a map of {struct_field_name:column_name}. - field_mapping: Dict[str, Union[str, Dict[str, str]]] - # A boolean that specifies whether struct fields should be prefixed with the output name. For example for a graph output - # named "o" and a field named "f", if prefix_struct_names is True then the column will be "o.f" else the column will be "f" - prefix_struct_names: bool = True - # This is an advanced usage of graph caching, in some instances we want to override timestamp and write data with custom timestamp, - # if this is specified, the values from the given column will be used as the timestamp column - data_timestamp_column_name: str - # Optional category specification for the dataset, can only be specified if using the default dataset. An example of category - # can be ['daily_statistics', 'market_prices']. This category will be part of the files path. Additionally, cache paths can be - # overridden for different categories. - category: List[str] - # Inputs to ignore for caching purposes - ignored_inputs: Set[str] - # A boolean that specifies whether each column should be written to a separate file - split_columns_to_files: bool - # The configuration of the written files - parquet_output_config: ParquetOutputConfig - # Data aggregation period - time_aggregation: TimeAggregation = TimeAggregation.DAY - # A boolean flag that specifies whether the node/graph provides a ts that specifies that the output should be cached - controlled_cache: bool = False - # The default value of whether at start the outputs should be cached, ignored if controlled_cache is False - default_cache_enabled: bool = False - - -class ResolvedGraphCacheOptions(Struct): - """A struct with all resolved graph cache options""" - - dataset_name: str - field_mapping: dict - prefix_struct_names: bool - data_timestamp_column_name: str - category: List[str] - ignored_inputs: Set[str] - split_columns_to_files: bool - parquet_output_config: ParquetOutputConfig - time_aggregation: TimeAggregation - controlled_cache: bool - default_cache_enabled: bool - - -def resolve_graph_cache_options(signature: Signature, cache_enabled, cache_options: GraphCacheOptions): - """Called at graph building time to validate that the given caching options are valid for given signature - - :param signature: The signature of the cached graph - :param cache_enabled: A boolean or that specifies whether cache enabled - :param cache_options: Graph cache read/write options - :return: - """ - if cache_enabled: - if cache_options is None: - cache_options = GraphCacheOptions() - - field_mapping = getattr(cache_options, "field_mapping", None) - split_columns_to_files = getattr(cache_options, "split_columns_to_files", None) - has_basket_outputs = False - - if signature._ts_inputs: - all_ts_ignored = False - ignored_inputs = getattr(cache_options, "ignored_inputs", None) - if ignored_inputs: - all_ts_ignored = True - for input in signature._ts_inputs: - if input.name not in ignored_inputs: - all_ts_ignored = False - if not all_ts_ignored: - raise NotImplementedError("Caching of graph with ts arguments is unsupported") - if not signature._outputs: - raise NotImplementedError("Caching of graph without outputs is unsupported") - for output in signature._outputs: - if isinstance(output.typ, OutputBasketContainer): - if CspTypingUtils.get_origin(output.typ.typ) is List: - raise NotImplementedError("Caching of list basket outputs is unsupported") - has_basket_outputs = True - elif not tstype.isTsType(output.typ): - if tstype.isTsStaticBasket(output.typ): - if CspTypingUtils.get_origin(output.typ) is List: - raise NotImplementedError("Caching of list basket outputs is unsupported") - else: - raise TypeError( - f"Cached output basket {output.name} must have shape provided using with_shape or with_shape_of" - ) - assert tstype.isTsType(output.typ) or isinstance(output.typ, OutputBasketContainer) - ignored_inputs = getattr(cache_options, "ignored_inputs", set()) - for input in signature.scalars: - if input.name not in ignored_inputs and not ManagedDataset.is_supported_partition_type(input.typ): - raise NotImplementedError( - f"Caching is unsupported for argument type {input.typ} (argument {input.name})" - ) - resolved_cache_options = ResolvedGraphCacheOptions(prefix_struct_names=cache_options.prefix_struct_names) - - for attr in ( - "dataset_name", - "data_timestamp_column_name", - "category", - "ignored_inputs", - "parquet_output_config", - "time_aggregation", - "controlled_cache", - "default_cache_enabled", - ): - if hasattr(cache_options, attr): - setattr(resolved_cache_options, attr, getattr(cache_options, attr)) - if has_basket_outputs: - if split_columns_to_files is False: - raise RuntimeError("Cached graph with output basket must set split_columns_to_files to True") - split_columns_to_files = True - elif split_columns_to_files is None: - split_columns_to_files = False - - resolved_cache_options.split_columns_to_files = split_columns_to_files - - resolved_cache_options.field_mapping = {} if field_mapping is None else field_mapping - else: - if cache_options: - raise RuntimeError("cache_options must be None if caching is disabled") - resolved_cache_options = None - return resolved_cache_options diff --git a/csp/impl/wiring/cache_support/partition_files_container.py b/csp/impl/wiring/cache_support/partition_files_container.py deleted file mode 100644 index 319bae705..000000000 --- a/csp/impl/wiring/cache_support/partition_files_container.py +++ /dev/null @@ -1,99 +0,0 @@ -import threading -from datetime import datetime -from typing import Dict, Tuple - -from csp.adapters.output_adapters.parquet import ParquetOutputConfig - - -class SinglePartitionFiles: - def __init__(self, dataset_partition, parquet_output_config): - self._dataset_partition = dataset_partition - self._parquet_output_config = parquet_output_config - # A mapping of (start, end)->file_path - self._files_by_period: Dict[Tuple[datetime, datetime], str] = {} - - @property - def dataset_partition(self): - return self._dataset_partition - - @property - def parquet_output_config(self): - return self._parquet_output_config - - @property - def files_by_period(self): - return self._files_by_period - - def add_file(self, start_time: datetime, end_time: datetime, file_path: str): - self._files_by_period[(start_time, end_time)] = file_path - - -class PartitionFileContainer: - TLS = threading.local() - - def __init__(self, cache_config): - # A mapping of id(dataset_partition)->(start_time,end_time)->file_path - self._files_by_partition_and_period: Dict[int, SinglePartitionFiles] = {} - self._cache_config = cache_config - - @classmethod - def get_instance(cls): - return cls.TLS.instance - - def __enter__(self): - assert not hasattr(self.TLS, "instance") - self.TLS.instance = self - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - try: - # We don't want to finalize cache if there is an exception - if exc_val is None: - # First let's publish all files - for partition_files in self._files_by_partition_and_period.values(): - for (start_time, end_time), file_path in partition_files.files_by_period.items(): - partition_files.dataset_partition.publish_file( - file_path, - start_time, - end_time, - self._cache_config.data_file_permissions, - lock_file_permissions=self._cache_config.lock_file_permissions, - ) - - # Let's now merge whatever we can - if self._cache_config.merge_existing_files: - for partition_files in self._files_by_partition_and_period.values(): - for (start_time, end_time), file_path in partition_files.files_by_period.items(): - partition_files.dataset_partition.merge_files( - start_time, - end_time, - cache_config=self._cache_config, - parquet_output_config=partition_files.parquet_output_config, - ) - partition_files.dataset_partition.cleanup_unneeded_files( - start_time=start_time, end_time=end_time, cache_config=self._cache_config - ) - finally: - del self.TLS.instance - - @property - def files_by_partition(self): - return self._files_by_partition_and_period - - def add_generated_file( - self, - dataset_partition, - start_time: datetime, - end_time: datetime, - file_path: str, - parquet_output_config: ParquetOutputConfig, - ): - key = id(dataset_partition) - - single_partition_files = self._files_by_partition_and_period.get(key) - if single_partition_files is None: - single_partition_files = SinglePartitionFiles(dataset_partition, parquet_output_config) - self._files_by_partition_and_period[key] = single_partition_files - else: - assert single_partition_files.parquet_output_config == parquet_output_config - single_partition_files.add_file(start_time, end_time, file_path) diff --git a/csp/impl/wiring/cache_support/runtime_cache_manager.py b/csp/impl/wiring/cache_support/runtime_cache_manager.py deleted file mode 100644 index 4ef9da7a5..000000000 --- a/csp/impl/wiring/cache_support/runtime_cache_manager.py +++ /dev/null @@ -1,74 +0,0 @@ -from typing import Dict, List - -import csp -from csp.impl.managed_dataset.managed_dataset import ManagedDataset, ManagedDatasetPartition -from csp.impl.wiring.cache_support.partition_files_container import PartitionFileContainer - - -class _DatasetRecord(csp.Struct): - dataset: ManagedDataset - read: bool = False - write: bool = False - - -class RuntimeCacheManager: - def __init__(self, cache_config, cache_data): - self._cache_config = cache_config - self._partition_file_container = PartitionFileContainer(cache_config) - self._datasets: Dict[int, _DatasetRecord] = {} - self._dataset_write_partitions: List[ManagedDatasetPartition] = [] - self._dataset_read_partitions: List[ManagedDatasetPartition] = [] - self._read_locks = [] - for graph_cache_manager in cache_data.cache_managers.values(): - if graph_cache_manager.outputs is not None: - self.add_read_partition(graph_cache_manager.dataset_partition) - else: - self.add_write_partition(graph_cache_manager.dataset_partition) - - def _validate_and_lock_datasets(self): - res = [] - for dataset_record in self._datasets.values(): - res.append( - dataset_record.dataset.validate_and_lock_metadata( - lock_file_permissions=self._cache_config.lock_file_permissions, - data_file_permissions=self._cache_config.data_file_permissions, - read=dataset_record.read, - write=dataset_record.write, - ) - ) - return res - - def __enter__(self): - self._read_locks = [] - self._read_locks.extend(self._validate_and_lock_datasets()) - for partition in self._dataset_write_partitions: - partition.create_root_folder(self._cache_config) - - self._partition_file_container.__enter__() - - def __exit__(self, exc_type, exc_val, exc_tb): - try: - for lock in self._read_locks: - lock.unlock() - finally: - self._read_locks.clear() - - self._partition_file_container.__exit__(exc_type, exc_val, exc_tb) - - def _add_dataset(self, dataset: ManagedDataset, read=False, write=False): - dataset_id = id(dataset) - dataset_record = self._datasets.get(dataset_id) - if dataset_record is None: - dataset_record = _DatasetRecord(dataset=dataset, read=read, write=write) - self._datasets[dataset_id] = dataset_record - return - dataset_record.read |= read - dataset_record.write |= write - - def add_write_partition(self, dataset_partition: ManagedDatasetPartition): - self._add_dataset(dataset_partition.dataset, write=True) - self._dataset_write_partitions.append(dataset_partition) - - def add_read_partition(self, dataset_partition: ManagedDatasetPartition): - self._add_dataset(dataset_partition.dataset, read=True) - self._dataset_read_partitions.append(dataset_partition) diff --git a/csp/impl/wiring/context.py b/csp/impl/wiring/context.py index 73156fcd5..d9bacf532 100644 --- a/csp/impl/wiring/context.py +++ b/csp/impl/wiring/context.py @@ -1,8 +1,6 @@ import threading from datetime import datetime -from typing import Optional -from csp.impl.config import Config from csp.impl.mem_cache import CspGraphObjectsMemCache @@ -13,24 +11,13 @@ def __init__( self, start_time: datetime = None, end_time: datetime = None, - config: Optional[Config] = None, is_global_instance: bool = False, ): - from csp.impl.wiring.cache_support.cache_config_resolver import CacheConfigResolver - from csp.impl.wiring.cache_support.graph_building import ContextCacheInfo - self.roots = [] self.start_time = start_time self.end_time = end_time self.mem_cache = None self.delayed_nodes = [] - self.config = config - - self.cache_data = ContextCacheInfo( - cache_managers={}, - managed_datasets_by_graph_object={}, - cache_data_paths_resolver=CacheConfigResolver(getattr(config, "cache_config", None)), - ) self._is_global_instance = is_global_instance if hasattr(self.TLS, "instance") and self.TLS.instance._is_global_instance: @@ -39,11 +26,7 @@ def __init__( # Note we don't copy everything here from the global context since it may cause undesired behaviors. # roots - We want to accumulate roots that are only relevant for the current run, not all the roots in the global context. # start_time, end_time - not even set in the global context - # mem_cache - can cause issues with dynamic graph nodes and with caching (i.e cached graphs can be built differently based on the run period - # and the data that is available on the disk) - # config - can differ between runs - # cache_data - Generally we don't support caching for global contexts. There are a bunch of issues around it at the moment (one reason is that - # at wiring time we need to know start and end time of the graph). + # mem_cache - can cause issues with dynamic graph nodes for delayed_node in prev.delayed_nodes: # The copy of the delayed node will add all the new delayed nodes to the current context delayed_node.copy() diff --git a/csp/impl/wiring/graph.py b/csp/impl/wiring/graph.py index 392bd167d..0446b65c9 100644 --- a/csp/impl/wiring/graph.py +++ b/csp/impl/wiring/graph.py @@ -1,31 +1,13 @@ -import datetime import inspect -import threading import types -from contextlib import contextmanager -import csp.impl.wiring.edge from csp.impl.constants import UNSET from csp.impl.error_handling import ExceptionContext -from csp.impl.managed_dataset.dateset_name_constants import DatasetNameConstants from csp.impl.mem_cache import csp_memoized_graph_object, function_full_name -from csp.impl.types.common_definitions import InputDef from csp.impl.types.instantiation_type_resolver import GraphOutputTypeResolver -from csp.impl.wiring import Signature -from csp.impl.wiring.cache_support.dataset_partition_cached_data import DataSetCachedData, DatasetPartitionCachedData -from csp.impl.wiring.cache_support.graph_building import ( - GraphBuildPartitionCacheManager, - GraphCacheOptions, - resolve_graph_cache_options, -) -from csp.impl.wiring.context import Context from csp.impl.wiring.graph_parser import GraphParser -from csp.impl.wiring.outputs import CacheWriteOnlyOutputsContainer, OutputsContainer -from csp.impl.wiring.special_output_names import ALL_SPECIAL_OUTPUT_NAMES, UNNAMED_OUTPUT_NAME - - -class NoCachedDataException(RuntimeError): - pass +from csp.impl.wiring.outputs import OutputsContainer +from csp.impl.wiring.special_output_names import UNNAMED_OUTPUT_NAME class _GraphDefMetaUsingAux: @@ -42,120 +24,11 @@ def using(self, **_forced_tvars): def __call__(self, *args, **kwargs): return self._graph_meta._instantiate(self._forced_tvars, *args, **kwargs) - def cache_periods(self, start_time, end_time): - return self._graph_meta.cached_data(start_time, end_time) - - def cached_data(self, data_folder=None, _forced_tvars=None) -> DatasetPartitionCachedData: - """Get the proxy object for accessing the graph cached data. - This is the basic interface for inspecting cache files and loading cached data as dataframes - :param data_folder: The root folder of the cache or an instance of CacheDataPathResolver - :return: An instance of DatasetPartitionCachedData to access the graph cached data - """ - if data_folder is None: - data_folder = Context.instance().config - return self._graph_meta.cached_data(data_folder, _forced_tvars) - - def cached(self, *args, **kwargs): - """A utility function to ensure that a graph is read from cache - For example if there is a cached graph g. - Calling g(a1, a2, ...) can either read it from cache or write the results to cache if no cached data is found. - Calling g.cached(a1, a2, ...) forces reading from cache, if no cache data is found then exception will be raised. - :param args: Positional arguments to the graph - :param kwargs: Keyword arguments to the graph - """ - return self._graph_meta.cached(*args, _forced_tvars=self._forced_tvars, **kwargs) - - -class _ForceCached: - """This class is an ugly workaround to avoid instantiating cached graphs. - The problem: - my_graph.cached(...) - is implemented by calling the regular code path of the graph instantiation and checking whether the graph is actually read from cache. This is a - problem since the user doesn't expect the graph to be instantiated if they use "cached" property. We can't also provide an argument "force_cached" to the instantiation - function since it's memcached and extra argument will cause calls to graph.cached(...) and graph(...) to result in different instances which is wrong. - - This class is a workaround to pass this "require_cached" flag not via arguments - """ - - _INSTANCE = threading.local() - - @classmethod - def is_force_cached(cls): - if not hasattr(cls._INSTANCE, "force_cached"): - return False - return cls._INSTANCE.force_cached - - @classmethod - @contextmanager - def force_cached(cls): - prev_value = cls.is_force_cached() - try: - cls._INSTANCE.force_cached = True - yield - finally: - cls._INSTANCE.force_cached = prev_value - - -class _CacheProxy: - """A helper class that allows to access cached data in a given time range, that can be smaller than the engine run time - - Usage: - my_graph.cached[start:end] - The cached property of the graph will return an instance of _CacheProxy which can then be called with the appropriate parameters. - """ - - def __init__(self, owner, csp_cache_start=None, csp_cache_end=None, csp_timestamp_shift=None): - self._owner = owner - self._csp_cache_start = csp_cache_start - self._csp_cache_end = csp_cache_end - self._csp_timestamp_shift = csp_timestamp_shift - - def __getitem__(self, item): - assert isinstance(item, slice), "cached item range must be a slice" - assert item.step is None, "Providing step for cache range is not supported" - res = _CacheProxy(self._owner, csp_timestamp_shift=self._csp_timestamp_shift) - res._csp_cache_start = item.start - # The range values are exclusive but for caching purposes we need inclusive end time - res._csp_cache_end = item.stop - datetime.timedelta(microseconds=1) - return res - - def shifted(self, csp_timestamp_shift: datetime.timedelta): - return _CacheProxy( - self._owner, - csp_cache_start=self._csp_cache_start, - csp_cache_end=self._csp_cache_end, - csp_timestamp_shift=csp_timestamp_shift, - ) - - def __call__(self, *args, _forced_tvars=None, **kwargs): - with _ForceCached.force_cached(): - return self._owner._cached_impl( - _forced_tvars, - self._csp_cache_start, - self._csp_cache_end, - args, - kwargs, - csp_timestamp_shift=self._csp_timestamp_shift, - ) - class GraphDefMeta(type): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._instantiate_func = self._instantiate_impl - ignored_inputs = getattr(self._cache_options, "ignored_inputs", None) - if self._cache_options and ignored_inputs is not None: - non_ignored_inputs = [input for input in self._signature.inputs if input.name not in ignored_inputs] - self._cache_signature = Signature( - name=self._signature._name, - inputs=[ - InputDef(name=s.name, typ=s.typ, kind=s.kind, basket_kind=s.basket_kind, ts_idx=None, arg_idx=i) - for i, s in enumerate(non_ignored_inputs) - ], - outputs=self._signature._outputs, - defaults={k: v for k, v in self._signature._defaults.items() if k not in ignored_inputs}, - ) - else: - self._cache_signature = self._signature if self.memoize or self.force_memoize: if self.wrapped_node: full_name = function_full_name(self.wrapped_node._impl) @@ -171,103 +44,7 @@ def _extract_forced_tvars(cls, d): return d.pop("_forced_tvars") return {} - @property - def _cached_function(self): - return self.wrapped_node._impl if self.wrapped_node else self._func - - def cache_periods(self, start_time, end_time): - from csp.impl.managed_dataset.aggregation_period_utils import AggregationPeriodUtils - - agg_period_utils = AggregationPeriodUtils(self._cache_options.time_aggregation) - return list(agg_period_utils.iterate_periods_in_date_range(start_time, end_time)) - - def cached_data(self, data_folder=None, _forced_tvars=None) -> DatasetPartitionCachedData: - """Get the proxy object for accessing the graph cached data. - This is the basic interface for inspecting cache files and loading cached data as dataframes - :param data_folder: An instance of string (data folder), csp.Config or csp.cache_support.CacheConfig with the appropriate cache config. Note, that - only if one of he configs passed in then the category resolution and custom data types serialization handled properly. Pass in string only if None of the above - features is used. - :return: An instance of DatasetPartitionCachedData to access the graph cached data - """ - if not self._cache_options: - raise RuntimeError("Trying to get cached data from graph that doesn't cache") - if data_folder is None: - data_folder = Context.instance().config - - if isinstance(data_folder, csp.Config): - cache_config = data_folder.cache_config - elif isinstance(data_folder, csp.cache_support.CacheConfig): - cache_config = data_folder - else: - cache_config = None - - if cache_config: - cache_serializers = cache_config.cache_serializers - else: - cache_serializers = {} - - cache_signature = self._cache_signature - - dataset = GraphBuildPartitionCacheManager.get_dataset_for_func( - graph=self, func=self._cached_function, cache_options=self._cache_options, data_folder=data_folder - ) - if dataset is None: - return None - - def _get_dataset_partition(*args, **kwargs): - inputs, scalars, tvars = cache_signature.parse_inputs(_forced_tvars, *args, **kwargs) - partition_values = dict(zip((i.name for i in cache_signature._inputs), scalars)) - return dataset.get_partition(partition_values) - - return DataSetCachedData(dataset, cache_serializers, _get_dataset_partition) - - @property - def cached(self) -> _CacheProxy: - """ - Usage: - my_graph.cached[start:end] - Will return an instance of _CacheProxy which can then be called with the appropriate parameters. - :return: A cache proxy that can be used to limit the time of the returned graph. - """ - return _CacheProxy(self) - - def _cached_impl(self, _forced_tvars, csp_cache_start, csp_cache_end, args, kwargs, csp_timestamp_shift=None): - """A utility function to ensure that a graph is read from cache - For example if there is a cached graph g. - Calling g(a1, a2, ...) can either read it from cache or write the results to cache if no cached data is found. - Calling g.cached(a1, a2, ...) forces reading from cache, if no cache data is found then exception will be raised. - :param args: Positional arguments to the graph - :param csp_cache_start: The start time of the cached data before which we don't want to load any date - :param csp_cache_end: The end time of the cached data after which we don't want to load any date - :param kwargs: Keyword arguments to the graph - """ - if Context.TLS.instance.config and hasattr(Context.TLS.instance.config, "cache_config") and self._cache_options: - read_from_cache, res, _ = self._instantiate_func( - _forced_tvars, self._signature, args, kwargs, csp_cache_start, csp_cache_end, csp_timestamp_shift - ) - assert read_from_cache - else: - raise NoCachedDataException( - f"No data found in cache for {self._signature._name} for the given run period, seems like cache_config is unset" - ) - return res - - def _raise_if_forced_cache_read(self, missing_period=None): - if _ForceCached.is_force_cached(): - if missing_period: - missing_period_str = f" {str(missing_period[0])} to {str(missing_period[1])}" - raise NoCachedDataException( - f"No data found in cache for {self._signature._name} for period{missing_period_str}" - ) - else: - raise NoCachedDataException( - f"No data found in cache for {self._signature._name} for the given run period" - ) - - def _instantiate_impl( - self, _forced_tvars, signature, args, kwargs, csp_cache_start=None, csp_cache_end=None, csp_timestamp_shift=None - ): - read_from_cache = False + def _instantiate_impl(self, _forced_tvars, signature, args, kwargs): inputs, scalars, tvars = signature.parse_inputs(_forced_tvars, *args, allow_none_ts=True, **kwargs) basket_shape_eval_inputs = list(scalars) @@ -278,62 +55,13 @@ def _instantiate_impl( expected_outputs = signature.resolve_output_definitions( tvars=tvars, basket_shape_eval_inputs=basket_shape_eval_inputs ) - if signature.special_outputs: - expected_regular_outputs = tuple(v for v in expected_outputs if v.name not in ALL_SPECIAL_OUTPUT_NAMES) - else: - expected_regular_outputs = expected_outputs - - cache_manager = None - if ( - hasattr(Context.TLS, "instance") - and Context.TLS.instance.config - and hasattr(Context.TLS.instance.config, "cache_config") - and self._cache_options - ): - ignored_inputs = getattr(self._cache_options, "ignored_inputs", set()) - cache_scalars = tuple(s for s, s_def in zip(scalars, signature.scalars) if s_def.name not in ignored_inputs) - - cache_manager = GraphBuildPartitionCacheManager.create_cache_manager( - graph=self, - func=self._cached_function, - signature=self._cache_signature, - non_ignored_scalars=cache_scalars, - all_scalars=scalars, - cache_options=self._cache_options, - expected_outputs=expected_regular_outputs, - tvars=tvars, - csp_cache_start=csp_cache_start, - csp_cache_end=csp_cache_end, - csp_timestamp_shift=csp_timestamp_shift, - ) - allow_non_cached_read = True - if cache_manager: - if cache_manager.outputs is not None: - res = cache_manager.outputs - read_from_cache = True - elif cache_manager.written_outputs is not None: - res = cache_manager.written_outputs - else: - self._raise_if_forced_cache_read(cache_manager.first_missing_period) - res = self._func(*args, **kwargs) - cache_manager.cache_outputs(res) - allow_non_cached_read = not cache_manager.is_force_cache_read - else: - self._raise_if_forced_cache_read() - res = self._func(*args, **kwargs) - - if read_from_cache: - expected_outputs = expected_regular_outputs + res = self._func(*args, **kwargs) # Validate graph return values if isinstance(res, OutputsContainer): outputs_raw = [] for e_o in expected_outputs: - output_name = ( - e_o.name - if e_o.name - else (DatasetNameConstants.UNNAMED_OUTPUT_NAME if read_from_cache else UNNAMED_OUTPUT_NAME) - ) + output_name = e_o.name if e_o.name else UNNAMED_OUTPUT_NAME cur_o = res._get(output_name, UNSET) if cur_o is UNSET: raise KeyError(f"Output {output_name} is not returned from the graph") @@ -360,26 +88,17 @@ def _instantiate_impl( output_definitions=expected_outputs, values=outputs_raw, forced_tvars=tvars, - allow_subtypes=self._cache_options is None, ) if signature.special_outputs: - if not read_from_cache: - if expected_outputs[0].name is None: - res = next(iter(res._values())) - else: - res = OutputsContainer(**{k: v for k, v in res._items() if k not in ALL_SPECIAL_OUTPUT_NAMES}) + if expected_outputs[0].name is None: + res = next(iter(res._values())) + else: + res = OutputsContainer(**{k: v for k, v in res._items() if k != -UNNAMED_OUTPUT_NAME}) - return read_from_cache, res, allow_non_cached_read + return res def _instantiate(self, _forced_tvars, *args, **kwargs): - _, res, allow_non_cached_read = self._instantiate_func(_forced_tvars, self._signature, args=args, kwargs=kwargs) - if not allow_non_cached_read: - if isinstance(res, csp.impl.wiring.edge.Edge): - return CacheWriteOnlyOutputsContainer(iter([res])) - else: - return CacheWriteOnlyOutputsContainer(iter(res)) - else: - return res + return self._instantiate_func(_forced_tvars, self._signature, args=args, kwargs=kwargs) def __call__(cls, *args, **kwargs): return cls._instantiate(None, *args, **kwargs) @@ -400,14 +119,9 @@ def _create_graph( signature, memoize, force_memoize, - cache, - cache_options, wrapped_function=None, wrapped_node=None, ): - resolved_cache_options = resolve_graph_cache_options( - signature=signature, cache_enabled=cache, cache_options=cache_options - ) return GraphDefMeta( func_name, (object,), @@ -416,7 +130,6 @@ def _create_graph( "_func": impl, "memoize": memoize, "force_memoize": force_memoize, - "_cache_options": resolved_cache_options, "__wrapped__": wrapped_function, "__module__": wrapped_function.__module__, "wrapped_node": wrapped_node, @@ -430,18 +143,15 @@ def graph( *, memoize=True, force_memoize=False, - cache: bool = False, - cache_options: GraphCacheOptions = None, name=None, debug_print=False, ): """ :param func: - :param memoize: Specify whether the node should be memoized (default True) - :param force_memoize: If True, the node will be memoized even if csp.memoize(False) was called. Usually it should not be set, set + :param memoize: Specify whether the graph should be memoized (default True) + :param force_memoize: If True, the graph will be memoized even if csp.memoize(False) was called. Usually it should not be set, set this to True ONLY if memoization required to guarantee correctness of the function (i.e the function must be called at most once with the for each set of parameters). - :param cache_options: The options for graph caching :param name: Provide a custom name for the constructed graph type :param debug_print: A boolean that specifies that processed function should be printed :return: @@ -450,12 +160,10 @@ def graph( def _impl(func): with ExceptionContext(): - add_cache_control_output = cache_options is not None and getattr(cache_options, "controlled_cache", False) parser = GraphParser( name or func.__name__, func, func_frame, - add_cache_control_output=add_cache_control_output, debug_print=debug_print, ) parser.parse() @@ -468,8 +176,6 @@ def _impl(func): signature, memoize, force_memoize, - cache, - cache_options, wrapped_function=func, ) diff --git a/csp/impl/wiring/graph_parser.py b/csp/impl/wiring/graph_parser.py index 36c243174..efe73cdc7 100644 --- a/csp/impl/wiring/graph_parser.py +++ b/csp/impl/wiring/graph_parser.py @@ -2,19 +2,18 @@ from csp.impl.wiring import Signature from csp.impl.wiring.base_parser import BaseParser, CspParseError, _pythonic_depr_warning -from csp.impl.wiring.special_output_names import CSP_CACHE_ENABLED_OUTPUT, UNNAMED_OUTPUT_NAME +from csp.impl.wiring.special_output_names import UNNAMED_OUTPUT_NAME class GraphParser(BaseParser): _DEBUG_PARSE = False - def __init__(self, name, raw_func, func_frame, debug_print=False, add_cache_control_output=False): + def __init__(self, name, raw_func, func_frame, debug_print=False): super().__init__( name=name, raw_func=raw_func, func_frame=func_frame, debug_print=debug_print, - add_cache_control_output=add_cache_control_output, ) def visit_FunctionDef(self, node): @@ -51,26 +50,17 @@ def visit_Return(self, node): if len(self._outputs) and node.value is None: raise CspParseError("return does not return values with non empty outputs") - if self._add_cache_control_output: - if isinstance(node.value, ast.Call): - parsed_return = self.visit_Call(node.value) - if isinstance(parsed_return, ast.Return): - return parsed_return + if isinstance(node.value, ast.Call): + if len(self._outputs) > 1: + self._validate_output(node) - returned_value = node.value - return self._wrap_returned_value_and_add_special_outputs(returned_value) - else: - if isinstance(node.value, ast.Call): - if len(self._outputs) > 1: - self._validate_output(node) - - parsed_return = self.visit_Call(node.value) - if isinstance(parsed_return, ast.Call): - return ast.Return(value=parsed_return, lineno=node.lineno, end_lineno=node.end_lineno) + parsed_return = self.visit_Call(node.value) + if isinstance(parsed_return, ast.Call): + return ast.Return(value=parsed_return, lineno=node.lineno, end_lineno=node.end_lineno) - return parsed_return + return parsed_return - return node + return node def _parse_single_output_definition(self, name, arg_type_node, ts_idx, typ=None): return self._parse_single_output_definition_with_shapes( @@ -101,11 +91,8 @@ def _parse_return(self, node, special_outputs=None): raise CspParseError("returning from graph without any outputs defined", node.lineno) elif ( len(self._signature._outputs) == 1 and self._signature._outputs[0].name is None - ): # graph only has one unnamed output - if self._add_cache_control_output: - return self._wrap_returned_value_and_add_special_outputs(node.args[0]) - else: - return ast.Return(value=node.args[0], lineno=node.lineno, end_lineno=node.end_lineno) + ): # graph only has one unnamed output: + return ast.Return(value=node.args[0], lineno=node.lineno, end_lineno=node.end_lineno) else: node.keywords = [ast.keyword(arg=self._signature._outputs[0].name, value=node.args[0])] node.args.clear() @@ -160,32 +147,10 @@ def visit_Expr(self, node): return res.value return res - def _get_special_output_name_mapping(self): - """ - :return: A dict mapping local_variable->output_name - """ - return {CSP_CACHE_ENABLED_OUTPUT: CSP_CACHE_ENABLED_OUTPUT} - def visit_Call(self, node: ast.Call): if (isinstance(node.func, ast.Name) and node.func.id == "__return__") or BaseParser._is_csp_output_call(node): special_outputs = {} - if self._add_cache_control_output: - special_outputs = self._get_special_output_name_mapping() return self._parse_return(node, special_outputs) - if ( - isinstance(node.func, ast.Attribute) - and isinstance(node.func.value, ast.Name) - and node.func.value.id == "csp" - and node.func.attr == "set_cache_enable_ts" - ): - if len(node.args) != 1 or node.keywords: - raise CspParseError("Invalid call to csp.set_cache_enable_ts", node.lineno) - if self._add_cache_control_output: - return ast.Assign(targets=[ast.Name(id=CSP_CACHE_ENABLED_OUTPUT, ctx=ast.Store())], value=node.args[0]) - else: - raise CspParseError( - "Invalid call to csp.set_cache_enable_ts in graph with non controlled cache", node.lineno - ) return self.generic_visit(node) def _is_ts_args_removed_from_signature(self): @@ -193,23 +158,10 @@ def _is_ts_args_removed_from_signature(self): def _parse_impl(self): self._inputs, input_defaults, self._outputs = self.parse_func_signature(self._funcdef) - self._resolve_special_outputs() # Should have inputs and outputs at this point self._signature = Signature( self._name, self._inputs, self._outputs, input_defaults, special_outputs=self._special_outputs ) - # We need to set default value for the cache control variable as the first command in the function - if self._add_cache_control_output: - self._funcdef.body = [ - ast.Assign( - targets=[ast.Name(id=CSP_CACHE_ENABLED_OUTPUT, ctx=ast.Store())], - value=ast.Call( - func=ast.Attribute(value=ast.Name(id="csp", ctx=ast.Load()), attr="null_ts", ctx=ast.Load()), - args=[ast.Name(id="bool", ctx=ast.Load())], - keywords=[], - ), - ) - ] + self._funcdef.body self.generic_visit(self._funcdef) newfuncdef = ast.FunctionDef(name=self._funcdef.name, body=self._funcdef.body, returns=None) diff --git a/csp/impl/wiring/node.py b/csp/impl/wiring/node.py index e4cfeab0b..a905bee36 100644 --- a/csp/impl/wiring/node.py +++ b/csp/impl/wiring/node.py @@ -195,16 +195,12 @@ def _create_node( cppimpl, pre_create_hook, name, - cache: bool = False, - cache_options=None, ): - create_wrapper = cache or cache_options parser = NodeParser( name, func, func_frame, debug_print=debug_print, - add_cache_control_output=cache_options and cache_options.controlled_cache, ) parser.parse() @@ -223,26 +219,7 @@ def _create_node( "__doc__": parser._docstring, }, ) - if create_wrapper: - from csp.impl.wiring.graph import _create_graph - - def wrapper(*args, **kwargs): - return nodetype(*args, **kwargs) - - return _create_graph( - name, - func.__doc__, - wrapper, - parser._signature.copy(drop_alarms=True), - memoize, - force_memoize, - cache, - cache_options, - wrapped_function=func, - wrapped_node=nodetype, - ) - else: - return nodetype + return nodetype def _node_internal_use( @@ -254,8 +231,6 @@ def _node_internal_use( debug_print=False, cppimpl=None, pre_create_hook=None, - cache: bool = False, - cache_options=None, name=None, ): """A decorator similar to the @node decorator that exposes some internal arguments that shoudn't be visible to users""" @@ -272,8 +247,6 @@ def _impl(func): cppimpl=cppimpl, pre_create_hook=pre_create_hook, name=name or func.__name__, - cache=cache, - cache_options=cache_options, ) if func is None: @@ -291,8 +264,6 @@ def node( force_memoize=False, debug_print=False, cppimpl=None, - cache: bool = False, - cache_options=None, name=None, ): """ @@ -303,8 +274,6 @@ def node( set of parameters). :param debug_print: A boolean that specifies that processed function should be printed :param cppimpl: - :param cache: - :param cache_options: :param name: Provide a custom name for the constructed node type, helpful when viewing a graph with many same-named nodes :return: """ @@ -317,7 +286,5 @@ def node( debug_print=debug_print, cppimpl=cppimpl, pre_create_hook=None, - cache=cache, - cache_options=cache_options, name=name, ) diff --git a/csp/impl/wiring/node_parser.py b/csp/impl/wiring/node_parser.py index 818ea2f7d..af6726ead 100644 --- a/csp/impl/wiring/node_parser.py +++ b/csp/impl/wiring/node_parser.py @@ -13,7 +13,6 @@ from csp.impl.wiring import Signature from csp.impl.wiring.ast_utils import ASTUtils from csp.impl.wiring.base_parser import BaseParser, CspParseError, _pythonic_depr_warning -from csp.impl.wiring.special_output_names import CSP_CACHE_ENABLED_OUTPUT class _SingleProxyFuncArgResolver(object): @@ -96,13 +95,12 @@ class NodeParser(BaseParser): _INPUT_PROXY_VARNAME = "input_proxy" _OUTPUT_PROXY_VARNAME = "output_proxy" - def __init__(self, name, raw_func, func_frame, debug_print=False, add_cache_control_output=False): + def __init__(self, name, raw_func, func_frame, debug_print=False): super().__init__( name=name, raw_func=raw_func, func_frame=func_frame, debug_print=debug_print, - add_cache_control_output=add_cache_control_output, ) self._stateblock = [] self._startblock = [] @@ -661,14 +659,6 @@ def _parse_engine_end_time(self, node): keywords=[], ) - def _parse_csp_enable_cache(self, node): - if len(node.args) != 1 or node.keywords: - raise CspParseError("Invalid call to csp.enable_cache", node.lineno) - - output = self._ts_outproxy_expr(CSP_CACHE_ENABLED_OUTPUT) - res = ast.BinOp(left=output, op=ast.Add(), right=node.args[0]) - return res - def _parse_csp_engine_stats(self, node): if len(node.args) or len(node.keywords): raise CspParseError("csp.engine_stats takes no arguments", node.lineno) @@ -770,7 +760,6 @@ def _is_ts_args_removed_from_signature(self): def _parse_impl(self): self._inputs, input_defaults, self._outputs = self.parse_func_signature(self._funcdef) idx = self._parse_special_blocks(self._funcdef.body) - self._resolve_special_outputs() self._signature = Signature( self._name, self._inputs, @@ -918,7 +907,6 @@ def _init_internal_maps(cls): "csp.set_buffering_policy": cls._parse_set_buffering_policy, "csp.engine_start_time": cls._parse_engine_start_time, "csp.engine_end_time": cls._parse_engine_end_time, - "csp.enable_cache": cls._parse_csp_enable_cache, "csp.engine_stats": cls._parse_csp_engine_stats, } diff --git a/csp/impl/wiring/outputs.py b/csp/impl/wiring/outputs.py index 90afe0769..1e0b1ecb0 100644 --- a/csp/impl/wiring/outputs.py +++ b/csp/impl/wiring/outputs.py @@ -35,18 +35,3 @@ def _get(self, item, dflt=None): def __repr__(self): return "OutputsContainer( %s )" % (",".join("%s=%r" % (k, v) for k, v in self._items())) - - -class CacheWriteOnlyOutputsContainer(list): - def __repr__(self): - return f'CacheWriteOnlyOutputsContainer( {",".join(v for v in self)} )' - - def __getattr__(self, item): - raise RuntimeError( - "Outputs of graphs with custom data_timestamp_column_name must be read using .cached property" - ) - - def __getitem__(self, item): - raise RuntimeError( - "Outputs of graphs with custom data_timestamp_column_name must be read using .cached property" - ) diff --git a/csp/impl/wiring/runtime.py b/csp/impl/wiring/runtime.py index bcc7dd15b..d334bd1e1 100644 --- a/csp/impl/wiring/runtime.py +++ b/csp/impl/wiring/runtime.py @@ -3,15 +3,13 @@ import time from collections import deque from datetime import datetime, timedelta -from typing import Optional from csp.impl.__cspimpl import _cspimpl -from csp.impl.config import Config from csp.impl.error_handling import ExceptionContext from csp.impl.wiring.adapters import _graph_return_adapter from csp.impl.wiring.context import Context from csp.impl.wiring.edge import Edge -from csp.impl.wiring.outputs import CacheWriteOnlyOutputsContainer, OutputsContainer +from csp.impl.wiring.outputs import OutputsContainer from csp.profiler import Profiler, graph_info MAX_END_TIME = datetime(2261, 12, 31, 23, 59, 50, 999999) @@ -34,14 +32,14 @@ def _normalize_run_times(starttime, endtime, realtime): return starttime, endtime -def build_graph(f, *args, config: Optional[Config] = None, starttime=None, endtime=None, realtime=False, **kwargs): +def build_graph(f, *args, starttime=None, endtime=None, realtime=False, **kwargs): assert ( (starttime is None) == (endtime is None) ), "Start time and end time should either both be specified or none of them should be specified when building a graph" if starttime: starttime, endtime = _normalize_run_times(starttime, endtime, realtime) with ExceptionContext(), GraphRunInfo(starttime=starttime, endtime=endtime, realtime=realtime), Context( - start_time=starttime, end_time=endtime, config=config + start_time=starttime, end_time=endtime ) as c: # Setup the profiler if within a profiling context if Profiler.instance() is not None and not Profiler.instance().initialized: @@ -54,7 +52,7 @@ def build_graph(f, *args, config: Optional[Config] = None, starttime=None, endti processed_outputs = OutputsContainer() - if outputs is not None and not isinstance(outputs, CacheWriteOnlyOutputsContainer): + if outputs is not None: if isinstance(outputs, Edge): processed_outputs[0] = outputs elif isinstance(outputs, list): @@ -112,19 +110,6 @@ def _build_engine(engine, context, memo=None): return engine -def _run_engine(engine, starttime, endtime, context_config=None, cache_data=None): - # context = Context.TLS.instance - cache_config = getattr(context_config, "cache_config", None) if context_config else None - if cache_config: - from csp.impl.wiring.cache_support.runtime_cache_manager import RuntimeCacheManager - - runtime_cache_manager = RuntimeCacheManager(cache_config, cache_data) - with runtime_cache_manager: - return engine.run(starttime, endtime) - else: - return engine.run(starttime, endtime) - - class GraphRunInfo: TLS = threading.local() @@ -174,7 +159,6 @@ def run( *args, starttime=None, endtime=MAX_END_TIME, - config: Optional[Config] = None, queue_wait_time=None, realtime=False, output_numpy=False, @@ -193,9 +177,6 @@ def run( orig_g.context = None if isinstance(g, Context): - if config is not None: - raise RuntimeError("Config can not be specified when running a built graph") - if g.start_time is not None: assert ( (g.start_time, g.end_time) == (starttime, endtime) @@ -212,8 +193,6 @@ def run( engine = _cspimpl.PyEngine(**engine_settings) engine = _build_engine(engine, g) - context_config = g.config - cache_data = g.cache_data mem_cache = g.mem_cache # Release graph construct at this point to free up all the edge / nodedef memory thats no longer needed del g @@ -224,18 +203,14 @@ def run( time.sleep((starttime - datetime.utcnow()).total_seconds()) with mem_cache: - return _run_engine( - engine, starttime=starttime, endtime=endtime, context_config=context_config, cache_data=cache_data - ) + return engine.run(starttime, endtime) if isinstance(g, Edge): - if config is not None: - raise RuntimeError("Config can not be specified when running a built graph") return run(lambda: g, starttime=starttime, endtime=endtime, **engine_settings) # wrapped in a _WrappedContext so that we can give up the mem before run graph = _WrappedContext( - build_graph(g, *args, starttime=starttime, endtime=endtime, realtime=realtime, config=config, **kwargs) + build_graph(g, *args, starttime=starttime, endtime=endtime, realtime=realtime, **kwargs) ) with GraphRunInfo(starttime=starttime, endtime=endtime, realtime=realtime): return run(graph, starttime=starttime, endtime=endtime, **engine_settings) diff --git a/csp/impl/wiring/signature.py b/csp/impl/wiring/signature.py index eb058274f..2a3ebdbb0 100644 --- a/csp/impl/wiring/signature.py +++ b/csp/impl/wiring/signature.py @@ -100,7 +100,6 @@ def parse_inputs(self, forced_tvars, *args, allow_subtypes=True, allow_none_ts=F input_definitions=self._inputs[self._num_alarms :], arguments=flat_args, forced_tvars=forced_tvars, - allow_subtypes=allow_subtypes, allow_none_ts=allow_none_ts, ) diff --git a/csp/impl/wiring/special_output_names.py b/csp/impl/wiring/special_output_names.py index 94b2c9dc6..3f66c0636 100644 --- a/csp/impl/wiring/special_output_names.py +++ b/csp/impl/wiring/special_output_names.py @@ -1,4 +1 @@ -CSP_CACHE_ENABLED_OUTPUT = "__csp_cache_enable_ts" UNNAMED_OUTPUT_NAME = "__csp__unnamed_output__" - -ALL_SPECIAL_OUTPUT_NAMES = {CSP_CACHE_ENABLED_OUTPUT} diff --git a/csp/impl/wiring/threaded_runtime.py b/csp/impl/wiring/threaded_runtime.py index 0726e3c44..55e326017 100644 --- a/csp/impl/wiring/threaded_runtime.py +++ b/csp/impl/wiring/threaded_runtime.py @@ -1,8 +1,6 @@ import threading -from typing import Optional import csp -from csp.impl.config import Config from csp.impl.pushadapter import PushInputAdapter from csp.impl.types.tstype import ts from csp.impl.wiring import MAX_END_TIME, py_push_adapter_def @@ -110,7 +108,6 @@ def run_on_thread( *args, starttime=None, endtime=MAX_END_TIME, - config: Optional[Config] = None, queue_wait_time=None, realtime=False, auto_shutdown=False, @@ -122,7 +119,6 @@ def run_on_thread( *args, starttime=starttime, endtime=endtime, - config=config, queue_wait_time=queue_wait_time, realtime=realtime, auto_shutdown=auto_shutdown, diff --git a/csp/tests/impl/test_struct.py b/csp/tests/impl/test_struct.py index 0920aa5cc..6f76339a9 100644 --- a/csp/tests/impl/test_struct.py +++ b/csp/tests/impl/test_struct.py @@ -171,20 +171,92 @@ def __init__(self, x: int): # items[:-2] are normal values of the given type that should be handled, # items[-2] is a normal value for non-generic and non-str types and None for generic and str types (the purpose is to test the raise of TypeError if a single object instead of a sequence is passed), # items[-1] is a value of a different type that is not convertible to the give type for non-generic types and None for generic types (the purpose is to test the raise of TypeError if an object of the wrong type is passed). -pystruct_list_test_values = { - int : [4, 2, 3, 5, 6, 7, 8, 's'], +pystruct_list_test_values = { + int: [4, 2, 3, 5, 6, 7, 8, "s"], bool: [True, True, True, False, True, False, True, 2], - float: [1.4, 3.2, 2.7, 1.0, -4.5, -6.0, -2.0, 's'], - datetime: [datetime(2022, 12, 6, 1, 2, 3), datetime(2022, 12, 7, 2, 2, 3), datetime(2022, 12, 8, 3, 2, 3), datetime(2022, 12, 9, 4, 2, 3), datetime(2022, 12, 10, 5, 2, 3), datetime(2022, 12, 11, 6, 2, 3), datetime(2022, 12, 13, 7, 2, 3), timedelta(seconds=.123)], - timedelta: [timedelta(seconds=.123), timedelta(seconds=12), timedelta(seconds=1), timedelta(seconds=.5), timedelta(seconds=123), timedelta(seconds=70), timedelta(seconds=700), datetime(2022, 12, 8, 3, 2, 3)], - date: [date(2022, 12, 6), date(2022, 12, 7), date(2022, 12, 8), date(2022, 12, 9), date(2022, 12, 10), date(2022, 12, 11), date(2022, 12, 13), timedelta(seconds=.123)], - time: [time(1, 2, 3), time(2, 2, 3), time(3, 2, 3), time(4, 2, 3), time(5, 2, 3), time(6, 2, 3), time(7, 2, 3), timedelta(seconds=.123)], - str : ['s', 'pqr', 'masd', 'wes', 'as', 'm', None, 5], - csp.Struct: [SimpleStruct(a = 1), AnotherSimpleStruct(b = 'sd'), SimpleStruct(a = 3), AnotherSimpleStruct(b = 'sdf'), SimpleStruct(a = -4), SimpleStruct(a = 5), SimpleStruct(a = 7), 4], # untyped struct list - SimpleStruct: [SimpleStruct(a = 1), SimpleStruct(a = 3), SimpleStruct(a = -1), SimpleStruct(a = -4), SimpleStruct(a = 5), SimpleStruct(a = 100), SimpleStruct(a = 1200), AnotherSimpleStruct(b = 'sd')], - SimpleEnum: [SimpleEnum.A, SimpleEnum.C, SimpleEnum.B, SimpleEnum.B, SimpleEnum.B, SimpleEnum.C, SimpleEnum.C, AnotherSimpleEnum.D], + float: [1.4, 3.2, 2.7, 1.0, -4.5, -6.0, -2.0, "s"], + datetime: [ + datetime(2022, 12, 6, 1, 2, 3), + datetime(2022, 12, 7, 2, 2, 3), + datetime(2022, 12, 8, 3, 2, 3), + datetime(2022, 12, 9, 4, 2, 3), + datetime(2022, 12, 10, 5, 2, 3), + datetime(2022, 12, 11, 6, 2, 3), + datetime(2022, 12, 13, 7, 2, 3), + timedelta(seconds=0.123), + ], + timedelta: [ + timedelta(seconds=0.123), + timedelta(seconds=12), + timedelta(seconds=1), + timedelta(seconds=0.5), + timedelta(seconds=123), + timedelta(seconds=70), + timedelta(seconds=700), + datetime(2022, 12, 8, 3, 2, 3), + ], + date: [ + date(2022, 12, 6), + date(2022, 12, 7), + date(2022, 12, 8), + date(2022, 12, 9), + date(2022, 12, 10), + date(2022, 12, 11), + date(2022, 12, 13), + timedelta(seconds=0.123), + ], + time: [ + time(1, 2, 3), + time(2, 2, 3), + time(3, 2, 3), + time(4, 2, 3), + time(5, 2, 3), + time(6, 2, 3), + time(7, 2, 3), + timedelta(seconds=0.123), + ], + str: ["s", "pqr", "masd", "wes", "as", "m", None, 5], + csp.Struct: [ + SimpleStruct(a=1), + AnotherSimpleStruct(b="sd"), + SimpleStruct(a=3), + AnotherSimpleStruct(b="sdf"), + SimpleStruct(a=-4), + SimpleStruct(a=5), + SimpleStruct(a=7), + 4, + ], # untyped struct list + SimpleStruct: [ + SimpleStruct(a=1), + SimpleStruct(a=3), + SimpleStruct(a=-1), + SimpleStruct(a=-4), + SimpleStruct(a=5), + SimpleStruct(a=100), + SimpleStruct(a=1200), + AnotherSimpleStruct(b="sd"), + ], + SimpleEnum: [ + SimpleEnum.A, + SimpleEnum.C, + SimpleEnum.B, + SimpleEnum.B, + SimpleEnum.B, + SimpleEnum.C, + SimpleEnum.C, + AnotherSimpleEnum.D, + ], list: [[1], [1, 2, 1], [6], [8, 3, 5], [3], [11, 8], None, None], # generic type list - SimpleClass: [SimpleClass(x = 1), SimpleClass(x = 5), SimpleClass(x = 9), SimpleClass(x = -1), SimpleClass(x = 2), SimpleClass(x = 3), None, None], # generic type user-defined + SimpleClass: [ + SimpleClass(x=1), + SimpleClass(x=5), + SimpleClass(x=9), + SimpleClass(x=-1), + SimpleClass(x=2), + SimpleClass(x=3), + None, + None, + ], # generic type user-defined } @@ -705,8 +777,8 @@ def __init__(self, iterable=None): class StructWithListDerivedType(csp.Struct): ldt: ListDerivedType - s1 = StructWithListDerivedType(ldt=ListDerivedType([1,2])) - self.assertTrue(isinstance(s1.to_dict()['ldt'], ListDerivedType)) + s1 = StructWithListDerivedType(ldt=ListDerivedType([1, 2])) + self.assertTrue(isinstance(s1.to_dict()["ldt"], ListDerivedType)) s2 = StructWithListDerivedType.from_dict(s1.to_dict()) self.assertEqual(s1, s2) @@ -1813,14 +1885,15 @@ def custom_jsonifier(obj): json.loads(test_struct.to_json(custom_jsonifier)) def test_list_field_append(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [] ) + + s = A(a=[]) s.a.append(v[0]) - + self.assertEqual(s.a, [v[0]]) s.a.append(v[1]) @@ -1834,14 +1907,15 @@ class A(csp.Struct): s.a.append(v[-1]) def test_list_field_insert(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [] ) + + s = A(a=[]) s.a.insert(0, v[0]) - + self.assertEqual(s.a, [v[0]]) s.a.insert(1, v[1]) @@ -1864,19 +1938,20 @@ class A(csp.Struct): s.a.insert(-1, v[-1]) def test_list_field_pop(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A(a = [v[0], v[1], v[2], v[3], v[4]]) + + s = A(a=[v[0], v[1], v[2], v[3], v[4]]) b = s.a.pop() - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3]]) self.assertEqual(b, v[4]) b = s.a.pop(-1) - + self.assertEqual(s.a, [v[0], v[1], v[2]]) self.assertEqual(b, v[3]) @@ -1884,13 +1959,13 @@ class A(csp.Struct): self.assertEqual(s.a, [v[0], v[2]]) self.assertEqual(b, v[1]) - + with self.assertRaises(IndexError) as e: s.a.pop() s.a.pop() s.a.pop() - - s = A(a = [v[0], v[1], v[2], v[3], v[4]]) + + s = A(a=[v[0], v[1], v[2], v[3], v[4]]) b = s.a.pop(-3) @@ -1904,14 +1979,15 @@ class A(csp.Struct): s.a.pop(4) def test_list_field_set_item(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2]] ) + + s = A(a=[v[0], v[1], v[2]]) s.a.__setitem__(0, v[3]) - + self.assertEqual(s.a, [v[3], v[1], v[2]]) s.a[1] = v[4] @@ -1927,7 +2003,7 @@ class A(csp.Struct): with self.assertRaises(IndexError) as e: s.a[-100] = v[0] - + s.a[5:6] = [v[0], v[1], v[2]] self.assertEqual(s.a, [v[3], v[4], v[5], v[0], v[1], v[2]]) @@ -1944,7 +2020,7 @@ class A(csp.Struct): self.assertEqual(s.a, [v[3], v[1], v[2], v[2], v[5]]) - # Check if not str or generic type (as str is a sequence of str) + # Check if not str or generic type (as str is a sequence of str) if v[-2] is not None: with self.assertRaises(TypeError) as e: s.a[1:4] = v[-2] @@ -1964,41 +2040,67 @@ class A(csp.Struct): self.assertEqual(s.a, [v[3], v[1], v[2], v[2], v[5]]) def test_list_field_reverse(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3]] ) + + s = A(a=[v[0], v[1], v[2], v[3]]) s.a.reverse() - + self.assertEqual(s.a, [v[3], v[2], v[1], v[0]]) - + def test_list_field_sort(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" # Not using pystruct_list_test_values, as sort() tests are of different semantics (order and sorting key existance matters). - values = { - int : [1, 5, 2, 2, -1, -5, 's'], - float: [1.4, 5.2, 2.7, 2.7, -1.4, -5.2, 's'], - datetime: [datetime(2022, 12, 6, 1, 2, 3), datetime(2022, 12, 8, 3, 2, 3), datetime(2022, 12, 7, 2, 2, 3), datetime(2022, 12, 7, 2, 2, 3), datetime(2022, 12, 5, 2, 2, 3), datetime(2022, 12, 3, 2, 2, 3), None], - timedelta: [timedelta(seconds=1), timedelta(seconds=123), timedelta(seconds=12), timedelta(seconds=12), timedelta(seconds=.1), timedelta(seconds=.01), None], - date: [date(2022, 12, 6), date(2022, 12, 8), date(2022, 12, 7), date(2022, 12, 7), date(2022, 12, 5), date(2022, 12, 3), None], + values = { + int: [1, 5, 2, 2, -1, -5, "s"], + float: [1.4, 5.2, 2.7, 2.7, -1.4, -5.2, "s"], + datetime: [ + datetime(2022, 12, 6, 1, 2, 3), + datetime(2022, 12, 8, 3, 2, 3), + datetime(2022, 12, 7, 2, 2, 3), + datetime(2022, 12, 7, 2, 2, 3), + datetime(2022, 12, 5, 2, 2, 3), + datetime(2022, 12, 3, 2, 2, 3), + None, + ], + timedelta: [ + timedelta(seconds=1), + timedelta(seconds=123), + timedelta(seconds=12), + timedelta(seconds=12), + timedelta(seconds=0.1), + timedelta(seconds=0.01), + None, + ], + date: [ + date(2022, 12, 6), + date(2022, 12, 8), + date(2022, 12, 7), + date(2022, 12, 7), + date(2022, 12, 5), + date(2022, 12, 3), + None, + ], time: [time(5, 2, 3), time(7, 2, 3), time(6, 2, 3), time(6, 2, 3), time(4, 2, 3), time(3, 2, 3), None], - str : ['s', 'xyz', 'w', 'w', 'bds', 'a', None], + str: ["s", "xyz", "w", "w", "bds", "a", None], } for typ, v in values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3], v[4], v[5]] ) - + + s = A(a=[v[0], v[1], v[2], v[3], v[4], v[5]]) + s.a.sort() - + self.assertEqual(s.a, [v[5], v[4], v[0], v[2], v[3], v[1]]) s.a.sort(reverse=True) - + self.assertEqual(s.a, [v[1], v[2], v[3], v[0], v[4], v[5]]) with self.assertRaises(TypeError) as e: @@ -2012,16 +2114,17 @@ class A(csp.Struct): s.a.sort(key=abs) self.assertEqual(s.a, [v[0], v[4], v[2], v[3], v[1], v[5]]) - + def test_list_field_extend(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2]] ) + + s = A(a=[v[0], v[1], v[2]]) s.a.extend([v[3]]) - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3]]) s.a.extend([]) @@ -2029,25 +2132,26 @@ class A(csp.Struct): self.assertEqual(s.a, [v[0], v[1], v[2], v[3], v[4], v[5]]) - # Check if not str or generic type (as str is a sequence of str) + # Check if not str or generic type (as str is a sequence of str) if v[-2] is not None: with self.assertRaises(TypeError) as e: s.a.extend(v[-2]) - - # Check if not generic type + + # Check if not generic type if v[-1] is not None: with self.assertRaises(TypeError) as e: s.a.extend([v[-1]]) - + def test_list_field_remove(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[0], v[2]] ) + + s = A(a=[v[0], v[1], v[0], v[2]]) s.a.remove(v[0]) - + self.assertEqual(s.a, [v[1], v[0], v[2]]) s.a.remove(v[2]) @@ -2058,32 +2162,34 @@ class A(csp.Struct): s.a.remove(v[3]) def test_list_field_clear(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3]] ) + + s = A(a=[v[0], v[1], v[2], v[3]]) s.a.clear() - + self.assertEqual(s.a, []) - + def test_list_field_del(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1], v[2], v[3]] ) + + s = A(a=[v[0], v[1], v[2], v[3]]) del s.a[0] - + self.assertEqual(s.a, [v[1], v[2], v[3]]) del s.a[1] self.assertEqual(s.a, [v[1], v[3]]) - s = A( a = [v[0], v[1], v[2], v[3]] ) + s = A(a=[v[0], v[1], v[2], v[3]]) del s.a[1:3] self.assertEqual(s.a, [v[0], v[3]]) @@ -2094,16 +2200,17 @@ class A(csp.Struct): with self.assertRaises(IndexError) as e: del s.a[5] - + def test_list_field_inplace_concat(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1]] ) + + s = A(a=[v[0], v[1]]) s.a.__iadd__([v[2], v[3]]) - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3]]) s.a += (v[4], v[5]) @@ -2117,22 +2224,23 @@ class A(csp.Struct): with self.assertRaises(TypeError) as e: s.a += v[-1] - # Check if not generic type + # Check if not generic type if v[-1] is not None: with self.assertRaises(TypeError) as e: s.a += [v[-1]] - + self.assertEqual(s.a, [v[0], v[1], v[2], v[3], v[4], v[5]]) - + def test_list_field_inplace_repeat(self): - ''' Was a BUG when the struct with list field was not recognizing changes made to this field in python''' + """Was a BUG when the struct with list field was not recognizing changes made to this field in python""" for typ, v in pystruct_list_test_values.items(): + class A(csp.Struct): a: [typ] - - s = A( a = [v[0], v[1]] ) + + s = A(a=[v[0], v[1]]) s.a.__imul__(1) - + self.assertEqual(s.a, [v[0], v[1]]) s.a *= 2 @@ -2143,10 +2251,10 @@ class A(csp.Struct): s.a *= [3] with self.assertRaises(TypeError) as e: - s.a *= 's' - + s.a *= "s" + s.a *= 0 - + self.assertEqual(s.a, []) s.a += [v[2], v[3]] @@ -2154,18 +2262,19 @@ class A(csp.Struct): self.assertEqual(s.a, [v[2], v[3]]) s.a *= -1 - + self.assertEqual(s.a, []) - + def test_list_field_lifetime(self): - '''Ensure that the lifetime of PyStructList field exceeds the lifetime of struct holding it''' + """Ensure that the lifetime of PyStructList field exceeds the lifetime of struct holding it""" + class A(csp.Struct): a: [int] - - s = A( a = [1, 2, 3] ) + + s = A(a=[1, 2, 3]) l = s.a del s - + self.assertEqual(l, [1, 2, 3]) diff --git a/csp/tests/test_caching.py b/csp/tests/test_caching.py deleted file mode 100644 index 04c86de13..000000000 --- a/csp/tests/test_caching.py +++ /dev/null @@ -1,2438 +0,0 @@ -# import collections -# import csp -# import csp.typing -# import glob -# import math -# import numpy -# import os -# import pandas -# import pytz -# import re -# import tempfile -# from typing import Dict -# import unittest -# from csp import Config, graph, node, ts -# from csp.adapters.parquet import ParquetOutputConfig -# from csp.cache_support import BaseCacheConfig, CacheCategoryConfig, CacheConfig, CacheConfigResolver, GraphCacheOptions, NoCachedDataException -# from csp.impl.managed_dataset.cache_user_custom_object_serializer import CacheObjectSerializer -# from csp.impl.managed_dataset.dataset_metadata import TimeAggregation -# from csp.impl.managed_dataset.managed_dataset_path_resolver import DatasetPartitionKey -# from csp.impl.types.instantiation_type_resolver import TSArgTypeMismatchError -# from csp.utils.object_factory_registry import Injected, register_injected_object, set_new_registry_thread_instance -# from datetime import date, datetime, timedelta -# from csp.tests.utils.typed_curve_generator import TypedCurveGenerator - - -# class _DummyStructWithTimestamp(csp.Struct): -# val: int -# timestamp: datetime - - -# class _GraphTempCacheFolderConfig: -# def __init__(self, allow_overwrite=False, merge_existing_files=True): -# self._temp_folder = None -# self._allow_overwrite = allow_overwrite -# self._merge_existing_files = merge_existing_files - -# def __enter__(self): -# assert self._temp_folder is None -# self._temp_folder = tempfile.TemporaryDirectory(prefix='csp_unit_tests') -# return Config(cache_config=CacheConfig(data_folder=self._temp_folder.name, allow_overwrite=self._allow_overwrite, -# merge_existing_files=self._merge_existing_files)) - -# def __exit__(self, exc_type, exc_val, exc_tb): -# if self._temp_folder: -# self._temp_folder.cleanup() -# self._temp_folder = None - - -# @csp.node -# def csp_sorted(x: ts[['T']]) -> ts[['T']]: -# if csp.ticked(x): -# return sorted(x) - - -# class TestCaching(unittest.TestCase): - -# EXPECTED_OUTPUT_TEST_SIMPLE = {'i': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 1), (datetime(2020, 3, 1, 22, 32, 2, 2002), 2), (datetime(2020, 3, 1, 23, 33, 3, 3003), 3), -# (datetime(2020, 3, 2, 0, 34, 4, 4004), 4), (datetime(2020, 3, 2, 1, 35, 5, 5005), 5), (datetime(2020, 3, 2, 2, 36, 6, 6006), 6)], -# 'd': [(datetime(2020, 3, 1, 21, 31, 1, 1001), date(2020, 1, 2)), (datetime(2020, 3, 1, 22, 32, 2, 2002), date(2020, 1, 3)), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), date(2020, 1, 4)), (datetime(2020, 3, 2, 0, 34, 4, 4004), date(2020, 1, 5)), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), date(2020, 1, 6)), (datetime(2020, 3, 2, 2, 36, 6, 6006), date(2020, 1, 7))], -# 'dt': [(datetime(2020, 3, 1, 21, 31, 1, 1001), datetime(2020, 1, 2, 0, 0, 0, 1)), -# (datetime(2020, 3, 1, 22, 32, 2, 2002), datetime(2020, 1, 3, 0, 0, 0, 2)), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), datetime(2020, 1, 4, 0, 0, 0, 3)), -# (datetime(2020, 3, 2, 0, 34, 4, 4004), datetime(2020, 1, 5, 0, 0, 0, 4)), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), datetime(2020, 1, 6, 0, 0, 0, 5)), -# (datetime(2020, 3, 2, 2, 36, 6, 6006), datetime(2020, 1, 7, 0, 0, 0, 6))], -# 'f': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 0.2), (datetime(2020, 3, 1, 22, 32, 2, 2002), 0.4), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 0.6000000000000001), (datetime(2020, 3, 2, 0, 34, 4, 4004), 0.8), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 1.0), (datetime(2020, 3, 2, 2, 36, 6, 6006), 1.2000000000000002)], -# 's': [(datetime(2020, 3, 1, 21, 31, 1, 1001), '1'), (datetime(2020, 3, 1, 22, 32, 2, 2002), '2'), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), '3'), (datetime(2020, 3, 2, 0, 34, 4, 4004), '4'), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), '5'), (datetime(2020, 3, 2, 2, 36, 6, 6006), '6')], -# 'b': [(datetime(2020, 3, 1, 21, 31, 1, 1001), True), (datetime(2020, 3, 1, 22, 32, 2, 2002), False), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), True), (datetime(2020, 3, 2, 0, 34, 4, 4004), False), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), True), (datetime(2020, 3, 2, 2, 36, 6, 6006), False)], -# 'simple_leaf_node': [(datetime(2020, 3, 1, 20, 30), 1)], -# 'p1_i': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 33), (datetime(2020, 3, 1, 22, 32, 2, 2002), 34), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 35), (datetime(2020, 3, 2, 0, 34, 4, 4004), 36), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 37), (datetime(2020, 3, 2, 2, 36, 6, 6006), 38)], -# 'p1_d': [(datetime(2020, 3, 1, 21, 31, 1, 1001), date(2021, 1, 2)), (datetime(2020, 3, 1, 22, 32, 2, 2002), date(2021, 1, 3)), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), date(2021, 1, 4)), (datetime(2020, 3, 2, 0, 34, 4, 4004), date(2021, 1, 5)), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), date(2021, 1, 6)), (datetime(2020, 3, 2, 2, 36, 6, 6006), date(2021, 1, 7))], -# 'p1_dt': [(datetime(2020, 3, 1, 21, 31, 1, 1001), datetime(2020, 6, 7, 1, 2, 3, 5)), -# (datetime(2020, 3, 1, 22, 32, 2, 2002), datetime(2020, 6, 8, 1, 2, 3, 6)), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), datetime(2020, 6, 9, 1, 2, 3, 7)), -# (datetime(2020, 3, 2, 0, 34, 4, 4004), datetime(2020, 6, 10, 1, 2, 3, 8)), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), datetime(2020, 6, 11, 1, 2, 3, 9)), -# (datetime(2020, 3, 2, 2, 36, 6, 6006), datetime(2020, 6, 12, 1, 2, 3, 10))], -# 'p1_f': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 11.4), (datetime(2020, 3, 1, 22, 32, 2, 2002), 17.1), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 22.8), (datetime(2020, 3, 2, 0, 34, 4, 4004), 28.5), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 34.2), (datetime(2020, 3, 2, 2, 36, 6, 6006), 39.900000000000006)], -# 'p1_s': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 'my_str1'), (datetime(2020, 3, 1, 22, 32, 2, 2002), 'my_str2'), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 'my_str3'), (datetime(2020, 3, 2, 0, 34, 4, 4004), 'my_str4'), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 'my_str5'), (datetime(2020, 3, 2, 2, 36, 6, 6006), 'my_str6')], -# 'p1_b': [(datetime(2020, 3, 1, 21, 31, 1, 1001), False), (datetime(2020, 3, 1, 22, 32, 2, 2002), True), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), False), (datetime(2020, 3, 2, 0, 34, 4, 4004), True), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), False), (datetime(2020, 3, 2, 2, 36, 6, 6006), True)], -# 'p1_simple_leaf_node': [(datetime(2020, 3, 1, 20, 30), 1)], -# 'p2_i': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 33), (datetime(2020, 3, 1, 22, 32, 2, 2002), 34), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 35), (datetime(2020, 3, 2, 0, 34, 4, 4004), 36), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 37), (datetime(2020, 3, 2, 2, 36, 6, 6006), 38)], -# 'p2_d': [(datetime(2020, 3, 1, 21, 31, 1, 1001), date(2021, 1, 3)), (datetime(2020, 3, 1, 22, 32, 2, 2002), date(2021, 1, 4)), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), date(2021, 1, 5)), (datetime(2020, 3, 2, 0, 34, 4, 4004), date(2021, 1, 6)), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), date(2021, 1, 7)), (datetime(2020, 3, 2, 2, 36, 6, 6006), date(2021, 1, 8))], -# 'p2_dt': [(datetime(2020, 3, 1, 21, 31, 1, 1001), datetime(2020, 6, 7, 1, 2, 3, 6)), -# (datetime(2020, 3, 1, 22, 32, 2, 2002), datetime(2020, 6, 8, 1, 2, 3, 7)), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), datetime(2020, 6, 9, 1, 2, 3, 8)), -# (datetime(2020, 3, 2, 0, 34, 4, 4004), datetime(2020, 6, 10, 1, 2, 3, 9)), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), datetime(2020, 6, 11, 1, 2, 3, 10)), -# (datetime(2020, 3, 2, 2, 36, 6, 6006), datetime(2020, 6, 12, 1, 2, 3, 11))], -# 'p2_f': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 11.4), (datetime(2020, 3, 1, 22, 32, 2, 2002), 17.1), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 22.8), (datetime(2020, 3, 2, 0, 34, 4, 4004), 28.5), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 34.2), (datetime(2020, 3, 2, 2, 36, 6, 6006), 39.900000000000006)], -# 'p2_s': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 'my_str1'), (datetime(2020, 3, 1, 22, 32, 2, 2002), 'my_str2'), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 'my_str3'), (datetime(2020, 3, 2, 0, 34, 4, 4004), 'my_str4'), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 'my_str5'), (datetime(2020, 3, 2, 2, 36, 6, 6006), 'my_str6')], -# 'p2_b': [(datetime(2020, 3, 1, 21, 31, 1, 1001), True), (datetime(2020, 3, 1, 22, 32, 2, 2002), False), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), True), (datetime(2020, 3, 2, 0, 34, 4, 4004), False), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), True), (datetime(2020, 3, 2, 2, 36, 6, 6006), False)], -# 'p2_simple_leaf_node': [(datetime(2020, 3, 1, 20, 30), 1)], -# 'named1_i': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 2), (datetime(2020, 3, 1, 22, 32, 2, 2002), 3), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 4), (datetime(2020, 3, 2, 0, 34, 4, 4004), 5), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 6), (datetime(2020, 3, 2, 2, 36, 6, 6006), 7)], -# 'named1_f': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 10), (datetime(2020, 3, 1, 22, 32, 2, 2002), 20), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 30), (datetime(2020, 3, 2, 0, 34, 4, 4004), 40), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 50), (datetime(2020, 3, 2, 2, 36, 6, 6006), 60)], -# 'named2_i2': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 3), (datetime(2020, 3, 1, 22, 32, 2, 2002), 4), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 5), (datetime(2020, 3, 2, 0, 34, 4, 4004), 6), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 7), (datetime(2020, 3, 2, 2, 36, 6, 6006), 8)], -# 'named2_f2': [(datetime(2020, 3, 1, 21, 31, 1, 1001), 20), (datetime(2020, 3, 1, 22, 32, 2, 2002), 40), -# (datetime(2020, 3, 1, 23, 33, 3, 3003), 60), (datetime(2020, 3, 2, 0, 34, 4, 4004), 80), -# (datetime(2020, 3, 2, 1, 35, 5, 5005), 100), (datetime(2020, 3, 2, 2, 36, 6, 6006), 120)], -# 'i_sample': [(datetime(2020, 3, 1, 22, 30), 1), (datetime(2020, 3, 2, 0, 30), 3), (datetime(2020, 3, 2, 2, 30), 5)]} -# EXPECTED_FILES = ['csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/dataset_meta.yml', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/dataset_meta.yml', -# 'dummy_stats/sub_category/dataset1/data/2020/03/01/20200301_203000_000000-20200301_235959_999999.parquet', -# 'dummy_stats/sub_category/dataset1/data/2020/03/02/20200302_000000_000000-20200302_030000_000000.parquet', -# 'dummy_stats/sub_category/dataset1/dataset_meta.yml', -# 'dummy_stats/sub_category/dataset2/data/2020/03/01/20200301_203000_000000-20200301_235959_999999.parquet', -# 'dummy_stats/sub_category/dataset2/data/2020/03/02/20200302_000000_000000-20200302_030000_000000.parquet', -# 'dummy_stats/sub_category/dataset2/dataset_meta.yml'] -# _SPLIT_COLUMNS_EXPECTED_FILES = ['csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/b.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/csp_timestamp.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/d.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/dt.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/f.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/i.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/s.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/simple_leaf_node.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/b.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/csp_timestamp.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/d.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/dt.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/f.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/i.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/s.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/simple_leaf_node.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_no_part/dataset_meta.yml', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/b.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/csp_timestamp.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/d.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/dt.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/f.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/i.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/s.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/01/20200301_203000_000000-20200301_235959_999999/simple_leaf_node.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/b.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/csp_timestamp.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/d.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/dt.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/f.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/i.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/s.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210101_000000_000000/20200606_010203_000004/5.7/my_str/True/2020/03/02/20200302_000000_000000-20200302_030000_000000/simple_leaf_node.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/b.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/csp_timestamp.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/d.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/dt.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/f.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/i.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/s.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/01/20200301_203000_000000-20200301_235959_999999/simple_leaf_node.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/b.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/csp_timestamp.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/d.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/dt.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/f.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/i.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/s.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/data/32/20210102_000000_000000/20200606_010203_000005/5.7/my_str/False/2020/03/02/20200302_000000_000000-20200302_030000_000000/simple_leaf_node.parquet', -# 'csp_unnamed_cache/test_caching.make_sub_graph_partitioned/dataset_meta.yml', -# 'dummy_stats/sub_category/dataset1/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/csp_timestamp.parquet', -# 'dummy_stats/sub_category/dataset1/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/f.parquet', -# 'dummy_stats/sub_category/dataset1/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/i.parquet', -# 'dummy_stats/sub_category/dataset1/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/csp_timestamp.parquet', -# 'dummy_stats/sub_category/dataset1/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/f.parquet', -# 'dummy_stats/sub_category/dataset1/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/i.parquet', 'dummy_stats/sub_category/dataset1/dataset_meta.yml', -# 'dummy_stats/sub_category/dataset2/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/csp_timestamp.parquet', -# 'dummy_stats/sub_category/dataset2/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/f2.parquet', -# 'dummy_stats/sub_category/dataset2/data/2020/03/01/20200301_203000_000000-20200301_235959_999999/i2.parquet', -# 'dummy_stats/sub_category/dataset2/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/csp_timestamp.parquet', -# 'dummy_stats/sub_category/dataset2/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/f2.parquet', -# 'dummy_stats/sub_category/dataset2/data/2020/03/02/20200302_000000_000000-20200302_030000_000000/i2.parquet', 'dummy_stats/sub_category/dataset2/dataset_meta.yml'] - -# class _EdgeOutputSettings(csp.Enum): -# FIRST_CYCLE = 0x1 -# LAST_CYCLE = 0x2 -# BOTH_EDGES = FIRST_CYCLE | LAST_CYCLE - -# def _create_graph(self, split_columns_to_files): -# func_run_count = [0] - -# def cache_options(**kwargs): -# return GraphCacheOptions(split_columns_to_files=split_columns_to_files, **kwargs) - -# @node -# def pass_through(v: ts['T']) -> ts['T']: -# with csp.start(): -# func_run_count[0] += 1 - -# if csp.ticked(v): -# return v - -# def make_curve_pass_through(f): -# values = [(timedelta(hours=v, minutes=v, seconds=v, milliseconds=v, microseconds=v), f(v)) for v in range(1, 20)] -# typ = type(values[0][1]) -# return pass_through(csp.curve(typ, values)) - -# simple_leaf_node = [None] - -# @graph(cache=True, cache_options=cache_options()) -# def make_sub_graph_no_part() -> csp.Outputs(i=ts[int], d=ts[date], dt=ts[datetime], f=ts[float], s=ts[str], b=ts[bool], simple_leaf_node=ts[int]): -# return csp.output(i=make_curve_pass_through(lambda v: v), -# d=make_curve_pass_through(lambda v: date(2020, 1, 1) + timedelta(days=v)), -# dt=make_curve_pass_through(lambda v: datetime(2020, 1, 1) + timedelta(days=v, microseconds=v)), -# f=make_curve_pass_through(lambda v: v * .2), -# s=make_curve_pass_through(str), -# b=make_curve_pass_through(lambda v: bool(v % 2)), -# simple_leaf_node=simple_leaf_node[0]) - -# @graph(cache=True, cache_options=cache_options()) -# def make_sub_graph_partitioned(i_v: int, d_v: date, dt_v: datetime, f_v: float, s_v: str, b_v: bool) -> csp.Outputs(i=ts[int], d=ts[date], dt=ts[datetime], f=ts[float], s=ts[str], b=ts[bool], simple_leaf_node=ts[int]): -# no_part_sub_graph = make_sub_graph_no_part() - -# return csp.output(i=make_curve_pass_through(lambda v: i_v + v), -# d=make_curve_pass_through(lambda v: d_v + timedelta(days=v)), -# dt=make_curve_pass_through(lambda v: dt_v + timedelta(days=v, microseconds=v)), -# f=make_curve_pass_through(lambda v: v * f_v + f_v), -# s=make_curve_pass_through(lambda v: s_v + str(v)), -# b=make_curve_pass_through(lambda v: bool(v % 2) ^ b_v), -# simple_leaf_node=no_part_sub_graph.simple_leaf_node) - -# @graph(cache=True, cache_options=cache_options(dataset_name='dataset1', category=['dummy_stats', 'sub_category'])) -# def named_managed_graph_col_set_1() -> csp.Outputs(i=ts[int], f=ts[float]): -# return csp.output(i=make_curve_pass_through(lambda v: v + 1), -# f=make_curve_pass_through(lambda v: v * 10.0)) - -# @graph(cache=True, cache_options=cache_options(dataset_name='dataset2', category=['dummy_stats', 'sub_category'])) -# def named_managed_graph_col_set_2() -> csp.Outputs(i2=ts[int], f2=ts[float]): -# return csp.output(i2=make_curve_pass_through(lambda v: v + 2), -# f2=make_curve_pass_through(lambda v: v * 20.0)) - -# @graph -# def my_graph(require_cached: bool = False): -# self.maxDiff = 20000 -# simple_leaf_node[0] = pass_through(csp.const(1)) -# sub_graph = make_sub_graph_no_part() -# sub_graph_partitioned = make_sub_graph_partitioned.cached if require_cached else make_sub_graph_partitioned -# named_managed_graph_col_set_1_g = named_managed_graph_col_set_1.cached if require_cached else named_managed_graph_col_set_1 -# named_managed_graph_col_set_2_g = named_managed_graph_col_set_2.cached if require_cached else named_managed_graph_col_set_2 -# sub_graph_part_1 = sub_graph_partitioned(i_v=32, d_v=date(2021, 1, 1), dt_v=datetime(2020, 6, 6, 1, 2, 3, 4), -# f_v=5.7, s_v='my_str', b_v=True) -# sub_graph_part_2 = sub_graph_partitioned(i_v=32, d_v=date(2021, 1, 2), dt_v=datetime(2020, 6, 6, 1, 2, 3, 5), -# f_v=5.7, s_v='my_str', b_v=False) -# named_col_set_1 = named_managed_graph_col_set_1_g() -# named_col_set_2 = named_managed_graph_col_set_2_g() -# for k in sub_graph: -# csp.add_graph_output(k, sub_graph[k]) -# for k in sub_graph_part_1: -# csp.add_graph_output(f'p1_{k}', sub_graph_part_1[k]) -# for k in sub_graph_part_2: -# csp.add_graph_output(f'p2_{k}', sub_graph_part_2[k]) -# for k in named_col_set_1: -# csp.add_graph_output(f'named1_{k}', named_col_set_1[k]) -# for k in named_col_set_2: -# csp.add_graph_output(f'named2_{k}', named_col_set_2[k]) -# csp.add_graph_output('i_sample', pass_through(csp.sample(csp.timer(timedelta(hours=2), 1), sub_graph.i))) - -# return func_run_count, my_graph - -# def test_simple_graph(self): -# for split_columns_to_files in (True, False): -# with csp.memoize(False): -# func_run_count, my_graph = self._create_graph(split_columns_to_files=split_columns_to_files) - -# with _GraphTempCacheFolderConfig() as config: -# g1 = csp.run(my_graph, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390), -# config=config) -# self.assertTrue(len(g1) > 0) -# func_run_count1 = func_run_count[0] -# # leaf node is that same in all, it's repeated 3 times -# self.assertEqual(len(g1) - 2, func_run_count1) -# self.assertEqual(g1, self.EXPECTED_OUTPUT_TEST_SIMPLE) -# g2 = csp.run(my_graph, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390), -# config=config) -# self.assertEqual(g1, g2) -# func_run_count2 = func_run_count[0] -# # When the sub graph is read from cache, we only have one "pass_through" for i_sample -# self.assertEqual(func_run_count1 + 1, func_run_count2) -# g3 = csp.run(my_graph, require_cached=True, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390), -# config=config) -# func_run_count3 = func_run_count[0] -# self.assertEqual(g1, g3) -# self.assertEqual(func_run_count2 + 1, func_run_count3) -# files_in_cache = self._get_files_in_cache(config) - -# if split_columns_to_files: -# aux_files = [] -# for f in files_in_cache: -# if f.endswith('.parquet'): -# aux_files.append(os.path.dirname(f) + '.parquet') -# else: -# aux_files.append(f) -# aux_files = sorted(set(aux_files)) -# self.assertEqual(aux_files, self.EXPECTED_FILES) -# self.assertEqual(files_in_cache, self._SPLIT_COLUMNS_EXPECTED_FILES) -# else: -# self.assertEqual(files_in_cache, self.EXPECTED_FILES) - -# def _get_files_in_cache(self, config): -# all_files_and_folders = sorted(glob.glob(f'{config.cache_config.data_folder}/**', recursive=True)) -# files_in_cache = [v.replace(f'{config.cache_config.data_folder}/', '') for v in all_files_and_folders if os.path.isfile(v)] -# # When we right from command line, the tests import paths differ. So let's support it as well -# files_in_cache = [f.replace('csp.tests.test_caching', 'test_caching') for f in files_in_cache] -# files_in_cache = [f.replace('/csp.tests.', '/') for f in files_in_cache] -# return files_in_cache - -# def test_no_cache(self): -# for split_columns_to_files in (True, False): -# func_run_count, my_graph_func = self._create_graph(split_columns_to_files=split_columns_to_files) -# g1 = csp.run(my_graph_func, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390)) -# func_run_count1 = func_run_count[0] -# g2 = csp.run(my_graph_func, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390)) -# func_run_count2 = func_run_count[0] -# self.assertEqual(g1, g2) -# self.assertEqual(func_run_count1 * 2, func_run_count2) - -# with self.assertRaises(NoCachedDataException): -# g3 = csp.run(my_graph_func, require_cached=True, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390)) - -# def _get_all_files(self, config): -# return sorted(glob.glob(f'{config.cache_config.data_folder}/**/*.parquet', recursive=True)) - -# def _get_default_graph_caching_kwargs(self, split_columns_to_files): -# if split_columns_to_files: -# graph_kwargs = {'cache_options': GraphCacheOptions(split_columns_to_files=True)} -# else: -# graph_kwargs = {} -# return graph_kwargs - -# def test_merge(self): -# for merge_existing_files in (True, False): -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) - -# def _time_to_seconds(t): -# return t.hour * 3600 + t.minute * 60 + t.second - -# @csp.node() -# def my_node() -> csp.Outputs(hours=ts[int], minutes=ts[int], seconds=ts[int]): -# with csp.alarms(): -# alarm = csp.alarm( int ) -# with csp.start(): -# csp.schedule_alarm(alarm, timedelta(), _time_to_seconds(csp.now())) -# if csp.ticked(alarm): -# csp.schedule_alarm(alarm, timedelta(seconds=60), alarm + 60) -# return csp.output(hours=alarm // 3600, minutes=alarm // 60, seconds=alarm) - -# @csp.graph(cache=True, **graph_kwargs) -# def sub_graph() -> csp.Outputs(hours=ts[int], minutes=ts[int], seconds=ts[int]): -# node = my_node() -# return csp.output(hours=node.hours, minutes=node.minutes, seconds=node.seconds) - -# def _validate_file_df(g, start_time, dt, g_start=None, g_end=None): -# end_time = start_time + dt if isinstance(dt, timedelta) else dt -# g_start = g_start if g_start else start_time -# g_end = g_end if g_end else end_time -# g_end = g_start + g_end if isinstance(g_end, timedelta) else g_end -# df = sub_graph.cached_data(config.cache_config.data_folder)().get_data_df_for_period(start_time, dt) - -# self.assertTrue((df.seconds.diff()[1:] == 60).all()) -# self.assertTrue((df.minutes == df.seconds // 60).all()) -# self.assertTrue((df.hours == df.seconds // 3600).all()) -# self.assertTrue(df.iloc[-1]['csp_timestamp'] == end_time) -# self.assertTrue(df.iloc[0]['csp_timestamp'] == start_time) -# self.assertTrue(df.iloc[0]['seconds'] == _time_to_seconds(start_time)) -# self.assertEqual(g['seconds'][0][1], _time_to_seconds(g_start)) -# self.assertEqual(g['seconds'][-1][1], _time_to_seconds(g_end)) - -# def graph(): -# res = sub_graph() -# csp.add_graph_output('seconds', res.seconds) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True, merge_existing_files=merge_existing_files) as config: -# missing_range_handler = lambda start, end: True -# start_time1 = datetime(2020, 3, 1, 9, 30, tzinfo=pytz.utc) -# dt1 = timedelta(hours=0, minutes=60) -# g = csp.run(graph, starttime=start_time1, endtime=dt1, config=config) -# files = list(sub_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period().values()) -# self.assertEqual(len(files), 1) -# _validate_file_df(g, start_time1, dt1) - -# start_time2 = start_time1 + timedelta(minutes=180) -# dt2 = dt1 -# g = csp.run(graph, starttime=start_time2, endtime=dt2, config=config) -# files = list(sub_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) - -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 2) -# _validate_file_df(g, start_time2, dt2) -# # Test repeated writing of the same file -# g = csp.run(graph, starttime=start_time2, endtime=dt2, config=config) -# _validate_file_df(g, start_time2, dt2) - -# start_time3 = start_time2 + dt2 - timedelta(minutes=5) -# dt3 = timedelta(minutes=15) -# g = csp.run(graph, starttime=start_time3, endtime=dt3, config=config) -# files = list(sub_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) - -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 2) -# _validate_file_df(g, start_time2, start_time3 + dt3, g_start=start_time3, g_end=dt3) - -# start_time4 = start_time2 - timedelta(minutes=5) -# dt4 = timedelta(minutes=15) -# g = csp.run(graph, starttime=start_time4, endtime=dt4, config=config) -# files = list(sub_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 2) -# _validate_file_df(g, start_time4, start_time3 + dt3, g_start=start_time4, g_end=dt4) - -# start_time5 = start_time1 + timedelta(minutes=40) -# dt5 = timedelta(minutes=200) -# g = csp.run(graph, starttime=start_time5, endtime=dt5, config=config) -# files = list(sub_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period().values()) -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 1) -# _validate_file_df(g, start_time1, start_time3 + dt3, g_start=start_time5, g_end=dt5) - -# g = csp.run(graph, starttime=start_time1 + timedelta(minutes=10), endtime=dt1, config=config) -# files = list(sub_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period().values()) -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 1) -# _validate_file_df(g, start_time1, start_time3 + dt3, g_start=start_time1 + timedelta(minutes=10), g_end=dt1) - -# start_time6 = start_time1 - timedelta(minutes=10) -# dt6 = start_time3 + dt3 + timedelta(minutes=10) -# g = csp.run(graph, starttime=start_time6, endtime=dt6, config=config) -# files = list(sub_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period().values()) -# self.assertEqual(len(files), 1) -# _validate_file_df(g, start_time6, dt6) - -# def test_folder_overrides(self): -# for split_columns_to_files in (True, False): -# start_time = datetime(2020, 3, 1, 20, 30) -# end_time = start_time + timedelta(seconds=1) - -# @csp.graph(cache=True) -# def g1() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(category=['C1', 'C2', 'C3'], split_columns_to_files=split_columns_to_files)) -# def g2() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(category=['C1', 'C2', 'C3_2'], split_columns_to_files=split_columns_to_files)) -# def g3() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(category=['C1', 'C2'], split_columns_to_files=split_columns_to_files)) -# def g4() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(category=['C1'], split_columns_to_files=split_columns_to_files)) -# def g5() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(dataset_name='named_dataset1', split_columns_to_files=split_columns_to_files)) -# def g6() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(dataset_name='named_dataset2', category=['C1', 'C2'], split_columns_to_files=split_columns_to_files)) -# def g7() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(dataset_name='named_dataset3', category=['C1', 'C2'], split_columns_to_files=split_columns_to_files)) -# def g8() -> csp.Outputs(o=csp.ts[int]): -# return csp.output(o=csp.null_ts(int)) - -# @csp.graph -# def g(): -# g1(), g2(), g3(), g4(), g5(), g6(), g7(), g8() - -# def _get_data_folders_for_config(config): -# all_folders = sorted({os.path.dirname(v) for v in self._get_files_in_cache(config)}) -# return sorted({v[:v.index('/data')] for v in all_folders if '/data' in v}) - -# with _GraphTempCacheFolderConfig() as config: -# with _GraphTempCacheFolderConfig() as config2: -# config_copy = Config.from_dict(config.to_dict()) -# root_folder = config_copy.cache_config.data_folder -# config_copy.cache_config.data_folder = os.path.join(root_folder, "default_output_folder") -# config_copy.cache_config.category_overrides = [ -# CacheCategoryConfig(category=['C1'], -# data_folder=os.path.join(root_folder, 'C1_O')), -# CacheCategoryConfig(category=['C1', 'C2'], -# data_folder=os.path.join(root_folder, 'C1_C2_O')), -# CacheCategoryConfig(category=['C1', 'C2', 'C3'], -# data_folder=os.path.join(root_folder, 'C1_C2_C3_O')) -# ] -# config_copy.cache_config.graph_overrides = {g8: BaseCacheConfig(data_folder=config2.cache_config.data_folder)} -# csp.run(g, starttime=start_time, endtime=end_time, config=config_copy) -# data_folders = _get_data_folders_for_config(config) -# data_folders2 = _get_data_folders_for_config(config2) -# expected_dataset_folders = { -# 'g1': 'default_output_folder/csp_unnamed_cache/test_caching.g1', 'g2': 'C1_C2_C3_O/C1/C2/C3/test_caching.g2', -# 'g3': 'C1_C2_O/C1/C2/C3_2/test_caching.g3', 'g4': 'C1_C2_O/C1/C2/test_caching.g4', 'g5': 'C1_O/C1/test_caching.g5', -# 'g6': 'default_output_folder/csp_unnamed_cache/named_dataset1', 'g7': 'C1_C2_O/C1/C2/named_dataset2'} -# expected_dataset_folders2 = {'g8': 'C1/C2/named_dataset3'} -# self.assertEqual(data_folders, sorted(expected_dataset_folders.values())) -# self.assertEqual(data_folders2, sorted(expected_dataset_folders2.values())) - -# full_path = lambda v: os.path.join(root_folder, v) -# get_data_files = lambda g, f: g.cached_data(full_path(f))().get_data_files_for_period(start_time, end_time) - -# self.assertEqual(1, len(get_data_files(g1, "default_output_folder"))) -# self.assertEqual(1, len(get_data_files(g2, "C1_C2_C3_O"))) -# self.assertEqual(1, len(get_data_files(g3, "C1_C2_O"))) -# self.assertEqual(1, len(get_data_files(g4, "C1_C2_O"))) -# self.assertEqual(1, len(get_data_files(g5, "C1_O"))) -# self.assertEqual(1, len(get_data_files(g6, "default_output_folder"))) -# self.assertEqual(1, len(get_data_files(g7, "C1_C2_O"))) - -# data_path_resolver = CacheConfigResolver(config_copy.cache_config) -# get_data_files = lambda g: g.cached_data(data_path_resolver)().get_data_files_for_period(start_time, end_time) -# self.assertEqual(1, len(get_data_files(g1))) -# self.assertEqual(1, len(get_data_files(g2))) -# self.assertEqual(1, len(get_data_files(g3))) -# self.assertEqual(1, len(get_data_files(g4))) -# self.assertEqual(1, len(get_data_files(g5))) -# self.assertEqual(1, len(get_data_files(g6))) -# self.assertEqual(1, len(get_data_files(g7))) -# self.assertEqual(1, len(get_data_files(g8))) - -# def test_caching_reads_only_needed_columns(self): -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) - -# class MyS(csp.Struct): -# x: int -# y: int - -# @graph(cache=True, **graph_kwargs) -# def g(s: str) -> csp.Outputs(o=csp.ts[MyS]): -# t = csp.engine_start_time() -# o_ts = csp.curve(MyS, [(t + timedelta(seconds=v), MyS(x=v, y=v * 2)) for v in range(20)]) -# return csp.output(o=o_ts) - -# @graph -# def g_x_reader(s: str) -> csp.Outputs(o=csp.ts[int]): -# return csp.count(g('A').o.x) - -# @graph -# def g_delayed_demux(s: str) -> csp.ts[int]: -# demux = csp.DelayedDemultiplex(g('A').o.x, g('A').o.x) -# return csp.count(demux.demultiplex(1)) - -# with _GraphTempCacheFolderConfig() as config: -# csp.run(g, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# files = g.cached_data(config.cache_config.data_folder)('A').get_data_files_for_period(datetime(2020, 1, 1), datetime(2020, 1, 1) + timedelta(seconds=20)) -# self.assertEqual(len(files), 1) -# file = next(iter(files.values())) -# if split_columns_to_files: -# # Let's fake the data file by removing the column y -# file_to_remove = os.path.join(file, 'o.y.parquet') -# self.assertTrue(os.path.exists(file_to_remove)) -# os.unlink(file_to_remove) -# self.assertFalse(os.path.exists(file_to_remove)) -# with self.assertRaisesRegex(Exception, r'.*IOError.*Failed to open .*o\.y.*'): -# csp.run(g, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# else: -# df = pandas.read_parquet(file) -# # Let's fake the data file by removing the column y. We want to make sure that we don't attempt to read column y -# df.drop(columns=['o.y']).to_parquet(file) -# with self.assertRaisesRegex(RuntimeError, r'Missing column o\.y.*'): -# csp.run(g, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# # This should not raise since we don't try to read the y column -# csp.run(g_x_reader, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# csp.run(g_delayed_demux, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) - -# def test_enum_serialization(self): -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) - -# class MyEnum(csp.Enum): -# X = csp.Enum.auto() -# Y = csp.Enum.auto() -# ZZZ = csp.Enum.auto() - -# raiseExc = [False] - -# @graph(cache=True, **graph_kwargs) -# def g(s: str) -> csp.Outputs(o=csp.ts[MyEnum]): -# if raiseExc[0]: -# raise RuntimeError("Shouldn't get here") -# o_ts = csp.curve(MyEnum, [(timedelta(seconds=1), MyEnum.X), (timedelta(seconds=1), MyEnum.Y), (timedelta(seconds=2), MyEnum.ZZZ), (timedelta(seconds=3), MyEnum.X)]) -# return csp.output(o=o_ts) - -# from csp.utils.qualified_name_utils import QualifiedNameUtils -# QualifiedNameUtils.register_type(MyEnum) - -# with _GraphTempCacheFolderConfig() as config: -# csp.run(g, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# raiseExc[0] = True -# cached_res = csp.run(g, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# enum_values = [v[1] for v in cached_res['o']] -# data_df = g.cached_data(config.cache_config.data_folder)('A').get_data_df_for_period() -# self.assertEqual(data_df['o'].tolist(), ['X', 'Y', 'ZZZ', 'X']) -# self.assertEqual(enum_values, [MyEnum.X, MyEnum.Y, MyEnum.ZZZ, MyEnum.X]) - -# def test_enum_field_serialization(self): -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) -# from csp.tests.impl.test_enum import MyEnum - -# class MyStruct(csp.Struct): -# e: MyEnum - -# raiseExc = [False] - -# @graph(cache=True, **graph_kwargs) -# def g(s: str) -> csp.Outputs(o=csp.ts[MyStruct]): -# if raiseExc[0]: -# raise RuntimeError("Shouldn't get here") -# make_s = lambda v: MyStruct(e=v) if v is not None else MyStruct() -# o_ts = csp.curve(MyStruct, [(timedelta(seconds=1), make_s(MyEnum.A)), (timedelta(seconds=1), make_s(MyEnum.B)), -# (timedelta(seconds=2), make_s(MyEnum.C)), (timedelta(seconds=3), make_s(MyEnum.A)), -# (timedelta(seconds=4), make_s(None))]) -# return csp.output(o=o_ts) - -# with _GraphTempCacheFolderConfig() as config: -# csp.run(g, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# raiseExc[0] = True -# cached_res = csp.run(g, 'A', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# enum_values = [v[1].e if hasattr(v[1], 'e') else None for v in cached_res['o']] -# data_df = g.cached_data(config.cache_config.data_folder)('A').get_data_df_for_period() -# self.assertEqual(data_df['o.e'].tolist(), ['A', 'B', 'C', 'A', None]) -# self.assertEqual(enum_values, [MyEnum.A, MyEnum.B, MyEnum.C, MyEnum.A, None]) - -# def test_nested_struct_caching(self): -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) -# from csp.tests.impl.test_enum import MyEnum -# class MyStruct1(csp.Struct): -# v_int: int -# v_str: str -# e: MyEnum - -# class MyStruct2(csp.Struct): -# v: MyStruct1 -# v_float: float - -# class MyStruct3(MyStruct2): -# v2: MyStruct2 - -# from csp.utils.qualified_name_utils import QualifiedNameUtils -# QualifiedNameUtils.register_type(MyStruct1) -# QualifiedNameUtils.register_type(MyStruct2) - -# raiseExc = [False] - -# struct_values = [MyStruct3(), -# MyStruct3(v=MyStruct1(v_int=1)), -# MyStruct3(v=MyStruct1(v_int=2)), -# MyStruct3(v=MyStruct1(v_int=3, v_str='3_val')), -# MyStruct3(v=MyStruct1(v_str='4_val')), -# MyStruct3(v=MyStruct1(v_str='5_val'), v2=MyStruct2(v_float=5.5, v=MyStruct1(v_int=6, v_str='6_val', e=MyEnum.B)), v_float=6.5), -# MyStruct3(v=MyStruct1()) -# ] - -# @graph(cache=True, **graph_kwargs) -# def g() -> csp.Outputs(o=csp.ts[MyStruct3]): -# if raiseExc[0]: -# raise RuntimeError("Shouldn't get here") -# o_ts = csp.curve(MyStruct3, [(timedelta(seconds=i), v) for i, v in enumerate(struct_values)]) -# return csp.output(o=o_ts) - -# @graph -# def g2(): -# csp.add_graph_output('o', g().o) -# csp.add_graph_output('o.v', g().o.v) -# csp.add_graph_output('o.v_float', g().o.v_float) - -# @graph -# def g3(): -# csp.add_graph_output('o.v_float', g().o.v_float) - -# with _GraphTempCacheFolderConfig() as config: -# csp.run(g, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# raiseExc[0] = True -# cached_res = csp.run(g2, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# cached_float = csp.run(g3, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# cached_values = list(zip(*cached_res['o']))[1] -# cached_v_values = list(zip(*cached_res['o.v']))[1] -# expected_v_values = [getattr(v, 'v') for v in cached_values if hasattr(v, 'v')] -# self.assertEqual(len(struct_values), len(cached_values)) -# for v1, v2 in zip(struct_values, cached_values): -# self.assertEqual(v1, v2) -# self.assertEqual(len(cached_v_values), len(expected_v_values)) -# for v1, v2 in zip(cached_v_values, expected_v_values): -# self.assertEqual(v1, v2) -# self.assertEqual(cached_float['o.v_float'], cached_res['o.v_float']) - -# def test_caching_same_timestamp_with_missing_values(self): -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) - -# @csp.node -# def my_node() -> csp.Outputs(v1=csp.ts[int], v2=csp.ts[int], v3=csp.ts[int]): -# with csp.alarms(): -# a = csp.alarm( int ) -# with csp.start(): -# csp.schedule_alarm(a, timedelta(0), 0) -# csp.schedule_alarm(a, timedelta(0), 1) -# csp.schedule_alarm(a, timedelta(0), 2) -# csp.schedule_alarm(a, timedelta(0), 3) -# if csp.ticked(a): -# if a == 0: -# csp.output(v1=10 + a, v2=20 + a) -# elif a == 1: -# csp.output(v1=10 + a, v3=30 + a) -# else: -# csp.output(v1=10 + a, v2=20 + a, v3=30 + a) - -# @graph(cache=True, **graph_kwargs) -# def g() -> csp.Outputs(v1=csp.ts[int], v2=csp.ts[int], v3=csp.ts[int]): -# outs = my_node() -# return csp.output(v1=outs.v1, v2=outs.v2, v3=outs.v3) - -# @graph -# def main(): -# csp.add_graph_output('l', csp_sorted(csp.collect([g().v1, g().v2, g().v3]))) - -# with _GraphTempCacheFolderConfig() as config: -# out1 = csp.run(main, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# out2 = csp.run(main, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) -# self.assertEqual(out1, out2) - -# def test_timestamp_with_nanos_caching(self): -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) -# timestamp_value = pandas.Timestamp('2020-01-01 00:00:00') + pandas.to_timedelta(123, 'ns') - -# @csp.node -# def my_node() -> csp.ts[datetime]: -# with csp.alarms(): -# a = csp.alarm( datetime ) -# with csp.start(): -# csp.schedule_alarm(a, timedelta(seconds=1), timestamp_value) -# if csp.ticked(a): -# return a - -# @graph(cache=True, **graph_kwargs) -# def g() -> csp.Outputs(t=csp.ts[datetime]): -# return csp.output(t=my_node()) - -# with _GraphTempCacheFolderConfig() as config: -# csp.run(g, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=20), config=config) - -# data_path_resolver = CacheConfigResolver(config.cache_config) - -# data_df = g.cached_data(data_path_resolver)().get_data_df_for_period() - -# self.assertEqual(timestamp_value.nanosecond, 123) -# self.assertEqual(len(data_df), 1) -# self.assertEqual(data_df.t.iloc[0].tz_localize(None), timestamp_value) - -# def test_unsupported_basket_caching(self): -# with self.assertRaisesRegex(NotImplementedError, "Caching of list basket outputs is unsupported"): -# @csp.graph(cache=True) -# def g_bad() -> csp.Outputs(list_basket=[csp.ts[str]]): -# raise RuntimeError() - -# with self.assertRaisesRegex(TypeError, "Cached output basket dict_basket must have shape provided using with_shape or with_shape_of"): -# @csp.graph(cache=True) -# def g_bad() -> csp.Outputs(dict_basket=csp.OutputBasket(Dict[str, csp.ts[str]])): -# raise RuntimeError() - -# with self.assertRaisesRegex(RuntimeError, "Cached graph with output basket must set split_columns_to_files to True"): -# @csp.graph(cache=True, cache_options=GraphCacheOptions(split_columns_to_files=False)) -# def g_bad() -> csp.Outputs(dict_basket=csp.OutputBasket(Dict[str, csp.ts[str]], shape=[1,2,3])): -# raise RuntimeError() -# # TODO: add shape validation check here - -# def test_simple_dict_basket_caching(self): -# def shape_func(l=None): -# if l is None: -# return ['x', 'y', 'z'] -# return l - -# @csp.node -# def my_node() -> csp.Outputs(scalar1=csp.ts[int], dict_basket= -# csp.OutputBasket(Dict[str, csp.ts[int]], shape=shape_func()), scalar=csp.ts[int]): -# with csp.alarms(): -# a_index = csp.alarm( int ) -# with csp.start(): -# csp.schedule_alarm(a_index, timedelta(), 0) -# if csp.ticked(a_index) and a_index < 10: -# if a_index == 1: -# csp.schedule_alarm(a_index, timedelta(), 2) -# else: -# csp.schedule_alarm(a_index, timedelta(seconds=1), a_index + 1) - -# if a_index == 0: -# csp.output(scalar1=1, dict_basket={'x': 1, 'y': 2, 'z': 3}, scalar=2) -# elif a_index == 1: -# csp.output(dict_basket={'x': 2, 'z': 3}, scalar=3) -# elif a_index == 2: -# csp.output(dict_basket={'x': 3, 'z': 34}) -# elif a_index == 3: -# csp.output(scalar1=5) -# elif a_index == 4: -# csp.output(dict_basket={'x': 45}) - -# @csp.graph(cache=True) -# def g_bad() -> csp.Outputs(scalar1=csp.ts[int], dict_basket=csp.OutputBasket(Dict[str, csp.ts[int]], shape=shape_func()), scalar=csp.ts[int]): -# # __outputs__(dict_basket={'T': csp.ts['K']}.with_shape(shape_func(['xx']))) -# # -# # return csp.output( dict_basket={'xx': csp.const(1)}) - -# return csp.output(scalar1=my_node().scalar1, dict_basket=my_node().dict_basket, scalar=my_node().scalar) - -# # @csp.node -# # def g_bad(): -# # __outputs__(scalar1=csp.ts[int], dict_basket={'T': csp.ts['K']}.with_shape(shape_func()), scalar=csp.ts[int]) -# # return csp.output(scalar1=5, dict_basket={'x': 1}, scalar=3) - -# @graph -# def run_graph(g: object): -# g_bad() - -# with _GraphTempCacheFolderConfig() as config: -# csp.run(run_graph, g_bad, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config) - -# def test_simple_basket_caching(self): -# for typ in (int, bool, float, str, datetime, date, TypedCurveGenerator.SimpleEnum, TypedCurveGenerator.SimpleStruct, TypedCurveGenerator.NestedStruct): -# @graph(cache=True) -# def cached_graph() -> csp.Outputs(v1=csp.OutputBasket(Dict[str, csp.ts[typ]], shape=['0', '', '2']), v2=csp.ts[int]): -# curve_generator = TypedCurveGenerator() - -# return csp.output(v1={'0': curve_generator.gen_transformed_curve(typ, 0, 100, 1, skip_indices=[5, 6, 7], duplicate_timestamp_indices=[8, 9]), -# '': curve_generator.gen_transformed_curve(typ, 13, 100, 1, skip_indices=[5, 7, 9]), -# '2': curve_generator.gen_transformed_curve(typ, 27, 100, 1, skip_indices=[5, 6]) -# }, -# v2=curve_generator.gen_int_curve(100, 10, 1, skip_indices=[2], duplicate_timestamp_indices=[7, 8])) - -# @graph -# def run_graph(force_cached: bool = False): -# g = cached_graph.cached if force_cached else cached_graph -# csp.add_graph_output('v1[0]', g().v1['0']) -# csp.add_graph_output('v1[1]', g().v1['']) -# csp.add_graph_output('v1[2]', g().v1['2']) -# csp.add_graph_output('v2', g().v2) - -# with _GraphTempCacheFolderConfig() as config: -# res = csp.run(run_graph, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config) -# res2 = csp.run(run_graph, True, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config) -# self.assertEqual(res, res2) - -# def test_basket_caching_first_last_cycles(self): -# for basket_edge_settings in self._EdgeOutputSettings: -# for scalar_edge_settings in self._EdgeOutputSettings: -# @graph(cache=True) -# def cached_graph(basket_edge_settings: self._EdgeOutputSettings, scalar_edge_settings: self._EdgeOutputSettings) -> csp.Outputs( -# v1=csp.OutputBasket(Dict[str, csp.ts[int]], shape=['0', '1']), v2=csp.ts[int]): -# curve_generator = TypedCurveGenerator() -# output_scalar_on_initial_cycle = bool(scalar_edge_settings.value & self._EdgeOutputSettings.FIRST_CYCLE.value) -# output_basket_on_initial_cycle = bool(basket_edge_settings.value & self._EdgeOutputSettings.FIRST_CYCLE.value) -# basket_skip_indices = [] if bool(basket_edge_settings.value & self._EdgeOutputSettings.LAST_CYCLE.value) else [3] -# scalar_skip_indices = [] if bool(scalar_edge_settings.value & self._EdgeOutputSettings.LAST_CYCLE.value) else [3] - -# return csp.output(v1={'0': curve_generator.gen_int_curve(0, 3, 1, output_on_initial_cycle=output_basket_on_initial_cycle, skip_indices=basket_skip_indices), -# '1': curve_generator.gen_int_curve(13, 3, 1, output_on_initial_cycle=output_basket_on_initial_cycle, skip_indices=basket_skip_indices)}, -# v2=curve_generator.gen_int_curve(100, 3, 1, output_on_initial_cycle=output_scalar_on_initial_cycle, skip_indices=scalar_skip_indices)) - -# @graph -# def run_graph(basket_edge_settings: self._EdgeOutputSettings, scalar_edge_settings: self._EdgeOutputSettings, force_cached: bool = False): -# g = cached_graph.cached if force_cached else cached_graph -# g_res = g(basket_edge_settings, scalar_edge_settings) -# csp.add_graph_output('v1[0]', g_res.v1['0']) -# csp.add_graph_output('v1[1]', g_res.v1['1']) -# csp.add_graph_output('v2', g_res.v2) - -# with _GraphTempCacheFolderConfig() as config: -# res = csp.run(run_graph, basket_edge_settings, scalar_edge_settings, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config) -# res2 = csp.run(run_graph, basket_edge_settings, scalar_edge_settings, True, starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config) -# self.assertEqual(res, res2) - -# def test_basket_multiday_read_write(self): -# # 5 hours - means we will have data each day and the last few days are empty. With 49 hours we will have some days in the middle having empty -# # data and we want to see that it's handled properly (this test actually found a hidden bug) -# for curve_hours in (5, 49): -# for typ in (int, bool, float, str, datetime, date, TypedCurveGenerator.SimpleEnum, TypedCurveGenerator.SimpleStruct, TypedCurveGenerator.NestedStruct): -# @graph(cache=True) -# def cached_graph() -> csp.Outputs(v1=csp.OutputBasket(Dict[str, csp.ts[typ]], shape=['0', '', '2']), v2=csp.ts[int]): -# curve_generator = TypedCurveGenerator(period=timedelta(hours=curve_hours)) - -# return csp.output(v1={'0': curve_generator.gen_transformed_curve(typ, 0, 10, 1, skip_indices=[5, 6, 7], duplicate_timestamp_indices=[8, 9]), -# '': curve_generator.gen_transformed_curve(typ, 13, 10, 1, skip_indices=[5, 7, 9]), -# '2': curve_generator.gen_transformed_curve(typ, 27, 10, 1, skip_indices=[5, 6]) -# }, -# v2=curve_generator.gen_int_curve(100, 10, 1, skip_indices=[2], duplicate_timestamp_indices=[7, 8])) - -# @graph -# def run_graph(force_cached: bool = False): -# g = cached_graph.cached if force_cached else cached_graph -# csp.add_graph_output('v1[0]', g().v1['0']) -# csp.add_graph_output('v1[1]', g().v1['']) -# csp.add_graph_output('v1[2]', g().v1['2']) -# csp.add_graph_output('v2', g().v2) - -# self.maxDiff = None -# with _GraphTempCacheFolderConfig() as config: -# res = csp.run(run_graph, starttime=datetime(2020, 1, 1), endtime=timedelta(days=5) - timedelta(microseconds=1), config=config) -# res2 = csp.run(run_graph, True, starttime=datetime(2020, 1, 1), endtime=timedelta(days=5) - timedelta(microseconds=1), config=config) -# self.assertEqual(res, res2) -# data_path_resolver = CacheConfigResolver(config.cache_config) -# # A sanity check that we can load the data with some empty dataframes on some days -# base_data_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period() - -# def test_merge_baskets(self): -# def _simple_struct_to_dict(o): -# if o is None: -# return None -# return {c: getattr(o, c, None) for c in TypedCurveGenerator.SimpleStruct.metadata()} - -# for batch_size in (117, None): -# output_config = ParquetOutputConfig() if batch_size is None else ParquetOutputConfig(batch_size=batch_size) -# for typ in (int, bool, float, str, datetime, date, TypedCurveGenerator.SimpleEnum, TypedCurveGenerator.SimpleStruct, TypedCurveGenerator.NestedStruct): -# @graph(cache=True) -# def base_graph() -> csp.Outputs(v1=csp.OutputBasket(Dict[str, csp.ts[typ]], shape=['COL1', 'COL2', 'COL3']), v2=csp.ts[int]): -# curve_generator = TypedCurveGenerator(period=timedelta(seconds=7)) -# return csp.output(v1={'COL1': curve_generator.gen_transformed_curve(typ, 0, 2600, 1, skip_indices=[95, 96, 97], duplicate_timestamp_indices=[98, 99]), -# 'COL2': curve_generator.gen_transformed_curve(typ, 13, 2600, 1, skip_indices=[95, 97, 99, 1090]), -# 'COL3': curve_generator.gen_transformed_curve(typ, 27, 2600, 1, skip_indices=[95, 96]) -# }, -# v2=curve_generator.gen_int_curve(100, 2600, 1, skip_indices=[92], duplicate_timestamp_indices=[97, 98])) - -# @graph(cache=True, cache_options=GraphCacheOptions(parquet_output_config=output_config)) -# def cached_graph() -> csp.Outputs(v1=csp.OutputBasket(Dict[str, csp.ts[typ]], shape=['COL1', 'COL2', 'COL3']), v2=csp.ts[int]): -# return csp.output(v1=base_graph.cached().v1, -# v2=base_graph.cached().v2) - -# @graph -# def run_graph(force_cached: bool = False): -# g = cached_graph.cached if force_cached else cached_graph -# csp.add_graph_output('COL1', g().v1['COL1']) -# csp.add_graph_output('COL2', g().v1['COL2']) -# csp.add_graph_output('COL3', g().v1['COL3']) -# csp.add_graph_output('v2', g().v2) - -# # enough to check this just for one type -# merge_existing_files = typ is int -# with _GraphTempCacheFolderConfig(allow_overwrite=True, merge_existing_files=merge_existing_files) as config: -# base_data_outputs = csp.run(base_graph, starttime=datetime(2020, 3, 1, 9, 20, tzinfo=pytz.utc), -# endtime=datetime(2020, 3, 1, 14, 0, tzinfo=pytz.utc), -# config=config) - -# aux_dfs = [pandas.DataFrame(dict(zip(['csp_timestamp', k], zip(*v)))) for k, v in base_data_outputs.items()] -# for aux_df in aux_dfs: -# repeated_timestamp_mask = 1 - (aux_df['csp_timestamp'].shift(1) != aux_df['csp_timestamp']).astype(int) -# aux_df['cycle_count'] = repeated_timestamp_mask.cumsum() * repeated_timestamp_mask -# aux_df.set_index(['csp_timestamp', 'cycle_count'], inplace=True) - -# # this does not work as of pandas==1.4.0 -# # expected_base_df = pandas.concat(aux_dfs, axis=1) -# expected_base_df = aux_dfs[0] -# for df in aux_dfs[1:]: -# expected_base_df = expected_base_df.merge(df, left_index=True, right_index=True, how="outer") -# expected_base_df = expected_base_df.reset_index().drop(columns=['cycle_count']) - -# expected_base_df.columns = [['csp_timestamp', 'v1', 'v1', 'v1', 'v2'], ['', 'COL1', 'COL2', 'COL3', '']] -# expected_base_df = expected_base_df[['csp_timestamp', 'v2', 'v1']] -# expected_base_df['csp_timestamp'] = expected_base_df['csp_timestamp'].dt.tz_localize(pytz.utc) -# if typ is datetime: -# for c in ['COL1', 'COL2', 'COL3']: -# expected_base_df.loc[:, ('v1', c)] = expected_base_df.loc[:, ('v1', c)].dt.tz_localize(pytz.utc) -# if typ is TypedCurveGenerator.SimpleEnum: -# for c in ['COL1', 'COL2', 'COL3']: -# expected_base_df.loc[:, ('v1', c)] = expected_base_df.loc[:, ('v1', c)].apply(lambda v: v.name if isinstance(v, TypedCurveGenerator.SimpleEnum) else v) -# if typ is TypedCurveGenerator.SimpleStruct: -# for k in TypedCurveGenerator.SimpleStruct.metadata(): -# for c in ['COL1', 'COL2', 'COL3']: -# expected_base_df.loc[:, (f'v1.{k}', c)] = expected_base_df.loc[:, ('v1', c)].apply(lambda v: getattr(v, k, None) if v else v) -# expected_base_df.drop(columns=['v1'], inplace=True, level=0) -# if typ is TypedCurveGenerator.NestedStruct: -# for k in TypedCurveGenerator.NestedStruct.metadata(): -# for c in ['COL1', 'COL2', 'COL3']: -# if k == 'value2': -# expected_base_df.loc[:, (f'v1.{k}', c)] = expected_base_df.loc[:, ('v1', c)].apply(lambda v: _simple_struct_to_dict(getattr(v, k, None)) if v else v) -# else: -# expected_base_df.loc[:, (f'v1.{k}', c)] = expected_base_df.loc[:, ('v1', c)].apply(lambda v: getattr(v, k, None) if v else v) -# expected_base_df.drop(columns=['v1'], inplace=True, level=0) - -# data_path_resolver = CacheConfigResolver(config.cache_config) -# base_data_df = base_graph.cached_data(data_path_resolver)().get_data_df_for_period() -# self.assertTrue(base_data_df.fillna(-111111).eq(expected_base_df.fillna(-111111)).all().all()) -# missing_range_handler = lambda start, end: True -# start_time1 = datetime(2020, 3, 1, 9, 30, tzinfo=pytz.utc) -# dt1 = timedelta(hours=0, minutes=60) -# res1 = csp.run(run_graph, starttime=start_time1, endtime=dt1, config=config) -# files = list(cached_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period().values()) -# self.assertEqual(len(files), 1) -# res1_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period(start_time1, dt1) -# self.assertTrue( -# expected_base_df[expected_base_df.csp_timestamp.between(start_time1, start_time1 + dt1)].reset_index(drop=True).fillna(-111111).eq(res1_df.fillna(-111111)).all().all()) -# start_time2 = start_time1 + timedelta(minutes=180) -# dt2 = dt1 -# res2 = csp.run(run_graph, starttime=start_time2, endtime=dt2, config=config) -# files = list(cached_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) -# self.assertEqual(len(files), 2) -# res2_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period(start_time2, dt2) -# self.assertTrue( -# expected_base_df[expected_base_df.csp_timestamp.between(start_time2, start_time2 + dt2)].reset_index(drop=True).fillna(-111111).eq(res2_df.fillna(-111111)).all().all()) - -# # # Test repeated writing of the same file -# res2b = csp.run(run_graph, starttime=start_time2, endtime=dt2, config=config) -# self.assertEqual(res2b, res2) - -# start_time3 = start_time2 + dt2 - timedelta(minutes=5) -# dt3 = timedelta(minutes=15) -# res3 = csp.run(run_graph, starttime=start_time3, endtime=dt3, config=config) -# files = list(cached_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 2) -# res3_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period(start_time3, dt3) -# self.assertTrue( -# expected_base_df[expected_base_df.csp_timestamp.between(start_time3, start_time3 + dt3)].reset_index(drop=True).fillna(-111111).eq(res3_df.fillna(-111111)).all().all()) - -# start_time4 = start_time2 - timedelta(minutes=5) -# dt4 = timedelta(minutes=15) -# res4 = csp.run(run_graph, starttime=start_time4, endtime=dt4, config=config) -# files = list(cached_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 2) -# res4_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period(start_time4, dt4) -# self.assertTrue( -# expected_base_df[expected_base_df.csp_timestamp.between(start_time4, start_time4 + dt4)].reset_index(drop=True).fillna(-111111).eq(res4_df.fillna(-111111)).all().all()) - -# start_time5 = start_time1 + timedelta(minutes=40) -# dt5 = timedelta(minutes=200) - -# res5 = csp.run(run_graph, starttime=start_time5, endtime=dt5, config=config) -# files = list(cached_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 1) -# res5_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period(start_time5, dt5) -# self.assertTrue(expected_base_df[expected_base_df.csp_timestamp.between(start_time5, start_time5 + dt5)].reset_index(drop=True).equals(res5_df)) - -# start_time6 = start_time1 + timedelta(minutes=10) -# res6 = csp.run(run_graph, starttime=start_time6, endtime=dt1, config=config) -# files = list(cached_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period(missing_range_handler=missing_range_handler).values()) -# if config.cache_config.merge_existing_files: -# self.assertEqual(len(files), 1) -# res6_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period(start_time6, dt1) -# self.assertTrue( -# expected_base_df[expected_base_df.csp_timestamp.between(start_time6, start_time6 + dt1)].reset_index(drop=True).fillna(-111111).eq(res6_df.fillna(-111111)).all().all()) -# start_time7 = start_time1 - timedelta(minutes=10) -# dt7 = start_time3 + dt3 + timedelta(minutes=10) -# res7 = csp.run(run_graph, starttime=start_time7, endtime=dt7, config=config) -# files = list(cached_graph.cached_data(config.cache_config.data_folder)().get_data_files_for_period().values()) -# self.assertEqual(len(files), 1) -# res7_df = cached_graph.cached_data(data_path_resolver)().get_data_df_for_period(start_time7, dt7) -# self.assertTrue(expected_base_df[expected_base_df.csp_timestamp.between(start_time7, dt7)].reset_index(drop=True).fillna(-111111).eq(res7_df.fillna(-111111)).all().all()) - -# def test_subtype_dict_caching(self): -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# for cache in (True, False): -# @graph(cache=cache) -# def main() -> csp.Outputs(o=csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleStruct]], shape=(['A', 'B']))): -# curve_generator = TypedCurveGenerator(period=timedelta(seconds=1)) -# return csp.output(o={ -# 'A': curve_generator.gen_transformed_curve(TypedCurveGenerator.SimpleSubStruct, 100, 10, 1), -# 'B': curve_generator.gen_transformed_curve(TypedCurveGenerator.SimpleSubStruct, 500, 10, 1), -# }) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# if cache: -# with self.assertRaises(csp.impl.types.instantiation_type_resolver.ArgTypeMismatchError): -# csp.run(main, starttime=start_time, endtime=end_time, config=config) -# else: -# csp.run(main, starttime=start_time, endtime=end_time, config=config) - -# def test_subclass_caching(self): -# @csp.graph -# def main() -> csp.Outputs(o=csp.ts[TypedCurveGenerator.SimpleStruct]): -# return csp.output(o=csp.const(TypedCurveGenerator.SimpleSubStruct())) - -# @csp.graph(cache=True) -# def main_cached() -> csp.Outputs(o=csp.ts[TypedCurveGenerator.SimpleStruct]): -# return csp.output(o=csp.const(TypedCurveGenerator.SimpleSubStruct())) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(main, starttime=start_time, endtime=end_time, config=config) -# # Cached graphs must return exact types -# with self.assertRaises(csp.impl.types.instantiation_type_resolver.TSArgTypeMismatchError): -# csp.run(main_cached, starttime=start_time, endtime=end_time, config=config) - -# def test_key_subset(self): -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# @graph(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'tickers'})) -# def main(tickers: [str]) -> csp.Outputs(prices=csp.OutputBasket(Dict[str, csp.ts[float]], shape="tickers")): -# curve_generator = TypedCurveGenerator(period=timedelta(seconds=1)) -# return csp.output(prices={ -# 'AAPL': curve_generator.gen_transformed_curve(float, 100, 10, 1), -# 'IBM': curve_generator.gen_transformed_curve(float, 500, 10, 1), -# }) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# res1 = csp.run(main, ['AAPL', 'IBM'], starttime=start_time, endtime=end_time, config=config) -# res2 = csp.run(main, ['AAPL'], starttime=start_time, endtime=end_time, config=config) -# res3 = csp.run(main, ['IBM'], starttime=start_time, endtime=end_time, config=config) -# self.assertEqual(len(res1), 2) -# self.assertEqual(len(res2), 1) -# self.assertEqual(len(res3), 1) -# self.assertEqual(res1['prices[AAPL]'], res2['prices[AAPL]']) -# self.assertEqual(res1['prices[IBM]'], res3['prices[IBM]']) - -# def test_simple_node_caching(self): -# throw_exc = [False] - -# @csp.node(cache=True) -# def main_node() -> csp.Outputs(x=csp.ts[int]): -# with csp.alarms(): -# a = csp.alarm( int ) -# with csp.start(): -# if throw_exc[0]: -# raise RuntimeError("Shouldn't get here, node should be cached") -# csp.schedule_alarm(a, timedelta(), 0) - -# if csp.ticked(a): -# csp.schedule_alarm(a, timedelta(seconds=1), a + 1) -# return csp.output(x=a) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# res1 = csp.run(main_node, starttime=start_time, endtime=end_time, config=config) -# throw_exc[0] = True -# res2 = csp.run(main_node, starttime=start_time, endtime=end_time, config=config) -# self.assertEqual(res1, res2) - -# def test_node_caching_with_args(self): -# throw_exc = [False] - -# @csp.node(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'input_ts', 'input_basket'})) -# def main_node(input_ts: csp.ts[int], input_basket: {str: csp.ts[int]}, addition: int = Injected('addition_value')) -> csp.Outputs( -# o1=csp.ts[int], o2=csp.OutputBasket(Dict[str, csp.ts[int]], shape_of='input_basket')): -# with csp.alarms(): -# a = csp.alarm( int ) -# with csp.start(): -# if throw_exc[0]: -# raise RuntimeError("Shouldn't get here, node should be cached") -# csp.schedule_alarm(a, timedelta(), -42) -# if csp.ticked(input_ts): -# csp.output(o1=input_ts + addition) -# for k, v in input_basket.tickeditems(): -# csp.output(o2={k: v + addition}) - -# def main_graph(): -# curve_generator = TypedCurveGenerator(period=timedelta(seconds=1)) -# return main_node(curve_generator.gen_int_curve(0, 10, 1), {'1': curve_generator.gen_int_curve(10, 10, 1), '2': curve_generator.gen_int_curve(20, 10, 1)}) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# with set_new_registry_thread_instance(): -# register_injected_object('addition_value', 42) -# res1 = csp.run(main_graph, starttime=start_time, endtime=end_time, config=config) -# throw_exc[0] = True -# res2 = csp.run(main_graph, starttime=start_time, endtime=end_time, config=config) -# self.assertEqual(res1, res2) - -# def test_caching_int_as_float(self): -# @csp.graph(cache=True) -# def main_cached() -> csp.Outputs(o=csp.ts[float]): -# return csp.output(o=csp.const.using(T=int)(int(42))) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# res1 = csp.run(main_cached, starttime=start_time, endtime=end_time, config=config) -# res2 = csp.run(main_cached, starttime=start_time, endtime=end_time, config=config) -# cached_val = res2['o'][0][1] -# self.assertIs(type(cached_val), float) -# self.assertEqual(cached_val, 42.0) - -# def test_consecutive_files_merge(self): -# for split_columns_to_files in (True, False): -# @csp.graph(cache=True, cache_options=GraphCacheOptions(split_columns_to_files=split_columns_to_files)) -# def main_cached() -> csp.Outputs(o=csp.ts[float]): -# return csp.output(o=csp.const.using(T=int)(int(42))) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(main_cached, starttime=start_time, endtime=end_time, config=config) -# csp.run(main_cached, starttime=end_time + timedelta(microseconds=1), endtime=end_time + timedelta(seconds=1), config=config) -# files = list(main_cached.cached_data(config.cache_config.data_folder)().get_data_files_for_period().items()) -# self.assertEqual(len(files), 1) -# self.assertEqual(files[0][0], (start_time, start_time + timedelta(seconds=12))) - -# def test_aggregation(self): -# ref_date = datetime(2021, 1, 1) -# dfs = [] -# for aggregation_period in TimeAggregation: -# for split_columns_to_files in (True, False): -# @csp.node(cache=True, cache_options=GraphCacheOptions(split_columns_to_files=split_columns_to_files, -# time_aggregation=aggregation_period)) -# def n1() -> csp.Outputs(c=csp.ts[int]): -# with csp.alarms(): -# a_t = csp.alarm( date ) -# with csp.start(): -# first_out_time = ref_date + timedelta(days=math.ceil((csp.now() - ref_date).total_seconds() / 86400 / 5) * 5) -# csp.schedule_alarm(a_t, first_out_time, ref_date.date()) - -# if csp.ticked(a_t): -# csp.schedule_alarm(a_t, timedelta(days=5), csp.now().date()) -# return csp.output(c=int((csp.now() - ref_date).total_seconds() / 86400)) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# for i in range(100): -# csp.run(n1, starttime=ref_date + timedelta(days=7 * i), endtime=timedelta(days=8, microseconds=1), config=config) - -# all_parquet_files = glob.glob(os.path.join(config.cache_config.data_folder, '**', '*.parquet'), recursive=True) -# files_for_period = n1.cached_data(config.cache_config.data_folder)().get_data_files_for_period() -# dfs.append(n1.cached_data(config.cache_config.data_folder)().get_data_df_for_period()) -# self.assertTrue((dfs[-1]['c'].diff().iloc[1:] == 5).all()) -# num_parquet_files = len(all_parquet_files) // 2 if split_columns_to_files else len(all_parquet_files) - -# if aggregation_period == TimeAggregation.DAY: -# self.assertEqual(len(files_for_period), 702) -# self.assertEqual(num_parquet_files, 702) -# elif aggregation_period == TimeAggregation.MONTH: -# self.assertEqual(len(files_for_period), 24) -# self.assertEqual(num_parquet_files, 24) -# elif aggregation_period == TimeAggregation.QUARTER: -# self.assertEqual(len(files_for_period), 8) -# self.assertEqual(num_parquet_files, 8) -# else: -# self.assertEqual(len(files_for_period), 2) -# self.assertEqual(num_parquet_files, 2) -# for df1, df2 in zip(dfs[0:-1], dfs[1:]): -# self.assertTrue((df1 == df2).all().all()) - -# def test_struct_column_subset_read(self): -# for split_columns_to_files in (True, False): -# @graph(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'t'}, split_columns_to_files=split_columns_to_files)) -# def g(t: 'T' = TypedCurveGenerator.SimpleSubStruct) -> csp.Outputs(o=csp.ts['T']): -# curve_generator = TypedCurveGenerator(period=timedelta(seconds=1)) -# return csp.output(o=curve_generator.gen_transformed_curve(t, 0, 10, 1)) - -# @graph -# def g_single_col() -> csp.Outputs(value=csp.ts[float]): - -# return csp.output(value=g().o.value2) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# res1 = csp.run(g, starttime=start_time, endtime=end_time, config=config) -# res2 = csp.run(g.cached, TypedCurveGenerator.SimpleSubStruct, starttime=start_time, endtime=end_time, config=config) -# res3 = csp.run(g, TypedCurveGenerator.SimpleStruct, starttime=start_time, endtime=end_time, config=config) -# res4 = csp.run(g.cached, TypedCurveGenerator.SimpleStruct, starttime=start_time, endtime=end_time, config=config) -# # Since now we try to write with a different schema, this should raise -# with self.assertRaisesRegex(RuntimeError, "Metadata mismatch .*"): -# res5 = csp.run(g, TypedCurveGenerator.SimpleStruct, starttime=start_time, endtime=end_time + timedelta(seconds=1), config=config) - -# self.assertEqual(res1, res2) -# self.assertEqual(res3, res4) -# self.assertEqual(len(res1['o']), len(res3['o'])) -# self.assertNotEqual(res1, res3) -# for (t1, v1), (t2, v2) in zip(res1['o'], res3['o']): -# v1_aux = TypedCurveGenerator.SimpleStruct() -# v1_aux.copy_from(v1) -# self.assertEqual(t1, t2) -# self.assertEqual(v1_aux, v2) -# res5 = csp.run(g_single_col, starttime=start_time, endtime=end_time, config=config) -# files = g.cached_data(config)().get_data_files_for_period() -# self.assertEqual(len(files), 1) -# file = next(iter(files.values())) -# if split_columns_to_files: -# os.unlink(os.path.join(file, 'o.value1.parquet')) -# else: -# import pandas -# df = pandas.read_parquet(file) -# df = df.drop(columns=['o.value1']) -# df.to_parquet(file) - -# res6 = csp.run(g_single_col, starttime=start_time, endtime=end_time, config=config) -# self.assertEqual(res5, res6) -# # Since we removed some data when trying to read all again, we should fail -# if split_columns_to_files: -# with self.assertRaisesRegex(Exception, 'IOError.*'): -# res7 = csp.run(g.cached, TypedCurveGenerator.SimpleSubStruct, starttime=start_time, endtime=end_time, config=config) -# else: -# with self.assertRaisesRegex(RuntimeError, '.*Missing column o.value1.*'): -# res7 = csp.run(g.cached, TypedCurveGenerator.SimpleSubStruct, starttime=start_time, endtime=end_time, config=config) - -# def test_basket_struct_column_subset_read(self): -# @graph(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'t'})) -# def g(t: 'T' = TypedCurveGenerator.SimpleSubStruct) -> csp.Outputs(o=csp.OutputBasket(Dict[str, csp.ts['T']], shape=['my_key'])) : -# curve_generator = TypedCurveGenerator(period=timedelta(seconds=1)) -# return csp.output(o={'my_key': curve_generator.gen_transformed_curve(t, 0, 10, 1)}) - -# @graph(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'t'})) -# def g_unnamed_out(t: 'T' = TypedCurveGenerator.SimpleSubStruct) -> csp.OutputBasket(Dict[str, csp.ts['T']], shape=['my_key']): -# return g.cached(t).o - -# @graph -# def g_single_col(unnamed: bool = False) -> csp.Outputs(value=csp.ts[float]): - -# if unnamed: -# res = csp.get_basket_field(g_unnamed_out(), 'value2') -# else: -# res = csp.get_basket_field(g().o, 'value2') - -# return csp.output(value=res['my_key']) - -# def verify_all(x: csp.ts[bool]): -# self.assertTrue(x is not None) - -# @graph -# def g_verify_multiple_type(): -# verify_all(g_unnamed_out(TypedCurveGenerator.SimpleStruct)['my_key'].value2 == g_unnamed_out(TypedCurveGenerator.SimpleSubStruct)['my_key'].value2) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(seconds=11) -# res1 = csp.run(g, starttime=start_time, endtime=end_time, config=config) -# res2 = csp.run(g.cached, TypedCurveGenerator.SimpleSubStruct, starttime=start_time, endtime=end_time, config=config) -# res3 = csp.run(g, TypedCurveGenerator.SimpleStruct, starttime=start_time, endtime=end_time, config=config) -# res4 = csp.run(g.cached, TypedCurveGenerator.SimpleStruct, starttime=start_time, endtime=end_time, config=config) -# res_unnamed = csp.run(g_unnamed_out, starttime=start_time, endtime=end_time, config=config) -# res_unnamed_cached = csp.run(g_unnamed_out.cached, starttime=start_time, endtime=end_time, config=config) -# csp.run(g_verify_multiple_type, starttime=start_time, endtime=end_time, config=config) -# self.assertEqual(res_unnamed, res_unnamed_cached) -# self.assertEqual(res_unnamed['my_key'], res1['o[my_key]']) -# # Since now we try to write with a different schema, this should raise -# with self.assertRaisesRegex(RuntimeError, "Metadata mismatch .*"): -# res5 = csp.run(g, TypedCurveGenerator.SimpleStruct, starttime=start_time, endtime=end_time + timedelta(seconds=1), config=config) - -# self.assertEqual(res1, res2) -# self.assertEqual(res3, res4) -# self.assertEqual(len(res1['o[my_key]']), len(res3['o[my_key]'])) -# self.assertNotEqual(res1, res3) -# for (t1, v1), (t2, v2) in zip(res1['o[my_key]'], res3['o[my_key]']): -# v1_aux = TypedCurveGenerator.SimpleStruct() -# v1_aux.copy_from(v1) -# self.assertEqual(t1, t2) -# self.assertEqual(v1_aux, v2) -# res5 = csp.run(g_single_col, False, starttime=start_time, endtime=end_time, config=config) -# res5_unnamed = csp.run(g_single_col, True, starttime=start_time, endtime=end_time, config=config) -# self.assertEqual(res5, res5_unnamed) -# files = g.cached_data(config)().get_data_files_for_period() -# self.assertEqual(len(files), 1) -# file = next(iter(files.values())) -# # TODO: uncomment -# # os.unlink(os.path.join(file, 'o.value1.parquet')) - -# res6 = csp.run(g_single_col, starttime=start_time, endtime=end_time, config=config) -# self.assertEqual(res5, res6) -# # Since we removed some data when trying to read all again, we should fail -# # TODO: uncomment -# # with self.assertRaisesRegex(Exception, 'IOError.*'): -# # res7 = csp.run(g.cached, TypedCurveGenerator.SimpleSubStruct, starttime=start_time, endtime=end_time, config=config) - -# def test_unnamed_output_caching(self): -# for split_columns_to_files in (True, False): -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# @csp.graph(cache=True, cache_options=GraphCacheOptions(split_columns_to_files=split_columns_to_files)) -# def g_scalar() -> csp.ts[int]: -# gen = TypedCurveGenerator() -# return gen.gen_int_curve(0, 10, 1) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(split_columns_to_files=split_columns_to_files)) -# def g_struct() -> csp.ts[TypedCurveGenerator.SimpleStruct]: -# gen = TypedCurveGenerator() -# return gen.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, 0, 10, 1) - -# @csp.graph(cache=True) -# def g_scalar_basket() -> csp.OutputBasket(Dict[str, csp.ts[int]] , shape=['k1', 'k2']): -# gen = TypedCurveGenerator() -# return {'k1': gen.gen_int_curve(0, 10, 1), -# 'k2': gen.gen_int_curve(100, 10, 1)} - -# @csp.graph(cache=True) -# def g_struct_basket() -> csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleStruct]], shape=['k1', 'k2']): -# gen = TypedCurveGenerator() -# return {'k1': gen.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, 0, 10, 1), -# 'k2': gen.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, 100, 10, 1)} - -# def run_test_single_graph(g_func): -# res1 = csp.run(g_func, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390), -# config=config) -# res2 = csp.run(g_func, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390), -# config=config) -# res3 = csp.run(g_func.cached, starttime=datetime(2020, 3, 1, 20, 30), endtime=timedelta(hours=0, minutes=390), -# config=config) -# self.assertEqual(res1, res2) -# self.assertEqual(res2, res3) -# return res1 - -# run_test_single_graph(g_scalar) -# run_test_single_graph(g_struct) -# run_test_single_graph(g_scalar_basket) -# run_test_single_graph(g_struct_basket) - -# res1_df = g_scalar.cached_data(config)().get_data_df_for_period() -# res2_df = g_struct.cached_data(config)().get_data_df_for_period() -# res3_df = g_scalar_basket.cached_data(config)().get_data_df_for_period() -# res4_df = g_struct_basket.cached_data(config)().get_data_df_for_period() -# self.assertEqual(list(res1_df.columns), ['csp_timestamp', 'csp_unnamed_output']) -# self.assertEqual(list(res2_df.columns), ['csp_timestamp', 'value1', 'value2']) -# self.assertEqual(list(res3_df.columns), ['csp_timestamp', 'k1', 'k2']) -# self.assertEqual(list(res4_df.columns), [('csp_timestamp', ''), ('value1', 'k1'), ('value1', 'k2'), ('value2', 'k1'), ('value2', 'k2')]) - -# for df in (res1_df, res2_df, res3_df, res4_df): -# self.assertEqual(len(df), 11) - -# def test_basket_ids_retrieval(self): -# for aggregation_period in (TimeAggregation.MONTH, TimeAggregation.DAY,): -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# @csp.graph(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'keys'}, time_aggregation=aggregation_period)) -# def g(keys: object) -> csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleStruct]], shape="keys"): -# gen = TypedCurveGenerator() -# return {keys[0]: gen.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, 0, 10, 1), -# keys[1]: gen.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, 100, 10, 1)} - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'keys'})) -# def g_named_output(keys: object) -> csp.Outputs(out=csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleStruct]], shape="keys")): - -# gen = TypedCurveGenerator() -# return csp.output(out={keys[0]: gen.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, 0, 10, 1), -# keys[1]: gen.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, 100, 10, 1)}) - -# csp.run(g, ['k1', 'k2'], starttime=datetime(2020, 3, 1), endtime=datetime(2020, 3, 1, 23, 59, 59, 999999), -# config=config) -# csp.run(g, ['k3', 'k4'], starttime=datetime(2020, 3, 2), endtime=datetime(2020, 3, 2, 23, 59, 59, 999999), -# config=config) -# csp.run(g_named_output, ['k1', 'k2'], starttime=datetime(2020, 3, 1), endtime=datetime(2020, 3, 1, 23, 59, 59, 999999), -# config=config) -# csp.run(g_named_output, ['k3', 'k4'], starttime=datetime(2020, 3, 2), endtime=datetime(2020, 3, 2, 23, 59, 59, 999999), -# config=config) -# self.assertEqual(g.cached_data(config)().get_all_basket_ids_in_range(), ['k1', 'k2', 'k3', 'k4']) -# self.assertEqual(g.cached_data(config)().get_all_basket_ids_in_range(starttime=datetime(2020, 3, 1), endtime=datetime(2020, 3, 1, 23, 59, 59, 999999)), -# ['k1', 'k2']) -# self.assertEqual(g.cached_data(config)().get_all_basket_ids_in_range(starttime=datetime(2020, 3, 2), endtime=datetime(2020, 3, 2, 23, 59, 59, 999999)), -# ['k3', 'k4']) -# self.assertEqual(g_named_output.cached_data(config)().get_all_basket_ids_in_range('out'), ['k1', 'k2', 'k3', 'k4']) -# self.assertEqual(g_named_output.cached_data(config)().get_all_basket_ids_in_range('out', starttime=datetime(2020, 3, 1), endtime=datetime(2020, 3, 1, 23, 59, 59, 999999)), -# ['k1', 'k2']) -# self.assertEqual(g_named_output.cached_data(config)().get_all_basket_ids_in_range('out', starttime=datetime(2020, 3, 2), endtime=datetime(2020, 3, 2, 23, 59, 59, 999999)), -# ['k3', 'k4']) - -# def test_custom_time_fields(self): -# from csp.impl.wiring.graph import NoCachedDataException -# import numpy - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(data_timestamp_column_name='timestamp')) -# def g1() -> csp.ts[_DummyStructWithTimestamp]: -# s = csp.engine_start_time() -# return csp.curve(_DummyStructWithTimestamp, [(s + timedelta(hours=1 + i), -# _DummyStructWithTimestamp(val=i, timestamp=s + timedelta(hours=(2 * i) ** 2))) for i in range(10)]) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(data_timestamp_column_name='timestamp')) -# def g2() -> csp.Outputs(timestamp=csp.ts[datetime], values=csp.OutputBasket(Dict[str, csp.ts[int]], shape=['v1', 'v2'])): -# s = csp.engine_start_time() -# values = {} -# values['v1'] = csp.curve(int, [(s + timedelta(hours=1 + i), i) for i in range(10)]) -# values['v2'] = csp.curve(int, [(s + timedelta(hours=1 + i), i * 100) for i in range(10)]) -# t = csp.curve(datetime, [(s + timedelta(hours=1 + i), s + timedelta(hours=(2 * i) ** 2)) for i in range(10)]) -# return csp.output(timestamp=t, values=values) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# s = datetime(2021, 1, 1) -# csp.run(g1, starttime=s, endtime=timedelta(hours=100), config=config) -# csp.run(g2, starttime=s, endtime=timedelta(hours=100), config=config) -# data_files = g1.cached_data(config)().get_data_files_for_period() -# data_files2 = g2.cached_data(config)().get_data_files_for_period() -# res = csp.run(g1.cached, starttime=datetime(2021, 1, 1), endtime=datetime(2021, 1, 14, 12, 0), config=config) -# res2 = csp.run(g2.cached, starttime=datetime(2021, 1, 1), endtime=datetime(2021, 1, 14, 12, 0), config=config) -# with self.assertRaises(NoCachedDataException): -# csp.run(g1.cached, starttime=datetime(2021, 1, 1), endtime=datetime(2021, 1, 14, 12, 1), config=config) - -# self.assertEqual(list(data_files.keys()), list(data_files2.keys())) -# self.assertEqual([(k, v.val) for k, v in res[0]], res2['values[v1]']) -# all_file_time_ranges = list(data_files.keys()) -# expected_start_end = res[0][0][1].timestamp, res[0][-1][1].timestamp -# actual_start_end = all_file_time_ranges[0][0], all_file_time_ranges[-1][1] -# self.assertEqual(expected_start_end, actual_start_end) -# data_df = g1.cached_data(config)().get_data_df_for_period() -# data_df2 = g2.cached_data(config)().get_data_df_for_period() -# self.assertTrue(all((data_df.timestamp.diff().dt.total_seconds() / 3600).values[1:].astype(int) == numpy.diff(((numpy.arange(0, 10) * 2) ** 2)))) -# self.assertTrue(all(data_df.val.values == (numpy.arange(0, 10)))) -# self.assertTrue((data_df['val'] == data_df2['values']['v1']).all()) -# self.assertTrue((data_df['val'] * 100 == data_df2['values']['v2']).all()) - -# def test_cached_with_start_stop_times(self): -# @csp.graph(cache=True) -# def g() -> csp.ts[int]: -# return csp.curve(int, [(datetime(2021, 1, 1), 1), (datetime(2021, 1, 2), 2), (datetime(2021, 1, 3), 3)]) - -# @csp.graph -# def g2(csp_cache_start: object = None) -> csp.ts[int]: -# end = csp.engine_end_time() - timedelta(days=1, microseconds=-1) -# if csp_cache_start: -# cached_g = g.cached[csp_cache_start:end] -# else: -# cached_g = g.cached[:end] -# return csp.delay(cached_g(), timedelta(days=1)) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# res1 = csp.run(g, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# res2 = csp.run(g2, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1) + timedelta(days=1), config=config) -# res1_transformed = [(v1 + timedelta(days=1), v2) for (v1, v2) in res1[0]] -# self.assertEqual(res1_transformed, res2[0]) -# with self.assertRaises(NoCachedDataException): -# csp.run(g2, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1) + timedelta(days=1, microseconds=1), config=config) - -# res3 = csp.run(g2, datetime(2021, 1, 1, 1), starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1) + timedelta(days=1), config=config) -# self.assertEqual(res3[0], res2[0][1:]) - -# def test_cached_graph_not_instantiated(self): -# raise_exception = [False] - -# @csp.graph(cache=True) -# def g() -> csp.ts[int]: -# assert not raise_exception[0] -# return csp.curve(int, [(datetime(2021, 1, 1), 1), (datetime(2021, 1, 2), 2), (datetime(2021, 1, 3), 3)]) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(g, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# raise_exception[0] = True -# csp.run(g, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# with self.assertRaises(NoCachedDataException): -# csp.run(g.cached, starttime=datetime(2021, 1, 1), endtime=timedelta(days=4, microseconds=-1), config=config) - -# def test_caching_with_struct_arguments(self): -# @csp.graph(cache=True) -# def g(value: TypedCurveGenerator.SimpleStruct) -> csp.ts[TypedCurveGenerator.SimpleStruct]: -# return csp.curve(TypedCurveGenerator.SimpleStruct, [(datetime(2021, 1, 1), value)]) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# s = TypedCurveGenerator.SimpleStruct(value1=42) -# res = csp.run(g, s, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# res2 = csp.run(g.cached, s, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# self.assertEqual(res, res2) - -# def test_caching_user_types(self): -# class OrderedDictSerializer(CacheObjectSerializer): -# def serialize_to_bytes(self, value): -# import pickle -# return pickle.dumps(value) - -# def deserialize_from_bytes(self, value): -# import pickle -# return pickle.loads(value) - -# @csp.graph(cache=True) -# def g() -> csp.ts[collections.OrderedDict]: -# return csp.curve(collections.OrderedDict, [(datetime(2021, 1, 1), collections.OrderedDict({1: 2, 3: 4}))]) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# # We don't know how to serialize ordereddict, this should raise -# with self.assertRaises(TypeError): -# res = csp.run(g, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) - -# config.cache_config.cache_serializers[collections.OrderedDict] = OrderedDictSerializer() -# res = csp.run(g, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# res2 = csp.run(g.cached, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# self.assertEqual(res, res2) -# res_df = g.cached_data(config)().get_data_df_for_period() -# self.assertEqual(res_df['csp_unnamed_output'].iloc[0], res[0][0][1]) - -# def test_special_character_partitioning(self): -# # Since we're using glob to locate the files on disk, there was a bug that special characters in the partition values broke the partition data -# # lookup. This test tests that it works now. - -# for split_columns_to_files in (True, False): -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_files) - -# @csp.graph(cache=True, **graph_kwargs) -# def g(x1: str, x2: str) -> csp.ts[str]: -# return csp.curve(str, [(datetime(2021, 1, 1), x1), (datetime(2021, 1, 1) + timedelta(seconds=1), x2)]) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# x1 = "[][]" -# x2 = "*x*)(" -# res = csp.run(g, x1, x2, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# res2 = csp.run(g.cached, x1, x2, starttime=datetime(2021, 1, 1), endtime=timedelta(days=3, microseconds=-1), config=config) -# self.assertEqual(res, res2) -# df = g.cached_data(config)(x1, x2).get_data_df_for_period() -# self.assertEqual(df['csp_unnamed_output'].tolist(), [x1, x2]) - -# def test_cutoff_bug(self): -# """Test for bug that was there of +-1 micro second offset, that caused some stitch data to be missing -# :return: -# """ -# for split_columns_to_files in (True, False): -# if split_columns_to_files: -# cache_options = GraphCacheOptions(split_columns_to_files=True) -# else: -# cache_options = GraphCacheOptions(split_columns_to_files=True) -# cache_options.time_aggregation = TimeAggregation.MONTH - -# @csp.graph(cache=True, cache_options=cache_options) -# def g() -> csp.ts[int]: -# l = [(datetime(2021, 1, 1), 1), (datetime(2021, 1, 1, 23, 59, 59, 999999), 2), -# (datetime(2021, 1, 2), 3), (datetime(2021, 1, 2, 23, 59, 59, 999999), 4), -# (datetime(2021, 1, 3), 5), (datetime(2021, 1, 3, 23, 59, 59, 999999), 6)] -# l = [v for v in l if csp.engine_start_time() <= v[0] <= csp.engine_end_time()] -# return csp.curve(int, l) - -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(g, starttime=datetime(2021, 1, 1), endtime=datetime(2021, 1, 1, 23, 59, 59, 999999), config=config) -# csp.run(g, starttime=datetime(2021, 1, 1, 12), endtime=datetime(2021, 1, 3, 23, 59, 59, 999999), config=config) -# self.assertEqual(g.cached_data(config)().get_data_df_for_period()['csp_unnamed_output'].tolist(), [1, 2, 3, 4, 5, 6]) - -# def test_scalar_flat_basket_loading(self): -# @csp.graph(cache=True) -# def simple_cached() -> csp.Outputs(i=csp.OutputBasket(Dict[str, csp.ts[int]], shape=['V1', 'V2']), -# s=csp.OutputBasket(Dict[str, csp.ts[str]], shape=['V3', 'V4'])): - -# i_v1 = csp.curve(int, [(timedelta(hours=10), 1), (timedelta(hours=10), 1), (timedelta(hours=30), 1)]) -# i_v2 = csp.curve(int, [(timedelta(hours=10), 10), (timedelta(hours=20), 11)]) -# s_v3 = csp.curve(str, [(timedelta(hours=30), "val1")]) -# s_v4 = csp.curve(str, [(timedelta(hours=10), "val2"), (timedelta(hours=20), "val3")]) -# return csp.output(i={'V1': i_v1, 'V2': i_v2}, s={'V3': s_v3, 'V4': s_v4}) - -# @csp.graph(cache=True) -# def simple_cached_unnamed() -> csp.OutputBasket(Dict[str, csp.ts[int]], shape=['V1', 'V2']): - -# return csp.output(simple_cached().i) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(hours=30) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(simple_cached, starttime=start_time, endtime=end_time, config=config) -# csp.run(simple_cached_unnamed, starttime=start_time, endtime=end_time, config=config) -# df_ref_full = simple_cached.cached_data(config)().get_data_df_for_period().stack(dropna=False).reset_index().drop(columns=['level_0']).rename(columns={'level_1': 'symbol'}) -# df_ref_full['csp_timestamp'] = df_ref_full['csp_timestamp'].ffill().dt.tz_localize(None) -# df_ref_full = df_ref_full[df_ref_full.symbol.str.len() > 0].reset_index(drop=True) - -# for start_dt, end_dt in ((None, None), -# (timedelta(hours=10), None), -# (timedelta(hours=10), timedelta(hours=10, microseconds=1)), -# (timedelta(hours=10), timedelta(hours=20)), -# (timedelta(hours=10, microseconds=1), timedelta(hours=20)), -# (timedelta(hours=10, microseconds=1), None), -# (timedelta(hours=10, microseconds=1), timedelta(hours=10, microseconds=2)), -# (timedelta(hours=10, microseconds=1), timedelta(hours=30))): - -# cur_start = start_time + start_dt if start_dt else None -# cur_end = start_time + end_dt if end_dt else None -# mask = df_ref_full.index >= 0 -# if cur_start: -# mask &= df_ref_full.csp_timestamp >= cur_start -# if cur_end: -# mask &= df_ref_full.csp_timestamp <= cur_end -# df_ref = df_ref_full[mask] - -# df_ref_i = df_ref[['csp_timestamp', 'symbol', 'i']][~df_ref.i.isna()].reset_index(drop=True) -# df_ref_s = df_ref[['csp_timestamp', 'symbol', 's']][~df_ref.s.isna()].reset_index(drop=True) - -# i_df_flat = simple_cached.cached_data(config)().get_flat_basket_df_for_period(basket_field_name='i', symbol_column='symbol', -# starttime=cur_start, endtime=cur_end) -# s_df_flat = simple_cached.cached_data(config)().get_flat_basket_df_for_period(basket_field_name='s', symbol_column='symbol', -# starttime=cur_start, endtime=cur_end) -# unnamed_flat = simple_cached_unnamed.cached_data(config)().get_flat_basket_df_for_period(symbol_column='symbol', -# starttime=cur_start, endtime=cur_end) -# self.assertTrue((i_df_flat == df_ref_i).all().all()) -# self.assertTrue((s_df_flat == df_ref_s).all().all()) - -# # We can't rename columns when None is returned so we have to add this check -# if unnamed_flat is None: -# self.assertTrue(len(df_ref_i) == 0) -# else: -# self.assertTrue((unnamed_flat.rename(columns={'csp_unnamed_output': 'i'}) == df_ref_i).all().all()) - -# def test_struct_flat_basket_loading(self): -# @csp.graph(cache=True) -# def simple_cached() -> csp.Outputs(ret=csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleSubStruct]], shape=['V1', 'V2'])): - -# i_v1 = csp.curve(int, [(timedelta(hours=10), 1), (timedelta(hours=10), 1), (timedelta(hours=30), 1)]) -# i_v2 = csp.curve(int, [(timedelta(hours=10), 10), (timedelta(hours=20), 11)]) -# s_v3 = csp.curve(str, [(timedelta(hours=30), "val1")]) -# s_v4 = csp.curve(str, [(timedelta(hours=10), "val2"), (timedelta(hours=20), "val3")]) -# res = {} -# res['V1'] = TypedCurveGenerator.SimpleSubStruct.fromts(value1=i_v1, value3=s_v3) -# res['V2'] = TypedCurveGenerator.SimpleSubStruct.fromts(value1=i_v2, value3=s_v4) -# return csp.output(ret=res) - -# @csp.graph(cache=True) -# def simple_cached_unnamed() -> csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleSubStruct]], shape=['V1', 'V2']): -# return csp.output(simple_cached().ret) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(hours=30) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(simple_cached, starttime=start_time, endtime=end_time, config=config) -# csp.run(simple_cached_unnamed, starttime=start_time, endtime=end_time, config=config) -# df_ref_full = simple_cached.cached_data(config)().get_data_df_for_period().stack(dropna=False).reset_index().drop(columns=['level_0']).rename(columns={'level_1': 'symbol'}) -# df_ref_full['csp_timestamp'] = df_ref_full['csp_timestamp'].ffill().dt.tz_localize(None) -# mask = (df_ref_full.symbol.str.len() > 0) & (~df_ref_full['ret.value1'].isna() | ~df_ref_full['ret.value2'].isna() | ~df_ref_full['ret.value2'].isna()) -# df_ref_full = df_ref_full[mask].reset_index(drop=True) -# df_ref_full = df_ref_full[['csp_timestamp', 'symbol', 'ret.value1', 'ret.value2', 'ret.value3']] - -# for start_dt, end_dt in ((None, None), -# (timedelta(hours=10), None), -# (timedelta(hours=10), timedelta(hours=10, microseconds=1)), -# (timedelta(hours=10), timedelta(hours=20)), -# (timedelta(hours=10, microseconds=1), timedelta(hours=20)), -# (timedelta(hours=10, microseconds=1), None), -# (timedelta(hours=10, microseconds=1), timedelta(hours=10, microseconds=2)), -# (timedelta(hours=10, microseconds=1), timedelta(hours=30))): - -# cur_start = start_time + start_dt if start_dt else None -# cur_end = start_time + end_dt if end_dt else None -# mask = df_ref_full.index >= 0 -# if cur_start: -# mask &= df_ref_full.csp_timestamp >= cur_start -# if cur_end: -# mask &= df_ref_full.csp_timestamp <= cur_end -# df_ref = df_ref_full[mask].fillna(-999).reset_index(drop=True) - -# df_flat = simple_cached.cached_data(config)().get_flat_basket_df_for_period(basket_field_name='ret', symbol_column='symbol', -# starttime=cur_start, endtime=cur_end) - -# unnamed_flat = simple_cached_unnamed.cached_data(config)().get_flat_basket_df_for_period(symbol_column='symbol', -# starttime=cur_start, endtime=cur_end) -# if unnamed_flat is not None: -# unnamed_flat_normalized = unnamed_flat.rename(columns=dict(zip(unnamed_flat.columns, df_flat.columns))) -# if df_flat is None: -# self.assertTrue(len(df_ref) == 0) -# self.assertTrue(unnamed_flat is None) -# else: -# self.assertTrue((df_flat.fillna(-999) == df_ref.fillna(-999)).all().all()) -# self.assertTrue((df_flat.fillna(-999) == unnamed_flat_normalized.fillna(-999)).all().all()) - -# for c in TypedCurveGenerator.SimpleSubStruct.metadata().keys(): -# df_flat_single_col = simple_cached.cached_data(config)().get_flat_basket_df_for_period(basket_field_name='ret', symbol_column='symbol', struct_fields=[c], -# starttime=cur_start, endtime=cur_end) -# if df_flat_single_col is None: -# self.assertTrue(len(df_ref) == 0) -# continue -# df_flat_single_col_ref = df_ref[df_flat_single_col.columns] -# self.assertTrue((df_flat_single_col.fillna(-999) == df_flat_single_col_ref.fillna(-999)).all().all()) - -# def test_simple_time_shift(self): -# @csp.graph(cache=True) -# def simple_cached() -> csp.ts[int]: - -# return csp.curve(int, [(timedelta(hours=i), i) for i in range(72)]) - -# @csp.graph -# def cached_data_shifted(shift: timedelta) -> csp.ts[int]: -# return simple_cached.cached.shifted(csp_timestamp_shift=shift)() - -# def to_df(res): -# return pandas.DataFrame({'timestamp': res[0][0], 'value': res[0][1]}) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(hours=71) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(simple_cached, starttime=start_time, endtime=end_time, config=config) -# with self.assertRaises(NoCachedDataException): -# csp.run(cached_data_shifted, timedelta(minutes=1), starttime=start_time, endtime=end_time, config=config, output_numpy=True) - -# ref_df = to_df(csp.run(cached_data_shifted, timedelta(), starttime=start_time, endtime=end_time, config=config, output_numpy=True)) -# td12 = timedelta(hours=12) - -# shifted_df = ref_df.shift(12).iloc[12:, :].reset_index(drop=True) -# shifted_df['timestamp'] += td12 -# res_df1 = to_df(csp.run(cached_data_shifted, td12, starttime=start_time + td12, endtime=end_time, config=config, output_numpy=True)) -# res_df2 = to_df(csp.run(cached_data_shifted, td12, starttime=start_time + td12, endtime=end_time + td12, config=config, output_numpy=True)) -# self.assertTrue((shifted_df == res_df1).all().all()) -# self.assertTrue((ref_df.value == res_df2.value).all()) - -# def test_struct_basket_time_shift(self): -# @csp.graph(cache=True) -# def struct_cached() -> csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleStruct]], shape=['A', 'B']): - -# generator = TypedCurveGenerator(period=timedelta(hours=1)) -# return { -# 'A': generator.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, start_value=0, num_cycles=71, increment=1, duplicate_timestamp_indices=[11], skip_indices=[3, 15]), -# 'B': generator.gen_transformed_curve(TypedCurveGenerator.SimpleStruct, start_value=100, num_cycles=71, increment=1, duplicate_timestamp_indices=[10, 11, 12, 13, 14], -# skip_indices=[3, 15]) -# } - -# @csp.node -# def dict_builder(x: {str: csp.ts[TypedCurveGenerator.SimpleStruct]}) -> csp.ts[object]: -# res = {'timestamp': csp.now()} -# ticked_items = {k: v for k, v in x.tickeditems()} -# for k in x.keys(): -# res[k] = ticked_items.get(k) -# return res - -# @csp.graph -# def cached_data_shifted(shift: timedelta) -> csp.ts[object]: -# return dict_builder(struct_cached.cached.shifted(csp_timestamp_shift=shift)()) - -# def to_df(res): -# keys = list(res[0][1][0].keys()) -# values = [[v for k, v in d.items()] for d in res[0][1]] -# return pandas.DataFrame(dict(zip(keys, zip(*values)))) - -# start_time = datetime(2021, 1, 1) -# end_time = start_time + timedelta(hours=71) -# with _GraphTempCacheFolderConfig(allow_overwrite=True) as config: -# csp.run(struct_cached, starttime=start_time, endtime=end_time, config=config) -# with self.assertRaises(NoCachedDataException): -# csp.run(cached_data_shifted, timedelta(minutes=1), starttime=start_time, endtime=end_time, config=config, output_numpy=True) - -# ref_df = to_df(csp.run(cached_data_shifted, timedelta(), starttime=start_time, endtime=end_time, config=config, output_numpy=True)) -# td12 = timedelta(hours=12) - -# ref_df1 = ref_df.copy() -# ref_df1.timestamp += td12 -# ref_df1 = ref_df1[ref_df1.timestamp.between(start_time + td12, end_time)].reset_index(drop=True) -# res_df1 = to_df(csp.run(cached_data_shifted, td12, starttime=start_time + td12, endtime=end_time, config=config, output_numpy=True)) -# self.assertTrue((res_df1.fillna(-1) == ref_df1.fillna(-1)).all().all()) - -# ref_df2 = ref_df.copy() -# ref_df2.timestamp += td12 -# ref_df2 = ref_df2[ref_df2.timestamp.between(start_time + td12, end_time + td12)].reset_index(drop=True) -# res_df2 = to_df(csp.run(cached_data_shifted, td12, starttime=start_time + td12, endtime=end_time + td12, config=config, output_numpy=True)) -# self.assertTrue((ref_df2.fillna(-1) == res_df2.fillna(-1)).all().all()) - -# ref_df2 = ref_df.copy() -# ref_df2.timestamp -= td12 -# ref_df2 = ref_df2[ref_df2.timestamp.between(start_time - td12, end_time - td12)].reset_index(drop=True) -# res_df2 = to_df(csp.run(cached_data_shifted, -td12, starttime=start_time - td12, endtime=end_time - td12, config=config, output_numpy=True)) -# self.assertTrue((ref_df2.fillna(-1) == res_df2.fillna(-1)).all().all()) - -# def test_caching_separate_folder(self): -# @csp.graph(cache=True) -# def g(name: str) -> csp.ts[float]: -# if name == 'a': -# return csp.curve(float, [(timedelta(seconds=i), i) for i in range(10)]) -# else: -# return csp.curve(float, [(timedelta(seconds=i), i * 2) for i in range(10)]) - -# with _GraphTempCacheFolderConfig() as config: -# with _GraphTempCacheFolderConfig() as config2: -# res1 = csp.run(g, 'a', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config) -# # Write to a different folder data for a different day and key -# res2 = csp.run(g, 'a', starttime=datetime(2020, 1, 2), endtime=timedelta(seconds=30), config=config2) -# res2b = csp.run(g, 'b', starttime=datetime(2020, 1, 2), endtime=timedelta(seconds=30), config=config2) - -# config3 = config.copy() -# files1 = g.cached_data(config3)('a').get_data_files_for_period() -# self.assertEqual(len(files1), 1) - -# csp.run(g.cached, 'a', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config) -# with self.assertRaises(NoCachedDataException): -# csp.run(g.cached, 'a', starttime=datetime(2020, 1, 2), endtime=timedelta(seconds=30), config=config) -# with self.assertRaises(NoCachedDataException): -# csp.run(g.cached, 'b', starttime=datetime(2020, 1, 2), endtime=timedelta(seconds=30), config=config) - -# config3.cache_config.read_folders = [config2.cache_config.data_folder] -# files2 = g.cached_data(config3)('a').get_data_files_for_period(missing_range_handler=lambda *args, **kwargs: True) -# self.assertEqual(len(files2), 2) - -# res3 = csp.run(g, 'a', starttime=datetime(2020, 1, 1), endtime=timedelta(seconds=30), config=config2) -# res4 = csp.run(g, 'a', starttime=datetime(2020, 1, 2), endtime=timedelta(seconds=30), config=config2) -# res4b = csp.run(g, 'b', starttime=datetime(2020, 1, 2), endtime=timedelta(seconds=30), config=config2) -# self.assertEqual(res1, res3) -# self.assertEqual(res2, res4) -# self.assertEqual(res2b, res4b) - -# def test_cache_invalidation(self): -# for split_columns_to_file in (True, False): -# with _GraphTempCacheFolderConfig() as config: -# graph_kwargs = self._get_default_graph_caching_kwargs(split_columns_to_file) - -# @csp.graph(cache=True, **graph_kwargs) -# def my_graph(val: str) -> csp.ts[float]: -# return csp.curve(float, [(timedelta(days=i), float(i)) for i in range(5)]) - -# start1 = datetime(2021, 1, 1) -# end1 = start1 + timedelta(days=5, microseconds=-1) - -# self.assertTrue(my_graph.cached_data(config) is None) -# csp.run(my_graph, 'val1', starttime=start1, endtime=end1, config=config) -# # We should be able to invalidate cache when no cached data exists yet -# my_graph.cached_data(config)('val2').invalidate_cache() -# csp.run(my_graph, 'val2', starttime=start1, endtime=end1, config=config) -# cached_data1 = my_graph.cached_data(config)('val1').get_data_df_for_period() -# cached_data2 = my_graph.cached_data(config)('val2').get_data_df_for_period() -# self.assertTrue((cached_data1 == cached_data2).all().all()) -# self.assertEqual(len(cached_data1), 5) -# my_graph.cached_data(config)('val2').invalidate_cache(start1 + timedelta(days=1), end1) -# cached_data2_after_invalidation = my_graph.cached_data(config)('val2').get_data_df_for_period() -# self.assertTrue((cached_data1.head(1) == cached_data2_after_invalidation).all().all()) -# with self.assertRaises(NoCachedDataException): -# csp.run(my_graph.cached, 'val2', starttime=start1, endtime=end1, config=config) -# # this should run fine, we still have data -# csp.run(my_graph.cached, 'val2', starttime=start1, endtime=start1 + timedelta(days=1, microseconds=-1), config=config) -# my_graph.cached_data(config)('val2').invalidate_cache() -# # now we have no data -# with self.assertRaises(NoCachedDataException): -# csp.run(my_graph.cached, 'val2', starttime=start1, endtime=start1 + timedelta(days=1, microseconds=-1), config=config) -# my_graph.cached_data(config)('val1').invalidate_cache() -# self.assertTrue(my_graph.cached_data(config)('val1').get_data_df_for_period() is None) -# # We should still be able to invalidate -# my_graph.cached_data(config)('val1').invalidate_cache() -# # We should have no data in the data folder -# self.assertFalse(os.listdir(os.path.join(my_graph.cached_data(config)._dataset.data_paths.root_folder, 'data'))) - -# def test_controlled_cache(self): -# for default_cache_enabled in (True, False): -# with _GraphTempCacheFolderConfig() as config: -# @csp.graph(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def graph_unnamed_output1() -> csp.ts[float]: -# csp.set_cache_enable_ts(csp.curve(bool, [(timedelta(seconds=5), True), (timedelta(seconds=6.1), False), (timedelta(seconds=8), True)])) - -# return csp.output(csp.curve(float, [(timedelta(seconds=i), float(i)) for i in range(10)])) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def graph_unnamed_output2() -> csp.ts[float]: -# csp.set_cache_enable_ts(csp.curve(bool, [(timedelta(seconds=5), True), (timedelta(seconds=6.1), False), (timedelta(seconds=8), True)])) - -# return (csp.curve(float, [(timedelta(seconds=i), float(i)) for i in range(10)])) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def graph_single_named_output() -> csp.Outputs(res1=csp.ts[float]): -# csp.set_cache_enable_ts(csp.curve(bool, [(timedelta(seconds=5), True), (timedelta(seconds=6.1), False), (timedelta(seconds=8), True)])) - -# return csp.output(res1=csp.curve(float, [(timedelta(seconds=i), float(i)) for i in range(10)])) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def graph_multiple_outputs() -> csp.Outputs(res1=csp.ts[float], res2=csp.OutputBasket(Dict[str, csp.ts[float]], shape=['value'])): -# csp.set_cache_enable_ts(csp.curve(bool, [(timedelta(seconds=5), True), (timedelta(seconds=6.1), False), (timedelta(seconds=8), True)])) - -# res = csp.curve(float, [(timedelta(seconds=i), float(i)) for i in range(10)]) -# return csp.output(res1=res, res2={'value': res}) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def main_graph_cached_named_output() -> csp.Outputs(res1=csp.ts[float]): -# csp.set_cache_enable_ts(csp.curve(bool, [(timedelta(seconds=5), True), (timedelta(seconds=6.1), False), (timedelta(seconds=8), True)])) - -# return csp.output(csp.curve(float, [(timedelta(seconds=i), float(i)) for i in range(10)])) - -# @csp.node(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def node_unnamed_output() -> csp.ts[float]: -# with csp.alarms(): -# a_enable = csp.alarm( bool ) -# a_value = csp.alarm( float ) -# with csp.start(): -# csp.schedule_alarm(a_enable, timedelta(seconds=5), True) -# csp.schedule_alarm(a_enable, timedelta(seconds=6.1), False) -# csp.schedule_alarm(a_enable, timedelta(seconds=8), True) -# csp.schedule_alarm(a_value, timedelta(), 0) -# if csp.ticked(a_enable): -# csp.enable_cache(a_enable) -# if csp.ticked(a_value): -# if a_value < 9: -# csp.schedule_alarm(a_value, timedelta(seconds=1), a_value + 1) -# if a_value == 6: -# return a_value -# elif a_value == 8: -# return csp.output(a_value) -# raise NotImplementedError() -# else: -# csp.output(a_value) - -# @csp.node(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def node_single_named_output() -> csp.Outputs(res=csp.ts[float]): -# with csp.alarms(): -# a_enable = csp.alarm( bool ) -# a_value = csp.alarm( float ) -# with csp.start(): -# csp.schedule_alarm(a_enable, timedelta(seconds=5), True) -# csp.schedule_alarm(a_enable, timedelta(seconds=6.1), False) -# csp.schedule_alarm(a_enable, timedelta(seconds=8), True) -# csp.schedule_alarm(a_value, timedelta(), 0) -# if csp.ticked(a_enable): -# csp.enable_cache(a_enable) -# if csp.ticked(a_value): -# if a_value < 9: -# csp.schedule_alarm(a_value, timedelta(seconds=1), a_value + 1) -# if a_value == 6: -# return a_value -# elif a_value == 8: -# return csp.output(res=a_value) -# raise NotImplementedError() -# else: -# csp.output(res=a_value) - -# @csp.node(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def node_multiple_outputs() -> csp.Outputs(res1=csp.ts[float], res2=csp.OutputBasket(Dict[str, csp.ts[float]], shape=['value'])): -# with csp.alarms(): -# a_enable = csp.alarm( bool ) -# a_value = csp.alarm( float ) -# with csp.start(): -# csp.schedule_alarm(a_enable, timedelta(seconds=5), True) -# csp.schedule_alarm(a_enable, timedelta(seconds=6.1), False) -# csp.schedule_alarm(a_enable, timedelta(seconds=8), True) -# csp.schedule_alarm(a_value, timedelta(), 0) -# if csp.ticked(a_enable): -# csp.enable_cache(a_enable) -# if csp.ticked(a_value): -# if a_value < 9: -# csp.schedule_alarm(a_value, timedelta(seconds=1), a_value + 1) -# csp.output(res2={'value': a_value}) -# return csp.output(res1=a_value) -# raise NotImplementedError() - -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=20) - -# results = [] -# for g in [graph_unnamed_output1, graph_unnamed_output2, graph_single_named_output, graph_multiple_outputs, node_unnamed_output, -# node_single_named_output, node_multiple_outputs]: -# csp.run(g, starttime=starttime, endtime=endtime, config=config) -# results.append(g.cached_data(config)().get_data_df_for_period(missing_range_handler=lambda *a, **ka: True)) -# for res in results: -# if default_cache_enabled: -# self.assertEqual(len(res), 9) -# else: -# self.assertEqual(len(res), 4) -# combined_df = pandas.concat(results[:1] + [res.drop(columns=['csp_timestamp']) for res in results], axis=1) -# self.assertEqual(list(combined_df.columns), -# ['csp_timestamp', 'csp_unnamed_output', 'csp_unnamed_output', 'csp_unnamed_output', 'res1', -# ('res1', ''), ('res2', 'value'), 'csp_unnamed_output', 'res', ('res1', ''), ('res2', 'value')]) -# self.assertTrue((combined_df.iloc[:, 1:].diff(axis=1).iloc[:, 1:] == 0).all().all()) - -# def test_controlled_cache_never_set(self): -# """ -# Test that if we never output the controolled set control, we don't get any errors -# :return: -# """ -# for default_cache_enabled in (True, False): -# with _GraphTempCacheFolderConfig() as config: -# @csp.graph(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def graph_unnamed_output1() -> csp.ts[float]: -# return csp.output(csp.null_ts(float)) - -# @csp.node(cache=True, cache_options=GraphCacheOptions(controlled_cache=True, default_cache_enabled=default_cache_enabled)) -# def node_unnamed_output() -> csp.ts[float]: -# with csp.alarms(): -# a_enable = csp.alarm( bool ) -# a_value = csp.alarm( float ) -# if False: -# csp.output(a_value) - -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=20) - -# csp.run(graph_unnamed_output1, starttime=starttime, endtime=endtime, config=config) - -# def test_controlled_cache_bug(self): -# """ -# There was a bug when we run across multiple aggregation periods that the cache enabled was not handled properly, this test was written to reproduce the bug -# and fix it. Here we have aggregation period of 1 month but we are running across 2 months. -# :return: -# """ -# @csp.graph( -# cache=True, -# cache_options=GraphCacheOptions( -# controlled_cache=True, -# time_aggregation=TimeAggregation.MONTH)) -# def cached_g() -> csp.ts[int]: - -# csp.set_cache_enable_ts(csp.curve(bool, [(datetime(2004, 7, 1), True), (datetime(2004, 8, 2), False)])) -# return csp.curve(int, [(datetime(2004, 6, 30), 1), -# (datetime(2004, 7, 1), 2), -# (datetime(2004, 7, 2), 3), -# (datetime(2004, 8, 1), 4), -# (datetime(2004, 8, 1, 1), 5), -# (datetime(2004, 8, 2), 6), -# ]) - -# @csp.graph( -# cache=True, -# cache_options=GraphCacheOptions( -# controlled_cache=True, -# time_aggregation=TimeAggregation.MONTH)) -# def cached_g_struct() -> csp.ts[_DummyStructWithTimestamp]: - -# csp.set_cache_enable_ts(csp.curve(bool, [(datetime(2004, 7, 1), True), (datetime(2004, 8, 2), False)])) -# return _DummyStructWithTimestamp.fromts(val=cached_g()) - -# @csp.graph( -# cache=True, -# cache_options=GraphCacheOptions( -# controlled_cache=True, -# time_aggregation=TimeAggregation.MONTH)) -# def cached_g_basket() -> csp.OutputBasket(Dict[str, csp.ts[int]], shape=['a', 'b']): - -# csp.set_cache_enable_ts(csp.curve(bool, [(datetime(2004, 7, 1), True), (datetime(2004, 8, 2), False)])) -# return {'a': csp.curve(int, [(datetime(2004, 6, 30), 1), -# (datetime(2004, 7, 1), 2), -# (datetime(2004, 7, 2), 3), -# (datetime(2004, 8, 1), 4), -# (datetime(2004, 8, 1, 1), 5), -# (datetime(2004, 8, 2), 6), -# ]), -# 'b': csp.curve(int, [(datetime(2004, 6, 30), 1), (datetime(2004, 8, 1, 1), 5), ]) -# } - -# @csp.graph( -# cache=True, -# cache_options=GraphCacheOptions( -# controlled_cache=True, -# time_aggregation=TimeAggregation.MONTH)) -# def cached_g_basket_struct() -> csp.OutputBasket(Dict[str, csp.ts[_DummyStructWithTimestamp]], shape=['a', 'b']): - -# csp.set_cache_enable_ts(csp.curve(bool, [(datetime(2004, 7, 1), True), (datetime(2004, 8, 2), False)])) -# aux = cached_g_basket() -# return {'a': _DummyStructWithTimestamp.fromts(val=aux['a']), -# 'b': _DummyStructWithTimestamp.fromts(val=aux['b'])} - -# def g(): -# cached_g() -# cached_g_struct() -# cached_g_basket() -# cached_g_basket_struct() - -# with _GraphTempCacheFolderConfig() as config: -# starttime = datetime(2004, 6, 30) -# endtime = datetime(2004, 8, 2, 23, 59, 59, 999999) -# csp.run(g, starttime=starttime, endtime=endtime, config=config) -# df = cached_g.cached_data(config)().get_data_df_for_period() -# self.assertEqual(df.csp_unnamed_output.tolist(), [2, 3, 4, 5]) -# struct_df = cached_g_struct.cached_data(config)().get_data_df_for_period() -# self.assertEqual(struct_df.val.tolist(), [2, 3, 4, 5]) -# basket_df = cached_g_basket.cached_data(config)().get_data_df_for_period() -# self.assertEqual(basket_df.a.tolist(), [2, 3, 4, 5]) -# self.assertEqual(basket_df.b.fillna(-1).tolist(), [-1, -1, -1, 5]) -# struct_basket_df = cached_g_basket_struct.cached_data(config)().get_data_df_for_period() -# self.assertEqual(struct_basket_df.val.a.tolist(), [2, 3, 4, 5]) -# self.assertEqual(struct_basket_df.val.b.fillna(-1).tolist(), [-1, -1, -1, 5]) - -# def test_numpy_1d_array_caching(self): -# for split_columns_to_files in (True, False): -# cache_args = self._get_default_graph_caching_kwargs(split_columns_to_files=split_columns_to_files) - -# for typ in (int, bool, float, str): -# a1 = numpy.array([1, 2, 3, 4, 0], dtype=typ) -# a2 = numpy.array([[1, 2], [3324, 4]], dtype=typ)[:, 0] -# self.assertTrue(a1.flags.c_contiguous) -# self.assertFalse(a2.flags.c_contiguous) - -# @csp.graph(cache=True, **cache_args) -# def g1() -> csp.ts[csp.typing.Numpy1DArray[typ]]: -# return csp.flatten([csp.const(a1), csp.const(a2)]) - -# @csp.graph(cache=True, cache_options=GraphCacheOptions(parquet_output_config=ParquetOutputConfig(batch_size=3), -# split_columns_to_files=split_columns_to_files)) -# def g2() -> csp.ts[csp.typing.Numpy1DArray[typ]]: -# return csp.flatten([csp.const(a1), csp.const(a2)]) - -# @csp.node(cache=True, **cache_args) -# def n1() -> csp.Outputs(arr1=csp.ts[csp.typing.Numpy1DArray[typ]], arr2=csp.ts[numpy.ndarray]): -# with csp.alarms(): -# a_values1 = csp.alarm( csp.typing.Numpy1DArray ) -# a_values2 = csp.alarm( numpy.ndarray ) - -# with csp.start(): -# csp.schedule_alarm(a_values1, timedelta(0), a1) -# csp.schedule_alarm(a_values1, timedelta(seconds=1), a2) -# csp.schedule_alarm(a_values2, timedelta(0), numpy.array([numpy.nan, 1])) -# csp.schedule_alarm(a_values2, timedelta(0), numpy.array([2, numpy.nan, 3])) - -# if csp.ticked(a_values1): -# csp.output(arr1=a_values1) -# if csp.ticked(a_values2): -# csp.output(arr2=a_values2) - -# def verify_equal_array(expected_list, result): -# res_list = [v for t, v in result[0]] -# self.assertEqual(len(expected_list), len(res_list)) -# for e, r in zip(expected_list, res_list): -# self.assertTrue((e == r).all()) - -# def verify_n1_result(expected_list, result): -# verify_equal_array(expected_list, {0: result['arr1']}) -# arr2_values = [v for _, v in result['arr2']] -# expected_arr2_values = numpy.array([numpy.array([numpy.nan, 1.]), numpy.array([2., numpy.nan, 3.])], dtype=object) -# self.assertEqual(len(arr2_values), len(expected_arr2_values)) -# for v1, v2 in zip(arr2_values, expected_arr2_values): -# self.assertTrue(((v1 == v2) | (numpy.isnan(v1) & (numpy.isnan(v1) == numpy.isnan(v1)))).all()) - -# with _GraphTempCacheFolderConfig() as config: -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=20) -# expected_list = [a1, a2] - -# res = csp.run(g1, starttime=starttime, endtime=endtime, config=config) -# verify_equal_array(expected_list, res) -# res = csp.run(g1.cached, starttime=starttime, endtime=endtime, config=config) -# verify_equal_array(expected_list, res) -# res = csp.run(n1, starttime=starttime, endtime=endtime, config=config) -# verify_n1_result(expected_list, res) -# res = csp.run(n1.cached, starttime=starttime, endtime=endtime, config=config) -# verify_n1_result(expected_list, res) -# res = csp.run(g2, starttime=starttime, endtime=endtime, config=config) -# verify_equal_array(expected_list, res) -# res = csp.run(g2.cached, starttime=starttime, endtime=endtime, config=config) -# verify_equal_array(expected_list, res) - -# def test_numpy_wrong_type_errors(self): -# @csp.graph(cache=True) -# def g1() -> csp.ts[csp.typing.Numpy1DArray[int]]: -# return csp.const(numpy.zeros(1, dtype=float)) - -# @csp.graph(cache=True) -# def g2() -> csp.ts[csp.typing.Numpy1DArray[object]]: -# return csp.const(numpy.zeros(1, dtype=object)) - -# @csp.node(cache=True) -# def n1() -> csp.ts[csp.typing.Numpy1DArray[int]]: -# with csp.alarms(): -# a_out = csp.alarm( bool ) -# with csp.start(): -# csp.schedule_alarm(a_out, timedelta(), True) -# if csp.ticked(a_out): -# return numpy.zeros(1, dtype=float) - -# with _GraphTempCacheFolderConfig() as config: -# with self.assertRaisesRegex(TSArgTypeMismatchError, re.escape("In function g1: Expected ts[csp.typing.Numpy1DArray[int]] for return value, got ts[csp.typing.Numpy1DArray[float]]")): -# csp.run(g1, starttime=datetime(2020, 1, 1), endtime=timedelta(minutes=20), config=config) - -# with _GraphTempCacheFolderConfig() as config: -# with self.assertRaisesRegex(TypeError, re.escape("Unsupported array value type when writing to parquet:DIALECT_GENERIC")): -# csp.run(g2, starttime=datetime(2020, 1, 1), endtime=timedelta(minutes=20), config=config) - -# with _GraphTempCacheFolderConfig() as config: -# with self.assertRaisesRegex(TypeError, re.escape("Expected array of type dtype('int64') got dtype('float64')")): -# csp.run(n1, starttime=datetime(2020, 1, 1), endtime=timedelta(minutes=20), config=config) - -# def test_basket_array_caching(self): -# @csp.graph(cache=True) -# def g1() -> csp.OutputBasket(Dict[str, csp.ts[csp.typing.Numpy1DArray[int]]], shape=['a']): -# a = numpy.zeros(3, dtype=int) -# return { -# 'a': csp.const(a) -# } - -# with _GraphTempCacheFolderConfig() as config: -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=20) - -# with self.assertRaisesRegex(NotImplementedError, re.escape('Writing of baskets with array values is not supported')): -# res = csp.run(g1, starttime=starttime, endtime=endtime, config=config) - -# def test_multi_dimensional_array_caching(self): -# a1 = numpy.array([1, 2, 3, 4, 0], dtype=float) -# a2 = numpy.array([[1, 2], [3, 4]], dtype=float) -# expected_df = pandas.DataFrame.from_dict({'csp_timestamp': [pytz.utc.localize(datetime(2020, 1, 1, 9, 29))] * 2, 'csp_unnamed_output': [a1, a2]}) -# for split_columns_to_files in (True, False): -# cache_args = self._get_default_graph_caching_kwargs(split_columns_to_files=split_columns_to_files) - -# @csp.graph(cache=True, **cache_args) -# def g1() -> csp.ts[csp.typing.NumpyNDArray[float]]: -# return csp.flatten([csp.const(a1), csp.const(a2)]) - -# @csp.graph(cache=True, **cache_args) -# def g2() -> csp.Outputs(res=csp.ts[csp.typing.NumpyNDArray[float]]): -# return csp.output(res=g1()) - -# with _GraphTempCacheFolderConfig() as config: -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=20) -# csp.run(g2, starttime=starttime, endtime=endtime, config=config) -# res_cached = csp.run(g1.cached, starttime=starttime, endtime=endtime, config=config) -# df = g1.cached_data(config)().get_data_df_for_period() -# df2 = g2.cached_data(config)().get_data_df_for_period() -# self.assertEqual(len(df), len(expected_df)) -# self.assertTrue((df.csp_timestamp == expected_df.csp_timestamp).all()) -# self.assertTrue(all([(v1 == v2).all() for v1, v2 in zip(df['csp_unnamed_output'], expected_df['csp_unnamed_output'])])) -# cached_values = list(zip(*res_cached[0]))[1] -# self.assertTrue(all([(v1 == v2).all() for v1, v2 in zip(cached_values, expected_df['csp_unnamed_output'])])) -# # We need to check the named column as well. -# self.assertEqual(len(df2), len(expected_df)) -# self.assertTrue(all([(v1 == v2).all() for v1, v2 in zip(df2['res'], expected_df['csp_unnamed_output'])])) - -# def test_read_folder_data_load_as_df(self): -# @csp.graph(cache=True) -# def g1() -> csp.ts[float]: -# return csp.const(42.0) - -# with _GraphTempCacheFolderConfig() as config1: -# with _GraphTempCacheFolderConfig() as config2: -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=20) -# csp.run(g1, starttime=starttime, endtime=endtime, config=config1) -# res1_cached = g1.cached_data(config1)().get_data_df_for_period() -# config2.cache_config.read_folders = [config1.cache_config.data_folder] -# res2_cached = g1.cached_data(config2)().get_data_df_for_period() -# self.assertTrue((res1_cached == res2_cached).all().all()) - -# def test_multiple_readers_different_shapes(self): -# @csp.graph(cache=True, cache_options=GraphCacheOptions(ignored_inputs={'shape', 'dummy'})) -# def g(shape: [str], dummy: object) -> csp.OutputBasket(Dict[str, csp.ts[str]], shape="shape"): -# res = {} -# for v in shape: -# res[v] = csp.const(f'{v}_value') -# return res - -# @csp.graph -# def read_g(): -# __outputs__(v1={str: csp.ts[str]}, v2={str: csp.ts[str]}) -# df = pandas.DataFrame({'dummy': [1]}) - -# v2_a = g.cached(['b', 'c', 'd'], df) -# v2_b = g.cached(['b', 'c', 'd'], df) -# assert id(v2_a) == id(v2_b) - -# return csp.output(v1=g.cached(['a', 'b'], df), v2=v2_a) - -# with _GraphTempCacheFolderConfig() as config: -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=1) -# csp.run(g, ['a', 'b', 'c', 'd'], None, starttime=starttime, endtime=endtime, config=config) -# res = csp.run(read_g, starttime=starttime, endtime=endtime, config=config) -# self.assertEqual(sorted(res.keys()), ['v1[a]', 'v1[b]', 'v2[b]', 'v2[c]', 'v2[d]']) - -# def test_basket_partial_cache_load(self): -# @csp.graph(cache=True) -# def g1() -> csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleStruct]], shape=['A', 'B']): -# return {'A': csp.curve(TypedCurveGenerator.SimpleStruct, [(timedelta(seconds=0), TypedCurveGenerator.SimpleStruct(value1=0)), -# (timedelta(seconds=1), TypedCurveGenerator.SimpleStruct(value1=2)), -# ]), -# 'B': csp.curve(TypedCurveGenerator.SimpleStruct, [(timedelta(seconds=1), TypedCurveGenerator.SimpleStruct(value1=3)), -# (timedelta(seconds=2), TypedCurveGenerator.SimpleStruct(value1=4)), -# ]) -# } - -# @csp.graph(cache=True) -# def g2() -> csp.Outputs(my_named_output=csp.OutputBasket(Dict[str, csp.ts[TypedCurveGenerator.SimpleStruct]], shape=['A', 'B'])): -# return csp.output(my_named_output=g1()) - -# with _GraphTempCacheFolderConfig() as config: -# starttime = datetime(2020, 1, 1, 9, 29) -# endtime = starttime + timedelta(minutes=1) -# csp.run(g2, starttime=starttime, endtime=endtime, config=config) -# res1 = g1.cached_data(config)().get_data_df_for_period() -# res2 = g1.cached_data(config)().get_data_df_for_period(struct_basket_sub_columns={'': ['value1']}) -# with self.assertRaisesRegex(RuntimeError, re.escape("Specified sub columns for basket 'csp_unnamed_output' but it's not loaded from file") + '.*'): -# res3 = g2.cached_data(config)().get_data_df_for_period(struct_basket_sub_columns={'': ['value1']}) -# res3 = g2.cached_data(config)().get_data_df_for_period(struct_basket_sub_columns={'my_named_output': ['value1']}) - -# self.assertEqual(res1.columns.levels[0].to_list(), ['csp_timestamp', 'value1', 'value2']) -# self.assertEqual(res1.columns.levels[1].to_list(), ['', 'A', 'B']) - -# self.assertEqual(res2.columns.levels[0].to_list(), ['csp_timestamp', 'value1']) -# self.assertEqual(res2.columns.levels[1].to_list(), ['', 'A', 'B']) - -# self.assertEqual(res3.columns.levels[0].to_list(), ['csp_timestamp', 'my_named_output.value1']) -# self.assertEqual(res3.columns.levels[1].to_list(), ['', 'A', 'B']) - -# self.assertTrue((res1['value1'].fillna(-111111) == res2['value1'].fillna(-111111)).all().all()) -# self.assertTrue((res1['value1'].fillna(-111111) == res3['my_named_output.value1'].fillna(-111111)).all().all()) -# self.assertTrue((res1['csp_timestamp'] == res2['csp_timestamp']).all().all()) -# self.assertTrue((res1['csp_timestamp'] == res3['csp_timestamp']).all().all()) - -# def test_partition_retrieval(self): -# @csp.graph(cache=True) -# def g1(i: int, d: date, dt: datetime, td: timedelta, f_val: float, s: str, b: bool, struct: TypedCurveGenerator.SimpleStruct) -> csp.Outputs(v1=csp.ts[TypedCurveGenerator.SimpleStruct], v2=csp.ts[float]): -# return csp.output(v1=csp.const(struct), v2=csp.const(f_val)) - -# @csp.graph(cache=True) -# def g2() -> csp.ts[int]: -# return csp.const(42) - -# s1 = datetime(2020, 1, 1) -# e1 = s1 + timedelta(hours=70, microseconds=-1) -# with _GraphTempCacheFolderConfig() as config: -# # i: int, d: date, dt: datetime, td: timedelta, f: float, s: str, b: bool, struct: TypedCurveGenerator.SimpleStruct -# csp.run(g1, i=1, d=date(2013, 5, 8), dt=datetime(2025, 3, 6, 11, 20, 59, 999599), td=timedelta(seconds=5), f_val=5.3, s="test1", b=False, -# struct=TypedCurveGenerator.SimpleStruct(value1=53), -# starttime=s1, endtime=e1, config=config) -# csp.run(g1, i=52, d=date(2013, 5, 31), dt=datetime(2025, 3, 5), td=timedelta(days=100), f_val=7.8, s="test2", b=True, struct=TypedCurveGenerator.SimpleStruct(value1=-53), starttime=s1, -# endtime=e1, config=config) -# csp.run(g2, starttime=s1, endtime=e1, config=config) -# g1_keys = g1.cached_data(config).get_partition_keys() -# g2_keys = g2.cached_data(config).get_partition_keys() -# self.assertEqual([DatasetPartitionKey({'i': 1, -# 'd': date(2013, 5, 8), -# 'dt': datetime(2025, 3, 6, 11, 20, 59, 999599), -# 'td': timedelta(seconds=5), -# 'f_val': 5.3, -# 's': 'test1', -# 'b': False, -# 'struct': TypedCurveGenerator.SimpleStruct(value1=53)}), -# DatasetPartitionKey({'i': 52, -# 'd': date(2013, 5, 31), -# 'dt': datetime(2025, 3, 5), -# 'td': timedelta(days=100), -# 'f_val': 7.8, -# 's': 'test2', -# 'b': True, -# 'struct': TypedCurveGenerator.SimpleStruct(value1=-53)})], -# g1_keys) -# # self.assertEqual(len(g1_keys), 2) -# self.assertEqual([DatasetPartitionKey({})], g2_keys) -# df1 = g1.cached_data(config)(**g1_keys[0].kwargs).get_data_df_for_period() -# df2 = g2.cached_data(config)(**g2_keys[0].kwargs).get_data_df_for_period() -# self.assertTrue((df1.fillna(-42) == pandas.DataFrame({'csp_timestamp': [pytz.utc.localize(s1)], 'v1.value1': [53], 'v1.value2': [-42], 'v2': [5.3]})).all().all()) -# self.assertTrue((df2 == pandas.DataFrame({'csp_timestamp': [pytz.utc.localize(s1)], 'csp_unnamed_output': [42]})).all().all()) - - -# if __name__ == '__main__': -# unittest.main() diff --git a/csp/tests/test_engine.py b/csp/tests/test_engine.py index 028b5d6f9..09bb25ecb 100644 --- a/csp/tests/test_engine.py +++ b/csp/tests/test_engine.py @@ -31,18 +31,6 @@ def _dummy_node(): raise NotImplementedError() -@csp.graph(cache=True) -def _dummy_graph_cached() -> csp.ts[float]: - raise NotImplementedError() - return csp.const(1) - - -@csp.node(cache=True) -def _dummy_node_cached() -> csp.ts[float]: - raise NotImplementedError() - return 1 - - class TestEngine(unittest.TestCase): def test_simple(self): @csp.node @@ -1303,7 +1291,7 @@ def my_node(val: int) -> ts[int]: def dummy(v: ts[int]) -> ts[int]: return v - @csp.graph(cache=True) + @csp.graph def my_ranked_node(val: int, rank: int = 0) -> csp.Outputs(val=ts[int]): res = my_node(val) for i in range(rank): @@ -1833,12 +1821,10 @@ def test_graph_node_pickling(self): """Checks for a bug that we had when transitioning to python 3.8 - the graphs and nodes became unpicklable :return: """ - from csp.tests.test_engine import _dummy_graph, _dummy_graph_cached, _dummy_node, _dummy_node_cached + from csp.tests.test_engine import _dummy_graph, _dummy_node self.assertEqual(_dummy_graph, pickle.loads(pickle.dumps(_dummy_graph))) self.assertEqual(_dummy_node, pickle.loads(pickle.dumps(_dummy_node))) - self.assertEqual(_dummy_graph_cached, pickle.loads(pickle.dumps(_dummy_graph_cached))) - self.assertEqual(_dummy_node_cached, pickle.loads(pickle.dumps(_dummy_node_cached))) def test_memoized_object(self): @csp.csp_memoized @@ -2078,8 +2064,9 @@ def raise_interrupt(): csp.schedule_alarm(a, timedelta(seconds=1), True) if csp.ticked(a): import signal + os.kill(os.getpid(), signal.SIGINT) - + # Python nodes @csp.graph def g(l: list): @@ -2094,12 +2081,12 @@ def g(l: list): for element in stopped: self.assertTrue(element) - + # C++ nodes class RTI: def __init__(self): self.stopped = [False, False, False] - + @csp.node(cppimpl=_csptestlibimpl.set_stop_index) def n2(obj_: object, idx: int): return @@ -2114,7 +2101,7 @@ def g2(rti: RTI): rti = RTI() with self.assertRaises(KeyboardInterrupt): csp.run(g2, rti, starttime=datetime.utcnow(), endtime=timedelta(seconds=60), realtime=True) - + for element in rti.stopped: self.assertTrue(element) diff --git a/csp/tests/test_parsing.py b/csp/tests/test_parsing.py index 5fb1fdbaf..18ec0c6d5 100644 --- a/csp/tests/test_parsing.py +++ b/csp/tests/test_parsing.py @@ -1,6 +1,6 @@ import sys import unittest -from datetime import date, datetime, timedelta +from datetime import datetime, timedelta from typing import Callable, Dict, List import csp @@ -986,37 +986,6 @@ def test_list_inside_callable(self): def graph(v: Dict[str, Callable[[], str]]): pass - def test_graph_caching_parsing(self): - with self.assertRaisesRegex( - NotImplementedError, "Caching is unsupported for argument type typing.List\\[int\\] \\(argument x\\)" - ): - - @csp.graph(cache=True) - def graph(x: List[int]): - __outputs__(o=ts[int]) - pass - - with self.assertRaisesRegex( - NotImplementedError, "Caching is unsupported for argument type typing.Dict\\[int, int\\] \\(argument x\\)" - ): - - @csp.graph(cache=True) - def graph(x: Dict[int, int]): - __outputs__(o=ts[int]) - pass - - with self.assertRaisesRegex(NotImplementedError, "Caching of list basket outputs is unsupported"): - - @csp.graph(cache=True) - def graph(): - __outputs__(o=[ts[int]]) - pass - - @csp.graph(cache=True) - def graph(a1: datetime, a2: date, a3: int, a4: float, a5: str, a6: bool): - __outputs__(o=ts[int]) - pass - def test_list_default_value(self): # There was a bug parsing list default value @csp.graph From 471e1426bb942fb19c6f5ca4b47887c3624068d0 Mon Sep 17 00:00:00 2001 From: Tim Paine <3105306+timkpaine@users.noreply.github.com> Date: Tue, 7 May 2024 18:37:16 -0400 Subject: [PATCH 23/27] Run autofixers with pinned up packages Signed-off-by: Tim Paine <3105306+timkpaine@users.noreply.github.com> --- docs/wiki/concepts/Execution-Modes.md | 14 +++++++------- docs/wiki/dev-guides/Build-CSP-from-Source.md | 10 +++++----- setup.py | 2 +- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/wiki/concepts/Execution-Modes.md b/docs/wiki/concepts/Execution-Modes.md index 46902a820..5d288f8c0 100644 --- a/docs/wiki/concepts/Execution-Modes.md +++ b/docs/wiki/concepts/Execution-Modes.md @@ -41,14 +41,14 @@ As always, `csp.now()` should still be used in `csp.node` code, even when runnin When consuming data from input adapters there are three choices on how one can consume the data: -| PushMode | EngineMode | Description | -| :------- | :--------- | :---------- | -| **LAST_VALUE** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with the last value on a given timestamp | -| | Realtime | all ticks that occurred since previous engine cycle will collapse / conflate to the latest value | +| PushMode | EngineMode | Description | +| :----------------- | :--------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **LAST_VALUE** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with the last value on a given timestamp | +| | Realtime | all ticks that occurred since previous engine cycle will collapse / conflate to the latest value | | **NON_COLLAPSING** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once per engine cycle. subsequent cycles will execute with the same time | -| | Realtime | all ticks that occurred since previous engine cycle will be ticked across subsequent engine cycles as fast as possible | -| **BURST** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with a list of all values | -| | Realtime | all ticks that occurred since previous engine cycle will tick once with a list of all the values | +| | Realtime | all ticks that occurred since previous engine cycle will be ticked across subsequent engine cycles as fast as possible | +| **BURST** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with a list of all values | +| | Realtime | all ticks that occurred since previous engine cycle will tick once with a list of all the values | ## Realtime Group Event Synchronization diff --git a/docs/wiki/dev-guides/Build-CSP-from-Source.md b/docs/wiki/dev-guides/Build-CSP-from-Source.md index 0ccaaf424..fc480feae 100644 --- a/docs/wiki/dev-guides/Build-CSP-from-Source.md +++ b/docs/wiki/dev-guides/Build-CSP-from-Source.md @@ -198,11 +198,11 @@ By default, we pull and build dependencies with [vcpkg](https://vcpkg.io/en/). W CSP has listing and auto formatting. -| Language | Linter | Autoformatter | Description | -| :------- | :----- | :------------ | :---------- | -| C++ | `clang-format` | `clang-format` | Style | -| Python | `ruff` | `ruff` | Style | -| Python | `isort` | `isort` | Imports | +| Language | Linter | Autoformatter | Description | +| :------- | :------------- | :------------- | :---------- | +| C++ | `clang-format` | `clang-format` | Style | +| Python | `ruff` | `ruff` | Style | +| Python | `isort` | `isort` | Imports | **C++ Linting** diff --git a/setup.py b/setup.py index 085e442e5..d8ec75779 100644 --- a/setup.py +++ b/setup.py @@ -64,7 +64,7 @@ ) if VCPKG_TRIPLET is not None: - cmake_args.append( f"-DVCPKG_TARGET_TRIPLET={VCPKG_TRIPLET}" ) + cmake_args.append(f"-DVCPKG_TARGET_TRIPLET={VCPKG_TRIPLET}") else: cmake_args.append("-DCSP_USE_VCPKG=OFF") From 7b3c3a78bbdd6627babe60c82f6d05c14ff07a8e Mon Sep 17 00:00:00 2001 From: Rob Ambalu Date: Wed, 8 May 2024 10:07:03 -0400 Subject: [PATCH 24/27] Python 3.12 build support (#221) * resolve #13 - Python 3.12 build support. * python 3.12 - fix unbound local issue - changed how ts inputs to PyNode are reset to null due to new LOAD_FAST vs LOAD_FAST_CHECK opcodes in Python 3.12. Inject DELETE opcodes into bytecode rather than setting directly to null in c++ * cibuildwheel 2.11.2 -> 2.16.5 Signed-off-by: Rob Ambalu --- .github/actions/setup-caches/action.yml | 1 + .github/actions/setup-dependencies/action.yml | 1 + .github/actions/setup-python/action.yml | 3 ++- .github/workflows/build.yml | 20 +++++++++++++++++++ .github/workflows/conda.yml | 2 +- CMakeLists.txt | 2 +- conda/dev-environment-unix.yml | 2 +- cpp/csp/python/PyNode.cpp | 16 +++++++-------- csp/impl/wiring/node_parser.py | 6 +++++- csp/impl/wiring/runtime.py | 6 +++--- csp/tests/test_random.py | 2 +- pyproject.toml | 3 ++- 12 files changed, 46 insertions(+), 18 deletions(-) diff --git a/.github/actions/setup-caches/action.yml b/.github/actions/setup-caches/action.yml index d3e59ba18..b3296b5e4 100644 --- a/.github/actions/setup-caches/action.yml +++ b/.github/actions/setup-caches/action.yml @@ -10,6 +10,7 @@ inputs: - 'cp39' - 'cp310' - 'cp311' + - 'cp312' default: 'cp39' runs: diff --git a/.github/actions/setup-dependencies/action.yml b/.github/actions/setup-dependencies/action.yml index e08d822c7..dac9f9051 100644 --- a/.github/actions/setup-dependencies/action.yml +++ b/.github/actions/setup-dependencies/action.yml @@ -10,6 +10,7 @@ inputs: - 'cp39' - 'cp310' - 'cp311' + - 'cp312' default: 'cp39' runs: diff --git a/.github/actions/setup-python/action.yml b/.github/actions/setup-python/action.yml index cbe61ec80..76bc9e33b 100644 --- a/.github/actions/setup-python/action.yml +++ b/.github/actions/setup-python/action.yml @@ -10,6 +10,7 @@ inputs: - '3.9' - '3.10' - '3.11' + - '3.12' default: '3.9' runs: @@ -33,4 +34,4 @@ runs: - name: Install cibuildwheel and twine shell: bash - run: pip install cibuildwheel==2.11.2 twine + run: pip install cibuildwheel==2.16.5 twine diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 9e4ff3778..c7c758f8e 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -179,11 +179,13 @@ jobs: - "3.9" - "3.10" - "3.11" + - "3.12" cibuildwheel: - "cp38" - "cp39" - "cp310" - "cp311" + - "cp312" is-full-run: - ${{ needs.initialize.outputs.FULL_RUN == 'true' }} exclude: @@ -198,24 +200,41 @@ jobs: cibuildwheel: "cp310" - python-version: "3.8" cibuildwheel: "cp311" + - python-version: "3.8" + cibuildwheel: "cp312" - python-version: "3.9" cibuildwheel: "cp38" - python-version: "3.9" cibuildwheel: "cp310" - python-version: "3.9" cibuildwheel: "cp311" + - python-version: "3.9" + cibuildwheel: "cp312" - python-version: "3.10" cibuildwheel: "cp38" - python-version: "3.10" cibuildwheel: "cp39" - python-version: "3.10" cibuildwheel: "cp311" + - python-version: "3.10" + cibuildwheel: "cp312" - python-version: "3.11" cibuildwheel: "cp38" - python-version: "3.11" cibuildwheel: "cp39" - python-version: "3.11" cibuildwheel: "cp310" + - python-version: "3.11" + cibuildwheel: "cp312" + - python-version: "3.12" + cibuildwheel: "cp38" + - python-version: "3.12" + cibuildwheel: "cp39" + - python-version: "3.12" + cibuildwheel: "cp310" + - python-version: "3.12" + cibuildwheel: "cp311" + ############################################## # Things to exclude if not a full matrix run # @@ -402,6 +421,7 @@ jobs: - 3.9 - "3.10" - 3.11 + - 3.12 is-full-run: - ${{ needs.initialize.outputs.FULL_RUN == 'true' }} exclude: diff --git a/.github/workflows/conda.yml b/.github/workflows/conda.yml index f54a1034b..d6073fd0a 100644 --- a/.github/workflows/conda.yml +++ b/.github/workflows/conda.yml @@ -52,7 +52,7 @@ jobs: - name: Set up Caches uses: ./.github/actions/setup-caches with: - cibuildwheel: 'cp311' + cibuildwheel: 'cp312' - name: Python Lint Steps run: make lint diff --git a/CMakeLists.txt b/CMakeLists.txt index c6d542d4c..6699ae74d 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -98,7 +98,7 @@ if(CSP_USE_CCACHE) endif() if(NOT DEFINED CSP_PYTHON_VERSION) - set(CSP_PYTHON_VERSION 3.11) + set(CSP_PYTHON_VERSION 3.12) endif() # Path to python folder for autogen diff --git a/conda/dev-environment-unix.yml b/conda/dev-environment-unix.yml index 490ae2036..80b4af66e 100644 --- a/conda/dev-environment-unix.yml +++ b/conda/dev-environment-unix.yml @@ -38,7 +38,7 @@ dependencies: - pytest-asyncio - pytest-cov - pytest-sugar - - python<3.12 + - python<3.13 - python-rapidjson - rapidjson - requests diff --git a/cpp/csp/python/PyNode.cpp b/cpp/csp/python/PyNode.cpp index ba4ebbd7d..d13e1b808 100644 --- a/cpp/csp/python/PyNode.cpp +++ b/cpp/csp/python/PyNode.cpp @@ -61,8 +61,8 @@ void PyNode::init( PyObjectPtr inputs, PyObjectPtr outputs ) m_localVars = ( PyObject *** ) calloc( numInputs(), sizeof( PyObject ** ) ); //printf( "Starting %s slots: %ld rank: %d\n", name(), slots, rank() ); - PyCodeObject * code = ( PyCodeObject * ) pygen -> gi_code; #if IS_PRE_PYTHON_3_11 + PyCodeObject * code = ( PyCodeObject * ) pygen -> gi_code; Py_ssize_t numCells = PyTuple_GET_SIZE( code -> co_cellvars ); size_t cell2argIdx = 0; for( int stackloc = code -> co_argcount; stackloc < code -> co_nlocals + numCells; ++stackloc ) @@ -82,12 +82,13 @@ void PyNode::init( PyObjectPtr inputs, PyObjectPtr outputs ) continue; var = &( ( ( PyCellObject * ) *var ) -> ob_ref ); } -//PY311 changes +//PY311+ changes #else + _PyInterpreterFrame * frame = ( _PyInterpreterFrame * ) pygen -> gi_iframe; + PyCodeObject * code = frame -> f_code; int localPlusIndex = 0; for( int stackloc = code -> co_argcount; stackloc < code -> co_nlocalsplus; ++stackloc, ++localPlusIndex ) { - _PyInterpreterFrame * frame = ( _PyInterpreterFrame * ) pygen -> gi_iframe; PyObject **var = &frame -> localsplus[stackloc]; auto kind = _PyLocals_GetKind(code -> co_localspluskinds, localPlusIndex ); @@ -113,19 +114,18 @@ void PyNode::init( PyObjectPtr inputs, PyObjectPtr outputs ) std::string vartype = PyUnicode_AsUTF8( PyTuple_GET_ITEM( *var, 0 ) ); int index = fromPython( PyTuple_GET_ITEM( *var, 1 ) ); - //decref tuple at this point its no longer needed and will be replaced - Py_DECREF( *var ); - if( vartype == INPUT_VAR_VAR ) { CSP_ASSERT( !isInputBasket( index ) ); m_localVars[ index ] = var; - //assign null to location so users get reference before assignment errors - *var = nullptr; + //These vars will be "deleted" from the python stack after start continue; } + //decref tuple at this point its no longer needed and will be replaced + Py_DECREF( *var ); + PyObject * newvalue = nullptr; if( vartype == NODEREF_VAR ) newvalue = toPython( reinterpret_cast( static_cast(this) ) ); diff --git a/csp/impl/wiring/node_parser.py b/csp/impl/wiring/node_parser.py index af6726ead..6aa3f023a 100644 --- a/csp/impl/wiring/node_parser.py +++ b/csp/impl/wiring/node_parser.py @@ -843,10 +843,14 @@ def _parse_impl(self): else: start_and_body = startblock + body + # delete ts_var variables *after* start so that they raise Unbound local exceptions if they get accessed before first tick + del_vars = [] + for v in ts_vars: + del_vars.append(ast.Delete(targets=[ast.Name(id=v.targets[0].id, ctx=ast.Del())])) # Yield before start block so we can setup stack frame before executing # However, this initial yield shouldn't be within the try-finally block, since if a node does not start, it's stop() logic should not be invoked # This avoids an issue where one node raises an exception upon start(), and then other nodes execute their stop() without having ever started - start_and_body = [ast.Expr(value=ast.Yield(value=None))] + start_and_body + start_and_body = [ast.Expr(value=ast.Yield(value=None))] + del_vars + start_and_body newbody = init_block + start_and_body newfuncdef = ast.FunctionDef(name=self._name, body=newbody, returns=None) diff --git a/csp/impl/wiring/runtime.py b/csp/impl/wiring/runtime.py index d334bd1e1..ca408c786 100644 --- a/csp/impl/wiring/runtime.py +++ b/csp/impl/wiring/runtime.py @@ -18,7 +18,7 @@ def _normalize_run_times(starttime, endtime, realtime): if starttime is None: if realtime: - starttime = datetime.utcnow() + starttime = datetime.now(pytz.UTC).replace(tzinfo=None) else: raise RuntimeError("starttime argument is required") if endtime is None: @@ -199,8 +199,8 @@ def run( mem_cache.clear(clear_user_objects=False) # Ensure we dont start running realtime engines before starttime if its in the future - if starttime > datetime.utcnow() and realtime: - time.sleep((starttime - datetime.utcnow()).total_seconds()) + if starttime > datetime.now(pytz.UTC).replace(tzinfo=None) and realtime: + time.sleep((starttime - datetime.now(pytz.UTC)).total_seconds()) with mem_cache: return engine.run(starttime, endtime) diff --git a/csp/tests/test_random.py b/csp/tests/test_random.py index c43e02bdf..3a875ce1d 100644 --- a/csp/tests/test_random.py +++ b/csp/tests/test_random.py @@ -116,7 +116,7 @@ def test_brownian_motion(self): endtime=timedelta(seconds=100), )[0] err = bm_out[-1][1] - data.sum(axis=0) - self.assertAlmostEquals(np.abs(err).max(), 0.0) + self.assertAlmostEqual(np.abs(err).max(), 0.0) def test_brownian_motion_1d(self): mean = 10.0 diff --git a/pyproject.toml b/pyproject.toml index e19191af5..24881afdc 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -38,6 +38,7 @@ classifiers = [ "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", ] @@ -113,7 +114,7 @@ slack = [ ignore = [] [tool.cibuildwheel] -build = "cp38-* cp39-* cp310-* cp311-*" +build = "cp38-* cp39-* cp310-* cp311-* cp312-*" test-command = "echo 'TODO'" test-requires = [ "pytest", From 68927fe17285f2bb16b5dd04fc6d1f8859e098e3 Mon Sep 17 00:00:00 2001 From: Rob Ambalu Date: Thu, 9 May 2024 12:52:04 -0400 Subject: [PATCH 25/27] Update to arrow / pyarrow 16 (#210) * Upgrade pyarrow to 16.0.0 Signed-off-by: Rob Ambalu --- conda/dev-environment-unix.yml | 4 +- cpp/csp/python/adapters/CMakeLists.txt | 2 +- .../pyarrow-15.0.0/arrow/python/ipc.cc | 67 - .../pyarrow-15.0.0/arrow/util/int_util.h | 137 - .../arrow/util/int_util_overflow.h | 118 - .../pyarrow-15.0.0/arrow/util/io_util.h | 420 - .../pyarrow-15.0.0/arrow/util/iterator.h | 568 -- .../arrow/util/key_value_metadata.h | 98 - .../pyarrow-15.0.0/arrow/util/launder.h | 35 - .../pyarrow-15.0.0/arrow/util/list_util.h | 55 - .../pyarrow-15.0.0/arrow/util/logging.h | 259 - .../pyarrow-15.0.0/arrow/util/macros.h | 191 - .../vendored/pyarrow-15.0.0/arrow/util/map.h | 63 - .../arrow/util/math_constants.h | 32 - .../pyarrow-15.0.0/arrow/util/memory.h | 43 - .../pyarrow-15.0.0/arrow/util/mutex.h | 85 - .../pyarrow-15.0.0/arrow/util/parallel.h | 102 - .../pyarrow-15.0.0/arrow/util/pcg_random.h | 33 - .../pyarrow-15.0.0/arrow/util/print.h | 77 - .../pyarrow-15.0.0/arrow/util/queue.h | 29 - .../pyarrow-15.0.0/arrow/util/range.h | 258 - .../pyarrow-15.0.0/arrow/util/ree_util.h | 582 -- .../pyarrow-15.0.0/arrow/util/regex.h | 51 - .../pyarrow-15.0.0/arrow/util/rle_encoding.h | 826 -- .../arrow/util/rows_to_batches.h | 163 - .../vendored/pyarrow-15.0.0/arrow/util/simd.h | 44 - .../pyarrow-15.0.0/arrow/util/small_vector.h | 511 -- .../vendored/pyarrow-15.0.0/arrow/util/sort.h | 78 - .../pyarrow-15.0.0/arrow/util/spaced.h | 98 - .../vendored/pyarrow-15.0.0/arrow/util/span.h | 132 - .../pyarrow-15.0.0/arrow/util/stopwatch.h | 48 - .../pyarrow-15.0.0/arrow/util/string.h | 173 - .../arrow/util/string_builder.h | 84 - .../pyarrow-15.0.0/arrow/util/task_group.h | 106 - .../pyarrow-15.0.0/arrow/util/tdigest.h | 104 - .../pyarrow-15.0.0/arrow/util/test_common.h | 90 - .../pyarrow-15.0.0/arrow/util/thread_pool.h | 620 -- .../vendored/pyarrow-15.0.0/arrow/util/time.h | 83 - .../pyarrow-15.0.0/arrow/util/tracing.h | 45 - .../vendored/pyarrow-15.0.0/arrow/util/trie.h | 243 - .../pyarrow-15.0.0/arrow/util/type_fwd.h | 69 - .../pyarrow-15.0.0/arrow/util/type_traits.h | 46 - .../pyarrow-15.0.0/arrow/util/ubsan.h | 87 - .../pyarrow-15.0.0/arrow/util/union_util.h | 31 - .../pyarrow-15.0.0/arrow/util/unreachable.h | 30 - .../vendored/pyarrow-15.0.0/arrow/util/uri.h | 118 - .../vendored/pyarrow-15.0.0/arrow/util/utf8.h | 59 - .../pyarrow-15.0.0/arrow/util/value_parsing.h | 928 --- .../pyarrow-15.0.0/arrow/util/vector.h | 172 - .../pyarrow-15.0.0/arrow/util/visibility.h | 83 - .../arrow/util/windows_compatibility.h | 40 - .../pyarrow-15.0.0/arrow/util/windows_fixup.h | 52 - .../vendored/portable-snippets/debug-trap.h | 83 - .../vendored/portable-snippets/safe-math.h | 1072 --- .../pyarrow-15.0.0/arrow/vendored/xxhash.h | 18 - .../arrow/vendored/xxhash/xxhash.h | 6773 ----------------- .../arrow/python/CMakeLists.txt | 0 .../arrow/python/api.h | 0 .../arrow/python/arrow_to_pandas.cc | 161 +- .../arrow/python/arrow_to_pandas.h | 0 .../arrow/python/arrow_to_python_internal.h | 0 .../arrow/python/async.h | 0 .../arrow/python/benchmark.cc | 0 .../arrow/python/benchmark.h | 0 .../arrow/python/common.cc | 49 +- .../arrow/python/common.h | 0 .../arrow/python/csv.cc | 0 .../arrow/python/csv.h | 0 .../arrow/python/datetime.cc | 0 .../arrow/python/datetime.h | 0 .../arrow/python/decimal.cc | 0 .../arrow/python/decimal.h | 0 .../arrow/python/deserialize.cc | 0 .../arrow/python/deserialize.h | 0 .../arrow/python/extension_type.cc | 2 +- .../arrow/python/extension_type.h | 2 +- .../arrow/python/filesystem.cc | 0 .../arrow/python/filesystem.h | 24 +- .../arrow/python/flight.cc | 0 .../arrow/python/flight.h | 0 .../arrow/python/gdb.cc | 0 .../arrow/python/gdb.h | 0 .../arrow/python/helpers.cc | 2 + .../arrow/python/helpers.h | 0 .../arrow/python/inference.cc | 5 +- .../arrow/python/inference.h | 0 .../arrow/python/init.cc | 0 .../arrow/python/init.h | 0 .../arrow/python/io.cc | 15 +- .../arrow/python/io.h | 0 .../pyarrow-16.0.0/arrow/python/ipc.cc | 133 + .../arrow/python/ipc.h | 20 + .../arrow/python/iterators.h | 0 .../arrow/python/lib.h | 0 .../arrow/python/lib_api.h | 102 +- .../arrow/python/numpy_convert.cc | 83 +- .../arrow/python/numpy_convert.h | 6 +- .../arrow/python/numpy_internal.h | 0 .../arrow/python/numpy_interop.h | 7 + .../arrow/python/numpy_to_arrow.cc | 32 +- .../arrow/python/numpy_to_arrow.h | 0 .../arrow/python/parquet_encryption.cc | 0 .../arrow/python/parquet_encryption.h | 0 .../arrow/python/pch.h | 0 .../arrow/python/platform.h | 0 .../arrow/python/pyarrow.cc | 0 .../arrow/python/pyarrow.h | 0 .../arrow/python/pyarrow_api.h | 0 .../arrow/python/pyarrow_lib.h | 0 .../arrow/python/python_test.cc | 17 +- .../arrow/python/python_test.h | 0 .../arrow/python/python_to_arrow.cc | 70 +- .../arrow/python/python_to_arrow.h | 0 .../arrow/python/serialize.cc | 0 .../arrow/python/serialize.h | 0 .../arrow/python/type_traits.h | 0 .../arrow/python/udf.cc | 0 .../arrow/python/udf.h | 0 .../arrow/python/visibility.h | 0 setup.py | 1 + vcpkg.json | 3 +- 121 files changed, 523 insertions(+), 16629 deletions(-) delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/ipc.cc delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util_overflow.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/io_util.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/iterator.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/key_value_metadata.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/launder.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/list_util.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/logging.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/macros.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/map.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/math_constants.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/memory.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/mutex.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/parallel.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/pcg_random.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/print.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/queue.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/range.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ree_util.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/regex.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rle_encoding.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rows_to_batches.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/simd.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/small_vector.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/sort.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/spaced.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/span.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/stopwatch.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string_builder.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/task_group.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tdigest.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/test_common.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/thread_pool.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/time.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tracing.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/trie.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_fwd.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_traits.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ubsan.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/union_util.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/unreachable.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/uri.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/utf8.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/value_parsing.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/vector.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/visibility.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_compatibility.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_fixup.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/debug-trap.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/safe-math.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash.h delete mode 100644 cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash/xxhash.h rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/CMakeLists.txt (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/api.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/arrow_to_pandas.cc (94%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/arrow_to_pandas.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/arrow_to_python_internal.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/async.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/benchmark.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/benchmark.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/common.cc (80%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/common.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/csv.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/csv.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/datetime.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/datetime.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/decimal.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/decimal.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/deserialize.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/deserialize.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/extension_type.cc (99%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/extension_type.h (97%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/filesystem.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/filesystem.h (90%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/flight.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/flight.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/gdb.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/gdb.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/helpers.cc (99%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/helpers.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/inference.cc (99%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/inference.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/init.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/init.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/io.cc (96%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/io.h (100%) mode change 100644 => 100755 create mode 100755 cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/ipc.cc rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/ipc.h (73%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/iterators.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/lib.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/lib_api.h (51%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/numpy_convert.cc (90%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/numpy_convert.h (94%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/numpy_internal.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/numpy_interop.h (92%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/numpy_to_arrow.cc (96%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/numpy_to_arrow.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/parquet_encryption.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/parquet_encryption.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/pch.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/platform.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/pyarrow.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/pyarrow.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/pyarrow_api.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/pyarrow_lib.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/python_test.cc (98%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/python_test.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/python_to_arrow.cc (95%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/python_to_arrow.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/serialize.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/serialize.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/type_traits.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/udf.cc (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/udf.h (100%) mode change 100644 => 100755 rename cpp/csp/python/adapters/vendored/{pyarrow-15.0.0 => pyarrow-16.0.0}/arrow/python/visibility.h (100%) mode change 100644 => 100755 diff --git a/conda/dev-environment-unix.yml b/conda/dev-environment-unix.yml index 80b4af66e..266f05268 100644 --- a/conda/dev-environment-unix.yml +++ b/conda/dev-environment-unix.yml @@ -18,7 +18,7 @@ dependencies: - gtest - httpx>=0.20,<1 - isort>=5,<6 - - libarrow=15 + - libarrow=16 - librdkafka - libboost-headers - lz4-c @@ -28,7 +28,7 @@ dependencies: - numpy - pillow - psutil - - pyarrow=15 + - pyarrow=16 - pandas - pillow - polars diff --git a/cpp/csp/python/adapters/CMakeLists.txt b/cpp/csp/python/adapters/CMakeLists.txt index a44cd8d38..512182c42 100644 --- a/cpp/csp/python/adapters/CMakeLists.txt +++ b/cpp/csp/python/adapters/CMakeLists.txt @@ -6,7 +6,7 @@ if(CSP_BUILD_KAFKA_ADAPTER) endif() if(CSP_BUILD_PARQUET_ADAPTER) - set(VENDORED_PYARROW_ROOT "${CMAKE_SOURCE_DIR}/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/") + set(VENDORED_PYARROW_ROOT "${CMAKE_SOURCE_DIR}/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/") set(ARROW_PYTHON_SRCS ${VENDORED_PYARROW_ROOT}/arrow/python/arrow_to_pandas.cc ${VENDORED_PYARROW_ROOT}/arrow/python/benchmark.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/ipc.cc b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/ipc.cc deleted file mode 100644 index 934818224..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/ipc.cc +++ /dev/null @@ -1,67 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "ipc.h" - -#include - -#include "arrow/python/pyarrow.h" - -namespace arrow { -namespace py { - -PyRecordBatchReader::PyRecordBatchReader() {} - -Status PyRecordBatchReader::Init(std::shared_ptr schema, PyObject* iterable) { - schema_ = std::move(schema); - - iterator_.reset(PyObject_GetIter(iterable)); - return CheckPyError(); -} - -std::shared_ptr PyRecordBatchReader::schema() const { return schema_; } - -Status PyRecordBatchReader::ReadNext(std::shared_ptr* batch) { - PyAcquireGIL lock; - - if (!iterator_) { - // End of stream - batch->reset(); - return Status::OK(); - } - - OwnedRef py_batch(PyIter_Next(iterator_.obj())); - if (!py_batch) { - RETURN_IF_PYERROR(); - // End of stream - batch->reset(); - iterator_.reset(); - return Status::OK(); - } - - return unwrap_batch(py_batch.obj()).Value(batch); -} - -Result> PyRecordBatchReader::Make( - std::shared_ptr schema, PyObject* iterable) { - auto reader = std::shared_ptr(new PyRecordBatchReader()); - RETURN_NOT_OK(reader->Init(std::move(schema), iterable)); - return reader; -} - -} // namespace py -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util.h deleted file mode 100644 index 59a2ac710..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util.h +++ /dev/null @@ -1,137 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include - -#include "arrow/status.h" - -#include "arrow/util/visibility.h" - -namespace arrow { - -class DataType; -struct ArraySpan; -struct Scalar; - -namespace internal { - -ARROW_EXPORT -uint8_t DetectUIntWidth(const uint64_t* values, int64_t length, uint8_t min_width = 1); - -ARROW_EXPORT -uint8_t DetectUIntWidth(const uint64_t* values, const uint8_t* valid_bytes, - int64_t length, uint8_t min_width = 1); - -ARROW_EXPORT -uint8_t DetectIntWidth(const int64_t* values, int64_t length, uint8_t min_width = 1); - -ARROW_EXPORT -uint8_t DetectIntWidth(const int64_t* values, const uint8_t* valid_bytes, int64_t length, - uint8_t min_width = 1); - -ARROW_EXPORT -void DowncastInts(const int64_t* source, int8_t* dest, int64_t length); - -ARROW_EXPORT -void DowncastInts(const int64_t* source, int16_t* dest, int64_t length); - -ARROW_EXPORT -void DowncastInts(const int64_t* source, int32_t* dest, int64_t length); - -ARROW_EXPORT -void DowncastInts(const int64_t* source, int64_t* dest, int64_t length); - -ARROW_EXPORT -void DowncastUInts(const uint64_t* source, uint8_t* dest, int64_t length); - -ARROW_EXPORT -void DowncastUInts(const uint64_t* source, uint16_t* dest, int64_t length); - -ARROW_EXPORT -void DowncastUInts(const uint64_t* source, uint32_t* dest, int64_t length); - -ARROW_EXPORT -void DowncastUInts(const uint64_t* source, uint64_t* dest, int64_t length); - -ARROW_EXPORT -void UpcastInts(const int32_t* source, int64_t* dest, int64_t length); - -template -inline typename std::enable_if<(sizeof(InputInt) >= sizeof(OutputInt))>::type CastInts( - const InputInt* source, OutputInt* dest, int64_t length) { - DowncastInts(source, dest, length); -} - -template -inline typename std::enable_if<(sizeof(InputInt) < sizeof(OutputInt))>::type CastInts( - const InputInt* source, OutputInt* dest, int64_t length) { - UpcastInts(source, dest, length); -} - -template -ARROW_EXPORT void TransposeInts(const InputInt* source, OutputInt* dest, int64_t length, - const int32_t* transpose_map); - -ARROW_EXPORT -Status TransposeInts(const DataType& src_type, const DataType& dest_type, - const uint8_t* src, uint8_t* dest, int64_t src_offset, - int64_t dest_offset, int64_t length, const int32_t* transpose_map); - -/// \brief Do vectorized boundschecking of integer-type array indices. The -/// indices must be nonnegative and strictly less than the passed upper -/// limit (which is usually the length of an array that is being indexed-into). -ARROW_EXPORT -Status CheckIndexBounds(const ArraySpan& values, uint64_t upper_limit); - -/// \brief Boundscheck integer values to determine if they are all between the -/// passed upper and lower limits (inclusive). Upper and lower bounds must be -/// the same type as the data and are not currently casted. -ARROW_EXPORT -Status CheckIntegersInRange(const ArraySpan& values, const Scalar& bound_lower, - const Scalar& bound_upper); - -/// \brief Use CheckIntegersInRange to determine whether the passed integers -/// can fit safely in the passed integer type. This helps quickly determine if -/// integer narrowing (e.g. int64->int32) is safe to do. -ARROW_EXPORT -Status IntegersCanFit(const ArraySpan& values, const DataType& target_type); - -/// \brief Convenience for boundschecking a single Scalar value -ARROW_EXPORT -Status IntegersCanFit(const Scalar& value, const DataType& target_type); - -/// Upcast an integer to the largest possible width (currently 64 bits) - -template -typename std::enable_if< - std::is_integral::value && std::is_signed::value, int64_t>::type -UpcastInt(Integer v) { - return v; -} - -template -typename std::enable_if< - std::is_integral::value && std::is_unsigned::value, uint64_t>::type -UpcastInt(Integer v) { - return v; -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util_overflow.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util_overflow.h deleted file mode 100644 index ffe78be24..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/int_util_overflow.h +++ /dev/null @@ -1,118 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include - -#include "arrow/status.h" -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -// "safe-math.h" includes from the Windows headers. -#include "arrow/util/windows_compatibility.h" -#include "arrow/vendored/portable-snippets/safe-math.h" -// clang-format off (avoid include reordering) -#include "arrow/util/windows_fixup.h" -// clang-format on - -namespace arrow { -namespace internal { - -// Define functions AddWithOverflow, SubtractWithOverflow, MultiplyWithOverflow -// with the signature `bool(T u, T v, T* out)` where T is an integer type. -// On overflow, these functions return true. Otherwise, false is returned -// and `out` is updated with the result of the operation. - -#define OP_WITH_OVERFLOW(_func_name, _psnip_op, _type, _psnip_type) \ - [[nodiscard]] static inline bool _func_name(_type u, _type v, _type* out) { \ - return !psnip_safe_##_psnip_type##_##_psnip_op(out, u, v); \ - } - -#define OPS_WITH_OVERFLOW(_func_name, _psnip_op) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, int8_t, int8) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, int16_t, int16) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, int32_t, int32) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, int64_t, int64) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, uint8_t, uint8) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, uint16_t, uint16) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, uint32_t, uint32) \ - OP_WITH_OVERFLOW(_func_name, _psnip_op, uint64_t, uint64) - -OPS_WITH_OVERFLOW(AddWithOverflow, add) -OPS_WITH_OVERFLOW(SubtractWithOverflow, sub) -OPS_WITH_OVERFLOW(MultiplyWithOverflow, mul) -OPS_WITH_OVERFLOW(DivideWithOverflow, div) - -#undef OP_WITH_OVERFLOW -#undef OPS_WITH_OVERFLOW - -// Define function NegateWithOverflow with the signature `bool(T u, T* out)` -// where T is a signed integer type. On overflow, these functions return true. -// Otherwise, false is returned and `out` is updated with the result of the -// operation. - -#define UNARY_OP_WITH_OVERFLOW(_func_name, _psnip_op, _type, _psnip_type) \ - [[nodiscard]] static inline bool _func_name(_type u, _type* out) { \ - return !psnip_safe_##_psnip_type##_##_psnip_op(out, u); \ - } - -#define SIGNED_UNARY_OPS_WITH_OVERFLOW(_func_name, _psnip_op) \ - UNARY_OP_WITH_OVERFLOW(_func_name, _psnip_op, int8_t, int8) \ - UNARY_OP_WITH_OVERFLOW(_func_name, _psnip_op, int16_t, int16) \ - UNARY_OP_WITH_OVERFLOW(_func_name, _psnip_op, int32_t, int32) \ - UNARY_OP_WITH_OVERFLOW(_func_name, _psnip_op, int64_t, int64) - -SIGNED_UNARY_OPS_WITH_OVERFLOW(NegateWithOverflow, neg) - -#undef UNARY_OP_WITH_OVERFLOW -#undef SIGNED_UNARY_OPS_WITH_OVERFLOW - -/// Signed addition with well-defined behaviour on overflow (as unsigned) -template -SignedInt SafeSignedAdd(SignedInt u, SignedInt v) { - using UnsignedInt = typename std::make_unsigned::type; - return static_cast(static_cast(u) + - static_cast(v)); -} - -/// Signed subtraction with well-defined behaviour on overflow (as unsigned) -template -SignedInt SafeSignedSubtract(SignedInt u, SignedInt v) { - using UnsignedInt = typename std::make_unsigned::type; - return static_cast(static_cast(u) - - static_cast(v)); -} - -/// Signed negation with well-defined behaviour on overflow (as unsigned) -template -SignedInt SafeSignedNegate(SignedInt u) { - using UnsignedInt = typename std::make_unsigned::type; - return static_cast(~static_cast(u) + 1); -} - -/// Signed left shift with well-defined behaviour on negative numbers or overflow -template -SignedInt SafeLeftShift(SignedInt u, Shift shift) { - using UnsignedInt = typename std::make_unsigned::type; - return static_cast(static_cast(u) << shift); -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/io_util.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/io_util.h deleted file mode 100644 index 113b1bdd9..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/io_util.h +++ /dev/null @@ -1,420 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#ifndef _WIN32 -#define ARROW_HAVE_SIGACTION 1 -#endif - -#include -#include -#include -#include -#include - -#if ARROW_HAVE_SIGACTION -#include // Needed for struct sigaction -#endif - -#include "arrow/status.h" -#include "arrow/type_fwd.h" -#include "arrow/util/macros.h" -#include "arrow/util/windows_fixup.h" - -namespace arrow { -namespace internal { - -// NOTE: 8-bit path strings on Windows are encoded using UTF-8. -// Using MBCS would fail encoding some paths. - -#if defined(_WIN32) -using NativePathString = std::wstring; -#else -using NativePathString = std::string; -#endif - -class ARROW_EXPORT PlatformFilename { - public: - struct Impl; - - ~PlatformFilename(); - PlatformFilename(); - PlatformFilename(const PlatformFilename&); - PlatformFilename(PlatformFilename&&); - PlatformFilename& operator=(const PlatformFilename&); - PlatformFilename& operator=(PlatformFilename&&); - explicit PlatformFilename(NativePathString path); - explicit PlatformFilename(const NativePathString::value_type* path); - - const NativePathString& ToNative() const; - std::string ToString() const; - - PlatformFilename Parent() const; - Result Real() const; - - // These functions can fail for character encoding reasons. - static Result FromString(std::string_view file_name); - Result Join(std::string_view child_name) const; - - PlatformFilename Join(const PlatformFilename& child_name) const; - - bool operator==(const PlatformFilename& other) const; - bool operator!=(const PlatformFilename& other) const; - - // Made public to avoid the proliferation of friend declarations. - const Impl* impl() const { return impl_.get(); } - - private: - std::unique_ptr impl_; - - explicit PlatformFilename(Impl impl); -}; - -/// Create a directory if it doesn't exist. -/// -/// Return whether the directory was created. -ARROW_EXPORT -Result CreateDir(const PlatformFilename& dir_path); - -/// Create a directory and its parents if it doesn't exist. -/// -/// Return whether the directory was created. -ARROW_EXPORT -Result CreateDirTree(const PlatformFilename& dir_path); - -/// Delete a directory's contents (but not the directory itself) if it exists. -/// -/// Return whether the directory existed. -ARROW_EXPORT -Result DeleteDirContents(const PlatformFilename& dir_path, - bool allow_not_found = true); - -/// Delete a directory tree if it exists. -/// -/// Return whether the directory existed. -ARROW_EXPORT -Result DeleteDirTree(const PlatformFilename& dir_path, bool allow_not_found = true); - -// Non-recursively list the contents of the given directory. -// The returned names are the children's base names, not including dir_path. -ARROW_EXPORT -Result> ListDir(const PlatformFilename& dir_path); - -/// Delete a file if it exists. -/// -/// Return whether the file existed. -ARROW_EXPORT -Result DeleteFile(const PlatformFilename& file_path, bool allow_not_found = true); - -/// Return whether a file exists. -ARROW_EXPORT -Result FileExists(const PlatformFilename& path); - -// TODO expose this more publicly to make it available from io/file.h? -/// A RAII wrapper for a file descriptor. -/// -/// The underlying file descriptor is automatically closed on destruction. -/// Moving is supported with well-defined semantics. -/// Furthermore, closing is idempotent. -class ARROW_EXPORT FileDescriptor { - public: - FileDescriptor() = default; - explicit FileDescriptor(int fd) : fd_(fd) {} - FileDescriptor(FileDescriptor&&); - FileDescriptor& operator=(FileDescriptor&&); - - ~FileDescriptor(); - - Status Close(); - - /// May return -1 if closed or default-initialized - int fd() const { return fd_.load(); } - - /// Detach and return the underlying file descriptor - int Detach(); - - bool closed() const { return fd_.load() == -1; } - - protected: - static void CloseFromDestructor(int fd); - - std::atomic fd_{-1}; -}; - -/// Open a file for reading and return a file descriptor. -ARROW_EXPORT -Result FileOpenReadable(const PlatformFilename& file_name); - -/// Open a file for writing and return a file descriptor. -ARROW_EXPORT -Result FileOpenWritable(const PlatformFilename& file_name, - bool write_only = true, bool truncate = true, - bool append = false); - -/// Read from current file position. Return number of bytes read. -ARROW_EXPORT -Result FileRead(int fd, uint8_t* buffer, int64_t nbytes); -/// Read from given file position. Return number of bytes read. -ARROW_EXPORT -Result FileReadAt(int fd, uint8_t* buffer, int64_t position, int64_t nbytes); - -ARROW_EXPORT -Status FileWrite(int fd, const uint8_t* buffer, const int64_t nbytes); -ARROW_EXPORT -Status FileTruncate(int fd, const int64_t size); - -ARROW_EXPORT -Status FileSeek(int fd, int64_t pos); -ARROW_EXPORT -Status FileSeek(int fd, int64_t pos, int whence); -ARROW_EXPORT -Result FileTell(int fd); -ARROW_EXPORT -Result FileGetSize(int fd); - -ARROW_EXPORT -Status FileClose(int fd); - -struct Pipe { - FileDescriptor rfd; - FileDescriptor wfd; - - Status Close() { return rfd.Close() & wfd.Close(); } -}; - -ARROW_EXPORT -Result CreatePipe(); - -ARROW_EXPORT -Status SetPipeFileDescriptorNonBlocking(int fd); - -class ARROW_EXPORT SelfPipe { - public: - static Result> Make(bool signal_safe); - virtual ~SelfPipe(); - - /// \brief Wait for a wakeup. - /// - /// Status::Invalid is returned if the pipe has been shutdown. - /// Otherwise the next sent payload is returned. - virtual Result Wait() = 0; - - /// \brief Wake up the pipe by sending a payload. - /// - /// This method is async-signal-safe if `signal_safe` was set to true. - virtual void Send(uint64_t payload) = 0; - - /// \brief Wake up the pipe and shut it down. - virtual Status Shutdown() = 0; -}; - -ARROW_EXPORT -int64_t GetPageSize(); - -struct MemoryRegion { - void* addr; - size_t size; -}; - -ARROW_EXPORT -Status MemoryMapRemap(void* addr, size_t old_size, size_t new_size, int fildes, - void** new_addr); -ARROW_EXPORT -Status MemoryAdviseWillNeed(const std::vector& regions); - -ARROW_EXPORT -Result GetEnvVar(const char* name); -ARROW_EXPORT -Result GetEnvVar(const std::string& name); -ARROW_EXPORT -Result GetEnvVarNative(const char* name); -ARROW_EXPORT -Result GetEnvVarNative(const std::string& name); - -ARROW_EXPORT -Status SetEnvVar(const char* name, const char* value); -ARROW_EXPORT -Status SetEnvVar(const std::string& name, const std::string& value); -ARROW_EXPORT -Status DelEnvVar(const char* name); -ARROW_EXPORT -Status DelEnvVar(const std::string& name); - -ARROW_EXPORT -std::string ErrnoMessage(int errnum); -#if _WIN32 -ARROW_EXPORT -std::string WinErrorMessage(int errnum); -#endif - -ARROW_EXPORT -std::shared_ptr StatusDetailFromErrno(int errnum); -#if _WIN32 -ARROW_EXPORT -std::shared_ptr StatusDetailFromWinError(int errnum); -#endif -ARROW_EXPORT -std::shared_ptr StatusDetailFromSignal(int signum); - -template -Status StatusFromErrno(int errnum, StatusCode code, Args&&... args) { - return Status::FromDetailAndArgs(code, StatusDetailFromErrno(errnum), - std::forward(args)...); -} - -template -Status IOErrorFromErrno(int errnum, Args&&... args) { - return StatusFromErrno(errnum, StatusCode::IOError, std::forward(args)...); -} - -#if _WIN32 -template -Status StatusFromWinError(int errnum, StatusCode code, Args&&... args) { - return Status::FromDetailAndArgs(code, StatusDetailFromWinError(errnum), - std::forward(args)...); -} - -template -Status IOErrorFromWinError(int errnum, Args&&... args) { - return StatusFromWinError(errnum, StatusCode::IOError, std::forward(args)...); -} -#endif - -template -Status StatusFromSignal(int signum, StatusCode code, Args&&... args) { - return Status::FromDetailAndArgs(code, StatusDetailFromSignal(signum), - std::forward(args)...); -} - -template -Status CancelledFromSignal(int signum, Args&&... args) { - return StatusFromSignal(signum, StatusCode::Cancelled, std::forward(args)...); -} - -ARROW_EXPORT -int ErrnoFromStatus(const Status&); - -// Always returns 0 on non-Windows platforms (for Python). -ARROW_EXPORT -int WinErrorFromStatus(const Status&); - -ARROW_EXPORT -int SignalFromStatus(const Status&); - -class ARROW_EXPORT TemporaryDir { - public: - ~TemporaryDir(); - - /// '/'-terminated path to the temporary dir - const PlatformFilename& path() { return path_; } - - /// Create a temporary subdirectory in the system temporary dir, - /// named starting with `prefix`. - static Result> Make(const std::string& prefix); - - private: - PlatformFilename path_; - - explicit TemporaryDir(PlatformFilename&&); -}; - -class ARROW_EXPORT SignalHandler { - public: - typedef void (*Callback)(int); - - SignalHandler(); - explicit SignalHandler(Callback cb); -#if ARROW_HAVE_SIGACTION - explicit SignalHandler(const struct sigaction& sa); -#endif - - Callback callback() const; -#if ARROW_HAVE_SIGACTION - const struct sigaction& action() const; -#endif - - protected: -#if ARROW_HAVE_SIGACTION - // Storing the full sigaction allows to restore the entire signal handling - // configuration. - struct sigaction sa_; -#else - Callback cb_; -#endif -}; - -/// \brief Return the current handler for the given signal number. -ARROW_EXPORT -Result GetSignalHandler(int signum); - -/// \brief Set a new handler for the given signal number. -/// -/// The old signal handler is returned. -ARROW_EXPORT -Result SetSignalHandler(int signum, const SignalHandler& handler); - -/// \brief Reinstate the signal handler -/// -/// For use in signal handlers. This is needed on platforms without sigaction() -/// such as Windows, as the default signal handler is restored there as -/// soon as a signal is raised. -ARROW_EXPORT -void ReinstateSignalHandler(int signum, SignalHandler::Callback handler); - -/// \brief Send a signal to the current process -/// -/// The thread which will receive the signal is unspecified. -ARROW_EXPORT -Status SendSignal(int signum); - -/// \brief Send a signal to the given thread -/// -/// This function isn't supported on Windows. -ARROW_EXPORT -Status SendSignalToThread(int signum, uint64_t thread_id); - -/// \brief Get an unpredictable random seed -/// -/// This function may be slightly costly, so should only be used to initialize -/// a PRNG, not to generate a large amount of random numbers. -/// It is better to use this function rather than std::random_device, unless -/// absolutely necessary (e.g. to generate a cryptographic secret). -ARROW_EXPORT -int64_t GetRandomSeed(); - -/// \brief Get the current thread id -/// -/// In addition to having the same properties as std::thread, the returned value -/// is a regular integer value, which is more convenient than an opaque type. -ARROW_EXPORT -uint64_t GetThreadId(); - -/// \brief Get the current memory used by the current process in bytes -/// -/// This function supports Windows, Linux, and Mac and will return 0 otherwise -ARROW_EXPORT -int64_t GetCurrentRSS(); - -/// \brief Get the total memory available to the system in bytes -/// -/// This function supports Windows, Linux, and Mac and will return 0 otherwise -ARROW_EXPORT -int64_t GetTotalMemoryBytes(); - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/iterator.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/iterator.h deleted file mode 100644 index 5e716d0fd..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/iterator.h +++ /dev/null @@ -1,568 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include -#include -#include - -#include "arrow/result.h" -#include "arrow/status.h" -#include "arrow/util/compare.h" -#include "arrow/util/functional.h" -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -template -class Iterator; - -template -struct IterationTraits { - /// \brief a reserved value which indicates the end of iteration. By - /// default this is NULLPTR since most iterators yield pointer types. - /// Specialize IterationTraits if different end semantics are required. - /// - /// Note: This should not be used to determine if a given value is a - /// terminal value. Use IsIterationEnd (which uses IsEnd) instead. This - /// is only for returning terminal values. - static T End() { return T(NULLPTR); } - - /// \brief Checks to see if the value is a terminal value. - /// A method is used here since T is not necessarily comparable in many - /// cases even though it has a distinct final value - static bool IsEnd(const T& val) { return val == End(); } -}; - -template -T IterationEnd() { - return IterationTraits::End(); -} - -template -bool IsIterationEnd(const T& val) { - return IterationTraits::IsEnd(val); -} - -template -struct IterationTraits> { - /// \brief by default when iterating through a sequence of optional, - /// nullopt indicates the end of iteration. - /// Specialize IterationTraits if different end semantics are required. - static std::optional End() { return std::nullopt; } - - /// \brief by default when iterating through a sequence of optional, - /// nullopt (!has_value()) indicates the end of iteration. - /// Specialize IterationTraits if different end semantics are required. - static bool IsEnd(const std::optional& val) { return !val.has_value(); } - - // TODO(bkietz) The range-for loop over Iterator> yields - // Result> which is unnecessary (since only the unyielded end optional - // is nullopt. Add IterationTraits::GetRangeElement() to handle this case -}; - -/// \brief A generic Iterator that can return errors -template -class Iterator : public util::EqualityComparable> { - public: - /// \brief Iterator may be constructed from any type which has a member function - /// with signature Result Next(); - /// End of iterator is signalled by returning IteratorTraits::End(); - /// - /// The argument is moved or copied to the heap and kept in a unique_ptr. Only - /// its destructor and its Next method (which are stored in function pointers) are - /// referenced after construction. - /// - /// This approach is used to dodge MSVC linkage hell (ARROW-6244, ARROW-6558) when using - /// an abstract template base class: instead of being inlined as usual for a template - /// function the base's virtual destructor will be exported, leading to multiple - /// definition errors when linking to any other TU where the base is instantiated. - template - explicit Iterator(Wrapped has_next) - : ptr_(new Wrapped(std::move(has_next)), Delete), next_(Next) {} - - Iterator() : ptr_(NULLPTR, [](void*) {}) {} - - /// \brief Return the next element of the sequence, IterationTraits::End() when the - /// iteration is completed. Calling this on a default constructed Iterator - /// will result in undefined behavior. - Result Next() { return next_(ptr_.get()); } - - /// Pass each element of the sequence to a visitor. Will return any error status - /// returned by the visitor, terminating iteration. - template - Status Visit(Visitor&& visitor) { - for (;;) { - ARROW_ASSIGN_OR_RAISE(auto value, Next()); - - if (IsIterationEnd(value)) break; - - ARROW_RETURN_NOT_OK(visitor(std::move(value))); - } - - return Status::OK(); - } - - /// Iterators will only compare equal if they are both null. - /// Equality comparability is required to make an Iterator of Iterators - /// (to check for the end condition). - bool Equals(const Iterator& other) const { return ptr_ == other.ptr_; } - - explicit operator bool() const { return ptr_ != NULLPTR; } - - class RangeIterator { - public: - RangeIterator() : value_(IterationTraits::End()) {} - - explicit RangeIterator(Iterator i) - : value_(IterationTraits::End()), - iterator_(std::make_shared(std::move(i))) { - Next(); - } - - bool operator!=(const RangeIterator& other) const { return value_ != other.value_; } - - RangeIterator& operator++() { - Next(); - return *this; - } - - Result operator*() { - ARROW_RETURN_NOT_OK(value_.status()); - - auto value = std::move(value_); - value_ = IterationTraits::End(); - return value; - } - - private: - void Next() { - if (!value_.ok()) { - value_ = IterationTraits::End(); - return; - } - value_ = iterator_->Next(); - } - - Result value_; - std::shared_ptr iterator_; - }; - - RangeIterator begin() { return RangeIterator(std::move(*this)); } - - RangeIterator end() { return RangeIterator(); } - - /// \brief Move every element of this iterator into a vector. - Result> ToVector() { - std::vector out; - for (auto maybe_element : *this) { - ARROW_ASSIGN_OR_RAISE(auto element, maybe_element); - out.push_back(std::move(element)); - } - // ARROW-8193: On gcc-4.8 without the explicit move it tries to use the - // copy constructor, which may be deleted on the elements of type T - return std::move(out); - } - - private: - /// Implementation of deleter for ptr_: Casts from void* to the wrapped type and - /// deletes that. - template - static void Delete(void* ptr) { - delete static_cast(ptr); - } - - /// Implementation of Next: Casts from void* to the wrapped type and invokes that - /// type's Next member function. - template - static Result Next(void* ptr) { - return static_cast(ptr)->Next(); - } - - /// ptr_ is a unique_ptr to void with a custom deleter: a function pointer which first - /// casts from void* to a pointer to the wrapped type then deletes that. - std::unique_ptr ptr_; - - /// next_ is a function pointer which first casts from void* to a pointer to the wrapped - /// type then invokes its Next member function. - Result (*next_)(void*) = NULLPTR; -}; - -template -struct TransformFlow { - using YieldValueType = T; - - TransformFlow(YieldValueType value, bool ready_for_next) - : finished_(false), - ready_for_next_(ready_for_next), - yield_value_(std::move(value)) {} - TransformFlow(bool finished, bool ready_for_next) - : finished_(finished), ready_for_next_(ready_for_next), yield_value_() {} - - bool HasValue() const { return yield_value_.has_value(); } - bool Finished() const { return finished_; } - bool ReadyForNext() const { return ready_for_next_; } - T Value() const { return *yield_value_; } - - bool finished_ = false; - bool ready_for_next_ = false; - std::optional yield_value_; -}; - -struct TransformFinish { - template - operator TransformFlow() && { // NOLINT explicit - return TransformFlow(true, true); - } -}; - -struct TransformSkip { - template - operator TransformFlow() && { // NOLINT explicit - return TransformFlow(false, true); - } -}; - -template -TransformFlow TransformYield(T value = {}, bool ready_for_next = true) { - return TransformFlow(std::move(value), ready_for_next); -} - -template -using Transformer = std::function>(T)>; - -template -class TransformIterator { - public: - explicit TransformIterator(Iterator it, Transformer transformer) - : it_(std::move(it)), - transformer_(std::move(transformer)), - last_value_(), - finished_() {} - - Result Next() { - while (!finished_) { - ARROW_ASSIGN_OR_RAISE(std::optional next, Pump()); - if (next.has_value()) { - return std::move(*next); - } - ARROW_ASSIGN_OR_RAISE(last_value_, it_.Next()); - } - return IterationTraits::End(); - } - - private: - // Calls the transform function on the current value. Can return in several ways - // * If the next value is requested (e.g. skip) it will return an empty optional - // * If an invalid status is encountered that will be returned - // * If finished it will return IterationTraits::End() - // * If a value is returned by the transformer that will be returned - Result> Pump() { - if (!finished_ && last_value_.has_value()) { - auto next_res = transformer_(*last_value_); - if (!next_res.ok()) { - finished_ = true; - return next_res.status(); - } - auto next = *next_res; - if (next.ReadyForNext()) { - if (IsIterationEnd(*last_value_)) { - finished_ = true; - } - last_value_.reset(); - } - if (next.Finished()) { - finished_ = true; - } - if (next.HasValue()) { - return next.Value(); - } - } - if (finished_) { - return IterationTraits::End(); - } - return std::nullopt; - } - - Iterator it_; - Transformer transformer_; - std::optional last_value_; - bool finished_ = false; -}; - -/// \brief Transforms an iterator according to a transformer, returning a new Iterator. -/// -/// The transformer will be called on each element of the source iterator and for each -/// call it can yield a value, skip, or finish the iteration. When yielding a value the -/// transformer can choose to consume the source item (the default, ready_for_next = true) -/// or to keep it and it will be called again on the same value. -/// -/// This is essentially a more generic form of the map operation that can return 0, 1, or -/// many values for each of the source items. -/// -/// The transformer will be exposed to the end of the source sequence -/// (IterationTraits::End) in case it needs to return some penultimate item(s). -/// -/// Any invalid status returned by the transformer will be returned immediately. -template -Iterator MakeTransformedIterator(Iterator it, Transformer op) { - return Iterator(TransformIterator(std::move(it), std::move(op))); -} - -template -struct IterationTraits> { - // The end condition for an Iterator of Iterators is a default constructed (null) - // Iterator. - static Iterator End() { return Iterator(); } - static bool IsEnd(const Iterator& val) { return !val; } -}; - -template -class FunctionIterator { - public: - explicit FunctionIterator(Fn fn) : fn_(std::move(fn)) {} - - Result Next() { return fn_(); } - - private: - Fn fn_; -}; - -/// \brief Construct an Iterator which invokes a callable on Next() -template ::ValueType> -Iterator MakeFunctionIterator(Fn fn) { - return Iterator(FunctionIterator(std::move(fn))); -} - -template -Iterator MakeEmptyIterator() { - return MakeFunctionIterator([]() -> Result { return IterationTraits::End(); }); -} - -template -Iterator MakeErrorIterator(Status s) { - return MakeFunctionIterator([s]() -> Result { - ARROW_RETURN_NOT_OK(s); - return IterationTraits::End(); - }); -} - -/// \brief Simple iterator which yields the elements of a std::vector -template -class VectorIterator { - public: - explicit VectorIterator(std::vector v) : elements_(std::move(v)) {} - - Result Next() { - if (i_ == elements_.size()) { - return IterationTraits::End(); - } - return std::move(elements_[i_++]); - } - - private: - std::vector elements_; - size_t i_ = 0; -}; - -template -Iterator MakeVectorIterator(std::vector v) { - return Iterator(VectorIterator(std::move(v))); -} - -/// \brief Simple iterator which yields *pointers* to the elements of a std::vector. -/// This is provided to support T where IterationTraits::End is not specialized -template -class VectorPointingIterator { - public: - explicit VectorPointingIterator(std::vector v) : elements_(std::move(v)) {} - - Result Next() { - if (i_ == elements_.size()) { - return NULLPTR; - } - return &elements_[i_++]; - } - - private: - std::vector elements_; - size_t i_ = 0; -}; - -template -Iterator MakeVectorPointingIterator(std::vector v) { - return Iterator(VectorPointingIterator(std::move(v))); -} - -/// \brief MapIterator takes ownership of an iterator and a function to apply -/// on every element. The mapped function is not allowed to fail. -template -class MapIterator { - public: - explicit MapIterator(Fn map, Iterator it) - : map_(std::move(map)), it_(std::move(it)) {} - - Result Next() { - ARROW_ASSIGN_OR_RAISE(I i, it_.Next()); - - if (IsIterationEnd(i)) { - return IterationTraits::End(); - } - - return map_(std::move(i)); - } - - private: - Fn map_; - Iterator it_; -}; - -/// \brief MapIterator takes ownership of an iterator and a function to apply -/// on every element. The mapped function is not allowed to fail. -template , - typename To = internal::call_traits::return_type> -Iterator MakeMapIterator(Fn map, Iterator it) { - return Iterator(MapIterator(std::move(map), std::move(it))); -} - -/// \brief Like MapIterator, but where the function can fail. -template , - typename To = typename internal::call_traits::return_type::ValueType> -Iterator MakeMaybeMapIterator(Fn map, Iterator it) { - return Iterator(MapIterator(std::move(map), std::move(it))); -} - -struct FilterIterator { - enum Action { ACCEPT, REJECT }; - - template - static Result> Reject() { - return std::make_pair(IterationTraits::End(), REJECT); - } - - template - static Result> Accept(To out) { - return std::make_pair(std::move(out), ACCEPT); - } - - template - static Result> MaybeAccept(Result maybe_out) { - return std::move(maybe_out).Map(Accept); - } - - template - static Result> Error(Status s) { - return s; - } - - template - class Impl { - public: - explicit Impl(Fn filter, Iterator it) : filter_(filter), it_(std::move(it)) {} - - Result Next() { - To out = IterationTraits::End(); - Action action; - - for (;;) { - ARROW_ASSIGN_OR_RAISE(From i, it_.Next()); - - if (IsIterationEnd(i)) { - return IterationTraits::End(); - } - - ARROW_ASSIGN_OR_RAISE(std::tie(out, action), filter_(std::move(i))); - - if (action == ACCEPT) return out; - } - } - - private: - Fn filter_; - Iterator it_; - }; -}; - -/// \brief Like MapIterator, but where the function can fail or reject elements. -template < - typename Fn, typename From = typename internal::call_traits::argument_type<0, Fn>, - typename Ret = typename internal::call_traits::return_type::ValueType, - typename To = typename std::tuple_element<0, Ret>::type, - typename Enable = typename std::enable_if::type, FilterIterator::Action>::value>::type> -Iterator MakeFilterIterator(Fn filter, Iterator it) { - return Iterator( - FilterIterator::Impl(std::move(filter), std::move(it))); -} - -/// \brief FlattenIterator takes an iterator generating iterators and yields a -/// unified iterator that flattens/concatenates in a single stream. -template -class FlattenIterator { - public: - explicit FlattenIterator(Iterator> it) : parent_(std::move(it)) {} - - Result Next() { - if (IsIterationEnd(child_)) { - // Pop from parent's iterator. - ARROW_ASSIGN_OR_RAISE(child_, parent_.Next()); - - // Check if final iteration reached. - if (IsIterationEnd(child_)) { - return IterationTraits::End(); - } - - return Next(); - } - - // Pop from child_ and check for depletion. - ARROW_ASSIGN_OR_RAISE(T out, child_.Next()); - if (IsIterationEnd(out)) { - // Reset state such that we pop from parent on the recursive call - child_ = IterationTraits>::End(); - - return Next(); - } - - return out; - } - - private: - Iterator> parent_; - Iterator child_ = IterationTraits>::End(); -}; - -template -Iterator MakeFlattenIterator(Iterator> it) { - return Iterator(FlattenIterator(std::move(it))); -} - -template -Iterator MakeIteratorFromReader( - const std::shared_ptr& reader) { - return MakeFunctionIterator([reader] { return reader->Next(); }); -} - -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/key_value_metadata.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/key_value_metadata.h deleted file mode 100644 index 8702ce73a..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/key_value_metadata.h +++ /dev/null @@ -1,98 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include - -#include "arrow/result.h" -#include "arrow/status.h" -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -/// \brief A container for key-value pair type metadata. Not thread-safe -class ARROW_EXPORT KeyValueMetadata { - public: - KeyValueMetadata(); - KeyValueMetadata(std::vector keys, std::vector values); - explicit KeyValueMetadata(const std::unordered_map& map); - - static std::shared_ptr Make(std::vector keys, - std::vector values); - - void ToUnorderedMap(std::unordered_map* out) const; - void Append(std::string key, std::string value); - - Result Get(const std::string& key) const; - bool Contains(const std::string& key) const; - // Note that deleting may invalidate known indices - Status Delete(const std::string& key); - Status Delete(int64_t index); - Status DeleteMany(std::vector indices); - Status Set(const std::string& key, const std::string& value); - - void reserve(int64_t n); - - int64_t size() const; - const std::string& key(int64_t i) const; - const std::string& value(int64_t i) const; - const std::vector& keys() const { return keys_; } - const std::vector& values() const { return values_; } - - std::vector> sorted_pairs() const; - - /// \brief Perform linear search for key, returning -1 if not found - int FindKey(const std::string& key) const; - - std::shared_ptr Copy() const; - - /// \brief Return a new KeyValueMetadata by combining the passed metadata - /// with this KeyValueMetadata. Colliding keys will be overridden by the - /// passed metadata. Assumes keys in both containers are unique - std::shared_ptr Merge(const KeyValueMetadata& other) const; - - bool Equals(const KeyValueMetadata& other) const; - std::string ToString() const; - - private: - std::vector keys_; - std::vector values_; - - ARROW_DISALLOW_COPY_AND_ASSIGN(KeyValueMetadata); -}; - -/// \brief Create a KeyValueMetadata instance -/// -/// \param pairs key-value mapping -ARROW_EXPORT std::shared_ptr key_value_metadata( - const std::unordered_map& pairs); - -/// \brief Create a KeyValueMetadata instance -/// -/// \param keys sequence of metadata keys -/// \param values sequence of corresponding metadata values -ARROW_EXPORT std::shared_ptr key_value_metadata( - std::vector keys, std::vector values); - -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/launder.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/launder.h deleted file mode 100644 index 9e4533c4b..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/launder.h +++ /dev/null @@ -1,35 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include - -namespace arrow { -namespace internal { - -#if __cpp_lib_launder -using std::launder; -#else -template -constexpr T* launder(T* p) noexcept { - return p; -} -#endif - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/list_util.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/list_util.h deleted file mode 100644 index 58deb8019..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/list_util.h +++ /dev/null @@ -1,55 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include - -#include "arrow/array/data.h" -#include "arrow/result.h" - -namespace arrow { -namespace list_util { -namespace internal { - -/// \brief Calculate the smallest continuous range of values used by the -/// var-length list-like input (list, map and list-view types). -/// -/// \param input The input array such that is_var_length_list_like(input.type) -/// is true -/// \return A pair of (offset, length) describing the range -ARROW_EXPORT Result> RangeOfValuesUsed( - const ArraySpan& input); - -/// \brief Calculate the sum of the sizes of all valid lists or list-views -/// -/// This is usually the same as the length of the RangeOfValuesUsed() range, but -/// it can be: -/// - Smaller: when the child array contains many values that are not -/// referenced by the lists or list-views in the parent array -/// - Greater: when the list-views share child array ranges -/// -/// \param input The input array such that is_var_length_list_like(input.type) -/// is true -/// \return The sum of all list or list-view sizes -ARROW_EXPORT Result SumOfLogicalListSizes(const ArraySpan& input); - -} // namespace internal - -} // namespace list_util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/logging.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/logging.h deleted file mode 100644 index 2baa56056..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/logging.h +++ /dev/null @@ -1,259 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#ifdef GANDIVA_IR - -// The LLVM IR code doesn't have an NDEBUG mode. And, it shouldn't include references to -// streams or stdc++. So, making the DCHECK calls void in that case. - -#define ARROW_IGNORE_EXPR(expr) ((void)(expr)) - -#define DCHECK(condition) ARROW_IGNORE_EXPR(condition) -#define DCHECK_OK(status) ARROW_IGNORE_EXPR(status) -#define DCHECK_EQ(val1, val2) ARROW_IGNORE_EXPR(val1) -#define DCHECK_NE(val1, val2) ARROW_IGNORE_EXPR(val1) -#define DCHECK_LE(val1, val2) ARROW_IGNORE_EXPR(val1) -#define DCHECK_LT(val1, val2) ARROW_IGNORE_EXPR(val1) -#define DCHECK_GE(val1, val2) ARROW_IGNORE_EXPR(val1) -#define DCHECK_GT(val1, val2) ARROW_IGNORE_EXPR(val1) - -#else // !GANDIVA_IR - -#include -#include -#include - -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { -namespace util { - -enum class ArrowLogLevel : int { - ARROW_DEBUG = -1, - ARROW_INFO = 0, - ARROW_WARNING = 1, - ARROW_ERROR = 2, - ARROW_FATAL = 3 -}; - -#define ARROW_LOG_INTERNAL(level) ::arrow::util::ArrowLog(__FILE__, __LINE__, level) -#define ARROW_LOG(level) ARROW_LOG_INTERNAL(::arrow::util::ArrowLogLevel::ARROW_##level) - -#define ARROW_IGNORE_EXPR(expr) ((void)(expr)) - -#define ARROW_CHECK_OR_LOG(condition, level) \ - ARROW_PREDICT_TRUE(condition) \ - ? ARROW_IGNORE_EXPR(0) \ - : ::arrow::util::Voidify() & ARROW_LOG(level) << " Check failed: " #condition " " - -#define ARROW_CHECK(condition) ARROW_CHECK_OR_LOG(condition, FATAL) - -// If 'to_call' returns a bad status, CHECK immediately with a logged message -// of 'msg' followed by the status. -#define ARROW_CHECK_OK_PREPEND(to_call, msg, level) \ - do { \ - ::arrow::Status _s = (to_call); \ - ARROW_CHECK_OR_LOG(_s.ok(), level) \ - << "Operation failed: " << ARROW_STRINGIFY(to_call) << "\n" \ - << (msg) << ": " << _s.ToString(); \ - } while (false) - -// If the status is bad, CHECK immediately, appending the status to the -// logged message. -#define ARROW_CHECK_OK(s) ARROW_CHECK_OK_PREPEND(s, "Bad status", FATAL) - -#define ARROW_CHECK_EQ(val1, val2) ARROW_CHECK((val1) == (val2)) -#define ARROW_CHECK_NE(val1, val2) ARROW_CHECK((val1) != (val2)) -#define ARROW_CHECK_LE(val1, val2) ARROW_CHECK((val1) <= (val2)) -#define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2)) -#define ARROW_CHECK_GE(val1, val2) ARROW_CHECK((val1) >= (val2)) -#define ARROW_CHECK_GT(val1, val2) ARROW_CHECK((val1) > (val2)) - -#ifdef NDEBUG -#define ARROW_DFATAL ::arrow::util::ArrowLogLevel::ARROW_WARNING - -// CAUTION: DCHECK_OK() always evaluates its argument, but other DCHECK*() macros -// only do so in debug mode. - -#define ARROW_DCHECK(condition) \ - while (false) ARROW_IGNORE_EXPR(condition); \ - while (false) ::arrow::util::detail::NullLog() -#define ARROW_DCHECK_OK(s) \ - ARROW_IGNORE_EXPR(s); \ - while (false) ::arrow::util::detail::NullLog() -#define ARROW_DCHECK_EQ(val1, val2) \ - while (false) ARROW_IGNORE_EXPR(val1); \ - while (false) ARROW_IGNORE_EXPR(val2); \ - while (false) ::arrow::util::detail::NullLog() -#define ARROW_DCHECK_NE(val1, val2) \ - while (false) ARROW_IGNORE_EXPR(val1); \ - while (false) ARROW_IGNORE_EXPR(val2); \ - while (false) ::arrow::util::detail::NullLog() -#define ARROW_DCHECK_LE(val1, val2) \ - while (false) ARROW_IGNORE_EXPR(val1); \ - while (false) ARROW_IGNORE_EXPR(val2); \ - while (false) ::arrow::util::detail::NullLog() -#define ARROW_DCHECK_LT(val1, val2) \ - while (false) ARROW_IGNORE_EXPR(val1); \ - while (false) ARROW_IGNORE_EXPR(val2); \ - while (false) ::arrow::util::detail::NullLog() -#define ARROW_DCHECK_GE(val1, val2) \ - while (false) ARROW_IGNORE_EXPR(val1); \ - while (false) ARROW_IGNORE_EXPR(val2); \ - while (false) ::arrow::util::detail::NullLog() -#define ARROW_DCHECK_GT(val1, val2) \ - while (false) ARROW_IGNORE_EXPR(val1); \ - while (false) ARROW_IGNORE_EXPR(val2); \ - while (false) ::arrow::util::detail::NullLog() - -#else -#define ARROW_DFATAL ::arrow::util::ArrowLogLevel::ARROW_FATAL - -#define ARROW_DCHECK ARROW_CHECK -#define ARROW_DCHECK_OK ARROW_CHECK_OK -#define ARROW_DCHECK_EQ ARROW_CHECK_EQ -#define ARROW_DCHECK_NE ARROW_CHECK_NE -#define ARROW_DCHECK_LE ARROW_CHECK_LE -#define ARROW_DCHECK_LT ARROW_CHECK_LT -#define ARROW_DCHECK_GE ARROW_CHECK_GE -#define ARROW_DCHECK_GT ARROW_CHECK_GT - -#endif // NDEBUG - -#define DCHECK ARROW_DCHECK -#define DCHECK_OK ARROW_DCHECK_OK -#define DCHECK_EQ ARROW_DCHECK_EQ -#define DCHECK_NE ARROW_DCHECK_NE -#define DCHECK_LE ARROW_DCHECK_LE -#define DCHECK_LT ARROW_DCHECK_LT -#define DCHECK_GE ARROW_DCHECK_GE -#define DCHECK_GT ARROW_DCHECK_GT - -// This code is adapted from -// https://github.com/ray-project/ray/blob/master/src/ray/util/logging.h. - -// To make the logging lib pluggable with other logging libs and make -// the implementation unawared by the user, ArrowLog is only a declaration -// which hide the implementation into logging.cc file. -// In logging.cc, we can choose different log libs using different macros. - -// This is also a null log which does not output anything. -class ARROW_EXPORT ArrowLogBase { - public: - virtual ~ArrowLogBase() {} - - virtual bool IsEnabled() const { return false; } - - template - ArrowLogBase& operator<<(const T& t) { - if (IsEnabled()) { - Stream() << t; - } - return *this; - } - - protected: - virtual std::ostream& Stream() = 0; -}; - -class ARROW_EXPORT ArrowLog : public ArrowLogBase { - public: - ArrowLog(const char* file_name, int line_number, ArrowLogLevel severity); - ~ArrowLog() override; - - /// Return whether or not current logging instance is enabled. - /// - /// \return True if logging is enabled and false otherwise. - bool IsEnabled() const override; - - /// The init function of arrow log for a program which should be called only once. - /// - /// \param appName The app name which starts the log. - /// \param severity_threshold Logging threshold for the program. - /// \param logDir Logging output file name. If empty, the log won't output to file. - static void StartArrowLog(const std::string& appName, - ArrowLogLevel severity_threshold = ArrowLogLevel::ARROW_INFO, - const std::string& logDir = ""); - - /// The shutdown function of arrow log, it should be used with StartArrowLog as a pair. - static void ShutDownArrowLog(); - - /// Install the failure signal handler to output call stack when crash. - /// If glog is not installed, this function won't do anything. - static void InstallFailureSignalHandler(); - - /// Uninstall the signal actions installed by InstallFailureSignalHandler. - static void UninstallSignalAction(); - - /// Return whether or not the log level is enabled in current setting. - /// - /// \param log_level The input log level to test. - /// \return True if input log level is not lower than the threshold. - static bool IsLevelEnabled(ArrowLogLevel log_level); - - private: - ARROW_DISALLOW_COPY_AND_ASSIGN(ArrowLog); - - // Hide the implementation of log provider by void *. - // Otherwise, lib user may define the same macro to use the correct header file. - void* logging_provider_; - /// True if log messages should be logged and false if they should be ignored. - bool is_enabled_; - - static ArrowLogLevel severity_threshold_; - - protected: - std::ostream& Stream() override; -}; - -// This class make ARROW_CHECK compilation pass to change the << operator to void. -// This class is copied from glog. -class ARROW_EXPORT Voidify { - public: - Voidify() {} - // This has to be an operator with a precedence lower than << but - // higher than ?: - void operator&(ArrowLogBase&) {} -}; - -namespace detail { - -/// @brief A helper for the nil log sink. -/// -/// Using this helper is analogous to sending log messages to /dev/null: -/// nothing gets logged. -class NullLog { - public: - /// The no-op output operator. - /// - /// @param [in] t - /// The object to send into the nil sink. - /// @return Reference to the updated object. - template - NullLog& operator<<(const T& t) { - return *this; - } -}; - -} // namespace detail -} // namespace util -} // namespace arrow - -#endif // GANDIVA_IR diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/macros.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/macros.h deleted file mode 100644 index b5675faa1..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/macros.h +++ /dev/null @@ -1,191 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include - -#define ARROW_EXPAND(x) x -#define ARROW_STRINGIFY(x) #x -#define ARROW_CONCAT(x, y) x##y - -// From Google gutil -#ifndef ARROW_DISALLOW_COPY_AND_ASSIGN -#define ARROW_DISALLOW_COPY_AND_ASSIGN(TypeName) \ - TypeName(const TypeName&) = delete; \ - void operator=(const TypeName&) = delete -#endif - -#ifndef ARROW_DEFAULT_MOVE_AND_ASSIGN -#define ARROW_DEFAULT_MOVE_AND_ASSIGN(TypeName) \ - TypeName(TypeName&&) = default; \ - TypeName& operator=(TypeName&&) = default -#endif - -#define ARROW_UNUSED(x) (void)(x) -#define ARROW_ARG_UNUSED(x) -// -// GCC can be told that a certain branch is not likely to be taken (for -// instance, a CHECK failure), and use that information in static analysis. -// Giving it this information can help it optimize for the common case in -// the absence of better information (ie. -fprofile-arcs). -// -#if defined(__GNUC__) -#define ARROW_PREDICT_FALSE(x) (__builtin_expect(!!(x), 0)) -#define ARROW_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1)) -#define ARROW_NORETURN __attribute__((noreturn)) -#define ARROW_NOINLINE __attribute__((noinline)) -#define ARROW_PREFETCH(addr) __builtin_prefetch(addr) -#elif defined(_MSC_VER) -#define ARROW_NORETURN __declspec(noreturn) -#define ARROW_NOINLINE __declspec(noinline) -#define ARROW_PREDICT_FALSE(x) (x) -#define ARROW_PREDICT_TRUE(x) (x) -#define ARROW_PREFETCH(addr) -#else -#define ARROW_NORETURN -#define ARROW_PREDICT_FALSE(x) (x) -#define ARROW_PREDICT_TRUE(x) (x) -#define ARROW_PREFETCH(addr) -#endif - -#if defined(__GNUC__) || defined(__clang__) || defined(_MSC_VER) -#define ARROW_RESTRICT __restrict -#else -#define ARROW_RESTRICT -#endif - -// ---------------------------------------------------------------------- -// C++/CLI support macros (see ARROW-1134) - -#ifndef NULLPTR - -#ifdef __cplusplus_cli -#define NULLPTR __nullptr -#else -#define NULLPTR nullptr -#endif - -#endif // ifndef NULLPTR - -// ---------------------------------------------------------------------- - -// clang-format off -// [[deprecated]] is only available in C++14, use this for the time being -// This macro takes an optional deprecation message -#ifdef __COVERITY__ -# define ARROW_DEPRECATED(...) -#else -# define ARROW_DEPRECATED(...) [[deprecated(__VA_ARGS__)]] -#endif - -#ifdef __COVERITY__ -# define ARROW_DEPRECATED_ENUM_VALUE(...) -#else -# define ARROW_DEPRECATED_ENUM_VALUE(...) [[deprecated(__VA_ARGS__)]] -#endif - -// clang-format on - -// Macros to disable deprecation warnings - -#ifdef __clang__ -#define ARROW_SUPPRESS_DEPRECATION_WARNING \ - _Pragma("clang diagnostic push"); \ - _Pragma("clang diagnostic ignored \"-Wdeprecated-declarations\"") -#define ARROW_UNSUPPRESS_DEPRECATION_WARNING _Pragma("clang diagnostic pop") -#elif defined(__GNUC__) -#define ARROW_SUPPRESS_DEPRECATION_WARNING \ - _Pragma("GCC diagnostic push"); \ - _Pragma("GCC diagnostic ignored \"-Wdeprecated-declarations\"") -#define ARROW_UNSUPPRESS_DEPRECATION_WARNING _Pragma("GCC diagnostic pop") -#elif defined(_MSC_VER) -#define ARROW_SUPPRESS_DEPRECATION_WARNING \ - __pragma(warning(push)) __pragma(warning(disable : 4996)) -#define ARROW_UNSUPPRESS_DEPRECATION_WARNING __pragma(warning(pop)) -#else -#define ARROW_SUPPRESS_DEPRECATION_WARNING -#define ARROW_UNSUPPRESS_DEPRECATION_WARNING -#endif - -// ---------------------------------------------------------------------- - -// macros to disable padding -// these macros are portable across different compilers and platforms -//[https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L1355] -#if !defined(MANUALLY_ALIGNED_STRUCT) -#if defined(_MSC_VER) -#define MANUALLY_ALIGNED_STRUCT(alignment) \ - __pragma(pack(1)); \ - struct __declspec(align(alignment)) -#define STRUCT_END(name, size) \ - __pragma(pack()); \ - static_assert(sizeof(name) == size, "compiler breaks packing rules") -#elif defined(__GNUC__) || defined(__clang__) -#define MANUALLY_ALIGNED_STRUCT(alignment) \ - _Pragma("pack(1)") struct __attribute__((aligned(alignment))) -#define STRUCT_END(name, size) \ - _Pragma("pack()") static_assert(sizeof(name) == size, "compiler breaks packing rules") -#else -#error Unknown compiler, please define structure alignment macros -#endif -#endif // !defined(MANUALLY_ALIGNED_STRUCT) - -// ---------------------------------------------------------------------- -// Convenience macro disabling a particular UBSan check in a function - -#if defined(__clang__) -#define ARROW_DISABLE_UBSAN(feature) __attribute__((no_sanitize(feature))) -#else -#define ARROW_DISABLE_UBSAN(feature) -#endif - -// ---------------------------------------------------------------------- -// Machine information - -#if INTPTR_MAX == INT64_MAX -#define ARROW_BITNESS 64 -#elif INTPTR_MAX == INT32_MAX -#define ARROW_BITNESS 32 -#else -#error Unexpected INTPTR_MAX -#endif - -// ---------------------------------------------------------------------- -// From googletest -// (also in parquet-cpp) - -// When you need to test the private or protected members of a class, -// use the FRIEND_TEST macro to declare your tests as friends of the -// class. For example: -// -// class MyClass { -// private: -// void MyMethod(); -// FRIEND_TEST(MyClassTest, MyMethod); -// }; -// -// class MyClassTest : public testing::Test { -// // ... -// }; -// -// TEST_F(MyClassTest, MyMethod) { -// // Can call MyClass::MyMethod() here. -// } - -#define FRIEND_TEST(test_case_name, test_name) \ - friend class test_case_name##_##test_name##_Test diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/map.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/map.h deleted file mode 100644 index 552390906..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/map.h +++ /dev/null @@ -1,63 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include - -#include "arrow/result.h" - -namespace arrow { -namespace internal { - -/// Helper providing single-lookup conditional insertion into std::map or -/// std::unordered_map. If `key` exists in the container, an iterator to that pair -/// will be returned. If `key` does not exist in the container, `gen(key)` will be -/// invoked and its return value inserted. -template -auto GetOrInsertGenerated(Map* map, typename Map::key_type key, Gen&& gen) - -> decltype(map->begin()->second = gen(map->begin()->first), map->begin()) { - decltype(gen(map->begin()->first)) placeholder{}; - - auto it_success = map->emplace(std::move(key), std::move(placeholder)); - if (it_success.second) { - // insertion of placeholder succeeded, overwrite it with gen() - const auto& inserted_key = it_success.first->first; - auto* value = &it_success.first->second; - *value = gen(inserted_key); - } - return it_success.first; -} - -template -auto GetOrInsertGenerated(Map* map, typename Map::key_type key, Gen&& gen) - -> Resultbegin()->second = gen(map->begin()->first).ValueOrDie(), - map->begin())> { - decltype(gen(map->begin()->first).ValueOrDie()) placeholder{}; - - auto it_success = map->emplace(std::move(key), std::move(placeholder)); - if (it_success.second) { - // insertion of placeholder succeeded, overwrite it with gen() - const auto& inserted_key = it_success.first->first; - auto* value = &it_success.first->second; - ARROW_ASSIGN_OR_RAISE(*value, gen(inserted_key)); - } - return it_success.first; -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/math_constants.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/math_constants.h deleted file mode 100644 index 7ee87c5d6..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/math_constants.h +++ /dev/null @@ -1,32 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include - -// Not provided by default in MSVC, -// and _USE_MATH_DEFINES is not reliable with unity builds -#ifndef M_PI -#define M_PI 3.14159265358979323846 -#endif -#ifndef M_PI_2 -#define M_PI_2 1.57079632679489661923 -#endif -#ifndef M_PI_4 -#define M_PI_4 0.785398163397448309616 -#endif diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/memory.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/memory.h deleted file mode 100644 index 4250d0694..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/memory.h +++ /dev/null @@ -1,43 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include - -#include "arrow/util/macros.h" - -namespace arrow { -namespace internal { - -// A helper function for doing memcpy with multiple threads. This is required -// to saturate the memory bandwidth of modern cpus. -void parallel_memcopy(uint8_t* dst, const uint8_t* src, int64_t nbytes, - uintptr_t block_size, int num_threads); - -// A helper function for checking if two wrapped objects implementing `Equals` -// are equal. -template -bool SharedPtrEquals(const std::shared_ptr& left, const std::shared_ptr& right) { - if (left == right) return true; - if (left == NULLPTR || right == NULLPTR) return false; - return left->Equals(*right); -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/mutex.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/mutex.h deleted file mode 100644 index ac63cf70c..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/mutex.h +++ /dev/null @@ -1,85 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include - -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { -namespace util { - -/// A wrapper around std::mutex since we can't use it directly in -/// public headers due to C++/CLI. -/// https://docs.microsoft.com/en-us/cpp/standard-library/mutex#remarks -class ARROW_EXPORT Mutex { - public: - Mutex(); - Mutex(Mutex&&) = default; - Mutex& operator=(Mutex&&) = default; - - /// A Guard is falsy if a lock could not be acquired. - class ARROW_EXPORT Guard { - public: - Guard() : locked_(NULLPTR, [](Mutex* mutex) {}) {} - Guard(Guard&&) = default; - Guard& operator=(Guard&&) = default; - - explicit operator bool() const { return bool(locked_); } - - void Unlock() { locked_.reset(); } - - private: - explicit Guard(Mutex* locked); - - std::unique_ptr locked_; - friend Mutex; - }; - - Guard TryLock(); - Guard Lock(); - - private: - struct Impl; - std::unique_ptr impl_; -}; - -#ifndef _WIN32 -/// Return a pointer to a process-wide, process-specific Mutex that can be used -/// at any point in a child process. NULL is returned when called in the parent. -/// -/// The rule is to first check that getpid() corresponds to the parent process pid -/// and, if not, call this function to lock any after-fork reinitialization code. -/// Like this: -/// -/// std::atomic pid{getpid()}; -/// ... -/// if (pid.load() != getpid()) { -/// // In child process -/// auto lock = GlobalForkSafeMutex()->Lock(); -/// if (pid.load() != getpid()) { -/// // Reinitialize internal structures after fork -/// ... -/// pid.store(getpid()); -ARROW_EXPORT -Mutex* GlobalForkSafeMutex(); -#endif - -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/parallel.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/parallel.h deleted file mode 100644 index 80f60fbdb..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/parallel.h +++ /dev/null @@ -1,102 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include - -#include "arrow/status.h" -#include "arrow/util/functional.h" -#include "arrow/util/thread_pool.h" -#include "arrow/util/vector.h" - -namespace arrow { -namespace internal { - -// A parallelizer that takes a `Status(int)` function and calls it with -// arguments between 0 and `num_tasks - 1`, on an arbitrary number of threads. - -template -Status ParallelFor(int num_tasks, FUNCTION&& func, - Executor* executor = internal::GetCpuThreadPool()) { - std::vector> futures(num_tasks); - - for (int i = 0; i < num_tasks; ++i) { - ARROW_ASSIGN_OR_RAISE(futures[i], executor->Submit(func, i)); - } - auto st = Status::OK(); - for (auto& fut : futures) { - st &= fut.status(); - } - return st; -} - -template ::ValueType> -Future> ParallelForAsync( - std::vector inputs, FUNCTION&& func, - Executor* executor = internal::GetCpuThreadPool()) { - std::vector> futures(inputs.size()); - for (size_t i = 0; i < inputs.size(); ++i) { - ARROW_ASSIGN_OR_RAISE(futures[i], executor->Submit(func, i, std::move(inputs[i]))); - } - return All(std::move(futures)) - .Then([](const std::vector>& results) -> Result> { - return UnwrapOrRaise(results); - }); -} - -// A parallelizer that takes a `Status(int)` function and calls it with -// arguments between 0 and `num_tasks - 1`, in sequence or in parallel, -// depending on the input boolean. - -template -Status OptionalParallelFor(bool use_threads, int num_tasks, FUNCTION&& func, - Executor* executor = internal::GetCpuThreadPool()) { - if (use_threads) { - return ParallelFor(num_tasks, std::forward(func), executor); - } else { - for (int i = 0; i < num_tasks; ++i) { - RETURN_NOT_OK(func(i)); - } - return Status::OK(); - } -} - -// A parallelizer that takes a `Result(int index, T item)` function and -// calls it with each item from the input array, in sequence or in parallel, -// depending on the input boolean. - -template ::ValueType> -Future> OptionalParallelForAsync( - bool use_threads, std::vector inputs, FUNCTION&& func, - Executor* executor = internal::GetCpuThreadPool()) { - if (use_threads) { - return ParallelForAsync(std::move(inputs), std::forward(func), executor); - } else { - std::vector result(inputs.size()); - for (size_t i = 0; i < inputs.size(); ++i) { - ARROW_ASSIGN_OR_RAISE(result[i], func(i, inputs[i])); - } - return result; - } -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/pcg_random.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/pcg_random.h deleted file mode 100644 index 768f23282..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/pcg_random.h +++ /dev/null @@ -1,33 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include "arrow/vendored/pcg/pcg_random.hpp" // IWYU pragma: export - -namespace arrow { -namespace random { - -using pcg32 = ::arrow_vendored::pcg32; -using pcg64 = ::arrow_vendored::pcg64; -using pcg32_fast = ::arrow_vendored::pcg32_fast; -using pcg64_fast = ::arrow_vendored::pcg64_fast; -using pcg32_oneseq = ::arrow_vendored::pcg32_oneseq; -using pcg64_oneseq = ::arrow_vendored::pcg64_oneseq; - -} // namespace random -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/print.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/print.h deleted file mode 100644 index 82cea473c..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/print.h +++ /dev/null @@ -1,77 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. template - -#pragma once - -#include -#include "arrow/util/string.h" - -using arrow::internal::ToChars; - -namespace arrow { -namespace internal { - -namespace detail { - -template -struct TuplePrinter { - static void Print(OStream* os, const Tuple& t) { - TuplePrinter::Print(os, t); - *os << std::get(t); - } -}; - -template -struct TuplePrinter { - static void Print(OStream* os, const Tuple& t) {} -}; - -} // namespace detail - -// Print elements from a tuple to a stream, in order. -// Typical use is to pack a bunch of existing values with std::forward_as_tuple() -// before passing it to this function. -template -void PrintTuple(OStream* os, const std::tuple& tup) { - detail::TuplePrinter, sizeof...(Args)>::Print(os, tup); -} - -template -struct PrintVector { - const Range& range_; - const Separator& separator_; - - template // template to dodge inclusion of - friend Os& operator<<(Os& os, PrintVector l) { - bool first = true; - os << "["; - for (const auto& element : l.range_) { - if (first) { - first = false; - } else { - os << l.separator_; - } - os << ToChars(element); // use ToChars to avoid locale dependence - } - os << "]"; - return os; - } -}; -template -PrintVector(const Range&, const Separator&) -> PrintVector; -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/queue.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/queue.h deleted file mode 100644 index 6c71fa6e1..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/queue.h +++ /dev/null @@ -1,29 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include "arrow/vendored/ProducerConsumerQueue.h" - -namespace arrow { -namespace util { - -template -using SpscQueue = arrow_vendored::folly::ProducerConsumerQueue; - -} -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/range.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/range.h deleted file mode 100644 index 205532879..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/range.h +++ /dev/null @@ -1,258 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include -#include - -namespace arrow::internal { - -/// Create a vector containing the values from start up to stop -template -std::vector Iota(T start, T stop) { - if (start > stop) { - return {}; - } - std::vector result(static_cast(stop - start)); - std::iota(result.begin(), result.end(), start); - return result; -} - -/// Create a vector containing the values from 0 up to length -template -std::vector Iota(T length) { - return Iota(static_cast(0), length); -} - -/// Create a range from a callable which takes a single index parameter -/// and returns the value of iterator on each call and a length. -/// Only iterators obtained from the same range should be compared, the -/// behaviour generally similar to other STL containers. -template -class LazyRange { - private: - // callable which generates the values - // has to be defined at the beginning of the class for type deduction - const Generator gen_; - // the length of the range - int64_t length_; -#ifdef _MSC_VER - // workaround to VS2010 not supporting decltype properly - // see https://stackoverflow.com/questions/21782846/decltype-for-class-member-function - static Generator gen_static_; -#endif - - public: -#ifdef _MSC_VER - using return_type = decltype(gen_static_(0)); -#else - using return_type = decltype(gen_(0)); -#endif - - /// Construct a new range from a callable and length - LazyRange(Generator gen, int64_t length) : gen_(gen), length_(length) {} - - // Class of the dependent iterator, created implicitly by begin and end - class RangeIter { - public: - using difference_type = int64_t; - using value_type = return_type; - using reference = const value_type&; - using pointer = const value_type*; - using iterator_category = std::forward_iterator_tag; - -#ifdef _MSC_VER - // msvc complains about unchecked iterators, - // see https://stackoverflow.com/questions/21655496/error-c4996-checked-iterators - using _Unchecked_type = typename LazyRange::RangeIter; -#endif - - RangeIter() = delete; - RangeIter(const RangeIter& other) = default; - RangeIter& operator=(const RangeIter& other) = default; - - RangeIter(const LazyRange& range, int64_t index) - : range_(&range), index_(index) {} - - const return_type operator*() const { return range_->gen_(index_); } - - RangeIter operator+(difference_type length) const { - return RangeIter(*range_, index_ + length); - } - - // pre-increment - RangeIter& operator++() { - ++index_; - return *this; - } - - // post-increment - RangeIter operator++(int) { - auto copy = RangeIter(*this); - ++index_; - return copy; - } - - bool operator==(const typename LazyRange::RangeIter& other) const { - return this->index_ == other.index_ && this->range_ == other.range_; - } - - bool operator!=(const typename LazyRange::RangeIter& other) const { - return this->index_ != other.index_ || this->range_ != other.range_; - } - - int64_t operator-(const typename LazyRange::RangeIter& other) const { - return this->index_ - other.index_; - } - - bool operator<(const typename LazyRange::RangeIter& other) const { - return this->index_ < other.index_; - } - - private: - // parent range reference - const LazyRange* range_; - // current index - int64_t index_; - }; - - friend class RangeIter; - - // Create a new begin const iterator - RangeIter begin() { return RangeIter(*this, 0); } - - // Create a new end const iterator - RangeIter end() { return RangeIter(*this, length_); } -}; - -/// Helper function to create a lazy range from a callable (e.g. lambda) and length -template -LazyRange MakeLazyRange(Generator&& gen, int64_t length) { - return LazyRange(std::forward(gen), length); -} - -/// \brief A helper for iterating multiple ranges simultaneously, similar to C++23's -/// zip() view adapter modelled after python's built-in zip() function. -/// -/// \code {.cpp} -/// const std::vector& tables = ... -/// std::function()> GetNames = ... -/// for (auto [table, name] : Zip(tables, GetNames())) { -/// static_assert(std::is_same_v); -/// static_assert(std::is_same_v); -/// // temporaries (like this vector of strings) are kept alive for the -/// // duration of a loop and are safely movable). -/// RegisterTableWithName(std::move(name), &table); -/// } -/// \endcode -/// -/// The zipped sequence ends as soon as any of its member ranges ends. -/// -/// Always use `auto` for the loop's declaration; it will always be a tuple -/// of references so for example using `const auto&` will compile but will -/// *look* like forcing const-ness even though the members of the tuple are -/// still mutable references. -/// -/// NOTE: we *could* make Zip a more full fledged range and enable things like -/// - gtest recognizing it as a container; it currently doesn't since Zip is -/// always mutable so this breaks: -/// EXPECT_THAT(Zip(std::vector{0}, std::vector{1}), -/// ElementsAre(std::tuple{0, 1})); -/// - letting it be random access when possible so we can do things like *sort* -/// parallel ranges -/// - ... -/// -/// However doing this will increase the compile time overhead of using Zip as -/// long as we're still using headers. Therefore until we can use c++20 modules: -/// *don't* extend Zip. -template -struct Zip; - -template -Zip(Ranges&&...) -> Zip, std::index_sequence_for>; - -template -struct Zip, std::index_sequence> { - explicit Zip(Ranges... ranges) : ranges_(std::forward(ranges)...) {} - - std::tuple ranges_; - - using sentinel = std::tuple(ranges_)))...>; - constexpr sentinel end() { return {std::end(std::get(ranges_))...}; } - - struct iterator : std::tuple(ranges_)))...> { - using std::tuple(ranges_)))...>::tuple; - - constexpr auto operator*() { - return std::tuple(*this))...>{*std::get(*this)...}; - } - - constexpr iterator& operator++() { - (++std::get(*this), ...); - return *this; - } - - constexpr bool operator!=(const sentinel& s) const { - bool all_iterators_valid = (... && (std::get(*this) != std::get(s))); - return all_iterators_valid; - } - }; - constexpr iterator begin() { return {std::begin(std::get(ranges_))...}; } -}; - -/// \brief A lazy sequence of integers which starts from 0 and never stops. -/// -/// This can be used in conjunction with Zip() to emulate python's built-in -/// enumerate() function: -/// -/// \code {.cpp} -/// const std::vector& tables = ... -/// for (auto [i, table] : Zip(Enumerate<>, tables)) { -/// std::cout << "#" << i << ": " << table.name() << std::endl; -/// } -/// \endcode -template -constexpr auto Enumerate = [] { - struct { - struct sentinel {}; - constexpr sentinel end() const { return {}; } - - struct iterator { - I value{0}; - - constexpr I operator*() { return value; } - - constexpr iterator& operator++() { - ++value; - return *this; - } - - constexpr std::true_type operator!=(sentinel) const { return {}; } - }; - constexpr iterator begin() const { return {}; } - } out; - - return out; -}(); - -} // namespace arrow::internal diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ree_util.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ree_util.h deleted file mode 100644 index a3e745ba8..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ree_util.h +++ /dev/null @@ -1,582 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include - -#include "arrow/array/data.h" -#include "arrow/type_traits.h" -#include "arrow/util/checked_cast.h" -#include "arrow/util/macros.h" - -namespace arrow { -namespace ree_util { - -/// \brief Get the child array holding the run ends from an REE array -inline const ArraySpan& RunEndsArray(const ArraySpan& span) { return span.child_data[0]; } - -/// \brief Get the child array holding the data values from an REE array -inline const ArraySpan& ValuesArray(const ArraySpan& span) { return span.child_data[1]; } - -/// \brief Get a pointer to run ends values of an REE array -template -const RunEndCType* RunEnds(const ArraySpan& span) { - assert(RunEndsArray(span).type->id() == CTypeTraits::ArrowType::type_id); - return RunEndsArray(span).GetValues(1); -} - -/// \brief Perform basic validations on the parameters of an REE array -/// and its two children arrays -/// -/// All the checks complete in O(1) time. Consequently, this function: -/// - DOES NOT check that run_ends is sorted and all-positive -/// - DOES NOT check the actual contents of the run_ends and values arrays -Status ValidateRunEndEncodedChildren(const RunEndEncodedType& type, - int64_t logical_length, - const std::shared_ptr& run_ends_data, - const std::shared_ptr& values_data, - int64_t null_count, int64_t logical_offset); - -/// \brief Compute the logical null count of an REE array -int64_t LogicalNullCount(const ArraySpan& span); - -namespace internal { - -/// \brief Uses binary-search to find the physical offset given a logical offset -/// and run-end values -/// -/// \return the physical offset or run_ends_size if the physical offset is not -/// found in run_ends -template -int64_t FindPhysicalIndex(const RunEndCType* run_ends, int64_t run_ends_size, int64_t i, - int64_t absolute_offset) { - assert(absolute_offset + i >= 0); - auto it = std::upper_bound(run_ends, run_ends + run_ends_size, absolute_offset + i); - int64_t result = std::distance(run_ends, it); - assert(result <= run_ends_size); - return result; -} - -/// \brief Uses binary-search to calculate the range of physical values (and -/// run-ends) necessary to represent the logical range of values from -/// offset to length -/// -/// \return a pair of physical offset and physical length -template -std::pair FindPhysicalRange(const RunEndCType* run_ends, - int64_t run_ends_size, int64_t length, - int64_t offset) { - const int64_t physical_offset = - FindPhysicalIndex(run_ends, run_ends_size, 0, offset); - // The physical length is calculated by finding the offset of the last element - // and adding 1 to it, so first we ensure there is at least one element. - if (length == 0) { - return {physical_offset, 0}; - } - const int64_t physical_index_of_last = FindPhysicalIndex( - run_ends + physical_offset, run_ends_size - physical_offset, length - 1, offset); - - assert(physical_index_of_last < run_ends_size - physical_offset); - return {physical_offset, physical_index_of_last + 1}; -} - -/// \brief Uses binary-search to calculate the number of physical values (and -/// run-ends) necessary to represent the logical range of values from -/// offset to length -template -int64_t FindPhysicalLength(const RunEndCType* run_ends, int64_t run_ends_size, - int64_t length, int64_t offset) { - auto [_, physical_length] = - FindPhysicalRange(run_ends, run_ends_size, length, offset); - // GH-37107: This is a workaround for GCC 7. GCC 7 doesn't ignore - // variables in structured binding automatically from unused - // variables when one of these variables are used. - // See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81767 - ARROW_UNUSED(_); - return physical_length; -} - -/// \brief Find the physical index into the values array of the REE ArraySpan -/// -/// This function uses binary-search, so it has a O(log N) cost. -template -int64_t FindPhysicalIndex(const ArraySpan& span, int64_t i, int64_t absolute_offset) { - const int64_t run_ends_size = RunEndsArray(span).length; - return FindPhysicalIndex(RunEnds(span), run_ends_size, i, absolute_offset); -} - -/// \brief Find the physical length of an REE ArraySpan -/// -/// The physical length of an REE is the number of physical values (and -/// run-ends) necessary to represent the logical range of values from -/// offset to length. -/// -/// Avoid calling this function if the physical length can be established in -/// some other way (e.g. when iterating over the runs sequentially until the -/// end). This function uses binary-search, so it has a O(log N) cost. -template -int64_t FindPhysicalLength(const ArraySpan& span) { - return FindPhysicalLength( - /*run_ends=*/RunEnds(span), - /*run_ends_size=*/RunEndsArray(span).length, - /*length=*/span.length, - /*offset=*/span.offset); -} - -template -struct PhysicalIndexFinder; - -// non-inline implementations for each run-end type -ARROW_EXPORT int64_t FindPhysicalIndexImpl16(PhysicalIndexFinder& self, - int64_t i); -ARROW_EXPORT int64_t FindPhysicalIndexImpl32(PhysicalIndexFinder& self, - int64_t i); -ARROW_EXPORT int64_t FindPhysicalIndexImpl64(PhysicalIndexFinder& self, - int64_t i); - -/// \brief Stateful version of FindPhysicalIndex() that caches the result of -/// the previous search and uses it to optimize the next search. -/// -/// When new queries for the physical index of a logical index come in, -/// binary search is performed again but the first candidate checked is the -/// result of the previous search (cached physical index) instead of the -/// midpoint of the run-ends array. -/// -/// If that test fails, internal::FindPhysicalIndex() is called with one of the -/// partitions defined by the cached index. If the queried logical indices -/// follow an increasing or decreasing pattern, this first test is much more -/// effective in (1) finding the answer right away (close logical indices belong -/// to the same runs) or (2) discarding many more candidates than probing -/// the midpoint would. -/// -/// The most adversarial case (i.e. alternating between 0 and length-1 queries) -/// only adds one extra binary search probe when compared to always starting -/// binary search from the midpoint without any of these optimizations. -/// -/// \tparam RunEndCType The numeric type of the run-ends array. -template -struct PhysicalIndexFinder { - const ArraySpan array_span; - const RunEndCType* run_ends; - int64_t last_physical_index = 0; - - explicit PhysicalIndexFinder(const ArrayData& data) - : array_span(data), - run_ends(RunEndsArray(array_span).template GetValues(1)) { - assert(CTypeTraits::ArrowType::type_id == - ::arrow::internal::checked_cast(*data.type) - .run_end_type() - ->id()); - } - - /// \brief Find the physical index into the values array of the REE array. - /// - /// \pre 0 <= i < array_span.length() - /// \param i the logical index into the REE array - /// \return the physical index into the values array - int64_t FindPhysicalIndex(int64_t i) { - if constexpr (std::is_same_v) { - return FindPhysicalIndexImpl16(*this, i); - } else if constexpr (std::is_same_v) { - return FindPhysicalIndexImpl32(*this, i); - } else { - static_assert(std::is_same_v, "Unsupported RunEndCType."); - return FindPhysicalIndexImpl64(*this, i); - } - } -}; - -} // namespace internal - -/// \brief Find the physical index into the values array of the REE ArraySpan -/// -/// This function uses binary-search, so it has a O(log N) cost. -ARROW_EXPORT int64_t FindPhysicalIndex(const ArraySpan& span, int64_t i, - int64_t absolute_offset); - -/// \brief Find the physical length of an REE ArraySpan -/// -/// The physical length of an REE is the number of physical values (and -/// run-ends) necessary to represent the logical range of values from -/// offset to length. -/// -/// Avoid calling this function if the physical length can be established in -/// some other way (e.g. when iterating over the runs sequentially until the -/// end). This function uses binary-search, so it has a O(log N) cost. -ARROW_EXPORT int64_t FindPhysicalLength(const ArraySpan& span); - -/// \brief Find the physical range of physical values referenced by the REE in -/// the logical range from offset to offset + length -/// -/// \return a pair of physical offset and physical length -ARROW_EXPORT std::pair FindPhysicalRange(const ArraySpan& span, - int64_t offset, - int64_t length); - -// Publish PhysicalIndexFinder outside of the internal namespace. -template -using PhysicalIndexFinder = internal::PhysicalIndexFinder; - -template -class RunEndEncodedArraySpan { - private: - struct PrivateTag {}; - - public: - /// \brief Iterator representing the current run during iteration over a - /// run-end encoded array - class Iterator { - public: - Iterator(PrivateTag, const RunEndEncodedArraySpan& span, int64_t logical_pos, - int64_t physical_pos) - : span(span), logical_pos_(logical_pos), physical_pos_(physical_pos) {} - - /// \brief Return the physical index of the run - /// - /// The values array can be addressed with this index to get the value - /// that makes up the run. - /// - /// NOTE: if this Iterator is equal to RunEndEncodedArraySpan::end(), - /// the value returned is undefined. - int64_t index_into_array() const { return physical_pos_; } - - /// \brief Return the initial logical position of the run - /// - /// If this Iterator is equal to RunEndEncodedArraySpan::end(), this is - /// the same as RunEndEncodedArraySpan::length(). - int64_t logical_position() const { return logical_pos_; } - - /// \brief Return the logical position immediately after the run. - /// - /// Pre-condition: *this != RunEndEncodedArraySpan::end() - int64_t run_end() const { return span.run_end(physical_pos_); } - - /// \brief Returns the logical length of the run. - /// - /// Pre-condition: *this != RunEndEncodedArraySpan::end() - int64_t run_length() const { return run_end() - logical_pos_; } - - /// \brief Check if the iterator is at the end of the array. - /// - /// This can be used to avoid paying the cost of a call to - /// RunEndEncodedArraySpan::end(). - /// - /// \return true if the iterator is at the end of the array - bool is_end(const RunEndEncodedArraySpan& span) const { - return logical_pos_ >= span.length(); - } - - Iterator& operator++() { - logical_pos_ = span.run_end(physical_pos_); - physical_pos_ += 1; - return *this; - } - - Iterator operator++(int) { - const Iterator prev = *this; - ++(*this); - return prev; - } - - Iterator& operator--() { - physical_pos_ -= 1; - logical_pos_ = (physical_pos_ > 0) ? span.run_end(physical_pos_ - 1) : 0; - return *this; - } - - Iterator operator--(int) { - const Iterator prev = *this; - --(*this); - return prev; - } - - bool operator==(const Iterator& other) const { - return logical_pos_ == other.logical_pos_; - } - - bool operator!=(const Iterator& other) const { - return logical_pos_ != other.logical_pos_; - } - - public: - const RunEndEncodedArraySpan& span; - - private: - int64_t logical_pos_; - int64_t physical_pos_; - }; - - // Prevent implicit ArrayData -> ArraySpan conversion in - // RunEndEncodedArraySpan instantiation. - explicit RunEndEncodedArraySpan(const ArrayData& data) = delete; - - /// \brief Construct a RunEndEncodedArraySpan from an ArraySpan and new - /// absolute offset and length. - /// - /// RunEndEncodedArraySpan{span, off, len} is equivalent to: - /// - /// span.SetSlice(off, len); - /// RunEndEncodedArraySpan{span} - /// - /// ArraySpan::SetSlice() updates the null_count to kUnknownNullCount, but - /// we don't need that here as REE arrays have null_count set to 0 by - /// convention. - explicit RunEndEncodedArraySpan(const ArraySpan& array_span, int64_t offset, - int64_t length) - : array_span_{array_span}, - run_ends_(RunEnds(array_span_)), - length_(length), - offset_(offset) { - assert(array_span_.type->id() == Type::RUN_END_ENCODED); - } - - explicit RunEndEncodedArraySpan(const ArraySpan& array_span) - : RunEndEncodedArraySpan(array_span, array_span.offset, array_span.length) {} - - int64_t offset() const { return offset_; } - int64_t length() const { return length_; } - - int64_t PhysicalIndex(int64_t logical_pos) const { - return internal::FindPhysicalIndex(run_ends_, RunEndsArray(array_span_).length, - logical_pos, offset_); - } - - /// \brief Create an iterator from a logical position and its - /// pre-computed physical offset into the run ends array - /// - /// \param logical_pos is an index in the [0, length()] range - /// \param physical_offset the pre-calculated PhysicalIndex(logical_pos) - Iterator iterator(int64_t logical_pos, int64_t physical_offset) const { - return Iterator{PrivateTag{}, *this, logical_pos, physical_offset}; - } - - /// \brief Create an iterator from a logical position - /// - /// \param logical_pos is an index in the [0, length()] range - Iterator iterator(int64_t logical_pos) const { - if (logical_pos < length()) { - return iterator(logical_pos, PhysicalIndex(logical_pos)); - } - // If logical_pos is above the valid range, use length() as the logical - // position and calculate the physical address right after the last valid - // physical position. Which is the physical index of the last logical - // position, plus 1. - return (length() == 0) ? iterator(0, PhysicalIndex(0)) - : iterator(length(), PhysicalIndex(length() - 1) + 1); - } - - /// \brief Create an iterator representing the logical begin of the run-end - /// encoded array - Iterator begin() const { return iterator(0, PhysicalIndex(0)); } - - /// \brief Create an iterator representing the first invalid logical position - /// of the run-end encoded array - /// - /// \warning Avoid calling end() in a loop, as it will recompute the physical - /// length of the array on each call (O(log N) cost per call). - /// - /// \par You can write your loops like this instead: - /// \code - /// for (auto it = array.begin(), end = array.end(); it != end; ++it) { - /// // ... - /// } - /// \endcode - /// - /// \par Or this version that does not look like idiomatic C++, but removes - /// the need for calling end() completely: - /// \code - /// for (auto it = array.begin(); !it.is_end(array); ++it) { - /// // ... - /// } - /// \endcode - Iterator end() const { - return iterator(length(), - (length() == 0) ? PhysicalIndex(0) : PhysicalIndex(length() - 1) + 1); - } - - // Pre-condition: physical_pos < RunEndsArray(array_span_).length); - inline int64_t run_end(int64_t physical_pos) const { - assert(physical_pos < RunEndsArray(array_span_).length); - // Logical index of the end of the run at physical_pos with offset applied - const int64_t logical_run_end = - std::max(static_cast(run_ends_[physical_pos]) - offset(), 0); - // The current run may go further than the logical length, cap it - return std::min(logical_run_end, length()); - } - - private: - const ArraySpan& array_span_; - const RunEndCType* run_ends_; - const int64_t length_; - const int64_t offset_; -}; - -/// \brief Iterate over two run-end encoded arrays in runs or sub-runs that are -/// inside run boundaries on both inputs -/// -/// Both RunEndEncodedArraySpan should have the same logical length. Instances -/// of this iterator only hold references to the RunEndEncodedArraySpan inputs. -template -class MergedRunsIterator { - private: - using LeftIterator = typename Left::Iterator; - using RightIterator = typename Right::Iterator; - - MergedRunsIterator(LeftIterator left_it, RightIterator right_it, - int64_t common_logical_length, int64_t common_logical_pos) - : ree_iterators_{std::move(left_it), std::move(right_it)}, - logical_length_(common_logical_length), - logical_pos_(common_logical_pos) {} - - public: - /// \brief Construct a MergedRunsIterator positioned at logical position 0. - /// - /// Pre-condition: left.length() == right.length() - MergedRunsIterator(const Left& left, const Right& right) - : MergedRunsIterator(left.begin(), right.begin(), left.length(), 0) { - assert(left.length() == right.length()); - } - - static Result MakeBegin(const Left& left, const Right& right) { - if (left.length() != right.length()) { - return Status::Invalid( - "MergedRunsIterator expects RunEndEncodedArraySpans of the same length"); - } - return MergedRunsIterator(left, right); - } - - static Result MakeEnd(const Left& left, const Right& right) { - if (left.length() != right.length()) { - return Status::Invalid( - "MergedRunsIterator expects RunEndEncodedArraySpans of the same length"); - } - return MergedRunsIterator(left.end(), right.end(), left.length(), left.length()); - } - - /// \brief Return the left RunEndEncodedArraySpan child - const Left& left() const { return std::get<0>(ree_iterators_).span; } - - /// \brief Return the right RunEndEncodedArraySpan child - const Right& right() const { return std::get<1>(ree_iterators_).span; } - - /// \brief Return the initial logical position of the run - /// - /// If is_end(), this is the same as length(). - int64_t logical_position() const { return logical_pos_; } - - /// \brief Whether the iterator is at logical position 0. - bool is_begin() const { return logical_pos_ == 0; } - - /// \brief Whether the iterator has reached the end of both arrays - bool is_end() const { return logical_pos_ == logical_length_; } - - /// \brief Return the logical position immediately after the run. - /// - /// Pre-condition: !is_end() - int64_t run_end() const { - const auto& left_it = std::get<0>(ree_iterators_); - const auto& right_it = std::get<1>(ree_iterators_); - return std::min(left_it.run_end(), right_it.run_end()); - } - - /// \brief returns the logical length of the current run - /// - /// Pre-condition: !is_end() - int64_t run_length() const { return run_end() - logical_pos_; } - - /// \brief Return a physical index into the values array of a given input, - /// pointing to the value of the current run - template - int64_t index_into_array() const { - return std::get(ree_iterators_).index_into_array(); - } - - int64_t index_into_left_array() const { return index_into_array<0>(); } - int64_t index_into_right_array() const { return index_into_array<1>(); } - - MergedRunsIterator& operator++() { - auto& left_it = std::get<0>(ree_iterators_); - auto& right_it = std::get<1>(ree_iterators_); - - const int64_t left_run_end = left_it.run_end(); - const int64_t right_run_end = right_it.run_end(); - - if (left_run_end < right_run_end) { - logical_pos_ = left_run_end; - ++left_it; - } else if (left_run_end > right_run_end) { - logical_pos_ = right_run_end; - ++right_it; - } else { - logical_pos_ = left_run_end; - ++left_it; - ++right_it; - } - return *this; - } - - MergedRunsIterator operator++(int) { - MergedRunsIterator prev = *this; - ++(*this); - return prev; - } - - MergedRunsIterator& operator--() { - auto& left_it = std::get<0>(ree_iterators_); - auto& right_it = std::get<1>(ree_iterators_); - - // The logical position of each iterator is the run_end() of the previous run. - const int64_t left_logical_pos = left_it.logical_position(); - const int64_t right_logical_pos = right_it.logical_position(); - - if (left_logical_pos < right_logical_pos) { - --right_it; - logical_pos_ = std::max(left_logical_pos, right_it.logical_position()); - } else if (left_logical_pos > right_logical_pos) { - --left_it; - logical_pos_ = std::max(left_it.logical_position(), right_logical_pos); - } else { - --left_it; - --right_it; - logical_pos_ = std::max(left_it.logical_position(), right_it.logical_position()); - } - return *this; - } - - MergedRunsIterator operator--(int) { - MergedRunsIterator prev = *this; - --(*this); - return prev; - } - - bool operator==(const MergedRunsIterator& other) const { - return logical_pos_ == other.logical_position(); - } - - bool operator!=(const MergedRunsIterator& other) const { return !(*this == other); } - - private: - std::tuple ree_iterators_; - const int64_t logical_length_; - int64_t logical_pos_; -}; - -} // namespace ree_util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/regex.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/regex.h deleted file mode 100644 index 590fbac71..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/regex.h +++ /dev/null @@ -1,51 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include - -#include "arrow/util/visibility.h" - -namespace arrow { -namespace internal { - -/// Match regex against target and produce string_views out of matches. -inline bool RegexMatch(const std::regex& regex, std::string_view target, - std::initializer_list out_matches) { - assert(regex.mark_count() == out_matches.size()); - - std::match_results match; - if (!std::regex_match(target.begin(), target.end(), match, regex)) { - return false; - } - - // Match #0 is the whole matched sequence - assert(regex.mark_count() + 1 == match.size()); - auto out_it = out_matches.begin(); - for (size_t i = 1; i < match.size(); ++i) { - **out_it++ = target.substr(match.position(i), match.length(i)); - } - return true; -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rle_encoding.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rle_encoding.h deleted file mode 100644 index e0f569006..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rle_encoding.h +++ /dev/null @@ -1,826 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// Imported from Apache Impala (incubating) on 2016-01-29 and modified for use -// in parquet-cpp, Arrow - -#pragma once - -#include -#include -#include -#include - -#include "arrow/util/bit_block_counter.h" -#include "arrow/util/bit_run_reader.h" -#include "arrow/util/bit_stream_utils.h" -#include "arrow/util/bit_util.h" -#include "arrow/util/macros.h" - -namespace arrow { -namespace util { - -/// Utility classes to do run length encoding (RLE) for fixed bit width values. If runs -/// are sufficiently long, RLE is used, otherwise, the values are just bit-packed -/// (literal encoding). -/// For both types of runs, there is a byte-aligned indicator which encodes the length -/// of the run and the type of the run. -/// This encoding has the benefit that when there aren't any long enough runs, values -/// are always decoded at fixed (can be precomputed) bit offsets OR both the value and -/// the run length are byte aligned. This allows for very efficient decoding -/// implementations. -/// The encoding is: -/// encoded-block := run* -/// run := literal-run | repeated-run -/// literal-run := literal-indicator < literal bytes > -/// repeated-run := repeated-indicator < repeated value. padded to byte boundary > -/// literal-indicator := varint_encode( number_of_groups << 1 | 1) -/// repeated-indicator := varint_encode( number_of_repetitions << 1 ) -// -/// Each run is preceded by a varint. The varint's least significant bit is -/// used to indicate whether the run is a literal run or a repeated run. The rest -/// of the varint is used to determine the length of the run (eg how many times the -/// value repeats). -// -/// In the case of literal runs, the run length is always a multiple of 8 (i.e. encode -/// in groups of 8), so that no matter the bit-width of the value, the sequence will end -/// on a byte boundary without padding. -/// Given that we know it is a multiple of 8, we store the number of 8-groups rather than -/// the actual number of encoded ints. (This means that the total number of encoded values -/// cannot be determined from the encoded data, since the number of values in the last -/// group may not be a multiple of 8). For the last group of literal runs, we pad -/// the group to 8 with zeros. This allows for 8 at a time decoding on the read side -/// without the need for additional checks. -// -/// There is a break-even point when it is more storage efficient to do run length -/// encoding. For 1 bit-width values, that point is 8 values. They require 2 bytes -/// for both the repeated encoding or the literal encoding. This value can always -/// be computed based on the bit-width. -/// TODO: think about how to use this for strings. The bit packing isn't quite the same. -// -/// Examples with bit-width 1 (eg encoding booleans): -/// ---------------------------------------- -/// 100 1s followed by 100 0s: -/// <1, padded to 1 byte> <0, padded to 1 byte> -/// - (total 4 bytes) -// -/// alternating 1s and 0s (200 total): -/// 200 ints = 25 groups of 8 -/// <25 bytes of values, bitpacked> -/// (total 26 bytes, 1 byte overhead) -// - -/// Decoder class for RLE encoded data. -class RleDecoder { - public: - /// Create a decoder object. buffer/buffer_len is the decoded data. - /// bit_width is the width of each value (before encoding). - RleDecoder(const uint8_t* buffer, int buffer_len, int bit_width) - : bit_reader_(buffer, buffer_len), - bit_width_(bit_width), - current_value_(0), - repeat_count_(0), - literal_count_(0) { - DCHECK_GE(bit_width_, 0); - DCHECK_LE(bit_width_, 64); - } - - RleDecoder() : bit_width_(-1) {} - - void Reset(const uint8_t* buffer, int buffer_len, int bit_width) { - DCHECK_GE(bit_width, 0); - DCHECK_LE(bit_width, 64); - bit_reader_.Reset(buffer, buffer_len); - bit_width_ = bit_width; - current_value_ = 0; - repeat_count_ = 0; - literal_count_ = 0; - } - - /// Gets the next value. Returns false if there are no more. - template - bool Get(T* val); - - /// Gets a batch of values. Returns the number of decoded elements. - template - int GetBatch(T* values, int batch_size); - - /// Like GetBatch but add spacing for null entries - template - int GetBatchSpaced(int batch_size, int null_count, const uint8_t* valid_bits, - int64_t valid_bits_offset, T* out); - - /// Like GetBatch but the values are then decoded using the provided dictionary - template - int GetBatchWithDict(const T* dictionary, int32_t dictionary_length, T* values, - int batch_size); - - /// Like GetBatchWithDict but add spacing for null entries - /// - /// Null entries will be zero-initialized in `values` to avoid leaking - /// private data. - template - int GetBatchWithDictSpaced(const T* dictionary, int32_t dictionary_length, T* values, - int batch_size, int null_count, const uint8_t* valid_bits, - int64_t valid_bits_offset); - - protected: - ::arrow::bit_util::BitReader bit_reader_; - /// Number of bits needed to encode the value. Must be between 0 and 64. - int bit_width_; - uint64_t current_value_; - int32_t repeat_count_; - int32_t literal_count_; - - private: - /// Fills literal_count_ and repeat_count_ with next values. Returns false if there - /// are no more. - template - bool NextCounts(); - - /// Utility methods for retrieving spaced values. - template - int GetSpaced(Converter converter, int batch_size, int null_count, - const uint8_t* valid_bits, int64_t valid_bits_offset, T* out); -}; - -/// Class to incrementally build the rle data. This class does not allocate any memory. -/// The encoding has two modes: encoding repeated runs and literal runs. -/// If the run is sufficiently short, it is more efficient to encode as a literal run. -/// This class does so by buffering 8 values at a time. If they are not all the same -/// they are added to the literal run. If they are the same, they are added to the -/// repeated run. When we switch modes, the previous run is flushed out. -class RleEncoder { - public: - /// buffer/buffer_len: preallocated output buffer. - /// bit_width: max number of bits for value. - /// TODO: consider adding a min_repeated_run_length so the caller can control - /// when values should be encoded as repeated runs. Currently this is derived - /// based on the bit_width, which can determine a storage optimal choice. - /// TODO: allow 0 bit_width (and have dict encoder use it) - RleEncoder(uint8_t* buffer, int buffer_len, int bit_width) - : bit_width_(bit_width), bit_writer_(buffer, buffer_len) { - DCHECK_GE(bit_width_, 0); - DCHECK_LE(bit_width_, 64); - max_run_byte_size_ = MinBufferSize(bit_width); - DCHECK_GE(buffer_len, max_run_byte_size_) << "Input buffer not big enough."; - Clear(); - } - - /// Returns the minimum buffer size needed to use the encoder for 'bit_width' - /// This is the maximum length of a single run for 'bit_width'. - /// It is not valid to pass a buffer less than this length. - static int MinBufferSize(int bit_width) { - /// 1 indicator byte and MAX_VALUES_PER_LITERAL_RUN 'bit_width' values. - int max_literal_run_size = 1 + static_cast(::arrow::bit_util::BytesForBits( - MAX_VALUES_PER_LITERAL_RUN * bit_width)); - /// Up to kMaxVlqByteLength indicator and a single 'bit_width' value. - int max_repeated_run_size = - ::arrow::bit_util::BitReader::kMaxVlqByteLength + - static_cast(::arrow::bit_util::BytesForBits(bit_width)); - return std::max(max_literal_run_size, max_repeated_run_size); - } - - /// Returns the maximum byte size it could take to encode 'num_values'. - static int MaxBufferSize(int bit_width, int num_values) { - // For a bit_width > 1, the worst case is the repetition of "literal run of length 8 - // and then a repeated run of length 8". - // 8 values per smallest run, 8 bits per byte - int bytes_per_run = bit_width; - int num_runs = static_cast(::arrow::bit_util::CeilDiv(num_values, 8)); - int literal_max_size = num_runs + num_runs * bytes_per_run; - - // In the very worst case scenario, the data is a concatenation of repeated - // runs of 8 values. Repeated run has a 1 byte varint followed by the - // bit-packed repeated value - int min_repeated_run_size = - 1 + static_cast(::arrow::bit_util::BytesForBits(bit_width)); - int repeated_max_size = num_runs * min_repeated_run_size; - - return std::max(literal_max_size, repeated_max_size); - } - - /// Encode value. Returns true if the value fits in buffer, false otherwise. - /// This value must be representable with bit_width_ bits. - bool Put(uint64_t value); - - /// Flushes any pending values to the underlying buffer. - /// Returns the total number of bytes written - int Flush(); - - /// Resets all the state in the encoder. - void Clear(); - - /// Returns pointer to underlying buffer - uint8_t* buffer() { return bit_writer_.buffer(); } - int32_t len() { return bit_writer_.bytes_written(); } - - private: - /// Flushes any buffered values. If this is part of a repeated run, this is largely - /// a no-op. - /// If it is part of a literal run, this will call FlushLiteralRun, which writes - /// out the buffered literal values. - /// If 'done' is true, the current run would be written even if it would normally - /// have been buffered more. This should only be called at the end, when the - /// encoder has received all values even if it would normally continue to be - /// buffered. - void FlushBufferedValues(bool done); - - /// Flushes literal values to the underlying buffer. If update_indicator_byte, - /// then the current literal run is complete and the indicator byte is updated. - void FlushLiteralRun(bool update_indicator_byte); - - /// Flushes a repeated run to the underlying buffer. - void FlushRepeatedRun(); - - /// Checks and sets buffer_full_. This must be called after flushing a run to - /// make sure there are enough bytes remaining to encode the next run. - void CheckBufferFull(); - - /// The maximum number of values in a single literal run - /// (number of groups encodable by a 1-byte indicator * 8) - static const int MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8; - - /// Number of bits needed to encode the value. Must be between 0 and 64. - const int bit_width_; - - /// Underlying buffer. - ::arrow::bit_util::BitWriter bit_writer_; - - /// If true, the buffer is full and subsequent Put()'s will fail. - bool buffer_full_; - - /// The maximum byte size a single run can take. - int max_run_byte_size_; - - /// We need to buffer at most 8 values for literals. This happens when the - /// bit_width is 1 (so 8 values fit in one byte). - /// TODO: generalize this to other bit widths - int64_t buffered_values_[8]; - - /// Number of values in buffered_values_ - int num_buffered_values_; - - /// The current (also last) value that was written and the count of how - /// many times in a row that value has been seen. This is maintained even - /// if we are in a literal run. If the repeat_count_ get high enough, we switch - /// to encoding repeated runs. - uint64_t current_value_; - int repeat_count_; - - /// Number of literals in the current run. This does not include the literals - /// that might be in buffered_values_. Only after we've got a group big enough - /// can we decide if they should part of the literal_count_ or repeat_count_ - int literal_count_; - - /// Pointer to a byte in the underlying buffer that stores the indicator byte. - /// This is reserved as soon as we need a literal run but the value is written - /// when the literal run is complete. - uint8_t* literal_indicator_byte_; -}; - -template -inline bool RleDecoder::Get(T* val) { - return GetBatch(val, 1) == 1; -} - -template -inline int RleDecoder::GetBatch(T* values, int batch_size) { - DCHECK_GE(bit_width_, 0); - int values_read = 0; - - auto* out = values; - - while (values_read < batch_size) { - int remaining = batch_size - values_read; - - if (repeat_count_ > 0) { // Repeated value case. - int repeat_batch = std::min(remaining, repeat_count_); - std::fill(out, out + repeat_batch, static_cast(current_value_)); - - repeat_count_ -= repeat_batch; - values_read += repeat_batch; - out += repeat_batch; - } else if (literal_count_ > 0) { - int literal_batch = std::min(remaining, literal_count_); - int actual_read = bit_reader_.GetBatch(bit_width_, out, literal_batch); - if (actual_read != literal_batch) { - return values_read; - } - - literal_count_ -= literal_batch; - values_read += literal_batch; - out += literal_batch; - } else { - if (!NextCounts()) return values_read; - } - } - - return values_read; -} - -template -inline int RleDecoder::GetSpaced(Converter converter, int batch_size, int null_count, - const uint8_t* valid_bits, int64_t valid_bits_offset, - T* out) { - if (ARROW_PREDICT_FALSE(null_count == batch_size)) { - converter.FillZero(out, out + batch_size); - return batch_size; - } - - DCHECK_GE(bit_width_, 0); - int values_read = 0; - int values_remaining = batch_size - null_count; - - // Assume no bits to start. - arrow::internal::BitRunReader bit_reader(valid_bits, valid_bits_offset, - /*length=*/batch_size); - arrow::internal::BitRun valid_run = bit_reader.NextRun(); - while (values_read < batch_size) { - if (ARROW_PREDICT_FALSE(valid_run.length == 0)) { - valid_run = bit_reader.NextRun(); - } - - DCHECK_GT(batch_size, 0); - DCHECK_GT(valid_run.length, 0); - - if (valid_run.set) { - if ((repeat_count_ == 0) && (literal_count_ == 0)) { - if (!NextCounts()) return values_read; - DCHECK((repeat_count_ > 0) ^ (literal_count_ > 0)); - } - - if (repeat_count_ > 0) { - int repeat_batch = 0; - // Consume the entire repeat counts incrementing repeat_batch to - // be the total of nulls + values consumed, we only need to - // get the total count because we can fill in the same value for - // nulls and non-nulls. This proves to be a big efficiency win. - while (repeat_count_ > 0 && (values_read + repeat_batch) < batch_size) { - DCHECK_GT(valid_run.length, 0); - if (valid_run.set) { - int update_size = std::min(static_cast(valid_run.length), repeat_count_); - repeat_count_ -= update_size; - repeat_batch += update_size; - valid_run.length -= update_size; - values_remaining -= update_size; - } else { - // We can consume all nulls here because we would do so on - // the next loop anyways. - repeat_batch += static_cast(valid_run.length); - valid_run.length = 0; - } - if (valid_run.length == 0) { - valid_run = bit_reader.NextRun(); - } - } - RunType current_value = static_cast(current_value_); - if (ARROW_PREDICT_FALSE(!converter.IsValid(current_value))) { - return values_read; - } - converter.Fill(out, out + repeat_batch, current_value); - out += repeat_batch; - values_read += repeat_batch; - } else if (literal_count_ > 0) { - int literal_batch = std::min(values_remaining, literal_count_); - DCHECK_GT(literal_batch, 0); - - // Decode the literals - constexpr int kBufferSize = 1024; - RunType indices[kBufferSize]; - literal_batch = std::min(literal_batch, kBufferSize); - int actual_read = bit_reader_.GetBatch(bit_width_, indices, literal_batch); - if (ARROW_PREDICT_FALSE(actual_read != literal_batch)) { - return values_read; - } - if (!converter.IsValid(indices, /*length=*/actual_read)) { - return values_read; - } - int skipped = 0; - int literals_read = 0; - while (literals_read < literal_batch) { - if (valid_run.set) { - int update_size = std::min(literal_batch - literals_read, - static_cast(valid_run.length)); - converter.Copy(out, indices + literals_read, update_size); - literals_read += update_size; - out += update_size; - valid_run.length -= update_size; - } else { - converter.FillZero(out, out + valid_run.length); - out += valid_run.length; - skipped += static_cast(valid_run.length); - valid_run.length = 0; - } - if (valid_run.length == 0) { - valid_run = bit_reader.NextRun(); - } - } - literal_count_ -= literal_batch; - values_remaining -= literal_batch; - values_read += literal_batch + skipped; - } - } else { - converter.FillZero(out, out + valid_run.length); - out += valid_run.length; - values_read += static_cast(valid_run.length); - valid_run.length = 0; - } - } - DCHECK_EQ(valid_run.length, 0); - DCHECK_EQ(values_remaining, 0); - return values_read; -} - -// Converter for GetSpaced that handles runs that get returned -// directly as output. -template -struct PlainRleConverter { - T kZero = {}; - inline bool IsValid(const T& values) const { return true; } - inline bool IsValid(const T* values, int32_t length) const { return true; } - inline void Fill(T* begin, T* end, const T& run_value) const { - std::fill(begin, end, run_value); - } - inline void FillZero(T* begin, T* end) { std::fill(begin, end, kZero); } - inline void Copy(T* out, const T* values, int length) const { - std::memcpy(out, values, length * sizeof(T)); - } -}; - -template -inline int RleDecoder::GetBatchSpaced(int batch_size, int null_count, - const uint8_t* valid_bits, - int64_t valid_bits_offset, T* out) { - if (null_count == 0) { - return GetBatch(out, batch_size); - } - - PlainRleConverter converter; - arrow::internal::BitBlockCounter block_counter(valid_bits, valid_bits_offset, - batch_size); - - int total_processed = 0; - int processed = 0; - arrow::internal::BitBlockCount block; - - do { - block = block_counter.NextFourWords(); - if (block.length == 0) { - break; - } - if (block.AllSet()) { - processed = GetBatch(out, block.length); - } else if (block.NoneSet()) { - converter.FillZero(out, out + block.length); - processed = block.length; - } else { - processed = GetSpaced>( - converter, block.length, block.length - block.popcount, valid_bits, - valid_bits_offset, out); - } - total_processed += processed; - out += block.length; - valid_bits_offset += block.length; - } while (processed == block.length); - return total_processed; -} - -static inline bool IndexInRange(int32_t idx, int32_t dictionary_length) { - return idx >= 0 && idx < dictionary_length; -} - -// Converter for GetSpaced that handles runs of returned dictionary -// indices. -template -struct DictionaryConverter { - T kZero = {}; - const T* dictionary; - int32_t dictionary_length; - - inline bool IsValid(int32_t value) { return IndexInRange(value, dictionary_length); } - - inline bool IsValid(const int32_t* values, int32_t length) const { - using IndexType = int32_t; - IndexType min_index = std::numeric_limits::max(); - IndexType max_index = std::numeric_limits::min(); - for (int x = 0; x < length; x++) { - min_index = std::min(values[x], min_index); - max_index = std::max(values[x], max_index); - } - - return IndexInRange(min_index, dictionary_length) && - IndexInRange(max_index, dictionary_length); - } - inline void Fill(T* begin, T* end, const int32_t& run_value) const { - std::fill(begin, end, dictionary[run_value]); - } - inline void FillZero(T* begin, T* end) { std::fill(begin, end, kZero); } - - inline void Copy(T* out, const int32_t* values, int length) const { - for (int x = 0; x < length; x++) { - out[x] = dictionary[values[x]]; - } - } -}; - -template -inline int RleDecoder::GetBatchWithDict(const T* dictionary, int32_t dictionary_length, - T* values, int batch_size) { - // Per https://github.com/apache/parquet-format/blob/master/Encodings.md, - // the maximum dictionary index width in Parquet is 32 bits. - using IndexType = int32_t; - DictionaryConverter converter; - converter.dictionary = dictionary; - converter.dictionary_length = dictionary_length; - - DCHECK_GE(bit_width_, 0); - int values_read = 0; - - auto* out = values; - - while (values_read < batch_size) { - int remaining = batch_size - values_read; - - if (repeat_count_ > 0) { - auto idx = static_cast(current_value_); - if (ARROW_PREDICT_FALSE(!IndexInRange(idx, dictionary_length))) { - return values_read; - } - T val = dictionary[idx]; - - int repeat_batch = std::min(remaining, repeat_count_); - std::fill(out, out + repeat_batch, val); - - /* Upkeep counters */ - repeat_count_ -= repeat_batch; - values_read += repeat_batch; - out += repeat_batch; - } else if (literal_count_ > 0) { - constexpr int kBufferSize = 1024; - IndexType indices[kBufferSize]; - - int literal_batch = std::min(remaining, literal_count_); - literal_batch = std::min(literal_batch, kBufferSize); - - int actual_read = bit_reader_.GetBatch(bit_width_, indices, literal_batch); - if (ARROW_PREDICT_FALSE(actual_read != literal_batch)) { - return values_read; - } - if (ARROW_PREDICT_FALSE(!converter.IsValid(indices, /*length=*/literal_batch))) { - return values_read; - } - converter.Copy(out, indices, literal_batch); - - /* Upkeep counters */ - literal_count_ -= literal_batch; - values_read += literal_batch; - out += literal_batch; - } else { - if (!NextCounts()) return values_read; - } - } - - return values_read; -} - -template -inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, - int32_t dictionary_length, T* out, - int batch_size, int null_count, - const uint8_t* valid_bits, - int64_t valid_bits_offset) { - if (null_count == 0) { - return GetBatchWithDict(dictionary, dictionary_length, out, batch_size); - } - arrow::internal::BitBlockCounter block_counter(valid_bits, valid_bits_offset, - batch_size); - using IndexType = int32_t; - DictionaryConverter converter; - converter.dictionary = dictionary; - converter.dictionary_length = dictionary_length; - - int total_processed = 0; - int processed = 0; - arrow::internal::BitBlockCount block; - do { - block = block_counter.NextFourWords(); - if (block.length == 0) { - break; - } - if (block.AllSet()) { - processed = GetBatchWithDict(dictionary, dictionary_length, out, block.length); - } else if (block.NoneSet()) { - converter.FillZero(out, out + block.length); - processed = block.length; - } else { - processed = GetSpaced>( - converter, block.length, block.length - block.popcount, valid_bits, - valid_bits_offset, out); - } - total_processed += processed; - out += block.length; - valid_bits_offset += block.length; - } while (processed == block.length); - return total_processed; -} - -template -bool RleDecoder::NextCounts() { - // Read the next run's indicator int, it could be a literal or repeated run. - // The int is encoded as a vlq-encoded value. - uint32_t indicator_value = 0; - if (!bit_reader_.GetVlqInt(&indicator_value)) return false; - - // lsb indicates if it is a literal run or repeated run - bool is_literal = indicator_value & 1; - uint32_t count = indicator_value >> 1; - if (is_literal) { - if (ARROW_PREDICT_FALSE(count == 0 || count > static_cast(INT32_MAX) / 8)) { - return false; - } - literal_count_ = count * 8; - } else { - if (ARROW_PREDICT_FALSE(count == 0 || count > static_cast(INT32_MAX))) { - return false; - } - repeat_count_ = count; - T value = {}; - if (!bit_reader_.GetAligned( - static_cast(::arrow::bit_util::CeilDiv(bit_width_, 8)), &value)) { - return false; - } - current_value_ = static_cast(value); - } - return true; -} - -/// This function buffers input values 8 at a time. After seeing all 8 values, -/// it decides whether they should be encoded as a literal or repeated run. -inline bool RleEncoder::Put(uint64_t value) { - DCHECK(bit_width_ == 64 || value < (1ULL << bit_width_)); - if (ARROW_PREDICT_FALSE(buffer_full_)) return false; - - if (ARROW_PREDICT_TRUE(current_value_ == value)) { - ++repeat_count_; - if (repeat_count_ > 8) { - // This is just a continuation of the current run, no need to buffer the - // values. - // Note that this is the fast path for long repeated runs. - return true; - } - } else { - if (repeat_count_ >= 8) { - // We had a run that was long enough but it has ended. Flush the - // current repeated run. - DCHECK_EQ(literal_count_, 0); - FlushRepeatedRun(); - } - repeat_count_ = 1; - current_value_ = value; - } - - buffered_values_[num_buffered_values_] = value; - if (++num_buffered_values_ == 8) { - DCHECK_EQ(literal_count_ % 8, 0); - FlushBufferedValues(false); - } - return true; -} - -inline void RleEncoder::FlushLiteralRun(bool update_indicator_byte) { - if (literal_indicator_byte_ == NULL) { - // The literal indicator byte has not been reserved yet, get one now. - literal_indicator_byte_ = bit_writer_.GetNextBytePtr(); - DCHECK(literal_indicator_byte_ != NULL); - } - - // Write all the buffered values as bit packed literals - for (int i = 0; i < num_buffered_values_; ++i) { - bool success = bit_writer_.PutValue(buffered_values_[i], bit_width_); - DCHECK(success) << "There is a bug in using CheckBufferFull()"; - } - num_buffered_values_ = 0; - - if (update_indicator_byte) { - // At this point we need to write the indicator byte for the literal run. - // We only reserve one byte, to allow for streaming writes of literal values. - // The logic makes sure we flush literal runs often enough to not overrun - // the 1 byte. - DCHECK_EQ(literal_count_ % 8, 0); - int num_groups = literal_count_ / 8; - int32_t indicator_value = (num_groups << 1) | 1; - DCHECK_EQ(indicator_value & 0xFFFFFF00, 0); - *literal_indicator_byte_ = static_cast(indicator_value); - literal_indicator_byte_ = NULL; - literal_count_ = 0; - CheckBufferFull(); - } -} - -inline void RleEncoder::FlushRepeatedRun() { - DCHECK_GT(repeat_count_, 0); - bool result = true; - // The lsb of 0 indicates this is a repeated run - int32_t indicator_value = repeat_count_ << 1 | 0; - result &= bit_writer_.PutVlqInt(static_cast(indicator_value)); - result &= bit_writer_.PutAligned( - current_value_, static_cast(::arrow::bit_util::CeilDiv(bit_width_, 8))); - DCHECK(result); - num_buffered_values_ = 0; - repeat_count_ = 0; - CheckBufferFull(); -} - -/// Flush the values that have been buffered. At this point we decide whether -/// we need to switch between the run types or continue the current one. -inline void RleEncoder::FlushBufferedValues(bool done) { - if (repeat_count_ >= 8) { - // Clear the buffered values. They are part of the repeated run now and we - // don't want to flush them out as literals. - num_buffered_values_ = 0; - if (literal_count_ != 0) { - // There was a current literal run. All the values in it have been flushed - // but we still need to update the indicator byte. - DCHECK_EQ(literal_count_ % 8, 0); - DCHECK_EQ(repeat_count_, 8); - FlushLiteralRun(true); - } - DCHECK_EQ(literal_count_, 0); - return; - } - - literal_count_ += num_buffered_values_; - DCHECK_EQ(literal_count_ % 8, 0); - int num_groups = literal_count_ / 8; - if (num_groups + 1 >= (1 << 6)) { - // We need to start a new literal run because the indicator byte we've reserved - // cannot store more values. - DCHECK(literal_indicator_byte_ != NULL); - FlushLiteralRun(true); - } else { - FlushLiteralRun(done); - } - repeat_count_ = 0; -} - -inline int RleEncoder::Flush() { - if (literal_count_ > 0 || repeat_count_ > 0 || num_buffered_values_ > 0) { - bool all_repeat = literal_count_ == 0 && (repeat_count_ == num_buffered_values_ || - num_buffered_values_ == 0); - // There is something pending, figure out if it's a repeated or literal run - if (repeat_count_ > 0 && all_repeat) { - FlushRepeatedRun(); - } else { - DCHECK_EQ(literal_count_ % 8, 0); - // Buffer the last group of literals to 8 by padding with 0s. - for (; num_buffered_values_ != 0 && num_buffered_values_ < 8; - ++num_buffered_values_) { - buffered_values_[num_buffered_values_] = 0; - } - literal_count_ += num_buffered_values_; - FlushLiteralRun(true); - repeat_count_ = 0; - } - } - bit_writer_.Flush(); - DCHECK_EQ(num_buffered_values_, 0); - DCHECK_EQ(literal_count_, 0); - DCHECK_EQ(repeat_count_, 0); - - return bit_writer_.bytes_written(); -} - -inline void RleEncoder::CheckBufferFull() { - int bytes_written = bit_writer_.bytes_written(); - if (bytes_written + max_run_byte_size_ > bit_writer_.buffer_len()) { - buffer_full_ = true; - } -} - -inline void RleEncoder::Clear() { - buffer_full_ = false; - current_value_ = 0; - repeat_count_ = 0; - num_buffered_values_ = 0; - literal_count_ = 0; - literal_indicator_byte_ = NULL; - bit_writer_.Clear(); -} - -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rows_to_batches.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rows_to_batches.h deleted file mode 100644 index 8ad254df2..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/rows_to_batches.h +++ /dev/null @@ -1,163 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include "arrow/record_batch.h" -#include "arrow/result.h" -#include "arrow/status.h" -#include "arrow/table_builder.h" -#include "arrow/util/iterator.h" - -#include - -namespace arrow::util { - -namespace detail { - -// Default identity function row accessor. Used to for the common case where the value -// of each row iterated over is it's self also directly iterable. -[[nodiscard]] constexpr inline auto MakeDefaultRowAccessor() { - return [](auto& x) -> Result { return std::ref(x); }; -} - -// Meta-function to check if a type `T` is a range (iterable using `std::begin()` / -// `std::end()`). `is_range::value` will be false if `T` is not a valid range. -template -struct is_range : std::false_type {}; - -template -struct is_range())), - decltype(std::end(std::declval()))>> : std::true_type { -}; - -} // namespace detail - -/// Delete overload for `const Range&& rows` because the data's lifetime must exceed -/// the lifetime of the function call. `data` will be read when client uses the -/// `RecordBatchReader` -template -[[nodiscard]] typename std::enable_if_t::value, - Result>> -/* Result>> */ RowsToBatches( - const std::shared_ptr& schema, const Range&& rows, - DataPointConvertor&& data_point_convertor, - RowAccessor&& row_accessor = detail::MakeDefaultRowAccessor(), - MemoryPool* pool = default_memory_pool(), - const std::size_t batch_size = 1024) = delete; - -/// \brief Utility function for converting any row-based structure into an -/// `arrow::RecordBatchReader` (this can be easily converted to an `arrow::Table` using -/// `arrow::RecordBatchReader::ToTable()`). -/// -/// Examples of supported types: -/// - `std::vector>>` -/// - `std::vector` - -/// If `rows` (client’s row-based structure) is not a valid C++ range, the client will -/// need to either make it iterable, or make an adapter/wrapper that is a valid C++ -/// range. - -/// The client must provide a `DataPointConvertor` callable type that will convert the -/// structure’s data points into the corresponding arrow types. - -/// Complex nested rows can be supported by providing a custom `row_accessor` instead -/// of the default. - -/// Example usage: -/// \code{.cpp} -/// auto IntConvertor = [](ArrayBuilder& array_builder, int value) { -/// return static_cast(array_builder).Append(value); -/// }; -/// std::vector> data = {{1, 2, 4}, {5, 6, 7}}; -/// auto batches = RowsToBatches(kTestSchema, data, IntConvertor); -/// \endcode - -/// \param[in] schema - The schema to be used in the `RecordBatchReader` - -/// \param[in] rows - Iterable row-based structure that will be converted to arrow -/// batches - -/// \param[in] data_point_convertor - Client provided callable type that will convert -/// the structure’s data points into the corresponding arrow types. The convertor must -/// return an error `Status` if an error happens during conversion. - -/// \param[in] row_accessor - In the common case where the value of each row iterated -/// over is it's self also directly iterable, the client can just use the default. -/// The provided callable must take the values of the `rows` range and return a -/// `std::reference_wrapper` to the data points in a given row. The data points -/// must be in order of their corresponding fields in the schema. -/// see: /ref `MakeDefaultRowAccessor` - -/// \param[in] pool - The MemoryPool to use for allocations. - -/// \param[in] batch_size - Number of rows to insert into each RecordBatch. - -/// \return `Result>>` result will be a -/// `std::shared_ptr>` if not errors occurred, else an error status. -template -[[nodiscard]] typename std::enable_if_t::value, - Result>> -/* Result>> */ RowsToBatches( - const std::shared_ptr& schema, const Range& rows, - DataPointConvertor&& data_point_convertor, - RowAccessor&& row_accessor = detail::MakeDefaultRowAccessor(), - MemoryPool* pool = default_memory_pool(), const std::size_t batch_size = 1024) { - auto make_next_batch = - [pool = pool, batch_size = batch_size, rows_ittr = std::begin(rows), - rows_ittr_end = std::end(rows), schema = schema, - row_accessor = std::forward(row_accessor), - data_point_convertor = std::forward( - data_point_convertor)]() mutable -> Result> { - if (rows_ittr == rows_ittr_end) return NULLPTR; - - ARROW_ASSIGN_OR_RAISE(auto record_batch_builder, - RecordBatchBuilder::Make(schema, pool, batch_size)); - - for (size_t i = 0; i < batch_size && (rows_ittr != rows_ittr_end); - i++, std::advance(rows_ittr, 1)) { - int col_index = 0; - ARROW_ASSIGN_OR_RAISE(const auto row, row_accessor(*rows_ittr)); - - // If the accessor returns a `std::reference_wrapper` unwrap if - const auto& row_unwrapped = [&]() { - if constexpr (detail::is_range::value) - return row; - else - return row.get(); - }(); - - for (auto& data_point : row_unwrapped) { - ArrayBuilder* array_builder = record_batch_builder->GetField(col_index); - ARROW_RETURN_IF(array_builder == NULLPTR, - Status::Invalid("array_builder == NULLPTR")); - - ARROW_RETURN_NOT_OK(data_point_convertor(*array_builder, data_point)); - col_index++; - } - } - - ARROW_ASSIGN_OR_RAISE(auto result, record_batch_builder->Flush()); - return result; - }; - return RecordBatchReader::MakeFromIterator(MakeFunctionIterator(make_next_batch), - schema); -} - -} // namespace arrow::util diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/simd.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/simd.h deleted file mode 100644 index ee9105d5f..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/simd.h +++ /dev/null @@ -1,44 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#ifdef _MSC_VER -// MSVC x86_64/arm64 - -#if defined(_M_AMD64) || defined(_M_X64) -#include -#endif - -#else -// gcc/clang (possibly others) - -#if defined(ARROW_HAVE_BMI2) -#include -#endif - -#if defined(ARROW_HAVE_AVX2) || defined(ARROW_HAVE_AVX512) -#include -#elif defined(ARROW_HAVE_SSE4_2) -#include -#endif - -#ifdef ARROW_HAVE_NEON -#include -#endif - -#endif diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/small_vector.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/small_vector.h deleted file mode 100644 index 52e191c4c..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/small_vector.h +++ /dev/null @@ -1,511 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "arrow/util/aligned_storage.h" -#include "arrow/util/macros.h" - -namespace arrow { -namespace internal { - -template -struct StaticVectorStorageBase { - using storage_type = AlignedStorage; - - storage_type static_data_[N]; - size_t size_ = 0; - - void destroy() noexcept {} -}; - -template -struct StaticVectorStorageBase { - using storage_type = AlignedStorage; - - storage_type static_data_[N]; - size_t size_ = 0; - - ~StaticVectorStorageBase() noexcept { destroy(); } - - void destroy() noexcept { storage_type::destroy_several(static_data_, size_); } -}; - -template ::value> -struct StaticVectorStorage : public StaticVectorStorageBase { - using Base = StaticVectorStorageBase; - using typename Base::storage_type; - - using Base::size_; - using Base::static_data_; - - StaticVectorStorage() noexcept = default; - - constexpr storage_type* storage_ptr() { return static_data_; } - - constexpr const storage_type* const_storage_ptr() const { return static_data_; } - - // Adjust storage size, but don't initialize any objects - void bump_size(size_t addend) { - assert(size_ + addend <= N); - size_ += addend; - } - - void ensure_capacity(size_t min_capacity) { assert(min_capacity <= N); } - - // Adjust storage size, but don't destroy any objects - void reduce_size(size_t reduce_by) { - assert(reduce_by <= size_); - size_ -= reduce_by; - } - - // Move objects from another storage, but don't destroy any objects currently - // stored in *this. - // You need to call destroy() first if necessary (e.g. in a - // move assignment operator). - void move_construct(StaticVectorStorage&& other) noexcept { - size_ = other.size_; - if (size_ != 0) { - // Use a compile-time memcpy size (N) for trivial types - storage_type::move_construct_several(other.static_data_, static_data_, size_, N); - } - } - - constexpr size_t capacity() const { return N; } - - constexpr size_t max_size() const { return N; } - - void reserve(size_t n) {} - - void clear() { - storage_type::destroy_several(static_data_, size_); - size_ = 0; - } -}; - -template -struct SmallVectorStorage { - using storage_type = AlignedStorage; - - storage_type static_data_[N]; - size_t size_ = 0; - storage_type* data_ = static_data_; - size_t dynamic_capacity_ = 0; - - SmallVectorStorage() noexcept = default; - - ~SmallVectorStorage() { destroy(); } - - constexpr storage_type* storage_ptr() { return data_; } - - constexpr const storage_type* const_storage_ptr() const { return data_; } - - void bump_size(size_t addend) { - const size_t new_size = size_ + addend; - ensure_capacity(new_size); - size_ = new_size; - } - - void ensure_capacity(size_t min_capacity) { - if (dynamic_capacity_) { - // Grow dynamic storage if necessary - if (min_capacity > dynamic_capacity_) { - size_t new_capacity = std::max(dynamic_capacity_ * 2, min_capacity); - reallocate_dynamic(new_capacity); - } - } else if (min_capacity > N) { - switch_to_dynamic(min_capacity); - } - } - - void reduce_size(size_t reduce_by) { - assert(reduce_by <= size_); - size_ -= reduce_by; - } - - void destroy() noexcept { - storage_type::destroy_several(data_, size_); - if (dynamic_capacity_) { - delete[] data_; - } - } - - void move_construct(SmallVectorStorage&& other) noexcept { - size_ = other.size_; - dynamic_capacity_ = other.dynamic_capacity_; - if (dynamic_capacity_) { - data_ = other.data_; - other.data_ = other.static_data_; - other.dynamic_capacity_ = 0; - other.size_ = 0; - } else if (size_ != 0) { - // Use a compile-time memcpy size (N) for trivial types - storage_type::move_construct_several(other.static_data_, static_data_, size_, N); - } - } - - constexpr size_t capacity() const { return dynamic_capacity_ ? dynamic_capacity_ : N; } - - constexpr size_t max_size() const { return std::numeric_limits::max(); } - - void reserve(size_t n) { - if (dynamic_capacity_) { - if (n > dynamic_capacity_) { - reallocate_dynamic(n); - } - } else if (n > N) { - switch_to_dynamic(n); - } - } - - void clear() { - storage_type::destroy_several(data_, size_); - size_ = 0; - } - - private: - void switch_to_dynamic(size_t new_capacity) { - dynamic_capacity_ = new_capacity; - data_ = new storage_type[new_capacity]; - storage_type::move_construct_several_and_destroy_source(static_data_, data_, size_); - } - - void reallocate_dynamic(size_t new_capacity) { - assert(new_capacity >= size_); - auto new_data = new storage_type[new_capacity]; - storage_type::move_construct_several_and_destroy_source(data_, new_data, size_); - delete[] data_; - dynamic_capacity_ = new_capacity; - data_ = new_data; - } -}; - -template -class StaticVectorImpl { - private: - Storage storage_; - - T* data_ptr() { return storage_.storage_ptr()->get(); } - - constexpr const T* const_data_ptr() const { - return storage_.const_storage_ptr()->get(); - } - - public: - using size_type = size_t; - using difference_type = ptrdiff_t; - using value_type = T; - using pointer = T*; - using const_pointer = const T*; - using reference = T&; - using const_reference = const T&; - using iterator = T*; - using const_iterator = const T*; - using reverse_iterator = std::reverse_iterator; - using const_reverse_iterator = std::reverse_iterator; - - constexpr StaticVectorImpl() noexcept = default; - - // Move and copy constructors - StaticVectorImpl(StaticVectorImpl&& other) noexcept { - storage_.move_construct(std::move(other.storage_)); - } - - StaticVectorImpl& operator=(StaticVectorImpl&& other) noexcept { - if (ARROW_PREDICT_TRUE(&other != this)) { - // TODO move_assign? - storage_.destroy(); - storage_.move_construct(std::move(other.storage_)); - } - return *this; - } - - StaticVectorImpl(const StaticVectorImpl& other) { - init_by_copying(other.storage_.size_, other.const_data_ptr()); - } - - StaticVectorImpl& operator=(const StaticVectorImpl& other) noexcept { - if (ARROW_PREDICT_TRUE(&other != this)) { - assign_by_copying(other.storage_.size_, other.data()); - } - return *this; - } - - // Automatic conversion from std::vector, for convenience - StaticVectorImpl(const std::vector& other) { // NOLINT: explicit - init_by_copying(other.size(), other.data()); - } - - StaticVectorImpl(std::vector&& other) noexcept { // NOLINT: explicit - init_by_moving(other.size(), other.data()); - } - - StaticVectorImpl& operator=(const std::vector& other) { - assign_by_copying(other.size(), other.data()); - return *this; - } - - StaticVectorImpl& operator=(std::vector&& other) noexcept { - assign_by_moving(other.size(), other.data()); - return *this; - } - - // Constructing from count and optional initialization value - explicit StaticVectorImpl(size_t count) { - storage_.bump_size(count); - auto* p = storage_.storage_ptr(); - for (size_t i = 0; i < count; ++i) { - p[i].construct(); - } - } - - StaticVectorImpl(size_t count, const T& value) { - storage_.bump_size(count); - auto* p = storage_.storage_ptr(); - for (size_t i = 0; i < count; ++i) { - p[i].construct(value); - } - } - - StaticVectorImpl(std::initializer_list values) { - storage_.bump_size(values.size()); - auto* p = storage_.storage_ptr(); - for (auto&& v : values) { - // Unfortunately, cannot move initializer values - p++->construct(v); - } - } - - // Size inspection - - constexpr bool empty() const { return storage_.size_ == 0; } - - constexpr size_t size() const { return storage_.size_; } - - constexpr size_t capacity() const { return storage_.capacity(); } - - constexpr size_t max_size() const { return storage_.max_size(); } - - // Data access - - T& operator[](size_t i) { return data_ptr()[i]; } - - constexpr const T& operator[](size_t i) const { return const_data_ptr()[i]; } - - T& front() { return data_ptr()[0]; } - - constexpr const T& front() const { return const_data_ptr()[0]; } - - T& back() { return data_ptr()[storage_.size_ - 1]; } - - constexpr const T& back() const { return const_data_ptr()[storage_.size_ - 1]; } - - T* data() { return data_ptr(); } - - constexpr const T* data() const { return const_data_ptr(); } - - // Iterators - - iterator begin() { return iterator(data_ptr()); } - - constexpr const_iterator begin() const { return const_iterator(const_data_ptr()); } - - constexpr const_iterator cbegin() const { return const_iterator(const_data_ptr()); } - - iterator end() { return iterator(data_ptr() + storage_.size_); } - - constexpr const_iterator end() const { - return const_iterator(const_data_ptr() + storage_.size_); - } - - constexpr const_iterator cend() const { - return const_iterator(const_data_ptr() + storage_.size_); - } - - reverse_iterator rbegin() { return reverse_iterator(end()); } - - constexpr const_reverse_iterator rbegin() const { - return const_reverse_iterator(end()); - } - - constexpr const_reverse_iterator crbegin() const { - return const_reverse_iterator(end()); - } - - reverse_iterator rend() { return reverse_iterator(begin()); } - - constexpr const_reverse_iterator rend() const { - return const_reverse_iterator(begin()); - } - - constexpr const_reverse_iterator crend() const { - return const_reverse_iterator(begin()); - } - - // Mutations - - void reserve(size_t n) { storage_.reserve(n); } - - void clear() { storage_.clear(); } - - void push_back(const T& value) { - storage_.bump_size(1); - storage_.storage_ptr()[storage_.size_ - 1].construct(value); - } - - void push_back(T&& value) { - storage_.bump_size(1); - storage_.storage_ptr()[storage_.size_ - 1].construct(std::move(value)); - } - - template - void emplace_back(Args&&... args) { - storage_.bump_size(1); - storage_.storage_ptr()[storage_.size_ - 1].construct(std::forward(args)...); - } - - template - iterator insert(const_iterator insert_at, InputIt first, InputIt last) { - const size_t n = storage_.size_; - const size_t it_size = static_cast(last - first); // XXX might be O(n)? - const size_t pos = static_cast(insert_at - const_data_ptr()); - storage_.bump_size(it_size); - auto* p = storage_.storage_ptr(); - if (it_size == 0) { - return p[pos].get(); - } - const size_t end_pos = pos + it_size; - - // Move [pos; n) to [end_pos; end_pos + n - pos) - size_t i = n; - size_t j = end_pos + n - pos; - while (j > std::max(n, end_pos)) { - p[--j].move_construct(&p[--i]); - } - while (j > end_pos) { - p[--j].move_assign(&p[--i]); - } - assert(j == end_pos); - // Copy [first; last) to [pos; end_pos) - j = pos; - while (j < std::min(n, end_pos)) { - p[j++].assign(*first++); - } - while (j < end_pos) { - p[j++].construct(*first++); - } - assert(first == last); - return p[pos].get(); - } - - void resize(size_t n) { - const size_t old_size = storage_.size_; - if (n > storage_.size_) { - storage_.bump_size(n - old_size); - auto* p = storage_.storage_ptr(); - for (size_t i = old_size; i < n; ++i) { - p[i].construct(T{}); - } - } else { - auto* p = storage_.storage_ptr(); - for (size_t i = n; i < old_size; ++i) { - p[i].destroy(); - } - storage_.reduce_size(old_size - n); - } - } - - void resize(size_t n, const T& value) { - const size_t old_size = storage_.size_; - if (n > storage_.size_) { - storage_.bump_size(n - old_size); - auto* p = storage_.storage_ptr(); - for (size_t i = old_size; i < n; ++i) { - p[i].construct(value); - } - } else { - auto* p = storage_.storage_ptr(); - for (size_t i = n; i < old_size; ++i) { - p[i].destroy(); - } - storage_.reduce_size(old_size - n); - } - } - - private: - template - void init_by_copying(size_t n, InputIt src) { - storage_.bump_size(n); - auto* dest = storage_.storage_ptr(); - for (size_t i = 0; i < n; ++i, ++src) { - dest[i].construct(*src); - } - } - - template - void init_by_moving(size_t n, InputIt src) { - init_by_copying(n, std::make_move_iterator(src)); - } - - template - void assign_by_copying(size_t n, InputIt src) { - const size_t old_size = storage_.size_; - if (n > old_size) { - storage_.bump_size(n - old_size); - auto* dest = storage_.storage_ptr(); - for (size_t i = 0; i < old_size; ++i, ++src) { - dest[i].assign(*src); - } - for (size_t i = old_size; i < n; ++i, ++src) { - dest[i].construct(*src); - } - } else { - auto* dest = storage_.storage_ptr(); - for (size_t i = 0; i < n; ++i, ++src) { - dest[i].assign(*src); - } - for (size_t i = n; i < old_size; ++i) { - dest[i].destroy(); - } - storage_.reduce_size(old_size - n); - } - } - - template - void assign_by_moving(size_t n, InputIt src) { - assign_by_copying(n, std::make_move_iterator(src)); - } -}; - -template -using StaticVector = StaticVectorImpl>; - -template -using SmallVector = StaticVectorImpl>; - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/sort.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/sort.h deleted file mode 100644 index cdffe0b23..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/sort.h +++ /dev/null @@ -1,78 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include - -namespace arrow { -namespace internal { - -template > -std::vector ArgSort(const std::vector& values, Cmp&& cmp = {}) { - std::vector indices(values.size()); - std::iota(indices.begin(), indices.end(), 0); - std::sort(indices.begin(), indices.end(), - [&](int64_t i, int64_t j) -> bool { return cmp(values[i], values[j]); }); - return indices; -} - -template -size_t Permute(const std::vector& indices, std::vector* values) { - if (indices.size() <= 1) { - return indices.size(); - } - - // mask indicating which of values are in the correct location - std::vector sorted(indices.size(), false); - - size_t cycle_count = 0; - - for (auto cycle_start = sorted.begin(); cycle_start != sorted.end(); - cycle_start = std::find(cycle_start, sorted.end(), false)) { - ++cycle_count; - - // position in which an element belongs WRT sort - auto sort_into = static_cast(cycle_start - sorted.begin()); - - if (indices[sort_into] == sort_into) { - // trivial cycle - sorted[sort_into] = true; - continue; - } - - // resolve this cycle - const auto end = sort_into; - for (int64_t take_from = indices[sort_into]; take_from != end; - take_from = indices[sort_into]) { - std::swap(values->at(sort_into), values->at(take_from)); - sorted[sort_into] = true; - sort_into = take_from; - } - sorted[sort_into] = true; - } - - return cycle_count; -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/spaced.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/spaced.h deleted file mode 100644 index 8265e1d22..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/spaced.h +++ /dev/null @@ -1,98 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include - -#include "arrow/util/bit_run_reader.h" - -namespace arrow { -namespace util { -namespace internal { - -/// \brief Compress the buffer to spaced, excluding the null entries. -/// -/// \param[in] src the source buffer -/// \param[in] num_values the size of source buffer -/// \param[in] valid_bits bitmap data indicating position of valid slots -/// \param[in] valid_bits_offset offset into valid_bits -/// \param[out] output the output buffer spaced -/// \return The size of spaced buffer. -template -inline int SpacedCompress(const T* src, int num_values, const uint8_t* valid_bits, - int64_t valid_bits_offset, T* output) { - int num_valid_values = 0; - - arrow::internal::SetBitRunReader reader(valid_bits, valid_bits_offset, num_values); - while (true) { - const auto run = reader.NextRun(); - if (run.length == 0) { - break; - } - std::memcpy(output + num_valid_values, src + run.position, run.length * sizeof(T)); - num_valid_values += static_cast(run.length); - } - - return num_valid_values; -} - -/// \brief Relocate values in buffer into positions of non-null values as indicated by -/// a validity bitmap. -/// -/// \param[in, out] buffer the in-place buffer -/// \param[in] num_values total size of buffer including null slots -/// \param[in] null_count number of null slots -/// \param[in] valid_bits bitmap data indicating position of valid slots -/// \param[in] valid_bits_offset offset into valid_bits -/// \return The number of values expanded, including nulls. -template -inline int SpacedExpand(T* buffer, int num_values, int null_count, - const uint8_t* valid_bits, int64_t valid_bits_offset) { - // Point to end as we add the spacing from the back. - int idx_decode = num_values - null_count; - - // Depending on the number of nulls, some of the value slots in buffer may - // be uninitialized, and this will cause valgrind warnings / potentially UB - std::memset(static_cast(buffer + idx_decode), 0, null_count * sizeof(T)); - if (idx_decode == 0) { - // All nulls, nothing more to do - return num_values; - } - - arrow::internal::ReverseSetBitRunReader reader(valid_bits, valid_bits_offset, - num_values); - while (true) { - const auto run = reader.NextRun(); - if (run.length == 0) { - break; - } - idx_decode -= static_cast(run.length); - assert(idx_decode >= 0); - std::memmove(buffer + run.position, buffer + idx_decode, run.length * sizeof(T)); - } - - // Otherwise caller gave an incorrect null_count - assert(idx_decode == 0); - return num_values; -} - -} // namespace internal -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/span.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/span.h deleted file mode 100644 index 4254fec75..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/span.h +++ /dev/null @@ -1,132 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include - -namespace arrow::util { - -/// std::span polyfill. -/// -/// Does not support static extents. -template -class span { - static_assert(sizeof(T), - R"( -std::span allows contiguous_iterators instead of just pointers, the enforcement -of which requires T to be a complete type. arrow::util::span does not support -contiguous_iterators, but T is still required to be a complete type to prevent -writing code which would break when it is replaced by std::span.)"); - - public: - using element_type = T; - using value_type = std::remove_cv_t; - using iterator = T*; - using const_iterator = T const*; - - span() = default; - span(const span&) = default; - span& operator=(const span&) = default; - - template >> - // NOLINTNEXTLINE runtime/explicit - constexpr span(span mut) : span{mut.data(), mut.size()} {} - - constexpr span(T* data, size_t count) : data_{data}, size_{count} {} - - constexpr span(T* begin, T* end) - : data_{begin}, size_{static_cast(end - begin)} {} - - template < - typename R, - typename DisableUnlessConstructibleFromDataAndSize = - decltype(span(std::data(std::declval()), std::size(std::declval()))), - typename DisableUnlessSimilarTypes = std::enable_if_t()))>>, - std::decay_t>>> - // NOLINTNEXTLINE runtime/explicit, non-const reference - constexpr span(R&& range) : span{std::data(range), std::size(range)} {} - - constexpr T* begin() const { return data_; } - constexpr T* end() const { return data_ + size_; } - constexpr T* data() const { return data_; } - - constexpr size_t size() const { return size_; } - constexpr size_t size_bytes() const { return size_ * sizeof(T); } - constexpr bool empty() const { return size_ == 0; } - - constexpr T& operator[](size_t i) { return data_[i]; } - constexpr const T& operator[](size_t i) const { return data_[i]; } - - constexpr span subspan(size_t offset) const { - if (offset > size_) return {data_, data_}; - return {data_ + offset, size_ - offset}; - } - - constexpr span subspan(size_t offset, size_t count) const { - auto out = subspan(offset); - if (count < out.size_) { - out.size_ = count; - } - return out; - } - - constexpr bool operator==(span const& other) const { - if (size_ != other.size_) return false; - - if constexpr (std::is_integral_v) { - if (size_ == 0) { - return true; // memcmp does not handle null pointers, even if size_ == 0 - } - return std::memcmp(data_, other.data_, size_bytes()) == 0; - } else { - T* ptr = data_; - for (T const& e : other) { - if (*ptr++ != e) return false; - } - return true; - } - } - constexpr bool operator!=(span const& other) const { return !(*this == other); } - - private: - T* data_{}; - size_t size_{}; -}; - -template -span(R& range) -> span>; - -template -span(T*, size_t) -> span; - -template -constexpr span as_bytes(span s) { - return {reinterpret_cast(s.data()), s.size_bytes()}; -} - -template -constexpr span as_writable_bytes(span s) { - return {reinterpret_cast(s.data()), s.size_bytes()}; -} - -} // namespace arrow::util diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/stopwatch.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/stopwatch.h deleted file mode 100644 index db4e67f59..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/stopwatch.h +++ /dev/null @@ -1,48 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include - -namespace arrow { -namespace internal { - -class StopWatch { - // This clock should give us wall clock time - using ClockType = std::chrono::steady_clock; - - public: - StopWatch() {} - - void Start() { start_ = ClockType::now(); } - - // Returns time in nanoseconds. - uint64_t Stop() { - auto stop = ClockType::now(); - std::chrono::nanoseconds d = stop - start_; - assert(d.count() >= 0); - return static_cast(d.count()); - } - - private: - std::chrono::time_point start_; -}; - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string.h deleted file mode 100644 index d7e377773..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string.h +++ /dev/null @@ -1,173 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include -#include - -#if __has_include() -#include -#endif - -#include "arrow/result.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Status; - -ARROW_EXPORT std::string HexEncode(const uint8_t* data, size_t length); - -ARROW_EXPORT std::string Escape(const char* data, size_t length); - -ARROW_EXPORT std::string HexEncode(const char* data, size_t length); - -ARROW_EXPORT std::string HexEncode(std::string_view str); - -ARROW_EXPORT std::string Escape(std::string_view str); - -ARROW_EXPORT Status ParseHexValue(const char* hex_pair, uint8_t* out); - -ARROW_EXPORT Status ParseHexValues(std::string_view hex_string, uint8_t* out); - -namespace internal { - -/// Like std::string_view::starts_with in C++20 -inline bool StartsWith(std::string_view s, std::string_view prefix) { - return s.length() >= prefix.length() && - (s.empty() || s.substr(0, prefix.length()) == prefix); -} - -/// Like std::string_view::ends_with in C++20 -inline bool EndsWith(std::string_view s, std::string_view suffix) { - return s.length() >= suffix.length() && - (s.empty() || s.substr(s.length() - suffix.length()) == suffix); -} - -/// \brief Split a string with a delimiter -ARROW_EXPORT -std::vector SplitString(std::string_view v, char delim, - int64_t limit = 0); - -/// \brief Join strings with a delimiter -ARROW_EXPORT -std::string JoinStrings(const std::vector& strings, - std::string_view delimiter); - -/// \brief Join strings with a delimiter -ARROW_EXPORT -std::string JoinStrings(const std::vector& strings, - std::string_view delimiter); - -/// \brief Trim whitespace from left and right sides of string -ARROW_EXPORT -std::string TrimString(std::string value); - -ARROW_EXPORT -bool AsciiEqualsCaseInsensitive(std::string_view left, std::string_view right); - -ARROW_EXPORT -std::string AsciiToLower(std::string_view value); - -ARROW_EXPORT -std::string AsciiToUpper(std::string_view value); - -/// \brief Search for the first instance of a token and replace it or return nullopt if -/// the token is not found. -ARROW_EXPORT -std::optional Replace(std::string_view s, std::string_view token, - std::string_view replacement); - -/// \brief Get boolean value from string -/// -/// If "1", "true" (case-insensitive), returns true -/// If "0", "false" (case-insensitive), returns false -/// Otherwise, returns Status::Invalid -ARROW_EXPORT -arrow::Result ParseBoolean(std::string_view value); - -#if __has_include() - -namespace detail { -template -struct can_to_chars : public std::false_type {}; - -template -struct can_to_chars< - T, std::void_t(), std::declval(), - std::declval>()))>> - : public std::true_type {}; -} // namespace detail - -/// \brief Whether std::to_chars exists for the current value type. -/// -/// This is useful as some C++ libraries do not implement all specified overloads -/// for std::to_chars. -template -inline constexpr bool have_to_chars = detail::can_to_chars::value; - -/// \brief An ergonomic wrapper around std::to_chars, returning a std::string -/// -/// For most inputs, the std::string result will not incur any heap allocation -/// thanks to small string optimization. -/// -/// Compared to std::to_string, this function gives locale-agnostic results -/// and might also be faster. -template -std::string ToChars(T value, Args&&... args) { - if constexpr (!have_to_chars) { - // Some C++ standard libraries do not yet implement std::to_chars for all types, - // in which case we have to fallback to std::string. - return std::to_string(value); - } else { - // According to various sources, the GNU libstdc++ and Microsoft's C++ STL - // allow up to 15 bytes of small string optimization, while clang's libc++ - // goes up to 22 bytes. Choose the pessimistic value. - std::string out(15, 0); - auto res = std::to_chars(&out.front(), &out.back(), value, args...); - while (res.ec != std::errc{}) { - assert(res.ec == std::errc::value_too_large); - out.resize(out.capacity() * 2); - res = std::to_chars(&out.front(), &out.back(), value, args...); - } - const auto length = res.ptr - out.data(); - assert(length <= static_cast(out.length())); - out.resize(length); - return out; - } -} - -#else // !__has_include() - -template -inline constexpr bool have_to_chars = false; - -template -std::string ToChars(T value, Args&&... args) { - return std::to_string(value); -} - -#endif - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string_builder.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string_builder.h deleted file mode 100644 index 7c05ccd51..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/string_builder.h +++ /dev/null @@ -1,84 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. template - -#pragma once - -#include -#include -#include -#include - -#include "arrow/util/visibility.h" - -namespace arrow { -namespace util { - -namespace detail { - -class ARROW_EXPORT StringStreamWrapper { - public: - StringStreamWrapper(); - ~StringStreamWrapper(); - - std::ostream& stream() { return ostream_; } - std::string str(); - - protected: - std::unique_ptr sstream_; - std::ostream& ostream_; -}; - -} // namespace detail - -template -void StringBuilderRecursive(std::ostream& stream, Head&& head) { - stream << head; -} - -template -void StringBuilderRecursive(std::ostream& stream, Head&& head, Tail&&... tail) { - StringBuilderRecursive(stream, std::forward(head)); - StringBuilderRecursive(stream, std::forward(tail)...); -} - -template -std::string StringBuilder(Args&&... args) { - detail::StringStreamWrapper ss; - StringBuilderRecursive(ss.stream(), std::forward(args)...); - return ss.str(); -} - -/// CRTP helper for declaring string representation. Defines operator<< -template -class ToStringOstreamable { - public: - ~ToStringOstreamable() { - static_assert( - std::is_same().ToString()), std::string>::value, - "ToStringOstreamable depends on the method T::ToString() const"); - } - - private: - const T& cast() const { return static_cast(*this); } - - friend inline std::ostream& operator<<(std::ostream& os, const ToStringOstreamable& t) { - return os << t.cast().ToString(); - } -}; - -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/task_group.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/task_group.h deleted file mode 100644 index 3bb72f0d9..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/task_group.h +++ /dev/null @@ -1,106 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include - -#include "arrow/status.h" -#include "arrow/type_fwd.h" -#include "arrow/util/cancel.h" -#include "arrow/util/functional.h" -#include "arrow/util/macros.h" -#include "arrow/util/type_fwd.h" -#include "arrow/util/visibility.h" - -namespace arrow { -namespace internal { - -/// \brief A group of related tasks -/// -/// A TaskGroup executes tasks with the signature `Status()`. -/// Execution can be serial or parallel, depending on the TaskGroup -/// implementation. When Finish() returns, it is guaranteed that all -/// tasks have finished, or at least one has errored. -/// -/// Once an error has occurred any tasks that are submitted to the task group -/// will not run. The call to Append will simply return without scheduling the -/// task. -/// -/// If the task group is parallel it is possible that multiple tasks could be -/// running at the same time and one of those tasks fails. This will put the -/// task group in a failure state (so additional tasks cannot be run) however -/// it will not interrupt running tasks. Finish will not complete -/// until all running tasks have finished, even if one task fails. -/// -/// Once a task group has finished new tasks may not be added to it. If you need to start -/// a new batch of work then you should create a new task group. -class ARROW_EXPORT TaskGroup : public std::enable_shared_from_this { - public: - /// Add a Status-returning function to execute. Execution order is - /// undefined. The function may be executed immediately or later. - template - void Append(Function&& func) { - return AppendReal(std::forward(func)); - } - - /// Wait for execution of all tasks (and subgroups) to be finished, - /// or for at least one task (or subgroup) to error out. - /// The returned Status propagates the error status of the first failing - /// task (or subgroup). - virtual Status Finish() = 0; - - /// Returns a future that will complete the first time all tasks are finished. - /// This should be called only after all top level tasks - /// have been added to the task group. - /// - /// If you are using a TaskGroup asynchronously there are a few considerations to keep - /// in mind. The tasks should not block on I/O, etc (defeats the purpose of using - /// futures) and should not be doing any nested locking or you run the risk of the tasks - /// getting stuck in the thread pool waiting for tasks which cannot get scheduled. - /// - /// Primarily this call is intended to help migrate existing work written with TaskGroup - /// in mind to using futures without having to do a complete conversion on the first - /// pass. - virtual Future<> FinishAsync() = 0; - - /// The current aggregate error Status. Non-blocking, useful for stopping early. - virtual Status current_status() = 0; - - /// Whether some tasks have already failed. Non-blocking, useful for stopping early. - virtual bool ok() const = 0; - - /// How many tasks can typically be executed in parallel. - /// This is only a hint, useful for testing or debugging. - virtual int parallelism() = 0; - - static std::shared_ptr MakeSerial(StopToken = StopToken::Unstoppable()); - static std::shared_ptr MakeThreaded(internal::Executor*, - StopToken = StopToken::Unstoppable()); - - virtual ~TaskGroup() = default; - - protected: - TaskGroup() = default; - ARROW_DISALLOW_COPY_AND_ASSIGN(TaskGroup); - - virtual void AppendReal(FnOnce task) = 0; -}; - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tdigest.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tdigest.h deleted file mode 100644 index 308df4688..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tdigest.h +++ /dev/null @@ -1,104 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// approximate quantiles from arbitrary length dataset with O(1) space -// based on 'Computing Extremely Accurate Quantiles Using t-Digests' from Dunning & Ertl -// - https://arxiv.org/abs/1902.04023 -// - https://github.com/tdunning/t-digest - -#pragma once - -#include -#include -#include - -#include "arrow/util/logging.h" -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Status; - -namespace internal { - -class ARROW_EXPORT TDigest { - public: - explicit TDigest(uint32_t delta = 100, uint32_t buffer_size = 500); - ~TDigest(); - TDigest(TDigest&&); - TDigest& operator=(TDigest&&); - - // reset and re-use this tdigest - void Reset(); - - // validate data integrity - Status Validate() const; - - // dump internal data, only for debug - void Dump() const; - - // buffer a single data point, consume internal buffer if full - // this function is intensively called and performance critical - // call it only if you are sure no NAN exists in input data - void Add(double value) { - DCHECK(!std::isnan(value)) << "cannot add NAN"; - if (ARROW_PREDICT_FALSE(input_.size() == input_.capacity())) { - MergeInput(); - } - input_.push_back(value); - } - - // skip NAN on adding - template - typename std::enable_if::value>::type NanAdd(T value) { - if (!std::isnan(value)) Add(value); - } - - template - typename std::enable_if::value>::type NanAdd(T value) { - Add(static_cast(value)); - } - - // merge with other t-digests, called infrequently - void Merge(const std::vector& others); - void Merge(const TDigest& other); - - // calculate quantile - double Quantile(double q) const; - - double Min() const { return Quantile(0); } - double Max() const { return Quantile(1); } - double Mean() const; - - // check if this tdigest contains no valid data points - bool is_empty() const; - - private: - // merge input data with current tdigest - void MergeInput() const; - - // input buffer, size = buffer_size * sizeof(double) - mutable std::vector input_; - - // hide other members with pimpl - class TDigestImpl; - std::unique_ptr impl_; -}; - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/test_common.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/test_common.h deleted file mode 100644 index 511daed1e..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/test_common.h +++ /dev/null @@ -1,90 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include - -#include "arrow/testing/gtest_util.h" -#include "arrow/util/iterator.h" - -namespace arrow { - -struct TestInt { - TestInt(); - TestInt(int i); // NOLINT runtime/explicit - int value; - - bool operator==(const TestInt& other) const; - - friend std::ostream& operator<<(std::ostream& os, const TestInt& v); -}; - -template <> -struct IterationTraits { - static TestInt End() { return TestInt(); } - static bool IsEnd(const TestInt& val) { return val == IterationTraits::End(); } -}; - -struct TestStr { - TestStr(); - TestStr(const std::string& s); // NOLINT runtime/explicit - TestStr(const char* s); // NOLINT runtime/explicit - explicit TestStr(const TestInt& test_int); - std::string value; - - bool operator==(const TestStr& other) const; - - friend std::ostream& operator<<(std::ostream& os, const TestStr& v); -}; - -template <> -struct IterationTraits { - static TestStr End() { return TestStr(); } - static bool IsEnd(const TestStr& val) { return val == IterationTraits::End(); } -}; - -std::vector RangeVector(unsigned int max, unsigned int step = 1); - -template -inline Iterator VectorIt(std::vector v) { - return MakeVectorIterator(std::move(v)); -} - -template -inline Iterator PossiblySlowVectorIt(std::vector v, bool slow = false) { - auto iterator = MakeVectorIterator(std::move(v)); - if (slow) { - return MakeTransformedIterator(std::move(iterator), - [](T item) -> Result> { - SleepABit(); - return TransformYield(item); - }); - } else { - return iterator; - } -} - -template -inline void AssertIteratorExhausted(Iterator& it) { - ASSERT_OK_AND_ASSIGN(T next, it.Next()); - ASSERT_TRUE(IsIterationEnd(next)); -} - -Transformer MakeFilter(std::function filter); - -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/thread_pool.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/thread_pool.h deleted file mode 100644 index 44b1e227b..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/thread_pool.h +++ /dev/null @@ -1,620 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include - -#include "arrow/result.h" -#include "arrow/status.h" -#include "arrow/util/cancel.h" -#include "arrow/util/config.h" -#include "arrow/util/functional.h" -#include "arrow/util/future.h" -#include "arrow/util/iterator.h" -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -#if defined(_MSC_VER) -// Disable harmless warning for decorated name length limit -#pragma warning(disable : 4503) -#endif - -namespace arrow { - -/// \brief Get the capacity of the global thread pool -/// -/// Return the number of worker threads in the thread pool to which -/// Arrow dispatches various CPU-bound tasks. This is an ideal number, -/// not necessarily the exact number of threads at a given point in time. -/// -/// You can change this number using SetCpuThreadPoolCapacity(). -ARROW_EXPORT int GetCpuThreadPoolCapacity(); - -/// \brief Set the capacity of the global thread pool -/// -/// Set the number of worker threads int the thread pool to which -/// Arrow dispatches various CPU-bound tasks. -/// -/// The current number is returned by GetCpuThreadPoolCapacity(). -ARROW_EXPORT Status SetCpuThreadPoolCapacity(int threads); - -namespace internal { - -// Hints about a task that may be used by an Executor. -// They are ignored by the provided ThreadPool implementation. -struct TaskHints { - // The lower, the more urgent - int32_t priority = 0; - // The IO transfer size in bytes - int64_t io_size = -1; - // The approximate CPU cost in number of instructions - int64_t cpu_cost = -1; - // An application-specific ID - int64_t external_id = -1; -}; - -class ARROW_EXPORT Executor { - public: - using StopCallback = internal::FnOnce; - - virtual ~Executor(); - - // Spawn a fire-and-forget task. - template - Status Spawn(Function&& func) { - return SpawnReal(TaskHints{}, std::forward(func), StopToken::Unstoppable(), - StopCallback{}); - } - template - Status Spawn(Function&& func, StopToken stop_token) { - return SpawnReal(TaskHints{}, std::forward(func), std::move(stop_token), - StopCallback{}); - } - template - Status Spawn(TaskHints hints, Function&& func) { - return SpawnReal(hints, std::forward(func), StopToken::Unstoppable(), - StopCallback{}); - } - template - Status Spawn(TaskHints hints, Function&& func, StopToken stop_token) { - return SpawnReal(hints, std::forward(func), std::move(stop_token), - StopCallback{}); - } - template - Status Spawn(TaskHints hints, Function&& func, StopToken stop_token, - StopCallback stop_callback) { - return SpawnReal(hints, std::forward(func), std::move(stop_token), - std::move(stop_callback)); - } - - // Transfers a future to this executor. Any continuations added to the - // returned future will run in this executor. Otherwise they would run - // on the same thread that called MarkFinished. - // - // This is necessary when (for example) an I/O task is completing a future. - // The continuations of that future should run on the CPU thread pool keeping - // CPU heavy work off the I/O thread pool. So the I/O task should transfer - // the future to the CPU executor before returning. - // - // By default this method will only transfer if the future is not already completed. If - // the future is already completed then any callback would be run synchronously and so - // no transfer is typically necessary. However, in cases where you want to force a - // transfer (e.g. to help the scheduler break up units of work across multiple cores) - // then you can override this behavior with `always_transfer`. - template - Future Transfer(Future future) { - return DoTransfer(std::move(future), false); - } - - // Overload of Transfer which will always schedule callbacks on new threads even if the - // future is finished when the callback is added. - // - // This can be useful in cases where you want to ensure parallelism - template - Future TransferAlways(Future future) { - return DoTransfer(std::move(future), true); - } - - // Submit a callable and arguments for execution. Return a future that - // will return the callable's result value once. - // The callable's arguments are copied before execution. - template > - Result Submit(TaskHints hints, StopToken stop_token, Function&& func, - Args&&... args) { - using ValueType = typename FutureType::ValueType; - - auto future = FutureType::Make(); - auto task = std::bind(::arrow::detail::ContinueFuture{}, future, - std::forward(func), std::forward(args)...); - struct { - WeakFuture weak_fut; - - void operator()(const Status& st) { - auto fut = weak_fut.get(); - if (fut.is_valid()) { - fut.MarkFinished(st); - } - } - } stop_callback{WeakFuture(future)}; - ARROW_RETURN_NOT_OK(SpawnReal(hints, std::move(task), std::move(stop_token), - std::move(stop_callback))); - - return future; - } - - template > - Result Submit(StopToken stop_token, Function&& func, Args&&... args) { - return Submit(TaskHints{}, stop_token, std::forward(func), - std::forward(args)...); - } - - template > - Result Submit(TaskHints hints, Function&& func, Args&&... args) { - return Submit(std::move(hints), StopToken::Unstoppable(), - std::forward(func), std::forward(args)...); - } - - template > - Result Submit(Function&& func, Args&&... args) { - return Submit(TaskHints{}, StopToken::Unstoppable(), std::forward(func), - std::forward(args)...); - } - - // Return the level of parallelism (the number of tasks that may be executed - // concurrently). This may be an approximate number. - virtual int GetCapacity() = 0; - - // Return true if the thread from which this function is called is owned by this - // Executor. Returns false if this Executor does not support this property. - virtual bool OwnsThisThread() { return false; } - - // Return true if this is the current executor being called - // n.b. this defaults to just calling OwnsThisThread - // unless the threadpool is disabled - virtual bool IsCurrentExecutor() { return OwnsThisThread(); } - - /// \brief An interface to represent something with a custom destructor - /// - /// \see KeepAlive - class ARROW_EXPORT Resource { - public: - virtual ~Resource() = default; - }; - - /// \brief Keep a resource alive until all executor threads have terminated - /// - /// Executors may have static storage duration. In particular, the CPU and I/O - /// executors are currently implemented this way. These threads may access other - /// objects with static storage duration such as the OpenTelemetry runtime context - /// the default memory pool, or other static executors. - /// - /// The order in which these objects are destroyed is difficult to control. In order - /// to ensure those objects remain alive until all threads have finished those objects - /// should be wrapped in a Resource object and passed into this method. The given - /// shared_ptr will be kept alive until all threads have finished their worker loops. - virtual void KeepAlive(std::shared_ptr resource); - - protected: - ARROW_DISALLOW_COPY_AND_ASSIGN(Executor); - - Executor() = default; - - template , typename FTSync = typename FT::SyncType> - Future DoTransfer(Future future, bool always_transfer = false) { - auto transferred = Future::Make(); - if (always_transfer) { - CallbackOptions callback_options = CallbackOptions::Defaults(); - callback_options.should_schedule = ShouldSchedule::Always; - callback_options.executor = this; - auto sync_callback = [transferred](const FTSync& result) mutable { - transferred.MarkFinished(result); - }; - future.AddCallback(sync_callback, callback_options); - return transferred; - } - - // We could use AddCallback's ShouldSchedule::IfUnfinished but we can save a bit of - // work by doing the test here. - auto callback = [this, transferred](const FTSync& result) mutable { - auto spawn_status = - Spawn([transferred, result]() mutable { transferred.MarkFinished(result); }); - if (!spawn_status.ok()) { - transferred.MarkFinished(spawn_status); - } - }; - auto callback_factory = [&callback]() { return callback; }; - if (future.TryAddCallback(callback_factory)) { - return transferred; - } - // If the future is already finished and we aren't going to force spawn a thread - // then we don't need to add another layer of callback and can return the original - // future - return future; - } - - // Subclassing API - virtual Status SpawnReal(TaskHints hints, FnOnce task, StopToken, - StopCallback&&) = 0; -}; - -/// \brief An executor implementation that runs all tasks on a single thread using an -/// event loop. -/// -/// Note: Any sort of nested parallelism will deadlock this executor. Blocking waits are -/// fine but if one task needs to wait for another task it must be expressed as an -/// asynchronous continuation. -class ARROW_EXPORT SerialExecutor : public Executor { - public: - template - using TopLevelTask = internal::FnOnce(Executor*)>; - - ~SerialExecutor() override; - - int GetCapacity() override { return 1; }; - bool OwnsThisThread() override; - Status SpawnReal(TaskHints hints, FnOnce task, StopToken, - StopCallback&&) override; - - // Return the number of tasks either running or in the queue. - int GetNumTasks(); - - /// \brief Runs the TopLevelTask and any scheduled tasks - /// - /// The TopLevelTask (or one of the tasks it schedules) must either return an invalid - /// status or call the finish signal. Failure to do this will result in a deadlock. For - /// this reason it is preferable (if possible) to use the helper methods (below) - /// RunSynchronously/RunSerially which delegates the responsibility onto a Future - /// producer's existing responsibility to always mark a future finished (which can - /// someday be aided by ARROW-12207). - template , - typename FTSync = typename FT::SyncType> - static FTSync RunInSerialExecutor(TopLevelTask initial_task) { - Future fut = SerialExecutor().Run(std::move(initial_task)); - return FutureToSync(fut); - } - - /// \brief Transform an AsyncGenerator into an Iterator - /// - /// An event loop will be created and each call to Next will power the event loop with - /// the calling thread until the next item is ready to be delivered. - /// - /// Note: The iterator's destructor will run until the given generator is fully - /// exhausted. If you wish to abandon iteration before completion then the correct - /// approach is to use a stop token to cause the generator to exhaust early. - template - static Iterator IterateGenerator( - internal::FnOnce()>>(Executor*)> initial_task) { - auto serial_executor = std::unique_ptr(new SerialExecutor()); - auto maybe_generator = std::move(initial_task)(serial_executor.get()); - if (!maybe_generator.ok()) { - return MakeErrorIterator(maybe_generator.status()); - } - auto generator = maybe_generator.MoveValueUnsafe(); - struct SerialIterator { - SerialIterator(std::unique_ptr executor, - std::function()> generator) - : executor(std::move(executor)), generator(std::move(generator)) {} - ARROW_DISALLOW_COPY_AND_ASSIGN(SerialIterator); - ARROW_DEFAULT_MOVE_AND_ASSIGN(SerialIterator); - ~SerialIterator() { - // A serial iterator must be consumed before it can be destroyed. Allowing it to - // do otherwise would lead to resource leakage. There will likely be deadlocks at - // this spot in the future but these will be the result of other bugs and not the - // fact that we are forcing consumption here. - - // If a streaming API needs to support early abandonment then it should be done so - // with a cancellation token and not simply discarding the iterator and expecting - // the underlying work to clean up correctly. - if (executor && !executor->IsFinished()) { - while (true) { - Result maybe_next = Next(); - if (!maybe_next.ok() || IsIterationEnd(*maybe_next)) { - break; - } - } - } - } - - Result Next() { - executor->Unpause(); - // This call may lead to tasks being scheduled in the serial executor - Future next_fut = generator(); - next_fut.AddCallback([this](const Result& res) { - // If we're done iterating we should drain the rest of the tasks in the executor - if (!res.ok() || IsIterationEnd(*res)) { - executor->Finish(); - return; - } - // Otherwise we will break out immediately, leaving the remaining tasks for - // the next call. - executor->Pause(); - }); -#ifdef ARROW_ENABLE_THREADING - // future must run on this thread - // Borrow this thread and run tasks until the future is finished - executor->RunLoop(); -#else - next_fut.Wait(); -#endif - if (!next_fut.is_finished()) { - // Not clear this is possible since RunLoop wouldn't generally exit - // unless we paused/finished which would imply next_fut has been - // finished. - return Status::Invalid( - "Serial executor terminated before next result computed"); - } - // At this point we may still have tasks in the executor, that is ok. - // We will run those tasks the next time through. - return next_fut.result(); - } - - std::unique_ptr executor; - std::function()> generator; - }; - return Iterator(SerialIterator{std::move(serial_executor), std::move(generator)}); - } - -#ifndef ARROW_ENABLE_THREADING - // run a pending task from loop - // returns true if any tasks were run in the last go round the loop (i.e. if it - // returns false, all executors are waiting) - static bool RunTasksOnAllExecutors(); - static SerialExecutor* GetCurrentExecutor(); - - bool IsCurrentExecutor() override; - -#endif - - protected: - virtual void RunLoop(); - - // State uses mutex - struct State; - std::shared_ptr state_; - - SerialExecutor(); - - // We mark the serial executor "finished" when there should be - // no more tasks scheduled on it. It's not strictly needed but - // can help catch bugs where we are trying to use the executor - // after we are done with it. - void Finish(); - bool IsFinished(); - // We pause the executor when we are running an async generator - // and we have received an item that we can deliver. - void Pause(); - void Unpause(); - - template ::SyncType> - Future Run(TopLevelTask initial_task) { - auto final_fut = std::move(initial_task)(this); - final_fut.AddCallback([this](const FTSync&) { Finish(); }); - RunLoop(); - return final_fut; - } - -#ifndef ARROW_ENABLE_THREADING - // we have to run tasks from all live executors - // during RunLoop if we don't have threading - static std::unordered_set all_executors; - // a pointer to the last one called by the loop - // so all tasks get spawned equally - // on multiple calls to RunTasksOnAllExecutors - static SerialExecutor* last_called_executor; - // without threading we can't tell which executor called the - // current process - so we set it in spawning the task - static SerialExecutor* current_executor; -#endif // ARROW_ENABLE_THREADING -}; - -#ifdef ARROW_ENABLE_THREADING - -/// An Executor implementation spawning tasks in FIFO manner on a fixed-size -/// pool of worker threads. -/// -/// Note: Any sort of nested parallelism will deadlock this executor. Blocking waits are -/// fine but if one task needs to wait for another task it must be expressed as an -/// asynchronous continuation. -class ARROW_EXPORT ThreadPool : public Executor { - public: - // Construct a thread pool with the given number of worker threads - static Result> Make(int threads); - - // Like Make(), but takes care that the returned ThreadPool is compatible - // with destruction late at process exit. - static Result> MakeEternal(int threads); - - // Destroy thread pool; the pool will first be shut down - ~ThreadPool() override; - - // Return the desired number of worker threads. - // The actual number of workers may lag a bit before being adjusted to - // match this value. - int GetCapacity() override; - - // Return the number of tasks either running or in the queue. - int GetNumTasks(); - - bool OwnsThisThread() override; - // Dynamically change the number of worker threads. - // - // This function always returns immediately. - // If fewer threads are running than this number, new threads are spawned - // on-demand when needed for task execution. - // If more threads are running than this number, excess threads are reaped - // as soon as possible. - Status SetCapacity(int threads); - - // Heuristic for the default capacity of a thread pool for CPU-bound tasks. - // This is exposed as a static method to help with testing. - static int DefaultCapacity(); - - // Shutdown the pool. Once the pool starts shutting down, new tasks - // cannot be submitted anymore. - // If "wait" is true, shutdown waits for all pending tasks to be finished. - // If "wait" is false, workers are stopped as soon as currently executing - // tasks are finished. - Status Shutdown(bool wait = true); - - // Wait for the thread pool to become idle - // - // This is useful for sequencing tests - void WaitForIdle(); - - void KeepAlive(std::shared_ptr resource) override; - - struct State; - - protected: - FRIEND_TEST(TestThreadPool, SetCapacity); - FRIEND_TEST(TestGlobalThreadPool, Capacity); - ARROW_FRIEND_EXPORT friend ThreadPool* GetCpuThreadPool(); - - ThreadPool(); - - Status SpawnReal(TaskHints hints, FnOnce task, StopToken, - StopCallback&&) override; - - // Collect finished worker threads, making sure the OS threads have exited - void CollectFinishedWorkersUnlocked(); - // Launch a given number of additional workers - void LaunchWorkersUnlocked(int threads); - // Get the current actual capacity - int GetActualCapacity(); - - static std::shared_ptr MakeCpuThreadPool(); - - std::shared_ptr sp_state_; - State* state_; - bool shutdown_on_destroy_; -}; -#else // ARROW_ENABLE_THREADING -// an executor implementation which pretends to be a thread pool but runs everything -// on the main thread using a static queue (shared between all thread pools, otherwise -// cross-threadpool dependencies will break everything) -class ARROW_EXPORT ThreadPool : public SerialExecutor { - public: - ARROW_FRIEND_EXPORT friend ThreadPool* GetCpuThreadPool(); - - static Result> Make(int threads); - - // Like Make(), but takes care that the returned ThreadPool is compatible - // with destruction late at process exit. - static Result> MakeEternal(int threads); - - // Destroy thread pool; the pool will first be shut down - ~ThreadPool() override; - - // Return the desired number of worker threads. - // The actual number of workers may lag a bit before being adjusted to - // match this value. - int GetCapacity() override; - - virtual int GetActualCapacity(); - - bool OwnsThisThread() override { return true; } - - // Dynamically change the number of worker threads. - // without threading this is equal to the - // number of tasks that can be running at once - // (inside each other) - Status SetCapacity(int threads); - - static int DefaultCapacity() { return 8; } - - // Shutdown the pool. Once the pool starts shutting down, new tasks - // cannot be submitted anymore. - // If "wait" is true, shutdown waits for all pending tasks to be finished. - // If "wait" is false, workers are stopped as soon as currently executing - // tasks are finished. - Status Shutdown(bool wait = true); - - // Wait for the thread pool to become idle - // - // This is useful for sequencing tests - void WaitForIdle(); - - protected: - static std::shared_ptr MakeCpuThreadPool(); - ThreadPool(); -}; - -#endif // ARROW_ENABLE_THREADING - -// Return the process-global thread pool for CPU-bound tasks. -ARROW_EXPORT ThreadPool* GetCpuThreadPool(); - -/// \brief Potentially run an async operation serially (if use_threads is false) -/// \see RunSerially -/// -/// If `use_threads` is true, the global CPU executor is used. -/// If `use_threads` is false, a temporary SerialExecutor is used. -/// `get_future` is called (from this thread) with the chosen executor and must -/// return a future that will eventually finish. This function returns once the -/// future has finished. -template -typename Fut::SyncType RunSynchronously(FnOnce get_future, - bool use_threads) { - if (use_threads) { - auto fut = std::move(get_future)(GetCpuThreadPool()); - return FutureToSync(fut); - } else { - return SerialExecutor::RunInSerialExecutor(std::move(get_future)); - } -} - -/// \brief Potentially iterate an async generator serially (if use_threads is false) -/// \see IterateGenerator -/// -/// If `use_threads` is true, the global CPU executor will be used. Each call to -/// the iterator will simply wait until the next item is available. Tasks may run in -/// the background between calls. -/// -/// If `use_threads` is false, the calling thread only will be used. Each call to -/// the iterator will use the calling thread to do enough work to generate one item. -/// Tasks will be left in a queue until the next call and no work will be done between -/// calls. -template -Iterator IterateSynchronously( - FnOnce()>>(Executor*)> get_gen, bool use_threads) { - if (use_threads) { - auto maybe_gen = std::move(get_gen)(GetCpuThreadPool()); - if (!maybe_gen.ok()) { - return MakeErrorIterator(maybe_gen.status()); - } - return MakeGeneratorIterator(*maybe_gen); - } else { - return SerialExecutor::IterateGenerator(std::move(get_gen)); - } -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/time.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/time.h deleted file mode 100644 index 981eab596..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/time.h +++ /dev/null @@ -1,83 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include - -#include "arrow/type_fwd.h" -#include "arrow/util/visibility.h" - -namespace arrow { -namespace util { - -enum DivideOrMultiply { - MULTIPLY, - DIVIDE, -}; - -ARROW_EXPORT -std::pair GetTimestampConversion(TimeUnit::type in_unit, - TimeUnit::type out_unit); - -// Converts a Timestamp value into another Timestamp value. -// -// This function takes care of properly transforming from one unit to another. -// -// \param[in] in the input type. Must be TimestampType. -// \param[in] out the output type. Must be TimestampType. -// \param[in] value the input value. -// -// \return The converted value, or an error. -ARROW_EXPORT Result ConvertTimestampValue(const std::shared_ptr& in, - const std::shared_ptr& out, - int64_t value); - -template -decltype(std::declval()(std::chrono::seconds{}, std::declval()...)) -VisitDuration(TimeUnit::type unit, Visitor&& visitor, Args&&... args) { - switch (unit) { - default: - case TimeUnit::SECOND: - break; - case TimeUnit::MILLI: - return visitor(std::chrono::milliseconds{}, std::forward(args)...); - case TimeUnit::MICRO: - return visitor(std::chrono::microseconds{}, std::forward(args)...); - case TimeUnit::NANO: - return visitor(std::chrono::nanoseconds{}, std::forward(args)...); - } - return visitor(std::chrono::seconds{}, std::forward(args)...); -} - -/// Convert a count of seconds to the corresponding count in a different TimeUnit -struct CastSecondsToUnitImpl { - template - int64_t operator()(Duration, int64_t seconds) { - auto duration = std::chrono::duration_cast(std::chrono::seconds{seconds}); - return static_cast(duration.count()); - } -}; - -inline int64_t CastSecondsToUnit(TimeUnit::type unit, int64_t seconds) { - return VisitDuration(unit, CastSecondsToUnitImpl{}, seconds); -} - -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tracing.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tracing.h deleted file mode 100644 index d78082564..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/tracing.h +++ /dev/null @@ -1,45 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include - -#include "arrow/util/visibility.h" - -namespace arrow { -namespace util { -namespace tracing { - -class ARROW_EXPORT SpanDetails { - public: - virtual ~SpanDetails() {} -}; - -class ARROW_EXPORT Span { - public: - Span() noexcept; - /// True if this span has been started with START_SPAN - bool valid() const; - /// End the span early - void reset(); - std::unique_ptr details; -}; - -} // namespace tracing -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/trie.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/trie.h deleted file mode 100644 index 7815d4d1e..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/trie.h +++ /dev/null @@ -1,243 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "arrow/status.h" -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { -namespace internal { - -// A non-zero-terminated small string class. -// std::string usually has a small string optimization -// (see review at https://shaharmike.com/cpp/std-string/) -// but this one allows tight control and optimization of memory layout. -template -class SmallString { - public: - SmallString() : length_(0) {} - - template - SmallString(const T& v) { // NOLINT implicit constructor - *this = std::string_view(v); - } - - SmallString& operator=(const std::string_view s) { -#ifndef NDEBUG - CheckSize(s.size()); -#endif - length_ = static_cast(s.size()); - std::memcpy(data_, s.data(), length_); - return *this; - } - - SmallString& operator=(const std::string& s) { - *this = std::string_view(s); - return *this; - } - - SmallString& operator=(const char* s) { - *this = std::string_view(s); - return *this; - } - - explicit operator std::string_view() const { return std::string_view(data_, length_); } - - const char* data() const { return data_; } - size_t length() const { return length_; } - bool empty() const { return length_ == 0; } - char operator[](size_t pos) const { -#ifdef NDEBUG - assert(pos <= length_); -#endif - return data_[pos]; - } - - SmallString substr(size_t pos) const { - return SmallString(std::string_view(*this).substr(pos)); - } - - SmallString substr(size_t pos, size_t count) const { - return SmallString(std::string_view(*this).substr(pos, count)); - } - - template - bool operator==(T&& other) const { - return std::string_view(*this) == std::string_view(std::forward(other)); - } - - template - bool operator!=(T&& other) const { - return std::string_view(*this) != std::string_view(std::forward(other)); - } - - protected: - uint8_t length_; - char data_[N]; - - void CheckSize(size_t n) { assert(n <= N); } -}; - -template -std::ostream& operator<<(std::ostream& os, const SmallString& str) { - return os << std::string_view(str); -} - -// A trie class for byte strings, optimized for small sets of short strings. -// This class is immutable by design, use a TrieBuilder to construct it. -class ARROW_EXPORT Trie { - using index_type = int16_t; - using fast_index_type = int_fast16_t; - static constexpr auto kMaxIndex = std::numeric_limits::max(); - - public: - Trie() : size_(0) {} - Trie(Trie&&) = default; - Trie& operator=(Trie&&) = default; - - int32_t Find(std::string_view s) const { - const Node* node = &nodes_[0]; - fast_index_type pos = 0; - if (s.length() > static_cast(kMaxIndex)) { - return -1; - } - fast_index_type remaining = static_cast(s.length()); - - while (remaining > 0) { - auto substring_length = node->substring_length(); - if (substring_length > 0) { - auto substring_data = node->substring_data(); - if (remaining < substring_length) { - // Input too short - return -1; - } - for (fast_index_type i = 0; i < substring_length; ++i) { - if (s[pos++] != substring_data[i]) { - // Mismatching substring - return -1; - } - --remaining; - } - if (remaining == 0) { - // Matched node exactly - return node->found_index_; - } - } - // Lookup child using next input character - if (node->child_lookup_ == -1) { - // Input too long - return -1; - } - auto c = static_cast(s[pos++]); - --remaining; - auto child_index = lookup_table_[node->child_lookup_ * 256 + c]; - if (child_index == -1) { - // Child not found - return -1; - } - node = &nodes_[child_index]; - } - - // Input exhausted - if (node->substring_.empty()) { - // Matched node exactly - return node->found_index_; - } else { - return -1; - } - } - - Status Validate() const; - - void Dump() const; - - protected: - static constexpr size_t kNodeSize = 16; - static constexpr auto kMaxSubstringLength = - kNodeSize - 2 * sizeof(index_type) - sizeof(int8_t); - - struct Node { - // If this node is a valid end of string, index of found string, otherwise -1 - index_type found_index_; - // Base index for child lookup in lookup_table_ (-1 if no child nodes) - index_type child_lookup_; - // The substring for this node. - SmallString substring_; - - fast_index_type substring_length() const { - return static_cast(substring_.length()); - } - const char* substring_data() const { return substring_.data(); } - }; - - static_assert(sizeof(Node) == kNodeSize, "Unexpected node size"); - - ARROW_DISALLOW_COPY_AND_ASSIGN(Trie); - - void Dump(const Node* node, const std::string& indent) const; - - // Node table: entry 0 is the root node - std::vector nodes_; - - // Indexed lookup structure: gives index in node table, or -1 if not found - std::vector lookup_table_; - - // Number of entries - index_type size_; - - friend class TrieBuilder; -}; - -class ARROW_EXPORT TrieBuilder { - using index_type = Trie::index_type; - using fast_index_type = Trie::fast_index_type; - - public: - TrieBuilder(); - Status Append(std::string_view s, bool allow_duplicate = false); - Trie Finish(); - - protected: - // Extend the lookup table by 256 entries, return the index of the new span - Status ExtendLookupTable(index_type* out_lookup_index); - // Split the node given by the index at the substring index `split_at` - Status SplitNode(fast_index_type node_index, fast_index_type split_at); - // Append an already constructed child node to the parent - Status AppendChildNode(Trie::Node* parent, uint8_t ch, Trie::Node&& node); - // Create a matching child node from this parent - Status CreateChildNode(Trie::Node* parent, uint8_t ch, std::string_view substring); - Status CreateChildNode(Trie::Node* parent, char ch, std::string_view substring); - - Trie trie_; - - static constexpr auto kMaxIndex = std::numeric_limits::max(); -}; - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_fwd.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_fwd.h deleted file mode 100644 index 6d904f19b..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_fwd.h +++ /dev/null @@ -1,69 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -namespace arrow { - -namespace internal { -struct Empty; -} // namespace internal - -template -class WeakFuture; -class FutureWaiter; - -class TimestampParser; - -namespace internal { - -class Executor; -class TaskGroup; -class ThreadPool; -class CpuInfo; - -namespace tracing { - -struct Scope; - -} // namespace tracing -} // namespace internal - -struct Compression { - /// \brief Compression algorithm - enum type { - UNCOMPRESSED, - SNAPPY, - GZIP, - BROTLI, - ZSTD, - LZ4, - LZ4_FRAME, - LZO, - BZ2, - LZ4_HADOOP - }; -}; - -namespace util { -class AsyncTaskScheduler; -class Compressor; -class Decompressor; -class Codec; -} // namespace util - -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_traits.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_traits.h deleted file mode 100644 index c19061524..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/type_traits.h +++ /dev/null @@ -1,46 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include - -namespace arrow { -namespace internal { - -/// \brief Metafunction to allow checking if a type matches any of another set of types -template -struct IsOneOf : std::false_type {}; /// Base case: nothing has matched - -template -struct IsOneOf { - /// Recursive case: T == U or T matches any other types provided (not including U). - static constexpr bool value = std::is_same::value || IsOneOf::value; -}; - -/// \brief Shorthand for using IsOneOf + std::enable_if -template -using EnableIfIsOneOf = typename std::enable_if::value, T>::type; - -/// \brief is_null_pointer from C++17 -template -struct is_null_pointer : std::is_same::type> { -}; - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ubsan.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ubsan.h deleted file mode 100644 index 900d8011d..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/ubsan.h +++ /dev/null @@ -1,87 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// Contains utilities for making UBSan happy. - -#pragma once - -#include -#include -#include - -#include "arrow/util/macros.h" - -namespace arrow { -namespace util { - -namespace internal { - -constexpr uint8_t kNonNullFiller = 0; - -} // namespace internal - -/// \brief Returns maybe_null if not null or a non-null pointer to an arbitrary memory -/// that shouldn't be dereferenced. -/// -/// Memset/Memcpy are undefined when a nullptr is passed as an argument use this utility -/// method to wrap locations where this could happen. -/// -/// Note: Flatbuffers has UBSan warnings if a zero length vector is passed. -/// https://github.com/google/flatbuffers/pull/5355 is trying to resolve -/// them. -template -inline T* MakeNonNull(T* maybe_null = NULLPTR) { - if (ARROW_PREDICT_TRUE(maybe_null != NULLPTR)) { - return maybe_null; - } - - return const_cast(reinterpret_cast(&internal::kNonNullFiller)); -} - -template -inline std::enable_if_t, T> SafeLoadAs( - const uint8_t* unaligned) { - std::remove_const_t ret; - std::memcpy(&ret, unaligned, sizeof(T)); - return ret; -} - -template -inline std::enable_if_t, T> SafeLoad(const T* unaligned) { - std::remove_const_t ret; - std::memcpy(&ret, unaligned, sizeof(T)); - return ret; -} - -template -inline std::enable_if_t && - std::is_trivially_copyable_v && sizeof(T) == sizeof(U), - U> -SafeCopy(T value) { - std::remove_const_t ret; - std::memcpy(&ret, &value, sizeof(T)); - return ret; -} - -template -inline std::enable_if_t, void> SafeStore(void* unaligned, - T value) { - std::memcpy(unaligned, &value, sizeof(T)); -} - -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/union_util.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/union_util.h deleted file mode 100644 index 0f30d5a32..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/union_util.h +++ /dev/null @@ -1,31 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include "arrow/array/data.h" - -namespace arrow { -namespace union_util { - -/// \brief Compute the number of of logical nulls in a sparse union array -int64_t LogicalSparseUnionNullCount(const ArraySpan& span); - -/// \brief Compute the number of of logical nulls in a dense union array -int64_t LogicalDenseUnionNullCount(const ArraySpan& span); - -} // namespace union_util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/unreachable.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/unreachable.h deleted file mode 100644 index d2e383e71..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/unreachable.h +++ /dev/null @@ -1,30 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include "arrow/util/visibility.h" - -#include - -namespace arrow { - -[[noreturn]] ARROW_EXPORT void Unreachable(const char* message = "Unreachable"); - -[[noreturn]] ARROW_EXPORT void Unreachable(std::string_view message); - -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/uri.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/uri.h deleted file mode 100644 index 855a61408..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/uri.h +++ /dev/null @@ -1,118 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include - -#include "arrow/type_fwd.h" -#include "arrow/util/visibility.h" - -namespace arrow { -namespace internal { - -/// \brief A parsed URI -class ARROW_EXPORT Uri { - public: - Uri(); - ~Uri(); - Uri(Uri&&); - Uri& operator=(Uri&&); - - // XXX Should we use std::string_view instead? These functions are - // not performance-critical. - - /// The URI scheme, such as "http", or the empty string if the URI has no - /// explicit scheme. - std::string scheme() const; - - /// Convenience function that returns true if the scheme() is "file" - bool is_file_scheme() const; - - /// Whether the URI has an explicit host name. This may return true if - /// the URI has an empty host (e.g. "file:///tmp/foo"), while it returns - /// false is the URI has not host component at all (e.g. "file:/tmp/foo"). - bool has_host() const; - /// The URI host name, such as "localhost", "127.0.0.1" or "::1", or the empty - /// string is the URI does not have a host component. - std::string host() const; - - /// The URI port number, as a string such as "80", or the empty string is the URI - /// does not have a port number component. - std::string port_text() const; - /// The URI port parsed as an integer, or -1 if the URI does not have a port - /// number component. - int32_t port() const; - - /// The username specified in the URI. - std::string username() const; - /// The password specified in the URI. - std::string password() const; - - /// The URI path component. - std::string path() const; - - /// The URI query string - std::string query_string() const; - - /// The URI query items - /// - /// Note this API doesn't allow differentiating between an empty value - /// and a missing value, such in "a&b=1" vs. "a=&b=1". - Result>> query_items() const; - - /// Get the string representation of this URI. - const std::string& ToString() const; - - /// Factory function to parse a URI from its string representation. - Status Parse(const std::string& uri_string); - - private: - struct Impl; - std::unique_ptr impl_; -}; - -/// Percent-encode the input string, for use e.g. as a URI query parameter. -/// -/// This will escape directory separators, making this function unsuitable -/// for encoding URI paths directly. See UriFromAbsolutePath() instead. -ARROW_EXPORT -std::string UriEscape(std::string_view s); - -ARROW_EXPORT -std::string UriUnescape(std::string_view s); - -/// Encode a host for use within a URI, such as "localhost", -/// "127.0.0.1", or "[::1]". -ARROW_EXPORT -std::string UriEncodeHost(std::string_view host); - -/// Whether the string is a syntactically valid URI scheme according to RFC 3986. -ARROW_EXPORT -bool IsValidUriScheme(std::string_view s); - -/// Create a file uri from a given absolute path -ARROW_EXPORT -Result UriFromAbsolutePath(std::string_view path); - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/utf8.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/utf8.h deleted file mode 100644 index ca93fab5b..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/utf8.h +++ /dev/null @@ -1,59 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include -#include - -#include "arrow/type_fwd.h" -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { -namespace util { - -// Convert a UTF8 string to a wstring (either UTF16 or UTF32, depending -// on the wchar_t width). -ARROW_EXPORT Result UTF8ToWideString(std::string_view source); - -// Similarly, convert a wstring to a UTF8 string. -ARROW_EXPORT Result WideStringToUTF8(const std::wstring& source); - -// Convert UTF8 string to a UTF16 string. -ARROW_EXPORT Result UTF8StringToUTF16(std::string_view source); - -// Convert UTF16 string to a UTF8 string. -ARROW_EXPORT Result UTF16StringToUTF8(std::u16string_view source); - -// This function needs to be called before doing UTF8 validation. -ARROW_EXPORT void InitializeUTF8(); - -ARROW_EXPORT bool ValidateUTF8(const uint8_t* data, int64_t size); - -ARROW_EXPORT bool ValidateUTF8(std::string_view str); - -// Skip UTF8 byte order mark, if any. -ARROW_EXPORT -Result SkipUTF8BOM(const uint8_t* data, int64_t size); - -static constexpr uint32_t kMaxUnicodeCodepoint = 0x110000; - -} // namespace util -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/value_parsing.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/value_parsing.h deleted file mode 100644 index b3c711840..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/value_parsing.h +++ /dev/null @@ -1,928 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// This is a private header for string-to-number parsing utilities - -#pragma once - -#include -#include -#include -#include -#include -#include -#include -#include - -#include "arrow/type.h" -#include "arrow/type_traits.h" -#include "arrow/util/checked_cast.h" -#include "arrow/util/config.h" -#include "arrow/util/macros.h" -#include "arrow/util/time.h" -#include "arrow/util/visibility.h" -#include "arrow/vendored/datetime.h" -#include "arrow/vendored/strptime.h" - -namespace arrow { - -/// \brief A virtual string to timestamp parser -class ARROW_EXPORT TimestampParser { - public: - virtual ~TimestampParser() = default; - - virtual bool operator()(const char* s, size_t length, TimeUnit::type out_unit, - int64_t* out, - bool* out_zone_offset_present = NULLPTR) const = 0; - - virtual const char* kind() const = 0; - - virtual const char* format() const; - - /// \brief Create a TimestampParser that recognizes strptime-like format strings - static std::shared_ptr MakeStrptime(std::string format); - - /// \brief Create a TimestampParser that recognizes (locale-agnostic) ISO8601 - /// timestamps - static std::shared_ptr MakeISO8601(); -}; - -namespace internal { - -/// \brief The entry point for conversion from strings. -/// -/// Specializations of StringConverter for `ARROW_TYPE` must define: -/// - A default constructible member type `value_type` which will be yielded on a -/// successful parse. -/// - The static member function `Convert`, callable with signature -/// `(const ARROW_TYPE& t, const char* s, size_t length, value_type* out)`. -/// `Convert` returns truthy for successful parses and assigns the parsed values to -/// `*out`. Parameters required for parsing (for example a timestamp's TimeUnit) -/// are acquired from the type parameter `t`. -template -struct StringConverter; - -template -struct is_parseable { - template ::value_type> - static std::true_type Test(U*); - - template - static std::false_type Test(...); - - static constexpr bool value = decltype(Test(NULLPTR))::value; -}; - -template -using enable_if_parseable = enable_if_t::value, R>; - -template <> -struct StringConverter { - using value_type = bool; - - bool Convert(const BooleanType&, const char* s, size_t length, value_type* out) { - if (length == 1) { - // "0" or "1"? - if (s[0] == '0') { - *out = false; - return true; - } - if (s[0] == '1') { - *out = true; - return true; - } - return false; - } - if (length == 4) { - // "true"? - *out = true; - return ((s[0] == 't' || s[0] == 'T') && (s[1] == 'r' || s[1] == 'R') && - (s[2] == 'u' || s[2] == 'U') && (s[3] == 'e' || s[3] == 'E')); - } - if (length == 5) { - // "false"? - *out = false; - return ((s[0] == 'f' || s[0] == 'F') && (s[1] == 'a' || s[1] == 'A') && - (s[2] == 'l' || s[2] == 'L') && (s[3] == 's' || s[3] == 'S') && - (s[4] == 'e' || s[4] == 'E')); - } - return false; - } -}; - -// Ideas for faster float parsing: -// - http://rapidjson.org/md_doc_internals.html#ParsingDouble -// - https://github.com/google/double-conversion [used here] -// - https://github.com/achan001/dtoa-fast - -ARROW_EXPORT -bool StringToFloat(const char* s, size_t length, char decimal_point, float* out); - -ARROW_EXPORT -bool StringToFloat(const char* s, size_t length, char decimal_point, double* out); - -template <> -struct StringConverter { - using value_type = float; - - explicit StringConverter(char decimal_point = '.') : decimal_point(decimal_point) {} - - bool Convert(const FloatType&, const char* s, size_t length, value_type* out) { - return ARROW_PREDICT_TRUE(StringToFloat(s, length, decimal_point, out)); - } - - private: - const char decimal_point; -}; - -template <> -struct StringConverter { - using value_type = double; - - explicit StringConverter(char decimal_point = '.') : decimal_point(decimal_point) {} - - bool Convert(const DoubleType&, const char* s, size_t length, value_type* out) { - return ARROW_PREDICT_TRUE(StringToFloat(s, length, decimal_point, out)); - } - - private: - const char decimal_point; -}; - -// NOTE: HalfFloatType would require a half<->float conversion library - -inline uint8_t ParseDecimalDigit(char c) { return static_cast(c - '0'); } - -#define PARSE_UNSIGNED_ITERATION(C_TYPE) \ - if (length > 0) { \ - uint8_t digit = ParseDecimalDigit(*s++); \ - result = static_cast(result * 10U); \ - length--; \ - if (ARROW_PREDICT_FALSE(digit > 9U)) { \ - /* Non-digit */ \ - return false; \ - } \ - result = static_cast(result + digit); \ - } else { \ - break; \ - } - -#define PARSE_UNSIGNED_ITERATION_LAST(C_TYPE) \ - if (length > 0) { \ - if (ARROW_PREDICT_FALSE(result > std::numeric_limits::max() / 10U)) { \ - /* Overflow */ \ - return false; \ - } \ - uint8_t digit = ParseDecimalDigit(*s++); \ - result = static_cast(result * 10U); \ - C_TYPE new_result = static_cast(result + digit); \ - if (ARROW_PREDICT_FALSE(--length > 0)) { \ - /* Too many digits */ \ - return false; \ - } \ - if (ARROW_PREDICT_FALSE(digit > 9U)) { \ - /* Non-digit */ \ - return false; \ - } \ - if (ARROW_PREDICT_FALSE(new_result < result)) { \ - /* Overflow */ \ - return false; \ - } \ - result = new_result; \ - } - -inline bool ParseUnsigned(const char* s, size_t length, uint8_t* out) { - uint8_t result = 0; - - do { - PARSE_UNSIGNED_ITERATION(uint8_t); - PARSE_UNSIGNED_ITERATION(uint8_t); - PARSE_UNSIGNED_ITERATION_LAST(uint8_t); - } while (false); - *out = result; - return true; -} - -inline bool ParseUnsigned(const char* s, size_t length, uint16_t* out) { - uint16_t result = 0; - do { - PARSE_UNSIGNED_ITERATION(uint16_t); - PARSE_UNSIGNED_ITERATION(uint16_t); - PARSE_UNSIGNED_ITERATION(uint16_t); - PARSE_UNSIGNED_ITERATION(uint16_t); - PARSE_UNSIGNED_ITERATION_LAST(uint16_t); - } while (false); - *out = result; - return true; -} - -inline bool ParseUnsigned(const char* s, size_t length, uint32_t* out) { - uint32_t result = 0; - do { - PARSE_UNSIGNED_ITERATION(uint32_t); - PARSE_UNSIGNED_ITERATION(uint32_t); - PARSE_UNSIGNED_ITERATION(uint32_t); - PARSE_UNSIGNED_ITERATION(uint32_t); - PARSE_UNSIGNED_ITERATION(uint32_t); - - PARSE_UNSIGNED_ITERATION(uint32_t); - PARSE_UNSIGNED_ITERATION(uint32_t); - PARSE_UNSIGNED_ITERATION(uint32_t); - PARSE_UNSIGNED_ITERATION(uint32_t); - - PARSE_UNSIGNED_ITERATION_LAST(uint32_t); - } while (false); - *out = result; - return true; -} - -inline bool ParseUnsigned(const char* s, size_t length, uint64_t* out) { - uint64_t result = 0; - do { - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - PARSE_UNSIGNED_ITERATION(uint64_t); - - PARSE_UNSIGNED_ITERATION_LAST(uint64_t); - } while (false); - *out = result; - return true; -} - -#undef PARSE_UNSIGNED_ITERATION -#undef PARSE_UNSIGNED_ITERATION_LAST - -template -bool ParseHex(const char* s, size_t length, T* out) { - // lets make sure that the length of the string is not too big - if (!ARROW_PREDICT_TRUE(sizeof(T) * 2 >= length && length > 0)) { - return false; - } - T result = 0; - for (size_t i = 0; i < length; i++) { - result = static_cast(result << 4); - if (s[i] >= '0' && s[i] <= '9') { - result = static_cast(result | (s[i] - '0')); - } else if (s[i] >= 'A' && s[i] <= 'F') { - result = static_cast(result | (s[i] - 'A' + 10)); - } else if (s[i] >= 'a' && s[i] <= 'f') { - result = static_cast(result | (s[i] - 'a' + 10)); - } else { - /* Non-digit */ - return false; - } - } - *out = result; - return true; -} - -template -struct StringToUnsignedIntConverterMixin { - using value_type = typename ARROW_TYPE::c_type; - - bool Convert(const ARROW_TYPE&, const char* s, size_t length, value_type* out) { - if (ARROW_PREDICT_FALSE(length == 0)) { - return false; - } - // If it starts with 0x then its hex - if (length > 2 && s[0] == '0' && ((s[1] == 'x') || (s[1] == 'X'))) { - length -= 2; - s += 2; - - return ARROW_PREDICT_TRUE(ParseHex(s, length, out)); - } - // Skip leading zeros - while (length > 0 && *s == '0') { - length--; - s++; - } - return ParseUnsigned(s, length, out); - } -}; - -template <> -struct StringConverter : public StringToUnsignedIntConverterMixin { - using StringToUnsignedIntConverterMixin::StringToUnsignedIntConverterMixin; -}; - -template <> -struct StringConverter - : public StringToUnsignedIntConverterMixin { - using StringToUnsignedIntConverterMixin::StringToUnsignedIntConverterMixin; -}; - -template <> -struct StringConverter - : public StringToUnsignedIntConverterMixin { - using StringToUnsignedIntConverterMixin::StringToUnsignedIntConverterMixin; -}; - -template <> -struct StringConverter - : public StringToUnsignedIntConverterMixin { - using StringToUnsignedIntConverterMixin::StringToUnsignedIntConverterMixin; -}; - -template -struct StringToSignedIntConverterMixin { - using value_type = typename ARROW_TYPE::c_type; - using unsigned_type = typename std::make_unsigned::type; - - bool Convert(const ARROW_TYPE&, const char* s, size_t length, value_type* out) { - static constexpr auto max_positive = - static_cast(std::numeric_limits::max()); - // Assuming two's complement - static constexpr unsigned_type max_negative = max_positive + 1; - bool negative = false; - unsigned_type unsigned_value = 0; - - if (ARROW_PREDICT_FALSE(length == 0)) { - return false; - } - // If it starts with 0x then its hex - if (length > 2 && s[0] == '0' && ((s[1] == 'x') || (s[1] == 'X'))) { - length -= 2; - s += 2; - - if (!ARROW_PREDICT_TRUE(ParseHex(s, length, &unsigned_value))) { - return false; - } - *out = static_cast(unsigned_value); - return true; - } - - if (*s == '-') { - negative = true; - s++; - if (--length == 0) { - return false; - } - } - // Skip leading zeros - while (length > 0 && *s == '0') { - length--; - s++; - } - if (!ARROW_PREDICT_TRUE(ParseUnsigned(s, length, &unsigned_value))) { - return false; - } - if (negative) { - if (ARROW_PREDICT_FALSE(unsigned_value > max_negative)) { - return false; - } - // To avoid both compiler warnings (with unsigned negation) - // and undefined behaviour (with signed negation overflow), - // use the expanded formula for 2's complement negation. - *out = static_cast(~unsigned_value + 1); - } else { - if (ARROW_PREDICT_FALSE(unsigned_value > max_positive)) { - return false; - } - *out = static_cast(unsigned_value); - } - return true; - } -}; - -template <> -struct StringConverter : public StringToSignedIntConverterMixin { - using StringToSignedIntConverterMixin::StringToSignedIntConverterMixin; -}; - -template <> -struct StringConverter : public StringToSignedIntConverterMixin { - using StringToSignedIntConverterMixin::StringToSignedIntConverterMixin; -}; - -template <> -struct StringConverter : public StringToSignedIntConverterMixin { - using StringToSignedIntConverterMixin::StringToSignedIntConverterMixin; -}; - -template <> -struct StringConverter : public StringToSignedIntConverterMixin { - using StringToSignedIntConverterMixin::StringToSignedIntConverterMixin; -}; - -namespace detail { - -// Inline-able ISO-8601 parser - -using ts_type = TimestampType::c_type; - -template -static inline bool ParseHH(const char* s, Duration* out) { - uint8_t hours = 0; - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 0, 2, &hours))) { - return false; - } - if (ARROW_PREDICT_FALSE(hours >= 24)) { - return false; - } - *out = std::chrono::duration_cast(std::chrono::hours(hours)); - return true; -} - -template -static inline bool ParseHH_MM(const char* s, Duration* out) { - uint8_t hours = 0; - uint8_t minutes = 0; - if (ARROW_PREDICT_FALSE(s[2] != ':')) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 0, 2, &hours))) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 3, 2, &minutes))) { - return false; - } - if (ARROW_PREDICT_FALSE(hours >= 24)) { - return false; - } - if (ARROW_PREDICT_FALSE(minutes >= 60)) { - return false; - } - *out = std::chrono::duration_cast(std::chrono::hours(hours) + - std::chrono::minutes(minutes)); - return true; -} - -template -static inline bool ParseHHMM(const char* s, Duration* out) { - uint8_t hours = 0; - uint8_t minutes = 0; - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 0, 2, &hours))) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 2, 2, &minutes))) { - return false; - } - if (ARROW_PREDICT_FALSE(hours >= 24)) { - return false; - } - if (ARROW_PREDICT_FALSE(minutes >= 60)) { - return false; - } - *out = std::chrono::duration_cast(std::chrono::hours(hours) + - std::chrono::minutes(minutes)); - return true; -} - -template -static inline bool ParseHH_MM_SS(const char* s, Duration* out) { - uint8_t hours = 0; - uint8_t minutes = 0; - uint8_t seconds = 0; - if (ARROW_PREDICT_FALSE(s[2] != ':') || ARROW_PREDICT_FALSE(s[5] != ':')) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 0, 2, &hours))) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 3, 2, &minutes))) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 6, 2, &seconds))) { - return false; - } - if (ARROW_PREDICT_FALSE(hours >= 24)) { - return false; - } - if (ARROW_PREDICT_FALSE(minutes >= 60)) { - return false; - } - if (ARROW_PREDICT_FALSE(seconds >= 60)) { - return false; - } - *out = std::chrono::duration_cast(std::chrono::hours(hours) + - std::chrono::minutes(minutes) + - std::chrono::seconds(seconds)); - return true; -} - -static inline bool ParseSubSeconds(const char* s, size_t length, TimeUnit::type unit, - uint32_t* out) { - // The decimal point has been peeled off at this point - - // Fail if number of decimal places provided exceeds what the unit can hold. - // Calculate how many trailing decimal places are omitted for the unit - // e.g. if 4 decimal places are provided and unit is MICRO, 2 are missing - size_t omitted = 0; - switch (unit) { - case TimeUnit::MILLI: - if (ARROW_PREDICT_FALSE(length > 3)) { - return false; - } - if (length < 3) { - omitted = 3 - length; - } - break; - case TimeUnit::MICRO: - if (ARROW_PREDICT_FALSE(length > 6)) { - return false; - } - if (length < 6) { - omitted = 6 - length; - } - break; - case TimeUnit::NANO: - if (ARROW_PREDICT_FALSE(length > 9)) { - return false; - } - if (length < 9) { - omitted = 9 - length; - } - break; - default: - return false; - } - - if (ARROW_PREDICT_TRUE(omitted == 0)) { - return ParseUnsigned(s, length, out); - } else { - uint32_t subseconds = 0; - bool success = ParseUnsigned(s, length, &subseconds); - if (ARROW_PREDICT_TRUE(success)) { - switch (omitted) { - case 1: - *out = subseconds * 10; - break; - case 2: - *out = subseconds * 100; - break; - case 3: - *out = subseconds * 1000; - break; - case 4: - *out = subseconds * 10000; - break; - case 5: - *out = subseconds * 100000; - break; - case 6: - *out = subseconds * 1000000; - break; - case 7: - *out = subseconds * 10000000; - break; - case 8: - *out = subseconds * 100000000; - break; - default: - // Impossible case - break; - } - return true; - } else { - return false; - } - } -} - -} // namespace detail - -template -static inline bool ParseYYYY_MM_DD(const char* s, Duration* since_epoch) { - uint16_t year = 0; - uint8_t month = 0; - uint8_t day = 0; - if (ARROW_PREDICT_FALSE(s[4] != '-') || ARROW_PREDICT_FALSE(s[7] != '-')) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 0, 4, &year))) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 5, 2, &month))) { - return false; - } - if (ARROW_PREDICT_FALSE(!ParseUnsigned(s + 8, 2, &day))) { - return false; - } - arrow_vendored::date::year_month_day ymd{arrow_vendored::date::year{year}, - arrow_vendored::date::month{month}, - arrow_vendored::date::day{day}}; - if (ARROW_PREDICT_FALSE(!ymd.ok())) return false; - - *since_epoch = std::chrono::duration_cast( - arrow_vendored::date::sys_days{ymd}.time_since_epoch()); - return true; -} - -static inline bool ParseTimestampISO8601(const char* s, size_t length, - TimeUnit::type unit, TimestampType::c_type* out, - bool* out_zone_offset_present = NULLPTR) { - using seconds_type = std::chrono::duration; - - // We allow the following zone offset formats: - // - (none) - // - Z - // - [+-]HH(:?MM)? - // - // We allow the following formats for all units: - // - "YYYY-MM-DD" - // - "YYYY-MM-DD[ T]hhZ?" - // - "YYYY-MM-DD[ T]hh:mmZ?" - // - "YYYY-MM-DD[ T]hh:mm:ssZ?" - // - // We allow the following formats for unit == MILLI, MICRO, or NANO: - // - "YYYY-MM-DD[ T]hh:mm:ss.s{1,3}Z?" - // - // We allow the following formats for unit == MICRO, or NANO: - // - "YYYY-MM-DD[ T]hh:mm:ss.s{4,6}Z?" - // - // We allow the following formats for unit == NANO: - // - "YYYY-MM-DD[ T]hh:mm:ss.s{7,9}Z?" - // - // UTC is always assumed, and the DataType's timezone is ignored. - // - - if (ARROW_PREDICT_FALSE(length < 10)) return false; - - seconds_type seconds_since_epoch; - if (ARROW_PREDICT_FALSE(!ParseYYYY_MM_DD(s, &seconds_since_epoch))) { - return false; - } - - if (length == 10) { - *out = util::CastSecondsToUnit(unit, seconds_since_epoch.count()); - return true; - } - - if (ARROW_PREDICT_FALSE(s[10] != ' ') && ARROW_PREDICT_FALSE(s[10] != 'T')) { - return false; - } - - if (out_zone_offset_present) { - *out_zone_offset_present = false; - } - - seconds_type zone_offset(0); - if (s[length - 1] == 'Z') { - --length; - if (out_zone_offset_present) *out_zone_offset_present = true; - } else if (s[length - 3] == '+' || s[length - 3] == '-') { - // [+-]HH - length -= 3; - if (ARROW_PREDICT_FALSE(!detail::ParseHH(s + length + 1, &zone_offset))) { - return false; - } - if (s[length] == '+') zone_offset *= -1; - if (out_zone_offset_present) *out_zone_offset_present = true; - } else if (s[length - 5] == '+' || s[length - 5] == '-') { - // [+-]HHMM - length -= 5; - if (ARROW_PREDICT_FALSE(!detail::ParseHHMM(s + length + 1, &zone_offset))) { - return false; - } - if (s[length] == '+') zone_offset *= -1; - if (out_zone_offset_present) *out_zone_offset_present = true; - } else if ((s[length - 6] == '+' || s[length - 6] == '-') && (s[length - 3] == ':')) { - // [+-]HH:MM - length -= 6; - if (ARROW_PREDICT_FALSE(!detail::ParseHH_MM(s + length + 1, &zone_offset))) { - return false; - } - if (s[length] == '+') zone_offset *= -1; - if (out_zone_offset_present) *out_zone_offset_present = true; - } - - seconds_type seconds_since_midnight; - switch (length) { - case 13: // YYYY-MM-DD[ T]hh - if (ARROW_PREDICT_FALSE(!detail::ParseHH(s + 11, &seconds_since_midnight))) { - return false; - } - break; - case 16: // YYYY-MM-DD[ T]hh:mm - if (ARROW_PREDICT_FALSE(!detail::ParseHH_MM(s + 11, &seconds_since_midnight))) { - return false; - } - break; - case 19: // YYYY-MM-DD[ T]hh:mm:ss - case 21: // YYYY-MM-DD[ T]hh:mm:ss.s - case 22: // YYYY-MM-DD[ T]hh:mm:ss.ss - case 23: // YYYY-MM-DD[ T]hh:mm:ss.sss - case 24: // YYYY-MM-DD[ T]hh:mm:ss.ssss - case 25: // YYYY-MM-DD[ T]hh:mm:ss.sssss - case 26: // YYYY-MM-DD[ T]hh:mm:ss.ssssss - case 27: // YYYY-MM-DD[ T]hh:mm:ss.sssssss - case 28: // YYYY-MM-DD[ T]hh:mm:ss.ssssssss - case 29: // YYYY-MM-DD[ T]hh:mm:ss.sssssssss - if (ARROW_PREDICT_FALSE(!detail::ParseHH_MM_SS(s + 11, &seconds_since_midnight))) { - return false; - } - break; - default: - return false; - } - - seconds_since_epoch += seconds_since_midnight; - seconds_since_epoch += zone_offset; - - if (length <= 19) { - *out = util::CastSecondsToUnit(unit, seconds_since_epoch.count()); - return true; - } - - if (ARROW_PREDICT_FALSE(s[19] != '.')) { - return false; - } - - uint32_t subseconds = 0; - if (ARROW_PREDICT_FALSE( - !detail::ParseSubSeconds(s + 20, length - 20, unit, &subseconds))) { - return false; - } - - *out = util::CastSecondsToUnit(unit, seconds_since_epoch.count()) + subseconds; - return true; -} - -#if defined(_WIN32) || defined(ARROW_WITH_MUSL) -static constexpr bool kStrptimeSupportsZone = false; -#else -static constexpr bool kStrptimeSupportsZone = true; -#endif - -/// \brief Returns time since the UNIX epoch in the requested unit -static inline bool ParseTimestampStrptime(const char* buf, size_t length, - const char* format, bool ignore_time_in_day, - bool allow_trailing_chars, TimeUnit::type unit, - int64_t* out) { - // NOTE: strptime() is more than 10x faster than arrow_vendored::date::parse(). - // The buffer may not be nul-terminated - std::string clean_copy(buf, length); - struct tm result; - memset(&result, 0, sizeof(struct tm)); -#ifdef _WIN32 - char* ret = arrow_strptime(clean_copy.c_str(), format, &result); -#else - char* ret = strptime(clean_copy.c_str(), format, &result); -#endif - if (ret == NULLPTR) { - return false; - } - if (!allow_trailing_chars && static_cast(ret - clean_copy.c_str()) != length) { - return false; - } - // ignore the time part - arrow_vendored::date::sys_seconds secs = - arrow_vendored::date::sys_days(arrow_vendored::date::year(result.tm_year + 1900) / - (result.tm_mon + 1) / std::max(result.tm_mday, 1)); - if (!ignore_time_in_day) { - secs += (std::chrono::hours(result.tm_hour) + std::chrono::minutes(result.tm_min) + - std::chrono::seconds(result.tm_sec)); -#ifndef _WIN32 - secs -= std::chrono::seconds(result.tm_gmtoff); -#endif - } - *out = util::CastSecondsToUnit(unit, secs.time_since_epoch().count()); - return true; -} - -template <> -struct StringConverter { - using value_type = int64_t; - - bool Convert(const TimestampType& type, const char* s, size_t length, value_type* out) { - return ParseTimestampISO8601(s, length, type.unit(), out); - } -}; - -template <> -struct StringConverter - : public StringToSignedIntConverterMixin { - using StringToSignedIntConverterMixin::StringToSignedIntConverterMixin; -}; - -template -struct StringConverter> { - using value_type = typename DATE_TYPE::c_type; - - using duration_type = - typename std::conditional::value, - arrow_vendored::date::days, - std::chrono::milliseconds>::type; - - bool Convert(const DATE_TYPE& type, const char* s, size_t length, value_type* out) { - if (ARROW_PREDICT_FALSE(length != 10)) { - return false; - } - - duration_type since_epoch; - if (ARROW_PREDICT_FALSE(!ParseYYYY_MM_DD(s, &since_epoch))) { - return false; - } - - *out = static_cast(since_epoch.count()); - return true; - } -}; - -template -struct StringConverter> { - using value_type = typename TIME_TYPE::c_type; - - // We allow the following formats for all units: - // - "hh:mm" - // - "hh:mm:ss" - // - // We allow the following formats for unit == MILLI, MICRO, or NANO: - // - "hh:mm:ss.s{1,3}" - // - // We allow the following formats for unit == MICRO, or NANO: - // - "hh:mm:ss.s{4,6}" - // - // We allow the following formats for unit == NANO: - // - "hh:mm:ss.s{7,9}" - - bool Convert(const TIME_TYPE& type, const char* s, size_t length, value_type* out) { - const auto unit = type.unit(); - std::chrono::seconds since_midnight; - - if (length == 5) { - if (ARROW_PREDICT_FALSE(!detail::ParseHH_MM(s, &since_midnight))) { - return false; - } - *out = - static_cast(util::CastSecondsToUnit(unit, since_midnight.count())); - return true; - } - - if (ARROW_PREDICT_FALSE(length < 8)) { - return false; - } - if (ARROW_PREDICT_FALSE(!detail::ParseHH_MM_SS(s, &since_midnight))) { - return false; - } - - *out = static_cast(util::CastSecondsToUnit(unit, since_midnight.count())); - - if (length == 8) { - return true; - } - - if (ARROW_PREDICT_FALSE(s[8] != '.')) { - return false; - } - - uint32_t subseconds_count = 0; - if (ARROW_PREDICT_FALSE( - !detail::ParseSubSeconds(s + 9, length - 9, unit, &subseconds_count))) { - return false; - } - - *out += subseconds_count; - return true; - } -}; - -/// \brief Convenience wrappers around internal::StringConverter. -template -bool ParseValue(const T& type, const char* s, size_t length, - typename StringConverter::value_type* out) { - return StringConverter{}.Convert(type, s, length, out); -} - -template -enable_if_parameter_free ParseValue( - const char* s, size_t length, typename StringConverter::value_type* out) { - static T type; - return StringConverter{}.Convert(type, s, length, out); -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/vector.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/vector.h deleted file mode 100644 index e3c0a67cf..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/vector.h +++ /dev/null @@ -1,172 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#include -#include -#include - -#include "arrow/result.h" -#include "arrow/util/algorithm.h" -#include "arrow/util/functional.h" -#include "arrow/util/logging.h" - -namespace arrow { -namespace internal { - -template -std::vector DeleteVectorElement(const std::vector& values, size_t index) { - DCHECK(!values.empty()); - DCHECK_LT(index, values.size()); - std::vector out; - out.reserve(values.size() - 1); - for (size_t i = 0; i < index; ++i) { - out.push_back(values[i]); - } - for (size_t i = index + 1; i < values.size(); ++i) { - out.push_back(values[i]); - } - return out; -} - -template -std::vector AddVectorElement(const std::vector& values, size_t index, - T new_element) { - DCHECK_LE(index, values.size()); - std::vector out; - out.reserve(values.size() + 1); - for (size_t i = 0; i < index; ++i) { - out.push_back(values[i]); - } - out.emplace_back(std::move(new_element)); - for (size_t i = index; i < values.size(); ++i) { - out.push_back(values[i]); - } - return out; -} - -template -std::vector ReplaceVectorElement(const std::vector& values, size_t index, - T new_element) { - DCHECK_LE(index, values.size()); - std::vector out; - out.reserve(values.size()); - for (size_t i = 0; i < index; ++i) { - out.push_back(values[i]); - } - out.emplace_back(std::move(new_element)); - for (size_t i = index + 1; i < values.size(); ++i) { - out.push_back(values[i]); - } - return out; -} - -template -std::vector FilterVector(std::vector values, Predicate&& predicate) { - auto new_end = std::remove_if(values.begin(), values.end(), - [&](const T& value) { return !predicate(value); }); - values.erase(new_end, values.end()); - return values; -} - -template ()(std::declval()))> -std::vector MapVector(Fn&& map, const std::vector& source) { - std::vector out; - out.reserve(source.size()); - std::transform(source.begin(), source.end(), std::back_inserter(out), - std::forward(map)); - return out; -} - -template ()(std::declval()))> -std::vector MapVector(Fn&& map, std::vector&& source) { - std::vector out; - out.reserve(source.size()); - std::transform(std::make_move_iterator(source.begin()), - std::make_move_iterator(source.end()), std::back_inserter(out), - std::forward(map)); - return out; -} - -/// \brief Like MapVector, but where the function can fail. -template , - typename To = typename internal::call_traits::return_type::ValueType> -Result> MaybeMapVector(Fn&& map, const std::vector& source) { - std::vector out; - out.reserve(source.size()); - ARROW_RETURN_NOT_OK(MaybeTransform(source.begin(), source.end(), - std::back_inserter(out), std::forward(map))); - return std::move(out); -} - -template , - typename To = typename internal::call_traits::return_type::ValueType> -Result> MaybeMapVector(Fn&& map, std::vector&& source) { - std::vector out; - out.reserve(source.size()); - ARROW_RETURN_NOT_OK(MaybeTransform(std::make_move_iterator(source.begin()), - std::make_move_iterator(source.end()), - std::back_inserter(out), std::forward(map))); - return std::move(out); -} - -template -std::vector FlattenVectors(const std::vector>& vecs) { - std::size_t sum = 0; - for (const auto& vec : vecs) { - sum += vec.size(); - } - std::vector out; - out.reserve(sum); - for (const auto& vec : vecs) { - out.insert(out.end(), vec.begin(), vec.end()); - } - return out; -} - -template -Result> UnwrapOrRaise(std::vector>&& results) { - std::vector out; - out.reserve(results.size()); - auto end = std::make_move_iterator(results.end()); - for (auto it = std::make_move_iterator(results.begin()); it != end; it++) { - if (!it->ok()) { - return it->status(); - } - out.push_back(it->MoveValueUnsafe()); - } - return std::move(out); -} - -template -Result> UnwrapOrRaise(const std::vector>& results) { - std::vector out; - out.reserve(results.size()); - for (const auto& result : results) { - if (!result.ok()) { - return result.status(); - } - out.push_back(result.ValueUnsafe()); - } - return std::move(out); -} - -} // namespace internal -} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/visibility.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/visibility.h deleted file mode 100644 index b0fd79029..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/visibility.h +++ /dev/null @@ -1,83 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#pragma once - -#if defined(_WIN32) || defined(__CYGWIN__) -// Windows - -#if defined(_MSC_VER) -#pragma warning(disable : 4251) -#else -#pragma GCC diagnostic ignored "-Wattributes" -#endif - -#if defined(__cplusplus) && defined(__GNUC__) && !defined(__clang__) -// Use C++ attribute syntax where possible to avoid GCC parser bug -// (https://stackoverflow.com/questions/57993818/gcc-how-to-combine-attribute-dllexport-and-nodiscard-in-a-struct-de) -#define ARROW_DLLEXPORT [[gnu::dllexport]] -#define ARROW_DLLIMPORT [[gnu::dllimport]] -#else -#define ARROW_DLLEXPORT __declspec(dllexport) -#define ARROW_DLLIMPORT __declspec(dllimport) -#endif - -#ifdef ARROW_STATIC -#define ARROW_EXPORT -#define ARROW_FRIEND_EXPORT -#define ARROW_TEMPLATE_EXPORT -#elif defined(ARROW_EXPORTING) -#define ARROW_EXPORT ARROW_DLLEXPORT -// For some reason [[gnu::dllexport]] doesn't work well with friend declarations -#define ARROW_FRIEND_EXPORT __declspec(dllexport) -#define ARROW_TEMPLATE_EXPORT ARROW_DLLEXPORT -#else -#define ARROW_EXPORT ARROW_DLLIMPORT -#define ARROW_FRIEND_EXPORT __declspec(dllimport) -#define ARROW_TEMPLATE_EXPORT ARROW_DLLIMPORT -#endif - -#define ARROW_NO_EXPORT -#define ARROW_FORCE_INLINE __forceinline - -#else - -// Non-Windows - -#define ARROW_FORCE_INLINE - -#if defined(__cplusplus) && (defined(__GNUC__) || defined(__clang__)) -#ifndef ARROW_EXPORT -#define ARROW_EXPORT [[gnu::visibility("default")]] -#endif -#ifndef ARROW_NO_EXPORT -#define ARROW_NO_EXPORT [[gnu::visibility("hidden")]] -#endif -#else -// Not C++, or not gcc/clang -#ifndef ARROW_EXPORT -#define ARROW_EXPORT -#endif -#ifndef ARROW_NO_EXPORT -#define ARROW_NO_EXPORT -#endif -#endif - -#define ARROW_FRIEND_EXPORT -#define ARROW_TEMPLATE_EXPORT - -#endif // Non-Windows diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_compatibility.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_compatibility.h deleted file mode 100644 index ea0d01675..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_compatibility.h +++ /dev/null @@ -1,40 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifdef _WIN32 - -// Windows defines min and max macros that mess up std::min/max -#ifndef NOMINMAX -#define NOMINMAX -#endif - -#define WIN32_LEAN_AND_MEAN - -// Set Windows 7 as a conservative minimum for Apache Arrow -#if defined(_WIN32_WINNT) && _WIN32_WINNT < 0x601 -#undef _WIN32_WINNT -#endif -#ifndef _WIN32_WINNT -#define _WIN32_WINNT 0x601 -#endif - -#include -#include - -#include "arrow/util/windows_fixup.h" - -#endif // _WIN32 diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_fixup.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_fixup.h deleted file mode 100644 index 2949ac4ab..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/util/windows_fixup.h +++ /dev/null @@ -1,52 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// This header needs to be included multiple times. - -#ifdef _WIN32 - -#ifdef max -#undef max -#endif -#ifdef min -#undef min -#endif - -// The Windows API defines macros from *File resolving to either -// *FileA or *FileW. Need to undo them. -#ifdef CopyFile -#undef CopyFile -#endif -#ifdef CreateFile -#undef CreateFile -#endif -#ifdef DeleteFile -#undef DeleteFile -#endif - -// Other annoying Windows macro definitions... -#ifdef IN -#undef IN -#endif -#ifdef OUT -#undef OUT -#endif - -// Note that we can't undefine OPTIONAL, because it can be used in other -// Windows headers... - -#endif // _WIN32 diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/debug-trap.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/debug-trap.h deleted file mode 100644 index 6d039064d..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/debug-trap.h +++ /dev/null @@ -1,83 +0,0 @@ -/* Debugging assertions and traps - * Portable Snippets - https://github.com/nemequ/portable-snippets - * Created by Evan Nemerson - * - * To the extent possible under law, the authors have waived all - * copyright and related or neighboring rights to this code. For - * details, see the Creative Commons Zero 1.0 Universal license at - * https://creativecommons.org/publicdomain/zero/1.0/ - */ - -#if !defined(PSNIP_DEBUG_TRAP_H) -#define PSNIP_DEBUG_TRAP_H - -#if !defined(PSNIP_NDEBUG) && defined(NDEBUG) && !defined(PSNIP_DEBUG) -# define PSNIP_NDEBUG 1 -#endif - -#if defined(__has_builtin) && !defined(__ibmxl__) -# if __has_builtin(__builtin_debugtrap) -# define psnip_trap() __builtin_debugtrap() -# elif __has_builtin(__debugbreak) -# define psnip_trap() __debugbreak() -# endif -#endif -#if !defined(psnip_trap) -# if defined(_MSC_VER) || defined(__INTEL_COMPILER) -# define psnip_trap() __debugbreak() -# elif defined(__ARMCC_VERSION) -# define psnip_trap() __breakpoint(42) -# elif defined(__ibmxl__) || defined(__xlC__) -# include -# define psnip_trap() __trap(42) -# elif defined(__DMC__) && defined(_M_IX86) - static inline void psnip_trap(void) { __asm int 3h; } -# elif defined(__i386__) || defined(__x86_64__) - static inline void psnip_trap(void) { __asm__ __volatile__("int $03"); } -# elif defined(__thumb__) - static inline void psnip_trap(void) { __asm__ __volatile__(".inst 0xde01"); } -# elif defined(__aarch64__) - static inline void psnip_trap(void) { __asm__ __volatile__(".inst 0xd4200000"); } -# elif defined(__arm__) - static inline void psnip_trap(void) { __asm__ __volatile__(".inst 0xe7f001f0"); } -# elif defined (__alpha__) && !defined(__osf__) - static inline void psnip_trap(void) { __asm__ __volatile__("bpt"); } -# elif defined(_54_) - static inline void psnip_trap(void) { __asm__ __volatile__("ESTOP"); } -# elif defined(_55_) - static inline void psnip_trap(void) { __asm__ __volatile__(";\n .if (.MNEMONIC)\n ESTOP_1\n .else\n ESTOP_1()\n .endif\n NOP"); } -# elif defined(_64P_) - static inline void psnip_trap(void) { __asm__ __volatile__("SWBP 0"); } -# elif defined(_6x_) - static inline void psnip_trap(void) { __asm__ __volatile__("NOP\n .word 0x10000000"); } -# elif defined(__STDC_HOSTED__) && (__STDC_HOSTED__ == 0) && defined(__GNUC__) -# define psnip_trap() __builtin_trap() -# else -# include -# if defined(SIGTRAP) -# define psnip_trap() raise(SIGTRAP) -# else -# define psnip_trap() raise(SIGABRT) -# endif -# endif -#endif - -#if defined(HEDLEY_LIKELY) -# define PSNIP_DBG_LIKELY(expr) HEDLEY_LIKELY(expr) -#elif defined(__GNUC__) && (__GNUC__ >= 3) -# define PSNIP_DBG_LIKELY(expr) __builtin_expect(!!(expr), 1) -#else -# define PSNIP_DBG_LIKELY(expr) (!!(expr)) -#endif - -#if !defined(PSNIP_NDEBUG) || (PSNIP_NDEBUG == 0) -# define psnip_dbg_assert(expr) do { \ - if (!PSNIP_DBG_LIKELY(expr)) { \ - psnip_trap(); \ - } \ - } while (0) -#else -# define psnip_dbg_assert(expr) -#endif - -#endif /* !defined(PSNIP_DEBUG_TRAP_H) */ diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/safe-math.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/safe-math.h deleted file mode 100644 index 7f6426ac7..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/portable-snippets/safe-math.h +++ /dev/null @@ -1,1072 +0,0 @@ -/* Overflow-safe math functions - * Portable Snippets - https://github.com/nemequ/portable-snippets - * Created by Evan Nemerson - * - * To the extent possible under law, the authors have waived all - * copyright and related or neighboring rights to this code. For - * details, see the Creative Commons Zero 1.0 Universal license at - * https://creativecommons.org/publicdomain/zero/1.0/ - */ - -#if !defined(PSNIP_SAFE_H) -#define PSNIP_SAFE_H - -#if !defined(PSNIP_SAFE_FORCE_PORTABLE) -# if defined(__has_builtin) -# if __has_builtin(__builtin_add_overflow) && !defined(__ibmxl__) -# define PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW -# endif -# elif defined(__GNUC__) && (__GNUC__ >= 5) && !defined(__INTEL_COMPILER) -# define PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW -# endif -# if defined(__has_include) -# if __has_include() -# define PSNIP_SAFE_HAVE_INTSAFE_H -# endif -# elif defined(_WIN32) -# define PSNIP_SAFE_HAVE_INTSAFE_H -# endif -#endif /* !defined(PSNIP_SAFE_FORCE_PORTABLE) */ - -#if defined(__GNUC__) -# define PSNIP_SAFE_LIKELY(expr) __builtin_expect(!!(expr), 1) -# define PSNIP_SAFE_UNLIKELY(expr) __builtin_expect(!!(expr), 0) -#else -# define PSNIP_SAFE_LIKELY(expr) !!(expr) -# define PSNIP_SAFE_UNLIKELY(expr) !!(expr) -#endif /* defined(__GNUC__) */ - -#if !defined(PSNIP_SAFE_STATIC_INLINE) -# if defined(__GNUC__) -# define PSNIP_SAFE__COMPILER_ATTRIBUTES __attribute__((__unused__)) -# else -# define PSNIP_SAFE__COMPILER_ATTRIBUTES -# endif - -# if defined(HEDLEY_INLINE) -# define PSNIP_SAFE__INLINE HEDLEY_INLINE -# elif defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L -# define PSNIP_SAFE__INLINE inline -# elif defined(__GNUC_STDC_INLINE__) -# define PSNIP_SAFE__INLINE __inline__ -# elif defined(_MSC_VER) && _MSC_VER >= 1200 -# define PSNIP_SAFE__INLINE __inline -# else -# define PSNIP_SAFE__INLINE -# endif - -# define PSNIP_SAFE__FUNCTION PSNIP_SAFE__COMPILER_ATTRIBUTES static PSNIP_SAFE__INLINE -#endif - -// !defined(__cplusplus) added for Solaris support -#if !defined(__cplusplus) && defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L -# define psnip_safe_bool _Bool -#else -# define psnip_safe_bool int -#endif - -#if !defined(PSNIP_SAFE_NO_FIXED) -/* For maximum portability include the exact-int module from - portable snippets. */ -# if \ - !defined(psnip_int64_t) || !defined(psnip_uint64_t) || \ - !defined(psnip_int32_t) || !defined(psnip_uint32_t) || \ - !defined(psnip_int16_t) || !defined(psnip_uint16_t) || \ - !defined(psnip_int8_t) || !defined(psnip_uint8_t) -# include -# if !defined(psnip_int64_t) -# define psnip_int64_t int64_t -# endif -# if !defined(psnip_uint64_t) -# define psnip_uint64_t uint64_t -# endif -# if !defined(psnip_int32_t) -# define psnip_int32_t int32_t -# endif -# if !defined(psnip_uint32_t) -# define psnip_uint32_t uint32_t -# endif -# if !defined(psnip_int16_t) -# define psnip_int16_t int16_t -# endif -# if !defined(psnip_uint16_t) -# define psnip_uint16_t uint16_t -# endif -# if !defined(psnip_int8_t) -# define psnip_int8_t int8_t -# endif -# if !defined(psnip_uint8_t) -# define psnip_uint8_t uint8_t -# endif -# endif -#endif /* !defined(PSNIP_SAFE_NO_FIXED) */ -#include -#include - -#if !defined(PSNIP_SAFE_SIZE_MAX) -# if defined(__SIZE_MAX__) -# define PSNIP_SAFE_SIZE_MAX __SIZE_MAX__ -# elif defined(PSNIP_EXACT_INT_HAVE_STDINT) -# include -# endif -#endif - -#if defined(PSNIP_SAFE_SIZE_MAX) -# define PSNIP_SAFE__SIZE_MAX_RT PSNIP_SAFE_SIZE_MAX -#else -# define PSNIP_SAFE__SIZE_MAX_RT (~((size_t) 0)) -#endif - -#if defined(PSNIP_SAFE_HAVE_INTSAFE_H) -/* In VS 10, stdint.h and intsafe.h both define (U)INTN_MIN/MAX, which - triggers warning C4005 (level 1). */ -# if defined(_MSC_VER) && (_MSC_VER == 1600) -# pragma warning(push) -# pragma warning(disable:4005) -# endif -# include -# if defined(_MSC_VER) && (_MSC_VER == 1600) -# pragma warning(pop) -# endif -#endif /* defined(PSNIP_SAFE_HAVE_INTSAFE_H) */ - -/* If there is a type larger than the one we're concerned with it's - * likely much faster to simply promote the operands, perform the - * requested operation, verify that the result falls within the - * original type, then cast the result back to the original type. */ - -#if !defined(PSNIP_SAFE_NO_PROMOTIONS) - -#define PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, op_name, op) \ - PSNIP_SAFE__FUNCTION psnip_safe_##name##_larger \ - psnip_safe_larger_##name##_##op_name (T a, T b) { \ - return ((psnip_safe_##name##_larger) a) op ((psnip_safe_##name##_larger) b); \ - } - -#define PSNIP_SAFE_DEFINE_LARGER_UNARY_OP(T, name, op_name, op) \ - PSNIP_SAFE__FUNCTION psnip_safe_##name##_larger \ - psnip_safe_larger_##name##_##op_name (T value) { \ - return (op ((psnip_safe_##name##_larger) value)); \ - } - -#define PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(T, name) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, add, +) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, sub, -) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, mul, *) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, div, /) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, mod, %) \ - PSNIP_SAFE_DEFINE_LARGER_UNARY_OP (T, name, neg, -) - -#define PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(T, name) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, add, +) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, sub, -) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, mul, *) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, div, /) \ - PSNIP_SAFE_DEFINE_LARGER_BINARY_OP(T, name, mod, %) - -#define PSNIP_SAFE_IS_LARGER(ORIG_MAX, DEST_MAX) ((DEST_MAX / ORIG_MAX) >= ORIG_MAX) - -#if defined(__GNUC__) && ((__GNUC__ >= 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)) && defined(__SIZEOF_INT128__) && !defined(__ibmxl__) -#define PSNIP_SAFE_HAVE_128 -typedef __int128 psnip_safe_int128_t; -typedef unsigned __int128 psnip_safe_uint128_t; -#endif /* defined(__GNUC__) */ - -#if !defined(PSNIP_SAFE_NO_FIXED) -#define PSNIP_SAFE_HAVE_INT8_LARGER -#define PSNIP_SAFE_HAVE_UINT8_LARGER -typedef psnip_int16_t psnip_safe_int8_larger; -typedef psnip_uint16_t psnip_safe_uint8_larger; - -#define PSNIP_SAFE_HAVE_INT16_LARGER -typedef psnip_int32_t psnip_safe_int16_larger; -typedef psnip_uint32_t psnip_safe_uint16_larger; - -#define PSNIP_SAFE_HAVE_INT32_LARGER -typedef psnip_int64_t psnip_safe_int32_larger; -typedef psnip_uint64_t psnip_safe_uint32_larger; - -#if defined(PSNIP_SAFE_HAVE_128) -#define PSNIP_SAFE_HAVE_INT64_LARGER -typedef psnip_safe_int128_t psnip_safe_int64_larger; -typedef psnip_safe_uint128_t psnip_safe_uint64_larger; -#endif /* defined(PSNIP_SAFE_HAVE_128) */ -#endif /* !defined(PSNIP_SAFE_NO_FIXED) */ - -#define PSNIP_SAFE_HAVE_LARGER_SCHAR -#if PSNIP_SAFE_IS_LARGER(SCHAR_MAX, SHRT_MAX) -typedef short psnip_safe_schar_larger; -#elif PSNIP_SAFE_IS_LARGER(SCHAR_MAX, INT_MAX) -typedef int psnip_safe_schar_larger; -#elif PSNIP_SAFE_IS_LARGER(SCHAR_MAX, LONG_MAX) -typedef long psnip_safe_schar_larger; -#elif PSNIP_SAFE_IS_LARGER(SCHAR_MAX, LLONG_MAX) -typedef long long psnip_safe_schar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(SCHAR_MAX, 0x7fff) -typedef psnip_int16_t psnip_safe_schar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(SCHAR_MAX, 0x7fffffffLL) -typedef psnip_int32_t psnip_safe_schar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(SCHAR_MAX, 0x7fffffffffffffffLL) -typedef psnip_int64_t psnip_safe_schar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (SCHAR_MAX <= 0x7fffffffffffffffLL) -typedef psnip_safe_int128_t psnip_safe_schar_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_SCHAR -#endif - -#define PSNIP_SAFE_HAVE_LARGER_UCHAR -#if PSNIP_SAFE_IS_LARGER(UCHAR_MAX, USHRT_MAX) -typedef unsigned short psnip_safe_uchar_larger; -#elif PSNIP_SAFE_IS_LARGER(UCHAR_MAX, UINT_MAX) -typedef unsigned int psnip_safe_uchar_larger; -#elif PSNIP_SAFE_IS_LARGER(UCHAR_MAX, ULONG_MAX) -typedef unsigned long psnip_safe_uchar_larger; -#elif PSNIP_SAFE_IS_LARGER(UCHAR_MAX, ULLONG_MAX) -typedef unsigned long long psnip_safe_uchar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(UCHAR_MAX, 0xffffU) -typedef psnip_uint16_t psnip_safe_uchar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(UCHAR_MAX, 0xffffffffUL) -typedef psnip_uint32_t psnip_safe_uchar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(UCHAR_MAX, 0xffffffffffffffffULL) -typedef psnip_uint64_t psnip_safe_uchar_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (UCHAR_MAX <= 0xffffffffffffffffULL) -typedef psnip_safe_uint128_t psnip_safe_uchar_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_UCHAR -#endif - -#if CHAR_MIN == 0 && defined(PSNIP_SAFE_HAVE_LARGER_UCHAR) -#define PSNIP_SAFE_HAVE_LARGER_CHAR -typedef psnip_safe_uchar_larger psnip_safe_char_larger; -#elif CHAR_MIN < 0 && defined(PSNIP_SAFE_HAVE_LARGER_SCHAR) -#define PSNIP_SAFE_HAVE_LARGER_CHAR -typedef psnip_safe_schar_larger psnip_safe_char_larger; -#endif - -#define PSNIP_SAFE_HAVE_LARGER_SHRT -#if PSNIP_SAFE_IS_LARGER(SHRT_MAX, INT_MAX) -typedef int psnip_safe_short_larger; -#elif PSNIP_SAFE_IS_LARGER(SHRT_MAX, LONG_MAX) -typedef long psnip_safe_short_larger; -#elif PSNIP_SAFE_IS_LARGER(SHRT_MAX, LLONG_MAX) -typedef long long psnip_safe_short_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(SHRT_MAX, 0x7fff) -typedef psnip_int16_t psnip_safe_short_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(SHRT_MAX, 0x7fffffffLL) -typedef psnip_int32_t psnip_safe_short_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(SHRT_MAX, 0x7fffffffffffffffLL) -typedef psnip_int64_t psnip_safe_short_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (SHRT_MAX <= 0x7fffffffffffffffLL) -typedef psnip_safe_int128_t psnip_safe_short_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_SHRT -#endif - -#define PSNIP_SAFE_HAVE_LARGER_USHRT -#if PSNIP_SAFE_IS_LARGER(USHRT_MAX, UINT_MAX) -typedef unsigned int psnip_safe_ushort_larger; -#elif PSNIP_SAFE_IS_LARGER(USHRT_MAX, ULONG_MAX) -typedef unsigned long psnip_safe_ushort_larger; -#elif PSNIP_SAFE_IS_LARGER(USHRT_MAX, ULLONG_MAX) -typedef unsigned long long psnip_safe_ushort_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(USHRT_MAX, 0xffff) -typedef psnip_uint16_t psnip_safe_ushort_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(USHRT_MAX, 0xffffffffUL) -typedef psnip_uint32_t psnip_safe_ushort_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(USHRT_MAX, 0xffffffffffffffffULL) -typedef psnip_uint64_t psnip_safe_ushort_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (USHRT_MAX <= 0xffffffffffffffffULL) -typedef psnip_safe_uint128_t psnip_safe_ushort_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_USHRT -#endif - -#define PSNIP_SAFE_HAVE_LARGER_INT -#if PSNIP_SAFE_IS_LARGER(INT_MAX, LONG_MAX) -typedef long psnip_safe_int_larger; -#elif PSNIP_SAFE_IS_LARGER(INT_MAX, LLONG_MAX) -typedef long long psnip_safe_int_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(INT_MAX, 0x7fff) -typedef psnip_int16_t psnip_safe_int_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(INT_MAX, 0x7fffffffLL) -typedef psnip_int32_t psnip_safe_int_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(INT_MAX, 0x7fffffffffffffffLL) -typedef psnip_int64_t psnip_safe_int_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (INT_MAX <= 0x7fffffffffffffffLL) -typedef psnip_safe_int128_t psnip_safe_int_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_INT -#endif - -#define PSNIP_SAFE_HAVE_LARGER_UINT -#if PSNIP_SAFE_IS_LARGER(UINT_MAX, ULONG_MAX) -typedef unsigned long psnip_safe_uint_larger; -#elif PSNIP_SAFE_IS_LARGER(UINT_MAX, ULLONG_MAX) -typedef unsigned long long psnip_safe_uint_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(UINT_MAX, 0xffff) -typedef psnip_uint16_t psnip_safe_uint_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(UINT_MAX, 0xffffffffUL) -typedef psnip_uint32_t psnip_safe_uint_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(UINT_MAX, 0xffffffffffffffffULL) -typedef psnip_uint64_t psnip_safe_uint_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (UINT_MAX <= 0xffffffffffffffffULL) -typedef psnip_safe_uint128_t psnip_safe_uint_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_UINT -#endif - -#define PSNIP_SAFE_HAVE_LARGER_LONG -#if PSNIP_SAFE_IS_LARGER(LONG_MAX, LLONG_MAX) -typedef long long psnip_safe_long_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(LONG_MAX, 0x7fff) -typedef psnip_int16_t psnip_safe_long_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(LONG_MAX, 0x7fffffffLL) -typedef psnip_int32_t psnip_safe_long_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(LONG_MAX, 0x7fffffffffffffffLL) -typedef psnip_int64_t psnip_safe_long_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (LONG_MAX <= 0x7fffffffffffffffLL) -typedef psnip_safe_int128_t psnip_safe_long_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_LONG -#endif - -#define PSNIP_SAFE_HAVE_LARGER_ULONG -#if PSNIP_SAFE_IS_LARGER(ULONG_MAX, ULLONG_MAX) -typedef unsigned long long psnip_safe_ulong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(ULONG_MAX, 0xffff) -typedef psnip_uint16_t psnip_safe_ulong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(ULONG_MAX, 0xffffffffUL) -typedef psnip_uint32_t psnip_safe_ulong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(ULONG_MAX, 0xffffffffffffffffULL) -typedef psnip_uint64_t psnip_safe_ulong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (ULONG_MAX <= 0xffffffffffffffffULL) -typedef psnip_safe_uint128_t psnip_safe_ulong_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_ULONG -#endif - -#define PSNIP_SAFE_HAVE_LARGER_LLONG -#if !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(LLONG_MAX, 0x7fff) -typedef psnip_int16_t psnip_safe_llong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(LLONG_MAX, 0x7fffffffLL) -typedef psnip_int32_t psnip_safe_llong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(LLONG_MAX, 0x7fffffffffffffffLL) -typedef psnip_int64_t psnip_safe_llong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (LLONG_MAX <= 0x7fffffffffffffffLL) -typedef psnip_safe_int128_t psnip_safe_llong_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_LLONG -#endif - -#define PSNIP_SAFE_HAVE_LARGER_ULLONG -#if !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(ULLONG_MAX, 0xffff) -typedef psnip_uint16_t psnip_safe_ullong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(ULLONG_MAX, 0xffffffffUL) -typedef psnip_uint32_t psnip_safe_ullong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(ULLONG_MAX, 0xffffffffffffffffULL) -typedef psnip_uint64_t psnip_safe_ullong_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (ULLONG_MAX <= 0xffffffffffffffffULL) -typedef psnip_safe_uint128_t psnip_safe_ullong_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_ULLONG -#endif - -#if defined(PSNIP_SAFE_SIZE_MAX) -#define PSNIP_SAFE_HAVE_LARGER_SIZE -#if PSNIP_SAFE_IS_LARGER(PSNIP_SAFE_SIZE_MAX, USHRT_MAX) -typedef unsigned short psnip_safe_size_larger; -#elif PSNIP_SAFE_IS_LARGER(PSNIP_SAFE_SIZE_MAX, UINT_MAX) -typedef unsigned int psnip_safe_size_larger; -#elif PSNIP_SAFE_IS_LARGER(PSNIP_SAFE_SIZE_MAX, ULONG_MAX) -typedef unsigned long psnip_safe_size_larger; -#elif PSNIP_SAFE_IS_LARGER(PSNIP_SAFE_SIZE_MAX, ULLONG_MAX) -typedef unsigned long long psnip_safe_size_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(PSNIP_SAFE_SIZE_MAX, 0xffff) -typedef psnip_uint16_t psnip_safe_size_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(PSNIP_SAFE_SIZE_MAX, 0xffffffffUL) -typedef psnip_uint32_t psnip_safe_size_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && PSNIP_SAFE_IS_LARGER(PSNIP_SAFE_SIZE_MAX, 0xffffffffffffffffULL) -typedef psnip_uint64_t psnip_safe_size_larger; -#elif !defined(PSNIP_SAFE_NO_FIXED) && defined(PSNIP_SAFE_HAVE_128) && (PSNIP_SAFE_SIZE_MAX <= 0xffffffffffffffffULL) -typedef psnip_safe_uint128_t psnip_safe_size_larger; -#else -#undef PSNIP_SAFE_HAVE_LARGER_SIZE -#endif -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_SCHAR) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(signed char, schar) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_UCHAR) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(unsigned char, uchar) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_CHAR) -#if CHAR_MIN == 0 -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(char, char) -#else -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(char, char) -#endif -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_SHORT) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(short, short) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_USHORT) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(unsigned short, ushort) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_INT) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(int, int) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_UINT) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(unsigned int, uint) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_LONG) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(long, long) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_ULONG) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(unsigned long, ulong) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_LLONG) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(long long, llong) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_ULLONG) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(unsigned long long, ullong) -#endif - -#if defined(PSNIP_SAFE_HAVE_LARGER_SIZE) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(size_t, size) -#endif - -#if !defined(PSNIP_SAFE_NO_FIXED) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(psnip_int8_t, int8) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(psnip_uint8_t, uint8) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(psnip_int16_t, int16) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(psnip_uint16_t, uint16) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(psnip_int32_t, int32) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(psnip_uint32_t, uint32) -#if defined(PSNIP_SAFE_HAVE_128) -PSNIP_SAFE_DEFINE_LARGER_SIGNED_OPS(psnip_int64_t, int64) -PSNIP_SAFE_DEFINE_LARGER_UNSIGNED_OPS(psnip_uint64_t, uint64) -#endif -#endif - -#endif /* !defined(PSNIP_SAFE_NO_PROMOTIONS) */ - -#define PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(T, name, op_name) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_##op_name(T* res, T a, T b) { \ - return !__builtin_##op_name##_overflow(a, b, res); \ - } - -#define PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(T, name, op_name, min, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_##op_name(T* res, T a, T b) { \ - const psnip_safe_##name##_larger r = psnip_safe_larger_##name##_##op_name(a, b); \ - *res = (T) r; \ - return (r >= min) && (r <= max); \ - } - -#define PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(T, name, op_name, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_##op_name(T* res, T a, T b) { \ - const psnip_safe_##name##_larger r = psnip_safe_larger_##name##_##op_name(a, b); \ - *res = (T) r; \ - return (r <= max); \ - } - -#define PSNIP_SAFE_DEFINE_SIGNED_ADD(T, name, min, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_add (T* res, T a, T b) { \ - psnip_safe_bool r = !( ((b > 0) && (a > (max - b))) || \ - ((b < 0) && (a < (min - b))) ); \ - if(PSNIP_SAFE_LIKELY(r)) \ - *res = a + b; \ - return r; \ - } - -#define PSNIP_SAFE_DEFINE_UNSIGNED_ADD(T, name, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_add (T* res, T a, T b) { \ - *res = (T) (a + b); \ - return !PSNIP_SAFE_UNLIKELY((b > 0) && (a > (max - b))); \ - } - -#define PSNIP_SAFE_DEFINE_SIGNED_SUB(T, name, min, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_sub (T* res, T a, T b) { \ - psnip_safe_bool r = !((b > 0 && a < (min + b)) || \ - (b < 0 && a > (max + b))); \ - if(PSNIP_SAFE_LIKELY(r)) \ - *res = a - b; \ - return r; \ - } - -#define PSNIP_SAFE_DEFINE_UNSIGNED_SUB(T, name, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_sub (T* res, T a, T b) { \ - *res = a - b; \ - return !PSNIP_SAFE_UNLIKELY(b > a); \ - } - -#define PSNIP_SAFE_DEFINE_SIGNED_MUL(T, name, min, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_mul (T* res, T a, T b) { \ - psnip_safe_bool r = 1; \ - if (a > 0) { \ - if (b > 0) { \ - if (a > (max / b)) { \ - r = 0; \ - } \ - } else { \ - if (b < (min / a)) { \ - r = 0; \ - } \ - } \ - } else { \ - if (b > 0) { \ - if (a < (min / b)) { \ - r = 0; \ - } \ - } else { \ - if ( (a != 0) && (b < (max / a))) { \ - r = 0; \ - } \ - } \ - } \ - if(PSNIP_SAFE_LIKELY(r)) \ - *res = a * b; \ - return r; \ - } - -#define PSNIP_SAFE_DEFINE_UNSIGNED_MUL(T, name, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_mul (T* res, T a, T b) { \ - *res = (T) (a * b); \ - return !PSNIP_SAFE_UNLIKELY((a > 0) && (b > 0) && (a > (max / b))); \ - } - -#define PSNIP_SAFE_DEFINE_SIGNED_DIV(T, name, min, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_div (T* res, T a, T b) { \ - if (PSNIP_SAFE_UNLIKELY(b == 0)) { \ - *res = 0; \ - return 0; \ - } else if (PSNIP_SAFE_UNLIKELY(a == min && b == -1)) { \ - *res = min; \ - return 0; \ - } else { \ - *res = (T) (a / b); \ - return 1; \ - } \ - } - -#define PSNIP_SAFE_DEFINE_UNSIGNED_DIV(T, name, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_div (T* res, T a, T b) { \ - if (PSNIP_SAFE_UNLIKELY(b == 0)) { \ - *res = 0; \ - return 0; \ - } else { \ - *res = a / b; \ - return 1; \ - } \ - } - -#define PSNIP_SAFE_DEFINE_SIGNED_MOD(T, name, min, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_mod (T* res, T a, T b) { \ - if (PSNIP_SAFE_UNLIKELY(b == 0)) { \ - *res = 0; \ - return 0; \ - } else if (PSNIP_SAFE_UNLIKELY(a == min && b == -1)) { \ - *res = min; \ - return 0; \ - } else { \ - *res = (T) (a % b); \ - return 1; \ - } \ - } - -#define PSNIP_SAFE_DEFINE_UNSIGNED_MOD(T, name, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_mod (T* res, T a, T b) { \ - if (PSNIP_SAFE_UNLIKELY(b == 0)) { \ - *res = 0; \ - return 0; \ - } else { \ - *res = a % b; \ - return 1; \ - } \ - } - -#define PSNIP_SAFE_DEFINE_SIGNED_NEG(T, name, min, max) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_neg (T* res, T value) { \ - psnip_safe_bool r = value != min; \ - *res = PSNIP_SAFE_LIKELY(r) ? -value : max; \ - return r; \ - } - -#define PSNIP_SAFE_DEFINE_INTSAFE(T, name, op, isf) \ - PSNIP_SAFE__FUNCTION psnip_safe_bool \ - psnip_safe_##name##_##op (T* res, T a, T b) { \ - return isf(a, b, res) == S_OK; \ - } - -#if CHAR_MIN == 0 -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(char, char, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(char, char, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(char, char, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_CHAR) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(char, char, add, CHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(char, char, sub, CHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(char, char, mul, CHAR_MAX) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(char, char, CHAR_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(char, char, CHAR_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(char, char, CHAR_MAX) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(char, char, CHAR_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(char, char, CHAR_MAX) -#else /* CHAR_MIN != 0 */ -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(char, char, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(char, char, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(char, char, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_CHAR) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(char, char, add, CHAR_MIN, CHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(char, char, sub, CHAR_MIN, CHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(char, char, mul, CHAR_MIN, CHAR_MAX) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(char, char, CHAR_MIN, CHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_SUB(char, char, CHAR_MIN, CHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MUL(char, char, CHAR_MIN, CHAR_MAX) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(char, char, CHAR_MIN, CHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MOD(char, char, CHAR_MIN, CHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_NEG(char, char, CHAR_MIN, CHAR_MAX) -#endif - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(signed char, schar, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(signed char, schar, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(signed char, schar, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_SCHAR) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(signed char, schar, add, SCHAR_MIN, SCHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(signed char, schar, sub, SCHAR_MIN, SCHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(signed char, schar, mul, SCHAR_MIN, SCHAR_MAX) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(signed char, schar, SCHAR_MIN, SCHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_SUB(signed char, schar, SCHAR_MIN, SCHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MUL(signed char, schar, SCHAR_MIN, SCHAR_MAX) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(signed char, schar, SCHAR_MIN, SCHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MOD(signed char, schar, SCHAR_MIN, SCHAR_MAX) -PSNIP_SAFE_DEFINE_SIGNED_NEG(signed char, schar, SCHAR_MIN, SCHAR_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned char, uchar, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned char, uchar, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned char, uchar, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_UCHAR) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned char, uchar, add, UCHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned char, uchar, sub, UCHAR_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned char, uchar, mul, UCHAR_MAX) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(unsigned char, uchar, UCHAR_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(unsigned char, uchar, UCHAR_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(unsigned char, uchar, UCHAR_MAX) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(unsigned char, uchar, UCHAR_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(unsigned char, uchar, UCHAR_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(short, short, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(short, short, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(short, short, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_SHORT) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(short, short, add, SHRT_MIN, SHRT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(short, short, sub, SHRT_MIN, SHRT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(short, short, mul, SHRT_MIN, SHRT_MAX) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(short, short, SHRT_MIN, SHRT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_SUB(short, short, SHRT_MIN, SHRT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MUL(short, short, SHRT_MIN, SHRT_MAX) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(short, short, SHRT_MIN, SHRT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MOD(short, short, SHRT_MIN, SHRT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_NEG(short, short, SHRT_MIN, SHRT_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned short, ushort, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned short, ushort, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned short, ushort, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned short, ushort, add, UShortAdd) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned short, ushort, sub, UShortSub) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned short, ushort, mul, UShortMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_USHORT) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned short, ushort, add, USHRT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned short, ushort, sub, USHRT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned short, ushort, mul, USHRT_MAX) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(unsigned short, ushort, USHRT_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(unsigned short, ushort, USHRT_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(unsigned short, ushort, USHRT_MAX) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(unsigned short, ushort, USHRT_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(unsigned short, ushort, USHRT_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(int, int, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(int, int, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(int, int, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_INT) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(int, int, add, INT_MIN, INT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(int, int, sub, INT_MIN, INT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(int, int, mul, INT_MIN, INT_MAX) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(int, int, INT_MIN, INT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_SUB(int, int, INT_MIN, INT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MUL(int, int, INT_MIN, INT_MAX) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(int, int, INT_MIN, INT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MOD(int, int, INT_MIN, INT_MAX) -PSNIP_SAFE_DEFINE_SIGNED_NEG(int, int, INT_MIN, INT_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned int, uint, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned int, uint, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned int, uint, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned int, uint, add, UIntAdd) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned int, uint, sub, UIntSub) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned int, uint, mul, UIntMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_UINT) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned int, uint, add, UINT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned int, uint, sub, UINT_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned int, uint, mul, UINT_MAX) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(unsigned int, uint, UINT_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(unsigned int, uint, UINT_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(unsigned int, uint, UINT_MAX) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(unsigned int, uint, UINT_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(unsigned int, uint, UINT_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(long, long, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(long, long, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(long, long, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_LONG) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(long, long, add, LONG_MIN, LONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(long, long, sub, LONG_MIN, LONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(long, long, mul, LONG_MIN, LONG_MAX) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(long, long, LONG_MIN, LONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_SUB(long, long, LONG_MIN, LONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MUL(long, long, LONG_MIN, LONG_MAX) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(long, long, LONG_MIN, LONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MOD(long, long, LONG_MIN, LONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_NEG(long, long, LONG_MIN, LONG_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned long, ulong, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned long, ulong, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned long, ulong, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned long, ulong, add, ULongAdd) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned long, ulong, sub, ULongSub) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned long, ulong, mul, ULongMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_ULONG) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned long, ulong, add, ULONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned long, ulong, sub, ULONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned long, ulong, mul, ULONG_MAX) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(unsigned long, ulong, ULONG_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(unsigned long, ulong, ULONG_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(unsigned long, ulong, ULONG_MAX) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(unsigned long, ulong, ULONG_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(unsigned long, ulong, ULONG_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(long long, llong, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(long long, llong, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(long long, llong, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_LLONG) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(long long, llong, add, LLONG_MIN, LLONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(long long, llong, sub, LLONG_MIN, LLONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(long long, llong, mul, LLONG_MIN, LLONG_MAX) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(long long, llong, LLONG_MIN, LLONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_SUB(long long, llong, LLONG_MIN, LLONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MUL(long long, llong, LLONG_MIN, LLONG_MAX) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(long long, llong, LLONG_MIN, LLONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_MOD(long long, llong, LLONG_MIN, LLONG_MAX) -PSNIP_SAFE_DEFINE_SIGNED_NEG(long long, llong, LLONG_MIN, LLONG_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned long long, ullong, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned long long, ullong, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(unsigned long long, ullong, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned long long, ullong, add, ULongLongAdd) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned long long, ullong, sub, ULongLongSub) -PSNIP_SAFE_DEFINE_INTSAFE(unsigned long long, ullong, mul, ULongLongMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_ULLONG) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned long long, ullong, add, ULLONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned long long, ullong, sub, ULLONG_MAX) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(unsigned long long, ullong, mul, ULLONG_MAX) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(unsigned long long, ullong, ULLONG_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(unsigned long long, ullong, ULLONG_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(unsigned long long, ullong, ULLONG_MAX) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(unsigned long long, ullong, ULLONG_MAX) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(unsigned long long, ullong, ULLONG_MAX) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(size_t, size, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(size_t, size, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(size_t, size, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) -PSNIP_SAFE_DEFINE_INTSAFE(size_t, size, add, SizeTAdd) -PSNIP_SAFE_DEFINE_INTSAFE(size_t, size, sub, SizeTSub) -PSNIP_SAFE_DEFINE_INTSAFE(size_t, size, mul, SizeTMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_SIZE) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(size_t, size, add, PSNIP_SAFE__SIZE_MAX_RT) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(size_t, size, sub, PSNIP_SAFE__SIZE_MAX_RT) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(size_t, size, mul, PSNIP_SAFE__SIZE_MAX_RT) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(size_t, size, PSNIP_SAFE__SIZE_MAX_RT) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(size_t, size, PSNIP_SAFE__SIZE_MAX_RT) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(size_t, size, PSNIP_SAFE__SIZE_MAX_RT) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(size_t, size, PSNIP_SAFE__SIZE_MAX_RT) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(size_t, size, PSNIP_SAFE__SIZE_MAX_RT) - -#if !defined(PSNIP_SAFE_NO_FIXED) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int8_t, int8, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int8_t, int8, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int8_t, int8, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_INT8) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int8_t, int8, add, (-0x7fLL-1), 0x7f) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int8_t, int8, sub, (-0x7fLL-1), 0x7f) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int8_t, int8, mul, (-0x7fLL-1), 0x7f) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(psnip_int8_t, int8, (-0x7fLL-1), 0x7f) -PSNIP_SAFE_DEFINE_SIGNED_SUB(psnip_int8_t, int8, (-0x7fLL-1), 0x7f) -PSNIP_SAFE_DEFINE_SIGNED_MUL(psnip_int8_t, int8, (-0x7fLL-1), 0x7f) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(psnip_int8_t, int8, (-0x7fLL-1), 0x7f) -PSNIP_SAFE_DEFINE_SIGNED_MOD(psnip_int8_t, int8, (-0x7fLL-1), 0x7f) -PSNIP_SAFE_DEFINE_SIGNED_NEG(psnip_int8_t, int8, (-0x7fLL-1), 0x7f) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint8_t, uint8, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint8_t, uint8, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint8_t, uint8, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_UINT8) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint8_t, uint8, add, 0xff) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint8_t, uint8, sub, 0xff) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint8_t, uint8, mul, 0xff) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(psnip_uint8_t, uint8, 0xff) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(psnip_uint8_t, uint8, 0xff) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(psnip_uint8_t, uint8, 0xff) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(psnip_uint8_t, uint8, 0xff) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(psnip_uint8_t, uint8, 0xff) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int16_t, int16, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int16_t, int16, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int16_t, int16, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_INT16) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int16_t, int16, add, (-32767-1), 0x7fff) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int16_t, int16, sub, (-32767-1), 0x7fff) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int16_t, int16, mul, (-32767-1), 0x7fff) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(psnip_int16_t, int16, (-32767-1), 0x7fff) -PSNIP_SAFE_DEFINE_SIGNED_SUB(psnip_int16_t, int16, (-32767-1), 0x7fff) -PSNIP_SAFE_DEFINE_SIGNED_MUL(psnip_int16_t, int16, (-32767-1), 0x7fff) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(psnip_int16_t, int16, (-32767-1), 0x7fff) -PSNIP_SAFE_DEFINE_SIGNED_MOD(psnip_int16_t, int16, (-32767-1), 0x7fff) -PSNIP_SAFE_DEFINE_SIGNED_NEG(psnip_int16_t, int16, (-32767-1), 0x7fff) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint16_t, uint16, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint16_t, uint16, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint16_t, uint16, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) && defined(_WIN32) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint16_t, uint16, add, UShortAdd) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint16_t, uint16, sub, UShortSub) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint16_t, uint16, mul, UShortMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_UINT16) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint16_t, uint16, add, 0xffff) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint16_t, uint16, sub, 0xffff) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint16_t, uint16, mul, 0xffff) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(psnip_uint16_t, uint16, 0xffff) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(psnip_uint16_t, uint16, 0xffff) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(psnip_uint16_t, uint16, 0xffff) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(psnip_uint16_t, uint16, 0xffff) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(psnip_uint16_t, uint16, 0xffff) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int32_t, int32, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int32_t, int32, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int32_t, int32, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_INT32) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int32_t, int32, add, (-0x7fffffffLL-1), 0x7fffffffLL) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int32_t, int32, sub, (-0x7fffffffLL-1), 0x7fffffffLL) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int32_t, int32, mul, (-0x7fffffffLL-1), 0x7fffffffLL) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(psnip_int32_t, int32, (-0x7fffffffLL-1), 0x7fffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_SUB(psnip_int32_t, int32, (-0x7fffffffLL-1), 0x7fffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_MUL(psnip_int32_t, int32, (-0x7fffffffLL-1), 0x7fffffffLL) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(psnip_int32_t, int32, (-0x7fffffffLL-1), 0x7fffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_MOD(psnip_int32_t, int32, (-0x7fffffffLL-1), 0x7fffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_NEG(psnip_int32_t, int32, (-0x7fffffffLL-1), 0x7fffffffLL) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint32_t, uint32, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint32_t, uint32, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint32_t, uint32, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) && defined(_WIN32) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint32_t, uint32, add, UIntAdd) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint32_t, uint32, sub, UIntSub) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint32_t, uint32, mul, UIntMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_UINT32) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint32_t, uint32, add, 0xffffffffUL) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint32_t, uint32, sub, 0xffffffffUL) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint32_t, uint32, mul, 0xffffffffUL) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(psnip_uint32_t, uint32, 0xffffffffUL) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(psnip_uint32_t, uint32, 0xffffffffUL) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(psnip_uint32_t, uint32, 0xffffffffUL) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(psnip_uint32_t, uint32, 0xffffffffUL) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(psnip_uint32_t, uint32, 0xffffffffUL) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int64_t, int64, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int64_t, int64, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_int64_t, int64, mul) -#elif defined(PSNIP_SAFE_HAVE_LARGER_INT64) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int64_t, int64, add, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int64_t, int64, sub, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -PSNIP_SAFE_DEFINE_PROMOTED_SIGNED_BINARY_OP(psnip_int64_t, int64, mul, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -#else -PSNIP_SAFE_DEFINE_SIGNED_ADD(psnip_int64_t, int64, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_SUB(psnip_int64_t, int64, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_MUL(psnip_int64_t, int64, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -#endif -PSNIP_SAFE_DEFINE_SIGNED_DIV(psnip_int64_t, int64, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_MOD(psnip_int64_t, int64, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) -PSNIP_SAFE_DEFINE_SIGNED_NEG(psnip_int64_t, int64, (-0x7fffffffffffffffLL-1), 0x7fffffffffffffffLL) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint64_t, uint64, add) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint64_t, uint64, sub) -PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(psnip_uint64_t, uint64, mul) -#elif defined(PSNIP_SAFE_HAVE_INTSAFE_H) && defined(_WIN32) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint64_t, uint64, add, ULongLongAdd) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint64_t, uint64, sub, ULongLongSub) -PSNIP_SAFE_DEFINE_INTSAFE(psnip_uint64_t, uint64, mul, ULongLongMult) -#elif defined(PSNIP_SAFE_HAVE_LARGER_UINT64) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint64_t, uint64, add, 0xffffffffffffffffULL) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint64_t, uint64, sub, 0xffffffffffffffffULL) -PSNIP_SAFE_DEFINE_PROMOTED_UNSIGNED_BINARY_OP(psnip_uint64_t, uint64, mul, 0xffffffffffffffffULL) -#else -PSNIP_SAFE_DEFINE_UNSIGNED_ADD(psnip_uint64_t, uint64, 0xffffffffffffffffULL) -PSNIP_SAFE_DEFINE_UNSIGNED_SUB(psnip_uint64_t, uint64, 0xffffffffffffffffULL) -PSNIP_SAFE_DEFINE_UNSIGNED_MUL(psnip_uint64_t, uint64, 0xffffffffffffffffULL) -#endif -PSNIP_SAFE_DEFINE_UNSIGNED_DIV(psnip_uint64_t, uint64, 0xffffffffffffffffULL) -PSNIP_SAFE_DEFINE_UNSIGNED_MOD(psnip_uint64_t, uint64, 0xffffffffffffffffULL) - -#endif /* !defined(PSNIP_SAFE_NO_FIXED) */ - -#define PSNIP_SAFE_C11_GENERIC_SELECTION(res, op) \ - _Generic((*res), \ - char: psnip_safe_char_##op, \ - unsigned char: psnip_safe_uchar_##op, \ - short: psnip_safe_short_##op, \ - unsigned short: psnip_safe_ushort_##op, \ - int: psnip_safe_int_##op, \ - unsigned int: psnip_safe_uint_##op, \ - long: psnip_safe_long_##op, \ - unsigned long: psnip_safe_ulong_##op, \ - long long: psnip_safe_llong_##op, \ - unsigned long long: psnip_safe_ullong_##op) - -#define PSNIP_SAFE_C11_GENERIC_BINARY_OP(op, res, a, b) \ - PSNIP_SAFE_C11_GENERIC_SELECTION(res, op)(res, a, b) -#define PSNIP_SAFE_C11_GENERIC_UNARY_OP(op, res, v) \ - PSNIP_SAFE_C11_GENERIC_SELECTION(res, op)(res, v) - -#if defined(PSNIP_SAFE_HAVE_BUILTIN_OVERFLOW) -#define psnip_safe_add(res, a, b) !__builtin_add_overflow(a, b, res) -#define psnip_safe_sub(res, a, b) !__builtin_sub_overflow(a, b, res) -#define psnip_safe_mul(res, a, b) !__builtin_mul_overflow(a, b, res) -#define psnip_safe_div(res, a, b) !__builtin_div_overflow(a, b, res) -#define psnip_safe_mod(res, a, b) !__builtin_mod_overflow(a, b, res) -#define psnip_safe_neg(res, v) PSNIP_SAFE_C11_GENERIC_UNARY_OP (neg, res, v) - -#elif defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) -/* The are no fixed-length or size selections because they cause an - * error about _Generic specifying two compatible types. Hopefully - * this doesn't cause problems on exotic platforms, but if it does - * please let me know and I'll try to figure something out. */ - -#define psnip_safe_add(res, a, b) PSNIP_SAFE_C11_GENERIC_BINARY_OP(add, res, a, b) -#define psnip_safe_sub(res, a, b) PSNIP_SAFE_C11_GENERIC_BINARY_OP(sub, res, a, b) -#define psnip_safe_mul(res, a, b) PSNIP_SAFE_C11_GENERIC_BINARY_OP(mul, res, a, b) -#define psnip_safe_div(res, a, b) PSNIP_SAFE_C11_GENERIC_BINARY_OP(div, res, a, b) -#define psnip_safe_mod(res, a, b) PSNIP_SAFE_C11_GENERIC_BINARY_OP(mod, res, a, b) -#define psnip_safe_neg(res, v) PSNIP_SAFE_C11_GENERIC_UNARY_OP (neg, res, v) -#endif - -#if !defined(PSNIP_SAFE_HAVE_BUILTINS) && (defined(PSNIP_SAFE_EMULATE_NATIVE) || defined(PSNIP_BUILTIN_EMULATE_NATIVE)) -# define __builtin_sadd_overflow(a, b, res) (!psnip_safe_int_add(res, a, b)) -# define __builtin_saddl_overflow(a, b, res) (!psnip_safe_long_add(res, a, b)) -# define __builtin_saddll_overflow(a, b, res) (!psnip_safe_llong_add(res, a, b)) -# define __builtin_uadd_overflow(a, b, res) (!psnip_safe_uint_add(res, a, b)) -# define __builtin_uaddl_overflow(a, b, res) (!psnip_safe_ulong_add(res, a, b)) -# define __builtin_uaddll_overflow(a, b, res) (!psnip_safe_ullong_add(res, a, b)) - -# define __builtin_ssub_overflow(a, b, res) (!psnip_safe_int_sub(res, a, b)) -# define __builtin_ssubl_overflow(a, b, res) (!psnip_safe_long_sub(res, a, b)) -# define __builtin_ssubll_overflow(a, b, res) (!psnip_safe_llong_sub(res, a, b)) -# define __builtin_usub_overflow(a, b, res) (!psnip_safe_uint_sub(res, a, b)) -# define __builtin_usubl_overflow(a, b, res) (!psnip_safe_ulong_sub(res, a, b)) -# define __builtin_usubll_overflow(a, b, res) (!psnip_safe_ullong_sub(res, a, b)) - -# define __builtin_smul_overflow(a, b, res) (!psnip_safe_int_mul(res, a, b)) -# define __builtin_smull_overflow(a, b, res) (!psnip_safe_long_mul(res, a, b)) -# define __builtin_smulll_overflow(a, b, res) (!psnip_safe_llong_mul(res, a, b)) -# define __builtin_umul_overflow(a, b, res) (!psnip_safe_uint_mul(res, a, b)) -# define __builtin_umull_overflow(a, b, res) (!psnip_safe_ulong_mul(res, a, b)) -# define __builtin_umulll_overflow(a, b, res) (!psnip_safe_ullong_mul(res, a, b)) -#endif - -#endif /* !defined(PSNIP_SAFE_H) */ diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash.h deleted file mode 100644 index a33cdf861..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash.h +++ /dev/null @@ -1,18 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/vendored/xxhash/xxhash.h" diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash/xxhash.h b/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash/xxhash.h deleted file mode 100644 index a18e8c762..000000000 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/vendored/xxhash/xxhash.h +++ /dev/null @@ -1,6773 +0,0 @@ -/* - * xxHash - Extremely Fast Hash algorithm - * Header File - * Copyright (C) 2012-2021 Yann Collet - * - * BSD 2-Clause License (https://www.opensource.org/licenses/bsd-license.php) - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are - * met: - * - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following disclaimer - * in the documentation and/or other materials provided with the - * distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS - * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT - * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR - * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT - * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, - * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT - * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, - * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY - * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE - * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * You can contact the author at: - * - xxHash homepage: https://www.xxhash.com - * - xxHash source repository: https://github.com/Cyan4973/xxHash - */ - -/*! - * @mainpage xxHash - * - * xxHash is an extremely fast non-cryptographic hash algorithm, working at RAM speed - * limits. - * - * It is proposed in four flavors, in three families: - * 1. @ref XXH32_family - * - Classic 32-bit hash function. Simple, compact, and runs on almost all - * 32-bit and 64-bit systems. - * 2. @ref XXH64_family - * - Classic 64-bit adaptation of XXH32. Just as simple, and runs well on most - * 64-bit systems (but _not_ 32-bit systems). - * 3. @ref XXH3_family - * - Modern 64-bit and 128-bit hash function family which features improved - * strength and performance across the board, especially on smaller data. - * It benefits greatly from SIMD and 64-bit without requiring it. - * - * Benchmarks - * --- - * The reference system uses an Intel i7-9700K CPU, and runs Ubuntu x64 20.04. - * The open source benchmark program is compiled with clang v10.0 using -O3 flag. - * - * | Hash Name | ISA ext | Width | Large Data Speed | Small Data Velocity | - * | -------------------- | ------- | ----: | ---------------: | ------------------: | - * | XXH3_64bits() | @b AVX2 | 64 | 59.4 GB/s | 133.1 | - * | MeowHash | AES-NI | 128 | 58.2 GB/s | 52.5 | - * | XXH3_128bits() | @b AVX2 | 128 | 57.9 GB/s | 118.1 | - * | CLHash | PCLMUL | 64 | 37.1 GB/s | 58.1 | - * | XXH3_64bits() | @b SSE2 | 64 | 31.5 GB/s | 133.1 | - * | XXH3_128bits() | @b SSE2 | 128 | 29.6 GB/s | 118.1 | - * | RAM sequential read | | N/A | 28.0 GB/s | N/A | - * | ahash | AES-NI | 64 | 22.5 GB/s | 107.2 | - * | City64 | | 64 | 22.0 GB/s | 76.6 | - * | T1ha2 | | 64 | 22.0 GB/s | 99.0 | - * | City128 | | 128 | 21.7 GB/s | 57.7 | - * | FarmHash | AES-NI | 64 | 21.3 GB/s | 71.9 | - * | XXH64() | | 64 | 19.4 GB/s | 71.0 | - * | SpookyHash | | 64 | 19.3 GB/s | 53.2 | - * | Mum | | 64 | 18.0 GB/s | 67.0 | - * | CRC32C | SSE4.2 | 32 | 13.0 GB/s | 57.9 | - * | XXH32() | | 32 | 9.7 GB/s | 71.9 | - * | City32 | | 32 | 9.1 GB/s | 66.0 | - * | Blake3* | @b AVX2 | 256 | 4.4 GB/s | 8.1 | - * | Murmur3 | | 32 | 3.9 GB/s | 56.1 | - * | SipHash* | | 64 | 3.0 GB/s | 43.2 | - * | Blake3* | @b SSE2 | 256 | 2.4 GB/s | 8.1 | - * | HighwayHash | | 64 | 1.4 GB/s | 6.0 | - * | FNV64 | | 64 | 1.2 GB/s | 62.7 | - * | Blake2* | | 256 | 1.1 GB/s | 5.1 | - * | SHA1* | | 160 | 0.8 GB/s | 5.6 | - * | MD5* | | 128 | 0.6 GB/s | 7.8 | - * @note - * - Hashes which require a specific ISA extension are noted. SSE2 is also noted, - * even though it is mandatory on x64. - * - Hashes with an asterisk are cryptographic. Note that MD5 is non-cryptographic - * by modern standards. - * - Small data velocity is a rough average of algorithm's efficiency for small - * data. For more accurate information, see the wiki. - * - More benchmarks and strength tests are found on the wiki: - * https://github.com/Cyan4973/xxHash/wiki - * - * Usage - * ------ - * All xxHash variants use a similar API. Changing the algorithm is a trivial - * substitution. - * - * @pre - * For functions which take an input and length parameter, the following - * requirements are assumed: - * - The range from [`input`, `input + length`) is valid, readable memory. - * - The only exception is if the `length` is `0`, `input` may be `NULL`. - * - For C++, the objects must have the *TriviallyCopyable* property, as the - * functions access bytes directly as if it was an array of `unsigned char`. - * - * @anchor single_shot_example - * **Single Shot** - * - * These functions are stateless functions which hash a contiguous block of memory, - * immediately returning the result. They are the easiest and usually the fastest - * option. - * - * XXH32(), XXH64(), XXH3_64bits(), XXH3_128bits() - * - * @code{.c} - * #include - * #include "xxhash.h" - * - * // Example for a function which hashes a null terminated string with XXH32(). - * XXH32_hash_t hash_string(const char* string, XXH32_hash_t seed) - * { - * // NULL pointers are only valid if the length is zero - * size_t length = (string == NULL) ? 0 : strlen(string); - * return XXH32(string, length, seed); - * } - * @endcode - * - * @anchor streaming_example - * **Streaming** - * - * These groups of functions allow incremental hashing of unknown size, even - * more than what would fit in a size_t. - * - * XXH32_reset(), XXH64_reset(), XXH3_64bits_reset(), XXH3_128bits_reset() - * - * @code{.c} - * #include - * #include - * #include "xxhash.h" - * // Example for a function which hashes a FILE incrementally with XXH3_64bits(). - * XXH64_hash_t hashFile(FILE* f) - * { - * // Allocate a state struct. Do not just use malloc() or new. - * XXH3_state_t* state = XXH3_createState(); - * assert(state != NULL && "Out of memory!"); - * // Reset the state to start a new hashing session. - * XXH3_64bits_reset(state); - * char buffer[4096]; - * size_t count; - * // Read the file in chunks - * while ((count = fread(buffer, 1, sizeof(buffer), f)) != 0) { - * // Run update() as many times as necessary to process the data - * XXH3_64bits_update(state, buffer, count); - * } - * // Retrieve the finalized hash. This will not change the state. - * XXH64_hash_t result = XXH3_64bits_digest(state); - * // Free the state. Do not use free(). - * XXH3_freeState(state); - * return result; - * } - * @endcode - * - * @file xxhash.h - * xxHash prototypes and implementation - */ - -#if defined (__cplusplus) -extern "C" { -#endif - -/* **************************** - * INLINE mode - ******************************/ -/*! - * @defgroup public Public API - * Contains details on the public xxHash functions. - * @{ - */ -#ifdef XXH_DOXYGEN -/*! - * @brief Gives access to internal state declaration, required for static allocation. - * - * Incompatible with dynamic linking, due to risks of ABI changes. - * - * Usage: - * @code{.c} - * #define XXH_STATIC_LINKING_ONLY - * #include "xxhash.h" - * @endcode - */ -# define XXH_STATIC_LINKING_ONLY -/* Do not undef XXH_STATIC_LINKING_ONLY for Doxygen */ - -/*! - * @brief Gives access to internal definitions. - * - * Usage: - * @code{.c} - * #define XXH_STATIC_LINKING_ONLY - * #define XXH_IMPLEMENTATION - * #include "xxhash.h" - * @endcode - */ -# define XXH_IMPLEMENTATION -/* Do not undef XXH_IMPLEMENTATION for Doxygen */ - -/*! - * @brief Exposes the implementation and marks all functions as `inline`. - * - * Use these build macros to inline xxhash into the target unit. - * Inlining improves performance on small inputs, especially when the length is - * expressed as a compile-time constant: - * - * https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html - * - * It also keeps xxHash symbols private to the unit, so they are not exported. - * - * Usage: - * @code{.c} - * #define XXH_INLINE_ALL - * #include "xxhash.h" - * @endcode - * Do not compile and link xxhash.o as a separate object, as it is not useful. - */ -# define XXH_INLINE_ALL -# undef XXH_INLINE_ALL -/*! - * @brief Exposes the implementation without marking functions as inline. - */ -# define XXH_PRIVATE_API -# undef XXH_PRIVATE_API -/*! - * @brief Emulate a namespace by transparently prefixing all symbols. - * - * If you want to include _and expose_ xxHash functions from within your own - * library, but also want to avoid symbol collisions with other libraries which - * may also include xxHash, you can use @ref XXH_NAMESPACE to automatically prefix - * any public symbol from xxhash library with the value of @ref XXH_NAMESPACE - * (therefore, avoid empty or numeric values). - * - * Note that no change is required within the calling program as long as it - * includes `xxhash.h`: Regular symbol names will be automatically translated - * by this header. - */ -# define XXH_NAMESPACE /* YOUR NAME HERE */ -# undef XXH_NAMESPACE -#endif - -#if (defined(XXH_INLINE_ALL) || defined(XXH_PRIVATE_API)) \ - && !defined(XXH_INLINE_ALL_31684351384) - /* this section should be traversed only once */ -# define XXH_INLINE_ALL_31684351384 - /* give access to the advanced API, required to compile implementations */ -# undef XXH_STATIC_LINKING_ONLY /* avoid macro redef */ -# define XXH_STATIC_LINKING_ONLY - /* make all functions private */ -# undef XXH_PUBLIC_API -# if defined(__GNUC__) -# define XXH_PUBLIC_API static __inline __attribute__((unused)) -# elif defined (__cplusplus) || (defined (__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */) -# define XXH_PUBLIC_API static inline -# elif defined(_MSC_VER) -# define XXH_PUBLIC_API static __inline -# else - /* note: this version may generate warnings for unused static functions */ -# define XXH_PUBLIC_API static -# endif - - /* - * This part deals with the special case where a unit wants to inline xxHash, - * but "xxhash.h" has previously been included without XXH_INLINE_ALL, - * such as part of some previously included *.h header file. - * Without further action, the new include would just be ignored, - * and functions would effectively _not_ be inlined (silent failure). - * The following macros solve this situation by prefixing all inlined names, - * avoiding naming collision with previous inclusions. - */ - /* Before that, we unconditionally #undef all symbols, - * in case they were already defined with XXH_NAMESPACE. - * They will then be redefined for XXH_INLINE_ALL - */ -# undef XXH_versionNumber - /* XXH32 */ -# undef XXH32 -# undef XXH32_createState -# undef XXH32_freeState -# undef XXH32_reset -# undef XXH32_update -# undef XXH32_digest -# undef XXH32_copyState -# undef XXH32_canonicalFromHash -# undef XXH32_hashFromCanonical - /* XXH64 */ -# undef XXH64 -# undef XXH64_createState -# undef XXH64_freeState -# undef XXH64_reset -# undef XXH64_update -# undef XXH64_digest -# undef XXH64_copyState -# undef XXH64_canonicalFromHash -# undef XXH64_hashFromCanonical - /* XXH3_64bits */ -# undef XXH3_64bits -# undef XXH3_64bits_withSecret -# undef XXH3_64bits_withSeed -# undef XXH3_64bits_withSecretandSeed -# undef XXH3_createState -# undef XXH3_freeState -# undef XXH3_copyState -# undef XXH3_64bits_reset -# undef XXH3_64bits_reset_withSeed -# undef XXH3_64bits_reset_withSecret -# undef XXH3_64bits_update -# undef XXH3_64bits_digest -# undef XXH3_generateSecret - /* XXH3_128bits */ -# undef XXH128 -# undef XXH3_128bits -# undef XXH3_128bits_withSeed -# undef XXH3_128bits_withSecret -# undef XXH3_128bits_reset -# undef XXH3_128bits_reset_withSeed -# undef XXH3_128bits_reset_withSecret -# undef XXH3_128bits_reset_withSecretandSeed -# undef XXH3_128bits_update -# undef XXH3_128bits_digest -# undef XXH128_isEqual -# undef XXH128_cmp -# undef XXH128_canonicalFromHash -# undef XXH128_hashFromCanonical - /* Finally, free the namespace itself */ -# undef XXH_NAMESPACE - - /* employ the namespace for XXH_INLINE_ALL */ -# define XXH_NAMESPACE XXH_INLINE_ - /* - * Some identifiers (enums, type names) are not symbols, - * but they must nonetheless be renamed to avoid redeclaration. - * Alternative solution: do not redeclare them. - * However, this requires some #ifdefs, and has a more dispersed impact. - * Meanwhile, renaming can be achieved in a single place. - */ -# define XXH_IPREF(Id) XXH_NAMESPACE ## Id -# define XXH_OK XXH_IPREF(XXH_OK) -# define XXH_ERROR XXH_IPREF(XXH_ERROR) -# define XXH_errorcode XXH_IPREF(XXH_errorcode) -# define XXH32_canonical_t XXH_IPREF(XXH32_canonical_t) -# define XXH64_canonical_t XXH_IPREF(XXH64_canonical_t) -# define XXH128_canonical_t XXH_IPREF(XXH128_canonical_t) -# define XXH32_state_s XXH_IPREF(XXH32_state_s) -# define XXH32_state_t XXH_IPREF(XXH32_state_t) -# define XXH64_state_s XXH_IPREF(XXH64_state_s) -# define XXH64_state_t XXH_IPREF(XXH64_state_t) -# define XXH3_state_s XXH_IPREF(XXH3_state_s) -# define XXH3_state_t XXH_IPREF(XXH3_state_t) -# define XXH128_hash_t XXH_IPREF(XXH128_hash_t) - /* Ensure the header is parsed again, even if it was previously included */ -# undef XXHASH_H_5627135585666179 -# undef XXHASH_H_STATIC_13879238742 -#endif /* XXH_INLINE_ALL || XXH_PRIVATE_API */ - -/* **************************************************************** - * Stable API - *****************************************************************/ -#ifndef XXHASH_H_5627135585666179 -#define XXHASH_H_5627135585666179 1 - -/*! @brief Marks a global symbol. */ -#if !defined(XXH_INLINE_ALL) && !defined(XXH_PRIVATE_API) -# if defined(WIN32) && defined(_MSC_VER) && (defined(XXH_IMPORT) || defined(XXH_EXPORT)) -# ifdef XXH_EXPORT -# define XXH_PUBLIC_API __declspec(dllexport) -# elif XXH_IMPORT -# define XXH_PUBLIC_API __declspec(dllimport) -# endif -# else -# define XXH_PUBLIC_API /* do nothing */ -# endif -#endif - -#ifdef XXH_NAMESPACE -# define XXH_CAT(A,B) A##B -# define XXH_NAME2(A,B) XXH_CAT(A,B) -# define XXH_versionNumber XXH_NAME2(XXH_NAMESPACE, XXH_versionNumber) -/* XXH32 */ -# define XXH32 XXH_NAME2(XXH_NAMESPACE, XXH32) -# define XXH32_createState XXH_NAME2(XXH_NAMESPACE, XXH32_createState) -# define XXH32_freeState XXH_NAME2(XXH_NAMESPACE, XXH32_freeState) -# define XXH32_reset XXH_NAME2(XXH_NAMESPACE, XXH32_reset) -# define XXH32_update XXH_NAME2(XXH_NAMESPACE, XXH32_update) -# define XXH32_digest XXH_NAME2(XXH_NAMESPACE, XXH32_digest) -# define XXH32_copyState XXH_NAME2(XXH_NAMESPACE, XXH32_copyState) -# define XXH32_canonicalFromHash XXH_NAME2(XXH_NAMESPACE, XXH32_canonicalFromHash) -# define XXH32_hashFromCanonical XXH_NAME2(XXH_NAMESPACE, XXH32_hashFromCanonical) -/* XXH64 */ -# define XXH64 XXH_NAME2(XXH_NAMESPACE, XXH64) -# define XXH64_createState XXH_NAME2(XXH_NAMESPACE, XXH64_createState) -# define XXH64_freeState XXH_NAME2(XXH_NAMESPACE, XXH64_freeState) -# define XXH64_reset XXH_NAME2(XXH_NAMESPACE, XXH64_reset) -# define XXH64_update XXH_NAME2(XXH_NAMESPACE, XXH64_update) -# define XXH64_digest XXH_NAME2(XXH_NAMESPACE, XXH64_digest) -# define XXH64_copyState XXH_NAME2(XXH_NAMESPACE, XXH64_copyState) -# define XXH64_canonicalFromHash XXH_NAME2(XXH_NAMESPACE, XXH64_canonicalFromHash) -# define XXH64_hashFromCanonical XXH_NAME2(XXH_NAMESPACE, XXH64_hashFromCanonical) -/* XXH3_64bits */ -# define XXH3_64bits XXH_NAME2(XXH_NAMESPACE, XXH3_64bits) -# define XXH3_64bits_withSecret XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_withSecret) -# define XXH3_64bits_withSeed XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_withSeed) -# define XXH3_64bits_withSecretandSeed XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_withSecretandSeed) -# define XXH3_createState XXH_NAME2(XXH_NAMESPACE, XXH3_createState) -# define XXH3_freeState XXH_NAME2(XXH_NAMESPACE, XXH3_freeState) -# define XXH3_copyState XXH_NAME2(XXH_NAMESPACE, XXH3_copyState) -# define XXH3_64bits_reset XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_reset) -# define XXH3_64bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_reset_withSeed) -# define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_reset_withSecret) -# define XXH3_64bits_reset_withSecretandSeed XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_reset_withSecretandSeed) -# define XXH3_64bits_update XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_update) -# define XXH3_64bits_digest XXH_NAME2(XXH_NAMESPACE, XXH3_64bits_digest) -# define XXH3_generateSecret XXH_NAME2(XXH_NAMESPACE, XXH3_generateSecret) -# define XXH3_generateSecret_fromSeed XXH_NAME2(XXH_NAMESPACE, XXH3_generateSecret_fromSeed) -/* XXH3_128bits */ -# define XXH128 XXH_NAME2(XXH_NAMESPACE, XXH128) -# define XXH3_128bits XXH_NAME2(XXH_NAMESPACE, XXH3_128bits) -# define XXH3_128bits_withSeed XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_withSeed) -# define XXH3_128bits_withSecret XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_withSecret) -# define XXH3_128bits_withSecretandSeed XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_withSecretandSeed) -# define XXH3_128bits_reset XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset) -# define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSeed) -# define XXH3_128bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret) -# define XXH3_128bits_reset_withSecretandSeed XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecretandSeed) -# define XXH3_128bits_update XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_update) -# define XXH3_128bits_digest XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_digest) -# define XXH128_isEqual XXH_NAME2(XXH_NAMESPACE, XXH128_isEqual) -# define XXH128_cmp XXH_NAME2(XXH_NAMESPACE, XXH128_cmp) -# define XXH128_canonicalFromHash XXH_NAME2(XXH_NAMESPACE, XXH128_canonicalFromHash) -# define XXH128_hashFromCanonical XXH_NAME2(XXH_NAMESPACE, XXH128_hashFromCanonical) -#endif - - -/* ************************************* -* Compiler specifics -***************************************/ - -/* specific declaration modes for Windows */ -#if !defined(XXH_INLINE_ALL) && !defined(XXH_PRIVATE_API) -# if defined(WIN32) && defined(_MSC_VER) && (defined(XXH_IMPORT) || defined(XXH_EXPORT)) -# ifdef XXH_EXPORT -# define XXH_PUBLIC_API __declspec(dllexport) -# elif XXH_IMPORT -# define XXH_PUBLIC_API __declspec(dllimport) -# endif -# else -# define XXH_PUBLIC_API /* do nothing */ -# endif -#endif - -#if defined (__GNUC__) -# define XXH_CONSTF __attribute__((const)) -# define XXH_PUREF __attribute__((pure)) -# define XXH_MALLOCF __attribute__((malloc)) -#else -# define XXH_CONSTF /* disable */ -# define XXH_PUREF -# define XXH_MALLOCF -#endif - -/* ************************************* -* Version -***************************************/ -#define XXH_VERSION_MAJOR 0 -#define XXH_VERSION_MINOR 8 -#define XXH_VERSION_RELEASE 2 -/*! @brief Version number, encoded as two digits each */ -#define XXH_VERSION_NUMBER (XXH_VERSION_MAJOR *100*100 + XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE) - -/*! - * @brief Obtains the xxHash version. - * - * This is mostly useful when xxHash is compiled as a shared library, - * since the returned value comes from the library, as opposed to header file. - * - * @return @ref XXH_VERSION_NUMBER of the invoked library. - */ -XXH_PUBLIC_API XXH_CONSTF unsigned XXH_versionNumber (void); - - -/* **************************** -* Common basic types -******************************/ -#include /* size_t */ -/*! - * @brief Exit code for the streaming API. - */ -typedef enum { - XXH_OK = 0, /*!< OK */ - XXH_ERROR /*!< Error */ -} XXH_errorcode; - - -/*-********************************************************************** -* 32-bit hash -************************************************************************/ -#if defined(XXH_DOXYGEN) /* Don't show include */ -/*! - * @brief An unsigned 32-bit integer. - * - * Not necessarily defined to `uint32_t` but functionally equivalent. - */ -typedef uint32_t XXH32_hash_t; - -#elif !defined (__VMS) \ - && (defined (__cplusplus) \ - || (defined (__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */) ) -# include - typedef uint32_t XXH32_hash_t; - -#else -# include -# if UINT_MAX == 0xFFFFFFFFUL - typedef unsigned int XXH32_hash_t; -# elif ULONG_MAX == 0xFFFFFFFFUL - typedef unsigned long XXH32_hash_t; -# else -# error "unsupported platform: need a 32-bit type" -# endif -#endif - -/*! - * @} - * - * @defgroup XXH32_family XXH32 family - * @ingroup public - * Contains functions used in the classic 32-bit xxHash algorithm. - * - * @note - * XXH32 is useful for older platforms, with no or poor 64-bit performance. - * Note that the @ref XXH3_family provides competitive speed for both 32-bit - * and 64-bit systems, and offers true 64/128 bit hash results. - * - * @see @ref XXH64_family, @ref XXH3_family : Other xxHash families - * @see @ref XXH32_impl for implementation details - * @{ - */ - -/*! - * @brief Calculates the 32-bit hash of @p input using xxHash32. - * - * Speed on Core 2 Duo @ 3 GHz (single thread, SMHasher benchmark): 5.4 GB/s - * - * See @ref single_shot_example "Single Shot Example" for an example. - * - * @param input The block of data to be hashed, at least @p length bytes in size. - * @param length The length of @p input, in bytes. - * @param seed The 32-bit seed to alter the hash's output predictably. - * - * @pre - * The memory between @p input and @p input + @p length must be valid, - * readable, contiguous memory. However, if @p length is `0`, @p input may be - * `NULL`. In C++, this also must be *TriviallyCopyable*. - * - * @return The calculated 32-bit hash value. - * - * @see - * XXH64(), XXH3_64bits_withSeed(), XXH3_128bits_withSeed(), XXH128(): - * Direct equivalents for the other variants of xxHash. - * @see - * XXH32_createState(), XXH32_update(), XXH32_digest(): Streaming version. - */ -XXH_PUBLIC_API XXH_PUREF XXH32_hash_t XXH32 (const void* input, size_t length, XXH32_hash_t seed); - -#ifndef XXH_NO_STREAM -/*! - * Streaming functions generate the xxHash value from an incremental input. - * This method is slower than single-call functions, due to state management. - * For small inputs, prefer `XXH32()` and `XXH64()`, which are better optimized. - * - * An XXH state must first be allocated using `XXH*_createState()`. - * - * Start a new hash by initializing the state with a seed using `XXH*_reset()`. - * - * Then, feed the hash state by calling `XXH*_update()` as many times as necessary. - * - * The function returns an error code, with 0 meaning OK, and any other value - * meaning there is an error. - * - * Finally, a hash value can be produced anytime, by using `XXH*_digest()`. - * This function returns the nn-bits hash as an int or long long. - * - * It's still possible to continue inserting input into the hash state after a - * digest, and generate new hash values later on by invoking `XXH*_digest()`. - * - * When done, release the state using `XXH*_freeState()`. - * - * @see streaming_example at the top of @ref xxhash.h for an example. - */ - -/*! - * @typedef struct XXH32_state_s XXH32_state_t - * @brief The opaque state struct for the XXH32 streaming API. - * - * @see XXH32_state_s for details. - */ -typedef struct XXH32_state_s XXH32_state_t; - -/*! - * @brief Allocates an @ref XXH32_state_t. - * - * Must be freed with XXH32_freeState(). - * @return An allocated XXH32_state_t on success, `NULL` on failure. - */ -XXH_PUBLIC_API XXH_MALLOCF XXH32_state_t* XXH32_createState(void); -/*! - * @brief Frees an @ref XXH32_state_t. - * - * Must be allocated with XXH32_createState(). - * @param statePtr A pointer to an @ref XXH32_state_t allocated with @ref XXH32_createState(). - * @return XXH_OK. - */ -XXH_PUBLIC_API XXH_errorcode XXH32_freeState(XXH32_state_t* statePtr); -/*! - * @brief Copies one @ref XXH32_state_t to another. - * - * @param dst_state The state to copy to. - * @param src_state The state to copy from. - * @pre - * @p dst_state and @p src_state must not be `NULL` and must not overlap. - */ -XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, const XXH32_state_t* src_state); - -/*! - * @brief Resets an @ref XXH32_state_t to begin a new hash. - * - * This function resets and seeds a state. Call it before @ref XXH32_update(). - * - * @param statePtr The state struct to reset. - * @param seed The 32-bit seed to alter the hash result predictably. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - */ -XXH_PUBLIC_API XXH_errorcode XXH32_reset (XXH32_state_t* statePtr, XXH32_hash_t seed); - -/*! - * @brief Consumes a block of @p input to an @ref XXH32_state_t. - * - * Call this to incrementally consume blocks of data. - * - * @param statePtr The state struct to update. - * @param input The block of data to be hashed, at least @p length bytes in size. - * @param length The length of @p input, in bytes. - * - * @pre - * @p statePtr must not be `NULL`. - * @pre - * The memory between @p input and @p input + @p length must be valid, - * readable, contiguous memory. However, if @p length is `0`, @p input may be - * `NULL`. In C++, this also must be *TriviallyCopyable*. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - */ -XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* statePtr, const void* input, size_t length); - -/*! - * @brief Returns the calculated hash value from an @ref XXH32_state_t. - * - * @note - * Calling XXH32_digest() will not affect @p statePtr, so you can update, - * digest, and update again. - * - * @param statePtr The state struct to calculate the hash from. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return The calculated xxHash32 value from that state. - */ -XXH_PUBLIC_API XXH_PUREF XXH32_hash_t XXH32_digest (const XXH32_state_t* statePtr); -#endif /* !XXH_NO_STREAM */ - -/******* Canonical representation *******/ - -/* - * The default return values from XXH functions are unsigned 32 and 64 bit - * integers. - * This the simplest and fastest format for further post-processing. - * - * However, this leaves open the question of what is the order on the byte level, - * since little and big endian conventions will store the same number differently. - * - * The canonical representation settles this issue by mandating big-endian - * convention, the same convention as human-readable numbers (large digits first). - * - * When writing hash values to storage, sending them over a network, or printing - * them, it's highly recommended to use the canonical representation to ensure - * portability across a wider range of systems, present and future. - * - * The following functions allow transformation of hash values to and from - * canonical format. - */ - -/*! - * @brief Canonical (big endian) representation of @ref XXH32_hash_t. - */ -typedef struct { - unsigned char digest[4]; /*!< Hash bytes, big endian */ -} XXH32_canonical_t; - -/*! - * @brief Converts an @ref XXH32_hash_t to a big endian @ref XXH32_canonical_t. - * - * @param dst The @ref XXH32_canonical_t pointer to be stored to. - * @param hash The @ref XXH32_hash_t to be converted. - * - * @pre - * @p dst must not be `NULL`. - */ -XXH_PUBLIC_API void XXH32_canonicalFromHash(XXH32_canonical_t* dst, XXH32_hash_t hash); - -/*! - * @brief Converts an @ref XXH32_canonical_t to a native @ref XXH32_hash_t. - * - * @param src The @ref XXH32_canonical_t to convert. - * - * @pre - * @p src must not be `NULL`. - * - * @return The converted hash. - */ -XXH_PUBLIC_API XXH_PUREF XXH32_hash_t XXH32_hashFromCanonical(const XXH32_canonical_t* src); - - -/*! @cond Doxygen ignores this part */ -#ifdef __has_attribute -# define XXH_HAS_ATTRIBUTE(x) __has_attribute(x) -#else -# define XXH_HAS_ATTRIBUTE(x) 0 -#endif -/*! @endcond */ - -/*! @cond Doxygen ignores this part */ -/* - * C23 __STDC_VERSION__ number hasn't been specified yet. For now - * leave as `201711L` (C17 + 1). - * TODO: Update to correct value when its been specified. - */ -#define XXH_C23_VN 201711L -/*! @endcond */ - -/*! @cond Doxygen ignores this part */ -/* C-language Attributes are added in C23. */ -#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= XXH_C23_VN) && defined(__has_c_attribute) -# define XXH_HAS_C_ATTRIBUTE(x) __has_c_attribute(x) -#else -# define XXH_HAS_C_ATTRIBUTE(x) 0 -#endif -/*! @endcond */ - -/*! @cond Doxygen ignores this part */ -#if defined(__cplusplus) && defined(__has_cpp_attribute) -# define XXH_HAS_CPP_ATTRIBUTE(x) __has_cpp_attribute(x) -#else -# define XXH_HAS_CPP_ATTRIBUTE(x) 0 -#endif -/*! @endcond */ - -/*! @cond Doxygen ignores this part */ -/* - * Define XXH_FALLTHROUGH macro for annotating switch case with the 'fallthrough' attribute - * introduced in CPP17 and C23. - * CPP17 : https://en.cppreference.com/w/cpp/language/attributes/fallthrough - * C23 : https://en.cppreference.com/w/c/language/attributes/fallthrough - */ -#if XXH_HAS_C_ATTRIBUTE(fallthrough) || XXH_HAS_CPP_ATTRIBUTE(fallthrough) -# define XXH_FALLTHROUGH [[fallthrough]] -#elif XXH_HAS_ATTRIBUTE(__fallthrough__) -# define XXH_FALLTHROUGH __attribute__ ((__fallthrough__)) -#else -# define XXH_FALLTHROUGH /* fallthrough */ -#endif -/*! @endcond */ - -/*! @cond Doxygen ignores this part */ -/* - * Define XXH_NOESCAPE for annotated pointers in public API. - * https://clang.llvm.org/docs/AttributeReference.html#noescape - * As of writing this, only supported by clang. - */ -#if XXH_HAS_ATTRIBUTE(noescape) -# define XXH_NOESCAPE __attribute__((noescape)) -#else -# define XXH_NOESCAPE -#endif -/*! @endcond */ - - -/*! - * @} - * @ingroup public - * @{ - */ - -#ifndef XXH_NO_LONG_LONG -/*-********************************************************************** -* 64-bit hash -************************************************************************/ -#if defined(XXH_DOXYGEN) /* don't include */ -/*! - * @brief An unsigned 64-bit integer. - * - * Not necessarily defined to `uint64_t` but functionally equivalent. - */ -typedef uint64_t XXH64_hash_t; -#elif !defined (__VMS) \ - && (defined (__cplusplus) \ - || (defined (__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */) ) -# include - typedef uint64_t XXH64_hash_t; -#else -# include -# if defined(__LP64__) && ULONG_MAX == 0xFFFFFFFFFFFFFFFFULL - /* LP64 ABI says uint64_t is unsigned long */ - typedef unsigned long XXH64_hash_t; -# else - /* the following type must have a width of 64-bit */ - typedef unsigned long long XXH64_hash_t; -# endif -#endif - -/*! - * @} - * - * @defgroup XXH64_family XXH64 family - * @ingroup public - * @{ - * Contains functions used in the classic 64-bit xxHash algorithm. - * - * @note - * XXH3 provides competitive speed for both 32-bit and 64-bit systems, - * and offers true 64/128 bit hash results. - * It provides better speed for systems with vector processing capabilities. - */ - -/*! - * @brief Calculates the 64-bit hash of @p input using xxHash64. - * - * This function usually runs faster on 64-bit systems, but slower on 32-bit - * systems (see benchmark). - * - * @param input The block of data to be hashed, at least @p length bytes in size. - * @param length The length of @p input, in bytes. - * @param seed The 64-bit seed to alter the hash's output predictably. - * - * @pre - * The memory between @p input and @p input + @p length must be valid, - * readable, contiguous memory. However, if @p length is `0`, @p input may be - * `NULL`. In C++, this also must be *TriviallyCopyable*. - * - * @return The calculated 64-bit hash. - * - * @see - * XXH32(), XXH3_64bits_withSeed(), XXH3_128bits_withSeed(), XXH128(): - * Direct equivalents for the other variants of xxHash. - * @see - * XXH64_createState(), XXH64_update(), XXH64_digest(): Streaming version. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH64(XXH_NOESCAPE const void* input, size_t length, XXH64_hash_t seed); - -/******* Streaming *******/ -#ifndef XXH_NO_STREAM -/*! - * @brief The opaque state struct for the XXH64 streaming API. - * - * @see XXH64_state_s for details. - */ -typedef struct XXH64_state_s XXH64_state_t; /* incomplete type */ - -/*! - * @brief Allocates an @ref XXH64_state_t. - * - * Must be freed with XXH64_freeState(). - * @return An allocated XXH64_state_t on success, `NULL` on failure. - */ -XXH_PUBLIC_API XXH_MALLOCF XXH64_state_t* XXH64_createState(void); - -/*! - * @brief Frees an @ref XXH64_state_t. - * - * Must be allocated with XXH64_createState(). - * @param statePtr A pointer to an @ref XXH64_state_t allocated with @ref XXH64_createState(). - * @return XXH_OK. - */ -XXH_PUBLIC_API XXH_errorcode XXH64_freeState(XXH64_state_t* statePtr); - -/*! - * @brief Copies one @ref XXH64_state_t to another. - * - * @param dst_state The state to copy to. - * @param src_state The state to copy from. - * @pre - * @p dst_state and @p src_state must not be `NULL` and must not overlap. - */ -XXH_PUBLIC_API void XXH64_copyState(XXH_NOESCAPE XXH64_state_t* dst_state, const XXH64_state_t* src_state); - -/*! - * @brief Resets an @ref XXH64_state_t to begin a new hash. - * - * This function resets and seeds a state. Call it before @ref XXH64_update(). - * - * @param statePtr The state struct to reset. - * @param seed The 64-bit seed to alter the hash result predictably. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - */ -XXH_PUBLIC_API XXH_errorcode XXH64_reset (XXH_NOESCAPE XXH64_state_t* statePtr, XXH64_hash_t seed); - -/*! - * @brief Consumes a block of @p input to an @ref XXH64_state_t. - * - * Call this to incrementally consume blocks of data. - * - * @param statePtr The state struct to update. - * @param input The block of data to be hashed, at least @p length bytes in size. - * @param length The length of @p input, in bytes. - * - * @pre - * @p statePtr must not be `NULL`. - * @pre - * The memory between @p input and @p input + @p length must be valid, - * readable, contiguous memory. However, if @p length is `0`, @p input may be - * `NULL`. In C++, this also must be *TriviallyCopyable*. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - */ -XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH_NOESCAPE XXH64_state_t* statePtr, XXH_NOESCAPE const void* input, size_t length); - -/*! - * @brief Returns the calculated hash value from an @ref XXH64_state_t. - * - * @note - * Calling XXH64_digest() will not affect @p statePtr, so you can update, - * digest, and update again. - * - * @param statePtr The state struct to calculate the hash from. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return The calculated xxHash64 value from that state. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH64_digest (XXH_NOESCAPE const XXH64_state_t* statePtr); -#endif /* !XXH_NO_STREAM */ -/******* Canonical representation *******/ - -/*! - * @brief Canonical (big endian) representation of @ref XXH64_hash_t. - */ -typedef struct { unsigned char digest[sizeof(XXH64_hash_t)]; } XXH64_canonical_t; - -/*! - * @brief Converts an @ref XXH64_hash_t to a big endian @ref XXH64_canonical_t. - * - * @param dst The @ref XXH64_canonical_t pointer to be stored to. - * @param hash The @ref XXH64_hash_t to be converted. - * - * @pre - * @p dst must not be `NULL`. - */ -XXH_PUBLIC_API void XXH64_canonicalFromHash(XXH_NOESCAPE XXH64_canonical_t* dst, XXH64_hash_t hash); - -/*! - * @brief Converts an @ref XXH64_canonical_t to a native @ref XXH64_hash_t. - * - * @param src The @ref XXH64_canonical_t to convert. - * - * @pre - * @p src must not be `NULL`. - * - * @return The converted hash. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH64_hashFromCanonical(XXH_NOESCAPE const XXH64_canonical_t* src); - -#ifndef XXH_NO_XXH3 - -/*! - * @} - * ************************************************************************ - * @defgroup XXH3_family XXH3 family - * @ingroup public - * @{ - * - * XXH3 is a more recent hash algorithm featuring: - * - Improved speed for both small and large inputs - * - True 64-bit and 128-bit outputs - * - SIMD acceleration - * - Improved 32-bit viability - * - * Speed analysis methodology is explained here: - * - * https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html - * - * Compared to XXH64, expect XXH3 to run approximately - * ~2x faster on large inputs and >3x faster on small ones, - * exact differences vary depending on platform. - * - * XXH3's speed benefits greatly from SIMD and 64-bit arithmetic, - * but does not require it. - * Most 32-bit and 64-bit targets that can run XXH32 smoothly can run XXH3 - * at competitive speeds, even without vector support. Further details are - * explained in the implementation. - * - * XXH3 has a fast scalar implementation, but it also includes accelerated SIMD - * implementations for many common platforms: - * - AVX512 - * - AVX2 - * - SSE2 - * - ARM NEON - * - WebAssembly SIMD128 - * - POWER8 VSX - * - s390x ZVector - * This can be controlled via the @ref XXH_VECTOR macro, but it automatically - * selects the best version according to predefined macros. For the x86 family, an - * automatic runtime dispatcher is included separately in @ref xxh_x86dispatch.c. - * - * XXH3 implementation is portable: - * it has a generic C90 formulation that can be compiled on any platform, - * all implementations generate exactly the same hash value on all platforms. - * Starting from v0.8.0, it's also labelled "stable", meaning that - * any future version will also generate the same hash value. - * - * XXH3 offers 2 variants, _64bits and _128bits. - * - * When only 64 bits are needed, prefer invoking the _64bits variant, as it - * reduces the amount of mixing, resulting in faster speed on small inputs. - * It's also generally simpler to manipulate a scalar return type than a struct. - * - * The API supports one-shot hashing, streaming mode, and custom secrets. - */ -/*-********************************************************************** -* XXH3 64-bit variant -************************************************************************/ - -/*! - * @brief 64-bit unseeded variant of XXH3. - * - * This is equivalent to @ref XXH3_64bits_withSeed() with a seed of 0, however - * it may have slightly better performance due to constant propagation of the - * defaults. - * - * @see - * XXH32(), XXH64(), XXH3_128bits(): equivalent for the other xxHash algorithms - * @see - * XXH3_64bits_withSeed(), XXH3_64bits_withSecret(): other seeding variants - * @see - * XXH3_64bits_reset(), XXH3_64bits_update(), XXH3_64bits_digest(): Streaming version. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH3_64bits(XXH_NOESCAPE const void* input, size_t length); - -/*! - * @brief 64-bit seeded variant of XXH3 - * - * This variant generates a custom secret on the fly based on default secret - * altered using the `seed` value. - * - * While this operation is decently fast, note that it's not completely free. - * - * @note - * seed == 0 produces the same results as @ref XXH3_64bits(). - * - * @param input The data to hash - * @param length The length - * @param seed The 64-bit seed to alter the state. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH3_64bits_withSeed(XXH_NOESCAPE const void* input, size_t length, XXH64_hash_t seed); - -/*! - * The bare minimum size for a custom secret. - * - * @see - * XXH3_64bits_withSecret(), XXH3_64bits_reset_withSecret(), - * XXH3_128bits_withSecret(), XXH3_128bits_reset_withSecret(). - */ -#define XXH3_SECRET_SIZE_MIN 136 - -/*! - * @brief 64-bit variant of XXH3 with a custom "secret". - * - * It's possible to provide any blob of bytes as a "secret" to generate the hash. - * This makes it more difficult for an external actor to prepare an intentional collision. - * The main condition is that secretSize *must* be large enough (>= XXH3_SECRET_SIZE_MIN). - * However, the quality of the secret impacts the dispersion of the hash algorithm. - * Therefore, the secret _must_ look like a bunch of random bytes. - * Avoid "trivial" or structured data such as repeated sequences or a text document. - * Whenever in doubt about the "randomness" of the blob of bytes, - * consider employing "XXH3_generateSecret()" instead (see below). - * It will generate a proper high entropy secret derived from the blob of bytes. - * Another advantage of using XXH3_generateSecret() is that - * it guarantees that all bits within the initial blob of bytes - * will impact every bit of the output. - * This is not necessarily the case when using the blob of bytes directly - * because, when hashing _small_ inputs, only a portion of the secret is employed. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH3_64bits_withSecret(XXH_NOESCAPE const void* data, size_t len, XXH_NOESCAPE const void* secret, size_t secretSize); - - -/******* Streaming *******/ -#ifndef XXH_NO_STREAM -/* - * Streaming requires state maintenance. - * This operation costs memory and CPU. - * As a consequence, streaming is slower than one-shot hashing. - * For better performance, prefer one-shot functions whenever applicable. - */ - -/*! - * @brief The state struct for the XXH3 streaming API. - * - * @see XXH3_state_s for details. - */ -typedef struct XXH3_state_s XXH3_state_t; -XXH_PUBLIC_API XXH_MALLOCF XXH3_state_t* XXH3_createState(void); -XXH_PUBLIC_API XXH_errorcode XXH3_freeState(XXH3_state_t* statePtr); - -/*! - * @brief Copies one @ref XXH3_state_t to another. - * - * @param dst_state The state to copy to. - * @param src_state The state to copy from. - * @pre - * @p dst_state and @p src_state must not be `NULL` and must not overlap. - */ -XXH_PUBLIC_API void XXH3_copyState(XXH_NOESCAPE XXH3_state_t* dst_state, XXH_NOESCAPE const XXH3_state_t* src_state); - -/*! - * @brief Resets an @ref XXH3_state_t to begin a new hash. - * - * This function resets `statePtr` and generate a secret with default parameters. Call it before @ref XXH3_64bits_update(). - * Digest will be equivalent to `XXH3_64bits()`. - * - * @param statePtr The state struct to reset. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - * - */ -XXH_PUBLIC_API XXH_errorcode XXH3_64bits_reset(XXH_NOESCAPE XXH3_state_t* statePtr); - -/*! - * @brief Resets an @ref XXH3_state_t with 64-bit seed to begin a new hash. - * - * This function resets `statePtr` and generate a secret from `seed`. Call it before @ref XXH3_64bits_update(). - * Digest will be equivalent to `XXH3_64bits_withSeed()`. - * - * @param statePtr The state struct to reset. - * @param seed The 64-bit seed to alter the state. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - * - */ -XXH_PUBLIC_API XXH_errorcode XXH3_64bits_reset_withSeed(XXH_NOESCAPE XXH3_state_t* statePtr, XXH64_hash_t seed); - -/*! - * XXH3_64bits_reset_withSecret(): - * `secret` is referenced, it _must outlive_ the hash streaming session. - * Similar to one-shot API, `secretSize` must be >= `XXH3_SECRET_SIZE_MIN`, - * and the quality of produced hash values depends on secret's entropy - * (secret's content should look like a bunch of random bytes). - * When in doubt about the randomness of a candidate `secret`, - * consider employing `XXH3_generateSecret()` instead (see below). - */ -XXH_PUBLIC_API XXH_errorcode XXH3_64bits_reset_withSecret(XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* secret, size_t secretSize); - -/*! - * @brief Consumes a block of @p input to an @ref XXH3_state_t. - * - * Call this to incrementally consume blocks of data. - * - * @param statePtr The state struct to update. - * @param input The block of data to be hashed, at least @p length bytes in size. - * @param length The length of @p input, in bytes. - * - * @pre - * @p statePtr must not be `NULL`. - * @pre - * The memory between @p input and @p input + @p length must be valid, - * readable, contiguous memory. However, if @p length is `0`, @p input may be - * `NULL`. In C++, this also must be *TriviallyCopyable*. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - */ -XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* input, size_t length); - -/*! - * @brief Returns the calculated XXH3 64-bit hash value from an @ref XXH3_state_t. - * - * @note - * Calling XXH3_64bits_digest() will not affect @p statePtr, so you can update, - * digest, and update again. - * - * @param statePtr The state struct to calculate the hash from. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return The calculated XXH3 64-bit hash value from that state. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH3_64bits_digest (XXH_NOESCAPE const XXH3_state_t* statePtr); -#endif /* !XXH_NO_STREAM */ - -/* note : canonical representation of XXH3 is the same as XXH64 - * since they both produce XXH64_hash_t values */ - - -/*-********************************************************************** -* XXH3 128-bit variant -************************************************************************/ - -/*! - * @brief The return value from 128-bit hashes. - * - * Stored in little endian order, although the fields themselves are in native - * endianness. - */ -typedef struct { - XXH64_hash_t low64; /*!< `value & 0xFFFFFFFFFFFFFFFF` */ - XXH64_hash_t high64; /*!< `value >> 64` */ -} XXH128_hash_t; - -/*! - * @brief Unseeded 128-bit variant of XXH3 - * - * The 128-bit variant of XXH3 has more strength, but it has a bit of overhead - * for shorter inputs. - * - * This is equivalent to @ref XXH3_128bits_withSeed() with a seed of 0, however - * it may have slightly better performance due to constant propagation of the - * defaults. - * - * @see - * XXH32(), XXH64(), XXH3_64bits(): equivalent for the other xxHash algorithms - * @see - * XXH3_128bits_withSeed(), XXH3_128bits_withSecret(): other seeding variants - * @see - * XXH3_128bits_reset(), XXH3_128bits_update(), XXH3_128bits_digest(): Streaming version. - */ -XXH_PUBLIC_API XXH_PUREF XXH128_hash_t XXH3_128bits(XXH_NOESCAPE const void* data, size_t len); -/*! @brief Seeded 128-bit variant of XXH3. @see XXH3_64bits_withSeed(). */ -XXH_PUBLIC_API XXH_PUREF XXH128_hash_t XXH3_128bits_withSeed(XXH_NOESCAPE const void* data, size_t len, XXH64_hash_t seed); -/*! @brief Custom secret 128-bit variant of XXH3. @see XXH3_64bits_withSecret(). */ -XXH_PUBLIC_API XXH_PUREF XXH128_hash_t XXH3_128bits_withSecret(XXH_NOESCAPE const void* data, size_t len, XXH_NOESCAPE const void* secret, size_t secretSize); - -/******* Streaming *******/ -#ifndef XXH_NO_STREAM -/* - * Streaming requires state maintenance. - * This operation costs memory and CPU. - * As a consequence, streaming is slower than one-shot hashing. - * For better performance, prefer one-shot functions whenever applicable. - * - * XXH3_128bits uses the same XXH3_state_t as XXH3_64bits(). - * Use already declared XXH3_createState() and XXH3_freeState(). - * - * All reset and streaming functions have same meaning as their 64-bit counterpart. - */ - -/*! - * @brief Resets an @ref XXH3_state_t to begin a new hash. - * - * This function resets `statePtr` and generate a secret with default parameters. Call it before @ref XXH3_128bits_update(). - * Digest will be equivalent to `XXH3_128bits()`. - * - * @param statePtr The state struct to reset. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - * - */ -XXH_PUBLIC_API XXH_errorcode XXH3_128bits_reset(XXH_NOESCAPE XXH3_state_t* statePtr); - -/*! - * @brief Resets an @ref XXH3_state_t with 64-bit seed to begin a new hash. - * - * This function resets `statePtr` and generate a secret from `seed`. Call it before @ref XXH3_128bits_update(). - * Digest will be equivalent to `XXH3_128bits_withSeed()`. - * - * @param statePtr The state struct to reset. - * @param seed The 64-bit seed to alter the state. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - * - */ -XXH_PUBLIC_API XXH_errorcode XXH3_128bits_reset_withSeed(XXH_NOESCAPE XXH3_state_t* statePtr, XXH64_hash_t seed); -/*! @brief Custom secret 128-bit variant of XXH3. @see XXH_64bits_reset_withSecret(). */ -XXH_PUBLIC_API XXH_errorcode XXH3_128bits_reset_withSecret(XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* secret, size_t secretSize); - -/*! - * @brief Consumes a block of @p input to an @ref XXH3_state_t. - * - * Call this to incrementally consume blocks of data. - * - * @param statePtr The state struct to update. - * @param input The block of data to be hashed, at least @p length bytes in size. - * @param length The length of @p input, in bytes. - * - * @pre - * @p statePtr must not be `NULL`. - * @pre - * The memory between @p input and @p input + @p length must be valid, - * readable, contiguous memory. However, if @p length is `0`, @p input may be - * `NULL`. In C++, this also must be *TriviallyCopyable*. - * - * @return @ref XXH_OK on success, @ref XXH_ERROR on failure. - */ -XXH_PUBLIC_API XXH_errorcode XXH3_128bits_update (XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* input, size_t length); - -/*! - * @brief Returns the calculated XXH3 128-bit hash value from an @ref XXH3_state_t. - * - * @note - * Calling XXH3_128bits_digest() will not affect @p statePtr, so you can update, - * digest, and update again. - * - * @param statePtr The state struct to calculate the hash from. - * - * @pre - * @p statePtr must not be `NULL`. - * - * @return The calculated XXH3 128-bit hash value from that state. - */ -XXH_PUBLIC_API XXH_PUREF XXH128_hash_t XXH3_128bits_digest (XXH_NOESCAPE const XXH3_state_t* statePtr); -#endif /* !XXH_NO_STREAM */ - -/* Following helper functions make it possible to compare XXH128_hast_t values. - * Since XXH128_hash_t is a structure, this capability is not offered by the language. - * Note: For better performance, these functions can be inlined using XXH_INLINE_ALL */ - -/*! - * XXH128_isEqual(): - * Return: 1 if `h1` and `h2` are equal, 0 if they are not. - */ -XXH_PUBLIC_API XXH_PUREF int XXH128_isEqual(XXH128_hash_t h1, XXH128_hash_t h2); - -/*! - * @brief Compares two @ref XXH128_hash_t - * This comparator is compatible with stdlib's `qsort()`/`bsearch()`. - * - * @return: >0 if *h128_1 > *h128_2 - * =0 if *h128_1 == *h128_2 - * <0 if *h128_1 < *h128_2 - */ -XXH_PUBLIC_API XXH_PUREF int XXH128_cmp(XXH_NOESCAPE const void* h128_1, XXH_NOESCAPE const void* h128_2); - - -/******* Canonical representation *******/ -typedef struct { unsigned char digest[sizeof(XXH128_hash_t)]; } XXH128_canonical_t; - - -/*! - * @brief Converts an @ref XXH128_hash_t to a big endian @ref XXH128_canonical_t. - * - * @param dst The @ref XXH128_canonical_t pointer to be stored to. - * @param hash The @ref XXH128_hash_t to be converted. - * - * @pre - * @p dst must not be `NULL`. - */ -XXH_PUBLIC_API void XXH128_canonicalFromHash(XXH_NOESCAPE XXH128_canonical_t* dst, XXH128_hash_t hash); - -/*! - * @brief Converts an @ref XXH128_canonical_t to a native @ref XXH128_hash_t. - * - * @param src The @ref XXH128_canonical_t to convert. - * - * @pre - * @p src must not be `NULL`. - * - * @return The converted hash. - */ -XXH_PUBLIC_API XXH_PUREF XXH128_hash_t XXH128_hashFromCanonical(XXH_NOESCAPE const XXH128_canonical_t* src); - - -#endif /* !XXH_NO_XXH3 */ -#endif /* XXH_NO_LONG_LONG */ - -/*! - * @} - */ -#endif /* XXHASH_H_5627135585666179 */ - - - -#if defined(XXH_STATIC_LINKING_ONLY) && !defined(XXHASH_H_STATIC_13879238742) -#define XXHASH_H_STATIC_13879238742 -/* **************************************************************************** - * This section contains declarations which are not guaranteed to remain stable. - * They may change in future versions, becoming incompatible with a different - * version of the library. - * These declarations should only be used with static linking. - * Never use them in association with dynamic linking! - ***************************************************************************** */ - -/* - * These definitions are only present to allow static allocation - * of XXH states, on stack or in a struct, for example. - * Never **ever** access their members directly. - */ - -/*! - * @internal - * @brief Structure for XXH32 streaming API. - * - * @note This is only defined when @ref XXH_STATIC_LINKING_ONLY, - * @ref XXH_INLINE_ALL, or @ref XXH_IMPLEMENTATION is defined. Otherwise it is - * an opaque type. This allows fields to safely be changed. - * - * Typedef'd to @ref XXH32_state_t. - * Do not access the members of this struct directly. - * @see XXH64_state_s, XXH3_state_s - */ -struct XXH32_state_s { - XXH32_hash_t total_len_32; /*!< Total length hashed, modulo 2^32 */ - XXH32_hash_t large_len; /*!< Whether the hash is >= 16 (handles @ref total_len_32 overflow) */ - XXH32_hash_t v[4]; /*!< Accumulator lanes */ - XXH32_hash_t mem32[4]; /*!< Internal buffer for partial reads. Treated as unsigned char[16]. */ - XXH32_hash_t memsize; /*!< Amount of data in @ref mem32 */ - XXH32_hash_t reserved; /*!< Reserved field. Do not read nor write to it. */ -}; /* typedef'd to XXH32_state_t */ - - -#ifndef XXH_NO_LONG_LONG /* defined when there is no 64-bit support */ - -/*! - * @internal - * @brief Structure for XXH64 streaming API. - * - * @note This is only defined when @ref XXH_STATIC_LINKING_ONLY, - * @ref XXH_INLINE_ALL, or @ref XXH_IMPLEMENTATION is defined. Otherwise it is - * an opaque type. This allows fields to safely be changed. - * - * Typedef'd to @ref XXH64_state_t. - * Do not access the members of this struct directly. - * @see XXH32_state_s, XXH3_state_s - */ -struct XXH64_state_s { - XXH64_hash_t total_len; /*!< Total length hashed. This is always 64-bit. */ - XXH64_hash_t v[4]; /*!< Accumulator lanes */ - XXH64_hash_t mem64[4]; /*!< Internal buffer for partial reads. Treated as unsigned char[32]. */ - XXH32_hash_t memsize; /*!< Amount of data in @ref mem64 */ - XXH32_hash_t reserved32; /*!< Reserved field, needed for padding anyways*/ - XXH64_hash_t reserved64; /*!< Reserved field. Do not read or write to it. */ -}; /* typedef'd to XXH64_state_t */ - -#ifndef XXH_NO_XXH3 - -#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) /* >= C11 */ -# include -# define XXH_ALIGN(n) alignas(n) -#elif defined(__cplusplus) && (__cplusplus >= 201103L) /* >= C++11 */ -/* In C++ alignas() is a keyword */ -# define XXH_ALIGN(n) alignas(n) -#elif defined(__GNUC__) -# define XXH_ALIGN(n) __attribute__ ((aligned(n))) -#elif defined(_MSC_VER) -# define XXH_ALIGN(n) __declspec(align(n)) -#else -# define XXH_ALIGN(n) /* disabled */ -#endif - -/* Old GCC versions only accept the attribute after the type in structures. */ -#if !(defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L)) /* C11+ */ \ - && ! (defined(__cplusplus) && (__cplusplus >= 201103L)) /* >= C++11 */ \ - && defined(__GNUC__) -# define XXH_ALIGN_MEMBER(align, type) type XXH_ALIGN(align) -#else -# define XXH_ALIGN_MEMBER(align, type) XXH_ALIGN(align) type -#endif - -/*! - * @brief The size of the internal XXH3 buffer. - * - * This is the optimal update size for incremental hashing. - * - * @see XXH3_64b_update(), XXH3_128b_update(). - */ -#define XXH3_INTERNALBUFFER_SIZE 256 - -/*! - * @internal - * @brief Default size of the secret buffer (and @ref XXH3_kSecret). - * - * This is the size used in @ref XXH3_kSecret and the seeded functions. - * - * Not to be confused with @ref XXH3_SECRET_SIZE_MIN. - */ -#define XXH3_SECRET_DEFAULT_SIZE 192 - -/*! - * @internal - * @brief Structure for XXH3 streaming API. - * - * @note This is only defined when @ref XXH_STATIC_LINKING_ONLY, - * @ref XXH_INLINE_ALL, or @ref XXH_IMPLEMENTATION is defined. - * Otherwise it is an opaque type. - * Never use this definition in combination with dynamic library. - * This allows fields to safely be changed in the future. - * - * @note ** This structure has a strict alignment requirement of 64 bytes!! ** - * Do not allocate this with `malloc()` or `new`, - * it will not be sufficiently aligned. - * Use @ref XXH3_createState() and @ref XXH3_freeState(), or stack allocation. - * - * Typedef'd to @ref XXH3_state_t. - * Do never access the members of this struct directly. - * - * @see XXH3_INITSTATE() for stack initialization. - * @see XXH3_createState(), XXH3_freeState(). - * @see XXH32_state_s, XXH64_state_s - */ -struct XXH3_state_s { - XXH_ALIGN_MEMBER(64, XXH64_hash_t acc[8]); - /*!< The 8 accumulators. See @ref XXH32_state_s::v and @ref XXH64_state_s::v */ - XXH_ALIGN_MEMBER(64, unsigned char customSecret[XXH3_SECRET_DEFAULT_SIZE]); - /*!< Used to store a custom secret generated from a seed. */ - XXH_ALIGN_MEMBER(64, unsigned char buffer[XXH3_INTERNALBUFFER_SIZE]); - /*!< The internal buffer. @see XXH32_state_s::mem32 */ - XXH32_hash_t bufferedSize; - /*!< The amount of memory in @ref buffer, @see XXH32_state_s::memsize */ - XXH32_hash_t useSeed; - /*!< Reserved field. Needed for padding on 64-bit. */ - size_t nbStripesSoFar; - /*!< Number or stripes processed. */ - XXH64_hash_t totalLen; - /*!< Total length hashed. 64-bit even on 32-bit targets. */ - size_t nbStripesPerBlock; - /*!< Number of stripes per block. */ - size_t secretLimit; - /*!< Size of @ref customSecret or @ref extSecret */ - XXH64_hash_t seed; - /*!< Seed for _withSeed variants. Must be zero otherwise, @see XXH3_INITSTATE() */ - XXH64_hash_t reserved64; - /*!< Reserved field. */ - const unsigned char* extSecret; - /*!< Reference to an external secret for the _withSecret variants, NULL - * for other variants. */ - /* note: there may be some padding at the end due to alignment on 64 bytes */ -}; /* typedef'd to XXH3_state_t */ - -#undef XXH_ALIGN_MEMBER - -/*! - * @brief Initializes a stack-allocated `XXH3_state_s`. - * - * When the @ref XXH3_state_t structure is merely emplaced on stack, - * it should be initialized with XXH3_INITSTATE() or a memset() - * in case its first reset uses XXH3_NNbits_reset_withSeed(). - * This init can be omitted if the first reset uses default or _withSecret mode. - * This operation isn't necessary when the state is created with XXH3_createState(). - * Note that this doesn't prepare the state for a streaming operation, - * it's still necessary to use XXH3_NNbits_reset*() afterwards. - */ -#define XXH3_INITSTATE(XXH3_state_ptr) \ - do { \ - XXH3_state_t* tmp_xxh3_state_ptr = (XXH3_state_ptr); \ - tmp_xxh3_state_ptr->seed = 0; \ - tmp_xxh3_state_ptr->extSecret = NULL; \ - } while(0) - - -/*! - * simple alias to pre-selected XXH3_128bits variant - */ -XXH_PUBLIC_API XXH_PUREF XXH128_hash_t XXH128(XXH_NOESCAPE const void* data, size_t len, XXH64_hash_t seed); - - -/* === Experimental API === */ -/* Symbols defined below must be considered tied to a specific library version. */ - -/*! - * XXH3_generateSecret(): - * - * Derive a high-entropy secret from any user-defined content, named customSeed. - * The generated secret can be used in combination with `*_withSecret()` functions. - * The `_withSecret()` variants are useful to provide a higher level of protection - * than 64-bit seed, as it becomes much more difficult for an external actor to - * guess how to impact the calculation logic. - * - * The function accepts as input a custom seed of any length and any content, - * and derives from it a high-entropy secret of length @p secretSize into an - * already allocated buffer @p secretBuffer. - * - * The generated secret can then be used with any `*_withSecret()` variant. - * The functions @ref XXH3_128bits_withSecret(), @ref XXH3_64bits_withSecret(), - * @ref XXH3_128bits_reset_withSecret() and @ref XXH3_64bits_reset_withSecret() - * are part of this list. They all accept a `secret` parameter - * which must be large enough for implementation reasons (>= @ref XXH3_SECRET_SIZE_MIN) - * _and_ feature very high entropy (consist of random-looking bytes). - * These conditions can be a high bar to meet, so @ref XXH3_generateSecret() can - * be employed to ensure proper quality. - * - * @p customSeed can be anything. It can have any size, even small ones, - * and its content can be anything, even "poor entropy" sources such as a bunch - * of zeroes. The resulting `secret` will nonetheless provide all required qualities. - * - * @pre - * - @p secretSize must be >= @ref XXH3_SECRET_SIZE_MIN - * - When @p customSeedSize > 0, supplying NULL as customSeed is undefined behavior. - * - * Example code: - * @code{.c} - * #include - * #include - * #include - * #define XXH_STATIC_LINKING_ONLY // expose unstable API - * #include "xxhash.h" - * // Hashes argv[2] using the entropy from argv[1]. - * int main(int argc, char* argv[]) - * { - * char secret[XXH3_SECRET_SIZE_MIN]; - * if (argv != 3) { return 1; } - * XXH3_generateSecret(secret, sizeof(secret), argv[1], strlen(argv[1])); - * XXH64_hash_t h = XXH3_64bits_withSecret( - * argv[2], strlen(argv[2]), - * secret, sizeof(secret) - * ); - * printf("%016llx\n", (unsigned long long) h); - * } - * @endcode - */ -XXH_PUBLIC_API XXH_errorcode XXH3_generateSecret(XXH_NOESCAPE void* secretBuffer, size_t secretSize, XXH_NOESCAPE const void* customSeed, size_t customSeedSize); - -/*! - * @brief Generate the same secret as the _withSeed() variants. - * - * The generated secret can be used in combination with - *`*_withSecret()` and `_withSecretandSeed()` variants. - * - * Example C++ `std::string` hash class: - * @code{.cpp} - * #include - * #define XXH_STATIC_LINKING_ONLY // expose unstable API - * #include "xxhash.h" - * // Slow, seeds each time - * class HashSlow { - * XXH64_hash_t seed; - * public: - * HashSlow(XXH64_hash_t s) : seed{s} {} - * size_t operator()(const std::string& x) const { - * return size_t{XXH3_64bits_withSeed(x.c_str(), x.length(), seed)}; - * } - * }; - * // Fast, caches the seeded secret for future uses. - * class HashFast { - * unsigned char secret[XXH3_SECRET_SIZE_MIN]; - * public: - * HashFast(XXH64_hash_t s) { - * XXH3_generateSecret_fromSeed(secret, seed); - * } - * size_t operator()(const std::string& x) const { - * return size_t{ - * XXH3_64bits_withSecret(x.c_str(), x.length(), secret, sizeof(secret)) - * }; - * } - * }; - * @endcode - * @param secretBuffer A writable buffer of @ref XXH3_SECRET_SIZE_MIN bytes - * @param seed The seed to seed the state. - */ -XXH_PUBLIC_API void XXH3_generateSecret_fromSeed(XXH_NOESCAPE void* secretBuffer, XXH64_hash_t seed); - -/*! - * These variants generate hash values using either - * @p seed for "short" keys (< XXH3_MIDSIZE_MAX = 240 bytes) - * or @p secret for "large" keys (>= XXH3_MIDSIZE_MAX). - * - * This generally benefits speed, compared to `_withSeed()` or `_withSecret()`. - * `_withSeed()` has to generate the secret on the fly for "large" keys. - * It's fast, but can be perceptible for "not so large" keys (< 1 KB). - * `_withSecret()` has to generate the masks on the fly for "small" keys, - * which requires more instructions than _withSeed() variants. - * Therefore, _withSecretandSeed variant combines the best of both worlds. - * - * When @p secret has been generated by XXH3_generateSecret_fromSeed(), - * this variant produces *exactly* the same results as `_withSeed()` variant, - * hence offering only a pure speed benefit on "large" input, - * by skipping the need to regenerate the secret for every large input. - * - * Another usage scenario is to hash the secret to a 64-bit hash value, - * for example with XXH3_64bits(), which then becomes the seed, - * and then employ both the seed and the secret in _withSecretandSeed(). - * On top of speed, an added benefit is that each bit in the secret - * has a 50% chance to swap each bit in the output, via its impact to the seed. - * - * This is not guaranteed when using the secret directly in "small data" scenarios, - * because only portions of the secret are employed for small data. - */ -XXH_PUBLIC_API XXH_PUREF XXH64_hash_t -XXH3_64bits_withSecretandSeed(XXH_NOESCAPE const void* data, size_t len, - XXH_NOESCAPE const void* secret, size_t secretSize, - XXH64_hash_t seed); -/*! @copydoc XXH3_64bits_withSecretandSeed() */ -XXH_PUBLIC_API XXH_PUREF XXH128_hash_t -XXH3_128bits_withSecretandSeed(XXH_NOESCAPE const void* input, size_t length, - XXH_NOESCAPE const void* secret, size_t secretSize, - XXH64_hash_t seed64); -#ifndef XXH_NO_STREAM -/*! @copydoc XXH3_64bits_withSecretandSeed() */ -XXH_PUBLIC_API XXH_errorcode -XXH3_64bits_reset_withSecretandSeed(XXH_NOESCAPE XXH3_state_t* statePtr, - XXH_NOESCAPE const void* secret, size_t secretSize, - XXH64_hash_t seed64); -/*! @copydoc XXH3_64bits_withSecretandSeed() */ -XXH_PUBLIC_API XXH_errorcode -XXH3_128bits_reset_withSecretandSeed(XXH_NOESCAPE XXH3_state_t* statePtr, - XXH_NOESCAPE const void* secret, size_t secretSize, - XXH64_hash_t seed64); -#endif /* !XXH_NO_STREAM */ - -#endif /* !XXH_NO_XXH3 */ -#endif /* XXH_NO_LONG_LONG */ -#if defined(XXH_INLINE_ALL) || defined(XXH_PRIVATE_API) -# define XXH_IMPLEMENTATION -#endif - -#endif /* defined(XXH_STATIC_LINKING_ONLY) && !defined(XXHASH_H_STATIC_13879238742) */ - - -/* ======================================================================== */ -/* ======================================================================== */ -/* ======================================================================== */ - - -/*-********************************************************************** - * xxHash implementation - *-********************************************************************** - * xxHash's implementation used to be hosted inside xxhash.c. - * - * However, inlining requires implementation to be visible to the compiler, - * hence be included alongside the header. - * Previously, implementation was hosted inside xxhash.c, - * which was then #included when inlining was activated. - * This construction created issues with a few build and install systems, - * as it required xxhash.c to be stored in /include directory. - * - * xxHash implementation is now directly integrated within xxhash.h. - * As a consequence, xxhash.c is no longer needed in /include. - * - * xxhash.c is still available and is still useful. - * In a "normal" setup, when xxhash is not inlined, - * xxhash.h only exposes the prototypes and public symbols, - * while xxhash.c can be built into an object file xxhash.o - * which can then be linked into the final binary. - ************************************************************************/ - -#if ( defined(XXH_INLINE_ALL) || defined(XXH_PRIVATE_API) \ - || defined(XXH_IMPLEMENTATION) ) && !defined(XXH_IMPLEM_13a8737387) -# define XXH_IMPLEM_13a8737387 - -/* ************************************* -* Tuning parameters -***************************************/ - -/*! - * @defgroup tuning Tuning parameters - * @{ - * - * Various macros to control xxHash's behavior. - */ -#ifdef XXH_DOXYGEN -/*! - * @brief Define this to disable 64-bit code. - * - * Useful if only using the @ref XXH32_family and you have a strict C90 compiler. - */ -# define XXH_NO_LONG_LONG -# undef XXH_NO_LONG_LONG /* don't actually */ -/*! - * @brief Controls how unaligned memory is accessed. - * - * By default, access to unaligned memory is controlled by `memcpy()`, which is - * safe and portable. - * - * Unfortunately, on some target/compiler combinations, the generated assembly - * is sub-optimal. - * - * The below switch allow selection of a different access method - * in the search for improved performance. - * - * @par Possible options: - * - * - `XXH_FORCE_MEMORY_ACCESS=0` (default): `memcpy` - * @par - * Use `memcpy()`. Safe and portable. Note that most modern compilers will - * eliminate the function call and treat it as an unaligned access. - * - * - `XXH_FORCE_MEMORY_ACCESS=1`: `__attribute__((aligned(1)))` - * @par - * Depends on compiler extensions and is therefore not portable. - * This method is safe _if_ your compiler supports it, - * and *generally* as fast or faster than `memcpy`. - * - * - `XXH_FORCE_MEMORY_ACCESS=2`: Direct cast - * @par - * Casts directly and dereferences. This method doesn't depend on the - * compiler, but it violates the C standard as it directly dereferences an - * unaligned pointer. It can generate buggy code on targets which do not - * support unaligned memory accesses, but in some circumstances, it's the - * only known way to get the most performance. - * - * - `XXH_FORCE_MEMORY_ACCESS=3`: Byteshift - * @par - * Also portable. This can generate the best code on old compilers which don't - * inline small `memcpy()` calls, and it might also be faster on big-endian - * systems which lack a native byteswap instruction. However, some compilers - * will emit literal byteshifts even if the target supports unaligned access. - * - * - * @warning - * Methods 1 and 2 rely on implementation-defined behavior. Use these with - * care, as what works on one compiler/platform/optimization level may cause - * another to read garbage data or even crash. - * - * See https://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html for details. - * - * Prefer these methods in priority order (0 > 3 > 1 > 2) - */ -# define XXH_FORCE_MEMORY_ACCESS 0 - -/*! - * @def XXH_SIZE_OPT - * @brief Controls how much xxHash optimizes for size. - * - * xxHash, when compiled, tends to result in a rather large binary size. This - * is mostly due to heavy usage to forced inlining and constant folding of the - * @ref XXH3_family to increase performance. - * - * However, some developers prefer size over speed. This option can - * significantly reduce the size of the generated code. When using the `-Os` - * or `-Oz` options on GCC or Clang, this is defined to 1 by default, - * otherwise it is defined to 0. - * - * Most of these size optimizations can be controlled manually. - * - * This is a number from 0-2. - * - `XXH_SIZE_OPT` == 0: Default. xxHash makes no size optimizations. Speed - * comes first. - * - `XXH_SIZE_OPT` == 1: Default for `-Os` and `-Oz`. xxHash is more - * conservative and disables hacks that increase code size. It implies the - * options @ref XXH_NO_INLINE_HINTS == 1, @ref XXH_FORCE_ALIGN_CHECK == 0, - * and @ref XXH3_NEON_LANES == 8 if they are not already defined. - * - `XXH_SIZE_OPT` == 2: xxHash tries to make itself as small as possible. - * Performance may cry. For example, the single shot functions just use the - * streaming API. - */ -# define XXH_SIZE_OPT 0 - -/*! - * @def XXH_FORCE_ALIGN_CHECK - * @brief If defined to non-zero, adds a special path for aligned inputs (XXH32() - * and XXH64() only). - * - * This is an important performance trick for architectures without decent - * unaligned memory access performance. - * - * It checks for input alignment, and when conditions are met, uses a "fast - * path" employing direct 32-bit/64-bit reads, resulting in _dramatically - * faster_ read speed. - * - * The check costs one initial branch per hash, which is generally negligible, - * but not zero. - * - * Moreover, it's not useful to generate an additional code path if memory - * access uses the same instruction for both aligned and unaligned - * addresses (e.g. x86 and aarch64). - * - * In these cases, the alignment check can be removed by setting this macro to 0. - * Then the code will always use unaligned memory access. - * Align check is automatically disabled on x86, x64, ARM64, and some ARM chips - * which are platforms known to offer good unaligned memory accesses performance. - * - * It is also disabled by default when @ref XXH_SIZE_OPT >= 1. - * - * This option does not affect XXH3 (only XXH32 and XXH64). - */ -# define XXH_FORCE_ALIGN_CHECK 0 - -/*! - * @def XXH_NO_INLINE_HINTS - * @brief When non-zero, sets all functions to `static`. - * - * By default, xxHash tries to force the compiler to inline almost all internal - * functions. - * - * This can usually improve performance due to reduced jumping and improved - * constant folding, but significantly increases the size of the binary which - * might not be favorable. - * - * Additionally, sometimes the forced inlining can be detrimental to performance, - * depending on the architecture. - * - * XXH_NO_INLINE_HINTS marks all internal functions as static, giving the - * compiler full control on whether to inline or not. - * - * When not optimizing (-O0), using `-fno-inline` with GCC or Clang, or if - * @ref XXH_SIZE_OPT >= 1, this will automatically be defined. - */ -# define XXH_NO_INLINE_HINTS 0 - -/*! - * @def XXH3_INLINE_SECRET - * @brief Determines whether to inline the XXH3 withSecret code. - * - * When the secret size is known, the compiler can improve the performance - * of XXH3_64bits_withSecret() and XXH3_128bits_withSecret(). - * - * However, if the secret size is not known, it doesn't have any benefit. This - * happens when xxHash is compiled into a global symbol. Therefore, if - * @ref XXH_INLINE_ALL is *not* defined, this will be defined to 0. - * - * Additionally, this defaults to 0 on GCC 12+, which has an issue with function pointers - * that are *sometimes* force inline on -Og, and it is impossible to automatically - * detect this optimization level. - */ -# define XXH3_INLINE_SECRET 0 - -/*! - * @def XXH32_ENDJMP - * @brief Whether to use a jump for `XXH32_finalize`. - * - * For performance, `XXH32_finalize` uses multiple branches in the finalizer. - * This is generally preferable for performance, - * but depending on exact architecture, a jmp may be preferable. - * - * This setting is only possibly making a difference for very small inputs. - */ -# define XXH32_ENDJMP 0 - -/*! - * @internal - * @brief Redefines old internal names. - * - * For compatibility with code that uses xxHash's internals before the names - * were changed to improve namespacing. There is no other reason to use this. - */ -# define XXH_OLD_NAMES -# undef XXH_OLD_NAMES /* don't actually use, it is ugly. */ - -/*! - * @def XXH_NO_STREAM - * @brief Disables the streaming API. - * - * When xxHash is not inlined and the streaming functions are not used, disabling - * the streaming functions can improve code size significantly, especially with - * the @ref XXH3_family which tends to make constant folded copies of itself. - */ -# define XXH_NO_STREAM -# undef XXH_NO_STREAM /* don't actually */ -#endif /* XXH_DOXYGEN */ -/*! - * @} - */ - -#ifndef XXH_FORCE_MEMORY_ACCESS /* can be defined externally, on command line for example */ - /* prefer __packed__ structures (method 1) for GCC - * < ARMv7 with unaligned access (e.g. Raspbian armhf) still uses byte shifting, so we use memcpy - * which for some reason does unaligned loads. */ -# if defined(__GNUC__) && !(defined(__ARM_ARCH) && __ARM_ARCH < 7 && defined(__ARM_FEATURE_UNALIGNED)) -# define XXH_FORCE_MEMORY_ACCESS 1 -# endif -#endif - -#ifndef XXH_SIZE_OPT - /* default to 1 for -Os or -Oz */ -# if (defined(__GNUC__) || defined(__clang__)) && defined(__OPTIMIZE_SIZE__) -# define XXH_SIZE_OPT 1 -# else -# define XXH_SIZE_OPT 0 -# endif -#endif - -#ifndef XXH_FORCE_ALIGN_CHECK /* can be defined externally */ - /* don't check on sizeopt, x86, aarch64, or arm when unaligned access is available */ -# if XXH_SIZE_OPT >= 1 || \ - defined(__i386) || defined(__x86_64__) || defined(__aarch64__) || defined(__ARM_FEATURE_UNALIGNED) \ - || defined(_M_IX86) || defined(_M_X64) || defined(_M_ARM64) || defined(_M_ARM) /* visual */ -# define XXH_FORCE_ALIGN_CHECK 0 -# else -# define XXH_FORCE_ALIGN_CHECK 1 -# endif -#endif - -#ifndef XXH_NO_INLINE_HINTS -# if XXH_SIZE_OPT >= 1 || defined(__NO_INLINE__) /* -O0, -fno-inline */ -# define XXH_NO_INLINE_HINTS 1 -# else -# define XXH_NO_INLINE_HINTS 0 -# endif -#endif - -#ifndef XXH3_INLINE_SECRET -# if (defined(__GNUC__) && !defined(__clang__) && __GNUC__ >= 12) \ - || !defined(XXH_INLINE_ALL) -# define XXH3_INLINE_SECRET 0 -# else -# define XXH3_INLINE_SECRET 1 -# endif -#endif - -#ifndef XXH32_ENDJMP -/* generally preferable for performance */ -# define XXH32_ENDJMP 0 -#endif - -/*! - * @defgroup impl Implementation - * @{ - */ - - -/* ************************************* -* Includes & Memory related functions -***************************************/ -#if defined(XXH_NO_STREAM) -/* nothing */ -#elif defined(XXH_NO_STDLIB) - -/* When requesting to disable any mention of stdlib, - * the library loses the ability to invoked malloc / free. - * In practice, it means that functions like `XXH*_createState()` - * will always fail, and return NULL. - * This flag is useful in situations where - * xxhash.h is integrated into some kernel, embedded or limited environment - * without access to dynamic allocation. - */ - -static XXH_CONSTF void* XXH_malloc(size_t s) { (void)s; return NULL; } -static void XXH_free(void* p) { (void)p; } - -#else - -/* - * Modify the local functions below should you wish to use - * different memory routines for malloc() and free() - */ -#include - -/*! - * @internal - * @brief Modify this function to use a different routine than malloc(). - */ -static XXH_MALLOCF void* XXH_malloc(size_t s) { return malloc(s); } - -/*! - * @internal - * @brief Modify this function to use a different routine than free(). - */ -static void XXH_free(void* p) { free(p); } - -#endif /* XXH_NO_STDLIB */ - -#include - -/*! - * @internal - * @brief Modify this function to use a different routine than memcpy(). - */ -static void* XXH_memcpy(void* dest, const void* src, size_t size) -{ - return memcpy(dest,src,size); -} - -#include /* ULLONG_MAX */ - - -/* ************************************* -* Compiler Specific Options -***************************************/ -#ifdef _MSC_VER /* Visual Studio warning fix */ -# pragma warning(disable : 4127) /* disable: C4127: conditional expression is constant */ -#endif - -#if XXH_NO_INLINE_HINTS /* disable inlining hints */ -# if defined(__GNUC__) || defined(__clang__) -# define XXH_FORCE_INLINE static __attribute__((unused)) -# else -# define XXH_FORCE_INLINE static -# endif -# define XXH_NO_INLINE static -/* enable inlining hints */ -#elif defined(__GNUC__) || defined(__clang__) -# define XXH_FORCE_INLINE static __inline__ __attribute__((always_inline, unused)) -# define XXH_NO_INLINE static __attribute__((noinline)) -#elif defined(_MSC_VER) /* Visual Studio */ -# define XXH_FORCE_INLINE static __forceinline -# define XXH_NO_INLINE static __declspec(noinline) -#elif defined (__cplusplus) \ - || (defined (__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L)) /* C99 */ -# define XXH_FORCE_INLINE static inline -# define XXH_NO_INLINE static -#else -# define XXH_FORCE_INLINE static -# define XXH_NO_INLINE static -#endif - -#if XXH3_INLINE_SECRET -# define XXH3_WITH_SECRET_INLINE XXH_FORCE_INLINE -#else -# define XXH3_WITH_SECRET_INLINE XXH_NO_INLINE -#endif - - -/* ************************************* -* Debug -***************************************/ -/*! - * @ingroup tuning - * @def XXH_DEBUGLEVEL - * @brief Sets the debugging level. - * - * XXH_DEBUGLEVEL is expected to be defined externally, typically via the - * compiler's command line options. The value must be a number. - */ -#ifndef XXH_DEBUGLEVEL -# ifdef DEBUGLEVEL /* backwards compat */ -# define XXH_DEBUGLEVEL DEBUGLEVEL -# else -# define XXH_DEBUGLEVEL 0 -# endif -#endif - -#if (XXH_DEBUGLEVEL>=1) -# include /* note: can still be disabled with NDEBUG */ -# define XXH_ASSERT(c) assert(c) -#else -# if defined(__INTEL_COMPILER) -# define XXH_ASSERT(c) XXH_ASSUME((unsigned char) (c)) -# else -# define XXH_ASSERT(c) XXH_ASSUME(c) -# endif -#endif - -/* note: use after variable declarations */ -#ifndef XXH_STATIC_ASSERT -# if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) /* C11 */ -# define XXH_STATIC_ASSERT_WITH_MESSAGE(c,m) do { _Static_assert((c),m); } while(0) -# elif defined(__cplusplus) && (__cplusplus >= 201103L) /* C++11 */ -# define XXH_STATIC_ASSERT_WITH_MESSAGE(c,m) do { static_assert((c),m); } while(0) -# else -# define XXH_STATIC_ASSERT_WITH_MESSAGE(c,m) do { struct xxh_sa { char x[(c) ? 1 : -1]; }; } while(0) -# endif -# define XXH_STATIC_ASSERT(c) XXH_STATIC_ASSERT_WITH_MESSAGE((c),#c) -#endif - -/*! - * @internal - * @def XXH_COMPILER_GUARD(var) - * @brief Used to prevent unwanted optimizations for @p var. - * - * It uses an empty GCC inline assembly statement with a register constraint - * which forces @p var into a general purpose register (eg eax, ebx, ecx - * on x86) and marks it as modified. - * - * This is used in a few places to avoid unwanted autovectorization (e.g. - * XXH32_round()). All vectorization we want is explicit via intrinsics, - * and _usually_ isn't wanted elsewhere. - * - * We also use it to prevent unwanted constant folding for AArch64 in - * XXH3_initCustomSecret_scalar(). - */ -#if defined(__GNUC__) || defined(__clang__) -# define XXH_COMPILER_GUARD(var) __asm__("" : "+r" (var)) -#else -# define XXH_COMPILER_GUARD(var) ((void)0) -#endif - -/* Specifically for NEON vectors which use the "w" constraint, on - * Clang. */ -#if defined(__clang__) && defined(__ARM_ARCH) && !defined(__wasm__) -# define XXH_COMPILER_GUARD_CLANG_NEON(var) __asm__("" : "+w" (var)) -#else -# define XXH_COMPILER_GUARD_CLANG_NEON(var) ((void)0) -#endif - -/* ************************************* -* Basic Types -***************************************/ -#if !defined (__VMS) \ - && (defined (__cplusplus) \ - || (defined (__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */) ) -# include - typedef uint8_t xxh_u8; -#else - typedef unsigned char xxh_u8; -#endif -typedef XXH32_hash_t xxh_u32; - -#ifdef XXH_OLD_NAMES -# warning "XXH_OLD_NAMES is planned to be removed starting v0.9. If the program depends on it, consider moving away from it by employing newer type names directly" -# define BYTE xxh_u8 -# define U8 xxh_u8 -# define U32 xxh_u32 -#endif - -/* *** Memory access *** */ - -/*! - * @internal - * @fn xxh_u32 XXH_read32(const void* ptr) - * @brief Reads an unaligned 32-bit integer from @p ptr in native endianness. - * - * Affected by @ref XXH_FORCE_MEMORY_ACCESS. - * - * @param ptr The pointer to read from. - * @return The 32-bit native endian integer from the bytes at @p ptr. - */ - -/*! - * @internal - * @fn xxh_u32 XXH_readLE32(const void* ptr) - * @brief Reads an unaligned 32-bit little endian integer from @p ptr. - * - * Affected by @ref XXH_FORCE_MEMORY_ACCESS. - * - * @param ptr The pointer to read from. - * @return The 32-bit little endian integer from the bytes at @p ptr. - */ - -/*! - * @internal - * @fn xxh_u32 XXH_readBE32(const void* ptr) - * @brief Reads an unaligned 32-bit big endian integer from @p ptr. - * - * Affected by @ref XXH_FORCE_MEMORY_ACCESS. - * - * @param ptr The pointer to read from. - * @return The 32-bit big endian integer from the bytes at @p ptr. - */ - -/*! - * @internal - * @fn xxh_u32 XXH_readLE32_align(const void* ptr, XXH_alignment align) - * @brief Like @ref XXH_readLE32(), but has an option for aligned reads. - * - * Affected by @ref XXH_FORCE_MEMORY_ACCESS. - * Note that when @ref XXH_FORCE_ALIGN_CHECK == 0, the @p align parameter is - * always @ref XXH_alignment::XXH_unaligned. - * - * @param ptr The pointer to read from. - * @param align Whether @p ptr is aligned. - * @pre - * If @p align == @ref XXH_alignment::XXH_aligned, @p ptr must be 4 byte - * aligned. - * @return The 32-bit little endian integer from the bytes at @p ptr. - */ - -#if (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==3)) -/* - * Manual byteshift. Best for old compilers which don't inline memcpy. - * We actually directly use XXH_readLE32 and XXH_readBE32. - */ -#elif (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==2)) - -/* - * Force direct memory access. Only works on CPU which support unaligned memory - * access in hardware. - */ -static xxh_u32 XXH_read32(const void* memPtr) { return *(const xxh_u32*) memPtr; } - -#elif (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==1)) - -/* - * __attribute__((aligned(1))) is supported by gcc and clang. Originally the - * documentation claimed that it only increased the alignment, but actually it - * can decrease it on gcc, clang, and icc: - * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69502, - * https://gcc.godbolt.org/z/xYez1j67Y. - */ -#ifdef XXH_OLD_NAMES -typedef union { xxh_u32 u32; } __attribute__((packed)) unalign; -#endif -static xxh_u32 XXH_read32(const void* ptr) -{ - typedef __attribute__((aligned(1))) xxh_u32 xxh_unalign32; - return *((const xxh_unalign32*)ptr); -} - -#else - -/* - * Portable and safe solution. Generally efficient. - * see: https://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html - */ -static xxh_u32 XXH_read32(const void* memPtr) -{ - xxh_u32 val; - XXH_memcpy(&val, memPtr, sizeof(val)); - return val; -} - -#endif /* XXH_FORCE_DIRECT_MEMORY_ACCESS */ - - -/* *** Endianness *** */ - -/*! - * @ingroup tuning - * @def XXH_CPU_LITTLE_ENDIAN - * @brief Whether the target is little endian. - * - * Defined to 1 if the target is little endian, or 0 if it is big endian. - * It can be defined externally, for example on the compiler command line. - * - * If it is not defined, - * a runtime check (which is usually constant folded) is used instead. - * - * @note - * This is not necessarily defined to an integer constant. - * - * @see XXH_isLittleEndian() for the runtime check. - */ -#ifndef XXH_CPU_LITTLE_ENDIAN -/* - * Try to detect endianness automatically, to avoid the nonstandard behavior - * in `XXH_isLittleEndian()` - */ -# if defined(_WIN32) /* Windows is always little endian */ \ - || defined(__LITTLE_ENDIAN__) \ - || (defined(__BYTE_ORDER__) && __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__) -# define XXH_CPU_LITTLE_ENDIAN 1 -# elif defined(__BIG_ENDIAN__) \ - || (defined(__BYTE_ORDER__) && __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__) -# define XXH_CPU_LITTLE_ENDIAN 0 -# else -/*! - * @internal - * @brief Runtime check for @ref XXH_CPU_LITTLE_ENDIAN. - * - * Most compilers will constant fold this. - */ -static int XXH_isLittleEndian(void) -{ - /* - * Portable and well-defined behavior. - * Don't use static: it is detrimental to performance. - */ - const union { xxh_u32 u; xxh_u8 c[4]; } one = { 1 }; - return one.c[0]; -} -# define XXH_CPU_LITTLE_ENDIAN XXH_isLittleEndian() -# endif -#endif - - - - -/* **************************************** -* Compiler-specific Functions and Macros -******************************************/ -#define XXH_GCC_VERSION (__GNUC__ * 100 + __GNUC_MINOR__) - -#ifdef __has_builtin -# define XXH_HAS_BUILTIN(x) __has_builtin(x) -#else -# define XXH_HAS_BUILTIN(x) 0 -#endif - - - -/* - * C23 and future versions have standard "unreachable()". - * Once it has been implemented reliably we can add it as an - * additional case: - * - * ``` - * #if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= XXH_C23_VN) - * # include - * # ifdef unreachable - * # define XXH_UNREACHABLE() unreachable() - * # endif - * #endif - * ``` - * - * Note C++23 also has std::unreachable() which can be detected - * as follows: - * ``` - * #if defined(__cpp_lib_unreachable) && (__cpp_lib_unreachable >= 202202L) - * # include - * # define XXH_UNREACHABLE() std::unreachable() - * #endif - * ``` - * NB: `__cpp_lib_unreachable` is defined in the `` header. - * We don't use that as including `` in `extern "C"` blocks - * doesn't work on GCC12 - */ - -#if XXH_HAS_BUILTIN(__builtin_unreachable) -# define XXH_UNREACHABLE() __builtin_unreachable() - -#elif defined(_MSC_VER) -# define XXH_UNREACHABLE() __assume(0) - -#else -# define XXH_UNREACHABLE() -#endif - -#if XXH_HAS_BUILTIN(__builtin_assume) -# define XXH_ASSUME(c) __builtin_assume(c) -#else -# define XXH_ASSUME(c) if (!(c)) { XXH_UNREACHABLE(); } -#endif - -/*! - * @internal - * @def XXH_rotl32(x,r) - * @brief 32-bit rotate left. - * - * @param x The 32-bit integer to be rotated. - * @param r The number of bits to rotate. - * @pre - * @p r > 0 && @p r < 32 - * @note - * @p x and @p r may be evaluated multiple times. - * @return The rotated result. - */ -#if !defined(NO_CLANG_BUILTIN) && XXH_HAS_BUILTIN(__builtin_rotateleft32) \ - && XXH_HAS_BUILTIN(__builtin_rotateleft64) -# define XXH_rotl32 __builtin_rotateleft32 -# define XXH_rotl64 __builtin_rotateleft64 -/* Note: although _rotl exists for minGW (GCC under windows), performance seems poor */ -#elif defined(_MSC_VER) -# define XXH_rotl32(x,r) _rotl(x,r) -# define XXH_rotl64(x,r) _rotl64(x,r) -#else -# define XXH_rotl32(x,r) (((x) << (r)) | ((x) >> (32 - (r)))) -# define XXH_rotl64(x,r) (((x) << (r)) | ((x) >> (64 - (r)))) -#endif - -/*! - * @internal - * @fn xxh_u32 XXH_swap32(xxh_u32 x) - * @brief A 32-bit byteswap. - * - * @param x The 32-bit integer to byteswap. - * @return @p x, byteswapped. - */ -#if defined(_MSC_VER) /* Visual Studio */ -# define XXH_swap32 _byteswap_ulong -#elif XXH_GCC_VERSION >= 403 -# define XXH_swap32 __builtin_bswap32 -#else -static xxh_u32 XXH_swap32 (xxh_u32 x) -{ - return ((x << 24) & 0xff000000 ) | - ((x << 8) & 0x00ff0000 ) | - ((x >> 8) & 0x0000ff00 ) | - ((x >> 24) & 0x000000ff ); -} -#endif - - -/* *************************** -* Memory reads -*****************************/ - -/*! - * @internal - * @brief Enum to indicate whether a pointer is aligned. - */ -typedef enum { - XXH_aligned, /*!< Aligned */ - XXH_unaligned /*!< Possibly unaligned */ -} XXH_alignment; - -/* - * XXH_FORCE_MEMORY_ACCESS==3 is an endian-independent byteshift load. - * - * This is ideal for older compilers which don't inline memcpy. - */ -#if (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==3)) - -XXH_FORCE_INLINE xxh_u32 XXH_readLE32(const void* memPtr) -{ - const xxh_u8* bytePtr = (const xxh_u8 *)memPtr; - return bytePtr[0] - | ((xxh_u32)bytePtr[1] << 8) - | ((xxh_u32)bytePtr[2] << 16) - | ((xxh_u32)bytePtr[3] << 24); -} - -XXH_FORCE_INLINE xxh_u32 XXH_readBE32(const void* memPtr) -{ - const xxh_u8* bytePtr = (const xxh_u8 *)memPtr; - return bytePtr[3] - | ((xxh_u32)bytePtr[2] << 8) - | ((xxh_u32)bytePtr[1] << 16) - | ((xxh_u32)bytePtr[0] << 24); -} - -#else -XXH_FORCE_INLINE xxh_u32 XXH_readLE32(const void* ptr) -{ - return XXH_CPU_LITTLE_ENDIAN ? XXH_read32(ptr) : XXH_swap32(XXH_read32(ptr)); -} - -static xxh_u32 XXH_readBE32(const void* ptr) -{ - return XXH_CPU_LITTLE_ENDIAN ? XXH_swap32(XXH_read32(ptr)) : XXH_read32(ptr); -} -#endif - -XXH_FORCE_INLINE xxh_u32 -XXH_readLE32_align(const void* ptr, XXH_alignment align) -{ - if (align==XXH_unaligned) { - return XXH_readLE32(ptr); - } else { - return XXH_CPU_LITTLE_ENDIAN ? *(const xxh_u32*)ptr : XXH_swap32(*(const xxh_u32*)ptr); - } -} - - -/* ************************************* -* Misc -***************************************/ -/*! @ingroup public */ -XXH_PUBLIC_API unsigned XXH_versionNumber (void) { return XXH_VERSION_NUMBER; } - - -/* ******************************************************************* -* 32-bit hash functions -*********************************************************************/ -/*! - * @} - * @defgroup XXH32_impl XXH32 implementation - * @ingroup impl - * - * Details on the XXH32 implementation. - * @{ - */ - /* #define instead of static const, to be used as initializers */ -#define XXH_PRIME32_1 0x9E3779B1U /*!< 0b10011110001101110111100110110001 */ -#define XXH_PRIME32_2 0x85EBCA77U /*!< 0b10000101111010111100101001110111 */ -#define XXH_PRIME32_3 0xC2B2AE3DU /*!< 0b11000010101100101010111000111101 */ -#define XXH_PRIME32_4 0x27D4EB2FU /*!< 0b00100111110101001110101100101111 */ -#define XXH_PRIME32_5 0x165667B1U /*!< 0b00010110010101100110011110110001 */ - -#ifdef XXH_OLD_NAMES -# define PRIME32_1 XXH_PRIME32_1 -# define PRIME32_2 XXH_PRIME32_2 -# define PRIME32_3 XXH_PRIME32_3 -# define PRIME32_4 XXH_PRIME32_4 -# define PRIME32_5 XXH_PRIME32_5 -#endif - -/*! - * @internal - * @brief Normal stripe processing routine. - * - * This shuffles the bits so that any bit from @p input impacts several bits in - * @p acc. - * - * @param acc The accumulator lane. - * @param input The stripe of input to mix. - * @return The mixed accumulator lane. - */ -static xxh_u32 XXH32_round(xxh_u32 acc, xxh_u32 input) -{ - acc += input * XXH_PRIME32_2; - acc = XXH_rotl32(acc, 13); - acc *= XXH_PRIME32_1; -#if (defined(__SSE4_1__) || defined(__aarch64__) || defined(__wasm_simd128__)) && !defined(XXH_ENABLE_AUTOVECTORIZE) - /* - * UGLY HACK: - * A compiler fence is the only thing that prevents GCC and Clang from - * autovectorizing the XXH32 loop (pragmas and attributes don't work for some - * reason) without globally disabling SSE4.1. - * - * The reason we want to avoid vectorization is because despite working on - * 4 integers at a time, there are multiple factors slowing XXH32 down on - * SSE4: - * - There's a ridiculous amount of lag from pmulld (10 cycles of latency on - * newer chips!) making it slightly slower to multiply four integers at - * once compared to four integers independently. Even when pmulld was - * fastest, Sandy/Ivy Bridge, it is still not worth it to go into SSE - * just to multiply unless doing a long operation. - * - * - Four instructions are required to rotate, - * movqda tmp, v // not required with VEX encoding - * pslld tmp, 13 // tmp <<= 13 - * psrld v, 19 // x >>= 19 - * por v, tmp // x |= tmp - * compared to one for scalar: - * roll v, 13 // reliably fast across the board - * shldl v, v, 13 // Sandy Bridge and later prefer this for some reason - * - * - Instruction level parallelism is actually more beneficial here because - * the SIMD actually serializes this operation: While v1 is rotating, v2 - * can load data, while v3 can multiply. SSE forces them to operate - * together. - * - * This is also enabled on AArch64, as Clang is *very aggressive* in vectorizing - * the loop. NEON is only faster on the A53, and with the newer cores, it is less - * than half the speed. - * - * Additionally, this is used on WASM SIMD128 because it JITs to the same - * SIMD instructions and has the same issue. - */ - XXH_COMPILER_GUARD(acc); -#endif - return acc; -} - -/*! - * @internal - * @brief Mixes all bits to finalize the hash. - * - * The final mix ensures that all input bits have a chance to impact any bit in - * the output digest, resulting in an unbiased distribution. - * - * @param hash The hash to avalanche. - * @return The avalanched hash. - */ -static xxh_u32 XXH32_avalanche(xxh_u32 hash) -{ - hash ^= hash >> 15; - hash *= XXH_PRIME32_2; - hash ^= hash >> 13; - hash *= XXH_PRIME32_3; - hash ^= hash >> 16; - return hash; -} - -#define XXH_get32bits(p) XXH_readLE32_align(p, align) - -/*! - * @internal - * @brief Processes the last 0-15 bytes of @p ptr. - * - * There may be up to 15 bytes remaining to consume from the input. - * This final stage will digest them to ensure that all input bytes are present - * in the final mix. - * - * @param hash The hash to finalize. - * @param ptr The pointer to the remaining input. - * @param len The remaining length, modulo 16. - * @param align Whether @p ptr is aligned. - * @return The finalized hash. - * @see XXH64_finalize(). - */ -static XXH_PUREF xxh_u32 -XXH32_finalize(xxh_u32 hash, const xxh_u8* ptr, size_t len, XXH_alignment align) -{ -#define XXH_PROCESS1 do { \ - hash += (*ptr++) * XXH_PRIME32_5; \ - hash = XXH_rotl32(hash, 11) * XXH_PRIME32_1; \ -} while (0) - -#define XXH_PROCESS4 do { \ - hash += XXH_get32bits(ptr) * XXH_PRIME32_3; \ - ptr += 4; \ - hash = XXH_rotl32(hash, 17) * XXH_PRIME32_4; \ -} while (0) - - if (ptr==NULL) XXH_ASSERT(len == 0); - - /* Compact rerolled version; generally faster */ - if (!XXH32_ENDJMP) { - len &= 15; - while (len >= 4) { - XXH_PROCESS4; - len -= 4; - } - while (len > 0) { - XXH_PROCESS1; - --len; - } - return XXH32_avalanche(hash); - } else { - switch(len&15) /* or switch(bEnd - p) */ { - case 12: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 8: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 4: XXH_PROCESS4; - return XXH32_avalanche(hash); - - case 13: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 9: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 5: XXH_PROCESS4; - XXH_PROCESS1; - return XXH32_avalanche(hash); - - case 14: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 10: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 6: XXH_PROCESS4; - XXH_PROCESS1; - XXH_PROCESS1; - return XXH32_avalanche(hash); - - case 15: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 11: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 7: XXH_PROCESS4; - XXH_FALLTHROUGH; /* fallthrough */ - case 3: XXH_PROCESS1; - XXH_FALLTHROUGH; /* fallthrough */ - case 2: XXH_PROCESS1; - XXH_FALLTHROUGH; /* fallthrough */ - case 1: XXH_PROCESS1; - XXH_FALLTHROUGH; /* fallthrough */ - case 0: return XXH32_avalanche(hash); - } - XXH_ASSERT(0); - return hash; /* reaching this point is deemed impossible */ - } -} - -#ifdef XXH_OLD_NAMES -# define PROCESS1 XXH_PROCESS1 -# define PROCESS4 XXH_PROCESS4 -#else -# undef XXH_PROCESS1 -# undef XXH_PROCESS4 -#endif - -/*! - * @internal - * @brief The implementation for @ref XXH32(). - * - * @param input , len , seed Directly passed from @ref XXH32(). - * @param align Whether @p input is aligned. - * @return The calculated hash. - */ -XXH_FORCE_INLINE XXH_PUREF xxh_u32 -XXH32_endian_align(const xxh_u8* input, size_t len, xxh_u32 seed, XXH_alignment align) -{ - xxh_u32 h32; - - if (input==NULL) XXH_ASSERT(len == 0); - - if (len>=16) { - const xxh_u8* const bEnd = input + len; - const xxh_u8* const limit = bEnd - 15; - xxh_u32 v1 = seed + XXH_PRIME32_1 + XXH_PRIME32_2; - xxh_u32 v2 = seed + XXH_PRIME32_2; - xxh_u32 v3 = seed + 0; - xxh_u32 v4 = seed - XXH_PRIME32_1; - - do { - v1 = XXH32_round(v1, XXH_get32bits(input)); input += 4; - v2 = XXH32_round(v2, XXH_get32bits(input)); input += 4; - v3 = XXH32_round(v3, XXH_get32bits(input)); input += 4; - v4 = XXH32_round(v4, XXH_get32bits(input)); input += 4; - } while (input < limit); - - h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) - + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18); - } else { - h32 = seed + XXH_PRIME32_5; - } - - h32 += (xxh_u32)len; - - return XXH32_finalize(h32, input, len&15, align); -} - -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API XXH32_hash_t XXH32 (const void* input, size_t len, XXH32_hash_t seed) -{ -#if !defined(XXH_NO_STREAM) && XXH_SIZE_OPT >= 2 - /* Simple version, good for code maintenance, but unfortunately slow for small inputs */ - XXH32_state_t state; - XXH32_reset(&state, seed); - XXH32_update(&state, (const xxh_u8*)input, len); - return XXH32_digest(&state); -#else - if (XXH_FORCE_ALIGN_CHECK) { - if ((((size_t)input) & 3) == 0) { /* Input is 4-bytes aligned, leverage the speed benefit */ - return XXH32_endian_align((const xxh_u8*)input, len, seed, XXH_aligned); - } } - - return XXH32_endian_align((const xxh_u8*)input, len, seed, XXH_unaligned); -#endif -} - - - -/******* Hash streaming *******/ -#ifndef XXH_NO_STREAM -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API XXH32_state_t* XXH32_createState(void) -{ - return (XXH32_state_t*)XXH_malloc(sizeof(XXH32_state_t)); -} -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API XXH_errorcode XXH32_freeState(XXH32_state_t* statePtr) -{ - XXH_free(statePtr); - return XXH_OK; -} - -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dstState, const XXH32_state_t* srcState) -{ - XXH_memcpy(dstState, srcState, sizeof(*dstState)); -} - -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API XXH_errorcode XXH32_reset(XXH32_state_t* statePtr, XXH32_hash_t seed) -{ - XXH_ASSERT(statePtr != NULL); - memset(statePtr, 0, sizeof(*statePtr)); - statePtr->v[0] = seed + XXH_PRIME32_1 + XXH_PRIME32_2; - statePtr->v[1] = seed + XXH_PRIME32_2; - statePtr->v[2] = seed + 0; - statePtr->v[3] = seed - XXH_PRIME32_1; - return XXH_OK; -} - - -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API XXH_errorcode -XXH32_update(XXH32_state_t* state, const void* input, size_t len) -{ - if (input==NULL) { - XXH_ASSERT(len == 0); - return XXH_OK; - } - - { const xxh_u8* p = (const xxh_u8*)input; - const xxh_u8* const bEnd = p + len; - - state->total_len_32 += (XXH32_hash_t)len; - state->large_len |= (XXH32_hash_t)((len>=16) | (state->total_len_32>=16)); - - if (state->memsize + len < 16) { /* fill in tmp buffer */ - XXH_memcpy((xxh_u8*)(state->mem32) + state->memsize, input, len); - state->memsize += (XXH32_hash_t)len; - return XXH_OK; - } - - if (state->memsize) { /* some data left from previous update */ - XXH_memcpy((xxh_u8*)(state->mem32) + state->memsize, input, 16-state->memsize); - { const xxh_u32* p32 = state->mem32; - state->v[0] = XXH32_round(state->v[0], XXH_readLE32(p32)); p32++; - state->v[1] = XXH32_round(state->v[1], XXH_readLE32(p32)); p32++; - state->v[2] = XXH32_round(state->v[2], XXH_readLE32(p32)); p32++; - state->v[3] = XXH32_round(state->v[3], XXH_readLE32(p32)); - } - p += 16-state->memsize; - state->memsize = 0; - } - - if (p <= bEnd-16) { - const xxh_u8* const limit = bEnd - 16; - - do { - state->v[0] = XXH32_round(state->v[0], XXH_readLE32(p)); p+=4; - state->v[1] = XXH32_round(state->v[1], XXH_readLE32(p)); p+=4; - state->v[2] = XXH32_round(state->v[2], XXH_readLE32(p)); p+=4; - state->v[3] = XXH32_round(state->v[3], XXH_readLE32(p)); p+=4; - } while (p<=limit); - - } - - if (p < bEnd) { - XXH_memcpy(state->mem32, p, (size_t)(bEnd-p)); - state->memsize = (unsigned)(bEnd-p); - } - } - - return XXH_OK; -} - - -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API XXH32_hash_t XXH32_digest(const XXH32_state_t* state) -{ - xxh_u32 h32; - - if (state->large_len) { - h32 = XXH_rotl32(state->v[0], 1) - + XXH_rotl32(state->v[1], 7) - + XXH_rotl32(state->v[2], 12) - + XXH_rotl32(state->v[3], 18); - } else { - h32 = state->v[2] /* == seed */ + XXH_PRIME32_5; - } - - h32 += state->total_len_32; - - return XXH32_finalize(h32, (const xxh_u8*)state->mem32, state->memsize, XXH_aligned); -} -#endif /* !XXH_NO_STREAM */ - -/******* Canonical representation *******/ - -/*! - * @ingroup XXH32_family - * The default return values from XXH functions are unsigned 32 and 64 bit - * integers. - * - * The canonical representation uses big endian convention, the same convention - * as human-readable numbers (large digits first). - * - * This way, hash values can be written into a file or buffer, remaining - * comparable across different systems. - * - * The following functions allow transformation of hash values to and from their - * canonical format. - */ -XXH_PUBLIC_API void XXH32_canonicalFromHash(XXH32_canonical_t* dst, XXH32_hash_t hash) -{ - XXH_STATIC_ASSERT(sizeof(XXH32_canonical_t) == sizeof(XXH32_hash_t)); - if (XXH_CPU_LITTLE_ENDIAN) hash = XXH_swap32(hash); - XXH_memcpy(dst, &hash, sizeof(*dst)); -} -/*! @ingroup XXH32_family */ -XXH_PUBLIC_API XXH32_hash_t XXH32_hashFromCanonical(const XXH32_canonical_t* src) -{ - return XXH_readBE32(src); -} - - -#ifndef XXH_NO_LONG_LONG - -/* ******************************************************************* -* 64-bit hash functions -*********************************************************************/ -/*! - * @} - * @ingroup impl - * @{ - */ -/******* Memory access *******/ - -typedef XXH64_hash_t xxh_u64; - -#ifdef XXH_OLD_NAMES -# define U64 xxh_u64 -#endif - -#if (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==3)) -/* - * Manual byteshift. Best for old compilers which don't inline memcpy. - * We actually directly use XXH_readLE64 and XXH_readBE64. - */ -#elif (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==2)) - -/* Force direct memory access. Only works on CPU which support unaligned memory access in hardware */ -static xxh_u64 XXH_read64(const void* memPtr) -{ - return *(const xxh_u64*) memPtr; -} - -#elif (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==1)) - -/* - * __attribute__((aligned(1))) is supported by gcc and clang. Originally the - * documentation claimed that it only increased the alignment, but actually it - * can decrease it on gcc, clang, and icc: - * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69502, - * https://gcc.godbolt.org/z/xYez1j67Y. - */ -#ifdef XXH_OLD_NAMES -typedef union { xxh_u32 u32; xxh_u64 u64; } __attribute__((packed)) unalign64; -#endif -static xxh_u64 XXH_read64(const void* ptr) -{ - typedef __attribute__((aligned(1))) xxh_u64 xxh_unalign64; - return *((const xxh_unalign64*)ptr); -} - -#else - -/* - * Portable and safe solution. Generally efficient. - * see: https://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html - */ -static xxh_u64 XXH_read64(const void* memPtr) -{ - xxh_u64 val; - XXH_memcpy(&val, memPtr, sizeof(val)); - return val; -} - -#endif /* XXH_FORCE_DIRECT_MEMORY_ACCESS */ - -#if defined(_MSC_VER) /* Visual Studio */ -# define XXH_swap64 _byteswap_uint64 -#elif XXH_GCC_VERSION >= 403 -# define XXH_swap64 __builtin_bswap64 -#else -static xxh_u64 XXH_swap64(xxh_u64 x) -{ - return ((x << 56) & 0xff00000000000000ULL) | - ((x << 40) & 0x00ff000000000000ULL) | - ((x << 24) & 0x0000ff0000000000ULL) | - ((x << 8) & 0x000000ff00000000ULL) | - ((x >> 8) & 0x00000000ff000000ULL) | - ((x >> 24) & 0x0000000000ff0000ULL) | - ((x >> 40) & 0x000000000000ff00ULL) | - ((x >> 56) & 0x00000000000000ffULL); -} -#endif - - -/* XXH_FORCE_MEMORY_ACCESS==3 is an endian-independent byteshift load. */ -#if (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==3)) - -XXH_FORCE_INLINE xxh_u64 XXH_readLE64(const void* memPtr) -{ - const xxh_u8* bytePtr = (const xxh_u8 *)memPtr; - return bytePtr[0] - | ((xxh_u64)bytePtr[1] << 8) - | ((xxh_u64)bytePtr[2] << 16) - | ((xxh_u64)bytePtr[3] << 24) - | ((xxh_u64)bytePtr[4] << 32) - | ((xxh_u64)bytePtr[5] << 40) - | ((xxh_u64)bytePtr[6] << 48) - | ((xxh_u64)bytePtr[7] << 56); -} - -XXH_FORCE_INLINE xxh_u64 XXH_readBE64(const void* memPtr) -{ - const xxh_u8* bytePtr = (const xxh_u8 *)memPtr; - return bytePtr[7] - | ((xxh_u64)bytePtr[6] << 8) - | ((xxh_u64)bytePtr[5] << 16) - | ((xxh_u64)bytePtr[4] << 24) - | ((xxh_u64)bytePtr[3] << 32) - | ((xxh_u64)bytePtr[2] << 40) - | ((xxh_u64)bytePtr[1] << 48) - | ((xxh_u64)bytePtr[0] << 56); -} - -#else -XXH_FORCE_INLINE xxh_u64 XXH_readLE64(const void* ptr) -{ - return XXH_CPU_LITTLE_ENDIAN ? XXH_read64(ptr) : XXH_swap64(XXH_read64(ptr)); -} - -static xxh_u64 XXH_readBE64(const void* ptr) -{ - return XXH_CPU_LITTLE_ENDIAN ? XXH_swap64(XXH_read64(ptr)) : XXH_read64(ptr); -} -#endif - -XXH_FORCE_INLINE xxh_u64 -XXH_readLE64_align(const void* ptr, XXH_alignment align) -{ - if (align==XXH_unaligned) - return XXH_readLE64(ptr); - else - return XXH_CPU_LITTLE_ENDIAN ? *(const xxh_u64*)ptr : XXH_swap64(*(const xxh_u64*)ptr); -} - - -/******* xxh64 *******/ -/*! - * @} - * @defgroup XXH64_impl XXH64 implementation - * @ingroup impl - * - * Details on the XXH64 implementation. - * @{ - */ -/* #define rather that static const, to be used as initializers */ -#define XXH_PRIME64_1 0x9E3779B185EBCA87ULL /*!< 0b1001111000110111011110011011000110000101111010111100101010000111 */ -#define XXH_PRIME64_2 0xC2B2AE3D27D4EB4FULL /*!< 0b1100001010110010101011100011110100100111110101001110101101001111 */ -#define XXH_PRIME64_3 0x165667B19E3779F9ULL /*!< 0b0001011001010110011001111011000110011110001101110111100111111001 */ -#define XXH_PRIME64_4 0x85EBCA77C2B2AE63ULL /*!< 0b1000010111101011110010100111011111000010101100101010111001100011 */ -#define XXH_PRIME64_5 0x27D4EB2F165667C5ULL /*!< 0b0010011111010100111010110010111100010110010101100110011111000101 */ - -#ifdef XXH_OLD_NAMES -# define PRIME64_1 XXH_PRIME64_1 -# define PRIME64_2 XXH_PRIME64_2 -# define PRIME64_3 XXH_PRIME64_3 -# define PRIME64_4 XXH_PRIME64_4 -# define PRIME64_5 XXH_PRIME64_5 -#endif - -/*! @copydoc XXH32_round */ -static xxh_u64 XXH64_round(xxh_u64 acc, xxh_u64 input) -{ - acc += input * XXH_PRIME64_2; - acc = XXH_rotl64(acc, 31); - acc *= XXH_PRIME64_1; - return acc; -} - -static xxh_u64 XXH64_mergeRound(xxh_u64 acc, xxh_u64 val) -{ - val = XXH64_round(0, val); - acc ^= val; - acc = acc * XXH_PRIME64_1 + XXH_PRIME64_4; - return acc; -} - -/*! @copydoc XXH32_avalanche */ -static xxh_u64 XXH64_avalanche(xxh_u64 hash) -{ - hash ^= hash >> 33; - hash *= XXH_PRIME64_2; - hash ^= hash >> 29; - hash *= XXH_PRIME64_3; - hash ^= hash >> 32; - return hash; -} - - -#define XXH_get64bits(p) XXH_readLE64_align(p, align) - -/*! - * @internal - * @brief Processes the last 0-31 bytes of @p ptr. - * - * There may be up to 31 bytes remaining to consume from the input. - * This final stage will digest them to ensure that all input bytes are present - * in the final mix. - * - * @param hash The hash to finalize. - * @param ptr The pointer to the remaining input. - * @param len The remaining length, modulo 32. - * @param align Whether @p ptr is aligned. - * @return The finalized hash - * @see XXH32_finalize(). - */ -static XXH_PUREF xxh_u64 -XXH64_finalize(xxh_u64 hash, const xxh_u8* ptr, size_t len, XXH_alignment align) -{ - if (ptr==NULL) XXH_ASSERT(len == 0); - len &= 31; - while (len >= 8) { - xxh_u64 const k1 = XXH64_round(0, XXH_get64bits(ptr)); - ptr += 8; - hash ^= k1; - hash = XXH_rotl64(hash,27) * XXH_PRIME64_1 + XXH_PRIME64_4; - len -= 8; - } - if (len >= 4) { - hash ^= (xxh_u64)(XXH_get32bits(ptr)) * XXH_PRIME64_1; - ptr += 4; - hash = XXH_rotl64(hash, 23) * XXH_PRIME64_2 + XXH_PRIME64_3; - len -= 4; - } - while (len > 0) { - hash ^= (*ptr++) * XXH_PRIME64_5; - hash = XXH_rotl64(hash, 11) * XXH_PRIME64_1; - --len; - } - return XXH64_avalanche(hash); -} - -#ifdef XXH_OLD_NAMES -# define PROCESS1_64 XXH_PROCESS1_64 -# define PROCESS4_64 XXH_PROCESS4_64 -# define PROCESS8_64 XXH_PROCESS8_64 -#else -# undef XXH_PROCESS1_64 -# undef XXH_PROCESS4_64 -# undef XXH_PROCESS8_64 -#endif - -/*! - * @internal - * @brief The implementation for @ref XXH64(). - * - * @param input , len , seed Directly passed from @ref XXH64(). - * @param align Whether @p input is aligned. - * @return The calculated hash. - */ -XXH_FORCE_INLINE XXH_PUREF xxh_u64 -XXH64_endian_align(const xxh_u8* input, size_t len, xxh_u64 seed, XXH_alignment align) -{ - xxh_u64 h64; - if (input==NULL) XXH_ASSERT(len == 0); - - if (len>=32) { - const xxh_u8* const bEnd = input + len; - const xxh_u8* const limit = bEnd - 31; - xxh_u64 v1 = seed + XXH_PRIME64_1 + XXH_PRIME64_2; - xxh_u64 v2 = seed + XXH_PRIME64_2; - xxh_u64 v3 = seed + 0; - xxh_u64 v4 = seed - XXH_PRIME64_1; - - do { - v1 = XXH64_round(v1, XXH_get64bits(input)); input+=8; - v2 = XXH64_round(v2, XXH_get64bits(input)); input+=8; - v3 = XXH64_round(v3, XXH_get64bits(input)); input+=8; - v4 = XXH64_round(v4, XXH_get64bits(input)); input+=8; - } while (input= 2 - /* Simple version, good for code maintenance, but unfortunately slow for small inputs */ - XXH64_state_t state; - XXH64_reset(&state, seed); - XXH64_update(&state, (const xxh_u8*)input, len); - return XXH64_digest(&state); -#else - if (XXH_FORCE_ALIGN_CHECK) { - if ((((size_t)input) & 7)==0) { /* Input is aligned, let's leverage the speed advantage */ - return XXH64_endian_align((const xxh_u8*)input, len, seed, XXH_aligned); - } } - - return XXH64_endian_align((const xxh_u8*)input, len, seed, XXH_unaligned); - -#endif -} - -/******* Hash Streaming *******/ -#ifndef XXH_NO_STREAM -/*! @ingroup XXH64_family*/ -XXH_PUBLIC_API XXH64_state_t* XXH64_createState(void) -{ - return (XXH64_state_t*)XXH_malloc(sizeof(XXH64_state_t)); -} -/*! @ingroup XXH64_family */ -XXH_PUBLIC_API XXH_errorcode XXH64_freeState(XXH64_state_t* statePtr) -{ - XXH_free(statePtr); - return XXH_OK; -} - -/*! @ingroup XXH64_family */ -XXH_PUBLIC_API void XXH64_copyState(XXH_NOESCAPE XXH64_state_t* dstState, const XXH64_state_t* srcState) -{ - XXH_memcpy(dstState, srcState, sizeof(*dstState)); -} - -/*! @ingroup XXH64_family */ -XXH_PUBLIC_API XXH_errorcode XXH64_reset(XXH_NOESCAPE XXH64_state_t* statePtr, XXH64_hash_t seed) -{ - XXH_ASSERT(statePtr != NULL); - memset(statePtr, 0, sizeof(*statePtr)); - statePtr->v[0] = seed + XXH_PRIME64_1 + XXH_PRIME64_2; - statePtr->v[1] = seed + XXH_PRIME64_2; - statePtr->v[2] = seed + 0; - statePtr->v[3] = seed - XXH_PRIME64_1; - return XXH_OK; -} - -/*! @ingroup XXH64_family */ -XXH_PUBLIC_API XXH_errorcode -XXH64_update (XXH_NOESCAPE XXH64_state_t* state, XXH_NOESCAPE const void* input, size_t len) -{ - if (input==NULL) { - XXH_ASSERT(len == 0); - return XXH_OK; - } - - { const xxh_u8* p = (const xxh_u8*)input; - const xxh_u8* const bEnd = p + len; - - state->total_len += len; - - if (state->memsize + len < 32) { /* fill in tmp buffer */ - XXH_memcpy(((xxh_u8*)state->mem64) + state->memsize, input, len); - state->memsize += (xxh_u32)len; - return XXH_OK; - } - - if (state->memsize) { /* tmp buffer is full */ - XXH_memcpy(((xxh_u8*)state->mem64) + state->memsize, input, 32-state->memsize); - state->v[0] = XXH64_round(state->v[0], XXH_readLE64(state->mem64+0)); - state->v[1] = XXH64_round(state->v[1], XXH_readLE64(state->mem64+1)); - state->v[2] = XXH64_round(state->v[2], XXH_readLE64(state->mem64+2)); - state->v[3] = XXH64_round(state->v[3], XXH_readLE64(state->mem64+3)); - p += 32 - state->memsize; - state->memsize = 0; - } - - if (p+32 <= bEnd) { - const xxh_u8* const limit = bEnd - 32; - - do { - state->v[0] = XXH64_round(state->v[0], XXH_readLE64(p)); p+=8; - state->v[1] = XXH64_round(state->v[1], XXH_readLE64(p)); p+=8; - state->v[2] = XXH64_round(state->v[2], XXH_readLE64(p)); p+=8; - state->v[3] = XXH64_round(state->v[3], XXH_readLE64(p)); p+=8; - } while (p<=limit); - - } - - if (p < bEnd) { - XXH_memcpy(state->mem64, p, (size_t)(bEnd-p)); - state->memsize = (unsigned)(bEnd-p); - } - } - - return XXH_OK; -} - - -/*! @ingroup XXH64_family */ -XXH_PUBLIC_API XXH64_hash_t XXH64_digest(XXH_NOESCAPE const XXH64_state_t* state) -{ - xxh_u64 h64; - - if (state->total_len >= 32) { - h64 = XXH_rotl64(state->v[0], 1) + XXH_rotl64(state->v[1], 7) + XXH_rotl64(state->v[2], 12) + XXH_rotl64(state->v[3], 18); - h64 = XXH64_mergeRound(h64, state->v[0]); - h64 = XXH64_mergeRound(h64, state->v[1]); - h64 = XXH64_mergeRound(h64, state->v[2]); - h64 = XXH64_mergeRound(h64, state->v[3]); - } else { - h64 = state->v[2] /*seed*/ + XXH_PRIME64_5; - } - - h64 += (xxh_u64) state->total_len; - - return XXH64_finalize(h64, (const xxh_u8*)state->mem64, (size_t)state->total_len, XXH_aligned); -} -#endif /* !XXH_NO_STREAM */ - -/******* Canonical representation *******/ - -/*! @ingroup XXH64_family */ -XXH_PUBLIC_API void XXH64_canonicalFromHash(XXH_NOESCAPE XXH64_canonical_t* dst, XXH64_hash_t hash) -{ - XXH_STATIC_ASSERT(sizeof(XXH64_canonical_t) == sizeof(XXH64_hash_t)); - if (XXH_CPU_LITTLE_ENDIAN) hash = XXH_swap64(hash); - XXH_memcpy(dst, &hash, sizeof(*dst)); -} - -/*! @ingroup XXH64_family */ -XXH_PUBLIC_API XXH64_hash_t XXH64_hashFromCanonical(XXH_NOESCAPE const XXH64_canonical_t* src) -{ - return XXH_readBE64(src); -} - -#ifndef XXH_NO_XXH3 - -/* ********************************************************************* -* XXH3 -* New generation hash designed for speed on small keys and vectorization -************************************************************************ */ -/*! - * @} - * @defgroup XXH3_impl XXH3 implementation - * @ingroup impl - * @{ - */ - -/* === Compiler specifics === */ - -#if ((defined(sun) || defined(__sun)) && __cplusplus) /* Solaris includes __STDC_VERSION__ with C++. Tested with GCC 5.5 */ -# define XXH_RESTRICT /* disable */ -#elif defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L /* >= C99 */ -# define XXH_RESTRICT restrict -#elif (defined (__GNUC__) && ((__GNUC__ > 3) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1))) \ - || (defined (__clang__)) \ - || (defined (_MSC_VER) && (_MSC_VER >= 1400)) \ - || (defined (__INTEL_COMPILER) && (__INTEL_COMPILER >= 1300)) -/* - * There are a LOT more compilers that recognize __restrict but this - * covers the major ones. - */ -# define XXH_RESTRICT __restrict -#else -# define XXH_RESTRICT /* disable */ -#endif - -#if (defined(__GNUC__) && (__GNUC__ >= 3)) \ - || (defined(__INTEL_COMPILER) && (__INTEL_COMPILER >= 800)) \ - || defined(__clang__) -# define XXH_likely(x) __builtin_expect(x, 1) -# define XXH_unlikely(x) __builtin_expect(x, 0) -#else -# define XXH_likely(x) (x) -# define XXH_unlikely(x) (x) -#endif - -#ifndef XXH_HAS_INCLUDE -# ifdef __has_include -# define XXH_HAS_INCLUDE(x) __has_include(x) -# else -# define XXH_HAS_INCLUDE(x) 0 -# endif -#endif - -#if defined(__GNUC__) || defined(__clang__) -# if defined(__ARM_FEATURE_SVE) -# include -# endif -# if defined(__ARM_NEON__) || defined(__ARM_NEON) \ - || (defined(_M_ARM) && _M_ARM >= 7) \ - || defined(_M_ARM64) || defined(_M_ARM64EC) \ - || (defined(__wasm_simd128__) && XXH_HAS_INCLUDE()) /* WASM SIMD128 via SIMDe */ -# define inline __inline__ /* circumvent a clang bug */ -# include -# undef inline -# elif defined(__AVX2__) -# include -# elif defined(__SSE2__) -# include -# endif -#endif - -#if defined(_MSC_VER) -# include -#endif - -/* - * One goal of XXH3 is to make it fast on both 32-bit and 64-bit, while - * remaining a true 64-bit/128-bit hash function. - * - * This is done by prioritizing a subset of 64-bit operations that can be - * emulated without too many steps on the average 32-bit machine. - * - * For example, these two lines seem similar, and run equally fast on 64-bit: - * - * xxh_u64 x; - * x ^= (x >> 47); // good - * x ^= (x >> 13); // bad - * - * However, to a 32-bit machine, there is a major difference. - * - * x ^= (x >> 47) looks like this: - * - * x.lo ^= (x.hi >> (47 - 32)); - * - * while x ^= (x >> 13) looks like this: - * - * // note: funnel shifts are not usually cheap. - * x.lo ^= (x.lo >> 13) | (x.hi << (32 - 13)); - * x.hi ^= (x.hi >> 13); - * - * The first one is significantly faster than the second, simply because the - * shift is larger than 32. This means: - * - All the bits we need are in the upper 32 bits, so we can ignore the lower - * 32 bits in the shift. - * - The shift result will always fit in the lower 32 bits, and therefore, - * we can ignore the upper 32 bits in the xor. - * - * Thanks to this optimization, XXH3 only requires these features to be efficient: - * - * - Usable unaligned access - * - A 32-bit or 64-bit ALU - * - If 32-bit, a decent ADC instruction - * - A 32 or 64-bit multiply with a 64-bit result - * - For the 128-bit variant, a decent byteswap helps short inputs. - * - * The first two are already required by XXH32, and almost all 32-bit and 64-bit - * platforms which can run XXH32 can run XXH3 efficiently. - * - * Thumb-1, the classic 16-bit only subset of ARM's instruction set, is one - * notable exception. - * - * First of all, Thumb-1 lacks support for the UMULL instruction which - * performs the important long multiply. This means numerous __aeabi_lmul - * calls. - * - * Second of all, the 8 functional registers are just not enough. - * Setup for __aeabi_lmul, byteshift loads, pointers, and all arithmetic need - * Lo registers, and this shuffling results in thousands more MOVs than A32. - * - * A32 and T32 don't have this limitation. They can access all 14 registers, - * do a 32->64 multiply with UMULL, and the flexible operand allowing free - * shifts is helpful, too. - * - * Therefore, we do a quick sanity check. - * - * If compiling Thumb-1 for a target which supports ARM instructions, we will - * emit a warning, as it is not a "sane" platform to compile for. - * - * Usually, if this happens, it is because of an accident and you probably need - * to specify -march, as you likely meant to compile for a newer architecture. - * - * Credit: large sections of the vectorial and asm source code paths - * have been contributed by @easyaspi314 - */ -#if defined(__thumb__) && !defined(__thumb2__) && defined(__ARM_ARCH_ISA_ARM) -# warning "XXH3 is highly inefficient without ARM or Thumb-2." -#endif - -/* ========================================== - * Vectorization detection - * ========================================== */ - -#ifdef XXH_DOXYGEN -/*! - * @ingroup tuning - * @brief Overrides the vectorization implementation chosen for XXH3. - * - * Can be defined to 0 to disable SIMD or any of the values mentioned in - * @ref XXH_VECTOR_TYPE. - * - * If this is not defined, it uses predefined macros to determine the best - * implementation. - */ -# define XXH_VECTOR XXH_SCALAR -/*! - * @ingroup tuning - * @brief Possible values for @ref XXH_VECTOR. - * - * Note that these are actually implemented as macros. - * - * If this is not defined, it is detected automatically. - * internal macro XXH_X86DISPATCH overrides this. - */ -enum XXH_VECTOR_TYPE /* fake enum */ { - XXH_SCALAR = 0, /*!< Portable scalar version */ - XXH_SSE2 = 1, /*!< - * SSE2 for Pentium 4, Opteron, all x86_64. - * - * @note SSE2 is also guaranteed on Windows 10, macOS, and - * Android x86. - */ - XXH_AVX2 = 2, /*!< AVX2 for Haswell and Bulldozer */ - XXH_AVX512 = 3, /*!< AVX512 for Skylake and Icelake */ - XXH_NEON = 4, /*!< - * NEON for most ARMv7-A, all AArch64, and WASM SIMD128 - * via the SIMDeverywhere polyfill provided with the - * Emscripten SDK. - */ - XXH_VSX = 5, /*!< VSX and ZVector for POWER8/z13 (64-bit) */ - XXH_SVE = 6, /*!< SVE for some ARMv8-A and ARMv9-A */ -}; -/*! - * @ingroup tuning - * @brief Selects the minimum alignment for XXH3's accumulators. - * - * When using SIMD, this should match the alignment required for said vector - * type, so, for example, 32 for AVX2. - * - * Default: Auto detected. - */ -# define XXH_ACC_ALIGN 8 -#endif - -/* Actual definition */ -#ifndef XXH_DOXYGEN -# define XXH_SCALAR 0 -# define XXH_SSE2 1 -# define XXH_AVX2 2 -# define XXH_AVX512 3 -# define XXH_NEON 4 -# define XXH_VSX 5 -# define XXH_SVE 6 -#endif - -#ifndef XXH_VECTOR /* can be defined on command line */ -# if defined(__ARM_FEATURE_SVE) -# define XXH_VECTOR XXH_SVE -# elif ( \ - defined(__ARM_NEON__) || defined(__ARM_NEON) /* gcc */ \ - || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) /* msvc */ \ - || (defined(__wasm_simd128__) && XXH_HAS_INCLUDE()) /* wasm simd128 via SIMDe */ \ - ) && ( \ - defined(_WIN32) || defined(__LITTLE_ENDIAN__) /* little endian only */ \ - || (defined(__BYTE_ORDER__) && __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__) \ - ) -# define XXH_VECTOR XXH_NEON -# elif defined(__AVX512F__) -# define XXH_VECTOR XXH_AVX512 -# elif defined(__AVX2__) -# define XXH_VECTOR XXH_AVX2 -# elif defined(__SSE2__) || defined(_M_AMD64) || defined(_M_X64) || (defined(_M_IX86_FP) && (_M_IX86_FP == 2)) -# define XXH_VECTOR XXH_SSE2 -# elif (defined(__PPC64__) && defined(__POWER8_VECTOR__)) \ - || (defined(__s390x__) && defined(__VEC__)) \ - && defined(__GNUC__) /* TODO: IBM XL */ -# define XXH_VECTOR XXH_VSX -# else -# define XXH_VECTOR XXH_SCALAR -# endif -#endif - -/* __ARM_FEATURE_SVE is only supported by GCC & Clang. */ -#if (XXH_VECTOR == XXH_SVE) && !defined(__ARM_FEATURE_SVE) -# ifdef _MSC_VER -# pragma warning(once : 4606) -# else -# warning "__ARM_FEATURE_SVE isn't supported. Use SCALAR instead." -# endif -# undef XXH_VECTOR -# define XXH_VECTOR XXH_SCALAR -#endif - -/* - * Controls the alignment of the accumulator, - * for compatibility with aligned vector loads, which are usually faster. - */ -#ifndef XXH_ACC_ALIGN -# if defined(XXH_X86DISPATCH) -# define XXH_ACC_ALIGN 64 /* for compatibility with avx512 */ -# elif XXH_VECTOR == XXH_SCALAR /* scalar */ -# define XXH_ACC_ALIGN 8 -# elif XXH_VECTOR == XXH_SSE2 /* sse2 */ -# define XXH_ACC_ALIGN 16 -# elif XXH_VECTOR == XXH_AVX2 /* avx2 */ -# define XXH_ACC_ALIGN 32 -# elif XXH_VECTOR == XXH_NEON /* neon */ -# define XXH_ACC_ALIGN 16 -# elif XXH_VECTOR == XXH_VSX /* vsx */ -# define XXH_ACC_ALIGN 16 -# elif XXH_VECTOR == XXH_AVX512 /* avx512 */ -# define XXH_ACC_ALIGN 64 -# elif XXH_VECTOR == XXH_SVE /* sve */ -# define XXH_ACC_ALIGN 64 -# endif -#endif - -#if defined(XXH_X86DISPATCH) || XXH_VECTOR == XXH_SSE2 \ - || XXH_VECTOR == XXH_AVX2 || XXH_VECTOR == XXH_AVX512 -# define XXH_SEC_ALIGN XXH_ACC_ALIGN -#elif XXH_VECTOR == XXH_SVE -# define XXH_SEC_ALIGN XXH_ACC_ALIGN -#else -# define XXH_SEC_ALIGN 8 -#endif - -#if defined(__GNUC__) || defined(__clang__) -# define XXH_ALIASING __attribute__((may_alias)) -#else -# define XXH_ALIASING /* nothing */ -#endif - -/* - * UGLY HACK: - * GCC usually generates the best code with -O3 for xxHash. - * - * However, when targeting AVX2, it is overzealous in its unrolling resulting - * in code roughly 3/4 the speed of Clang. - * - * There are other issues, such as GCC splitting _mm256_loadu_si256 into - * _mm_loadu_si128 + _mm256_inserti128_si256. This is an optimization which - * only applies to Sandy and Ivy Bridge... which don't even support AVX2. - * - * That is why when compiling the AVX2 version, it is recommended to use either - * -O2 -mavx2 -march=haswell - * or - * -O2 -mavx2 -mno-avx256-split-unaligned-load - * for decent performance, or to use Clang instead. - * - * Fortunately, we can control the first one with a pragma that forces GCC into - * -O2, but the other one we can't control without "failed to inline always - * inline function due to target mismatch" warnings. - */ -#if XXH_VECTOR == XXH_AVX2 /* AVX2 */ \ - && defined(__GNUC__) && !defined(__clang__) /* GCC, not Clang */ \ - && defined(__OPTIMIZE__) && XXH_SIZE_OPT <= 0 /* respect -O0 and -Os */ -# pragma GCC push_options -# pragma GCC optimize("-O2") -#endif - -#if XXH_VECTOR == XXH_NEON - -/* - * UGLY HACK: While AArch64 GCC on Linux does not seem to care, on macOS, GCC -O3 - * optimizes out the entire hashLong loop because of the aliasing violation. - * - * However, GCC is also inefficient at load-store optimization with vld1q/vst1q, - * so the only option is to mark it as aliasing. - */ -typedef uint64x2_t xxh_aliasing_uint64x2_t XXH_ALIASING; - -/*! - * @internal - * @brief `vld1q_u64` but faster and alignment-safe. - * - * On AArch64, unaligned access is always safe, but on ARMv7-a, it is only - * *conditionally* safe (`vld1` has an alignment bit like `movdq[ua]` in x86). - * - * GCC for AArch64 sees `vld1q_u8` as an intrinsic instead of a load, so it - * prohibits load-store optimizations. Therefore, a direct dereference is used. - * - * Otherwise, `vld1q_u8` is used with `vreinterpretq_u8_u64` to do a safe - * unaligned load. - */ -#if defined(__aarch64__) && defined(__GNUC__) && !defined(__clang__) -XXH_FORCE_INLINE uint64x2_t XXH_vld1q_u64(void const* ptr) /* silence -Wcast-align */ -{ - return *(xxh_aliasing_uint64x2_t const *)ptr; -} -#else -XXH_FORCE_INLINE uint64x2_t XXH_vld1q_u64(void const* ptr) -{ - return vreinterpretq_u64_u8(vld1q_u8((uint8_t const*)ptr)); -} -#endif - -/*! - * @internal - * @brief `vmlal_u32` on low and high halves of a vector. - * - * This is a workaround for AArch64 GCC < 11 which implemented arm_neon.h with - * inline assembly and were therefore incapable of merging the `vget_{low, high}_u32` - * with `vmlal_u32`. - */ -#if defined(__aarch64__) && defined(__GNUC__) && !defined(__clang__) && __GNUC__ < 11 -XXH_FORCE_INLINE uint64x2_t -XXH_vmlal_low_u32(uint64x2_t acc, uint32x4_t lhs, uint32x4_t rhs) -{ - /* Inline assembly is the only way */ - __asm__("umlal %0.2d, %1.2s, %2.2s" : "+w" (acc) : "w" (lhs), "w" (rhs)); - return acc; -} -XXH_FORCE_INLINE uint64x2_t -XXH_vmlal_high_u32(uint64x2_t acc, uint32x4_t lhs, uint32x4_t rhs) -{ - /* This intrinsic works as expected */ - return vmlal_high_u32(acc, lhs, rhs); -} -#else -/* Portable intrinsic versions */ -XXH_FORCE_INLINE uint64x2_t -XXH_vmlal_low_u32(uint64x2_t acc, uint32x4_t lhs, uint32x4_t rhs) -{ - return vmlal_u32(acc, vget_low_u32(lhs), vget_low_u32(rhs)); -} -/*! @copydoc XXH_vmlal_low_u32 - * Assume the compiler converts this to vmlal_high_u32 on aarch64 */ -XXH_FORCE_INLINE uint64x2_t -XXH_vmlal_high_u32(uint64x2_t acc, uint32x4_t lhs, uint32x4_t rhs) -{ - return vmlal_u32(acc, vget_high_u32(lhs), vget_high_u32(rhs)); -} -#endif - -/*! - * @ingroup tuning - * @brief Controls the NEON to scalar ratio for XXH3 - * - * This can be set to 2, 4, 6, or 8. - * - * ARM Cortex CPUs are _very_ sensitive to how their pipelines are used. - * - * For example, the Cortex-A73 can dispatch 3 micro-ops per cycle, but only 2 of those - * can be NEON. If you are only using NEON instructions, you are only using 2/3 of the CPU - * bandwidth. - * - * This is even more noticeable on the more advanced cores like the Cortex-A76 which - * can dispatch 8 micro-ops per cycle, but still only 2 NEON micro-ops at once. - * - * Therefore, to make the most out of the pipeline, it is beneficial to run 6 NEON lanes - * and 2 scalar lanes, which is chosen by default. - * - * This does not apply to Apple processors or 32-bit processors, which run better with - * full NEON. These will default to 8. Additionally, size-optimized builds run 8 lanes. - * - * This change benefits CPUs with large micro-op buffers without negatively affecting - * most other CPUs: - * - * | Chipset | Dispatch type | NEON only | 6:2 hybrid | Diff. | - * |:----------------------|:--------------------|----------:|-----------:|------:| - * | Snapdragon 730 (A76) | 2 NEON/8 micro-ops | 8.8 GB/s | 10.1 GB/s | ~16% | - * | Snapdragon 835 (A73) | 2 NEON/3 micro-ops | 5.1 GB/s | 5.3 GB/s | ~5% | - * | Marvell PXA1928 (A53) | In-order dual-issue | 1.9 GB/s | 1.9 GB/s | 0% | - * | Apple M1 | 4 NEON/8 micro-ops | 37.3 GB/s | 36.1 GB/s | ~-3% | - * - * It also seems to fix some bad codegen on GCC, making it almost as fast as clang. - * - * When using WASM SIMD128, if this is 2 or 6, SIMDe will scalarize 2 of the lanes meaning - * it effectively becomes worse 4. - * - * @see XXH3_accumulate_512_neon() - */ -# ifndef XXH3_NEON_LANES -# if (defined(__aarch64__) || defined(__arm64__) || defined(_M_ARM64) || defined(_M_ARM64EC)) \ - && !defined(__APPLE__) && XXH_SIZE_OPT <= 0 -# define XXH3_NEON_LANES 6 -# else -# define XXH3_NEON_LANES XXH_ACC_NB -# endif -# endif -#endif /* XXH_VECTOR == XXH_NEON */ - -/* - * VSX and Z Vector helpers. - * - * This is very messy, and any pull requests to clean this up are welcome. - * - * There are a lot of problems with supporting VSX and s390x, due to - * inconsistent intrinsics, spotty coverage, and multiple endiannesses. - */ -#if XXH_VECTOR == XXH_VSX -/* Annoyingly, these headers _may_ define three macros: `bool`, `vector`, - * and `pixel`. This is a problem for obvious reasons. - * - * These keywords are unnecessary; the spec literally says they are - * equivalent to `__bool`, `__vector`, and `__pixel` and may be undef'd - * after including the header. - * - * We use pragma push_macro/pop_macro to keep the namespace clean. */ -# pragma push_macro("bool") -# pragma push_macro("vector") -# pragma push_macro("pixel") -/* silence potential macro redefined warnings */ -# undef bool -# undef vector -# undef pixel - -# if defined(__s390x__) -# include -# else -# include -# endif - -/* Restore the original macro values, if applicable. */ -# pragma pop_macro("pixel") -# pragma pop_macro("vector") -# pragma pop_macro("bool") - -typedef __vector unsigned long long xxh_u64x2; -typedef __vector unsigned char xxh_u8x16; -typedef __vector unsigned xxh_u32x4; - -/* - * UGLY HACK: Similar to aarch64 macOS GCC, s390x GCC has the same aliasing issue. - */ -typedef xxh_u64x2 xxh_aliasing_u64x2 XXH_ALIASING; - -# ifndef XXH_VSX_BE -# if defined(__BIG_ENDIAN__) \ - || (defined(__BYTE_ORDER__) && __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__) -# define XXH_VSX_BE 1 -# elif defined(__VEC_ELEMENT_REG_ORDER__) && __VEC_ELEMENT_REG_ORDER__ == __ORDER_BIG_ENDIAN__ -# warning "-maltivec=be is not recommended. Please use native endianness." -# define XXH_VSX_BE 1 -# else -# define XXH_VSX_BE 0 -# endif -# endif /* !defined(XXH_VSX_BE) */ - -# if XXH_VSX_BE -# if defined(__POWER9_VECTOR__) || (defined(__clang__) && defined(__s390x__)) -# define XXH_vec_revb vec_revb -# else -/*! - * A polyfill for POWER9's vec_revb(). - */ -XXH_FORCE_INLINE xxh_u64x2 XXH_vec_revb(xxh_u64x2 val) -{ - xxh_u8x16 const vByteSwap = { 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00, - 0x0F, 0x0E, 0x0D, 0x0C, 0x0B, 0x0A, 0x09, 0x08 }; - return vec_perm(val, val, vByteSwap); -} -# endif -# endif /* XXH_VSX_BE */ - -/*! - * Performs an unaligned vector load and byte swaps it on big endian. - */ -XXH_FORCE_INLINE xxh_u64x2 XXH_vec_loadu(const void *ptr) -{ - xxh_u64x2 ret; - XXH_memcpy(&ret, ptr, sizeof(xxh_u64x2)); -# if XXH_VSX_BE - ret = XXH_vec_revb(ret); -# endif - return ret; -} - -/* - * vec_mulo and vec_mule are very problematic intrinsics on PowerPC - * - * These intrinsics weren't added until GCC 8, despite existing for a while, - * and they are endian dependent. Also, their meaning swap depending on version. - * */ -# if defined(__s390x__) - /* s390x is always big endian, no issue on this platform */ -# define XXH_vec_mulo vec_mulo -# define XXH_vec_mule vec_mule -# elif defined(__clang__) && XXH_HAS_BUILTIN(__builtin_altivec_vmuleuw) && !defined(__ibmxl__) -/* Clang has a better way to control this, we can just use the builtin which doesn't swap. */ - /* The IBM XL Compiler (which defined __clang__) only implements the vec_* operations */ -# define XXH_vec_mulo __builtin_altivec_vmulouw -# define XXH_vec_mule __builtin_altivec_vmuleuw -# else -/* gcc needs inline assembly */ -/* Adapted from https://github.com/google/highwayhash/blob/master/highwayhash/hh_vsx.h. */ -XXH_FORCE_INLINE xxh_u64x2 XXH_vec_mulo(xxh_u32x4 a, xxh_u32x4 b) -{ - xxh_u64x2 result; - __asm__("vmulouw %0, %1, %2" : "=v" (result) : "v" (a), "v" (b)); - return result; -} -XXH_FORCE_INLINE xxh_u64x2 XXH_vec_mule(xxh_u32x4 a, xxh_u32x4 b) -{ - xxh_u64x2 result; - __asm__("vmuleuw %0, %1, %2" : "=v" (result) : "v" (a), "v" (b)); - return result; -} -# endif /* XXH_vec_mulo, XXH_vec_mule */ -#endif /* XXH_VECTOR == XXH_VSX */ - -#if XXH_VECTOR == XXH_SVE -#define ACCRND(acc, offset) \ -do { \ - svuint64_t input_vec = svld1_u64(mask, xinput + offset); \ - svuint64_t secret_vec = svld1_u64(mask, xsecret + offset); \ - svuint64_t mixed = sveor_u64_x(mask, secret_vec, input_vec); \ - svuint64_t swapped = svtbl_u64(input_vec, kSwap); \ - svuint64_t mixed_lo = svextw_u64_x(mask, mixed); \ - svuint64_t mixed_hi = svlsr_n_u64_x(mask, mixed, 32); \ - svuint64_t mul = svmad_u64_x(mask, mixed_lo, mixed_hi, swapped); \ - acc = svadd_u64_x(mask, acc, mul); \ -} while (0) -#endif /* XXH_VECTOR == XXH_SVE */ - -/* prefetch - * can be disabled, by declaring XXH_NO_PREFETCH build macro */ -#if defined(XXH_NO_PREFETCH) -# define XXH_PREFETCH(ptr) (void)(ptr) /* disabled */ -#else -# if XXH_SIZE_OPT >= 1 -# define XXH_PREFETCH(ptr) (void)(ptr) -# elif defined(_MSC_VER) && (defined(_M_X64) || defined(_M_IX86)) /* _mm_prefetch() not defined outside of x86/x64 */ -# include /* https://msdn.microsoft.com/fr-fr/library/84szxsww(v=vs.90).aspx */ -# define XXH_PREFETCH(ptr) _mm_prefetch((const char*)(ptr), _MM_HINT_T0) -# elif defined(__GNUC__) && ( (__GNUC__ >= 4) || ( (__GNUC__ == 3) && (__GNUC_MINOR__ >= 1) ) ) -# define XXH_PREFETCH(ptr) __builtin_prefetch((ptr), 0 /* rw==read */, 3 /* locality */) -# else -# define XXH_PREFETCH(ptr) (void)(ptr) /* disabled */ -# endif -#endif /* XXH_NO_PREFETCH */ - - -/* ========================================== - * XXH3 default settings - * ========================================== */ - -#define XXH_SECRET_DEFAULT_SIZE 192 /* minimum XXH3_SECRET_SIZE_MIN */ - -#if (XXH_SECRET_DEFAULT_SIZE < XXH3_SECRET_SIZE_MIN) -# error "default keyset is not large enough" -#endif - -/*! Pseudorandom secret taken directly from FARSH. */ -XXH_ALIGN(64) static const xxh_u8 XXH3_kSecret[XXH_SECRET_DEFAULT_SIZE] = { - 0xb8, 0xfe, 0x6c, 0x39, 0x23, 0xa4, 0x4b, 0xbe, 0x7c, 0x01, 0x81, 0x2c, 0xf7, 0x21, 0xad, 0x1c, - 0xde, 0xd4, 0x6d, 0xe9, 0x83, 0x90, 0x97, 0xdb, 0x72, 0x40, 0xa4, 0xa4, 0xb7, 0xb3, 0x67, 0x1f, - 0xcb, 0x79, 0xe6, 0x4e, 0xcc, 0xc0, 0xe5, 0x78, 0x82, 0x5a, 0xd0, 0x7d, 0xcc, 0xff, 0x72, 0x21, - 0xb8, 0x08, 0x46, 0x74, 0xf7, 0x43, 0x24, 0x8e, 0xe0, 0x35, 0x90, 0xe6, 0x81, 0x3a, 0x26, 0x4c, - 0x3c, 0x28, 0x52, 0xbb, 0x91, 0xc3, 0x00, 0xcb, 0x88, 0xd0, 0x65, 0x8b, 0x1b, 0x53, 0x2e, 0xa3, - 0x71, 0x64, 0x48, 0x97, 0xa2, 0x0d, 0xf9, 0x4e, 0x38, 0x19, 0xef, 0x46, 0xa9, 0xde, 0xac, 0xd8, - 0xa8, 0xfa, 0x76, 0x3f, 0xe3, 0x9c, 0x34, 0x3f, 0xf9, 0xdc, 0xbb, 0xc7, 0xc7, 0x0b, 0x4f, 0x1d, - 0x8a, 0x51, 0xe0, 0x4b, 0xcd, 0xb4, 0x59, 0x31, 0xc8, 0x9f, 0x7e, 0xc9, 0xd9, 0x78, 0x73, 0x64, - 0xea, 0xc5, 0xac, 0x83, 0x34, 0xd3, 0xeb, 0xc3, 0xc5, 0x81, 0xa0, 0xff, 0xfa, 0x13, 0x63, 0xeb, - 0x17, 0x0d, 0xdd, 0x51, 0xb7, 0xf0, 0xda, 0x49, 0xd3, 0x16, 0x55, 0x26, 0x29, 0xd4, 0x68, 0x9e, - 0x2b, 0x16, 0xbe, 0x58, 0x7d, 0x47, 0xa1, 0xfc, 0x8f, 0xf8, 0xb8, 0xd1, 0x7a, 0xd0, 0x31, 0xce, - 0x45, 0xcb, 0x3a, 0x8f, 0x95, 0x16, 0x04, 0x28, 0xaf, 0xd7, 0xfb, 0xca, 0xbb, 0x4b, 0x40, 0x7e, -}; - -static const xxh_u64 PRIME_MX1 = 0x165667919E3779F9ULL; /*!< 0b0001011001010110011001111001000110011110001101110111100111111001 */ -static const xxh_u64 PRIME_MX2 = 0x9FB21C651E98DF25ULL; /*!< 0b1001111110110010000111000110010100011110100110001101111100100101 */ - -#ifdef XXH_OLD_NAMES -# define kSecret XXH3_kSecret -#endif - -#ifdef XXH_DOXYGEN -/*! - * @brief Calculates a 32-bit to 64-bit long multiply. - * - * Implemented as a macro. - * - * Wraps `__emulu` on MSVC x86 because it tends to call `__allmul` when it doesn't - * need to (but it shouldn't need to anyways, it is about 7 instructions to do - * a 64x64 multiply...). Since we know that this will _always_ emit `MULL`, we - * use that instead of the normal method. - * - * If you are compiling for platforms like Thumb-1 and don't have a better option, - * you may also want to write your own long multiply routine here. - * - * @param x, y Numbers to be multiplied - * @return 64-bit product of the low 32 bits of @p x and @p y. - */ -XXH_FORCE_INLINE xxh_u64 -XXH_mult32to64(xxh_u64 x, xxh_u64 y) -{ - return (x & 0xFFFFFFFF) * (y & 0xFFFFFFFF); -} -#elif defined(_MSC_VER) && defined(_M_IX86) -# define XXH_mult32to64(x, y) __emulu((unsigned)(x), (unsigned)(y)) -#else -/* - * Downcast + upcast is usually better than masking on older compilers like - * GCC 4.2 (especially 32-bit ones), all without affecting newer compilers. - * - * The other method, (x & 0xFFFFFFFF) * (y & 0xFFFFFFFF), will AND both operands - * and perform a full 64x64 multiply -- entirely redundant on 32-bit. - */ -# define XXH_mult32to64(x, y) ((xxh_u64)(xxh_u32)(x) * (xxh_u64)(xxh_u32)(y)) -#endif - -/*! - * @brief Calculates a 64->128-bit long multiply. - * - * Uses `__uint128_t` and `_umul128` if available, otherwise uses a scalar - * version. - * - * @param lhs , rhs The 64-bit integers to be multiplied - * @return The 128-bit result represented in an @ref XXH128_hash_t. - */ -static XXH128_hash_t -XXH_mult64to128(xxh_u64 lhs, xxh_u64 rhs) -{ - /* - * GCC/Clang __uint128_t method. - * - * On most 64-bit targets, GCC and Clang define a __uint128_t type. - * This is usually the best way as it usually uses a native long 64-bit - * multiply, such as MULQ on x86_64 or MUL + UMULH on aarch64. - * - * Usually. - * - * Despite being a 32-bit platform, Clang (and emscripten) define this type - * despite not having the arithmetic for it. This results in a laggy - * compiler builtin call which calculates a full 128-bit multiply. - * In that case it is best to use the portable one. - * https://github.com/Cyan4973/xxHash/issues/211#issuecomment-515575677 - */ -#if (defined(__GNUC__) || defined(__clang__)) && !defined(__wasm__) \ - && defined(__SIZEOF_INT128__) \ - || (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128) - - __uint128_t const product = (__uint128_t)lhs * (__uint128_t)rhs; - XXH128_hash_t r128; - r128.low64 = (xxh_u64)(product); - r128.high64 = (xxh_u64)(product >> 64); - return r128; - - /* - * MSVC for x64's _umul128 method. - * - * xxh_u64 _umul128(xxh_u64 Multiplier, xxh_u64 Multiplicand, xxh_u64 *HighProduct); - * - * This compiles to single operand MUL on x64. - */ -#elif (defined(_M_X64) || defined(_M_IA64)) && !defined(_M_ARM64EC) - -#ifndef _MSC_VER -# pragma intrinsic(_umul128) -#endif - xxh_u64 product_high; - xxh_u64 const product_low = _umul128(lhs, rhs, &product_high); - XXH128_hash_t r128; - r128.low64 = product_low; - r128.high64 = product_high; - return r128; - - /* - * MSVC for ARM64's __umulh method. - * - * This compiles to the same MUL + UMULH as GCC/Clang's __uint128_t method. - */ -#elif defined(_M_ARM64) || defined(_M_ARM64EC) - -#ifndef _MSC_VER -# pragma intrinsic(__umulh) -#endif - XXH128_hash_t r128; - r128.low64 = lhs * rhs; - r128.high64 = __umulh(lhs, rhs); - return r128; - -#else - /* - * Portable scalar method. Optimized for 32-bit and 64-bit ALUs. - * - * This is a fast and simple grade school multiply, which is shown below - * with base 10 arithmetic instead of base 0x100000000. - * - * 9 3 // D2 lhs = 93 - * x 7 5 // D2 rhs = 75 - * ---------- - * 1 5 // D2 lo_lo = (93 % 10) * (75 % 10) = 15 - * 4 5 | // D2 hi_lo = (93 / 10) * (75 % 10) = 45 - * 2 1 | // D2 lo_hi = (93 % 10) * (75 / 10) = 21 - * + 6 3 | | // D2 hi_hi = (93 / 10) * (75 / 10) = 63 - * --------- - * 2 7 | // D2 cross = (15 / 10) + (45 % 10) + 21 = 27 - * + 6 7 | | // D2 upper = (27 / 10) + (45 / 10) + 63 = 67 - * --------- - * 6 9 7 5 // D4 res = (27 * 10) + (15 % 10) + (67 * 100) = 6975 - * - * The reasons for adding the products like this are: - * 1. It avoids manual carry tracking. Just like how - * (9 * 9) + 9 + 9 = 99, the same applies with this for UINT64_MAX. - * This avoids a lot of complexity. - * - * 2. It hints for, and on Clang, compiles to, the powerful UMAAL - * instruction available in ARM's Digital Signal Processing extension - * in 32-bit ARMv6 and later, which is shown below: - * - * void UMAAL(xxh_u32 *RdLo, xxh_u32 *RdHi, xxh_u32 Rn, xxh_u32 Rm) - * { - * xxh_u64 product = (xxh_u64)*RdLo * (xxh_u64)*RdHi + Rn + Rm; - * *RdLo = (xxh_u32)(product & 0xFFFFFFFF); - * *RdHi = (xxh_u32)(product >> 32); - * } - * - * This instruction was designed for efficient long multiplication, and - * allows this to be calculated in only 4 instructions at speeds - * comparable to some 64-bit ALUs. - * - * 3. It isn't terrible on other platforms. Usually this will be a couple - * of 32-bit ADD/ADCs. - */ - - /* First calculate all of the cross products. */ - xxh_u64 const lo_lo = XXH_mult32to64(lhs & 0xFFFFFFFF, rhs & 0xFFFFFFFF); - xxh_u64 const hi_lo = XXH_mult32to64(lhs >> 32, rhs & 0xFFFFFFFF); - xxh_u64 const lo_hi = XXH_mult32to64(lhs & 0xFFFFFFFF, rhs >> 32); - xxh_u64 const hi_hi = XXH_mult32to64(lhs >> 32, rhs >> 32); - - /* Now add the products together. These will never overflow. */ - xxh_u64 const cross = (lo_lo >> 32) + (hi_lo & 0xFFFFFFFF) + lo_hi; - xxh_u64 const upper = (hi_lo >> 32) + (cross >> 32) + hi_hi; - xxh_u64 const lower = (cross << 32) | (lo_lo & 0xFFFFFFFF); - - XXH128_hash_t r128; - r128.low64 = lower; - r128.high64 = upper; - return r128; -#endif -} - -/*! - * @brief Calculates a 64-bit to 128-bit multiply, then XOR folds it. - * - * The reason for the separate function is to prevent passing too many structs - * around by value. This will hopefully inline the multiply, but we don't force it. - * - * @param lhs , rhs The 64-bit integers to multiply - * @return The low 64 bits of the product XOR'd by the high 64 bits. - * @see XXH_mult64to128() - */ -static xxh_u64 -XXH3_mul128_fold64(xxh_u64 lhs, xxh_u64 rhs) -{ - XXH128_hash_t product = XXH_mult64to128(lhs, rhs); - return product.low64 ^ product.high64; -} - -/*! Seems to produce slightly better code on GCC for some reason. */ -XXH_FORCE_INLINE XXH_CONSTF xxh_u64 XXH_xorshift64(xxh_u64 v64, int shift) -{ - XXH_ASSERT(0 <= shift && shift < 64); - return v64 ^ (v64 >> shift); -} - -/* - * This is a fast avalanche stage, - * suitable when input bits are already partially mixed - */ -static XXH64_hash_t XXH3_avalanche(xxh_u64 h64) -{ - h64 = XXH_xorshift64(h64, 37); - h64 *= PRIME_MX1; - h64 = XXH_xorshift64(h64, 32); - return h64; -} - -/* - * This is a stronger avalanche, - * inspired by Pelle Evensen's rrmxmx - * preferable when input has not been previously mixed - */ -static XXH64_hash_t XXH3_rrmxmx(xxh_u64 h64, xxh_u64 len) -{ - /* this mix is inspired by Pelle Evensen's rrmxmx */ - h64 ^= XXH_rotl64(h64, 49) ^ XXH_rotl64(h64, 24); - h64 *= PRIME_MX2; - h64 ^= (h64 >> 35) + len ; - h64 *= PRIME_MX2; - return XXH_xorshift64(h64, 28); -} - - -/* ========================================== - * Short keys - * ========================================== - * One of the shortcomings of XXH32 and XXH64 was that their performance was - * sub-optimal on short lengths. It used an iterative algorithm which strongly - * favored lengths that were a multiple of 4 or 8. - * - * Instead of iterating over individual inputs, we use a set of single shot - * functions which piece together a range of lengths and operate in constant time. - * - * Additionally, the number of multiplies has been significantly reduced. This - * reduces latency, especially when emulating 64-bit multiplies on 32-bit. - * - * Depending on the platform, this may or may not be faster than XXH32, but it - * is almost guaranteed to be faster than XXH64. - */ - -/* - * At very short lengths, there isn't enough input to fully hide secrets, or use - * the entire secret. - * - * There is also only a limited amount of mixing we can do before significantly - * impacting performance. - * - * Therefore, we use different sections of the secret and always mix two secret - * samples with an XOR. This should have no effect on performance on the - * seedless or withSeed variants because everything _should_ be constant folded - * by modern compilers. - * - * The XOR mixing hides individual parts of the secret and increases entropy. - * - * This adds an extra layer of strength for custom secrets. - */ -XXH_FORCE_INLINE XXH_PUREF XXH64_hash_t -XXH3_len_1to3_64b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - XXH_ASSERT(input != NULL); - XXH_ASSERT(1 <= len && len <= 3); - XXH_ASSERT(secret != NULL); - /* - * len = 1: combined = { input[0], 0x01, input[0], input[0] } - * len = 2: combined = { input[1], 0x02, input[0], input[1] } - * len = 3: combined = { input[2], 0x03, input[0], input[1] } - */ - { xxh_u8 const c1 = input[0]; - xxh_u8 const c2 = input[len >> 1]; - xxh_u8 const c3 = input[len - 1]; - xxh_u32 const combined = ((xxh_u32)c1 << 16) | ((xxh_u32)c2 << 24) - | ((xxh_u32)c3 << 0) | ((xxh_u32)len << 8); - xxh_u64 const bitflip = (XXH_readLE32(secret) ^ XXH_readLE32(secret+4)) + seed; - xxh_u64 const keyed = (xxh_u64)combined ^ bitflip; - return XXH64_avalanche(keyed); - } -} - -XXH_FORCE_INLINE XXH_PUREF XXH64_hash_t -XXH3_len_4to8_64b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - XXH_ASSERT(input != NULL); - XXH_ASSERT(secret != NULL); - XXH_ASSERT(4 <= len && len <= 8); - seed ^= (xxh_u64)XXH_swap32((xxh_u32)seed) << 32; - { xxh_u32 const input1 = XXH_readLE32(input); - xxh_u32 const input2 = XXH_readLE32(input + len - 4); - xxh_u64 const bitflip = (XXH_readLE64(secret+8) ^ XXH_readLE64(secret+16)) - seed; - xxh_u64 const input64 = input2 + (((xxh_u64)input1) << 32); - xxh_u64 const keyed = input64 ^ bitflip; - return XXH3_rrmxmx(keyed, len); - } -} - -XXH_FORCE_INLINE XXH_PUREF XXH64_hash_t -XXH3_len_9to16_64b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - XXH_ASSERT(input != NULL); - XXH_ASSERT(secret != NULL); - XXH_ASSERT(9 <= len && len <= 16); - { xxh_u64 const bitflip1 = (XXH_readLE64(secret+24) ^ XXH_readLE64(secret+32)) + seed; - xxh_u64 const bitflip2 = (XXH_readLE64(secret+40) ^ XXH_readLE64(secret+48)) - seed; - xxh_u64 const input_lo = XXH_readLE64(input) ^ bitflip1; - xxh_u64 const input_hi = XXH_readLE64(input + len - 8) ^ bitflip2; - xxh_u64 const acc = len - + XXH_swap64(input_lo) + input_hi - + XXH3_mul128_fold64(input_lo, input_hi); - return XXH3_avalanche(acc); - } -} - -XXH_FORCE_INLINE XXH_PUREF XXH64_hash_t -XXH3_len_0to16_64b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - XXH_ASSERT(len <= 16); - { if (XXH_likely(len > 8)) return XXH3_len_9to16_64b(input, len, secret, seed); - if (XXH_likely(len >= 4)) return XXH3_len_4to8_64b(input, len, secret, seed); - if (len) return XXH3_len_1to3_64b(input, len, secret, seed); - return XXH64_avalanche(seed ^ (XXH_readLE64(secret+56) ^ XXH_readLE64(secret+64))); - } -} - -/* - * DISCLAIMER: There are known *seed-dependent* multicollisions here due to - * multiplication by zero, affecting hashes of lengths 17 to 240. - * - * However, they are very unlikely. - * - * Keep this in mind when using the unseeded XXH3_64bits() variant: As with all - * unseeded non-cryptographic hashes, it does not attempt to defend itself - * against specially crafted inputs, only random inputs. - * - * Compared to classic UMAC where a 1 in 2^31 chance of 4 consecutive bytes - * cancelling out the secret is taken an arbitrary number of times (addressed - * in XXH3_accumulate_512), this collision is very unlikely with random inputs - * and/or proper seeding: - * - * This only has a 1 in 2^63 chance of 8 consecutive bytes cancelling out, in a - * function that is only called up to 16 times per hash with up to 240 bytes of - * input. - * - * This is not too bad for a non-cryptographic hash function, especially with - * only 64 bit outputs. - * - * The 128-bit variant (which trades some speed for strength) is NOT affected - * by this, although it is always a good idea to use a proper seed if you care - * about strength. - */ -XXH_FORCE_INLINE xxh_u64 XXH3_mix16B(const xxh_u8* XXH_RESTRICT input, - const xxh_u8* XXH_RESTRICT secret, xxh_u64 seed64) -{ -#if defined(__GNUC__) && !defined(__clang__) /* GCC, not Clang */ \ - && defined(__i386__) && defined(__SSE2__) /* x86 + SSE2 */ \ - && !defined(XXH_ENABLE_AUTOVECTORIZE) /* Define to disable like XXH32 hack */ - /* - * UGLY HACK: - * GCC for x86 tends to autovectorize the 128-bit multiply, resulting in - * slower code. - * - * By forcing seed64 into a register, we disrupt the cost model and - * cause it to scalarize. See `XXH32_round()` - * - * FIXME: Clang's output is still _much_ faster -- On an AMD Ryzen 3600, - * XXH3_64bits @ len=240 runs at 4.6 GB/s with Clang 9, but 3.3 GB/s on - * GCC 9.2, despite both emitting scalar code. - * - * GCC generates much better scalar code than Clang for the rest of XXH3, - * which is why finding a more optimal codepath is an interest. - */ - XXH_COMPILER_GUARD(seed64); -#endif - { xxh_u64 const input_lo = XXH_readLE64(input); - xxh_u64 const input_hi = XXH_readLE64(input+8); - return XXH3_mul128_fold64( - input_lo ^ (XXH_readLE64(secret) + seed64), - input_hi ^ (XXH_readLE64(secret+8) - seed64) - ); - } -} - -/* For mid range keys, XXH3 uses a Mum-hash variant. */ -XXH_FORCE_INLINE XXH_PUREF XXH64_hash_t -XXH3_len_17to128_64b(const xxh_u8* XXH_RESTRICT input, size_t len, - const xxh_u8* XXH_RESTRICT secret, size_t secretSize, - XXH64_hash_t seed) -{ - XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); (void)secretSize; - XXH_ASSERT(16 < len && len <= 128); - - { xxh_u64 acc = len * XXH_PRIME64_1; -#if XXH_SIZE_OPT >= 1 - /* Smaller and cleaner, but slightly slower. */ - unsigned int i = (unsigned int)(len - 1) / 32; - do { - acc += XXH3_mix16B(input+16 * i, secret+32*i, seed); - acc += XXH3_mix16B(input+len-16*(i+1), secret+32*i+16, seed); - } while (i-- != 0); -#else - if (len > 32) { - if (len > 64) { - if (len > 96) { - acc += XXH3_mix16B(input+48, secret+96, seed); - acc += XXH3_mix16B(input+len-64, secret+112, seed); - } - acc += XXH3_mix16B(input+32, secret+64, seed); - acc += XXH3_mix16B(input+len-48, secret+80, seed); - } - acc += XXH3_mix16B(input+16, secret+32, seed); - acc += XXH3_mix16B(input+len-32, secret+48, seed); - } - acc += XXH3_mix16B(input+0, secret+0, seed); - acc += XXH3_mix16B(input+len-16, secret+16, seed); -#endif - return XXH3_avalanche(acc); - } -} - -#define XXH3_MIDSIZE_MAX 240 - -XXH_NO_INLINE XXH_PUREF XXH64_hash_t -XXH3_len_129to240_64b(const xxh_u8* XXH_RESTRICT input, size_t len, - const xxh_u8* XXH_RESTRICT secret, size_t secretSize, - XXH64_hash_t seed) -{ - XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); (void)secretSize; - XXH_ASSERT(128 < len && len <= XXH3_MIDSIZE_MAX); - - #define XXH3_MIDSIZE_STARTOFFSET 3 - #define XXH3_MIDSIZE_LASTOFFSET 17 - - { xxh_u64 acc = len * XXH_PRIME64_1; - xxh_u64 acc_end; - unsigned int const nbRounds = (unsigned int)len / 16; - unsigned int i; - XXH_ASSERT(128 < len && len <= XXH3_MIDSIZE_MAX); - for (i=0; i<8; i++) { - acc += XXH3_mix16B(input+(16*i), secret+(16*i), seed); - } - /* last bytes */ - acc_end = XXH3_mix16B(input + len - 16, secret + XXH3_SECRET_SIZE_MIN - XXH3_MIDSIZE_LASTOFFSET, seed); - XXH_ASSERT(nbRounds >= 8); - acc = XXH3_avalanche(acc); -#if defined(__clang__) /* Clang */ \ - && (defined(__ARM_NEON) || defined(__ARM_NEON__)) /* NEON */ \ - && !defined(XXH_ENABLE_AUTOVECTORIZE) /* Define to disable */ - /* - * UGLY HACK: - * Clang for ARMv7-A tries to vectorize this loop, similar to GCC x86. - * In everywhere else, it uses scalar code. - * - * For 64->128-bit multiplies, even if the NEON was 100% optimal, it - * would still be slower than UMAAL (see XXH_mult64to128). - * - * Unfortunately, Clang doesn't handle the long multiplies properly and - * converts them to the nonexistent "vmulq_u64" intrinsic, which is then - * scalarized into an ugly mess of VMOV.32 instructions. - * - * This mess is difficult to avoid without turning autovectorization - * off completely, but they are usually relatively minor and/or not - * worth it to fix. - * - * This loop is the easiest to fix, as unlike XXH32, this pragma - * _actually works_ because it is a loop vectorization instead of an - * SLP vectorization. - */ - #pragma clang loop vectorize(disable) -#endif - for (i=8 ; i < nbRounds; i++) { - /* - * Prevents clang for unrolling the acc loop and interleaving with this one. - */ - XXH_COMPILER_GUARD(acc); - acc_end += XXH3_mix16B(input+(16*i), secret+(16*(i-8)) + XXH3_MIDSIZE_STARTOFFSET, seed); - } - return XXH3_avalanche(acc + acc_end); - } -} - - -/* ======= Long Keys ======= */ - -#define XXH_STRIPE_LEN 64 -#define XXH_SECRET_CONSUME_RATE 8 /* nb of secret bytes consumed at each accumulation */ -#define XXH_ACC_NB (XXH_STRIPE_LEN / sizeof(xxh_u64)) - -#ifdef XXH_OLD_NAMES -# define STRIPE_LEN XXH_STRIPE_LEN -# define ACC_NB XXH_ACC_NB -#endif - -#ifndef XXH_PREFETCH_DIST -# ifdef __clang__ -# define XXH_PREFETCH_DIST 320 -# else -# if (XXH_VECTOR == XXH_AVX512) -# define XXH_PREFETCH_DIST 512 -# else -# define XXH_PREFETCH_DIST 384 -# endif -# endif /* __clang__ */ -#endif /* XXH_PREFETCH_DIST */ - -/* - * These macros are to generate an XXH3_accumulate() function. - * The two arguments select the name suffix and target attribute. - * - * The name of this symbol is XXH3_accumulate_() and it calls - * XXH3_accumulate_512_(). - * - * It may be useful to hand implement this function if the compiler fails to - * optimize the inline function. - */ -#define XXH3_ACCUMULATE_TEMPLATE(name) \ -void \ -XXH3_accumulate_##name(xxh_u64* XXH_RESTRICT acc, \ - const xxh_u8* XXH_RESTRICT input, \ - const xxh_u8* XXH_RESTRICT secret, \ - size_t nbStripes) \ -{ \ - size_t n; \ - for (n = 0; n < nbStripes; n++ ) { \ - const xxh_u8* const in = input + n*XXH_STRIPE_LEN; \ - XXH_PREFETCH(in + XXH_PREFETCH_DIST); \ - XXH3_accumulate_512_##name( \ - acc, \ - in, \ - secret + n*XXH_SECRET_CONSUME_RATE); \ - } \ -} - - -XXH_FORCE_INLINE void XXH_writeLE64(void* dst, xxh_u64 v64) -{ - if (!XXH_CPU_LITTLE_ENDIAN) v64 = XXH_swap64(v64); - XXH_memcpy(dst, &v64, sizeof(v64)); -} - -/* Several intrinsic functions below are supposed to accept __int64 as argument, - * as documented in https://software.intel.com/sites/landingpage/IntrinsicsGuide/ . - * However, several environments do not define __int64 type, - * requiring a workaround. - */ -#if !defined (__VMS) \ - && (defined (__cplusplus) \ - || (defined (__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */) ) - typedef int64_t xxh_i64; -#else - /* the following type must have a width of 64-bit */ - typedef long long xxh_i64; -#endif - - -/* - * XXH3_accumulate_512 is the tightest loop for long inputs, and it is the most optimized. - * - * It is a hardened version of UMAC, based off of FARSH's implementation. - * - * This was chosen because it adapts quite well to 32-bit, 64-bit, and SIMD - * implementations, and it is ridiculously fast. - * - * We harden it by mixing the original input to the accumulators as well as the product. - * - * This means that in the (relatively likely) case of a multiply by zero, the - * original input is preserved. - * - * On 128-bit inputs, we swap 64-bit pairs when we add the input to improve - * cross-pollination, as otherwise the upper and lower halves would be - * essentially independent. - * - * This doesn't matter on 64-bit hashes since they all get merged together in - * the end, so we skip the extra step. - * - * Both XXH3_64bits and XXH3_128bits use this subroutine. - */ - -#if (XXH_VECTOR == XXH_AVX512) \ - || (defined(XXH_DISPATCH_AVX512) && XXH_DISPATCH_AVX512 != 0) - -#ifndef XXH_TARGET_AVX512 -# define XXH_TARGET_AVX512 /* disable attribute target */ -#endif - -XXH_FORCE_INLINE XXH_TARGET_AVX512 void -XXH3_accumulate_512_avx512(void* XXH_RESTRICT acc, - const void* XXH_RESTRICT input, - const void* XXH_RESTRICT secret) -{ - __m512i* const xacc = (__m512i *) acc; - XXH_ASSERT((((size_t)acc) & 63) == 0); - XXH_STATIC_ASSERT(XXH_STRIPE_LEN == sizeof(__m512i)); - - { - /* data_vec = input[0]; */ - __m512i const data_vec = _mm512_loadu_si512 (input); - /* key_vec = secret[0]; */ - __m512i const key_vec = _mm512_loadu_si512 (secret); - /* data_key = data_vec ^ key_vec; */ - __m512i const data_key = _mm512_xor_si512 (data_vec, key_vec); - /* data_key_lo = data_key >> 32; */ - __m512i const data_key_lo = _mm512_srli_epi64 (data_key, 32); - /* product = (data_key & 0xffffffff) * (data_key_lo & 0xffffffff); */ - __m512i const product = _mm512_mul_epu32 (data_key, data_key_lo); - /* xacc[0] += swap(data_vec); */ - __m512i const data_swap = _mm512_shuffle_epi32(data_vec, (_MM_PERM_ENUM)_MM_SHUFFLE(1, 0, 3, 2)); - __m512i const sum = _mm512_add_epi64(*xacc, data_swap); - /* xacc[0] += product; */ - *xacc = _mm512_add_epi64(product, sum); - } -} -XXH_FORCE_INLINE XXH_TARGET_AVX512 XXH3_ACCUMULATE_TEMPLATE(avx512) - -/* - * XXH3_scrambleAcc: Scrambles the accumulators to improve mixing. - * - * Multiplication isn't perfect, as explained by Google in HighwayHash: - * - * // Multiplication mixes/scrambles bytes 0-7 of the 64-bit result to - * // varying degrees. In descending order of goodness, bytes - * // 3 4 2 5 1 6 0 7 have quality 228 224 164 160 100 96 36 32. - * // As expected, the upper and lower bytes are much worse. - * - * Source: https://github.com/google/highwayhash/blob/0aaf66b/highwayhash/hh_avx2.h#L291 - * - * Since our algorithm uses a pseudorandom secret to add some variance into the - * mix, we don't need to (or want to) mix as often or as much as HighwayHash does. - * - * This isn't as tight as XXH3_accumulate, but still written in SIMD to avoid - * extraction. - * - * Both XXH3_64bits and XXH3_128bits use this subroutine. - */ - -XXH_FORCE_INLINE XXH_TARGET_AVX512 void -XXH3_scrambleAcc_avx512(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret) -{ - XXH_ASSERT((((size_t)acc) & 63) == 0); - XXH_STATIC_ASSERT(XXH_STRIPE_LEN == sizeof(__m512i)); - { __m512i* const xacc = (__m512i*) acc; - const __m512i prime32 = _mm512_set1_epi32((int)XXH_PRIME32_1); - - /* xacc[0] ^= (xacc[0] >> 47) */ - __m512i const acc_vec = *xacc; - __m512i const shifted = _mm512_srli_epi64 (acc_vec, 47); - /* xacc[0] ^= secret; */ - __m512i const key_vec = _mm512_loadu_si512 (secret); - __m512i const data_key = _mm512_ternarylogic_epi32(key_vec, acc_vec, shifted, 0x96 /* key_vec ^ acc_vec ^ shifted */); - - /* xacc[0] *= XXH_PRIME32_1; */ - __m512i const data_key_hi = _mm512_srli_epi64 (data_key, 32); - __m512i const prod_lo = _mm512_mul_epu32 (data_key, prime32); - __m512i const prod_hi = _mm512_mul_epu32 (data_key_hi, prime32); - *xacc = _mm512_add_epi64(prod_lo, _mm512_slli_epi64(prod_hi, 32)); - } -} - -XXH_FORCE_INLINE XXH_TARGET_AVX512 void -XXH3_initCustomSecret_avx512(void* XXH_RESTRICT customSecret, xxh_u64 seed64) -{ - XXH_STATIC_ASSERT((XXH_SECRET_DEFAULT_SIZE & 63) == 0); - XXH_STATIC_ASSERT(XXH_SEC_ALIGN == 64); - XXH_ASSERT(((size_t)customSecret & 63) == 0); - (void)(&XXH_writeLE64); - { int const nbRounds = XXH_SECRET_DEFAULT_SIZE / sizeof(__m512i); - __m512i const seed_pos = _mm512_set1_epi64((xxh_i64)seed64); - __m512i const seed = _mm512_mask_sub_epi64(seed_pos, 0xAA, _mm512_set1_epi8(0), seed_pos); - - const __m512i* const src = (const __m512i*) ((const void*) XXH3_kSecret); - __m512i* const dest = ( __m512i*) customSecret; - int i; - XXH_ASSERT(((size_t)src & 63) == 0); /* control alignment */ - XXH_ASSERT(((size_t)dest & 63) == 0); - for (i=0; i < nbRounds; ++i) { - dest[i] = _mm512_add_epi64(_mm512_load_si512(src + i), seed); - } } -} - -#endif - -#if (XXH_VECTOR == XXH_AVX2) \ - || (defined(XXH_DISPATCH_AVX2) && XXH_DISPATCH_AVX2 != 0) - -#ifndef XXH_TARGET_AVX2 -# define XXH_TARGET_AVX2 /* disable attribute target */ -#endif - -XXH_FORCE_INLINE XXH_TARGET_AVX2 void -XXH3_accumulate_512_avx2( void* XXH_RESTRICT acc, - const void* XXH_RESTRICT input, - const void* XXH_RESTRICT secret) -{ - XXH_ASSERT((((size_t)acc) & 31) == 0); - { __m256i* const xacc = (__m256i *) acc; - /* Unaligned. This is mainly for pointer arithmetic, and because - * _mm256_loadu_si256 requires a const __m256i * pointer for some reason. */ - const __m256i* const xinput = (const __m256i *) input; - /* Unaligned. This is mainly for pointer arithmetic, and because - * _mm256_loadu_si256 requires a const __m256i * pointer for some reason. */ - const __m256i* const xsecret = (const __m256i *) secret; - - size_t i; - for (i=0; i < XXH_STRIPE_LEN/sizeof(__m256i); i++) { - /* data_vec = xinput[i]; */ - __m256i const data_vec = _mm256_loadu_si256 (xinput+i); - /* key_vec = xsecret[i]; */ - __m256i const key_vec = _mm256_loadu_si256 (xsecret+i); - /* data_key = data_vec ^ key_vec; */ - __m256i const data_key = _mm256_xor_si256 (data_vec, key_vec); - /* data_key_lo = data_key >> 32; */ - __m256i const data_key_lo = _mm256_srli_epi64 (data_key, 32); - /* product = (data_key & 0xffffffff) * (data_key_lo & 0xffffffff); */ - __m256i const product = _mm256_mul_epu32 (data_key, data_key_lo); - /* xacc[i] += swap(data_vec); */ - __m256i const data_swap = _mm256_shuffle_epi32(data_vec, _MM_SHUFFLE(1, 0, 3, 2)); - __m256i const sum = _mm256_add_epi64(xacc[i], data_swap); - /* xacc[i] += product; */ - xacc[i] = _mm256_add_epi64(product, sum); - } } -} -XXH_FORCE_INLINE XXH_TARGET_AVX2 XXH3_ACCUMULATE_TEMPLATE(avx2) - -XXH_FORCE_INLINE XXH_TARGET_AVX2 void -XXH3_scrambleAcc_avx2(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret) -{ - XXH_ASSERT((((size_t)acc) & 31) == 0); - { __m256i* const xacc = (__m256i*) acc; - /* Unaligned. This is mainly for pointer arithmetic, and because - * _mm256_loadu_si256 requires a const __m256i * pointer for some reason. */ - const __m256i* const xsecret = (const __m256i *) secret; - const __m256i prime32 = _mm256_set1_epi32((int)XXH_PRIME32_1); - - size_t i; - for (i=0; i < XXH_STRIPE_LEN/sizeof(__m256i); i++) { - /* xacc[i] ^= (xacc[i] >> 47) */ - __m256i const acc_vec = xacc[i]; - __m256i const shifted = _mm256_srli_epi64 (acc_vec, 47); - __m256i const data_vec = _mm256_xor_si256 (acc_vec, shifted); - /* xacc[i] ^= xsecret; */ - __m256i const key_vec = _mm256_loadu_si256 (xsecret+i); - __m256i const data_key = _mm256_xor_si256 (data_vec, key_vec); - - /* xacc[i] *= XXH_PRIME32_1; */ - __m256i const data_key_hi = _mm256_srli_epi64 (data_key, 32); - __m256i const prod_lo = _mm256_mul_epu32 (data_key, prime32); - __m256i const prod_hi = _mm256_mul_epu32 (data_key_hi, prime32); - xacc[i] = _mm256_add_epi64(prod_lo, _mm256_slli_epi64(prod_hi, 32)); - } - } -} - -XXH_FORCE_INLINE XXH_TARGET_AVX2 void XXH3_initCustomSecret_avx2(void* XXH_RESTRICT customSecret, xxh_u64 seed64) -{ - XXH_STATIC_ASSERT((XXH_SECRET_DEFAULT_SIZE & 31) == 0); - XXH_STATIC_ASSERT((XXH_SECRET_DEFAULT_SIZE / sizeof(__m256i)) == 6); - XXH_STATIC_ASSERT(XXH_SEC_ALIGN <= 64); - (void)(&XXH_writeLE64); - XXH_PREFETCH(customSecret); - { __m256i const seed = _mm256_set_epi64x((xxh_i64)(0U - seed64), (xxh_i64)seed64, (xxh_i64)(0U - seed64), (xxh_i64)seed64); - - const __m256i* const src = (const __m256i*) ((const void*) XXH3_kSecret); - __m256i* dest = ( __m256i*) customSecret; - -# if defined(__GNUC__) || defined(__clang__) - /* - * On GCC & Clang, marking 'dest' as modified will cause the compiler: - * - do not extract the secret from sse registers in the internal loop - * - use less common registers, and avoid pushing these reg into stack - */ - XXH_COMPILER_GUARD(dest); -# endif - XXH_ASSERT(((size_t)src & 31) == 0); /* control alignment */ - XXH_ASSERT(((size_t)dest & 31) == 0); - - /* GCC -O2 need unroll loop manually */ - dest[0] = _mm256_add_epi64(_mm256_load_si256(src+0), seed); - dest[1] = _mm256_add_epi64(_mm256_load_si256(src+1), seed); - dest[2] = _mm256_add_epi64(_mm256_load_si256(src+2), seed); - dest[3] = _mm256_add_epi64(_mm256_load_si256(src+3), seed); - dest[4] = _mm256_add_epi64(_mm256_load_si256(src+4), seed); - dest[5] = _mm256_add_epi64(_mm256_load_si256(src+5), seed); - } -} - -#endif - -/* x86dispatch always generates SSE2 */ -#if (XXH_VECTOR == XXH_SSE2) || defined(XXH_X86DISPATCH) - -#ifndef XXH_TARGET_SSE2 -# define XXH_TARGET_SSE2 /* disable attribute target */ -#endif - -XXH_FORCE_INLINE XXH_TARGET_SSE2 void -XXH3_accumulate_512_sse2( void* XXH_RESTRICT acc, - const void* XXH_RESTRICT input, - const void* XXH_RESTRICT secret) -{ - /* SSE2 is just a half-scale version of the AVX2 version. */ - XXH_ASSERT((((size_t)acc) & 15) == 0); - { __m128i* const xacc = (__m128i *) acc; - /* Unaligned. This is mainly for pointer arithmetic, and because - * _mm_loadu_si128 requires a const __m128i * pointer for some reason. */ - const __m128i* const xinput = (const __m128i *) input; - /* Unaligned. This is mainly for pointer arithmetic, and because - * _mm_loadu_si128 requires a const __m128i * pointer for some reason. */ - const __m128i* const xsecret = (const __m128i *) secret; - - size_t i; - for (i=0; i < XXH_STRIPE_LEN/sizeof(__m128i); i++) { - /* data_vec = xinput[i]; */ - __m128i const data_vec = _mm_loadu_si128 (xinput+i); - /* key_vec = xsecret[i]; */ - __m128i const key_vec = _mm_loadu_si128 (xsecret+i); - /* data_key = data_vec ^ key_vec; */ - __m128i const data_key = _mm_xor_si128 (data_vec, key_vec); - /* data_key_lo = data_key >> 32; */ - __m128i const data_key_lo = _mm_shuffle_epi32 (data_key, _MM_SHUFFLE(0, 3, 0, 1)); - /* product = (data_key & 0xffffffff) * (data_key_lo & 0xffffffff); */ - __m128i const product = _mm_mul_epu32 (data_key, data_key_lo); - /* xacc[i] += swap(data_vec); */ - __m128i const data_swap = _mm_shuffle_epi32(data_vec, _MM_SHUFFLE(1,0,3,2)); - __m128i const sum = _mm_add_epi64(xacc[i], data_swap); - /* xacc[i] += product; */ - xacc[i] = _mm_add_epi64(product, sum); - } } -} -XXH_FORCE_INLINE XXH_TARGET_SSE2 XXH3_ACCUMULATE_TEMPLATE(sse2) - -XXH_FORCE_INLINE XXH_TARGET_SSE2 void -XXH3_scrambleAcc_sse2(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret) -{ - XXH_ASSERT((((size_t)acc) & 15) == 0); - { __m128i* const xacc = (__m128i*) acc; - /* Unaligned. This is mainly for pointer arithmetic, and because - * _mm_loadu_si128 requires a const __m128i * pointer for some reason. */ - const __m128i* const xsecret = (const __m128i *) secret; - const __m128i prime32 = _mm_set1_epi32((int)XXH_PRIME32_1); - - size_t i; - for (i=0; i < XXH_STRIPE_LEN/sizeof(__m128i); i++) { - /* xacc[i] ^= (xacc[i] >> 47) */ - __m128i const acc_vec = xacc[i]; - __m128i const shifted = _mm_srli_epi64 (acc_vec, 47); - __m128i const data_vec = _mm_xor_si128 (acc_vec, shifted); - /* xacc[i] ^= xsecret[i]; */ - __m128i const key_vec = _mm_loadu_si128 (xsecret+i); - __m128i const data_key = _mm_xor_si128 (data_vec, key_vec); - - /* xacc[i] *= XXH_PRIME32_1; */ - __m128i const data_key_hi = _mm_shuffle_epi32 (data_key, _MM_SHUFFLE(0, 3, 0, 1)); - __m128i const prod_lo = _mm_mul_epu32 (data_key, prime32); - __m128i const prod_hi = _mm_mul_epu32 (data_key_hi, prime32); - xacc[i] = _mm_add_epi64(prod_lo, _mm_slli_epi64(prod_hi, 32)); - } - } -} - -XXH_FORCE_INLINE XXH_TARGET_SSE2 void XXH3_initCustomSecret_sse2(void* XXH_RESTRICT customSecret, xxh_u64 seed64) -{ - XXH_STATIC_ASSERT((XXH_SECRET_DEFAULT_SIZE & 15) == 0); - (void)(&XXH_writeLE64); - { int const nbRounds = XXH_SECRET_DEFAULT_SIZE / sizeof(__m128i); - -# if defined(_MSC_VER) && defined(_M_IX86) && _MSC_VER < 1900 - /* MSVC 32bit mode does not support _mm_set_epi64x before 2015 */ - XXH_ALIGN(16) const xxh_i64 seed64x2[2] = { (xxh_i64)seed64, (xxh_i64)(0U - seed64) }; - __m128i const seed = _mm_load_si128((__m128i const*)seed64x2); -# else - __m128i const seed = _mm_set_epi64x((xxh_i64)(0U - seed64), (xxh_i64)seed64); -# endif - int i; - - const void* const src16 = XXH3_kSecret; - __m128i* dst16 = (__m128i*) customSecret; -# if defined(__GNUC__) || defined(__clang__) - /* - * On GCC & Clang, marking 'dest' as modified will cause the compiler: - * - do not extract the secret from sse registers in the internal loop - * - use less common registers, and avoid pushing these reg into stack - */ - XXH_COMPILER_GUARD(dst16); -# endif - XXH_ASSERT(((size_t)src16 & 15) == 0); /* control alignment */ - XXH_ASSERT(((size_t)dst16 & 15) == 0); - - for (i=0; i < nbRounds; ++i) { - dst16[i] = _mm_add_epi64(_mm_load_si128((const __m128i *)src16+i), seed); - } } -} - -#endif - -#if (XXH_VECTOR == XXH_NEON) - -/* forward declarations for the scalar routines */ -XXH_FORCE_INLINE void -XXH3_scalarRound(void* XXH_RESTRICT acc, void const* XXH_RESTRICT input, - void const* XXH_RESTRICT secret, size_t lane); - -XXH_FORCE_INLINE void -XXH3_scalarScrambleRound(void* XXH_RESTRICT acc, - void const* XXH_RESTRICT secret, size_t lane); - -/*! - * @internal - * @brief The bulk processing loop for NEON and WASM SIMD128. - * - * The NEON code path is actually partially scalar when running on AArch64. This - * is to optimize the pipelining and can have up to 15% speedup depending on the - * CPU, and it also mitigates some GCC codegen issues. - * - * @see XXH3_NEON_LANES for configuring this and details about this optimization. - * - * NEON's 32-bit to 64-bit long multiply takes a half vector of 32-bit - * integers instead of the other platforms which mask full 64-bit vectors, - * so the setup is more complicated than just shifting right. - * - * Additionally, there is an optimization for 4 lanes at once noted below. - * - * Since, as stated, the most optimal amount of lanes for Cortexes is 6, - * there needs to be *three* versions of the accumulate operation used - * for the remaining 2 lanes. - * - * WASM's SIMD128 uses SIMDe's arm_neon.h polyfill because the intrinsics overlap - * nearly perfectly. - */ - -XXH_FORCE_INLINE void -XXH3_accumulate_512_neon( void* XXH_RESTRICT acc, - const void* XXH_RESTRICT input, - const void* XXH_RESTRICT secret) -{ - XXH_ASSERT((((size_t)acc) & 15) == 0); - XXH_STATIC_ASSERT(XXH3_NEON_LANES > 0 && XXH3_NEON_LANES <= XXH_ACC_NB && XXH3_NEON_LANES % 2 == 0); - { /* GCC for darwin arm64 does not like aliasing here */ - xxh_aliasing_uint64x2_t* const xacc = (xxh_aliasing_uint64x2_t*) acc; - /* We don't use a uint32x4_t pointer because it causes bus errors on ARMv7. */ - uint8_t const* xinput = (const uint8_t *) input; - uint8_t const* xsecret = (const uint8_t *) secret; - - size_t i; -#ifdef __wasm_simd128__ - /* - * On WASM SIMD128, Clang emits direct address loads when XXH3_kSecret - * is constant propagated, which results in it converting it to this - * inside the loop: - * - * a = v128.load(XXH3_kSecret + 0 + $secret_offset, offset = 0) - * b = v128.load(XXH3_kSecret + 16 + $secret_offset, offset = 0) - * ... - * - * This requires a full 32-bit address immediate (and therefore a 6 byte - * instruction) as well as an add for each offset. - * - * Putting an asm guard prevents it from folding (at the cost of losing - * the alignment hint), and uses the free offset in `v128.load` instead - * of adding secret_offset each time which overall reduces code size by - * about a kilobyte and improves performance. - */ - XXH_COMPILER_GUARD(xsecret); -#endif - /* Scalar lanes use the normal scalarRound routine */ - for (i = XXH3_NEON_LANES; i < XXH_ACC_NB; i++) { - XXH3_scalarRound(acc, input, secret, i); - } - i = 0; - /* 4 NEON lanes at a time. */ - for (; i+1 < XXH3_NEON_LANES / 2; i+=2) { - /* data_vec = xinput[i]; */ - uint64x2_t data_vec_1 = XXH_vld1q_u64(xinput + (i * 16)); - uint64x2_t data_vec_2 = XXH_vld1q_u64(xinput + ((i+1) * 16)); - /* key_vec = xsecret[i]; */ - uint64x2_t key_vec_1 = XXH_vld1q_u64(xsecret + (i * 16)); - uint64x2_t key_vec_2 = XXH_vld1q_u64(xsecret + ((i+1) * 16)); - /* data_swap = swap(data_vec) */ - uint64x2_t data_swap_1 = vextq_u64(data_vec_1, data_vec_1, 1); - uint64x2_t data_swap_2 = vextq_u64(data_vec_2, data_vec_2, 1); - /* data_key = data_vec ^ key_vec; */ - uint64x2_t data_key_1 = veorq_u64(data_vec_1, key_vec_1); - uint64x2_t data_key_2 = veorq_u64(data_vec_2, key_vec_2); - - /* - * If we reinterpret the 64x2 vectors as 32x4 vectors, we can use a - * de-interleave operation for 4 lanes in 1 step with `vuzpq_u32` to - * get one vector with the low 32 bits of each lane, and one vector - * with the high 32 bits of each lane. - * - * The intrinsic returns a double vector because the original ARMv7-a - * instruction modified both arguments in place. AArch64 and SIMD128 emit - * two instructions from this intrinsic. - * - * [ dk11L | dk11H | dk12L | dk12H ] -> [ dk11L | dk12L | dk21L | dk22L ] - * [ dk21L | dk21H | dk22L | dk22H ] -> [ dk11H | dk12H | dk21H | dk22H ] - */ - uint32x4x2_t unzipped = vuzpq_u32( - vreinterpretq_u32_u64(data_key_1), - vreinterpretq_u32_u64(data_key_2) - ); - /* data_key_lo = data_key & 0xFFFFFFFF */ - uint32x4_t data_key_lo = unzipped.val[0]; - /* data_key_hi = data_key >> 32 */ - uint32x4_t data_key_hi = unzipped.val[1]; - /* - * Then, we can split the vectors horizontally and multiply which, as for most - * widening intrinsics, have a variant that works on both high half vectors - * for free on AArch64. A similar instruction is available on SIMD128. - * - * sum = data_swap + (u64x2) data_key_lo * (u64x2) data_key_hi - */ - uint64x2_t sum_1 = XXH_vmlal_low_u32(data_swap_1, data_key_lo, data_key_hi); - uint64x2_t sum_2 = XXH_vmlal_high_u32(data_swap_2, data_key_lo, data_key_hi); - /* - * Clang reorders - * a += b * c; // umlal swap.2d, dkl.2s, dkh.2s - * c += a; // add acc.2d, acc.2d, swap.2d - * to - * c += a; // add acc.2d, acc.2d, swap.2d - * c += b * c; // umlal acc.2d, dkl.2s, dkh.2s - * - * While it would make sense in theory since the addition is faster, - * for reasons likely related to umlal being limited to certain NEON - * pipelines, this is worse. A compiler guard fixes this. - */ - XXH_COMPILER_GUARD_CLANG_NEON(sum_1); - XXH_COMPILER_GUARD_CLANG_NEON(sum_2); - /* xacc[i] = acc_vec + sum; */ - xacc[i] = vaddq_u64(xacc[i], sum_1); - xacc[i+1] = vaddq_u64(xacc[i+1], sum_2); - } - /* Operate on the remaining NEON lanes 2 at a time. */ - for (; i < XXH3_NEON_LANES / 2; i++) { - /* data_vec = xinput[i]; */ - uint64x2_t data_vec = XXH_vld1q_u64(xinput + (i * 16)); - /* key_vec = xsecret[i]; */ - uint64x2_t key_vec = XXH_vld1q_u64(xsecret + (i * 16)); - /* acc_vec_2 = swap(data_vec) */ - uint64x2_t data_swap = vextq_u64(data_vec, data_vec, 1); - /* data_key = data_vec ^ key_vec; */ - uint64x2_t data_key = veorq_u64(data_vec, key_vec); - /* For two lanes, just use VMOVN and VSHRN. */ - /* data_key_lo = data_key & 0xFFFFFFFF; */ - uint32x2_t data_key_lo = vmovn_u64(data_key); - /* data_key_hi = data_key >> 32; */ - uint32x2_t data_key_hi = vshrn_n_u64(data_key, 32); - /* sum = data_swap + (u64x2) data_key_lo * (u64x2) data_key_hi; */ - uint64x2_t sum = vmlal_u32(data_swap, data_key_lo, data_key_hi); - /* Same Clang workaround as before */ - XXH_COMPILER_GUARD_CLANG_NEON(sum); - /* xacc[i] = acc_vec + sum; */ - xacc[i] = vaddq_u64 (xacc[i], sum); - } - } -} -XXH_FORCE_INLINE XXH3_ACCUMULATE_TEMPLATE(neon) - -XXH_FORCE_INLINE void -XXH3_scrambleAcc_neon(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret) -{ - XXH_ASSERT((((size_t)acc) & 15) == 0); - - { xxh_aliasing_uint64x2_t* xacc = (xxh_aliasing_uint64x2_t*) acc; - uint8_t const* xsecret = (uint8_t const*) secret; - - size_t i; - /* WASM uses operator overloads and doesn't need these. */ -#ifndef __wasm_simd128__ - /* { prime32_1, prime32_1 } */ - uint32x2_t const kPrimeLo = vdup_n_u32(XXH_PRIME32_1); - /* { 0, prime32_1, 0, prime32_1 } */ - uint32x4_t const kPrimeHi = vreinterpretq_u32_u64(vdupq_n_u64((xxh_u64)XXH_PRIME32_1 << 32)); -#endif - - /* AArch64 uses both scalar and neon at the same time */ - for (i = XXH3_NEON_LANES; i < XXH_ACC_NB; i++) { - XXH3_scalarScrambleRound(acc, secret, i); - } - for (i=0; i < XXH3_NEON_LANES / 2; i++) { - /* xacc[i] ^= (xacc[i] >> 47); */ - uint64x2_t acc_vec = xacc[i]; - uint64x2_t shifted = vshrq_n_u64(acc_vec, 47); - uint64x2_t data_vec = veorq_u64(acc_vec, shifted); - - /* xacc[i] ^= xsecret[i]; */ - uint64x2_t key_vec = XXH_vld1q_u64(xsecret + (i * 16)); - uint64x2_t data_key = veorq_u64(data_vec, key_vec); - /* xacc[i] *= XXH_PRIME32_1 */ -#ifdef __wasm_simd128__ - /* SIMD128 has multiply by u64x2, use it instead of expanding and scalarizing */ - xacc[i] = data_key * XXH_PRIME32_1; -#else - /* - * Expanded version with portable NEON intrinsics - * - * lo(x) * lo(y) + (hi(x) * lo(y) << 32) - * - * prod_hi = hi(data_key) * lo(prime) << 32 - * - * Since we only need 32 bits of this multiply a trick can be used, reinterpreting the vector - * as a uint32x4_t and multiplying by { 0, prime, 0, prime } to cancel out the unwanted bits - * and avoid the shift. - */ - uint32x4_t prod_hi = vmulq_u32 (vreinterpretq_u32_u64(data_key), kPrimeHi); - /* Extract low bits for vmlal_u32 */ - uint32x2_t data_key_lo = vmovn_u64(data_key); - /* xacc[i] = prod_hi + lo(data_key) * XXH_PRIME32_1; */ - xacc[i] = vmlal_u32(vreinterpretq_u64_u32(prod_hi), data_key_lo, kPrimeLo); -#endif - } - } -} -#endif - -#if (XXH_VECTOR == XXH_VSX) - -XXH_FORCE_INLINE void -XXH3_accumulate_512_vsx( void* XXH_RESTRICT acc, - const void* XXH_RESTRICT input, - const void* XXH_RESTRICT secret) -{ - /* presumed aligned */ - xxh_aliasing_u64x2* const xacc = (xxh_aliasing_u64x2*) acc; - xxh_u8 const* const xinput = (xxh_u8 const*) input; /* no alignment restriction */ - xxh_u8 const* const xsecret = (xxh_u8 const*) secret; /* no alignment restriction */ - xxh_u64x2 const v32 = { 32, 32 }; - size_t i; - for (i = 0; i < XXH_STRIPE_LEN / sizeof(xxh_u64x2); i++) { - /* data_vec = xinput[i]; */ - xxh_u64x2 const data_vec = XXH_vec_loadu(xinput + 16*i); - /* key_vec = xsecret[i]; */ - xxh_u64x2 const key_vec = XXH_vec_loadu(xsecret + 16*i); - xxh_u64x2 const data_key = data_vec ^ key_vec; - /* shuffled = (data_key << 32) | (data_key >> 32); */ - xxh_u32x4 const shuffled = (xxh_u32x4)vec_rl(data_key, v32); - /* product = ((xxh_u64x2)data_key & 0xFFFFFFFF) * ((xxh_u64x2)shuffled & 0xFFFFFFFF); */ - xxh_u64x2 const product = XXH_vec_mulo((xxh_u32x4)data_key, shuffled); - /* acc_vec = xacc[i]; */ - xxh_u64x2 acc_vec = xacc[i]; - acc_vec += product; - - /* swap high and low halves */ -#ifdef __s390x__ - acc_vec += vec_permi(data_vec, data_vec, 2); -#else - acc_vec += vec_xxpermdi(data_vec, data_vec, 2); -#endif - xacc[i] = acc_vec; - } -} -XXH_FORCE_INLINE XXH3_ACCUMULATE_TEMPLATE(vsx) - -XXH_FORCE_INLINE void -XXH3_scrambleAcc_vsx(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret) -{ - XXH_ASSERT((((size_t)acc) & 15) == 0); - - { xxh_aliasing_u64x2* const xacc = (xxh_aliasing_u64x2*) acc; - const xxh_u8* const xsecret = (const xxh_u8*) secret; - /* constants */ - xxh_u64x2 const v32 = { 32, 32 }; - xxh_u64x2 const v47 = { 47, 47 }; - xxh_u32x4 const prime = { XXH_PRIME32_1, XXH_PRIME32_1, XXH_PRIME32_1, XXH_PRIME32_1 }; - size_t i; - for (i = 0; i < XXH_STRIPE_LEN / sizeof(xxh_u64x2); i++) { - /* xacc[i] ^= (xacc[i] >> 47); */ - xxh_u64x2 const acc_vec = xacc[i]; - xxh_u64x2 const data_vec = acc_vec ^ (acc_vec >> v47); - - /* xacc[i] ^= xsecret[i]; */ - xxh_u64x2 const key_vec = XXH_vec_loadu(xsecret + 16*i); - xxh_u64x2 const data_key = data_vec ^ key_vec; - - /* xacc[i] *= XXH_PRIME32_1 */ - /* prod_lo = ((xxh_u64x2)data_key & 0xFFFFFFFF) * ((xxh_u64x2)prime & 0xFFFFFFFF); */ - xxh_u64x2 const prod_even = XXH_vec_mule((xxh_u32x4)data_key, prime); - /* prod_hi = ((xxh_u64x2)data_key >> 32) * ((xxh_u64x2)prime >> 32); */ - xxh_u64x2 const prod_odd = XXH_vec_mulo((xxh_u32x4)data_key, prime); - xacc[i] = prod_odd + (prod_even << v32); - } } -} - -#endif - -#if (XXH_VECTOR == XXH_SVE) - -XXH_FORCE_INLINE void -XXH3_accumulate_512_sve( void* XXH_RESTRICT acc, - const void* XXH_RESTRICT input, - const void* XXH_RESTRICT secret) -{ - uint64_t *xacc = (uint64_t *)acc; - const uint64_t *xinput = (const uint64_t *)(const void *)input; - const uint64_t *xsecret = (const uint64_t *)(const void *)secret; - svuint64_t kSwap = sveor_n_u64_z(svptrue_b64(), svindex_u64(0, 1), 1); - uint64_t element_count = svcntd(); - if (element_count >= 8) { - svbool_t mask = svptrue_pat_b64(SV_VL8); - svuint64_t vacc = svld1_u64(mask, xacc); - ACCRND(vacc, 0); - svst1_u64(mask, xacc, vacc); - } else if (element_count == 2) { /* sve128 */ - svbool_t mask = svptrue_pat_b64(SV_VL2); - svuint64_t acc0 = svld1_u64(mask, xacc + 0); - svuint64_t acc1 = svld1_u64(mask, xacc + 2); - svuint64_t acc2 = svld1_u64(mask, xacc + 4); - svuint64_t acc3 = svld1_u64(mask, xacc + 6); - ACCRND(acc0, 0); - ACCRND(acc1, 2); - ACCRND(acc2, 4); - ACCRND(acc3, 6); - svst1_u64(mask, xacc + 0, acc0); - svst1_u64(mask, xacc + 2, acc1); - svst1_u64(mask, xacc + 4, acc2); - svst1_u64(mask, xacc + 6, acc3); - } else { - svbool_t mask = svptrue_pat_b64(SV_VL4); - svuint64_t acc0 = svld1_u64(mask, xacc + 0); - svuint64_t acc1 = svld1_u64(mask, xacc + 4); - ACCRND(acc0, 0); - ACCRND(acc1, 4); - svst1_u64(mask, xacc + 0, acc0); - svst1_u64(mask, xacc + 4, acc1); - } -} - -XXH_FORCE_INLINE void -XXH3_accumulate_sve(xxh_u64* XXH_RESTRICT acc, - const xxh_u8* XXH_RESTRICT input, - const xxh_u8* XXH_RESTRICT secret, - size_t nbStripes) -{ - if (nbStripes != 0) { - uint64_t *xacc = (uint64_t *)acc; - const uint64_t *xinput = (const uint64_t *)(const void *)input; - const uint64_t *xsecret = (const uint64_t *)(const void *)secret; - svuint64_t kSwap = sveor_n_u64_z(svptrue_b64(), svindex_u64(0, 1), 1); - uint64_t element_count = svcntd(); - if (element_count >= 8) { - svbool_t mask = svptrue_pat_b64(SV_VL8); - svuint64_t vacc = svld1_u64(mask, xacc + 0); - do { - /* svprfd(svbool_t, void *, enum svfprop); */ - svprfd(mask, xinput + 128, SV_PLDL1STRM); - ACCRND(vacc, 0); - xinput += 8; - xsecret += 1; - nbStripes--; - } while (nbStripes != 0); - - svst1_u64(mask, xacc + 0, vacc); - } else if (element_count == 2) { /* sve128 */ - svbool_t mask = svptrue_pat_b64(SV_VL2); - svuint64_t acc0 = svld1_u64(mask, xacc + 0); - svuint64_t acc1 = svld1_u64(mask, xacc + 2); - svuint64_t acc2 = svld1_u64(mask, xacc + 4); - svuint64_t acc3 = svld1_u64(mask, xacc + 6); - do { - svprfd(mask, xinput + 128, SV_PLDL1STRM); - ACCRND(acc0, 0); - ACCRND(acc1, 2); - ACCRND(acc2, 4); - ACCRND(acc3, 6); - xinput += 8; - xsecret += 1; - nbStripes--; - } while (nbStripes != 0); - - svst1_u64(mask, xacc + 0, acc0); - svst1_u64(mask, xacc + 2, acc1); - svst1_u64(mask, xacc + 4, acc2); - svst1_u64(mask, xacc + 6, acc3); - } else { - svbool_t mask = svptrue_pat_b64(SV_VL4); - svuint64_t acc0 = svld1_u64(mask, xacc + 0); - svuint64_t acc1 = svld1_u64(mask, xacc + 4); - do { - svprfd(mask, xinput + 128, SV_PLDL1STRM); - ACCRND(acc0, 0); - ACCRND(acc1, 4); - xinput += 8; - xsecret += 1; - nbStripes--; - } while (nbStripes != 0); - - svst1_u64(mask, xacc + 0, acc0); - svst1_u64(mask, xacc + 4, acc1); - } - } -} - -#endif - -/* scalar variants - universal */ - -#if defined(__aarch64__) && (defined(__GNUC__) || defined(__clang__)) -/* - * In XXH3_scalarRound(), GCC and Clang have a similar codegen issue, where they - * emit an excess mask and a full 64-bit multiply-add (MADD X-form). - * - * While this might not seem like much, as AArch64 is a 64-bit architecture, only - * big Cortex designs have a full 64-bit multiplier. - * - * On the little cores, the smaller 32-bit multiplier is used, and full 64-bit - * multiplies expand to 2-3 multiplies in microcode. This has a major penalty - * of up to 4 latency cycles and 2 stall cycles in the multiply pipeline. - * - * Thankfully, AArch64 still provides the 32-bit long multiply-add (UMADDL) which does - * not have this penalty and does the mask automatically. - */ -XXH_FORCE_INLINE xxh_u64 -XXH_mult32to64_add64(xxh_u64 lhs, xxh_u64 rhs, xxh_u64 acc) -{ - xxh_u64 ret; - /* note: %x = 64-bit register, %w = 32-bit register */ - __asm__("umaddl %x0, %w1, %w2, %x3" : "=r" (ret) : "r" (lhs), "r" (rhs), "r" (acc)); - return ret; -} -#else -XXH_FORCE_INLINE xxh_u64 -XXH_mult32to64_add64(xxh_u64 lhs, xxh_u64 rhs, xxh_u64 acc) -{ - return XXH_mult32to64((xxh_u32)lhs, (xxh_u32)rhs) + acc; -} -#endif - -/*! - * @internal - * @brief Scalar round for @ref XXH3_accumulate_512_scalar(). - * - * This is extracted to its own function because the NEON path uses a combination - * of NEON and scalar. - */ -XXH_FORCE_INLINE void -XXH3_scalarRound(void* XXH_RESTRICT acc, - void const* XXH_RESTRICT input, - void const* XXH_RESTRICT secret, - size_t lane) -{ - xxh_u64* xacc = (xxh_u64*) acc; - xxh_u8 const* xinput = (xxh_u8 const*) input; - xxh_u8 const* xsecret = (xxh_u8 const*) secret; - XXH_ASSERT(lane < XXH_ACC_NB); - XXH_ASSERT(((size_t)acc & (XXH_ACC_ALIGN-1)) == 0); - { - xxh_u64 const data_val = XXH_readLE64(xinput + lane * 8); - xxh_u64 const data_key = data_val ^ XXH_readLE64(xsecret + lane * 8); - xacc[lane ^ 1] += data_val; /* swap adjacent lanes */ - xacc[lane] = XXH_mult32to64_add64(data_key /* & 0xFFFFFFFF */, data_key >> 32, xacc[lane]); - } -} - -/*! - * @internal - * @brief Processes a 64 byte block of data using the scalar path. - */ -XXH_FORCE_INLINE void -XXH3_accumulate_512_scalar(void* XXH_RESTRICT acc, - const void* XXH_RESTRICT input, - const void* XXH_RESTRICT secret) -{ - size_t i; - /* ARM GCC refuses to unroll this loop, resulting in a 24% slowdown on ARMv6. */ -#if defined(__GNUC__) && !defined(__clang__) \ - && (defined(__arm__) || defined(__thumb2__)) \ - && defined(__ARM_FEATURE_UNALIGNED) /* no unaligned access just wastes bytes */ \ - && XXH_SIZE_OPT <= 0 -# pragma GCC unroll 8 -#endif - for (i=0; i < XXH_ACC_NB; i++) { - XXH3_scalarRound(acc, input, secret, i); - } -} -XXH_FORCE_INLINE XXH3_ACCUMULATE_TEMPLATE(scalar) - -/*! - * @internal - * @brief Scalar scramble step for @ref XXH3_scrambleAcc_scalar(). - * - * This is extracted to its own function because the NEON path uses a combination - * of NEON and scalar. - */ -XXH_FORCE_INLINE void -XXH3_scalarScrambleRound(void* XXH_RESTRICT acc, - void const* XXH_RESTRICT secret, - size_t lane) -{ - xxh_u64* const xacc = (xxh_u64*) acc; /* presumed aligned */ - const xxh_u8* const xsecret = (const xxh_u8*) secret; /* no alignment restriction */ - XXH_ASSERT((((size_t)acc) & (XXH_ACC_ALIGN-1)) == 0); - XXH_ASSERT(lane < XXH_ACC_NB); - { - xxh_u64 const key64 = XXH_readLE64(xsecret + lane * 8); - xxh_u64 acc64 = xacc[lane]; - acc64 = XXH_xorshift64(acc64, 47); - acc64 ^= key64; - acc64 *= XXH_PRIME32_1; - xacc[lane] = acc64; - } -} - -/*! - * @internal - * @brief Scrambles the accumulators after a large chunk has been read - */ -XXH_FORCE_INLINE void -XXH3_scrambleAcc_scalar(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret) -{ - size_t i; - for (i=0; i < XXH_ACC_NB; i++) { - XXH3_scalarScrambleRound(acc, secret, i); - } -} - -XXH_FORCE_INLINE void -XXH3_initCustomSecret_scalar(void* XXH_RESTRICT customSecret, xxh_u64 seed64) -{ - /* - * We need a separate pointer for the hack below, - * which requires a non-const pointer. - * Any decent compiler will optimize this out otherwise. - */ - const xxh_u8* kSecretPtr = XXH3_kSecret; - XXH_STATIC_ASSERT((XXH_SECRET_DEFAULT_SIZE & 15) == 0); - -#if defined(__GNUC__) && defined(__aarch64__) - /* - * UGLY HACK: - * GCC and Clang generate a bunch of MOV/MOVK pairs for aarch64, and they are - * placed sequentially, in order, at the top of the unrolled loop. - * - * While MOVK is great for generating constants (2 cycles for a 64-bit - * constant compared to 4 cycles for LDR), it fights for bandwidth with - * the arithmetic instructions. - * - * I L S - * MOVK - * MOVK - * MOVK - * MOVK - * ADD - * SUB STR - * STR - * By forcing loads from memory (as the asm line causes the compiler to assume - * that XXH3_kSecretPtr has been changed), the pipelines are used more - * efficiently: - * I L S - * LDR - * ADD LDR - * SUB STR - * STR - * - * See XXH3_NEON_LANES for details on the pipsline. - * - * XXH3_64bits_withSeed, len == 256, Snapdragon 835 - * without hack: 2654.4 MB/s - * with hack: 3202.9 MB/s - */ - XXH_COMPILER_GUARD(kSecretPtr); -#endif - { int const nbRounds = XXH_SECRET_DEFAULT_SIZE / 16; - int i; - for (i=0; i < nbRounds; i++) { - /* - * The asm hack causes the compiler to assume that kSecretPtr aliases with - * customSecret, and on aarch64, this prevented LDP from merging two - * loads together for free. Putting the loads together before the stores - * properly generates LDP. - */ - xxh_u64 lo = XXH_readLE64(kSecretPtr + 16*i) + seed64; - xxh_u64 hi = XXH_readLE64(kSecretPtr + 16*i + 8) - seed64; - XXH_writeLE64((xxh_u8*)customSecret + 16*i, lo); - XXH_writeLE64((xxh_u8*)customSecret + 16*i + 8, hi); - } } -} - - -typedef void (*XXH3_f_accumulate)(xxh_u64* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, size_t); -typedef void (*XXH3_f_scrambleAcc)(void* XXH_RESTRICT, const void*); -typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64); - - -#if (XXH_VECTOR == XXH_AVX512) - -#define XXH3_accumulate_512 XXH3_accumulate_512_avx512 -#define XXH3_accumulate XXH3_accumulate_avx512 -#define XXH3_scrambleAcc XXH3_scrambleAcc_avx512 -#define XXH3_initCustomSecret XXH3_initCustomSecret_avx512 - -#elif (XXH_VECTOR == XXH_AVX2) - -#define XXH3_accumulate_512 XXH3_accumulate_512_avx2 -#define XXH3_accumulate XXH3_accumulate_avx2 -#define XXH3_scrambleAcc XXH3_scrambleAcc_avx2 -#define XXH3_initCustomSecret XXH3_initCustomSecret_avx2 - -#elif (XXH_VECTOR == XXH_SSE2) - -#define XXH3_accumulate_512 XXH3_accumulate_512_sse2 -#define XXH3_accumulate XXH3_accumulate_sse2 -#define XXH3_scrambleAcc XXH3_scrambleAcc_sse2 -#define XXH3_initCustomSecret XXH3_initCustomSecret_sse2 - -#elif (XXH_VECTOR == XXH_NEON) - -#define XXH3_accumulate_512 XXH3_accumulate_512_neon -#define XXH3_accumulate XXH3_accumulate_neon -#define XXH3_scrambleAcc XXH3_scrambleAcc_neon -#define XXH3_initCustomSecret XXH3_initCustomSecret_scalar - -#elif (XXH_VECTOR == XXH_VSX) - -#define XXH3_accumulate_512 XXH3_accumulate_512_vsx -#define XXH3_accumulate XXH3_accumulate_vsx -#define XXH3_scrambleAcc XXH3_scrambleAcc_vsx -#define XXH3_initCustomSecret XXH3_initCustomSecret_scalar - -#elif (XXH_VECTOR == XXH_SVE) -#define XXH3_accumulate_512 XXH3_accumulate_512_sve -#define XXH3_accumulate XXH3_accumulate_sve -#define XXH3_scrambleAcc XXH3_scrambleAcc_scalar -#define XXH3_initCustomSecret XXH3_initCustomSecret_scalar - -#else /* scalar */ - -#define XXH3_accumulate_512 XXH3_accumulate_512_scalar -#define XXH3_accumulate XXH3_accumulate_scalar -#define XXH3_scrambleAcc XXH3_scrambleAcc_scalar -#define XXH3_initCustomSecret XXH3_initCustomSecret_scalar - -#endif - -#if XXH_SIZE_OPT >= 1 /* don't do SIMD for initialization */ -# undef XXH3_initCustomSecret -# define XXH3_initCustomSecret XXH3_initCustomSecret_scalar -#endif - -XXH_FORCE_INLINE void -XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc, - const xxh_u8* XXH_RESTRICT input, size_t len, - const xxh_u8* XXH_RESTRICT secret, size_t secretSize, - XXH3_f_accumulate f_acc, - XXH3_f_scrambleAcc f_scramble) -{ - size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE; - size_t const block_len = XXH_STRIPE_LEN * nbStripesPerBlock; - size_t const nb_blocks = (len - 1) / block_len; - - size_t n; - - XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); - - for (n = 0; n < nb_blocks; n++) { - f_acc(acc, input + n*block_len, secret, nbStripesPerBlock); - f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN); - } - - /* last partial block */ - XXH_ASSERT(len > XXH_STRIPE_LEN); - { size_t const nbStripes = ((len - 1) - (block_len * nb_blocks)) / XXH_STRIPE_LEN; - XXH_ASSERT(nbStripes <= (secretSize / XXH_SECRET_CONSUME_RATE)); - f_acc(acc, input + nb_blocks*block_len, secret, nbStripes); - - /* last stripe */ - { const xxh_u8* const p = input + len - XXH_STRIPE_LEN; -#define XXH_SECRET_LASTACC_START 7 /* not aligned on 8, last secret is different from acc & scrambler */ - XXH3_accumulate_512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START); - } } -} - -XXH_FORCE_INLINE xxh_u64 -XXH3_mix2Accs(const xxh_u64* XXH_RESTRICT acc, const xxh_u8* XXH_RESTRICT secret) -{ - return XXH3_mul128_fold64( - acc[0] ^ XXH_readLE64(secret), - acc[1] ^ XXH_readLE64(secret+8) ); -} - -static XXH64_hash_t -XXH3_mergeAccs(const xxh_u64* XXH_RESTRICT acc, const xxh_u8* XXH_RESTRICT secret, xxh_u64 start) -{ - xxh_u64 result64 = start; - size_t i = 0; - - for (i = 0; i < 4; i++) { - result64 += XXH3_mix2Accs(acc+2*i, secret + 16*i); -#if defined(__clang__) /* Clang */ \ - && (defined(__arm__) || defined(__thumb__)) /* ARMv7 */ \ - && (defined(__ARM_NEON) || defined(__ARM_NEON__)) /* NEON */ \ - && !defined(XXH_ENABLE_AUTOVECTORIZE) /* Define to disable */ - /* - * UGLY HACK: - * Prevent autovectorization on Clang ARMv7-a. Exact same problem as - * the one in XXH3_len_129to240_64b. Speeds up shorter keys > 240b. - * XXH3_64bits, len == 256, Snapdragon 835: - * without hack: 2063.7 MB/s - * with hack: 2560.7 MB/s - */ - XXH_COMPILER_GUARD(result64); -#endif - } - - return XXH3_avalanche(result64); -} - -#define XXH3_INIT_ACC { XXH_PRIME32_3, XXH_PRIME64_1, XXH_PRIME64_2, XXH_PRIME64_3, \ - XXH_PRIME64_4, XXH_PRIME32_2, XXH_PRIME64_5, XXH_PRIME32_1 } - -XXH_FORCE_INLINE XXH64_hash_t -XXH3_hashLong_64b_internal(const void* XXH_RESTRICT input, size_t len, - const void* XXH_RESTRICT secret, size_t secretSize, - XXH3_f_accumulate f_acc, - XXH3_f_scrambleAcc f_scramble) -{ - XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[XXH_ACC_NB] = XXH3_INIT_ACC; - - XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, (const xxh_u8*)secret, secretSize, f_acc, f_scramble); - - /* converge into final hash */ - XXH_STATIC_ASSERT(sizeof(acc) == 64); - /* do not align on 8, so that the secret is different from the accumulator */ -#define XXH_SECRET_MERGEACCS_START 11 - XXH_ASSERT(secretSize >= sizeof(acc) + XXH_SECRET_MERGEACCS_START); - return XXH3_mergeAccs(acc, (const xxh_u8*)secret + XXH_SECRET_MERGEACCS_START, (xxh_u64)len * XXH_PRIME64_1); -} - -/* - * It's important for performance to transmit secret's size (when it's static) - * so that the compiler can properly optimize the vectorized loop. - * This makes a big performance difference for "medium" keys (<1 KB) when using AVX instruction set. - * When the secret size is unknown, or on GCC 12 where the mix of NO_INLINE and FORCE_INLINE - * breaks -Og, this is XXH_NO_INLINE. - */ -XXH3_WITH_SECRET_INLINE XXH64_hash_t -XXH3_hashLong_64b_withSecret(const void* XXH_RESTRICT input, size_t len, - XXH64_hash_t seed64, const xxh_u8* XXH_RESTRICT secret, size_t secretLen) -{ - (void)seed64; - return XXH3_hashLong_64b_internal(input, len, secret, secretLen, XXH3_accumulate, XXH3_scrambleAcc); -} - -/* - * It's preferable for performance that XXH3_hashLong is not inlined, - * as it results in a smaller function for small data, easier to the instruction cache. - * Note that inside this no_inline function, we do inline the internal loop, - * and provide a statically defined secret size to allow optimization of vector loop. - */ -XXH_NO_INLINE XXH_PUREF XXH64_hash_t -XXH3_hashLong_64b_default(const void* XXH_RESTRICT input, size_t len, - XXH64_hash_t seed64, const xxh_u8* XXH_RESTRICT secret, size_t secretLen) -{ - (void)seed64; (void)secret; (void)secretLen; - return XXH3_hashLong_64b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret), XXH3_accumulate, XXH3_scrambleAcc); -} - -/* - * XXH3_hashLong_64b_withSeed(): - * Generate a custom key based on alteration of default XXH3_kSecret with the seed, - * and then use this key for long mode hashing. - * - * This operation is decently fast but nonetheless costs a little bit of time. - * Try to avoid it whenever possible (typically when seed==0). - * - * It's important for performance that XXH3_hashLong is not inlined. Not sure - * why (uop cache maybe?), but the difference is large and easily measurable. - */ -XXH_FORCE_INLINE XXH64_hash_t -XXH3_hashLong_64b_withSeed_internal(const void* input, size_t len, - XXH64_hash_t seed, - XXH3_f_accumulate f_acc, - XXH3_f_scrambleAcc f_scramble, - XXH3_f_initCustomSecret f_initSec) -{ -#if XXH_SIZE_OPT <= 0 - if (seed == 0) - return XXH3_hashLong_64b_internal(input, len, - XXH3_kSecret, sizeof(XXH3_kSecret), - f_acc, f_scramble); -#endif - { XXH_ALIGN(XXH_SEC_ALIGN) xxh_u8 secret[XXH_SECRET_DEFAULT_SIZE]; - f_initSec(secret, seed); - return XXH3_hashLong_64b_internal(input, len, secret, sizeof(secret), - f_acc, f_scramble); - } -} - -/* - * It's important for performance that XXH3_hashLong is not inlined. - */ -XXH_NO_INLINE XXH64_hash_t -XXH3_hashLong_64b_withSeed(const void* XXH_RESTRICT input, size_t len, - XXH64_hash_t seed, const xxh_u8* XXH_RESTRICT secret, size_t secretLen) -{ - (void)secret; (void)secretLen; - return XXH3_hashLong_64b_withSeed_internal(input, len, seed, - XXH3_accumulate, XXH3_scrambleAcc, XXH3_initCustomSecret); -} - - -typedef XXH64_hash_t (*XXH3_hashLong64_f)(const void* XXH_RESTRICT, size_t, - XXH64_hash_t, const xxh_u8* XXH_RESTRICT, size_t); - -XXH_FORCE_INLINE XXH64_hash_t -XXH3_64bits_internal(const void* XXH_RESTRICT input, size_t len, - XXH64_hash_t seed64, const void* XXH_RESTRICT secret, size_t secretLen, - XXH3_hashLong64_f f_hashLong) -{ - XXH_ASSERT(secretLen >= XXH3_SECRET_SIZE_MIN); - /* - * If an action is to be taken if `secretLen` condition is not respected, - * it should be done here. - * For now, it's a contract pre-condition. - * Adding a check and a branch here would cost performance at every hash. - * Also, note that function signature doesn't offer room to return an error. - */ - if (len <= 16) - return XXH3_len_0to16_64b((const xxh_u8*)input, len, (const xxh_u8*)secret, seed64); - if (len <= 128) - return XXH3_len_17to128_64b((const xxh_u8*)input, len, (const xxh_u8*)secret, secretLen, seed64); - if (len <= XXH3_MIDSIZE_MAX) - return XXH3_len_129to240_64b((const xxh_u8*)input, len, (const xxh_u8*)secret, secretLen, seed64); - return f_hashLong(input, len, seed64, (const xxh_u8*)secret, secretLen); -} - - -/* === Public entry point === */ - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH64_hash_t XXH3_64bits(XXH_NOESCAPE const void* input, size_t length) -{ - return XXH3_64bits_internal(input, length, 0, XXH3_kSecret, sizeof(XXH3_kSecret), XXH3_hashLong_64b_default); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH64_hash_t -XXH3_64bits_withSecret(XXH_NOESCAPE const void* input, size_t length, XXH_NOESCAPE const void* secret, size_t secretSize) -{ - return XXH3_64bits_internal(input, length, 0, secret, secretSize, XXH3_hashLong_64b_withSecret); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH64_hash_t -XXH3_64bits_withSeed(XXH_NOESCAPE const void* input, size_t length, XXH64_hash_t seed) -{ - return XXH3_64bits_internal(input, length, seed, XXH3_kSecret, sizeof(XXH3_kSecret), XXH3_hashLong_64b_withSeed); -} - -XXH_PUBLIC_API XXH64_hash_t -XXH3_64bits_withSecretandSeed(XXH_NOESCAPE const void* input, size_t length, XXH_NOESCAPE const void* secret, size_t secretSize, XXH64_hash_t seed) -{ - if (length <= XXH3_MIDSIZE_MAX) - return XXH3_64bits_internal(input, length, seed, XXH3_kSecret, sizeof(XXH3_kSecret), NULL); - return XXH3_hashLong_64b_withSecret(input, length, seed, (const xxh_u8*)secret, secretSize); -} - - -/* === XXH3 streaming === */ -#ifndef XXH_NO_STREAM -/* - * Malloc's a pointer that is always aligned to align. - * - * This must be freed with `XXH_alignedFree()`. - * - * malloc typically guarantees 16 byte alignment on 64-bit systems and 8 byte - * alignment on 32-bit. This isn't enough for the 32 byte aligned loads in AVX2 - * or on 32-bit, the 16 byte aligned loads in SSE2 and NEON. - * - * This underalignment previously caused a rather obvious crash which went - * completely unnoticed due to XXH3_createState() not actually being tested. - * Credit to RedSpah for noticing this bug. - * - * The alignment is done manually: Functions like posix_memalign or _mm_malloc - * are avoided: To maintain portability, we would have to write a fallback - * like this anyways, and besides, testing for the existence of library - * functions without relying on external build tools is impossible. - * - * The method is simple: Overallocate, manually align, and store the offset - * to the original behind the returned pointer. - * - * Align must be a power of 2 and 8 <= align <= 128. - */ -static XXH_MALLOCF void* XXH_alignedMalloc(size_t s, size_t align) -{ - XXH_ASSERT(align <= 128 && align >= 8); /* range check */ - XXH_ASSERT((align & (align-1)) == 0); /* power of 2 */ - XXH_ASSERT(s != 0 && s < (s + align)); /* empty/overflow */ - { /* Overallocate to make room for manual realignment and an offset byte */ - xxh_u8* base = (xxh_u8*)XXH_malloc(s + align); - if (base != NULL) { - /* - * Get the offset needed to align this pointer. - * - * Even if the returned pointer is aligned, there will always be - * at least one byte to store the offset to the original pointer. - */ - size_t offset = align - ((size_t)base & (align - 1)); /* base % align */ - /* Add the offset for the now-aligned pointer */ - xxh_u8* ptr = base + offset; - - XXH_ASSERT((size_t)ptr % align == 0); - - /* Store the offset immediately before the returned pointer. */ - ptr[-1] = (xxh_u8)offset; - return ptr; - } - return NULL; - } -} -/* - * Frees an aligned pointer allocated by XXH_alignedMalloc(). Don't pass - * normal malloc'd pointers, XXH_alignedMalloc has a specific data layout. - */ -static void XXH_alignedFree(void* p) -{ - if (p != NULL) { - xxh_u8* ptr = (xxh_u8*)p; - /* Get the offset byte we added in XXH_malloc. */ - xxh_u8 offset = ptr[-1]; - /* Free the original malloc'd pointer */ - xxh_u8* base = ptr - offset; - XXH_free(base); - } -} -/*! @ingroup XXH3_family */ -/*! - * @brief Allocate an @ref XXH3_state_t. - * - * Must be freed with XXH3_freeState(). - * @return An allocated XXH3_state_t on success, `NULL` on failure. - */ -XXH_PUBLIC_API XXH3_state_t* XXH3_createState(void) -{ - XXH3_state_t* const state = (XXH3_state_t*)XXH_alignedMalloc(sizeof(XXH3_state_t), 64); - if (state==NULL) return NULL; - XXH3_INITSTATE(state); - return state; -} - -/*! @ingroup XXH3_family */ -/*! - * @brief Frees an @ref XXH3_state_t. - * - * Must be allocated with XXH3_createState(). - * @param statePtr A pointer to an @ref XXH3_state_t allocated with @ref XXH3_createState(). - * @return XXH_OK. - */ -XXH_PUBLIC_API XXH_errorcode XXH3_freeState(XXH3_state_t* statePtr) -{ - XXH_alignedFree(statePtr); - return XXH_OK; -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API void -XXH3_copyState(XXH_NOESCAPE XXH3_state_t* dst_state, XXH_NOESCAPE const XXH3_state_t* src_state) -{ - XXH_memcpy(dst_state, src_state, sizeof(*dst_state)); -} - -static void -XXH3_reset_internal(XXH3_state_t* statePtr, - XXH64_hash_t seed, - const void* secret, size_t secretSize) -{ - size_t const initStart = offsetof(XXH3_state_t, bufferedSize); - size_t const initLength = offsetof(XXH3_state_t, nbStripesPerBlock) - initStart; - XXH_ASSERT(offsetof(XXH3_state_t, nbStripesPerBlock) > initStart); - XXH_ASSERT(statePtr != NULL); - /* set members from bufferedSize to nbStripesPerBlock (excluded) to 0 */ - memset((char*)statePtr + initStart, 0, initLength); - statePtr->acc[0] = XXH_PRIME32_3; - statePtr->acc[1] = XXH_PRIME64_1; - statePtr->acc[2] = XXH_PRIME64_2; - statePtr->acc[3] = XXH_PRIME64_3; - statePtr->acc[4] = XXH_PRIME64_4; - statePtr->acc[5] = XXH_PRIME32_2; - statePtr->acc[6] = XXH_PRIME64_5; - statePtr->acc[7] = XXH_PRIME32_1; - statePtr->seed = seed; - statePtr->useSeed = (seed != 0); - statePtr->extSecret = (const unsigned char*)secret; - XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); - statePtr->secretLimit = secretSize - XXH_STRIPE_LEN; - statePtr->nbStripesPerBlock = statePtr->secretLimit / XXH_SECRET_CONSUME_RATE; -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_64bits_reset(XXH_NOESCAPE XXH3_state_t* statePtr) -{ - if (statePtr == NULL) return XXH_ERROR; - XXH3_reset_internal(statePtr, 0, XXH3_kSecret, XXH_SECRET_DEFAULT_SIZE); - return XXH_OK; -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_64bits_reset_withSecret(XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* secret, size_t secretSize) -{ - if (statePtr == NULL) return XXH_ERROR; - XXH3_reset_internal(statePtr, 0, secret, secretSize); - if (secret == NULL) return XXH_ERROR; - if (secretSize < XXH3_SECRET_SIZE_MIN) return XXH_ERROR; - return XXH_OK; -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_64bits_reset_withSeed(XXH_NOESCAPE XXH3_state_t* statePtr, XXH64_hash_t seed) -{ - if (statePtr == NULL) return XXH_ERROR; - if (seed==0) return XXH3_64bits_reset(statePtr); - if ((seed != statePtr->seed) || (statePtr->extSecret != NULL)) - XXH3_initCustomSecret(statePtr->customSecret, seed); - XXH3_reset_internal(statePtr, seed, NULL, XXH_SECRET_DEFAULT_SIZE); - return XXH_OK; -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_64bits_reset_withSecretandSeed(XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* secret, size_t secretSize, XXH64_hash_t seed64) -{ - if (statePtr == NULL) return XXH_ERROR; - if (secret == NULL) return XXH_ERROR; - if (secretSize < XXH3_SECRET_SIZE_MIN) return XXH_ERROR; - XXH3_reset_internal(statePtr, seed64, secret, secretSize); - statePtr->useSeed = 1; /* always, even if seed64==0 */ - return XXH_OK; -} - -/*! - * @internal - * @brief Processes a large input for XXH3_update() and XXH3_digest_long(). - * - * Unlike XXH3_hashLong_internal_loop(), this can process data that overlaps a block. - * - * @param acc Pointer to the 8 accumulator lanes - * @param nbStripesSoFarPtr In/out pointer to the number of leftover stripes in the block* - * @param nbStripesPerBlock Number of stripes in a block - * @param input Input pointer - * @param nbStripes Number of stripes to process - * @param secret Secret pointer - * @param secretLimit Offset of the last block in @p secret - * @param f_acc Pointer to an XXH3_accumulate implementation - * @param f_scramble Pointer to an XXH3_scrambleAcc implementation - * @return Pointer past the end of @p input after processing - */ -XXH_FORCE_INLINE const xxh_u8 * -XXH3_consumeStripes(xxh_u64* XXH_RESTRICT acc, - size_t* XXH_RESTRICT nbStripesSoFarPtr, size_t nbStripesPerBlock, - const xxh_u8* XXH_RESTRICT input, size_t nbStripes, - const xxh_u8* XXH_RESTRICT secret, size_t secretLimit, - XXH3_f_accumulate f_acc, - XXH3_f_scrambleAcc f_scramble) -{ - const xxh_u8* initialSecret = secret + *nbStripesSoFarPtr * XXH_SECRET_CONSUME_RATE; - /* Process full blocks */ - if (nbStripes >= (nbStripesPerBlock - *nbStripesSoFarPtr)) { - /* Process the initial partial block... */ - size_t nbStripesThisIter = nbStripesPerBlock - *nbStripesSoFarPtr; - - do { - /* Accumulate and scramble */ - f_acc(acc, input, initialSecret, nbStripesThisIter); - f_scramble(acc, secret + secretLimit); - input += nbStripesThisIter * XXH_STRIPE_LEN; - nbStripes -= nbStripesThisIter; - /* Then continue the loop with the full block size */ - nbStripesThisIter = nbStripesPerBlock; - initialSecret = secret; - } while (nbStripes >= nbStripesPerBlock); - *nbStripesSoFarPtr = 0; - } - /* Process a partial block */ - if (nbStripes > 0) { - f_acc(acc, input, initialSecret, nbStripes); - input += nbStripes * XXH_STRIPE_LEN; - *nbStripesSoFarPtr += nbStripes; - } - /* Return end pointer */ - return input; -} - -#ifndef XXH3_STREAM_USE_STACK -# if XXH_SIZE_OPT <= 0 && !defined(__clang__) /* clang doesn't need additional stack space */ -# define XXH3_STREAM_USE_STACK 1 -# endif -#endif -/* - * Both XXH3_64bits_update and XXH3_128bits_update use this routine. - */ -XXH_FORCE_INLINE XXH_errorcode -XXH3_update(XXH3_state_t* XXH_RESTRICT const state, - const xxh_u8* XXH_RESTRICT input, size_t len, - XXH3_f_accumulate f_acc, - XXH3_f_scrambleAcc f_scramble) -{ - if (input==NULL) { - XXH_ASSERT(len == 0); - return XXH_OK; - } - - XXH_ASSERT(state != NULL); - { const xxh_u8* const bEnd = input + len; - const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extSecret; -#if defined(XXH3_STREAM_USE_STACK) && XXH3_STREAM_USE_STACK >= 1 - /* For some reason, gcc and MSVC seem to suffer greatly - * when operating accumulators directly into state. - * Operating into stack space seems to enable proper optimization. - * clang, on the other hand, doesn't seem to need this trick */ - XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[8]; - XXH_memcpy(acc, state->acc, sizeof(acc)); -#else - xxh_u64* XXH_RESTRICT const acc = state->acc; -#endif - state->totalLen += len; - XXH_ASSERT(state->bufferedSize <= XXH3_INTERNALBUFFER_SIZE); - - /* small input : just fill in tmp buffer */ - if (len <= XXH3_INTERNALBUFFER_SIZE - state->bufferedSize) { - XXH_memcpy(state->buffer + state->bufferedSize, input, len); - state->bufferedSize += (XXH32_hash_t)len; - return XXH_OK; - } - - /* total input is now > XXH3_INTERNALBUFFER_SIZE */ - #define XXH3_INTERNALBUFFER_STRIPES (XXH3_INTERNALBUFFER_SIZE / XXH_STRIPE_LEN) - XXH_STATIC_ASSERT(XXH3_INTERNALBUFFER_SIZE % XXH_STRIPE_LEN == 0); /* clean multiple */ - - /* - * Internal buffer is partially filled (always, except at beginning) - * Complete it, then consume it. - */ - if (state->bufferedSize) { - size_t const loadSize = XXH3_INTERNALBUFFER_SIZE - state->bufferedSize; - XXH_memcpy(state->buffer + state->bufferedSize, input, loadSize); - input += loadSize; - XXH3_consumeStripes(acc, - &state->nbStripesSoFar, state->nbStripesPerBlock, - state->buffer, XXH3_INTERNALBUFFER_STRIPES, - secret, state->secretLimit, - f_acc, f_scramble); - state->bufferedSize = 0; - } - XXH_ASSERT(input < bEnd); - if (bEnd - input > XXH3_INTERNALBUFFER_SIZE) { - size_t nbStripes = (size_t)(bEnd - 1 - input) / XXH_STRIPE_LEN; - input = XXH3_consumeStripes(acc, - &state->nbStripesSoFar, state->nbStripesPerBlock, - input, nbStripes, - secret, state->secretLimit, - f_acc, f_scramble); - XXH_memcpy(state->buffer + sizeof(state->buffer) - XXH_STRIPE_LEN, input - XXH_STRIPE_LEN, XXH_STRIPE_LEN); - - } - /* Some remaining input (always) : buffer it */ - XXH_ASSERT(input < bEnd); - XXH_ASSERT(bEnd - input <= XXH3_INTERNALBUFFER_SIZE); - XXH_ASSERT(state->bufferedSize == 0); - XXH_memcpy(state->buffer, input, (size_t)(bEnd-input)); - state->bufferedSize = (XXH32_hash_t)(bEnd-input); -#if defined(XXH3_STREAM_USE_STACK) && XXH3_STREAM_USE_STACK >= 1 - /* save stack accumulators into state */ - XXH_memcpy(state->acc, acc, sizeof(acc)); -#endif - } - - return XXH_OK; -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_64bits_update(XXH_NOESCAPE XXH3_state_t* state, XXH_NOESCAPE const void* input, size_t len) -{ - return XXH3_update(state, (const xxh_u8*)input, len, - XXH3_accumulate, XXH3_scrambleAcc); -} - - -XXH_FORCE_INLINE void -XXH3_digest_long (XXH64_hash_t* acc, - const XXH3_state_t* state, - const unsigned char* secret) -{ - xxh_u8 lastStripe[XXH_STRIPE_LEN]; - const xxh_u8* lastStripePtr; - - /* - * Digest on a local copy. This way, the state remains unaltered, and it can - * continue ingesting more input afterwards. - */ - XXH_memcpy(acc, state->acc, sizeof(state->acc)); - if (state->bufferedSize >= XXH_STRIPE_LEN) { - /* Consume remaining stripes then point to remaining data in buffer */ - size_t const nbStripes = (state->bufferedSize - 1) / XXH_STRIPE_LEN; - size_t nbStripesSoFar = state->nbStripesSoFar; - XXH3_consumeStripes(acc, - &nbStripesSoFar, state->nbStripesPerBlock, - state->buffer, nbStripes, - secret, state->secretLimit, - XXH3_accumulate, XXH3_scrambleAcc); - lastStripePtr = state->buffer + state->bufferedSize - XXH_STRIPE_LEN; - } else { /* bufferedSize < XXH_STRIPE_LEN */ - /* Copy to temp buffer */ - size_t const catchupSize = XXH_STRIPE_LEN - state->bufferedSize; - XXH_ASSERT(state->bufferedSize > 0); /* there is always some input buffered */ - XXH_memcpy(lastStripe, state->buffer + sizeof(state->buffer) - catchupSize, catchupSize); - XXH_memcpy(lastStripe + catchupSize, state->buffer, state->bufferedSize); - lastStripePtr = lastStripe; - } - /* Last stripe */ - XXH3_accumulate_512(acc, - lastStripePtr, - secret + state->secretLimit - XXH_SECRET_LASTACC_START); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_digest (XXH_NOESCAPE const XXH3_state_t* state) -{ - const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extSecret; - if (state->totalLen > XXH3_MIDSIZE_MAX) { - XXH_ALIGN(XXH_ACC_ALIGN) XXH64_hash_t acc[XXH_ACC_NB]; - XXH3_digest_long(acc, state, secret); - return XXH3_mergeAccs(acc, - secret + XXH_SECRET_MERGEACCS_START, - (xxh_u64)state->totalLen * XXH_PRIME64_1); - } - /* totalLen <= XXH3_MIDSIZE_MAX: digesting a short input */ - if (state->useSeed) - return XXH3_64bits_withSeed(state->buffer, (size_t)state->totalLen, state->seed); - return XXH3_64bits_withSecret(state->buffer, (size_t)(state->totalLen), - secret, state->secretLimit + XXH_STRIPE_LEN); -} -#endif /* !XXH_NO_STREAM */ - - -/* ========================================== - * XXH3 128 bits (a.k.a XXH128) - * ========================================== - * XXH3's 128-bit variant has better mixing and strength than the 64-bit variant, - * even without counting the significantly larger output size. - * - * For example, extra steps are taken to avoid the seed-dependent collisions - * in 17-240 byte inputs (See XXH3_mix16B and XXH128_mix32B). - * - * This strength naturally comes at the cost of some speed, especially on short - * lengths. Note that longer hashes are about as fast as the 64-bit version - * due to it using only a slight modification of the 64-bit loop. - * - * XXH128 is also more oriented towards 64-bit machines. It is still extremely - * fast for a _128-bit_ hash on 32-bit (it usually clears XXH64). - */ - -XXH_FORCE_INLINE XXH_PUREF XXH128_hash_t -XXH3_len_1to3_128b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - /* A doubled version of 1to3_64b with different constants. */ - XXH_ASSERT(input != NULL); - XXH_ASSERT(1 <= len && len <= 3); - XXH_ASSERT(secret != NULL); - /* - * len = 1: combinedl = { input[0], 0x01, input[0], input[0] } - * len = 2: combinedl = { input[1], 0x02, input[0], input[1] } - * len = 3: combinedl = { input[2], 0x03, input[0], input[1] } - */ - { xxh_u8 const c1 = input[0]; - xxh_u8 const c2 = input[len >> 1]; - xxh_u8 const c3 = input[len - 1]; - xxh_u32 const combinedl = ((xxh_u32)c1 <<16) | ((xxh_u32)c2 << 24) - | ((xxh_u32)c3 << 0) | ((xxh_u32)len << 8); - xxh_u32 const combinedh = XXH_rotl32(XXH_swap32(combinedl), 13); - xxh_u64 const bitflipl = (XXH_readLE32(secret) ^ XXH_readLE32(secret+4)) + seed; - xxh_u64 const bitfliph = (XXH_readLE32(secret+8) ^ XXH_readLE32(secret+12)) - seed; - xxh_u64 const keyed_lo = (xxh_u64)combinedl ^ bitflipl; - xxh_u64 const keyed_hi = (xxh_u64)combinedh ^ bitfliph; - XXH128_hash_t h128; - h128.low64 = XXH64_avalanche(keyed_lo); - h128.high64 = XXH64_avalanche(keyed_hi); - return h128; - } -} - -XXH_FORCE_INLINE XXH_PUREF XXH128_hash_t -XXH3_len_4to8_128b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - XXH_ASSERT(input != NULL); - XXH_ASSERT(secret != NULL); - XXH_ASSERT(4 <= len && len <= 8); - seed ^= (xxh_u64)XXH_swap32((xxh_u32)seed) << 32; - { xxh_u32 const input_lo = XXH_readLE32(input); - xxh_u32 const input_hi = XXH_readLE32(input + len - 4); - xxh_u64 const input_64 = input_lo + ((xxh_u64)input_hi << 32); - xxh_u64 const bitflip = (XXH_readLE64(secret+16) ^ XXH_readLE64(secret+24)) + seed; - xxh_u64 const keyed = input_64 ^ bitflip; - - /* Shift len to the left to ensure it is even, this avoids even multiplies. */ - XXH128_hash_t m128 = XXH_mult64to128(keyed, XXH_PRIME64_1 + (len << 2)); - - m128.high64 += (m128.low64 << 1); - m128.low64 ^= (m128.high64 >> 3); - - m128.low64 = XXH_xorshift64(m128.low64, 35); - m128.low64 *= PRIME_MX2; - m128.low64 = XXH_xorshift64(m128.low64, 28); - m128.high64 = XXH3_avalanche(m128.high64); - return m128; - } -} - -XXH_FORCE_INLINE XXH_PUREF XXH128_hash_t -XXH3_len_9to16_128b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - XXH_ASSERT(input != NULL); - XXH_ASSERT(secret != NULL); - XXH_ASSERT(9 <= len && len <= 16); - { xxh_u64 const bitflipl = (XXH_readLE64(secret+32) ^ XXH_readLE64(secret+40)) - seed; - xxh_u64 const bitfliph = (XXH_readLE64(secret+48) ^ XXH_readLE64(secret+56)) + seed; - xxh_u64 const input_lo = XXH_readLE64(input); - xxh_u64 input_hi = XXH_readLE64(input + len - 8); - XXH128_hash_t m128 = XXH_mult64to128(input_lo ^ input_hi ^ bitflipl, XXH_PRIME64_1); - /* - * Put len in the middle of m128 to ensure that the length gets mixed to - * both the low and high bits in the 128x64 multiply below. - */ - m128.low64 += (xxh_u64)(len - 1) << 54; - input_hi ^= bitfliph; - /* - * Add the high 32 bits of input_hi to the high 32 bits of m128, then - * add the long product of the low 32 bits of input_hi and XXH_PRIME32_2 to - * the high 64 bits of m128. - * - * The best approach to this operation is different on 32-bit and 64-bit. - */ - if (sizeof(void *) < sizeof(xxh_u64)) { /* 32-bit */ - /* - * 32-bit optimized version, which is more readable. - * - * On 32-bit, it removes an ADC and delays a dependency between the two - * halves of m128.high64, but it generates an extra mask on 64-bit. - */ - m128.high64 += (input_hi & 0xFFFFFFFF00000000ULL) + XXH_mult32to64((xxh_u32)input_hi, XXH_PRIME32_2); - } else { - /* - * 64-bit optimized (albeit more confusing) version. - * - * Uses some properties of addition and multiplication to remove the mask: - * - * Let: - * a = input_hi.lo = (input_hi & 0x00000000FFFFFFFF) - * b = input_hi.hi = (input_hi & 0xFFFFFFFF00000000) - * c = XXH_PRIME32_2 - * - * a + (b * c) - * Inverse Property: x + y - x == y - * a + (b * (1 + c - 1)) - * Distributive Property: x * (y + z) == (x * y) + (x * z) - * a + (b * 1) + (b * (c - 1)) - * Identity Property: x * 1 == x - * a + b + (b * (c - 1)) - * - * Substitute a, b, and c: - * input_hi.hi + input_hi.lo + ((xxh_u64)input_hi.lo * (XXH_PRIME32_2 - 1)) - * - * Since input_hi.hi + input_hi.lo == input_hi, we get this: - * input_hi + ((xxh_u64)input_hi.lo * (XXH_PRIME32_2 - 1)) - */ - m128.high64 += input_hi + XXH_mult32to64((xxh_u32)input_hi, XXH_PRIME32_2 - 1); - } - /* m128 ^= XXH_swap64(m128 >> 64); */ - m128.low64 ^= XXH_swap64(m128.high64); - - { /* 128x64 multiply: h128 = m128 * XXH_PRIME64_2; */ - XXH128_hash_t h128 = XXH_mult64to128(m128.low64, XXH_PRIME64_2); - h128.high64 += m128.high64 * XXH_PRIME64_2; - - h128.low64 = XXH3_avalanche(h128.low64); - h128.high64 = XXH3_avalanche(h128.high64); - return h128; - } } -} - -/* - * Assumption: `secret` size is >= XXH3_SECRET_SIZE_MIN - */ -XXH_FORCE_INLINE XXH_PUREF XXH128_hash_t -XXH3_len_0to16_128b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed) -{ - XXH_ASSERT(len <= 16); - { if (len > 8) return XXH3_len_9to16_128b(input, len, secret, seed); - if (len >= 4) return XXH3_len_4to8_128b(input, len, secret, seed); - if (len) return XXH3_len_1to3_128b(input, len, secret, seed); - { XXH128_hash_t h128; - xxh_u64 const bitflipl = XXH_readLE64(secret+64) ^ XXH_readLE64(secret+72); - xxh_u64 const bitfliph = XXH_readLE64(secret+80) ^ XXH_readLE64(secret+88); - h128.low64 = XXH64_avalanche(seed ^ bitflipl); - h128.high64 = XXH64_avalanche( seed ^ bitfliph); - return h128; - } } -} - -/* - * A bit slower than XXH3_mix16B, but handles multiply by zero better. - */ -XXH_FORCE_INLINE XXH128_hash_t -XXH128_mix32B(XXH128_hash_t acc, const xxh_u8* input_1, const xxh_u8* input_2, - const xxh_u8* secret, XXH64_hash_t seed) -{ - acc.low64 += XXH3_mix16B (input_1, secret+0, seed); - acc.low64 ^= XXH_readLE64(input_2) + XXH_readLE64(input_2 + 8); - acc.high64 += XXH3_mix16B (input_2, secret+16, seed); - acc.high64 ^= XXH_readLE64(input_1) + XXH_readLE64(input_1 + 8); - return acc; -} - - -XXH_FORCE_INLINE XXH_PUREF XXH128_hash_t -XXH3_len_17to128_128b(const xxh_u8* XXH_RESTRICT input, size_t len, - const xxh_u8* XXH_RESTRICT secret, size_t secretSize, - XXH64_hash_t seed) -{ - XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); (void)secretSize; - XXH_ASSERT(16 < len && len <= 128); - - { XXH128_hash_t acc; - acc.low64 = len * XXH_PRIME64_1; - acc.high64 = 0; - -#if XXH_SIZE_OPT >= 1 - { - /* Smaller, but slightly slower. */ - unsigned int i = (unsigned int)(len - 1) / 32; - do { - acc = XXH128_mix32B(acc, input+16*i, input+len-16*(i+1), secret+32*i, seed); - } while (i-- != 0); - } -#else - if (len > 32) { - if (len > 64) { - if (len > 96) { - acc = XXH128_mix32B(acc, input+48, input+len-64, secret+96, seed); - } - acc = XXH128_mix32B(acc, input+32, input+len-48, secret+64, seed); - } - acc = XXH128_mix32B(acc, input+16, input+len-32, secret+32, seed); - } - acc = XXH128_mix32B(acc, input, input+len-16, secret, seed); -#endif - { XXH128_hash_t h128; - h128.low64 = acc.low64 + acc.high64; - h128.high64 = (acc.low64 * XXH_PRIME64_1) - + (acc.high64 * XXH_PRIME64_4) - + ((len - seed) * XXH_PRIME64_2); - h128.low64 = XXH3_avalanche(h128.low64); - h128.high64 = (XXH64_hash_t)0 - XXH3_avalanche(h128.high64); - return h128; - } - } -} - -XXH_NO_INLINE XXH_PUREF XXH128_hash_t -XXH3_len_129to240_128b(const xxh_u8* XXH_RESTRICT input, size_t len, - const xxh_u8* XXH_RESTRICT secret, size_t secretSize, - XXH64_hash_t seed) -{ - XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); (void)secretSize; - XXH_ASSERT(128 < len && len <= XXH3_MIDSIZE_MAX); - - { XXH128_hash_t acc; - unsigned i; - acc.low64 = len * XXH_PRIME64_1; - acc.high64 = 0; - /* - * We set as `i` as offset + 32. We do this so that unchanged - * `len` can be used as upper bound. This reaches a sweet spot - * where both x86 and aarch64 get simple agen and good codegen - * for the loop. - */ - for (i = 32; i < 160; i += 32) { - acc = XXH128_mix32B(acc, - input + i - 32, - input + i - 16, - secret + i - 32, - seed); - } - acc.low64 = XXH3_avalanche(acc.low64); - acc.high64 = XXH3_avalanche(acc.high64); - /* - * NB: `i <= len` will duplicate the last 32-bytes if - * len % 32 was zero. This is an unfortunate necessity to keep - * the hash result stable. - */ - for (i=160; i <= len; i += 32) { - acc = XXH128_mix32B(acc, - input + i - 32, - input + i - 16, - secret + XXH3_MIDSIZE_STARTOFFSET + i - 160, - seed); - } - /* last bytes */ - acc = XXH128_mix32B(acc, - input + len - 16, - input + len - 32, - secret + XXH3_SECRET_SIZE_MIN - XXH3_MIDSIZE_LASTOFFSET - 16, - (XXH64_hash_t)0 - seed); - - { XXH128_hash_t h128; - h128.low64 = acc.low64 + acc.high64; - h128.high64 = (acc.low64 * XXH_PRIME64_1) - + (acc.high64 * XXH_PRIME64_4) - + ((len - seed) * XXH_PRIME64_2); - h128.low64 = XXH3_avalanche(h128.low64); - h128.high64 = (XXH64_hash_t)0 - XXH3_avalanche(h128.high64); - return h128; - } - } -} - -XXH_FORCE_INLINE XXH128_hash_t -XXH3_hashLong_128b_internal(const void* XXH_RESTRICT input, size_t len, - const xxh_u8* XXH_RESTRICT secret, size_t secretSize, - XXH3_f_accumulate f_acc, - XXH3_f_scrambleAcc f_scramble) -{ - XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[XXH_ACC_NB] = XXH3_INIT_ACC; - - XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, secret, secretSize, f_acc, f_scramble); - - /* converge into final hash */ - XXH_STATIC_ASSERT(sizeof(acc) == 64); - XXH_ASSERT(secretSize >= sizeof(acc) + XXH_SECRET_MERGEACCS_START); - { XXH128_hash_t h128; - h128.low64 = XXH3_mergeAccs(acc, - secret + XXH_SECRET_MERGEACCS_START, - (xxh_u64)len * XXH_PRIME64_1); - h128.high64 = XXH3_mergeAccs(acc, - secret + secretSize - - sizeof(acc) - XXH_SECRET_MERGEACCS_START, - ~((xxh_u64)len * XXH_PRIME64_2)); - return h128; - } -} - -/* - * It's important for performance that XXH3_hashLong() is not inlined. - */ -XXH_NO_INLINE XXH_PUREF XXH128_hash_t -XXH3_hashLong_128b_default(const void* XXH_RESTRICT input, size_t len, - XXH64_hash_t seed64, - const void* XXH_RESTRICT secret, size_t secretLen) -{ - (void)seed64; (void)secret; (void)secretLen; - return XXH3_hashLong_128b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret), - XXH3_accumulate, XXH3_scrambleAcc); -} - -/* - * It's important for performance to pass @p secretLen (when it's static) - * to the compiler, so that it can properly optimize the vectorized loop. - * - * When the secret size is unknown, or on GCC 12 where the mix of NO_INLINE and FORCE_INLINE - * breaks -Og, this is XXH_NO_INLINE. - */ -XXH3_WITH_SECRET_INLINE XXH128_hash_t -XXH3_hashLong_128b_withSecret(const void* XXH_RESTRICT input, size_t len, - XXH64_hash_t seed64, - const void* XXH_RESTRICT secret, size_t secretLen) -{ - (void)seed64; - return XXH3_hashLong_128b_internal(input, len, (const xxh_u8*)secret, secretLen, - XXH3_accumulate, XXH3_scrambleAcc); -} - -XXH_FORCE_INLINE XXH128_hash_t -XXH3_hashLong_128b_withSeed_internal(const void* XXH_RESTRICT input, size_t len, - XXH64_hash_t seed64, - XXH3_f_accumulate f_acc, - XXH3_f_scrambleAcc f_scramble, - XXH3_f_initCustomSecret f_initSec) -{ - if (seed64 == 0) - return XXH3_hashLong_128b_internal(input, len, - XXH3_kSecret, sizeof(XXH3_kSecret), - f_acc, f_scramble); - { XXH_ALIGN(XXH_SEC_ALIGN) xxh_u8 secret[XXH_SECRET_DEFAULT_SIZE]; - f_initSec(secret, seed64); - return XXH3_hashLong_128b_internal(input, len, (const xxh_u8*)secret, sizeof(secret), - f_acc, f_scramble); - } -} - -/* - * It's important for performance that XXH3_hashLong is not inlined. - */ -XXH_NO_INLINE XXH128_hash_t -XXH3_hashLong_128b_withSeed(const void* input, size_t len, - XXH64_hash_t seed64, const void* XXH_RESTRICT secret, size_t secretLen) -{ - (void)secret; (void)secretLen; - return XXH3_hashLong_128b_withSeed_internal(input, len, seed64, - XXH3_accumulate, XXH3_scrambleAcc, XXH3_initCustomSecret); -} - -typedef XXH128_hash_t (*XXH3_hashLong128_f)(const void* XXH_RESTRICT, size_t, - XXH64_hash_t, const void* XXH_RESTRICT, size_t); - -XXH_FORCE_INLINE XXH128_hash_t -XXH3_128bits_internal(const void* input, size_t len, - XXH64_hash_t seed64, const void* XXH_RESTRICT secret, size_t secretLen, - XXH3_hashLong128_f f_hl128) -{ - XXH_ASSERT(secretLen >= XXH3_SECRET_SIZE_MIN); - /* - * If an action is to be taken if `secret` conditions are not respected, - * it should be done here. - * For now, it's a contract pre-condition. - * Adding a check and a branch here would cost performance at every hash. - */ - if (len <= 16) - return XXH3_len_0to16_128b((const xxh_u8*)input, len, (const xxh_u8*)secret, seed64); - if (len <= 128) - return XXH3_len_17to128_128b((const xxh_u8*)input, len, (const xxh_u8*)secret, secretLen, seed64); - if (len <= XXH3_MIDSIZE_MAX) - return XXH3_len_129to240_128b((const xxh_u8*)input, len, (const xxh_u8*)secret, secretLen, seed64); - return f_hl128(input, len, seed64, secret, secretLen); -} - - -/* === Public XXH128 API === */ - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH128_hash_t XXH3_128bits(XXH_NOESCAPE const void* input, size_t len) -{ - return XXH3_128bits_internal(input, len, 0, - XXH3_kSecret, sizeof(XXH3_kSecret), - XXH3_hashLong_128b_default); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH128_hash_t -XXH3_128bits_withSecret(XXH_NOESCAPE const void* input, size_t len, XXH_NOESCAPE const void* secret, size_t secretSize) -{ - return XXH3_128bits_internal(input, len, 0, - (const xxh_u8*)secret, secretSize, - XXH3_hashLong_128b_withSecret); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH128_hash_t -XXH3_128bits_withSeed(XXH_NOESCAPE const void* input, size_t len, XXH64_hash_t seed) -{ - return XXH3_128bits_internal(input, len, seed, - XXH3_kSecret, sizeof(XXH3_kSecret), - XXH3_hashLong_128b_withSeed); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH128_hash_t -XXH3_128bits_withSecretandSeed(XXH_NOESCAPE const void* input, size_t len, XXH_NOESCAPE const void* secret, size_t secretSize, XXH64_hash_t seed) -{ - if (len <= XXH3_MIDSIZE_MAX) - return XXH3_128bits_internal(input, len, seed, XXH3_kSecret, sizeof(XXH3_kSecret), NULL); - return XXH3_hashLong_128b_withSecret(input, len, seed, secret, secretSize); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH128_hash_t -XXH128(XXH_NOESCAPE const void* input, size_t len, XXH64_hash_t seed) -{ - return XXH3_128bits_withSeed(input, len, seed); -} - - -/* === XXH3 128-bit streaming === */ -#ifndef XXH_NO_STREAM -/* - * All initialization and update functions are identical to 64-bit streaming variant. - * The only difference is the finalization routine. - */ - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_128bits_reset(XXH_NOESCAPE XXH3_state_t* statePtr) -{ - return XXH3_64bits_reset(statePtr); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_128bits_reset_withSecret(XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* secret, size_t secretSize) -{ - return XXH3_64bits_reset_withSecret(statePtr, secret, secretSize); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_128bits_reset_withSeed(XXH_NOESCAPE XXH3_state_t* statePtr, XXH64_hash_t seed) -{ - return XXH3_64bits_reset_withSeed(statePtr, seed); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_128bits_reset_withSecretandSeed(XXH_NOESCAPE XXH3_state_t* statePtr, XXH_NOESCAPE const void* secret, size_t secretSize, XXH64_hash_t seed) -{ - return XXH3_64bits_reset_withSecretandSeed(statePtr, secret, secretSize, seed); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_128bits_update(XXH_NOESCAPE XXH3_state_t* state, XXH_NOESCAPE const void* input, size_t len) -{ - return XXH3_64bits_update(state, input, len); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_digest (XXH_NOESCAPE const XXH3_state_t* state) -{ - const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extSecret; - if (state->totalLen > XXH3_MIDSIZE_MAX) { - XXH_ALIGN(XXH_ACC_ALIGN) XXH64_hash_t acc[XXH_ACC_NB]; - XXH3_digest_long(acc, state, secret); - XXH_ASSERT(state->secretLimit + XXH_STRIPE_LEN >= sizeof(acc) + XXH_SECRET_MERGEACCS_START); - { XXH128_hash_t h128; - h128.low64 = XXH3_mergeAccs(acc, - secret + XXH_SECRET_MERGEACCS_START, - (xxh_u64)state->totalLen * XXH_PRIME64_1); - h128.high64 = XXH3_mergeAccs(acc, - secret + state->secretLimit + XXH_STRIPE_LEN - - sizeof(acc) - XXH_SECRET_MERGEACCS_START, - ~((xxh_u64)state->totalLen * XXH_PRIME64_2)); - return h128; - } - } - /* len <= XXH3_MIDSIZE_MAX : short code */ - if (state->seed) - return XXH3_128bits_withSeed(state->buffer, (size_t)state->totalLen, state->seed); - return XXH3_128bits_withSecret(state->buffer, (size_t)(state->totalLen), - secret, state->secretLimit + XXH_STRIPE_LEN); -} -#endif /* !XXH_NO_STREAM */ -/* 128-bit utility functions */ - -#include /* memcmp, memcpy */ - -/* return : 1 is equal, 0 if different */ -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API int XXH128_isEqual(XXH128_hash_t h1, XXH128_hash_t h2) -{ - /* note : XXH128_hash_t is compact, it has no padding byte */ - return !(memcmp(&h1, &h2, sizeof(h1))); -} - -/* This prototype is compatible with stdlib's qsort(). - * @return : >0 if *h128_1 > *h128_2 - * <0 if *h128_1 < *h128_2 - * =0 if *h128_1 == *h128_2 */ -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API int XXH128_cmp(XXH_NOESCAPE const void* h128_1, XXH_NOESCAPE const void* h128_2) -{ - XXH128_hash_t const h1 = *(const XXH128_hash_t*)h128_1; - XXH128_hash_t const h2 = *(const XXH128_hash_t*)h128_2; - int const hcmp = (h1.high64 > h2.high64) - (h2.high64 > h1.high64); - /* note : bets that, in most cases, hash values are different */ - if (hcmp) return hcmp; - return (h1.low64 > h2.low64) - (h2.low64 > h1.low64); -} - - -/*====== Canonical representation ======*/ -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API void -XXH128_canonicalFromHash(XXH_NOESCAPE XXH128_canonical_t* dst, XXH128_hash_t hash) -{ - XXH_STATIC_ASSERT(sizeof(XXH128_canonical_t) == sizeof(XXH128_hash_t)); - if (XXH_CPU_LITTLE_ENDIAN) { - hash.high64 = XXH_swap64(hash.high64); - hash.low64 = XXH_swap64(hash.low64); - } - XXH_memcpy(dst, &hash.high64, sizeof(hash.high64)); - XXH_memcpy((char*)dst + sizeof(hash.high64), &hash.low64, sizeof(hash.low64)); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH128_hash_t -XXH128_hashFromCanonical(XXH_NOESCAPE const XXH128_canonical_t* src) -{ - XXH128_hash_t h; - h.high64 = XXH_readBE64(src); - h.low64 = XXH_readBE64(src->digest + 8); - return h; -} - - - -/* ========================================== - * Secret generators - * ========================================== - */ -#define XXH_MIN(x, y) (((x) > (y)) ? (y) : (x)) - -XXH_FORCE_INLINE void XXH3_combine16(void* dst, XXH128_hash_t h128) -{ - XXH_writeLE64( dst, XXH_readLE64(dst) ^ h128.low64 ); - XXH_writeLE64( (char*)dst+8, XXH_readLE64((char*)dst+8) ^ h128.high64 ); -} - -/*! @ingroup XXH3_family */ -XXH_PUBLIC_API XXH_errorcode -XXH3_generateSecret(XXH_NOESCAPE void* secretBuffer, size_t secretSize, XXH_NOESCAPE const void* customSeed, size_t customSeedSize) -{ -#if (XXH_DEBUGLEVEL >= 1) - XXH_ASSERT(secretBuffer != NULL); - XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); -#else - /* production mode, assert() are disabled */ - if (secretBuffer == NULL) return XXH_ERROR; - if (secretSize < XXH3_SECRET_SIZE_MIN) return XXH_ERROR; -#endif - - if (customSeedSize == 0) { - customSeed = XXH3_kSecret; - customSeedSize = XXH_SECRET_DEFAULT_SIZE; - } -#if (XXH_DEBUGLEVEL >= 1) - XXH_ASSERT(customSeed != NULL); -#else - if (customSeed == NULL) return XXH_ERROR; -#endif - - /* Fill secretBuffer with a copy of customSeed - repeat as needed */ - { size_t pos = 0; - while (pos < secretSize) { - size_t const toCopy = XXH_MIN((secretSize - pos), customSeedSize); - memcpy((char*)secretBuffer + pos, customSeed, toCopy); - pos += toCopy; - } } - - { size_t const nbSeg16 = secretSize / 16; - size_t n; - XXH128_canonical_t scrambler; - XXH128_canonicalFromHash(&scrambler, XXH128(customSeed, customSeedSize, 0)); - for (n=0; n { } }; +template <> +struct WrapBytes { + static inline PyObject* Wrap(const char* data, int64_t length) { + return PyUnicode_FromStringAndSize(data, length); + } +}; + template <> struct WrapBytes { static inline PyObject* Wrap(const char* data, int64_t length) { @@ -147,6 +154,13 @@ struct WrapBytes { } }; +template <> +struct WrapBytes { + static inline PyObject* Wrap(const char* data, int64_t length) { + return PyBytes_FromStringAndSize(data, length); + } +}; + template <> struct WrapBytes { static inline PyObject* Wrap(const char* data, int64_t length) { @@ -189,7 +203,9 @@ static inline bool ListTypeSupported(const DataType& type) { return true; case Type::FIXED_SIZE_LIST: case Type::LIST: - case Type::LARGE_LIST: { + case Type::LARGE_LIST: + case Type::LIST_VIEW: + case Type::LARGE_LIST_VIEW: { const auto& list_type = checked_cast(type); return ListTypeSupported(*list_type.value_type()); } @@ -241,7 +257,8 @@ Status SetBufferBase(PyArrayObject* arr, const std::shared_ptr& buffer) } inline void set_numpy_metadata(int type, const DataType* datatype, PyArray_Descr* out) { - auto metadata = reinterpret_cast(out->c_metadata); + auto metadata = + reinterpret_cast(PyDataType_C_METADATA(out)); if (type == NPY_DATETIME) { if (datatype->id() == Type::TIMESTAMP) { const auto& timestamp_type = checked_cast(*datatype); @@ -262,7 +279,7 @@ Status PyArray_NewFromPool(int nd, npy_intp* dims, PyArray_Descr* descr, MemoryP // // * Track allocations // * Get better performance through custom allocators - int64_t total_size = descr->elsize; + int64_t total_size = PyDataType_ELSIZE(descr); for (int i = 0; i < nd; ++i) { total_size *= dims[i]; } @@ -523,8 +540,9 @@ class PandasWriter { void SetDatetimeUnit(NPY_DATETIMEUNIT unit) { PyAcquireGIL lock; - auto date_dtype = reinterpret_cast( - PyArray_DESCR(reinterpret_cast(block_arr_.obj()))->c_metadata); + auto date_dtype = + reinterpret_cast(PyDataType_C_METADATA( + PyArray_DESCR(reinterpret_cast(block_arr_.obj())))); date_dtype->meta.base = unit; } @@ -606,40 +624,40 @@ inline Status ConvertAsPyObjects(const PandasOptions& options, const ChunkedArra using ArrayType = typename TypeTraits::ArrayType; using Scalar = typename MemoizationTraits::Scalar; - ::arrow::internal::ScalarMemoTable memo_table(options.pool); - std::vector unique_values; - int32_t memo_size = 0; - - auto WrapMemoized = [&](const Scalar& value, PyObject** out_values) { - int32_t memo_index; - RETURN_NOT_OK(memo_table.GetOrInsert(value, &memo_index)); - if (memo_index == memo_size) { - // New entry - RETURN_NOT_OK(wrap_func(value, out_values)); - unique_values.push_back(*out_values); - ++memo_size; - } else { - // Duplicate entry - Py_INCREF(unique_values[memo_index]); - *out_values = unique_values[memo_index]; + auto convert_chunks = [&](auto&& wrap_func) -> Status { + for (int c = 0; c < data.num_chunks(); c++) { + const auto& arr = arrow::internal::checked_cast(*data.chunk(c)); + RETURN_NOT_OK(internal::WriteArrayObjects(arr, wrap_func, out_values)); + out_values += arr.length(); } return Status::OK(); }; - auto WrapUnmemoized = [&](const Scalar& value, PyObject** out_values) { - return wrap_func(value, out_values); - }; - - for (int c = 0; c < data.num_chunks(); c++) { - const auto& arr = arrow::internal::checked_cast(*data.chunk(c)); - if (options.deduplicate_objects) { - RETURN_NOT_OK(internal::WriteArrayObjects(arr, WrapMemoized, out_values)); - } else { - RETURN_NOT_OK(internal::WriteArrayObjects(arr, WrapUnmemoized, out_values)); - } - out_values += arr.length(); + if (options.deduplicate_objects) { + // GH-40316: only allocate a memo table if deduplication is enabled. + ::arrow::internal::ScalarMemoTable memo_table(options.pool); + std::vector unique_values; + int32_t memo_size = 0; + + auto WrapMemoized = [&](const Scalar& value, PyObject** out_values) { + int32_t memo_index; + RETURN_NOT_OK(memo_table.GetOrInsert(value, &memo_index)); + if (memo_index == memo_size) { + // New entry + RETURN_NOT_OK(wrap_func(value, out_values)); + unique_values.push_back(*out_values); + ++memo_size; + } else { + // Duplicate entry + Py_INCREF(unique_values[memo_index]); + *out_values = unique_values[memo_index]; + } + return Status::OK(); + }; + return convert_chunks(std::move(WrapMemoized)); + } else { + return convert_chunks(std::forward(wrap_func)); } - return Status::OK(); } Status ConvertStruct(PandasOptions options, const ChunkedArray& data, @@ -736,9 +754,11 @@ Status DecodeDictionaries(MemoryPool* pool, const std::shared_ptr& den return Status::OK(); } -template -Status ConvertListsLike(PandasOptions options, const ChunkedArray& data, - PyObject** out_values) { +template +enable_if_list_like ConvertListsLike(PandasOptions options, + const ChunkedArray& data, + PyObject** out_values) { + using ListArrayT = typename TypeTraits::ArrayType; // Get column of underlying value arrays ArrayVector value_arrays; for (int c = 0; c < data.num_chunks(); c++) { @@ -812,6 +832,26 @@ Status ConvertListsLike(PandasOptions options, const ChunkedArray& data, return Status::OK(); } +// TODO GH-40579: optimize ListView conversion to avoid unnecessary copies +template +enable_if_list_view ConvertListsLike(PandasOptions options, + const ChunkedArray& data, + PyObject** out_values) { + using ListViewArrayType = typename TypeTraits::ArrayType; + using NonViewType = + std::conditional_t; + using NonViewClass = typename TypeTraits::ArrayType; + ArrayVector list_arrays; + for (int c = 0; c < data.num_chunks(); c++) { + const auto& arr = checked_cast(*data.chunk(c)); + ARROW_ASSIGN_OR_RAISE(auto non_view_array, + NonViewClass::FromListView(arr, options.pool)); + list_arrays.emplace_back(non_view_array); + } + auto chunked_array = std::make_shared(list_arrays); + return ConvertListsLike(options, *chunked_array, out_values); +} + template Status ConvertMapHelper(F1 resetRow, F2 addPairToRow, F3 stealRow, const ChunkedArray& data, PyArrayObject* py_keys, @@ -1154,7 +1194,8 @@ struct ObjectWriterVisitor { } template - enable_if_t::value || is_fixed_size_binary_type::value, + enable_if_t::value || is_binary_view_like_type::value || + is_fixed_size_binary_type::value, Status> Visit(const Type& type) { auto WrapValue = [](const std::string_view& view, PyObject** out) { @@ -1327,16 +1368,14 @@ struct ObjectWriterVisitor { } template - enable_if_t::value || is_var_length_list_type::value, - Status> - Visit(const T& type) { - using ArrayType = typename TypeTraits::ArrayType; + enable_if_t::value || is_list_view_type::value, Status> Visit( + const T& type) { if (!ListTypeSupported(*type.value_type())) { return Status::NotImplemented( "Not implemented type for conversion from List to Pandas: ", type.value_type()->ToString()); } - return ConvertListsLike(options, data, out_values); + return ConvertListsLike(options, data, out_values); } Status Visit(const MapType& type) { return ConvertMap(options, data, out_values); } @@ -1350,13 +1389,10 @@ struct ObjectWriterVisitor { std::is_same::value || std::is_same::value || std::is_same::value || - std::is_same::value || - std::is_same::value || std::is_same::value || (std::is_base_of::value && !std::is_same::value) || - std::is_base_of::value || - std::is_base_of::value, + std::is_base_of::value, Status> Visit(const Type& type) { return Status::NotImplemented("No implemented conversion to object dtype: ", @@ -2086,8 +2122,10 @@ static Status GetPandasWriterType(const ChunkedArray& data, const PandasOptions& break; case Type::STRING: // fall through case Type::LARGE_STRING: // fall through + case Type::STRING_VIEW: // fall through case Type::BINARY: // fall through case Type::LARGE_BINARY: + case Type::BINARY_VIEW: case Type::NA: // fall through case Type::FIXED_SIZE_BINARY: // fall through case Type::STRUCT: // fall through @@ -2189,6 +2227,8 @@ static Status GetPandasWriterType(const ChunkedArray& data, const PandasOptions& case Type::FIXED_SIZE_LIST: case Type::LIST: case Type::LARGE_LIST: + case Type::LIST_VIEW: + case Type::LARGE_LIST_VIEW: case Type::MAP: { auto list_type = std::static_pointer_cast(data.type()); if (!ListTypeSupported(*list_type->value_type())) { @@ -2273,6 +2313,14 @@ std::shared_ptr GetStorageChunkedArray(std::shared_ptr(std::move(storage_arrays), value_type); }; +// Helper function to decode RunEndEncodedArray +Result> GetDecodedChunkedArray( + std::shared_ptr arr) { + ARROW_ASSIGN_OR_RAISE(Datum decoded, compute::RunEndDecode(arr)); + DCHECK(decoded.is_chunked_array()); + return decoded.chunked_array(); +}; + class ConsolidatedBlockCreator : public PandasBlockCreator { public: using PandasBlockCreator::PandasBlockCreator; @@ -2302,6 +2350,11 @@ class ConsolidatedBlockCreator : public PandasBlockCreator { if (arrays_[column_index]->type()->id() == Type::EXTENSION) { arrays_[column_index] = GetStorageChunkedArray(arrays_[column_index]); } + // In case of a RunEndEncodedArray default to the values type + else if (arrays_[column_index]->type()->id() == Type::RUN_END_ENCODED) { + ARROW_ASSIGN_OR_RAISE(arrays_[column_index], + GetDecodedChunkedArray(arrays_[column_index])); + } return GetPandasWriterType(*arrays_[column_index], options_, out); } } @@ -2499,6 +2552,8 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options, std::shared_ptr arr, PyObject* py_ref, PyObject** out) { if (options.decode_dictionaries && arr->type()->id() == Type::DICTIONARY) { + // XXX we should return an error as below if options.zero_copy_only + // is true, but that would break compatibility with existing tests. const auto& dense_type = checked_cast(*arr->type()).value_type(); RETURN_NOT_OK(DecodeDictionaries(options.pool, dense_type, &arr)); @@ -2534,6 +2589,18 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options, if (arr->type()->id() == Type::EXTENSION) { arr = GetStorageChunkedArray(arr); } + // In case of a RunEndEncodedArray decode the array + else if (arr->type()->id() == Type::RUN_END_ENCODED) { + if (options.zero_copy_only) { + return Status::Invalid("Need to dencode a RunEndEncodedArray, but ", + "only zero-copy conversions allowed"); + } + ARROW_ASSIGN_OR_RAISE(arr, GetDecodedChunkedArray(arr)); + + // Because we built a new array when we decoded the RunEndEncodedArray + // the final resulting numpy array should own the memory through a Capsule + py_ref = nullptr; + } PandasWriter::type output_type; RETURN_NOT_OK(GetPandasWriterType(*arr, modified_options, &output_type)); diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/arrow_to_pandas.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/arrow_to_pandas.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/arrow_to_pandas.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/arrow_to_pandas.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/arrow_to_python_internal.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/arrow_to_python_internal.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/arrow_to_python_internal.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/arrow_to_python_internal.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/async.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/async.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/async.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/async.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/benchmark.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/benchmark.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/benchmark.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/benchmark.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/benchmark.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/benchmark.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/benchmark.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/benchmark.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/common.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/common.cc old mode 100644 new mode 100755 similarity index 80% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/common.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/common.cc index 6fe2ed4da..2f44a9122 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/common.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/common.cc @@ -19,6 +19,7 @@ #include #include +#include #include #include "arrow/memory_pool.h" @@ -90,9 +91,15 @@ class PythonErrorDetail : public StatusDetail { std::string ToString() const override { // This is simple enough not to need the GIL - const auto ty = reinterpret_cast(exc_type_.obj()); - // XXX Should we also print traceback? - return std::string("Python exception: ") + ty->tp_name; + Result result = FormatImpl(); + + if (result.ok()) { + return result.ValueOrDie(); + } else { + // Fallback to just the exception type + const auto ty = reinterpret_cast(exc_type_.obj()); + return std::string("Python exception: ") + ty->tp_name; + } } void RestorePyError() const { @@ -131,6 +138,42 @@ class PythonErrorDetail : public StatusDetail { } protected: + Result FormatImpl() const { + PyAcquireGIL lock; + + // Use traceback.format_exception() + OwnedRef traceback_module; + RETURN_NOT_OK(internal::ImportModule("traceback", &traceback_module)); + + OwnedRef fmt_exception; + RETURN_NOT_OK(internal::ImportFromModule(traceback_module.obj(), "format_exception", + &fmt_exception)); + + OwnedRef formatted; + formatted.reset(PyObject_CallFunctionObjArgs(fmt_exception.obj(), exc_type_.obj(), + exc_value_.obj(), exc_traceback_.obj(), + NULL)); + RETURN_IF_PYERROR(); + + std::stringstream ss; + ss << "Python exception: "; + Py_ssize_t num_lines = PySequence_Length(formatted.obj()); + RETURN_IF_PYERROR(); + + for (Py_ssize_t i = 0; i < num_lines; ++i) { + Py_ssize_t line_size; + + PyObject* line = PySequence_GetItem(formatted.obj(), i); + RETURN_IF_PYERROR(); + + const char* data = PyUnicode_AsUTF8AndSize(line, &line_size); + RETURN_IF_PYERROR(); + + ss << std::string_view(data, line_size); + } + return ss.str(); + } + PythonErrorDetail() = default; OwnedRefNoGIL exc_type_, exc_value_, exc_traceback_; diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/common.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/common.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/common.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/common.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/csv.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/csv.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/csv.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/csv.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/csv.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/csv.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/csv.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/csv.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/datetime.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/datetime.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/datetime.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/datetime.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/datetime.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/datetime.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/datetime.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/datetime.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/decimal.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/decimal.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/decimal.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/decimal.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/decimal.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/decimal.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/decimal.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/decimal.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/deserialize.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/deserialize.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/deserialize.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/deserialize.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/deserialize.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/deserialize.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/deserialize.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/deserialize.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/extension_type.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/extension_type.cc old mode 100644 new mode 100755 similarity index 99% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/extension_type.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/extension_type.cc index 3ccc171c8..be66b4a1c --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/extension_type.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/extension_type.cc @@ -72,7 +72,7 @@ PyObject* DeserializeExtInstance(PyObject* type_class, static const char* kExtensionName = "arrow.py_extension_type"; -std::string PyExtensionType::ToString() const { +std::string PyExtensionType::ToString(bool show_metadata) const { PyAcquireGIL lock; std::stringstream ss; diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/extension_type.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/extension_type.h old mode 100644 new mode 100755 similarity index 97% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/extension_type.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/extension_type.h index e433d9aca..e6523824e --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/extension_type.h +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/extension_type.h @@ -33,7 +33,7 @@ class ARROW_PYTHON_EXPORT PyExtensionType : public ExtensionType { // Implement extensionType API std::string extension_name() const override { return extension_name_; } - std::string ToString() const override; + std::string ToString(bool show_metadata = false) const override; bool ExtensionEquals(const ExtensionType& other) const override; diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/filesystem.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/filesystem.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/filesystem.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/filesystem.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/filesystem.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/filesystem.h old mode 100644 new mode 100755 similarity index 90% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/filesystem.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/filesystem.h index 003fd5cb8..194b226ac --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/filesystem.h +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/filesystem.h @@ -26,9 +26,7 @@ #include "arrow/python/visibility.h" #include "arrow/util/macros.h" -namespace arrow { -namespace py { -namespace fs { +namespace arrow::py::fs { class ARROW_PYTHON_EXPORT PyFileSystemVtable { public: @@ -83,16 +81,24 @@ class ARROW_PYTHON_EXPORT PyFileSystem : public arrow::fs::FileSystem { bool Equals(const FileSystem& other) const override; + /// \cond FALSE + using FileSystem::CreateDir; + using FileSystem::DeleteDirContents; + using FileSystem::GetFileInfo; + using FileSystem::OpenAppendStream; + using FileSystem::OpenOutputStream; + /// \endcond + Result GetFileInfo(const std::string& path) override; Result> GetFileInfo( const std::vector& paths) override; Result> GetFileInfo( const arrow::fs::FileSelector& select) override; - Status CreateDir(const std::string& path, bool recursive = true) override; + Status CreateDir(const std::string& path, bool recursive) override; Status DeleteDir(const std::string& path) override; - Status DeleteDirContents(const std::string& path, bool missing_dir_ok = false) override; + Status DeleteDirContents(const std::string& path, bool missing_dir_ok) override; Status DeleteRootDirContents() override; Status DeleteFile(const std::string& path) override; @@ -107,10 +113,10 @@ class ARROW_PYTHON_EXPORT PyFileSystem : public arrow::fs::FileSystem { const std::string& path) override; Result> OpenOutputStream( const std::string& path, - const std::shared_ptr& metadata = {}) override; + const std::shared_ptr& metadata) override; Result> OpenAppendStream( const std::string& path, - const std::shared_ptr& metadata = {}) override; + const std::shared_ptr& metadata) override; Result NormalizePath(std::string path) override; @@ -121,6 +127,4 @@ class ARROW_PYTHON_EXPORT PyFileSystem : public arrow::fs::FileSystem { PyFileSystemVtable vtable_; }; -} // namespace fs -} // namespace py -} // namespace arrow +} // namespace arrow::py::fs diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/flight.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/flight.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/flight.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/flight.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/flight.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/flight.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/flight.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/flight.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/gdb.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/gdb.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/gdb.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/gdb.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/gdb.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/gdb.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/gdb.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/gdb.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/helpers.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/helpers.cc old mode 100644 new mode 100755 similarity index 99% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/helpers.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/helpers.cc index c266abc16..2c86c86a9 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/helpers.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/helpers.cc @@ -63,6 +63,8 @@ std::shared_ptr GetPrimitiveType(Type::type type) { GET_PRIMITIVE_TYPE(STRING, utf8); GET_PRIMITIVE_TYPE(LARGE_BINARY, large_binary); GET_PRIMITIVE_TYPE(LARGE_STRING, large_utf8); + GET_PRIMITIVE_TYPE(BINARY_VIEW, binary_view); + GET_PRIMITIVE_TYPE(STRING_VIEW, utf8_view); GET_PRIMITIVE_TYPE(INTERVAL_MONTH_DAY_NANO, month_day_nano_interval); default: return nullptr; diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/helpers.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/helpers.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/helpers.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/helpers.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/inference.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/inference.cc old mode 100644 new mode 100755 similarity index 99% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/inference.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/inference.cc index 9537aec57..10116f9af --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/inference.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/inference.cc @@ -468,10 +468,7 @@ class TypeInferrer { if (numpy_dtype_count_ > 0) { // All NumPy scalars and Nones/nulls if (numpy_dtype_count_ + none_count_ == total_count_) { - std::shared_ptr type; - RETURN_NOT_OK(NumPyDtypeToArrow(numpy_unifier_.current_dtype(), &type)); - *out = type; - return Status::OK(); + return NumPyDtypeToArrow(numpy_unifier_.current_dtype()).Value(out); } // The "bad path": data contains a mix of NumPy scalars and diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/inference.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/inference.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/inference.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/inference.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/init.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/init.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/init.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/init.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/init.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/init.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/init.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/init.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/io.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/io.cc old mode 100644 new mode 100755 similarity index 96% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/io.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/io.cc index 43f8297c5..197f8b9d3 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/io.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/io.cc @@ -92,9 +92,12 @@ class PythonFile { Status Seek(int64_t position, int whence) { RETURN_NOT_OK(CheckClosed()); + // NOTE: `long long` is at least 64 bits in the C standard, the cast below is + // therefore safe. + // whence: 0 for relative to start of file, 2 for end of file - PyObject* result = cpp_PyObject_CallMethod(file_.obj(), "seek", "(ni)", - static_cast(position), whence); + PyObject* result = cpp_PyObject_CallMethod(file_.obj(), "seek", "(Li)", + static_cast(position), whence); Py_XDECREF(result); PY_RETURN_IF_ERROR(StatusCode::IOError); return Status::OK(); @@ -103,16 +106,16 @@ class PythonFile { Status Read(int64_t nbytes, PyObject** out) { RETURN_NOT_OK(CheckClosed()); - PyObject* result = cpp_PyObject_CallMethod(file_.obj(), "read", "(n)", - static_cast(nbytes)); + PyObject* result = cpp_PyObject_CallMethod(file_.obj(), "read", "(L)", + static_cast(nbytes)); PY_RETURN_IF_ERROR(StatusCode::IOError); *out = result; return Status::OK(); } Status ReadBuffer(int64_t nbytes, PyObject** out) { - PyObject* result = cpp_PyObject_CallMethod(file_.obj(), "read_buffer", "(n)", - static_cast(nbytes)); + PyObject* result = cpp_PyObject_CallMethod(file_.obj(), "read_buffer", "(L)", + static_cast(nbytes)); PY_RETURN_IF_ERROR(StatusCode::IOError); *out = result; return Status::OK(); diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/io.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/io.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/io.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/io.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/ipc.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/ipc.cc new file mode 100755 index 000000000..0ed152242 --- /dev/null +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/ipc.cc @@ -0,0 +1,133 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "ipc.h" + +#include + +#include "arrow/compute/cast.h" +#include "arrow/python/pyarrow.h" + +namespace arrow { +namespace py { + +PyRecordBatchReader::PyRecordBatchReader() {} + +Status PyRecordBatchReader::Init(std::shared_ptr schema, PyObject* iterable) { + schema_ = std::move(schema); + + iterator_.reset(PyObject_GetIter(iterable)); + return CheckPyError(); +} + +std::shared_ptr PyRecordBatchReader::schema() const { return schema_; } + +Status PyRecordBatchReader::ReadNext(std::shared_ptr* batch) { + PyAcquireGIL lock; + + if (!iterator_) { + // End of stream + batch->reset(); + return Status::OK(); + } + + OwnedRef py_batch(PyIter_Next(iterator_.obj())); + if (!py_batch) { + RETURN_IF_PYERROR(); + // End of stream + batch->reset(); + iterator_.reset(); + return Status::OK(); + } + + return unwrap_batch(py_batch.obj()).Value(batch); +} + +Result> PyRecordBatchReader::Make( + std::shared_ptr schema, PyObject* iterable) { + auto reader = std::shared_ptr(new PyRecordBatchReader()); + RETURN_NOT_OK(reader->Init(std::move(schema), iterable)); + return reader; +} + +CastingRecordBatchReader::CastingRecordBatchReader() = default; + +Status CastingRecordBatchReader::Init(std::shared_ptr parent, + std::shared_ptr schema) { + std::shared_ptr src = parent->schema(); + + // The check for names has already been done in Python where it's easier to + // generate a nice error message. + int num_fields = schema->num_fields(); + if (src->num_fields() != num_fields) { + return Status::Invalid("Number of fields not equal"); + } + + // Ensure all columns can be cast before succeeding + for (int i = 0; i < num_fields; i++) { + if (!compute::CanCast(*src->field(i)->type(), *schema->field(i)->type())) { + return Status::TypeError("Field ", i, " cannot be cast from ", + src->field(i)->type()->ToString(), " to ", + schema->field(i)->type()->ToString()); + } + } + + parent_ = std::move(parent); + schema_ = std::move(schema); + + return Status::OK(); +} + +std::shared_ptr CastingRecordBatchReader::schema() const { return schema_; } + +Status CastingRecordBatchReader::ReadNext(std::shared_ptr* batch) { + std::shared_ptr out; + ARROW_RETURN_NOT_OK(parent_->ReadNext(&out)); + if (!out) { + batch->reset(); + return Status::OK(); + } + + auto num_columns = out->num_columns(); + auto options = compute::CastOptions::Safe(); + ArrayVector columns(num_columns); + for (int i = 0; i < num_columns; i++) { + const Array& src = *out->column(i); + if (!schema_->field(i)->nullable() && src.null_count() > 0) { + return Status::Invalid( + "Can't cast array that contains nulls to non-nullable field at index ", i); + } + + ARROW_ASSIGN_OR_RAISE(columns[i], + compute::Cast(src, schema_->field(i)->type(), options)); + } + + *batch = RecordBatch::Make(schema_, out->num_rows(), std::move(columns)); + return Status::OK(); +} + +Result> CastingRecordBatchReader::Make( + std::shared_ptr parent, std::shared_ptr schema) { + auto reader = std::shared_ptr(new CastingRecordBatchReader()); + ARROW_RETURN_NOT_OK(reader->Init(parent, schema)); + return reader; +} + +Status CastingRecordBatchReader::Close() { return parent_->Close(); } + +} // namespace py +} // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/ipc.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/ipc.h old mode 100644 new mode 100755 similarity index 73% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/ipc.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/ipc.h index 92232ed83..2c16d8c96 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/ipc.h +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/ipc.h @@ -48,5 +48,25 @@ class ARROW_PYTHON_EXPORT PyRecordBatchReader : public RecordBatchReader { OwnedRefNoGIL iterator_; }; +class ARROW_PYTHON_EXPORT CastingRecordBatchReader : public RecordBatchReader { + public: + std::shared_ptr schema() const override; + + Status ReadNext(std::shared_ptr* batch) override; + + static Result> Make( + std::shared_ptr parent, std::shared_ptr schema); + + Status Close() override; + + protected: + CastingRecordBatchReader(); + + Status Init(std::shared_ptr parent, std::shared_ptr schema); + + std::shared_ptr parent_; + std::shared_ptr schema_; +}; + } // namespace py } // namespace arrow diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/iterators.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/iterators.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/iterators.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/iterators.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/lib.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/lib.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/lib.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/lib.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/lib_api.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/lib_api.h old mode 100644 new mode 100755 similarity index 51% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/lib_api.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/lib_api.h index 12bb219b3..6c4fee277 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/lib_api.h +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/lib_api.h @@ -1,4 +1,4 @@ -/* Generated by Cython 3.0.8 */ +/* Generated by Cython 3.0.10 */ #ifndef __PYX_HAVE_API__pyarrow__lib #define __PYX_HAVE_API__pyarrow__lib @@ -102,9 +102,9 @@ static int (*__pyx_api_f_7pyarrow_3lib_pyarrow_is_table)(PyObject *) = 0; #define pyarrow_is_table __pyx_api_f_7pyarrow_3lib_pyarrow_is_table static int (*__pyx_api_f_7pyarrow_3lib_pyarrow_is_batch)(PyObject *) = 0; #define pyarrow_is_batch __pyx_api_f_7pyarrow_3lib_pyarrow_is_batch -#ifndef __PYX_HAVE_RT_ImportFunction_3_0_8 -#define __PYX_HAVE_RT_ImportFunction_3_0_8 -static int __Pyx_ImportFunction_3_0_8(PyObject *module, const char *funcname, void (**f)(void), const char *sig) { +#ifndef __PYX_HAVE_RT_ImportFunction_3_0_10 +#define __PYX_HAVE_RT_ImportFunction_3_0_10 +static int __Pyx_ImportFunction_3_0_10(PyObject *module, const char *funcname, void (**f)(void), const char *sig) { PyObject *d = 0; PyObject *cobj = 0; union { @@ -144,53 +144,53 @@ static int import_pyarrow__lib(void) { PyObject *module = 0; module = PyImport_ImportModule("pyarrow.lib"); if (!module) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "box_memory_pool", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_box_memory_pool, "PyObject *( arrow::MemoryPool *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_buffer, "PyObject *(std::shared_ptr< arrow::Buffer> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_resizable_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer, "PyObject *(std::shared_ptr< arrow::ResizableBuffer> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_data_type", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_data_type, "PyObject *(std::shared_ptr< arrow::DataType> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_field", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_field, "PyObject *(std::shared_ptr< arrow::Field> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_schema", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_schema, "PyObject *(std::shared_ptr< arrow::Schema> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_scalar", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_scalar, "PyObject *(std::shared_ptr< arrow::Scalar> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_array, "PyObject *(std::shared_ptr< arrow::Array> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_chunked_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_chunked_array, "PyObject *(std::shared_ptr< arrow::ChunkedArray> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_sparse_coo_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor, "PyObject *(std::shared_ptr< arrow::SparseCOOTensor> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_sparse_csc_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix, "PyObject *(std::shared_ptr< arrow::SparseCSCMatrix> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_sparse_csf_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor, "PyObject *(std::shared_ptr< arrow::SparseCSFTensor> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_sparse_csr_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix, "PyObject *(std::shared_ptr< arrow::SparseCSRMatrix> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_tensor, "PyObject *(std::shared_ptr< arrow::Tensor> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_batch", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_batch, "PyObject *(std::shared_ptr< arrow::RecordBatch> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_wrap_table", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_table, "PyObject *(std::shared_ptr< arrow::Table> const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_buffer, "std::shared_ptr< arrow::Buffer> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_data_type", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_data_type, "std::shared_ptr< arrow::DataType> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_field", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_field, "std::shared_ptr< arrow::Field> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_schema", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_schema, "std::shared_ptr< arrow::Schema> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_scalar", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_scalar, "std::shared_ptr< arrow::Scalar> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_array, "std::shared_ptr< arrow::Array> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_chunked_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_chunked_array, "std::shared_ptr< arrow::ChunkedArray> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_sparse_coo_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_coo_tensor, "std::shared_ptr< arrow::SparseCOOTensor> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_sparse_csc_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_csc_matrix, "std::shared_ptr< arrow::SparseCSCMatrix> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_sparse_csf_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_csf_tensor, "std::shared_ptr< arrow::SparseCSFTensor> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_sparse_csr_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_csr_matrix, "std::shared_ptr< arrow::SparseCSRMatrix> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_tensor, "std::shared_ptr< arrow::Tensor> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_batch", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_batch, "std::shared_ptr< arrow::RecordBatch> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_unwrap_table", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_table, "std::shared_ptr< arrow::Table> (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_internal_check_status", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_internal_check_status, "int (arrow::Status const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_internal_convert_status", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_internal_convert_status, "PyObject *(arrow::Status const &)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_buffer, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_data_type", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_data_type, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_metadata", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_metadata, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_field", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_field, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_schema", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_schema, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_array, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_chunked_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_chunked_array, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_scalar", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_scalar, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_tensor, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_sparse_coo_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_coo_tensor, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_sparse_csr_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_csr_matrix, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_sparse_csc_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_csc_matrix, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_sparse_csf_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_csf_tensor, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_table", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_table, "int (PyObject *)") < 0) goto bad; - if (__Pyx_ImportFunction_3_0_8(module, "pyarrow_is_batch", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_batch, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "box_memory_pool", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_box_memory_pool, "PyObject *( arrow::MemoryPool *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_buffer, "PyObject *(std::shared_ptr< arrow::Buffer> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_resizable_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer, "PyObject *(std::shared_ptr< arrow::ResizableBuffer> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_data_type", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_data_type, "PyObject *(std::shared_ptr< arrow::DataType> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_field", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_field, "PyObject *(std::shared_ptr< arrow::Field> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_schema", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_schema, "PyObject *(std::shared_ptr< arrow::Schema> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_scalar", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_scalar, "PyObject *(std::shared_ptr< arrow::Scalar> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_array, "PyObject *(std::shared_ptr< arrow::Array> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_chunked_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_chunked_array, "PyObject *(std::shared_ptr< arrow::ChunkedArray> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_sparse_coo_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor, "PyObject *(std::shared_ptr< arrow::SparseCOOTensor> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_sparse_csc_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix, "PyObject *(std::shared_ptr< arrow::SparseCSCMatrix> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_sparse_csf_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor, "PyObject *(std::shared_ptr< arrow::SparseCSFTensor> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_sparse_csr_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix, "PyObject *(std::shared_ptr< arrow::SparseCSRMatrix> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_tensor, "PyObject *(std::shared_ptr< arrow::Tensor> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_batch", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_batch, "PyObject *(std::shared_ptr< arrow::RecordBatch> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_wrap_table", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_wrap_table, "PyObject *(std::shared_ptr< arrow::Table> const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_buffer, "std::shared_ptr< arrow::Buffer> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_data_type", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_data_type, "std::shared_ptr< arrow::DataType> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_field", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_field, "std::shared_ptr< arrow::Field> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_schema", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_schema, "std::shared_ptr< arrow::Schema> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_scalar", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_scalar, "std::shared_ptr< arrow::Scalar> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_array, "std::shared_ptr< arrow::Array> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_chunked_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_chunked_array, "std::shared_ptr< arrow::ChunkedArray> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_sparse_coo_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_coo_tensor, "std::shared_ptr< arrow::SparseCOOTensor> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_sparse_csc_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_csc_matrix, "std::shared_ptr< arrow::SparseCSCMatrix> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_sparse_csf_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_csf_tensor, "std::shared_ptr< arrow::SparseCSFTensor> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_sparse_csr_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_sparse_csr_matrix, "std::shared_ptr< arrow::SparseCSRMatrix> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_tensor, "std::shared_ptr< arrow::Tensor> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_batch", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_batch, "std::shared_ptr< arrow::RecordBatch> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_unwrap_table", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_unwrap_table, "std::shared_ptr< arrow::Table> (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_internal_check_status", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_internal_check_status, "int (arrow::Status const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_internal_convert_status", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_internal_convert_status, "PyObject *(arrow::Status const &)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_buffer", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_buffer, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_data_type", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_data_type, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_metadata", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_metadata, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_field", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_field, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_schema", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_schema, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_array, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_chunked_array", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_chunked_array, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_scalar", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_scalar, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_tensor, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_sparse_coo_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_coo_tensor, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_sparse_csr_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_csr_matrix, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_sparse_csc_matrix", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_csc_matrix, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_sparse_csf_tensor", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_sparse_csf_tensor, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_table", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_table, "int (PyObject *)") < 0) goto bad; + if (__Pyx_ImportFunction_3_0_10(module, "pyarrow_is_batch", (void (**)(void))&__pyx_api_f_7pyarrow_3lib_pyarrow_is_batch, "int (PyObject *)") < 0) goto bad; Py_DECREF(module); module = 0; return 0; bad: diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_convert.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_convert.cc old mode 100644 new mode 100755 similarity index 90% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_convert.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_convert.cc index 497068076..5fd2cb511 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_convert.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_convert.cc @@ -46,7 +46,7 @@ NumPyBuffer::NumPyBuffer(PyObject* ao) : Buffer(nullptr, 0) { PyArrayObject* ndarray = reinterpret_cast(ao); auto ptr = reinterpret_cast(PyArray_DATA(ndarray)); data_ = const_cast(ptr); - size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize; + size_ = PyArray_NBYTES(ndarray); capacity_ = size_; is_mutable_ = !!(PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE); } @@ -59,12 +59,11 @@ NumPyBuffer::~NumPyBuffer() { #define TO_ARROW_TYPE_CASE(NPY_NAME, FACTORY) \ case NPY_##NPY_NAME: \ - *out = FACTORY(); \ - break; + return FACTORY(); namespace { -Status GetTensorType(PyObject* dtype, std::shared_ptr* out) { +Result> GetTensorType(PyObject* dtype) { if (!PyObject_TypeCheck(dtype, &PyArrayDescr_Type)) { return Status::TypeError("Did not pass numpy.dtype object"); } @@ -84,11 +83,8 @@ Status GetTensorType(PyObject* dtype, std::shared_ptr* out) { TO_ARROW_TYPE_CASE(FLOAT16, float16); TO_ARROW_TYPE_CASE(FLOAT32, float32); TO_ARROW_TYPE_CASE(FLOAT64, float64); - default: { - return Status::NotImplemented("Unsupported numpy type ", descr->type_num); - } } - return Status::OK(); + return Status::NotImplemented("Unsupported numpy type ", descr->type_num); } Status GetNumPyType(const DataType& type, int* type_num) { @@ -120,15 +116,21 @@ Status GetNumPyType(const DataType& type, int* type_num) { } // namespace -Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr* out) { +Result> NumPyScalarToArrowDataType(PyObject* scalar) { + PyArray_Descr* descr = PyArray_DescrFromScalar(scalar); + OwnedRef descr_ref(reinterpret_cast(descr)); + return NumPyDtypeToArrow(descr); +} + +Result> NumPyDtypeToArrow(PyObject* dtype) { if (!PyObject_TypeCheck(dtype, &PyArrayDescr_Type)) { return Status::TypeError("Did not pass numpy.dtype object"); } PyArray_Descr* descr = reinterpret_cast(dtype); - return NumPyDtypeToArrow(descr, out); + return NumPyDtypeToArrow(descr); } -Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr* out) { +Result> NumPyDtypeToArrow(PyArray_Descr* descr) { int type_num = fix_numpy_type_num(descr->type_num); switch (type_num) { @@ -148,23 +150,18 @@ Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr* out) { TO_ARROW_TYPE_CASE(UNICODE, utf8); case NPY_DATETIME: { auto date_dtype = - reinterpret_cast(descr->c_metadata); + reinterpret_cast(PyDataType_C_METADATA(descr)); switch (date_dtype->meta.base) { case NPY_FR_s: - *out = timestamp(TimeUnit::SECOND); - break; + return timestamp(TimeUnit::SECOND); case NPY_FR_ms: - *out = timestamp(TimeUnit::MILLI); - break; + return timestamp(TimeUnit::MILLI); case NPY_FR_us: - *out = timestamp(TimeUnit::MICRO); - break; + return timestamp(TimeUnit::MICRO); case NPY_FR_ns: - *out = timestamp(TimeUnit::NANO); - break; + return timestamp(TimeUnit::NANO); case NPY_FR_D: - *out = date32(); - break; + return date32(); case NPY_FR_GENERIC: return Status::NotImplemented("Unbound or generic datetime64 time unit"); default: @@ -173,32 +170,25 @@ Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr* out) { } break; case NPY_TIMEDELTA: { auto timedelta_dtype = - reinterpret_cast(descr->c_metadata); + reinterpret_cast(PyDataType_C_METADATA(descr)); switch (timedelta_dtype->meta.base) { case NPY_FR_s: - *out = duration(TimeUnit::SECOND); - break; + return duration(TimeUnit::SECOND); case NPY_FR_ms: - *out = duration(TimeUnit::MILLI); - break; + return duration(TimeUnit::MILLI); case NPY_FR_us: - *out = duration(TimeUnit::MICRO); - break; + return duration(TimeUnit::MICRO); case NPY_FR_ns: - *out = duration(TimeUnit::NANO); - break; + return duration(TimeUnit::NANO); case NPY_FR_GENERIC: return Status::NotImplemented("Unbound or generic timedelta64 time unit"); default: return Status::NotImplemented("Unsupported timedelta64 time unit"); } } break; - default: { - return Status::NotImplemented("Unsupported numpy type ", descr->type_num); - } } - return Status::OK(); + return Status::NotImplemented("Unsupported numpy type ", descr->type_num); } #undef TO_ARROW_TYPE_CASE @@ -230,9 +220,8 @@ Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, strides[i] = array_strides[i]; } - std::shared_ptr type; - RETURN_NOT_OK( - GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)), &type)); + ARROW_ASSIGN_OR_RAISE( + auto type, GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)))); *out = std::make_shared(type, data, shape, strides, dim_names); return Status::OK(); } @@ -435,9 +424,9 @@ Status NdarraysToSparseCOOTensor(MemoryPool* pool, PyObject* data_ao, PyObject* PyArrayObject* ndarray_data = reinterpret_cast(data_ao); std::shared_ptr data = std::make_shared(data_ao); - std::shared_ptr type_data; - RETURN_NOT_OK(GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray_data)), - &type_data)); + ARROW_ASSIGN_OR_RAISE( + auto type_data, + GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray_data)))); std::shared_ptr coords; RETURN_NOT_OK(NdarrayToTensor(pool, coords_ao, {}, &coords)); @@ -462,9 +451,9 @@ Status NdarraysToSparseCSXMatrix(MemoryPool* pool, PyObject* data_ao, PyObject* PyArrayObject* ndarray_data = reinterpret_cast(data_ao); std::shared_ptr data = std::make_shared(data_ao); - std::shared_ptr type_data; - RETURN_NOT_OK(GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray_data)), - &type_data)); + ARROW_ASSIGN_OR_RAISE( + auto type_data, + GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray_data)))); std::shared_ptr indptr, indices; RETURN_NOT_OK(NdarrayToTensor(pool, indptr_ao, {}, &indptr)); @@ -491,9 +480,9 @@ Status NdarraysToSparseCSFTensor(MemoryPool* pool, PyObject* data_ao, PyObject* const int ndim = static_cast(shape.size()); PyArrayObject* ndarray_data = reinterpret_cast(data_ao); std::shared_ptr data = std::make_shared(data_ao); - std::shared_ptr type_data; - RETURN_NOT_OK(GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray_data)), - &type_data)); + ARROW_ASSIGN_OR_RAISE( + auto type_data, + GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray_data)))); std::vector> indptr(ndim - 1); std::vector> indices(ndim); diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_convert.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_convert.h old mode 100644 new mode 100755 similarity index 94% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_convert.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_convert.h index 10451077a..2d1086e13 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_convert.h +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_convert.h @@ -49,9 +49,11 @@ class ARROW_PYTHON_EXPORT NumPyBuffer : public Buffer { }; ARROW_PYTHON_EXPORT -Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr* out); +Result> NumPyDtypeToArrow(PyObject* dtype); ARROW_PYTHON_EXPORT -Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr* out); +Result> NumPyDtypeToArrow(PyArray_Descr* descr); +ARROW_PYTHON_EXPORT +Result> NumPyScalarToArrowDataType(PyObject* scalar); ARROW_PYTHON_EXPORT Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, const std::vector& dim_names, diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_internal.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_internal.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_internal.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_internal.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_interop.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_interop.h old mode 100644 new mode 100755 similarity index 92% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_interop.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_interop.h index ce7baed25..7ea7d6e16 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_interop.h +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_interop.h @@ -67,6 +67,13 @@ #define NPY_INT32_IS_INT 0 #endif +// Backported NumPy 2 API (can be removed if numpy 2 is required) +#if NPY_ABI_VERSION < 0x02000000 +#define PyDataType_ELSIZE(descr) ((descr)->elsize) +#define PyDataType_C_METADATA(descr) ((descr)->c_metadata) +#define PyDataType_FIELDS(descr) ((descr)->fields) +#endif + namespace arrow { namespace py { diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_to_arrow.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_to_arrow.cc old mode 100644 new mode 100755 similarity index 96% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_to_arrow.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_to_arrow.cc index 2727ce32f..460b1d0ce --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_to_arrow.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_to_arrow.cc @@ -196,7 +196,7 @@ class NumPyConverter { mask_ = reinterpret_cast(mo); } length_ = static_cast(PyArray_SIZE(arr_)); - itemsize_ = static_cast(PyArray_DESCR(arr_)->elsize); + itemsize_ = static_cast(PyArray_ITEMSIZE(arr_)); stride_ = static_cast(PyArray_STRIDES(arr_)[0]); } @@ -296,7 +296,7 @@ class NumPyConverter { PyArrayObject* mask_; int64_t length_; int64_t stride_; - int itemsize_; + int64_t itemsize_; bool from_pandas_; compute::CastOptions cast_options_; @@ -462,8 +462,7 @@ template inline Status NumPyConverter::ConvertData(std::shared_ptr* data) { RETURN_NOT_OK(PrepareInputData(data)); - std::shared_ptr input_type; - RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast(dtype_), &input_type)); + ARROW_ASSIGN_OR_RAISE(auto input_type, NumPyDtypeToArrow(dtype_)); if (!input_type->Equals(*type_)) { RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_, type_, @@ -479,7 +478,8 @@ inline Status NumPyConverter::ConvertData(std::shared_ptr* d RETURN_NOT_OK(PrepareInputData(data)); - auto date_dtype = reinterpret_cast(dtype_->c_metadata); + auto date_dtype = + reinterpret_cast(PyDataType_C_METADATA(dtype_)); if (dtype_->type_num == NPY_DATETIME) { // If we have inbound datetime64[D] data, this needs to be downcasted // separately here from int64_t to int32_t, because this data is not @@ -490,7 +490,7 @@ inline Status NumPyConverter::ConvertData(std::shared_ptr* d Status s = StaticCastBuffer(**data, length_, pool_, data); RETURN_NOT_OK(s); } else { - RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast(dtype_), &input_type)); + ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_)); if (!input_type->Equals(*type_)) { // The null bitmap was already computed in VisitNative() RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_, @@ -498,7 +498,7 @@ inline Status NumPyConverter::ConvertData(std::shared_ptr* d } } } else { - RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast(dtype_), &input_type)); + ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_)); if (!input_type->Equals(*type_)) { RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_, type_, cast_options_, pool_, data)); @@ -515,7 +515,8 @@ inline Status NumPyConverter::ConvertData(std::shared_ptr* d RETURN_NOT_OK(PrepareInputData(data)); - auto date_dtype = reinterpret_cast(dtype_->c_metadata); + auto date_dtype = + reinterpret_cast(PyDataType_C_METADATA(dtype_)); if (dtype_->type_num == NPY_DATETIME) { // If we have inbound datetime64[D] data, this needs to be downcasted // separately here from int64_t to int32_t, because this data is not @@ -531,7 +532,7 @@ inline Status NumPyConverter::ConvertData(std::shared_ptr* d } *data = std::move(result); } else { - RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast(dtype_), &input_type)); + ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_)); if (!input_type->Equals(*type_)) { // The null bitmap was already computed in VisitNative() RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_, @@ -539,7 +540,7 @@ inline Status NumPyConverter::ConvertData(std::shared_ptr* d } } } else { - RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast(dtype_), &input_type)); + ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_)); if (!input_type->Equals(*type_)) { RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_, type_, cast_options_, pool_, data)); @@ -629,11 +630,11 @@ namespace { // NumPy unicode is UCS4/UTF32 always constexpr int kNumPyUnicodeSize = 4; -Status AppendUTF32(const char* data, int itemsize, int byteorder, +Status AppendUTF32(const char* data, int64_t itemsize, int byteorder, ::arrow::internal::ChunkedStringBuilder* builder) { // The binary \x00\x00\x00\x00 indicates a nul terminator in NumPy unicode, // so we need to detect that here to truncate if necessary. Yep. - int actual_length = 0; + Py_ssize_t actual_length = 0; for (; actual_length < itemsize / kNumPyUnicodeSize; ++actual_length) { const char* code_point = data + actual_length * kNumPyUnicodeSize; if ((*code_point == '\0') && (*(code_point + 1) == '\0') && @@ -706,7 +707,7 @@ Status NumPyConverter::Visit(const StringType& type) { auto AppendNonNullValue = [&](const uint8_t* data) { if (is_binary_type) { if (ARROW_PREDICT_TRUE(util::ValidateUTF8(data, itemsize_))) { - return builder.Append(data, itemsize_); + return builder.Append(data, static_cast(itemsize_)); } else { return Status::Invalid("Encountered non-UTF8 binary value: ", HexEncode(data, itemsize_)); @@ -751,12 +752,13 @@ Status NumPyConverter::Visit(const StructType& type) { PyAcquireGIL gil_lock; // Create converters for each struct type field - if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) { + if (PyDataType_FIELDS(dtype_) == NULL || !PyDict_Check(PyDataType_FIELDS(dtype_))) { return Status::TypeError("Expected struct array"); } for (auto field : type.fields()) { - PyObject* tup = PyDict_GetItemString(dtype_->fields, field->name().c_str()); + PyObject* tup = + PyDict_GetItemString(PyDataType_FIELDS(dtype_), field->name().c_str()); if (tup == NULL) { return Status::Invalid("Missing field '", field->name(), "' in struct array"); } diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_to_arrow.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_to_arrow.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/numpy_to_arrow.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/numpy_to_arrow.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/parquet_encryption.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/parquet_encryption.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/parquet_encryption.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/parquet_encryption.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/parquet_encryption.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/parquet_encryption.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/parquet_encryption.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/parquet_encryption.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pch.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pch.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pch.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pch.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/platform.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/platform.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/platform.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/platform.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow_api.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow_api.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow_api.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow_api.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow_lib.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow_lib.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/pyarrow_lib.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/pyarrow_lib.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_test.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_test.cc old mode 100644 new mode 100755 similarity index 98% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_test.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_test.cc index 01ab8a303..746bf4109 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_test.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_test.cc @@ -174,10 +174,14 @@ Status TestOwnedRefNoGILMoves() { } } -std::string FormatPythonException(const std::string& exc_class_name) { +std::string FormatPythonException(const std::string& exc_class_name, + const std::string& exc_value) { std::stringstream ss; ss << "Python exception: "; ss << exc_class_name; + ss << ": "; + ss << exc_value; + ss << "\n"; return ss.str(); } @@ -205,7 +209,8 @@ Status TestCheckPyErrorStatus() { } PyErr_SetString(PyExc_TypeError, "some error"); - ASSERT_OK(check_error(st, "some error", FormatPythonException("TypeError"))); + ASSERT_OK( + check_error(st, "some error", FormatPythonException("TypeError", "some error"))); ASSERT_TRUE(st.IsTypeError()); PyErr_SetString(PyExc_ValueError, "some error"); @@ -223,7 +228,8 @@ Status TestCheckPyErrorStatus() { } PyErr_SetString(PyExc_NotImplementedError, "some error"); - ASSERT_OK(check_error(st, "some error", FormatPythonException("NotImplementedError"))); + ASSERT_OK(check_error(st, "some error", + FormatPythonException("NotImplementedError", "some error"))); ASSERT_TRUE(st.IsNotImplemented()); // No override if a specific status code is given @@ -246,7 +252,8 @@ Status TestCheckPyErrorStatusNoGIL() { lock.release(); ASSERT_TRUE(st.IsUnknownError()); ASSERT_EQ(st.message(), "zzzt"); - ASSERT_EQ(st.detail()->ToString(), FormatPythonException("ZeroDivisionError")); + ASSERT_EQ(st.detail()->ToString(), + FormatPythonException("ZeroDivisionError", "zzzt")); return Status::OK(); } } @@ -257,7 +264,7 @@ Status TestRestorePyErrorBasics() { ASSERT_FALSE(PyErr_Occurred()); ASSERT_TRUE(st.IsUnknownError()); ASSERT_EQ(st.message(), "zzzt"); - ASSERT_EQ(st.detail()->ToString(), FormatPythonException("ZeroDivisionError")); + ASSERT_EQ(st.detail()->ToString(), FormatPythonException("ZeroDivisionError", "zzzt")); RestorePyError(st); ASSERT_TRUE(PyErr_Occurred()); diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_test.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_test.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_test.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_test.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_to_arrow.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_to_arrow.cc old mode 100644 new mode 100755 similarity index 95% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_to_arrow.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_to_arrow.cc index 23b92598e..79da47567 --- a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_to_arrow.cc +++ b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_to_arrow.cc @@ -386,8 +386,7 @@ class PyValue { } } else if (PyArray_CheckAnyScalarExact(obj)) { // validate that the numpy scalar has np.datetime64 dtype - std::shared_ptr numpy_type; - RETURN_NOT_OK(NumPyDtypeToArrow(PyArray_DescrFromScalar(obj), &numpy_type)); + ARROW_ASSIGN_OR_RAISE(auto numpy_type, NumPyScalarToArrowDataType(obj)); if (!numpy_type->Equals(*type)) { return Status::NotImplemented("Expected np.datetime64 but got: ", numpy_type->ToString()); @@ -406,7 +405,7 @@ class PyValue { RETURN_NOT_OK(PopulateMonthDayNano::Field( obj, &output.months, &found_attrs)); // on relativeoffset weeks is a property calculated from days. On - // DateOffset is is a field on its own. timedelta doesn't have a weeks + // DateOffset is a field on its own. timedelta doesn't have a weeks // attribute. PyObject* pandas_date_offset_type = internal::BorrowPandasDataOffsetType(); bool is_date_offset = pandas_date_offset_type == (PyObject*)Py_TYPE(obj); @@ -466,8 +465,7 @@ class PyValue { } } else if (PyArray_CheckAnyScalarExact(obj)) { // validate that the numpy scalar has np.datetime64 dtype - std::shared_ptr numpy_type; - RETURN_NOT_OK(NumPyDtypeToArrow(PyArray_DescrFromScalar(obj), &numpy_type)); + ARROW_ASSIGN_OR_RAISE(auto numpy_type, NumPyScalarToArrowDataType(obj)); if (!numpy_type->Equals(*type)) { return Status::NotImplemented("Expected np.timedelta64 but got: ", numpy_type->ToString()); @@ -488,6 +486,10 @@ class PyValue { return view.ParseString(obj); } + static Status Convert(const BinaryViewType*, const O&, I obj, PyBytesView& view) { + return view.ParseString(obj); + } + static Status Convert(const FixedSizeBinaryType* type, const O&, I obj, PyBytesView& view) { ARROW_RETURN_NOT_OK(view.ParseString(obj)); @@ -501,8 +503,8 @@ class PyValue { } template - static enable_if_string Convert(const T*, const O& options, I obj, - PyBytesView& view) { + static enable_if_t::value || is_string_view_type::value, Status> + Convert(const T*, const O& options, I obj, PyBytesView& view) { if (options.strict) { // Strict conversion, force output to be unicode / utf8 and validate that // any binary values are utf8 @@ -572,20 +574,15 @@ struct PyConverterTrait; template struct PyConverterTrait< - T, - enable_if_t<(!is_nested_type::value && !is_interval_type::value && - !is_extension_type::value && !is_binary_view_like_type::value) || - std::is_same::value>> { + T, enable_if_t<(!is_nested_type::value && !is_interval_type::value && + !is_extension_type::value) || + std::is_same::value>> { using type = PyPrimitiveConverter; }; template -struct PyConverterTrait> { - // not implemented -}; - -template -struct PyConverterTrait> { +struct PyConverterTrait< + T, enable_if_t::value || is_list_view_type::value>> { using type = PyListConverter; }; @@ -701,11 +698,22 @@ class PyPrimitiveConverter:: PyBytesView view_; }; +template +struct OffsetTypeTrait { + using type = typename T::offset_type; +}; + template -class PyPrimitiveConverter> +struct OffsetTypeTrait> { + using type = int64_t; +}; + +template +class PyPrimitiveConverter< + T, enable_if_t::value || is_binary_view_like_type::value>> : public PrimitiveConverter { public: - using OffsetType = typename T::offset_type; + using OffsetType = typename OffsetTypeTrait::type; Status Append(PyObject* value) override { if (PyValue::IsNull(this->options_, value)) { @@ -796,7 +804,6 @@ class PyListConverter : public ListConverter { return this->list_builder_->AppendNull(); } - RETURN_NOT_OK(this->list_builder_->Append()); if (PyArray_Check(value)) { RETURN_NOT_OK(AppendNdarray(value)); } else if (PySequence_Check(value)) { @@ -817,6 +824,21 @@ class PyListConverter : public ListConverter { } protected: + // MapType does not support args in the Append() method + Status AppendTo(const MapType*, int64_t size) { return this->list_builder_->Append(); } + + // FixedSizeListType does not support args in the Append() method + Status AppendTo(const FixedSizeListType*, int64_t size) { + return this->list_builder_->Append(); + } + + // ListType requires the size argument in the Append() method + // in order to be convertible to a ListViewType. ListViewType + // requires the size argument in the Append() method always. + Status AppendTo(const BaseListType*, int64_t size) { + return this->list_builder_->Append(true, size); + } + Status ValidateBuilder(const MapType*) { if (this->list_builder_->key_builder()->null_count() > 0) { return Status::Invalid("Invalid Map: key field cannot contain null values"); @@ -829,11 +851,14 @@ class PyListConverter : public ListConverter { Status AppendSequence(PyObject* value) { int64_t size = static_cast(PySequence_Size(value)); + RETURN_NOT_OK(AppendTo(this->list_type_, size)); RETURN_NOT_OK(this->list_builder_->ValidateOverflow(size)); return this->value_converter_->Extend(value, size); } Status AppendIterable(PyObject* value) { + auto size = static_cast(PyObject_Size(value)); + RETURN_NOT_OK(AppendTo(this->list_type_, size)); PyObject* iterator = PyObject_GetIter(value); OwnedRef iter_ref(iterator); while (PyObject* item = PyIter_Next(iterator)) { @@ -850,6 +875,7 @@ class PyListConverter : public ListConverter { return Status::Invalid("Can only convert 1-dimensional array values"); } const int64_t size = PyArray_SIZE(ndarray); + RETURN_NOT_OK(AppendTo(this->list_type_, size)); RETURN_NOT_OK(this->list_builder_->ValidateOverflow(size)); const auto value_type = this->value_converter_->builder()->type(); @@ -1043,7 +1069,8 @@ class PyStructConverter : public StructConverter case KeyKind::BYTES: return AppendDict(dict, bytes_field_names_.obj()); default: - RETURN_NOT_OK(InferKeyKind(PyDict_Items(dict))); + OwnedRef item_ref(PyDict_Items(dict)); + RETURN_NOT_OK(InferKeyKind(item_ref.obj())); if (key_kind_ == KeyKind::UNKNOWN) { // was unable to infer the type which means that all keys are absent return AppendEmpty(); @@ -1089,6 +1116,7 @@ class PyStructConverter : public StructConverter Result> GetKeyValuePair(PyObject* seq, int index) { PyObject* pair = PySequence_GetItem(seq, index); RETURN_IF_PYERROR(); + OwnedRef pair_ref(pair); // ensure reference count is decreased at scope end if (!PyTuple_Check(pair) || PyTuple_Size(pair) != 2) { return internal::InvalidType(pair, "was expecting tuple of (key, value) pair"); } diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_to_arrow.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_to_arrow.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/python_to_arrow.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/python_to_arrow.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/serialize.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/serialize.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/serialize.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/serialize.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/serialize.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/serialize.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/serialize.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/serialize.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/type_traits.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/type_traits.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/type_traits.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/type_traits.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/udf.cc b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/udf.cc old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/udf.cc rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/udf.cc diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/udf.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/udf.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/udf.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/udf.h diff --git a/cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/visibility.h b/cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/visibility.h old mode 100644 new mode 100755 similarity index 100% rename from cpp/csp/python/adapters/vendored/pyarrow-15.0.0/arrow/python/visibility.h rename to cpp/csp/python/adapters/vendored/pyarrow-16.0.0/arrow/python/visibility.h diff --git a/setup.py b/setup.py index d8ec75779..a3f3c6e7c 100644 --- a/setup.py +++ b/setup.py @@ -38,6 +38,7 @@ args = ["install"] if VCPKG_TRIPLET is not None: args.append(f"--triplet={VCPKG_TRIPLET}") + if os.name == "nt": subprocess.call(["bootstrap-vcpkg.bat"], cwd="vcpkg", shell=True) subprocess.call(["vcpkg.bat"] + args, cwd="vcpkg", shell=True) diff --git a/vcpkg.json b/vcpkg.json index a54276c50..cb6bfdb18 100644 --- a/vcpkg.json +++ b/vcpkg.json @@ -25,7 +25,8 @@ "overrides": [ { "name": "arrow", - "version": "15.0.0" + "version": "16.0.0", + "port-version" : 1 } ], "builtin-baseline": "04b0cf2b3fd1752d3c3db969cbc10ba0a4613cee" From d6479c5b30d198f5ca5e78286a0c22d23d843e96 Mon Sep 17 00:00:00 2001 From: Adam Glustein Date: Thu, 9 May 2024 16:11:48 -0400 Subject: [PATCH 26/27] Upgrade CSP to C++20; build websocket against C++17; rename .hi files (#224) Signed-off-by: Adam Glustein --- CMakeLists.txt | 5 +---- cpp/csp/adapters/websocket/CMakeLists.txt | 3 +++ cpp/csp/python/CMakeLists.txt | 10 +++++----- cpp/csp/python/PyStruct.cpp | 2 +- cpp/csp/python/PyStructList.h | 2 +- .../python/{PyStructList.hi => PyStructList_impl.h} | 2 +- cpp/csp/python/adapters/CMakeLists.txt | 8 +++++--- 7 files changed, 17 insertions(+), 15 deletions(-) rename cpp/csp/python/{PyStructList.hi => PyStructList_impl.h} (99%) diff --git a/CMakeLists.txt b/CMakeLists.txt index 6699ae74d..97cf3c088 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -3,7 +3,7 @@ ######################### cmake_minimum_required(VERSION 3.20.0) project(csp VERSION "0.0.3") -set(CMAKE_CXX_STANDARD 17) +set(CMAKE_CXX_STANDARD 20) ################################################################################################################################################### # CMake Dependencies # @@ -153,9 +153,6 @@ endif() ################################################################################################################################################### # Flags # ######### -# Compiler version flags -set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++17") - # Optimization Flags if(WIN32) if(CMAKE_BUILD_TYPE_LOWER STREQUAL debug) diff --git a/cpp/csp/adapters/websocket/CMakeLists.txt b/cpp/csp/adapters/websocket/CMakeLists.txt index 73513d975..879a156c4 100644 --- a/cpp/csp/adapters/websocket/CMakeLists.txt +++ b/cpp/csp/adapters/websocket/CMakeLists.txt @@ -1,5 +1,8 @@ csp_autogen( csp.adapters.websocket_types websocket_types WEBSOCKET_HEADER WEBSOCKET_SOURCE ) +# Need to build websocket adapter under cpp17 standard due to websocketpp incompatibility issues +set(CMAKE_CXX_STANDARD 17) + set(WS_CLIENT_HEADER_FILES ClientAdapterManager.h ClientInputAdapter.h diff --git a/cpp/csp/python/CMakeLists.txt b/cpp/csp/python/CMakeLists.txt index 13f934f12..ae4328ada 100644 --- a/cpp/csp/python/CMakeLists.txt +++ b/cpp/csp/python/CMakeLists.txt @@ -4,7 +4,8 @@ set(CSPTYPESIMPL_PUBLIC_HEADERS PyCspEnum.h PyCspType.h PyStruct.h - PyStructList.h) + PyStructList.h + PyStructList_impl.h) add_library(csptypesimpl csptypesimpl.cpp @@ -12,8 +13,7 @@ add_library(csptypesimpl PyCspEnum.cpp PyCspType.cpp PyStruct.cpp - PyStructToJson.cpp - PyStructList.hi) + PyStructToJson.cpp) set_target_properties(csptypesimpl PROPERTIES PUBLIC_HEADER "${CSPTYPESIMPL_PUBLIC_HEADERS}") target_compile_definitions(csptypesimpl PUBLIC RAPIDJSON_HAS_STDSTRING=1) target_link_libraries(csptypesimpl csp_core csp_types) @@ -42,7 +42,8 @@ set(CSPIMPL_PUBLIC_HEADERS PyOutputProxy.h PyConstants.h PyStructToJson.h - PyStructList.h) + PyStructList.h + PyStructList_impl.h) add_library(cspimpl SHARED cspimpl.cpp @@ -73,7 +74,6 @@ add_library(cspimpl SHARED PyManagedSimInputAdapter.cpp PyTimerAdapter.cpp PyConstants.cpp - PyStructList.hi ${CSPIMPL_PUBLIC_HEADERS}) set_target_properties(cspimpl PROPERTIES PUBLIC_HEADER "${CSPIMPL_PUBLIC_HEADERS}") diff --git a/cpp/csp/python/PyStruct.cpp b/cpp/csp/python/PyStruct.cpp index ccb821311..63e65a26d 100644 --- a/cpp/csp/python/PyStruct.cpp +++ b/cpp/csp/python/PyStruct.cpp @@ -3,7 +3,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/cpp/csp/python/PyStructList.h b/cpp/csp/python/PyStructList.h index b26116c06..7a8944ed0 100644 --- a/cpp/csp/python/PyStructList.h +++ b/cpp/csp/python/PyStructList.h @@ -14,7 +14,7 @@ struct PyStructList : public PyObject { using ElemT = typename CspType::Type::toCArrayElemType::type; - PyStructList( PyStruct * p, std::vector & v, const CspType & type ) : pystruct( p ), vector( v ), field_type( type ) + PyStructList( PyStruct * p, std::vector & v, const CspType & type ) : pystruct( p ), vector( v ), field_type( type ) { Py_INCREF( pystruct ); } diff --git a/cpp/csp/python/PyStructList.hi b/cpp/csp/python/PyStructList_impl.h similarity index 99% rename from cpp/csp/python/PyStructList.hi rename to cpp/csp/python/PyStructList_impl.h index 7e09a5a2d..884cf3ea0 100644 --- a/cpp/csp/python/PyStructList.hi +++ b/cpp/csp/python/PyStructList_impl.h @@ -435,7 +435,7 @@ PyStructList_dealloc( PyStructList * self ) } template PyTypeObject PyStructList::PyType = { - PyVarObject_HEAD_INIT( NULL, 0 ) + .ob_base = PyVarObject_HEAD_INIT( NULL, 0 ) .tp_name = "_cspimpl.PyStructList", .tp_basicsize = sizeof( PyStructList ), .tp_itemsize = 0, diff --git a/cpp/csp/python/adapters/CMakeLists.txt b/cpp/csp/python/adapters/CMakeLists.txt index 512182c42..828e67da9 100644 --- a/cpp/csp/python/adapters/CMakeLists.txt +++ b/cpp/csp/python/adapters/CMakeLists.txt @@ -35,7 +35,9 @@ if(CSP_BUILD_PARQUET_ADAPTER) endif() if(CSP_BUILD_WS_CLIENT_ADAPTER) - add_library(websocketadapterimpl SHARED websocketadapterimpl.cpp) - target_link_libraries(websocketadapterimpl csp_core csp_engine cspimpl csp_websocket_client_adapter) - install(TARGETS websocketadapterimpl RUNTIME DESTINATION bin/ LIBRARY DESTINATION lib/) + set(CMAKE_CXX_STANDARD 17) + add_library(websocketadapterimpl SHARED websocketadapterimpl.cpp) + target_link_libraries(websocketadapterimpl csp_core csp_engine cspimpl csp_websocket_client_adapter) + install(TARGETS websocketadapterimpl RUNTIME DESTINATION bin/ LIBRARY DESTINATION lib/) + set(CMAKE_CXX_STANDARD 20) endif() From b99ee3cc6ffdc5aed7ef695c542a39299f2e1439 Mon Sep 17 00:00:00 2001 From: M Bussonnier Date: Mon, 13 May 2024 13:41:26 +0200 Subject: [PATCH 27/27] Rewritten branch Empty commit. This is an empty commit, it is kept for information purposes. This is a rewritten branch, that squash the content of PR #142 into a single commit (keeping a merge commit). Due to the merkel tree nature of git, all the subsequent commits have been rewritten, have a different hash and so it will need a force push. For completeness here is the current list of commits on main that have been rewritten and their new counterpart: old new e1e1f82 7b3c3a7 : Python 3.12 build support (#221) a32cef3 2abdf59 : Update vcpkg baseline (#209) 8bae523 c51a86b : Merge pull request #189 from Point72/pavithraes/fix-links 0964fad 7a25f45 : Merge pull request #219 from Point72/wrr/fix_ws_json_mapper 755debf cc60b87 : Merge pull request #191 from Point72/ac/fix_to_json_parsing_floats 5c7e55f ba90a02 : Merge pull request #200 from Point72/tkp/docs 2568689 abcd307 : Move websocket example after merge f9b7e62 da4d5e3 : fix format changes that will now result in lint failures 687dc2e d6479c5 : Upgrade CSP to C++20; build websocket against C++17; rename .hi files (#224) 9b442d5 8adb1a7 : Upgrade baseline in vcpkg.json 92b7e34 d9ac41d : Pin linters to narrow range to avoid noise ee1aaf2 f11b8fc : PushPullInputAdapter - fix to previous patch that fixed out of order time handling. Need to account for the null event which signifies end of replay fef2fac 68927fe : Update to arrow / pyarrow 16 (#210) 4717f54 086c9d5 : Add placeholder block to build action for service tests (in another PR) 306a530 ef1a239 : Merge pull request #195 from Point72/bugfix/push_pull_ooo_patch fc239d5 4ea4799 : Maintain the type of a list-derived object when converting a struct in to_dict (#199) adc79fd f6b0963 : Parse None natively in to_json method 3871e4a df3bb2d : Merge pull request #196 from Point72/revert-194-ac/upgrade_vcpkg_43d81795a 6c57fb3 8a0d881 : Remove all caching code from CSP (#213) 9729984 e70e4d7 : Merge pull request #174 from Point72/tkp/slacktut 964c77e 12cceeb : Merge pull request #223 from Point72/tkp/lint 801aa60 86c4e6d : Revert "Upgrade baseline in vcpkg.json" 7b07bea 64559cb : Include AS statement in SQL build query regardless of sqlalchemy version (#205) 4621584 5103026 : Re-apply lost updates in dev guides (#202) 5cde2c7 68dda2f : Merge pull request #192 from Point72/ac/parse_none_to_json 6b4f38b efe0fd6 : Add format check to lint step 4a7dc36 eaec4a0 : Update baseline to stable version c5acdc8 3ca95b3 : Merge pull request #194 from Point72/ac/upgrade_vcpkg_43d81795a 00c2a06 471e142 : Run autofixers with pinned up packages b748dc7 7441fb5 : Add build-debug option to Makefile so we dont forget the proper incantations (#222) 24c5818 89eda4c : fix @217 | add tests 96a47b9 118a2fb : Fix to_json serialization for floats 895563c 78e7aca : Merge pull request #211 from Point72/tkp/checklint 063b137 e2a4f2b : Fix interrupt handling issues in csp: ensure first node is stopped and reset signaled flag across runs (#206) 7197c77 da9a84f : minor bugfix to unroll cppimpl. Missing cast from vector value to ElemT, which for bool would be a vector value of unsigned char. This was triggering a CSP_ASSERT in debug builds With the exception of the squashed one : faee778 : Merge pull request #142 from Point72/pavithraes/docs-restructure that is a squash of all of #142 All commits before that until 5d7eeb are unchanged You can use the following command using this commit message to check each of those commits vs their counterpart: ``` $ git show HEAD |tail -n +16 |head -n 35 | cut -f 1 -d: | xargs -L1 git diff --stat ```