|
| 1 | +--- |
| 2 | +title: Apache DataFu-Spark 2.1.0 Released |
| 3 | +author: Eyal Allweil |
| 4 | +license: > |
| 5 | + Licensed to the Apache Software Foundation (ASF) under one or more |
| 6 | + contributor license agreements. See the NOTICE file distributed with |
| 7 | + this work for additional information regarding copyright ownership. |
| 8 | + The ASF licenses this file to You under the Apache License, Version 2.0 |
| 9 | + (the "License"); you may not use this file except in compliance with |
| 10 | + the License. You may obtain a copy of the License at |
| 11 | +
|
| 12 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 13 | +
|
| 14 | + Unless required by applicable law or agreed to in writing, software |
| 15 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 16 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 17 | + See the License for the specific language governing permissions and |
| 18 | + limitations under the License. |
| 19 | +--- |
| 20 | + |
| 21 | +I'd like to announce the release of Apache DataFu-Spark 2.1.0. |
| 22 | + |
| 23 | +In this release, Spark versions 3.0.0 to 3.4.2 are supported. |
| 24 | + |
| 25 | +<br> |
| 26 | + |
| 27 | +**Additions** |
| 28 | + |
| 29 | +* Add dedupByAllExcept method (DATAFU-167). This is a new method for reducing rows when there is one column whose value is not important, but you don't want to lose any actual data from the other rows. For example if a server creates events with an autogenerated event id, and sometimes events are duplicated. You don't want double rows just for the event ids, but if any of the other fields are distinct you want to keep the rows (with their event ids) |
| 30 | + |
| 31 | +* Add collectNumberOrderedElements (DATAFU-176). This is a new UDAF for aggregating and collecting data with a possibility of skew. For example if you want to create a list of top customers for a company. Using a window function would require sending all the data for a given company to the same executor. This method will filter rows out in the combiner stage. |
| 32 | + |
| 33 | +**Improvements** |
| 34 | + |
| 35 | +* Spark 3.0.0 - 3.4.x supported (DATAFU-175, DATAFU-179) |
| 36 | +* Expose dedupRandomN in Python (DATAFU-180) |
| 37 | + |
| 38 | +**Breaking changes** |
| 39 | + |
| 40 | +* The four deprecated classes in SparkUDAFs - MultiSet, MultiArraySet, MapMerge and CountDistinctUpTo have been removed. Instead of them, there are new versions which use the Spark Aggregator API. |
| 41 | + |
| 42 | +<br> |
| 43 | + |
| 44 | +The source release can be obtained from: |
| 45 | + |
| 46 | +http://www.apache.org/dyn/closer.cgi/datafu/apache-datafu-2.1.0/apache-datafu-sources-2.1.0.tgz |
| 47 | + |
| 48 | +Artifacts for DataFu are published in Apache's Maven Repository: |
| 49 | + |
| 50 | +https://repository.apache.org/content/groups/public/org/apache/datafu/ |
| 51 | + |
| 52 | +Please visit the [Download](/docs/download.html) page for instructions on building from source or retrieving the artifacts in your build system. |
0 commit comments