-
Notifications
You must be signed in to change notification settings - Fork 105
Description
Consider the following USQL (reflecting the issue we've observed in our production development):
@data =
SELECT
*
FROM (VALUES
("A",(decimal?)2, "LabelA","A:1"),
("A",(decimal?)null,"LabelA","A:2"),
("A",(decimal?)1, "LabelA","A:3"),
("B",(decimal?)4, "LabelB","B:1")) AS T(Name, Value, Type, Id);
@result =
SELECT
Name,
SUM(Value) AS Sum,
Type,
AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
FROM @data
GROUP BY Name, Type;
OUTPUT @result TO "/res.csv" USING Outputters.Csv(outputHeader:true);
The aggregator in this case is the the sample custom aggregator from the USQL reference doc (our production code is different, but the problem is demonstrable with the sample UDAGG code):
using Microsoft.Analytics.Interfaces;
namespace AggTest
{
public class genericAggregator : IAggregate<string, string, string>
{
string AggregatedValue;
public override void Init()
{
AggregatedValue = "";
}
public override void Accumulate(string ValueToAgg, string GroupByValue)
{
AggregatedValue += ValueToAgg + ",";
}
public override string Terminate()
{
// remove last comma
return AggregatedValue.Substring(0, AggregatedValue.Length - 1);
}
}
}
When executed, either within an ADLA instance in Azure or using the USQL local run environment within Visual Studio 2017, the result is:
"Name","Sum","Type","RowId"
"A",,"LabelA","A,A,A"
"B",4,"LabelB","B"
The built-in USQL SUM aggregator has returned NULL rather than the expected output of 3 for the row with Name A. Removing the call to the user-defined aggregator returns a rowset with the expected SUM aggregation value of 3:
"Name","Sum","Type"
"A",3,"LabelA"
"B",4,"LabelB"
This is clearly inconsistent behaviour for the SUM aggregate which shouldn't care if a UDAGG is included in the processing of the same group.
Interestingly, if the @Result query is modified to:
@result =
SELECT
Name,
AVG(Value) AS Avg,
SUM(Value) AS Sum,
Type,
AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
FROM @data
GROUP BY Name, Type;
Then the output is:
"Name","Avg","Sum","Type","RowId"
"A",1.5,3,"LabelA","A,A,A"
"B",4,4,"LabelB","B"
In this case the introduction of the AVG aggregator both produces the expected average as well as coaxing the SUM aggregator into also producing the correct answer!
At present we've implemented two workarounds:
- Use the null coalescing operator within the SUM, i.e.:
SUM(Value ?? 0.0m) AS Sum
- Filter the rowset to ensure that NULLs in the field to sum aren't included in the group (this has the disadvantage that if more than one SUM aggregator is used in a single query and each of the fields may have NULLs then this adds complication).
Obviously neither work around is ideal as we may care that the result of a SUM aggregate is actually NULL if there were no non-NULL values to sum in the group.