About Me

My photo
Northglenn, Colorado, United States
I'm primarily a BI Developer on the Microsoft stack. I do sometimes touch upon other Microsoft stacks ( web development, application development, and sql server development).

Thursday, January 12, 2012

SQL CLR Aggregate: Median

Since Sql Server doesn't have an aggregate for Median, I figured, that I'll create my first SQL CLR to handle this problem.

So, I created my intial project, using the .Net Framework 3.5 and C# SQL CLR aggregate template, and started coding:

Code Snippet
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections.Generic;


[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
    Format.UserDefined,
    IsInvariantToDuplicates = false,
    IsInvariantToNulls = false,
    IsInvariantToOrder = false,
    MaxByteSize = 8000)]
public struct Median : IBinarySerialize
{
    //Variables to hold the values;
    private List<double> ld;

    public void Init()
    {
        ld = new List<double>();
    }

    public void Accumulate(SqlDouble Value)
    {
        if (!Value.IsNull)
        {
            ld.Add(Value.Value);
        }
    }

    ///
    /// Merge the partially computed aggregate with this aggregate.
    ///
    /// The other partial results to be merged
    public void Merge(Median Group)
    {
        this.ld.AddRange(Group.ld.ToArray());
    }

    ///
    /// Called at the end of aggregation, to return the results.
    ///
    /// The median of all inputted values
    public SqlDouble Terminate()
    {
        //debug: return (SqlDouble)ld.Count;

        //special case 0 values
        if (ld.Count == 0)
            return SqlDouble.Null;

        ld.Sort();
        int index = (int)ld.Count / 2;

        if (ld.Count % 2 == 0)
        {
            return (SqlDouble)(((double)ld[index] + (double)ld[index - 1]) / 2);
        }
        else
        {
            return (SqlDouble)((double)ld[index]);
        }
    }


    #region IBinarySerialize Members

    public void Read(System.IO.BinaryReader r)
    {
        int cnt = r.ReadInt32();
        this.ld = new List<double>(cnt);
        for(int i = 0; i < cnt; i++)
        {
            this.ld.Add(r.ReadDouble());
        }
    }

    public void Write(System.IO.BinaryWriter w)
    {
        w.Write(this.ld.Count);
        foreach (double d in this.ld)
        {
            w.Write(d);
        }
    }

    #endregion
}


Some of the intial changes that I've made from the default:
1) The SqlUserDefinedAggregate default was Format.Native. The problem with using the default Serializer is that I needed a way to store my values, preferably in a List. By changing the format to Format.UserDefined  -- I was able to create my own serializer. In this case I used the IBinarySerialize interface.

2) I also changed the default IsInvariant attributes.
  • IsInvariantToDuplicates - set to false since I want duplicates.
  • IsInvariantToNull - set to false since I will handle Null values in my code.
  • IsInvariantToOrder - set to false, I can do this in code.

Once the code is compiled, you will need to place the dll on the same machine as the SQL Server is located.

Using the following query to set the assembly and aggregate name:
Code Snippet
IF EXISTS (SELECT name FROM sysobjects WHERE name = 'Median')
   DROP AGGREGATE Median
go

IF EXISTS (SELECT name FROM sys.assemblies WHERE name = 'MyClrCode')
   DROP ASSEMBLY MyClrCode
go


CREATE ASSEMBLY MyClrCode
FROM 'C:\Median.dll'
WITH PERMISSION_SET = SAFE
GO

CREATE AGGREGATE dbo.Median
(@input float)
RETURNS float
EXTERNAL NAME MyClrCode.Median

Once done, time to do some testing. An example:
Code Snippet
--Test 1 value
select
dbo.Median(x.y)
from
(
    select 1 as y
)x
GO

--Test Accuracy
select
dbo.Median(x.y)
from
(
    select 1 as y
    union all
    select 2 as y
)x
GO

11 comments:

Anonymous said...

Tanks for solution but if 2 user call the median function at same time the median function throw System.NullReferenceException.

For reproduce the bug :
select
(select dbo.median(field1) from table),(select dbo.median(field2) from table)

William Andrus said...

I don't have that problem. I did up to 6 dbo.Median calls at the same time before stopping. I did notice that I still have a buffer issue.

I originally had the attribute set at MaxByteSize = 8000, then I changed it to MaxByteSize = -1 which worked for more. I think anything over 1000+ seems to give me a buffer error. (Assuming I deployed my changes to my dev environment)

Unknown said...

I believe you're misunderstanding the invariant flags. IsInvariantToOrder means that it doesnt matter what order the incoming data is in, it will be yield the same result. This allows the optimizer to ignore any order by statement within that aggregate.
A concatenate aggregate would NOT be order invariant, but a sum, avg, and your implementation of median ARE order invariant.

Thanks for the post!

William Andrus said...

I see, the Invariants are used by sql-server query processor to determine the most optimized query execution plan/path.

My execution, doesn't care about the order, so setting it to true -- might actually make it run quicker.

Anonymous said...

I need to implement a clr aggregate median in a project that I am working on, but it doesn't say anything about the legal use of code on your site. Can you send me any licensing restrictions you have on the code that you post on the blog?

William Andrus said...

It's all open and free to use. Nothing propriety here.

William Andrus said...

If need be, you can grab the code also here: https://github.com/andrusw/SQLCLR-Library/blob/master/SQLCLR/Math/Median/Median.cs

Anonymous said...
This comment has been removed by the author.
Anonymous said...

Just to clarify it is a closed source / for profit project. Still ok? Thanks for getting back to me so quick.

William Andrus said...

Jason,

It's okay.

Unknown said...

Hey andrus,
great post. Helped me a lot. Thanks a ton.