en-UShe-IL
You are here:   Blog
Register   |  Login

Blog Archive:

Maximize
* Can be used in order to search for older blogs Entries

Search in blogs


Blog Categories:

Maximize
* Can be used in order to search for blogs Entries by Categories

Blog Tags:

Maximize
* Can be used in order to search for blogs by keywords

TNWikiSummit


Awared MVP

 


Microsoft® Community Contributor 


Microsoft® Community Contributor


 Read first, before you you use the blog! Maximize

Recent Entries

Minimize
דצמ23

Written by: ronen ariely
23/12/2015 19:34 RssIcon


Background

Several days ago I stumbled a new blog that was published with what seem to me as partial solution. The original blog based on real customer request, and it discusses on using hash function in order to create a first filter, and then working on the filtered data in order to get the final result as requested (two steps job). The blog included only the first step (as a solution for the client?!?), probably focusing on the main idea of using Hash function and splitting the job into two steps.

In this blog firstly I will discuss a very small improvement to the query in the original post, for the first step. Next, I will show a simple full solution (one-step job) for the client's request. 

* Originally, this post wasn't meant to be published as a blog, but I wanted to send the original author some comments including some options to improve his original solution. I just needed a place to post the queries temporary, so I used my blog interface. As such I did not spend lot of time in order to improve the queries... I am sure that these query can be re-write much better, once we will have all the information of the real case study.

* Using the original DDL+DML My full solution seem to use the same resource as the first step in the original post, according to the exaction plan. As this is not a real case for me, and I do not have the full description of the original client request or any information regarding the client hardware, software, SQL Sever versions & edition, ans so on... this blog will not assumed that this is a better solution, but it's just come in order to give another solution, using different approach.

* A hash function is a "Many-to-One function", which mean it maps single or multiple values in group A (input) into single value In group B (result). It's a deterministic function which mean it always produces the same result for the same input, but it can return the same result for different input.

Our Case tudy

We have simple table named MemberSessionsm which contain MemberId column. The table also includes several other columns, that describes users session parameters. We need to find members with the exact same sessions parameters.

DDL+DML

CREATE SCHEMA Operation;
GO
 
CREATE TABLE Operation.MemberSessions(
       Id              INT    IDENTITY (1,1)       NOT NULL ,
       MemberId        INT                         NOT NULL ,
       LoginDateTime   DATETIME2(0)                NOT NULL ,
       EndDateTime     DATETIME2(0)                NULL ,
       EndReasonId     TINYINT                     NULL ,
       CONSTRAINT pk_MemberSessions_c_Id
       PRIMARY KEY CLUSTERED(Id ASC)
);
GO
 
TRUNCATE TABLE Operation.MemberSessions
INSERT INTO Operation.MemberSessions(
       MemberId ,
       LoginDateTime ,
       EndDateTime ,
       EndReasonId
)
VALUES
       (1     , '2015-12-03 13:01:10'    , '2015-12-03 15:50:53'    , 1) ,
       (1     , '2015-12-08 12:47:53'    , '2015-12-08 13:11:43'    , 1) ,
       (2     , '2015-12-03 13:01:10'    , '2015-12-03 15:50:53'    , 1) ,
       (2     , '2015-12-08 12:47:53'    , '2015-12-08 13:11:43'    , 1) ,
       (3     , '2015-12-14 09:45:01'    , '2015-12-14 09:48:52'    , 1) ,
       (3     , '2015-12-15 10:33:41'    , '2015-12-15 11:02:17'    , 2) ,
       (4     , '2012-08-09 11:17:05'    , '2012-08-09 15:02:07'    , 1) ,
       (5     , '2012-11-23 19:51:44'    , '2012-11-23 20:15:16'    , 1) ,
       (5     , '2014-09-28 04:10:26'    , '2014-09-28 06:00:37'    , 3) ,
       (5     , '2006-10-29 17:10:47'    , '2006-10-29 20:33:40'    , 2);
GO

 

Queries and discussion

The original blog based on the idea to use the built-in function CHECKSUM in order to calculate single integer for each row from all the columns that we wants to compare. Next we use the function CHECKSUM_AGG in order to aggregate these result grouping by the the MemberId column. For each row that we have the same values in all the columns we will get the same result for the hash function CHECKSUM, but even if we have different values we might get the same result. The assumption is that in this approach we can filter the original table, which include lot of data, into a small SET of records. This will not give us the final result, but will insure that the final result is in these rows. Next step, we will have to go over all these result and confirm that these are actually different members with the same session parameters.


WITH MembersWithChecksum(MemberId,ChecksumValue,ChecksumCount) AS (
    SELECT
        MemberId        = MemberId ,
        ChecksumValue   = CHECKSUM_AGG (CHECKSUM (LoginDateTime , EndDateTime , EndReasonId)) ,
        ChecksumCount   = COUNT (*) OVER (PARTITION BY CHECKSUM_AGG (CHECKSUM (LoginDateTime , EndDateTime , EndReasonId)))
    FROM Operation.MemberSessions
    GROUP BY MemberId
)
SELECT
    MemberGroup = DENSE_RANK () OVER (ORDER BY ChecksumValue ASC), --  <<<<== this mean another sort which we do not need
    MemberId    = MemberId
FROM MembersWithChecksum
WHERE ChecksumCount > 1
ORDER BY MemberGroup ASC, MemberId ASC;
GO

As you can see in the code above, I added a comment regarding extra sorting, which we do not need. I recommend a small improvement by removing the use of the DENSE_RANK in the query. There is no reason to replace the "original" ChecksumValue created using hash functions with an ordered MemberGroup. we can stay with the original value as this has nothing to do with the final result.


WITH MembersWithChecksum(MemberId,ChecksumValue,ChecksumCount) AS (
    SELECT
        MemberId        = MemberId ,
        ChecksumValue   = CHECKSUM_AGG (CHECKSUM (LoginDateTime , EndDateTime , EndReasonId)) ,
        ChecksumCount   = COUNT (*) OVER (PARTITION BY CHECKSUM_AGG (CHECKSUM (LoginDateTime , EndDateTime , EndReasonId)))
    FROM Operation.MemberSessions
    GROUP BY MemberId
)
SELECT MemberGroup = ChecksumValue, MemberId
FROM MembersWithChecksum
WHERE ChecksumCount > 1
ORDER BY ChecksumValue ASC, MemberId ASC;
GO



As mentioned the above queries will not fit the client request, and it can only be used as first step. In some cases using this approach of splitting the batch into several steps might improve the performance, especially if we insert the data from previous step into temporary table and indexes it for next step. But in most cases SQL Server query engine will create us good solution and it is much better to execute the query as one step. Let's examine a simple one-step solution.

* My basic approach is to use the RANK function instead of hash function. RANKing the data ordered by all the columns that need to be checked will return the result immediately, as the RANK will give us the MemberGroup value (each row with the same session parameters will get the same RANK). Since we need only the rows with more then one row in the MemberGroup I will filter using the COUNT function. Last step I use ROW_NUMBER in order to select one row for each MemberID.


;with MyCTE as (
    SELECT MemberId, RANK() OVER (order by LoginDateTime, EndDateTime, EndReasonId) R
    from Operation.MemberSessions
),
MyCTE2 as (
    SELECT *, COUNT(R) OVER (partition by R) CR
    from MyCTE
),
MyCTE3 as (
    SELECT MemberId, R as MemberGroup, ROW_NUMBER() OVER (partition by MemberId order by R) RN
    FROM MyCTE2
    where CR > 1
)
SELECT DISTINCT MemberId,MemberGroup
FROM MyCTE3
where RN = 1
GO

* Check the Execution Plan! This final solution in one step show the same resources as step one in the previews solution (the improved one).


Last section :-)

Let's take out case one more step, and check a bit more complex request. In the above queries I returned any MemberID that have at least one session with the same parameters as another MemberID, but what if we need to get only these that all their session is the same as another MemberID? This is not the request, but I find it a good idea to ad this case as well.

I will base by solution on using "FOR XML PATH" in order to concatenate all sessions MemberGroup into one string.

Option 1: Using RANK function


-- looking for users that all their sessions the same!
;with MyCTE as (
    SELECT MemberId, RANK() OVER (order by LoginDateTime, EndDateTime, EndReasonId) R
    from Operation.MemberSessions
)
,MyCTE2 as (
    SELECT
        MemberId,
        STUFF(
            (
                SELECT ', ' + CAST(R AS VARCHAR(MAX))
                FROM MyCTE
                WHERE (MemberId = O.MemberId)
                FOR XML PATH ('')
            ),1,2,''
        ) AS NameValues
    FROM MyCTE O
    GROUP BY MemberId
)
,MyCTE3 as (
    SELECT MemberId, RANK() OVER (order by NameValues) MemberGroup
    from MyCTE2
)
, MyCTE4 as (
    SELECT MemberId, MemberGroup, COUNT(MemberGroup) OVER (partition by MemberGroup) C
    FROM MyCTE3
)
SELECT * FROM MyCTE4
where C > 1
GO


 

Option 2: Using FORMAT function

We can improve the above query dramatically (in some cases) simply by replacing the use of the function RANK with converted values into text and concatenate all the values together, keeping same space for each column (padding with balank spaces if needed).

Instead of using:
RANK() OVER (order by LoginDateTime, EndDateTime, EndReasonId) R
we can use:
FORMAT( LoginDateTime, 'yyyyMMddhhmmssfff', 'en-US' ) + FORMAT( EndDateTime, 'yyyyMMddhhmmssfff', 'en-US' ) + CONVERT(CHAR(3),EndReasonId) as R

-- looking for users that all the same!
;with MyCTE as (
    SELECT
        --Id,
        MemberId, --LoginDateTime, EndDateTime, EndReasonId,
        FORMAT( LoginDateTime, 'yyyyMMddhhmmssfff', 'en-US' ) + FORMAT( EndDateTime, 'yyyyMMddhhmmssfff', 'en-US' ) + CONVERT(CHAR(3),EndReasonId) as R
        --RANK() OVER (order by LoginDateTime, EndDateTime, EndReasonId) R
    from Operation.MemberSessions
    --order by id
)
,MyCTE2 as (
    SELECT
        MemberId,
        STUFF(
            (
                SELECT ', ' + CAST(R AS VARCHAR(MAX))
                FROM MyCTE
                WHERE (MemberId = O.MemberId)
                FOR XML PATH ('')
            ),1,2,''
        ) AS NameValues
    FROM MyCTE O
    GROUP BY MemberId
)
,MyCTE3 as (
    SELECT MemberId, RANK() OVER (order by NameValues) MemberGroup
    from MyCTE2
)
, MyCTE4 as (
    SELECT MemberId, MemberGroup, COUNT(MemberGroup) OVER (partition by MemberGroup) C
    FROM MyCTE3
)
SELECT * FROM MyCTE4
where C > 1
GO


 

That's it for today :-)
I hope this was clear and useful
Ronen,