A SQL case: Relating groups based on membership relations

The other day when I visited a group page on Facebook and looked at the "Related groups" section I started to think about how that SQL question might be implemented.

image 
Screen capture from the MSDN Sweden Facebook group.

This is not a list that the group owner can control, instead the list automatically shows groups that has a lot of members in common with the group you're currently visiting.

Without any knowledge of Facebooks design for this, let's assume that the database structure is like this, a very ordinary many-to-many relationship:

image

To have some test data I filled these tables with the following data.

image

As you can see the "Comic Book Characters" and the "Superheroes" groups have a lot of members in common while the "Celebrities" group is an island.

So, based on a given GroupId, how can we get a resultset that lists all of the other groups with a column that specifies the amount of shared members? Let's build this step by step, in fact I do this as I'm writing so the final solution is not totally clear to me yet, even though I have some idea. If you want to follow along and run these SQL commands you can download the database file here (this database also includes the final SQL command as a stored procedure).

Let's say that the given GroupId is 2 (so this might come from a querystring or some other input, in these commands the number 2 will be hard coded). This means that the final result we want is this:

GroupId GroupName SharedMemberCount
1 Comic Book Characters 4
3 Celebrities 0

Let's start with selecting all groups except the given one;

SELECT
    GroupId, GroupName
FROM
    Groups
WHERE
    GroupId <> 2

Now, we will need a nested select statement to add the third column, I usually put the outer structure in place, and then I start writing the SELECT statement within the parenthesis:

SELECT
    GroupId, GroupName,
    (SELECT something) AS SharedMemberCount
FROM
    Groups
WHERE
    GroupId <> 2

Let's start with just counting the total numbers of members in each group, to do this the Groups table needs an alias so that we can distinguish the two GroupId columns;

SELECT
    GroupId, GroupName,
    (SELECT
        COUNT(*)
    FROM
        PersonGroupMemberships AS pgm
    WHERE
        pgm.GroupId = g.GroupId    
    ) AS SharedMemberCount
FROM
    Groups AS g
WHERE
    GroupId <> 2

This is the current resultset;

 image

Now we need to filter the inner select so that it only counts the members that are also members in the given group (GroupId=2).

SELECT
    GroupId, GroupName,
    (SELECT
        COUNT(*)
    FROM
        PersonGroupMemberships AS pgm
    WHERE
        pgm.GroupId = g.GroupId
        AND pgm.PersonId IN 
            (SELECT PersonId
            FROM PersonGroupMemberships 
            WHERE GroupId=2)    
    ) AS SharedMemberCount
FROM
    Groups AS g
WHERE
    GroupId <> 2

And really, this was no brain surgery. We simply say that the PersonId of the rows we count has to have a value that corresponds to any PersonId that is a member of group 2 by using the IN statement and yet another nested SELECT statement.

Next we should add sorting so that the groups with the most shared members comes on top, I also add a TOP statement since we probably just want the most related groups, not all groups. These changes wont make any difference with this testdata, but they would if we had more groups:

SELECT TOP 4
    GroupId, GroupName,
    (SELECT
        COUNT(*)
    FROM
        PersonGroupMemberships AS pgm
    WHERE
        pgm.GroupId = g.GroupId
        AND pgm.PersonId IN 
            (SELECT PersonId
            FROM PersonGroupMemberships 
            WHERE GroupId=2)    
    ) AS SharedMemberCount
FROM
    Groups AS g
WHERE
    GroupId <> 2
ORDER BY
    SharedMemberCount DESC

And now we get the resultset we wanted;

 image

To think about
Of course you wouldn't want to display the "Celebrities" group in this case, since it has no members in common with the "Superheroes" group. How could you filter the resultset so that it only includes groups where SharedMemberCount is more than zero? You might wanna use this technique.

(By the way, I have installed SQL Server 2008 Express, and it gives neat intellisense when working with SQL queries! :-)

No comments :

Post a Comment