Selecting random ids using TOP and a CTE

While testing visualizations in a Flex application, I needed to do some underlying data cleanup in SQL Server.  One of my tasks was to manually update an entity table and set the status column to one of three possibilities.  Status group A and B both needed to be roughly 20% of my tables total record count, and status group C would be the remaining rows that weren’t touched by status A or status B.  Oh and there’s one more thing, the ids in each status group can not be in sequential order, they have to be random.

At first I thought no sweat.  My dataset is still small ( only 2000 rows ), so if we want uber control I could do the math and generate my id lists by hand.  Yes, hand crafting is possible and under a deadline that kind of logic almost makes sense.  However, I already know the table I’m working with will grow in the future, and I’ll probably have to do this data update again, so why not do this right?  While playing around with different select statements I had a “EUREKA!” moment.  SQL Server’s TOP operator supports PERCENT, not just number.  I couldn’t believe it.  I use TOP at least once a week and I always forget about TOP PERCENT.  Since I already know how to select random rows via CTE, it was time to put it all together.

Before giving you the final SQL, here are the important parts to be familiar with.  Also, for the sake of example I’m using the AdventureWorks database so you can play along at home.

TOP PERCENT

If you just need 50% of the rows in a table, but you’re not concerned about the sequence returned, you can fire this query.  This will give you a sequential listing of ProductIDs

[sql]
SELECT TOP 50 PERCENT ProductID
FROM Production.Product
ORDER BY ProductID
[/sql]

Which will look something like this.

SQL's TOP operator returns rows sequentially

COMMON TABLE EXPRESSION

Now let’s say you want to randomly pull all rows from a table.  This can be achieved using this CTE.

[sql]
WITH data( ProductID ) AS (
SELECT ProductID
FROM Production.Product
)
SELECT ProductID
FROM data
ORDER BY NEWID()
[/sql]

Which will look like this

Common Table Expressions in SQLSERVER are super helpful

If you’re looking to randomly select values from a pre-determined list, see my CTE sample here.

So now that you’ve seen TOP PERCENT and CTE in action, it’s time to put these together and solve my initial task of creating randomly selected groups of ids, of a percent size.

RANDOMLY SELECT TOP PERCENT

Putting it all together, here is the query I used to create my first status group.

[sql]
WITH data( ProductID ) AS (
SELECT ProductID
FROM Production.Product
)
SELECT TOP 20 PERCENT ProductID
FROM data
ORDER BY NEWID()
[/sql]

Which gives me a dataset that is 20% of all rows in Production.Product, and the ids are in random order.

And there you have it. Randomly selecting a percent sized data set from a table in SQL Server. The SQL here is really pretty simple, but for some reason I always forget TOP PERCENT. I’m hoping this post will help me remember TOP PERCENT, and maybe even help somebody else with some TSQL.

Leave a comment