Selecting random ids using TOP and a CTE

While testing visualizations in a Flex application, I needed to do some underlying data cleanup in SQL Server.  One of my tasks was to manually update an entity table and set the status column to one of three possibilities.  Status group A and B both needed to be roughly 20% of my tables total record count, and status group C would be the remaining rows that weren’t touched by status A or status B.  Oh and there’s one more thing, the ids in each status group can not be in sequential order, they have to be random.

At first I thought no sweat.  My dataset is still small ( only 2000 rows ), so if we want uber control I could do the math and generate my id lists by hand.  Yes, hand crafting is possible and under a deadline that kind of logic almost makes sense.  However, I already know the table I’m working with will grow in the future, and I’ll probably have to do this data update again, so why not do this right?  While playing around with different select statements I had a “EUREKA!” moment.  SQL Server’s TOP operator supports PERCENT, not just number.  I couldn’t believe it.  I use TOP at least once a week and I always forget about TOP PERCENT.  Since I already know how to select random rows via CTE, it was time to put it all together.

Before giving you the final SQL, here are the important parts to be familiar with.  Also, for the sake of example I’m using the AdventureWorks database so you can play along at home.

TOP PERCENT

If you just need 50% of the rows in a table, but you’re not concerned about the sequence returned, you can fire this query.  This will give you a sequential listing of ProductIDs

[sql]
SELECT TOP 50 PERCENT ProductID
FROM Production.Product
ORDER BY ProductID
[/sql]

Which will look something like this.

SQL's TOP operator returns rows sequentially

COMMON TABLE EXPRESSION

Now let’s say you want to randomly pull all rows from a table.  This can be achieved using this CTE.

[sql]
WITH data( ProductID ) AS (
SELECT ProductID
FROM Production.Product
)
SELECT ProductID
FROM data
ORDER BY NEWID()
[/sql]

Which will look like this

Common Table Expressions in SQLSERVER are super helpful

If you’re looking to randomly select values from a pre-determined list, see my CTE sample here.

So now that you’ve seen TOP PERCENT and CTE in action, it’s time to put these together and solve my initial task of creating randomly selected groups of ids, of a percent size.

RANDOMLY SELECT TOP PERCENT

Putting it all together, here is the query I used to create my first status group.

[sql]
WITH data( ProductID ) AS (
SELECT ProductID
FROM Production.Product
)
SELECT TOP 20 PERCENT ProductID
FROM data
ORDER BY NEWID()
[/sql]

Which gives me a dataset that is 20% of all rows in Production.Product, and the ids are in random order.

And there you have it. Randomly selecting a percent sized data set from a table in SQL Server. The SQL here is really pretty simple, but for some reason I always forget TOP PERCENT. I’m hoping this post will help me remember TOP PERCENT, and maybe even help somebody else with some TSQL.

Select random value from a range of values

Earlier I blogged about creating random numbers using tsql functions.  Here are two techniques for selecting a random value from a pre-defined range of values in a tsql script.  The first technique uses a table variable ( MSSQL 2000 + ), and the second uses a Common Table Expression or CTE ( MSSQL 2005+ ).

Select a random value using a table variable


-- var to hold random integer
declare @field_val int

-- create table var to hold value range [ 0, 512, 1024, 2048, 4096 ]
-- inserting the first value sets the structure for the table variable
SELECT 0 AS 'num'
INTO #temp

-- insert data into table var
INSERT INTO #temp VALUES ( 512 )
INSERT INTO #temp VALUES ( 1024 )
INSERT INTO #temp VALUES ( 2048 )
INSERT INTO #temp VALUES ( 4096 )

-- assign random value
SELECT TOP 1 @field_val = num FROM #temp ORDER BY NEWID()

-- show value
SELECT @field_val

-- drop the table variable
DROP TABLE #temp

Select a random value using a CTE

-- define our data table
WITH data( car )
AS
(
	-- UNION together our range of values
	SELECT 'audi' AS 'car'
	UNION
	SELECT 'bmw' AS 'car'
	UNION
	SELECT 'infinity' AS 'car'
	UNION
	SELECT 'lexus' AS 'car'
	UNION
	SELECT 'porsche' AS 'car'
)
-- select a random value
SELECT TOP 1 car FROM data
ORDER BY NEWID()

Both of these techniques can be used with numbers or text. Just be sure to mind your quotes, and variable datatypes.  Being able to pick a random value in data generation scripts has proven very useful.  I hope this helps somebody else out as well.