What is the best way to get a random ordering?

I have a query where I want the resulting records to be ordered randomly. It uses a clustered index, so if I do not include an order by it will likely return records in the order of that index. How can I ensure a random row order?

I understand that it will likely not be "truly" random, pseudo-random is good enough for my needs.

Answers 6

  • This is an old question, but one aspect of the discussion is missing, in my opinion -- PERFORMANCE. ORDER BY NewId() is the general answer. When someone get's fancy they add that you should really wrap NewID() in CheckSum(), you know, for performance!

    The problem with this method, is that you're still guaranteed a full index scan and then a complete sort of the data. If you've worked with any serious data volume this can rapidly become expensive. Look at this typical execution plan, and note how the sort takes 96% of your time ...

    enter image description here

    To give you a sense of how this scales, I'll give you two examples from a database I work with.

    • TableA - has 50,000 rows across 2500 data pages. The random query generates 145 reads in 42ms.
    • Table B - has 1.2 million rows across 114,000 data pages. Running Order By newid() on this table generates 53,700 reads and takes 16 seconds.

    The moral of the story is that if you have large tables (think billions of rows) or need to run this query frequently the newid() method breaks down. So what's a boy to do?


    In SQL 2005 a new capability called TABLESAMPLE was created. I've only seen one article discussing it's use...there should be more. MSDN Docs here. First an example:

    SELECT Top (20) *
    FROM Northwind..Orders TABLESAMPLE(20 PERCENT)

    The idea behind table sample is to give you approximately the subset size you ask for. SQL numbers each data page and selects X percent of those pages. The actual number of rows you get back can vary based on what exists in the selected pages.

    So how do I use it? Select a subset size that more than covers the number of rows you need, then add a Top(). The idea is you can make your ginormous table appear smaller prior to the expensive sort.

    Personally I've been using it to in effect limit the size of my table. So on that million row table doing top(20)...TABLESAMPLE(20 PERCENT) the query drops to 5600 reads in 1600ms. There is also a REPEATABLE() option where you can pass a "Seed" for page selection. This should result in a stable sample selection.

    Anyway, just thought this should be added to the discussion. Hope it helps someone.

  • Pradeep Adiga's first suggestion, ORDER BY NEWID(), is fine and something I've used in the past for this reason.

    Be careful with using RAND() - in many contexts it is only executed once per statement so ORDER BY RAND() will have no effect (as you are getting the same result out of RAND() for each row).

    For instance:

    SELECT display_name, RAND() FROM tr_person

    returns each name from our person table and a "random" number, which is the same for each row. The number does vary each time you run the query, but is the same for each row each time.

    To show that the same is the case with RAND() used in an ORDER BY clause, I try:

    SELECT display_name FROM tr_person ORDER BY RAND(), display_name

    The results are still ordered by the name indicating that the earlier sort field (the one expected to be random) has no effect so presumably always has the same value.

    Ordering by NEWID() does work though, because if NEWID() was not always reassessed the purpose of UUIDs would be broken when inserting many new rows in one statemnt with unique identifiers as they key, so:

    SELECT display_name FROM tr_person ORDER BY NEWID()

    does order the names "randomly".

    Other DBMS

    The above is true for MSSQL (2005 and 2008 at least, and if I remember rightly 2000 as well). A function returning a new UUID should be evaluated every time in all DBMSs NEWID() is under MSSQL but it is worth verifying this in the documentation and/or by your own tests. The behaviour of other arbitrary-result functions, like RAND(), is more likely to vary between DBMSs, so again check the documentation.

    Also I've seen ordering by UUID values being ignored in some contexts as the DB assumes that the type has no meaningful ordering. If you find this to be that case explicitly cast the UUID to a string type in the ordering clause, or wrap some other function around it like CHECKSUM() in SQL Server (there may be a small performance difference from this too as the ordering will be done on a 32-bit values not a 128-bit one, though whether the benefit of that outweighs the cost of running CHECKSUM() per value first I'll leave you to test).

    Side Note

    If you want an arbitrary but somewhat repeatable ordering, order by some relatively uncontrolled subset of the data in the rows themselves. For instance either or these will return the names in an arbitrary but repeatable order:

    SELECT display_name FROM tr_person ORDER BY CHECKSUM(display_name), display_name -- order by the checksum of some of the row's data
    SELECT display_name FROM tr_person ORDER BY SUBSTRING(display_name, LEN(display_name)/2, 128) -- order by part of the name field, but not in any an obviously recognisable order)

    Arbitrary but repeatable orderings are not often useful in applications, though can be useful in testing if you want to test some code on results in a variety of orders but want to be able to repeat each run the same way several times (for getting average timing results over several runs, or testing that a fix you have made to the code does remove a problem or inefficiency previously highlighted by a particular input resultset, or just for testing that your code is "stable" in that is returns the same result each time if sent the same data in a given order).

    This trick can also be used to get more arbitrary results from functions, which do not allow non-deterministic calls like NEWID() within their body. Again, this is not something that is likely to be often useful in the real world but could come in handy if you want a function to return something random and "random-ish" is good enough (but be careful to remember the rules that determine when user defined functions evaluted, i.e. usually only once per row, or your results may not be what you expect/require).


    As EBarr points out, there can be performance issues with any of the above. For more than a few rows you are almost garanteed to see the output spooled out to tempdb before the requested number of rows being read back in the right order, which means that even if you are looking for the top 10 you might find a full index scan (or worse, table scan) happens along with a huge block of writing to tempdb. Therefor it can be vitally important, as with most things, to benchmark with realistic data before using this in production.

  • Many tables have a relatively dense (few missing values) indexed numeric ID column.

    This allows us to determine the range of existing values, and choose rows using randomly-generated ID values in that range. This works best when the number of rows to be returned is relatively small, and the range of ID values is densely populated (so the chance of generating a missing value is small enough).

    To illustrate, the following code chooses 100 distinct random users from the Stack Overflow table of users, which has 8,123,937 rows.

    The first step is to determine the range of ID values, an efficient operation due to the index:

        @MinID integer,
        @Range integer,
        @Rows bigint = 100;
    --- Find the range of values
        @MinID = MIN(U.Id),
        @Range = 1 + MAX(U.Id) - MIN(U.Id)
    FROM dbo.Users AS U;

    Range query

    The plan reads one row from each end of the index.

    Now we generate 100 distinct random IDs in the range (with matching rows in the users table) and return those rows:

    WITH Random (ID) AS
        -- Find @Rows distinct random user IDs that exist
        FROM dbo.Users AS U
            -- Random ID
            VALUES (@MinID + (CONVERT(integer, CRYPT_GEN_RANDOM(4)) % @Range))
        ) AS Random (ID)
            SELECT 1
            FROM dbo.Users AS U2
                -- Ensure the row continues to exist
                WITH (REPEATABLEREAD)
            WHERE U2.Id = Random.ID
    FROM Random AS R
    JOIN dbo.Users AS U3
        ON U3.Id = R.ID
    -- QO model hint required to get a non-blocking flow distinct

    random rows query

    The plan shows that in this case 601 random numbers were needed to find 100 matching rows. It is pretty quick:

    Table 'Users'. Scan count 1, logical reads 1937, physical reads 2, read-ahead reads 408
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0
     SQL Server Execution Times:
       CPU time = 0 ms,  elapsed time = 9 ms.

    Try it on the Stack Exchange Data Explorer.

  • As I explained in this article, in order to shuffle the SQL result set, you need to use a database-specific function call.

    Note that sorting a large result set using a RANDOM function might turn out to be very slow, so make sure you do that on small result sets.

    If you have to shuffle a large result set and limit it afterward, then it's better to use the SQL Server TABLESAMPLE in SQL Server instead of a random function in the ORDER BY clause.

    So, assuming we have the following database table:

    enter image description here

    And the following rows in the song table:

    | id | artist                          | title                              |
    | 1  | Miyagi & ???????? ft. ??? ????? | I Got Love                         |
    | 2  | HAIM                            | Don't Save Me (Cyril Hahn Remix)   |
    | 3  | 2Pac ft. DMX                    | Rise Of A Champion (GalilHD Remix) |
    | 4  | Ed Sheeran & Passenger          | No Diggity (Kygo Remix)            |
    | 5  | JP Cooper ft. Mali-Koa          | All This Love                      |

    On SQL Server, you need to use the NEWID function, as illustrated by the following example:

        CONCAT(CONCAT(artist, ' - '), title) AS song
    FROM song

    When running the aforementioned SQL query on SQL Server, we are going to get the following result set:

    | song                                              |
    | Miyagi & ???????? ft. ??? ????? - I Got Love      |
    | JP Cooper ft. Mali-Koa - All This Love            |
    | HAIM - Don't Save Me (Cyril Hahn Remix)           |
    | Ed Sheeran & Passenger - No Diggity (Kygo Remix)  |
    | 2Pac ft. DMX - Rise Of A Champion (GalilHD Remix) |

    Notice that the songs are being listed in random order, thanks to the NEWID function call used by the ORDER BY clause.

  • This is an old thread but came across this recently; so updating a method that has worked for me and gives good performance. This assumes your table has an IDENTITY or similar column:

    DECLARE @r decimal(8,6) = rand()
    SELECT @r
    SELECT  TOP 100 *
    FROM    TableA
    ORDER BY ID % @r

Related Questions