Bill's Blog: Pivot Query

Pivot Query (this is for Oracle Database)
A pivot query is when you want to take some data such as:

C1        C2      C3
-----     -----   ------
a1        b1      x1
a1        b1      x2
a1        b1      x3
…

and you would like to display it as:

C1        C2      C3(1)   C3(2) C3(3)
-----     -----   ------ ----- ----
a1        b1      x1      x2     x3
…

Basically it turning rows into columns. For example taking the distinct jobs within a department and
making them be columns so the output would look like:

DEPTNO     JOB_1     JOB_2     JOB_3
---------- --------- --------- ---------
        10 CLERK     MANAGER   PRESIDENT
        20 ANALYST   ANALYST   CLERK
        30 CLERK     MANAGER   SALESMAN

instead of

DEPTNO     JOB
---------- ---------
        10 CLERK
        10 MANAGER
        10 PRESIDENT
        20 ANALYST
        20 CLERK
        20 MANAGER
        30 CLERK
        30 MANAGER
        30 SALESMAN

I'm going to show two examples for pivots. The first will be another implementation of the
preceding question. The second shows how to pivot any result set in a generic fashion and gives you
a template for doing so.

In the first case let's say you wanted to show the top 3 salary earners in each department as
COLUMNS. That is the query would return exactly 1 row per department and the row would have 4
columns the DEPTNO, the name of the highest paid employee in the department, the name of the next
highest paid and so on. Using this new functionality this is almost easy (before these functions
this was virtually impossible):

ops$tkyte@DEV816> select deptno,
2             max(decode(seq,1,ename,null)) highest_paid,
3             max(decode(seq,2,ename,null)) second_highest,
4             max(decode(seq,3,ename,null)) third_highest
5    from ( SELECT deptno, ename,
6                  row_number() OVER
7                     (PARTITION BY deptno
8                          ORDER BY sal desc NULLS LAST ) seq
9             FROM emp )
10 where seq <= 3
11 group by deptno
12 /

    DEPTNO HIGHEST_PA SECOND_HIG THIRD_HIGH
---------- ---------- ---------- ----------
        10 KING       CLARK      MILLER
        20 SCOTT      FORD       JONES
        30 BLAKE      ALLEN      TURNER

That simply created an inner result set that had a sequence assigned to employees BY DEPTNO in
order of SAL. The decode in the outer query keeps only rows with sequences 1, 2, or 3 and assigns
them to the correct "column". The group by gets rid of the redundant rows and we are left with our
collapsed result. It may be easier to understand what I mean by that if you see the result set
without the group by and max:

scott@TKYTE816> select deptno,
2             (decode(seq,1,ename,null)) highest_paid,
3             (decode(seq,2,ename,null)) second_highest,
4             (decode(seq,3,ename,null)) third_highest
5    from ( SELECT deptno, ename,
6                  row_number() OVER
7                     (PARTITION BY deptno
8                          ORDER BY sal desc NULLS LAST ) seq
9             FROM emp )
10 where seq <= 3
11 /

    DEPTNO HIGHEST_PA SECOND_HIG THIRD_HIGH
---------- ---------- ---------- ----------
        10 KING
        10            CLARK
        10                       MILLER
        20 SCOTT
        20            FORD
        20                       JONES
        30 ALLEN
        30            BLAKE
        30                       MARTIN

9 rows selected.

The MAX aggregate function will be applied by the GROUP BY column DEPTNO. In any given DEPTNO
above only one row will have a non null value for HIGHTEST_PAID, the remaining rows in that group
will always be NULL. The MAX function will pick out the non-null row and keep that for us. Hence
the group by and MAX will collapse our result set, removing the NULL values from it and giving us
what we want.

In general, if you have a table T with columns C1, C2 and you would like to get a result like:

C1     C2(1)   C2(2) …. C2(N)

Where column C1 is to stay "cross record" and column C2 will be pivoted to be "in record" the
values of C2 are to become columns instead of rows you will generate a query of the form:

Select c1
       max(decode(rn,1,c2,null)) c2_1,
       max(decode(rn,2,c2,null)) c2_2,
       …
       max(decode(rn,N,c2,null)) c2_N
from ( select c1, c2
                row_number() over ( partition by C1
                                    order by ) rn
           from T

       )
group by C1

. In the above example, C1 was simply DEPTNO and C2 was ENAME. Since we ordered by SAL DESC, the
first three columns we retrieved where the top three paid employees in that department (bearing in
mind that if four people made the top three, we would of course lose one).

The second example is a more generic "I want to pivot my result set". Here, instead of having a
single column C1 to anchor on and a single column C2 to pivot we'll look at the more general case
where C1 is a set of columns as is C2. As it turns out, this is very similar to the above.
Suppose you want to report by JOB and DEPTNO the employees in that job and their salary. The
report needs to have the employees going ACROSS the page as columns however, not down the page
the same with their salaries. Additionally, the employees need to appear from left to right in
order of their salary. The steps would be:

scott@TKYTE816> select max(count(*)) from emp group by deptno, job;

MAX(COUNT(*))
-------------
            4

That tells us the number of columns, now we can generate the query:

scott@TKYTE816> select deptno, job,
2         max( decode( rn, 1, ename, null )) ename_1,
3         max( decode( rn, 1, sal, null )) sal_1,
4         max( decode( rn, 2, ename, null )) ename_2,
5         max( decode( rn, 2, sal, null )) sal_2,
6         max( decode( rn, 3, ename, null )) ename_3,
7         max( decode( rn, 3, sal, null )) sal_3,
8         max( decode( rn, 4, ename, null )) ename_4,
9         max( decode( rn, 4, sal, null )) sal_4
10    from ( select deptno, job, ename, sal,
11                   row_number() over ( partition by deptno, job
12                                           order by sal, ename ) rn
13              from emp
14             )
15 group by deptno, job
16 /

DEPTNO JOB       ENAME_1 SAL_1 ENAME_2   SAL_2 ENAME_3    SAL_3 ENAME_ SAL_4
------ --------- ------ ----- --------- ----- ---------- ----- ------ -----
    10 CLERK     MILLER   1300
    10 MANAGER   CLARK    2450
    10 PRESIDENT KING     5000
    20 ANALYST   FORD     3000 SCOTT      3000
    20 CLERK     SMITH     800 ADAMS      1100
    20 MANAGER   JONES    2975
    30 CLERK     JAMES      99
    30 MANAGER   BLAKE      99
    30 SALESMAN ALLEN      99 MARTIN       99 TURNER        99 WARD      99

9 rows selected.

In general, to pivot a result set, we can generalize further. If you have a set of columns C1, C2,
C3, … CN and you want to keep columns C1 .. Cx cross record (going down the page) and Cx+1 … CN
in record (across the page), you can:

Select C1, C2, … CX,
       max(decode(rn,1,C{X+1},null)) cx+1_1,…max(decode(rn,1,CN,null)) CN_1
       max(decode(rn,2,C{X+1},null)) cx+1_2,…max(decode(rn,1,CN,null)) CN_2
       …
       max(decode(rn,N,c{X+1},null)) cx+1_N,…max(decode(rn,1,CN,null)) CN_N
from ( select C1, C2, … CN
                row_number() over ( partition by C1, C2, … CX
                                    order by ) rn
           from T

       )
group by C1, C2, … CX

In the example, we used C1, C2 = DEPTNO, JOB and C3, C4 = ENAME, SAL

One other thing we must know is the MAXIMUM number of rows per partition we anticipate. This will
dictate the number of columns we will be generating. Without it we cannot pivot. SQL needs to
know the number of columns and there is no way around that fact. That leads us into the next more
generic example of pivoting. If we do not know the number of total columns until runtime, we'll
have to use dynamic SQL to deal with the fact that the SELECT list is variable. We can use PL/SQL
to demonstrate how to do this and end up with a generic routine that can be reused whenever you
need a pivot. This routine will have the following specification:

scott@TKYTE816> create or replace package my_pkg
2 as
3      type refcursor is ref cursor;
4      type array is table of varchar2(30);
5
6      procedure pivot( p_max_cols       in number   default NULL,
7                       p_max_cols_query in varchar2 default NULL,
8                       p_query          in varchar2,
9                       p_anchor         in array,
10                       p_pivot          in array,
11                       p_cursor in out refcursor );
12 end;
13 /

Package created.

Here, you must send in either P_MAX_COLS or P_MAX_COLS_QUERY. SQL needs to know the number of
columns in a query and this parameter will allow us to build a query with the proper number of
columns. The value you should send in here will be the output of a query similar to:

scott@TKYTE816> select max(count(*)) from emp group by deptno, job;

That is: it is the count of the discrete values that are currently in ROWS that we will put into
COLUMNS. You can either send in the query to get this number, or the number if you already know
it.

The P_QUERY parameter is simply the query that gathers your data together. Using the last example
from above the query would be:

10    from ( select deptno, job, ename, sal,
11                   row_number() over ( partition by deptno, job
12                                           order by sal, ename ) rn
13              from emp
14             )

The next two inputs are arrays of column names. The P_ANCHOR tells us what columns will stay CROSS
RECORD (down the page) and P_PIVOT states the columns that will go IN RECORD (across the page). In
our example from above, P_ANCHOR = ( DEPTNO, JOB ) and P_PIVOT = (ENAME,SAL). Skipping
over the implementation for a moment, the entire call put together might look like this:

scott@TKYTE816> variable x refcursor
scott@TKYTE816> set autoprint on

scott@TKYTE816> begin
2      my_pkg.pivot
3      ( p_max_cols_query => 'select max(count(*)) from emp
                                 group by deptno,job',
4        p_query => 'select deptno, job, ename, sal,
5                           row_number() over ( partition by deptno, job
6                                               order by sal, ename ) rn
7                      from emp a',
8        p_anchor => my_pkg.array( 'DEPTNO','JOB' ),
9        p_pivot => my_pkg.array( 'ENAME', 'SAL' ),
10        p_cursor => :x );
11 end;
12 /

PL/SQL procedure successfully completed.

DEPTNO JOB       ENAME_ SAL_1 ENAME_2    SAL_2 ENAME_3    SAL_3 ENAME_ SAL_4
------ --------- ------ ----- ---------- ----- ---------- ----- ------ -----
    10 CLERK     MILLER 1300
    10 MANAGER   CLARK   2450
    10 PRESIDENT KING    5000
    20 ANALYST   FORD    3000 SCOTT       3000
    20 CLERK     SMITH    800 ADAMS       1100
    20 MANAGER   JONES   2975
    30 CLERK     JAMES     99
    30 MANAGER   BLAKE     99
    30 SALESMAN ALLEN     99 MARTIN        99 TURNER        99 WARD      99

9 rows selected.

As you can see that dynamically rewrote our query using the generalized template we developed.
The implementation of the package body is straightforward:

scott@TKYTE816> create or replace package body my_pkg
2 as
3
4 procedure pivot( p_max_cols          in number   default NULL,
5                   p_max_cols_query in varchar2 default NULL,
6                   p_query          in varchar2,
7                   p_anchor         in array,
8                   p_pivot          in array,
9                   p_cursor in out refcursor )
10 as
11      l_max_cols number;
12      l_query    long;
13      l_cnames   array;
14 begin
15      -- figure out the number of columns we must support
16      -- we either KNOW this or we have a query that can tell us
17      if ( p_max_cols is not null )
18      then
19          l_max_cols := p_max_cols;
20      elsif ( p_max_cols_query is not null )
21      then
22          execute immediate p_max_cols_query into l_max_cols;
23      else
24          raise_application_error(-20001, 'Cannot figure out max cols');
25      end if;
26
27
28      -- Now, construct the query that can answer the question for us...
29      -- start with the C1, C2, ... CX columns:
30
31      l_query := 'select ';
32      for i in 1 .. p_anchor.count
33      loop
34          l_query := l_query || p_anchor(i) || ',';
35      end loop;
36
37      -- Now add in the C{x+1}... CN columns to be pivoted:
38      -- the format is "max(decode(rn,1,C{X+1},null)) cx+1_1"
39
40      for i in 1 .. l_max_cols
41      loop
42          for j in 1 .. p_pivot.count
43          loop
44              l_query := l_query ||
45                  'max(decode(rn,'||i||','||
46                              p_pivot(j)||',null)) ' ||
47                              p_pivot(j) || '_' || i || ',';
48          end loop;
49      end loop;
50
51      -- Now just add in the original query
52      l_query := rtrim(l_query,',')||' from ( '||p_query||') group by ';
53
54      -- and then the group by columns...
55
56      for i in 1 .. p_anchor.count
57      loop
58          l_query := l_query || p_anchor(i) || ',';
59      end loop;
60      l_query := rtrim(l_query,',');
61
62      -- and return it
63      execute immediate 'alter session set cursor_sharing=force';
64      open p_cursor for l_query;
65      execute immediate 'alter session set cursor_sharing=exact';
66 end;
67
68 end;
69 /

Package body created.

It only does a little string manipulation to rewrite the query and open a REF CURSOR dynamically.
In the likely event the query had a predicate with constants and such in it, we set cursor sharing
on and then back off for the parse of this query to facilitate bind variables (see the section on
tuning for more information on that). Now we have a fully parsed query that is ready to be fetched
from.


Tom,

I have a tricky turning "columns into rows" predicament...

Here are the steps to set up my problem...

----------------------------
-- TARGET table to store list of account_ids
----------------------------
create table demo_account_list
(
   account_id number
)
/

----------------------------
-- SOURCE table of account_ids
----------------------------
create table demo_account_sources
(
   acct_1 number
, acct_2 number
, acct_3 number
, acct_4 number
)
/

----------------------------
-- Populate the account_id SOURCE table
----------------------------
insert into demo_account_sources values (1,2,3,4);
insert into demo_account_sources values (13,22,433,44261);
insert into demo_account_sources values (10,342,32342,33443);
insert into demo_account_sources values (15,26,737,48);

commit;

------------------------------
-- Want to capture only the EVEN account_ids
-- This doesn't work!
------------------------------
insert into demo_account_list
(
   select acct_1, acct_2, acct_3, acct_4
     from demo_account_sources
    -----------------------------
    -- Want to load only EVEN account_ids
    -----------------------------
    where mod(acct_1,2) = 0
       or mod(acct_2,2) = 0
       or mod(acct_3,2) = 0
       or mod(acct_4,2) = 0
)
/

----------------------------
Of course, the above query returns this error...
----------------------------
insert into demo_account_list
            *
ERROR at line 1:
ORA-00913: too many values
----------------------------

My question is, how can I write this query in such a way that I can select >1 rows yet insert into
a single column?

Thanks,

Robert

Followup   December 20, 2006 - 7pm US/Eastern:

ops$tkyte%ORA10GR2> select decode( r, 1, acct_1, 2, acct_2, 3, acct_3, 4, acct_4 ) acct
2    from demo_account_sources,
3         (select 1 r from dual union all select 2 from dual
4          union all select 3 from dual union all select 4 from dual )
5 /

      ACCT
----------
         1
        13
        10
        15
         2
        22
       342
        26
         3
       433
     32342
       737
         4
     44261
     33443
        48

16 rows selected.

Bill's Blog

Monday, July 7, 2014

Pivot Query

No comments:

Post a Comment