Welcome to Aginity SQL Academy’s documentation!

SQL Academy for Data Analysts

Lesson #1: Calculate Internal Rate of Return (IRR) Directly in Redshift

Internal Rate of Return (IRR) is a calculation frequently used to estimate an investment’s rate of return.

I encountered it at a large mortgage lender where IRR was one of many calculations used to determine the health of individual loans. The inputs were the mortgage principle, the previous monthly payments, and estimated monthly payments to the end of the loan period. So, there could easily be more than 300 inputs to the IRR calculation. The standard practice was to transfer a sample of borrower data from the data warehouse to the SAS server to run the IRR calculation and transfer the results back to the data warehouse.

The type of iterative calculation required by IRR is not traditionally done on MPP data warehouse platforms. However, with a bit of advanced SQL thinking, this type of calculation can take advantage of the distributed nature of the MPP to score very large populations.

As with all of our lessons if you want to catalog this code in Aginity Pro or Aginity Team click the catalogicon to skip right to the queries to add to your catalog.

Step 1 - Introduction to IRR

\[\mathrm{NPV} = \sum_0^\textit{N}\frac{C_n}{\left ( 1+r \right )^n} = 0\]

So, what is the IRR calculation?

Advanced SQL writers come from a range of backgrounds. You may not have a computer science degree, or remember a lot of your high school or college math, so I’ll jump into a few details here. The Additional Resources section below has links to several useful articles regarding IRR. I’ll give a minimal explanation here, so you should read those. The purpose here is to demonstrate some useful SQL techniques. If you need to implement this, or any other financial calculation, in production, then work with an experienced financial engineer.

In the equation above, our goal is to find “r”, which is the interest rate that makes the Net Present Value (NPV) equal to 0. In the early 19th century, Evariste Galois proved that this type of polynomial has no general solution; that is, there is no “formula” for finding “r”. You have probably encountered these types of functions in your mathematical career. Hopefully, your math teacher discussed a bit about how your fancy scientific calculator, or computer program, was able to solve these equations without a formula.

In Computer Science, the study of finding “good enough” solutions to impossible mathematical problems is called Numerical Methods or Numerical Analysis. For this problem, we are trying to find the “r” that makes NPV equal to zero. So, with our computer program, we can use this method:

1. Make a guess for "r"
2. Calculate the NPV based on that guess
3. Evaluate whether or not the NPV resulting from that guess is "close enough" to 0
  a. if it is "close enough" then we are done and our guess is a "good enough" result for r
  b. if not, then we need a better guess
4. Improve the guess and start over at #1

Step 2 - Relational Database Equivalent of Iteration

It is clear how this type of algorithm can be implemented in a traditional programming language, maybe even in a database system with a procedural language available. However, the power of the MPP lies in SQL, not in the procedure. In order to implement this in SQL, we have to change our thinking from “one at a time” to “all at once”. Think of this simple case. In a programming language, you might print the numbers one though 10 like this:

for i=1, 10 do
  print(i)
end

Assuming that I have a numbers table, in SQL I achieve the same result like this:

select num
from numbers
where num <= 10
order by num
;


.. code-block:: aginity_catalog_item

In the procedural language, it is clear that the computer “did” something (printed 1), then “did” something else (printed 2). Did this happen in the database system? Maybe, maybe not. The point is that SQL is focused on the result set. The SET is only in a particular order because I asked for it to be displayed that way. There is a SET of numbers. I don’t care about the order in which they were read off the disk. Let’s go a step further:

total = 0
for i=1, 10 do
  total = total + i
end
print(total)

vs

select sum(num) as total
from
    (select num
     from numbers
     where num <= 10
     ) as inner_select
;

Again, the order of the process described by the procedural language matches what the system actually does. Does the database system keep a running total when it calculates a SUM? I don’t know. It doesn’t matter. The SQL says that, of the SET returned by the inner_select, give me the sum. When there is a logical ordering of operations – in order to SUM over a set I first have to identify a set – then that order of operations is clear in the SQL.

So, how does this apply to our problem of making guesses? In procedural languages, we make one guess at a time, evaluate it, and make another guess; in SQL we can make all of our guesses at the same time.

Step 3 - IRR Baby Steps

In breaking down a calculation like this, I like to start with all the terms on a single row and work backward from there. We can easily get as many examples as we want by using the IRR calculation in Excel. So, let’s look a simple excel example:

Excel Example 1
Period Payment IRR Excel
0 -10,000  
1 2,400  
2 2,400  
3 2,400  
4 2,400  
5 2,400  
6 2,400 11.530473216%

Note

IRR Excel Function

_images/excel_1.png

Remember that anything raised to the power of “0” is equal to 1. So, we can use a “0” period to get the value of the payment “as-is”.

If “Net Present Value” is the sum of a bunch of things, let’s call those pre-summed things “Present Value”. So, for every payment period we need:

  • an ID so we can handle multiple borrowers
  • the period number
  • the payment for that period
-- example 1
drop table if exists payment_data;
create temp table payment_data
(id int, period int, payment int)
;

insert into payment_data VALUES
(1, 0, -10000), (1, 1, 2400), (1, 2, 2400), (1, 3, 2400), (1, 4, 2400), (1, 5, 2400), (1, 6, 2400)
;

-- section 1
select *
from payment_data
order by id, period
;
_images/result_1.png

Let’s build up the calculation. Remember, we are trying to get an NPV of 0. For our example data, Excel has told us that happens when r is about “0.115305”.

\[\mathrm{NPV} = \sum_0^\textit{N}\frac{C_n}{\left ( 1+r \right )^n} = 0\]
  1. for a given period, the numerator is the payment for that period
  2. inside the parentheses, is (1 + 0.115305)
  3. the dominator is the parens raised to the power of the period: power((1 + 0.115305), period)

So, the query is

select id, period, payment, payment/(power(1+0.115305, period)) as pv
from payment_data
order by id, period
;

Let’s sum that to see whether it is correct:

select sum(pv) as npv from
  (select id, period, payment, payment/(power(1+0.115305, period)) as pv
   from payment_data
   order by id, period
  ) as inner_query
;
_images/result_2.png

So, that’s pretty close to 0. Theoretically, we can get as close to 0 as we want by continually adding decimal places to our value of r.

In this case, we “cheated” by getting Excel to tell us the correct value for now. Next we are going to evaluate r over a range of “guesses” to determine which value of r produces an NPV close enough to 0.

Step 4 - Longer Strides

So, we know from Excel that the IRR for our data is between 11% and 12%. Let’s explore a few values.

First a table of guesses:

drop table if exists guesses;
create temp table guesses as
select num*.01 as guess
from numbers
;

In the previous example, we had the “guess” hard coded. Now we want our guesses table to drive the guess. So, every row of our payment data needs its own guess. In SQL, we can achieve this by using cross join. In SQL development we always need to keep in mind an estimate of the sizes of our input and output sets so things don’t get out of hand. We have 7 periods; we’ll look at 10 guesses initially. That will be 70 inner rows that will aggregate to 10 npv_guesses to evaluate. Rather than make this strictly true by pre-limiting our guesses table, we’ll assume that Redshift is smart enough to do that limitation for us. If not, then we’ll have 7 million inner rows that will be filtered down to 70. For Redshift, we won’t consider that to be a big deal for now.

So our inner query is this:

select id, period, payment, guess, payment/(power(1+guess, period)) as pv
from payment_data
cross join guesses
where guess between 0.06 and .15
order by id, period, guess

We can’t tell much by looking at this level of detail, so let’s aggregate

select id, guess, sum(pv) as npv_guess
from
  (select id, period, payment, guess, payment/(power(1+guess, period)) as pv
   from payment_data
   cross join guesses
   where guess between 0.06 and .15  -- an arbitrary limit on guess for easy viewing
   order by id, period, guess
  ) as inner_query
group by id, guess
order by id, guess
;
_images/result_3.png

We can see that the value closest to 0 is .12. Let’s dial in additional precision by adding decimals to our guesses, then re-running the aggregate query:

drop table if exists guesses;
create temp table guesses as
select num*.001 as guess
from numbers
;

Run the npv_guess query again.

_images/result_4.png

Now there are 100 rows of output and closest to 0 is .115. Let’s jump a couple levels of precision and re-run the aggregate query.

drop table if exists guesses;
create temp table guesses as
select num*.00001 as guess
from numbers
;
_images/result_5.png

Now we have 10,000 rows with the closest being 0.11530.

Note

If you are working through these queries, go ahead and put the decimal back to “0.01” in the guesses table so we get faster execution times for the rest of the examples.

Step 5 - Crossing the Finish Line

Now we can see the shape of where we are going. We are making all of our guesses “at the same time”; at least as part of the same result set. From that set, we need to find the one that is closest to 0; that is, the npv_guess that has the minimum absolute value.

As our levels of aggregation continue to grow, we need to be comfortable with the technique of SQL Window functions.

Let’s rewrite the previous query with a window function. Also, remember the previous note to take the guesses table back down to two decimal places for faster execution.

select *, payment/(power(1+guess, period)) as pv,
       sum(payment/(power(1+guess, period))) over(partition by id, guess order by period rows unbounded preceding) as npv_guess,
       max(period) over(partition by id, guess) as max_period
from payment_data
cross join guesses
order by id, guess, period

Now we have our payment data, pv, and npv_guess on the same row. The npv_guess aggregation is being driven by the sum using the window function. For aggregating at the next level, where we are going to find the npv_guess closest to 0, we need to choose a row. The row we want is the one with the last period for our data. So, we have a max_period aggregation that we’ll use for a filter at the next level. Note that the final “order by” clause here and in examples below is for us to use in visualizing the output. The aggregation here is based on the “order by” clause inside the window function.

Any time we use window functions, we want to add test cases to make sure that the functions are working as expected. So, let’s add a couple of more IRR examples from excel:

Excel Example 2
Period Payment IRR Excel
0 -1,000  
1 120  
2 360  
3 100  
4 240  
5 480 8%
Excel Example 3
Period Payment IRR Excel
0 -18,000  
1 3,100  
2 2,400  
3 2,400  
4 2,400  
5 2,400  
6 2,400  
7 3,000  
8 3,200  
9 3.600 7%
insert into payment_data VALUES
(2, 0, -1000), (2, 1, 120), (2, 2, 360), (2, 3, 100), (2, 4, 240), (2, 5, 480),
(3, 0, -18000), (3, 1, 3100), (3, 2, 2400), (3, 3, 2400), (3, 4, 2400), (3, 5, 2400), (3, 6, 2400),
(3, 7, 3000), (3, 8, 3200), (3, 9, 3600)
;

At this level, we have all of our guesses, along with their distances from 0 (absolute value), and identification of which of these is the closest to 0.

select id, guess, abs(npv_guess) as abs_npv_guess,
       min(abs(npv_guess)) over(partition by id) as min_abs_npv_guess
from
 (select *, payment/(power(1+guess, period)) as pv,
      sum(payment/(power(1+guess, period))) over(partition by id, guess order by period rows unbounded preceding) as npv_guess,
      max(period) over(partition by id, guess) as max_period
  from payment_data
  cross join guesses
  order by id, guess, period
 ) as payment_level
where period = max_period
order by id
;

So, one additional filter gives the final query:

select id, guess as irr
from
  (select id, guess, abs(npv_guess) as abs_npv_guess,
       min(abs(npv_guess)) over(partition by id) as min_abs_npv_guess
   from
     (select *, payment/(power(1+guess, period)) as pv,
          sum(payment/(power(1+guess, period))) over(partition by id, guess order by period rows unbounded preceding) as npv_guess,
          max(period) over(partition by id, guess) as max_period
      from payment_data
      cross join guesses
      order by id, guess, period
     ) as payment_level
   where period = max_period
   order by id, guess
     ) as guess_level
 where abs_npv_guess = min_abs_npv_guess
 order by id
;
_images/result_6.png

IRR Sample Aginity Catalog Assets

Lesson #5: Learn to Create Basic Recency, Frequency and Monetary Segments

RFM (Recency, Frequency, Monetary) analysis is a simple, easy to comprehend, highly used marketing model for behavior based customer segmentation. It groups customers based on their transaction history – how recently did they transact, how often did they transact and how much did they purchase.

RFM models helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.

See also

As with all of our lessons if you want to catalog this code in Aginity Pro or Aginity Team click the catalogicon to skip right to the queries to add to your catalog.

Step 1 - The Required Data

For the sake of this example we want to make it simple but let you expand it against any data you have available to you.

We are going to mock up some data using a basic SQL UNION ALL command. We will use two different tables which are very typical in RFM modeling, TRANSACTION_HEADER and TRANSACTION_DETAIL.

Here is the structure of each table.

TRANSACTION_HEADER
Column name Column Description Data Type
customer_id This is a unique identifier of a customer that purchased. Integer
transaction_id A unique identifier of a transaction. Integer
transaction_date The date on which the transaction occurred. Date
TRANSACTION_DETAIL
Column name Column Description Data Type
transaction_id A unique identifier of a transaction. FK to TRANSACTION HEADER. Integer
quantity The quantity of items purchased. Integer
net_amount The total amount of items purchased. Decimal(14,3)
Sample RFM Data Scripts

The following SQL Scripts are portable and used to simulate the RFM model.

TRANSACTION_HEADER

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  -- transaction_header data

  select cast(123456 as integer) as customer_id, cast(11111 as integer) as transaction_id, cast('2019-01-01' as date) as transaction_date
  union all
  select cast(123456 as integer) as customer_id, cast(11112 as integer) as transaction_id, cast('2019-01-04' as date) as transaction_date
  union all
  select cast(123456 as integer) as customer_id, cast(11113 as integer) as transaction_id, cast('2019-01-07' as date) as transaction_date
  union all
  select cast(123456 as integer) as customer_id, cast(11114 as integer) as transaction_id, cast('2019-01-10' as date) as transaction_date
  union all
  select cast(123456 as integer) as customer_id, cast(11115 as integer) as transaction_id, cast('2019-01-14' as date) as transaction_date
  union all
  select cast(123456 as integer) as customer_id, cast(11116 as integer) as transaction_id, cast('2019-01-17' as date) as transaction_date
  union all
  select cast(123456 as integer) as customer_id, cast(11117 as integer) as transaction_id, cast('2019-01-20' as date) as transaction_date
  union all
  select cast(123456 as integer) as customer_id, cast(11118 as integer) as transaction_id, cast('2019-01-27' as date) as transaction_date
  union all
  select cast(234567 as integer) as customer_id, cast(21115 as integer) as transaction_id, cast('2019-01-14' as date) as transaction_date
  union all
  select cast(234567 as integer) as customer_id, cast(21116 as integer) as transaction_id, cast('2019-01-15' as date) as transaction_date
  union all
  select cast(234567 as integer) as customer_id, cast(21117 as integer) as transaction_id, cast('2019-01-16' as date) as transaction_date
  union all
  select cast(234567 as integer) as customer_id, cast(21118 as integer) as transaction_id, cast('2019-01-17' as date) as transaction_date

TRANSACTION_DETAIL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
      --transaction_detail data

      select cast(11111 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11112 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11112 as integer) as transaction_id, cast(1 as integer) as quantity, cast(7.25 as decimal(13,2)) as net_amount
      union all
      select cast(11113 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11112 as integer) as transaction_id, cast(15 as integer) as quantity, cast(1.10 as decimal(13,2)) as net_amount
      union all
      select cast(11114 as integer) as transaction_id, cast(1 as integer) as quantity, cast(25.34 as decimal(13,2)) as net_amount
      union all
      select cast(11114 as integer) as transaction_id, cast(2 as integer) as quantity, cast(14.32 as decimal(13,2)) as net_amount
      union all
      select cast(11114 as integer) as transaction_id, cast(1 as integer) as quantity, cast(7.10 as decimal(13,2)) as net_amount
      union all
      select cast(11115 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11116 as integer) as transaction_id, cast(1 as integer) as quantity, cast(8.10 as decimal(13,2)) as net_amount
      union all
      select cast(11117 as integer) as transaction_id, cast(2 as integer) as quantity, cast(23.10 as decimal(13,2)) as net_amount
      union all
      select cast(11118 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(21115 as integer) as transaction_id, cast(14 as integer) as quantity, cast(4.10 as decimal(13,2)) as net_amount
      union all
      select cast(21116 as integer) as transaction_id, cast(16 as integer) as quantity, cast(8.10 as decimal(13,2)) as net_amount
      union all
      select cast(21117 as integer) as transaction_id, cast(4 as integer) as quantity, cast(23.10 as decimal(13,2)) as net_amount
      union all
      select cast(21118 as integer) as transaction_id, cast(1 as integer) as quantity, cast(43.10 as decimal(13,2)) as net_amount

Step 2 - The RFM Query

We are using a WITH statement to collapse two passes into one required SQL statement. Let’s start with the statement itself that returns the aggregated analytics and the segment each customer falls in.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
    with customer_metrics as
    (
    select
     th.customer_id,
     count(distinct th.transaction_id) as trips_per_period,
     sum(td.quantity * td.net_amount) as total_spend_per_period,
     datediff(DAY,current_date, max(th.transaction_date)) AS days_since_last_transaction
    from  TRANSACTION_HEADER th
    join  TRANSACTION_DETAIL td on th.transaction_id = td.transaction_id
    where th.transaction_date > dateadd(day, cast ($lookback_days as integer)*-1,current_date)
    AND td.quantity > 0             -- returns ignored
    group by th.customer_id),
    rfm as
    (
    select customer_id,
      ntile($buckets) over (order by days_since_last_transaction desc) as r,
      ntile($buckets) over (order by trips_per_period desc) as f,
      ntile($buckets) over (order by total_spend_per_period desc) as m,
      trips_per_period,
      total_spend_per_period,
      days_since_last_transaction
    from customer_metrics
    )
    select customer_id
         , r
         , f
         , m
         , trips_per_period
         , total_spend_per_period
         , days_since_last_transaction
         , ntile($buckets) over (order by 1.0*r+1.0*f+1.0*m)
    from rfm;  -- weights on final calculation - default to 1.0
Breakdown of the SQL

There are two queries embedded in the WITH statement: customer_metrics and rfm.

The customer_metrics query will aggregate:

  1. The trips_per_period by counting distinct transaction id’s (F or Frequency).
  2. The total_spend_per_period by summing net_sales and the quantity sold (M or Monetary).
  3. The days_since_last_transaction by finding the difference between the current date and the last purchase date (R or Recency).

In this query we have two parameters: $lookback_days which tells you how long from current date do you want to segment customers purchases by and $buckets which signifies the number of segments you want the query to return.

The rfm query then uses the windowing function, ntile, which will take the ordered data from the customer_metrics query and segment them into equal size (number of rows per group).

The final query as shown below brings together all the information from the WITH queries and displays it along with a final ntile of the RFM calculation.

1
2
3
4
5
6
7
8
9
    select customer_id
          , r
          , f
          , m
          , trips_per_period
          , total_spend_per_period
          , days_since_last_transaction
          , ntile($buckets) over (order by 1.0*r+1.0*f+1.0*m)
     from rfm;  -- weights on final calculation - default to 1.0

Note

The 1.0 you see in the query above represents equal weight to the R, F and M calculation. In some cases an organization may want to weight each measure differently to perform the final segmentation. For instance, you may choose to apply a weight of 2 to monetary and 1 and 1 to frequency and recency.

Ignore this answer, it does not work: Better use the answer from Louis

For anchor, you may define “short” anchor names like this:

RFM Sample Aginity Catalog Assets

There are six assets you can add to your catalog. I chose to add them as shown below.

_images/RFMCatalog.png

These queries are written using ANSII standard SQL so should work across most database platforms. Just select a connection in the Pro/Team Editor and either double click the catalog item and execute or drag and drop the catalog item which will expose the code and run them.

DATA-transaction_header
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
    (select cast(123456 as integer) as customer_id, cast(11111 as integer) as transaction_id, cast('2019-01-01' as date) as transaction_date
    union all
    select cast(123456 as integer) as customer_id, cast(11112 as integer) as transaction_id, cast('2019-01-04' as date) as transaction_date
    union all
    select cast(123456 as integer) as customer_id, cast(11113 as integer) as transaction_id, cast('2019-01-07' as date) as transaction_date
    union all
    select cast(123456 as integer) as customer_id, cast(11114 as integer) as transaction_id, cast('2019-01-10' as date) as transaction_date
    union all
    select cast(123456 as integer) as customer_id, cast(11115 as integer) as transaction_id, cast('2019-01-14' as date) as transaction_date
    union all
    select cast(123456 as integer) as customer_id, cast(11116 as integer) as transaction_id, cast('2019-01-17' as date) as transaction_date
    union all
    select cast(123456 as integer) as customer_id, cast(11117 as integer) as transaction_id, cast('2019-01-20' as date) as transaction_date
    union all
    select cast(123456 as integer) as customer_id, cast(11118 as integer) as transaction_id, cast('2019-01-27' as date) as transaction_date
    union all
    select cast(234567 as integer) as customer_id, cast(21115 as integer) as transaction_id, cast('2019-01-14' as date) as transaction_date
    union all
    select cast(234567 as integer) as customer_id, cast(21116 as integer) as transaction_id, cast('2019-01-15' as date) as transaction_date
    union all
    select cast(234567 as integer) as customer_id, cast(21117 as integer) as transaction_id, cast('2019-01-16' as date) as transaction_date
    union all
    select cast(234567 as integer) as customer_id, cast(21118 as integer) as transaction_id, cast('2019-01-17' as date) as transaction_date)
DATA-transaction_detail
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      (select cast(11111 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11112 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11112 as integer) as transaction_id, cast(1 as integer) as quantity, cast(7.25 as decimal(13,2)) as net_amount
      union all
      select cast(11113 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11112 as integer) as transaction_id, cast(15 as integer) as quantity, cast(1.10 as decimal(13,2)) as net_amount
      union all
      select cast(11114 as integer) as transaction_id, cast(1 as integer) as quantity, cast(25.34 as decimal(13,2)) as net_amount
      union all
      select cast(11114 as integer) as transaction_id, cast(2 as integer) as quantity, cast(14.32 as decimal(13,2)) as net_amount
      union all
      select cast(11114 as integer) as transaction_id, cast(1 as integer) as quantity, cast(7.10 as decimal(13,2)) as net_amount
      union all
      select cast(11115 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(11116 as integer) as transaction_id, cast(1 as integer) as quantity, cast(8.10 as decimal(13,2)) as net_amount
      union all
      select cast(11117 as integer) as transaction_id, cast(2 as integer) as quantity, cast(23.10 as decimal(13,2)) as net_amount
      union all
      select cast(11118 as integer) as transaction_id, cast(3 as integer) as quantity, cast(3.10 as decimal(13,2)) as net_amount
      union all
      select cast(21115 as integer) as transaction_id, cast(14 as integer) as quantity, cast(4.10 as decimal(13,2)) as net_amount
      union all
      select cast(21116 as integer) as transaction_id, cast(16 as integer) as quantity, cast(8.10 as decimal(13,2)) as net_amount
      union all
      select cast(21117 as integer) as transaction_id, cast(4 as integer) as quantity, cast(23.10 as decimal(13,2)) as net_amount
      union all
      select cast(21118 as integer) as transaction_id, cast(1 as integer) as quantity, cast(43.10 as decimal(13,2)) as net_amount)
FML - Days Since Last Transaction (Recency)

This asset is a reusable formula that calculates the days between execution run time (current date) and the maximum transaction date for each customer.

1
current_date - max(th.transaction_date)
FML - Total Spend in Period (Monetary)

This asset is a reusable formula that calculates the aggregation (sum) of net sales, defined as quantity multiplied by net_amount over the specified time period.

1
sum(td.quantity * td.net_amount)
FML - Trips per Period (Frequency)

This asset is a reusable formula that counts the number of distinct transactions within the specified time period.

1
  count(distinct th.transaction_id)
Sample - RFM Query
This asset uses the formulas above and then calculates the segementation using the windowed analytic function ntile.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
      with customer_metrics as
      (
      select
      th.customer_id,
      @{/Samples/Sample Data Science Queries - All Platforms/Recency-Frequency-Monetary Samples/FML - Trips per Period (Frequency)} as trips_per_period,
      @{/Samples/Sample Data Science Queries - All Platforms/Recency-Frequency-Monetary Samples/FML - Total Spend in Period (Monetary)} as total_spend_per_period,
      @{/Samples/Sample Data Science Queries - All Platforms/Recency-Frequency-Monetary Samples/FML - Days Since Last Transaction (Recency)} AS days_since_last_transaction
      from  @{/Samples/Sample Data Science Queries - All Platforms/Recency-Frequency-Monetary Samples/DATA-transaction_header}
      join  (SELECT * FROM @{/Samples/Sample Data Science Queries - All Platforms/Recency-Frequency-Monetary Samples/DATA-transaction_detail}) td
      on th.transaction_id = td.transaction_id
      where td.quantity > 0             -- returns ignored
      --and th.transaction_date > dateadd(day, cast($lookback_days as integer)*-1,current_date)  -- consider the past 365 days - customization opportunity (1)
      group by th.customer_id),
      rfm as
      (
      select customer_id,
      ntile($buckets) over (order by days_since_last_transaction desc) as r,  -- split into 10 bins - customization opportunity (2)
      ntile($buckets) over (order by trips_per_period desc) as f,
      ntile($buckets) over (order by total_spend_per_period desc) as m,
      trips_per_period,
      total_spend_per_period,
      days_since_last_transaction
      from customer_metrics
      )
      select customer_id
         , r
         , f
         , m
         , trips_per_period
         , total_spend_per_period
         , days_since_last_transaction
         , ntile($buckets) over (order by 1.0*r+1.0*f+1.0*m)
      from rfm

SQL Academy for Data Engineers

Lesson #2: Essential Redshift Utilities: Generate DDL and Search Table Metadata

A decade ago, technologists tended to specialize in a few systems. In the database realm, you might have focused on Oracle or DB2 or SQL Server. Even with massively parallel processing (MPP) databases we tended to specialize in Teradata or Netezza or Greenplum. However, over the past few years, I have worked on projects on all of these systems and more, including cloud-based systems like Hive, Spark, Redshift, Snowflake, and BigQuery. When I worked only in Oracle and only used an Oracle SQL editor, then I knew exactly where to find my store of SQL snippets for doing things like querying the database system tables.

However, as the number of databases I worked with each day expanded, so did the number of places I had to look for my old code, and the amount of time I had to spend researching the intricacies of system tables for those databases. In the well-known Workbench series of products from Aginity, each product focused on one database platform (Workbench for Netezza, Workbench for Redshift, etc.). Aginity Pro is more than a unified platform for executing SQL against many databases. Now there is an “Active Catalog” where I can store, organize, manage, and execute my hard-won SQL.

As SQL analysts and engineers, there are a few things we do every time we approach a new set of data or a new project. Many of those things involve querying the system tables of the current database. The Active Catalog included with Aginity Pro includes a “Sample Catalog” directory with, among other things, system table queries that are ready to use out of the box. Below, I’ll discuss two of these. Neither is a particularly sophisticated use of the Active Catalog, but being familiar with their location and usage means that new projects can start immediately with data investigation rather than with a Google search for how to query the system tables in [your database platform here].

Step 1 - Search Table Metadata

Gathering all of the bits and pieces that make up a DDL statement required a fairly large “system” query. Searching for column names should be simple. In fact, in most systems it is simple after you finish searching Google for that right tables to query. I have researched how to search partial column names on probably 30 database systems over the years. When I start a new project, I frequently get vague instructions like, “you should find what you need in the Current Customer table”. When I finally get a database connection, I see table names like FRDST_CST_MN. Huh?

They are all very straight-forward queries. For example, here is the definition of Search for Columns by partial name - Public schema:

1
2
3
4
  select table_catalog as database, table_schema, table_name, column_name, ordinal_position as column_sequence
  from information_schema.columns
  where lower(table_schema) = 'public'
  and lower(column_name) like $partial_column_name

Note

The $partial_column_name is a parameter and is usually used with % wildcard characters.

Step 2 - Generate Drop Table Query

In some cases you can string together SQL statements to get more value from them. For instance in a lot of cases we desire to search the database catalog for table names that match a pattern and then generate a DROP statement to clean the database up. The first query below will search for all tables in the information schema that match a name sequence.

1
2
3
  select table_catalog as database, table_schema, table_name
  from information_schema.tables
  where lower(table_name) like lower($partial_table_name);

You can now use this SQL and embed it inside of another SQL statement to generate the DROP statements

1
2
3
4
5
6
7
  select 'drop table ' + table_name + ';'
  from
    (
      select table_catalog as database, table_schema, table_name
      from information_schema.tables
      where lower(table_name) like lower($partial_table_name)
    ) a;

When executed for tables that match the name ‘%pendo%’ we will return the following results:

drop table pendo_featureevents;
drop table pendo_visitor;
drop table pendo_trackevents;

Step 3 - Generate DDL

When we sit down to a new project, we frequently need to work with the Data Definition Language (DDL). In most database systems, the actual structure of database objects is spread across several tables. Table-level properties are one place, columns another place, constraints another. Some systems provide a view to pull all of these sources together so that we can easily query the DDL of an existing table. Redshift does not provide a built-in view for this, but Amazon has provided an example query on Github.

This 230 lines of SQL provided by Amazon allows an admin to create a view that can then be queried to assemble the DDL.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
    (
    SELECT
    table_id
    ,schemaname
    ,tablename
    ,seq
    ,ddl
    FROM
    (
    SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,2 AS seq
    ,'CREATE TABLE IF NOT EXISTS ' + QUOTE_IDENT(n.nspname) + '.' + QUOTE_IDENT(c.relname) + '' AS ddl
    FROM pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    WHERE c.relkind = 'r'
    --OPEN PARENT COLUMN LIST
    UNION SELECT c.oid::bigint as table_id,n.nspname AS schemaname, c.relname AS tablename, 5 AS seq, '(' AS ddl
    FROM pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    WHERE c.relkind = 'r'
    --COLUMN LIST
    UNION SELECT
    table_id
    ,schemaname
    ,tablename
    ,seq
    ,'\t' + col_delim + col_name + ' ' + col_datatype + ' ' + col_nullable + ' ' + col_default + ' ' + col_encoding AS ddl
    FROM
    (
    SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,100000000 + a.attnum AS seq
    ,CASE WHEN a.attnum > 1 THEN ',' ELSE '' END AS col_delim
    ,QUOTE_IDENT(a.attname) AS col_name
    ,CASE WHEN STRPOS(UPPER(format_type(a.atttypid, a.atttypmod)), 'CHARACTER VARYING') > 0
      THEN REPLACE(UPPER(format_type(a.atttypid, a.atttypmod)), 'CHARACTER VARYING', 'VARCHAR')
     WHEN STRPOS(UPPER(format_type(a.atttypid, a.atttypmod)), 'CHARACTER') > 0
      THEN REPLACE(UPPER(format_type(a.atttypid, a.atttypmod)), 'CHARACTER', 'CHAR')
     ELSE UPPER(format_type(a.atttypid, a.atttypmod))
     END AS col_datatype
    ,CASE WHEN format_encoding((a.attencodingtype)::integer) = 'none'
     THEN 'ENCODE RAW'
     ELSE 'ENCODE ' + format_encoding((a.attencodingtype)::integer)
     END AS col_encoding
    ,CASE WHEN a.atthasdef IS TRUE THEN 'DEFAULT ' + adef.adsrc ELSE '' END AS col_default
    ,CASE WHEN a.attnotnull IS TRUE THEN 'NOT NULL' ELSE '' END AS col_nullable
    FROM pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    INNER JOIN pg_attribute AS a ON c.oid = a.attrelid
    LEFT OUTER JOIN pg_attrdef AS adef ON a.attrelid = adef.adrelid AND a.attnum = adef.adnum
    WHERE c.relkind = 'r'
     AND a.attnum > 0
    ORDER BY a.attnum
    )
    --CONSTRAINT LIST
    UNION (SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,200000000 + CAST(con.oid AS INT) AS seq
    ,'\t,' + pg_get_constraintdef(con.oid) AS ddl
    FROM pg_constraint AS con
    INNER JOIN pg_class AS c ON c.relnamespace = con.connamespace AND c.oid = con.conrelid
    INNER JOIN pg_namespace AS n ON n.oid = c.relnamespace
    WHERE c.relkind = 'r' AND pg_get_constraintdef(con.oid) NOT LIKE 'FOREIGN KEY%'
    ORDER BY seq)
    --CLOSE PARENT COLUMN LIST
    UNION SELECT c.oid::bigint as table_id,n.nspname AS schemaname, c.relname AS tablename, 299999999 AS seq, ')' AS ddl
    FROM pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    WHERE c.relkind = 'r'

    --DISTSTYLE
    UNION SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,300000001 AS seq
    ,CASE WHEN c.reldiststyle = 0 THEN 'DISTSTYLE EVEN'
    WHEN c.reldiststyle = 1 THEN 'DISTSTYLE KEY'
    WHEN c.reldiststyle = 8 THEN 'DISTSTYLE ALL'
    WHEN c.reldiststyle = 9 THEN 'DISTSTYLE AUTO'
    ELSE '<<Error - UNKNOWN DISTSTYLE>>'
    END AS ddl
    FROM pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    WHERE c.relkind = 'r'
    --DISTKEY COLUMNS
    UNION SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,400000000 + a.attnum AS seq
    ,' DISTKEY (' + QUOTE_IDENT(a.attname) + ')' AS ddl
    FROM pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    INNER JOIN pg_attribute AS a ON c.oid = a.attrelid
    WHERE c.relkind = 'r'
    AND a.attisdistkey IS TRUE
    AND a.attnum > 0
    --SORTKEY COLUMNS
    UNION select table_id,schemaname, tablename, seq,
       case when min_sort <0 then 'INTERLEAVED SORTKEY (' else ' SORTKEY (' end as ddl
    from (SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,499999999 AS seq
    ,min(attsortkeyord) min_sort FROM pg_namespace AS n
    INNER JOIN  pg_class AS c ON n.oid = c.relnamespace
    INNER JOIN pg_attribute AS a ON c.oid = a.attrelid
    WHERE c.relkind = 'r'
    AND abs(a.attsortkeyord) > 0
    AND a.attnum > 0
    group by 1,2,3,4 )
    UNION (SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,500000000 + abs(a.attsortkeyord) AS seq
    ,CASE WHEN abs(a.attsortkeyord) = 1
    THEN '\t' + QUOTE_IDENT(a.attname)
    ELSE '\t, ' + QUOTE_IDENT(a.attname)
    END AS ddl
    FROM  pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    INNER JOIN pg_attribute AS a ON c.oid = a.attrelid
    WHERE c.relkind = 'r'
    AND abs(a.attsortkeyord) > 0
    AND a.attnum > 0
    ORDER BY abs(a.attsortkeyord))
    UNION SELECT
    c.oid::bigint as table_id
    ,n.nspname AS schemaname
    ,c.relname AS tablename
    ,599999999 AS seq
    ,'\t)' AS ddl
    FROM pg_namespace AS n
    INNER JOIN  pg_class AS c ON n.oid = c.relnamespace
    INNER JOIN  pg_attribute AS a ON c.oid = a.attrelid
    WHERE c.relkind = 'r'
    AND abs(a.attsortkeyord) > 0
    AND a.attnum > 0
    --END SEMICOLON
    UNION SELECT c.oid::bigint as table_id ,n.nspname AS schemaname, c.relname AS tablename, 600000000 AS seq, ';' AS ddl
    FROM  pg_namespace AS n
    INNER JOIN pg_class AS c ON n.oid = c.relnamespace
    WHERE c.relkind = 'r'

    )
    ORDER BY table_id,schemaname, tablename, seq
    )
    where schemaname = $schema and tablename = $table_name

Note

You will be prompted to supply an exact schema and table_name in this example.

Redshift Utilities Aginity Catalog Assets

There are seven assets you can add to your catalog. I chose to add them as shown below.

_images/RSUtilityCatalog.png

These queries are specific to Redshift but could be patterned after for other Database platforms. Just select a Redshift connection in the Pro/Team Editor and either double click the catalog item and execute or drag and drop the catalog item which will expose the code and run them.

Search for Columns by exact name - All schemas

This asset will search the information schema for columns with an exact name matches across all schemas

1
2
3
  select table_catalog as database, table_schema, table_name, column_name, ordinal_position as column_sequence
  from information_schema.columns
  where lower(column_name) = $column_name
Search for Columns by exact name - Public schema

This asset will search the information schema for columns with an exact name matches across just the public schema

1
2
3
4
  select table_catalog as database, table_schema, table_name, column_name, ordinal_position as column_sequence
  from information_schema.columns
  where lower(table_schema) = 'public'
  and lower(column_name) = $column_name
Search for Columns by partial name - All schemas

This asset will search the information schema for columns with a partial name matches across all schemas

1
2
3
4
  select table_catalog as database, table_schema, table_name, column_name, ordinal_position as column_sequence
  from information_schema.columns
  where lower(table_schema) = 'public'
  and lower(column_name) = $column_name
Search for Columns by partial name - Public schema

This asset will search the information schema for columns with an partial name matches across just the public schema

1
2
3
4
  select table_catalog as database, table_schema, table_name, column_name, ordinal_position as column_sequence
  from information_schema.columns
  where lower(table_schema) = 'public'
  and lower(column_name) like $partial_column_name
Search for Tables by partial name - All schemas

This asset will search the information schema for tables with a partial name matches across all schemas

1
2
3
  select table_catalog as database, table_schema, table_name
  from information_schema.tables
  where lower(table_name) like lower($partial_table_name)
Search for Tables by partial name - Public schema

This asset will search the information schema for tables with an partial name matches across just the public schema

1
2
3
4
  select table_catalog as database, table_schema, table_name
  from information_schema.tables
  where lower(table_schema) = 'public'
  and lower(table_name) like $partial_table_name

Lesson #3: Generating SQL to Profile Table Data

If you are here to use the data-profiling-SQL-generating example provided with Aginity Pro, you can jump straight to Step 4 - Aginity Active Catalog Example. If you want to dig into the concepts behind that example, then read on.

A large part of the day-to-day work of both data analysts and data engineers is to explore data in database tables. For small datasets, analysts have many tools we can use. In addition to showing us 5 number summaries, they might display distribution graphs or plots of relationships between columns. However, in the world of cloud-based data these tools might not be available. The data might be too big to bring back to the tool. We might have other restrictions related to connections or even legal concerns with transferring data. Sometimes, in order to begin exploring our data, we need to begin that exploration directly in SQL.

Of course, SQL is great for aggregating data. MPP cloud databases like Redshift, Snowflake, Netezza, and Hive are optimized for responding to this kind of SQL query. SQL is the most straight-forward language for expressing ideas around data relationships and performing manipulations on data. However, there aren’t generally-available facilities to explore “all” the data. For example, In Python, R, or SAS, there are language keywords that allow me to say “apply this function to all of the columns of this dataset. On the other hand,in SQL we have to specifically list each column transformation that we want to be performed.

All database systems have the concept of “system tables” or “system catalog” or “catalog tables” that are tables or views that contain information about all of the database objects, including table and column names. In this tutorial, we’ll learn how to use these system tables to automatically generate SQL that can summarize our data.

Step 1 - What do you mean “generate SQL”?

I love programming languages. A lot of languages are great for generating source code, either of their own language or of a different language. SQL is not a programming language that naturally comes to mind when we think about doing a lot of string manipulation. However, database administrators have a long history of writing SQL that generates SQL.

You might run into the term “dynamic SQL”. This is a technical term for database systems that include a procedural language. That procedural language is used to generate SQL that will be submitted later within the same procedural language program.

For our purposes, we are technically using “static SQL”, that is, when the SQL statement is submitted to the database system, it is a complete SQL statement. However, before submitting that SQL statement, we will submit a prior statement that will generate as the statement output, the static SQL that will profile our data. This idea will become clearer as we work through some examples.

Step 2 - What are “system tables”?

Just one more term before we jump into an example. Database “system tables” or “system views” hold information about the database itself. In Redshift (and in most database systems based on PostgreSQL), the view information_schema.columns contains all of the columns for all of the tables in the currently active database.

_images/generate_columns_table.png

As an initial example, let’s say that we want to generate a SQL statement to get the maximum value from each column in the stl_plan_info table that has “node” in its name. Here is a query to get the column names we are interested in:

select column_name, data_type
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
;
_images/generate_results_1.png

Using some SQL string concatenation, we can generate the aggregation SQL based on those criteria:

select
'select '||quote_literal(column_name)||' as col_name, max('||column_name||') as max_value from stl_plan_info;' as generated_sql
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
;
_images/generate_results_2.png

You’ll notice that this naive example produces one SQL statement per column, which might not be what we want. We’ll revisit this issue later.

Note

A note on the examples used here.

Redshift includes STL Tables that contain log information for events that occur on the system. Some of these tables are only accessible to administrators. However, STL Tables pertaining to the queries that you execute are available to you. In order for you to follow along with this tutorial, when we need to profile some data we’ll point to one of these STL Tables that is guaranteed to be available.

Method

As you can guess from looking at the brief example above, SQL that generates SQL can get complicated pretty quickly. The general method for developing these queries is first to figure out what you want the profiling SQL to look like. Then, write a simple query to generate just one portion of that target SQL. Keep adding to your query until you achieve the target SQL.

In this case, we have a wonderfully complex SQL-generating query provided by Aginity as an example in the Active Catalog. Below, we’ll look at some general principles, then explore this example query.

Step 3 - Data Profiling

If we already know which columns of a table are “interesting”, then we can just write SQL to explore those columns. However, when we are presented with a new table, we don’t know which columns are interesting and which are not. We don’t know whether a column contains just a few values repeated over and over, or whether there are millions of unique values. We don’t know whether the date columns represent only today, or whether they stretch back 20 or 30 years. Getting this knowledge over the data is one reason that we profile.

Another reason that we profile tables is to get a handle on data quality. If one day, our STATE table has 50 unique values and the next day, it has 482 unique values, then we might need to investigate the ETL process because something has clearly gone wrong. Other changes are more subtle. If the average transaction_count is 4,927,642 one day and it is 3,477,923 the next, then is there a problem with the data? Maybe, maybe not. However, we can capture the profile data each date to store in a table. Then we can check the standard deviation for the average_transaction_count to see whether there might be a problem worth investigating.

Data Types

This might seem obvious, but it bears emphasizing. Different data types have different aggregation functions. Taking the average (mean) of a list of dates doesn’t make any sense. You might be able to get the max and min of a character string, but that doesn’t really give the same insight as the same aggregation over numeric data.

A final consideration is that physical data types are not always useful for determining aggregation. For example, the ID column of a table might be of type bigint. You can take the min, max, and average of this column, but that doesn’t tell you anything useful about the data. So, feel free to create your own version of this example – that’s why Aginity provided it – that takes into account your local naming standards and business knowledge to avoid performing aggregation on columns that won’t provide useful information.

Step 4 - Aginity Active Catalog Example

In Aginity Pro, in the “Catalog” tab, you should see a folder labeled Sample Catalog. Open the folder path Sample Catalog -> Examples -> Redshift -> Data Profile.

_images/generate_catalog_navigation.png

Set up your Query Window with this query:

select query from @{/Sample Catalog/Examples/Redshift/Data Profile/Profile Data in Table};

and run with these parameters:

_images/generate_params.png

Your results will look like this:

_images/generate_results_3.png

with the “query” column containing the text of the data profiling SQL.

Here is the SQL that was returned with some formatting applied.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
 SELECT 'starttime',
        count(starttime) AS ROW_COUNT,
        min(starttime) AS MIN,
        max(starttime) AS MAX,
        count(DISTINCT starttime) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'endtime',
        count(endtime) AS ROW_COUNT,
        min(endtime) AS MIN,
        max(endtime) AS MAX,
        count(DISTINCT endtime) AS distinct_count
 FROM pg_catalog.stl_query;

 SELECT 'userid',
        count(userid) AS ROW_COUNT,
        min(userid) AS MIN,
        max(userid) AS MAX,
        avg(userid) AS average,
        count(DISTINCT userid) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'query',
        count(query) AS ROW_COUNT,
        min(query) AS MIN,
        max(query) AS MAX,
        avg(query) AS average,
        count(DISTINCT query) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'xid',
        count(xid) AS ROW_COUNT,
        min(xid) AS MIN,
        max(xid) AS MAX,
        avg(xid) AS average,
        count(DISTINCT xid) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'pid',
        count(pid) AS ROW_COUNT,
        min(pid) AS MIN,
        max(pid) AS MAX,
        avg(pid) AS average,
        count(DISTINCT pid) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'aborted',
        count(aborted) AS ROW_COUNT,
        min(aborted) AS MIN,
        max(aborted) AS MAX,
        avg(aborted) AS average,
        count(DISTINCT aborted) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'insert_pristine',
        count(insert_pristine) AS ROW_COUNT,
        min(insert_pristine) AS MIN,
        max(insert_pristine) AS MAX,
        avg(insert_pristine) AS average,
        count(DISTINCT insert_pristine) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'concurrency_scaling_status',
        count(concurrency_scaling_status) AS ROW_COUNT,
        min(concurrency_scaling_status) AS MIN,
        max(concurrency_scaling_status) AS MAX,
        avg(concurrency_scaling_status) AS average,
        count(DISTINCT concurrency_scaling_status) AS distinct_count
 FROM pg_catalog.stl_query;


 SELECT 'label',
        count(label) AS ROW_COUNT,
        max(top10_literals) AS top10_literals,
        count(DISTINCT label) AS distinct_count
 FROM pg_catalog.stl_query r
 CROSS JOIN
   (SELECT listagg(label, ',') top10_literals
    FROM
      (SELECT top 10 label
       FROM
         (SELECT label,
                 count(*) cnt
          FROM pg_catalog.stl_query
          GROUP BY label)
       ORDER BY cnt DESC)) AS rr
 UNION ALL
 SELECT 'database',
        count(DATABASE) AS ROW_COUNT,
        max(top10_literals) AS top10_literals,
        count(DISTINCT DATABASE) AS distinct_count
 FROM pg_catalog.stl_query r
 CROSS JOIN
 (SELECT listagg(DATABASE, ',') top10_literals
    FROM
      (SELECT top 10 DATABASE
       FROM
         (SELECT DATABASE,
                 count(*) cnt
          FROM pg_catalog.stl_query
          GROUP BY DATABASE)
       ORDER BY cnt DESC)) AS rr
 UNION ALL
 SELECT 'querytxt',
        count(querytxt) AS ROW_COUNT,
        max(top10_literals) AS top10_literals,
        count(DISTINCT querytxt) AS distinct_count
 FROM pg_catalog.stl_query r
 CROSS JOIN
   (SELECT listagg(querytxt, ',') top10_literals
    FROM
      (SELECT top 10 querytxt
       FROM
         (SELECT querytxt,
                 count(*) cnt
          FROM pg_catalog.stl_query
          GROUP BY querytxt)
       ORDER BY cnt DESC)) AS rr ;

Here we see three SQL statements: one for the two time columns, one for the seven numeric columns, and one for the three text columns.

Step 5 - Digging into the Example

Let’s open up the example. In the Active Catalog, navigate as before but rather than double clicking on Profile Data in Table, this time drag it into the Query Window. This will expand the Catalog Item so that it looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
  (select
  case

      when section = 'numeric' then
          'select '''||column_name||''', count('||column_name||') as row_count, min('||column_name||') as min, max('||column_name||') as max, avg('||column_name||') as average,
              count(distinct '||column_name||') as distinct_count
          from '||$schema||'.'||$table_name
      when section = 'text' then
          'select '''||column_name||''', count('||column_name||') as row_count,  max(top10_literals) as top10_literals,
              count(distinct '||column_name||') as distinct_count
          from '||$schema||'.'||$table_name||' r
          cross join (select listagg( '||column_name||', '','') top10_literals from (select top 10 '||column_name||' from (select '||column_name||', count(*) cnt from '||$schema||'.'||$table_name||' group by '||column_name||') order by cnt desc)) as rr '
      when section = 'datetime' then
          'select '''||column_name||''', count('||column_name||') as row_count, min('||column_name||') as min, max('||column_name||') as max,
              count(distinct '||column_name||') as distinct_count
          from '||$schema||'.'||$table_name
  end ||
  case when ordinal_position = (
  select max(ordinal_position) from
  (
      select column_name, ordinal_position, 'numeric'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('bigint', 'double precision', 'integer', 'numeric', 'real', 'smallint')
              union all
      select column_name, ordinal_position, 'text'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('"char"', 'character', 'character varying', 'text')
              union all
      select column_name, ordinal_position, 'datetime'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('abstime', 'date', 'timestamp with time zone', 'timestamp without time zone')
  ) c2 where c2.section = c1.section ) then ';' else
  ' union all'  end as query, section, ordinal_position
  from
  (
      select column_name, ordinal_position, 'numeric'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('bigint', 'double precision', 'integer', 'numeric', 'real', 'smallint')
              union all
      select column_name, ordinal_position, 'text'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('"char"', 'character', 'character varying', 'text')
              union all
      select column_name, ordinal_position, 'datetime'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('abstime', 'date', 'timestamp with time zone', 'timestamp without time zone')
  ) c1
  ) r order by section , ordinal_position;

You can learn a lot by digging into this example and adapting it for your own purposes, creating your own Active Catalog entries. Here, I’ll draw your attention to two particular aspects.

The problem of stopping

In our initial example, we generated one statement per “node” column. We might try to combine these into a single statement with this (broken) code:

select
'select '||quote_literal(column_name)||' as col_name, max('||column_name||') as max_value from stl_plan_info union all' as generated_sql
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
;

which, after formatting, produces this SQL

SELECT 'plannode'    AS col_name,
       max(plannode) AS max_value
FROM   stl_plan_info
UNION ALL
SELECT 'nodeid'    AS col_name,
       max(nodeid) AS max_value
FROM   stl_plan_info
UNION ALL

The final “UNION ALL” doesn’t belong there. This is a common problem when generating “delimited” items: you need to handle the last item slightly differently than the rest of the items. How does our Active Catalog example handle this?

Look particularly at this construction, with the concatenation operator on line 38:

case .. end || case .. end as query

The first case statement sets up generation of the aggregate functions, very similarly to our simplified example, with one clause for each data type. The second case statement begins by selecting only the row with the max(ordinal_position). ordinal_position is given to us by the information_schema.columns view as the ordering of column names in the table. So, this second clause finds the last column name and appends a semicolon (;) in that case; otherwise, it appends ' union all'.

In our simplified example, which doesn’t need the first case statement it looks like this:

select
'select '||quote_literal(column_name)||' as col_name,
    max('||column_name||') as max_value from stl_plan_info' ||
case when ordinal_position =
  (select max(ordinal_position) from
    (select column_name, ordinal_position from information_schema.columns
     where table_name = 'stl_plan_info' and data_type = 'integer' and column_name like '%node%')
   )
  then ';' else ' union all'
  end as generated_sql
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
order by ordinal_position
;

which produces the desired output giving us a single, well-constructed SQL statement covering both columns. In our simplified version, we only have one possible row in the subselect with max(ordinal_position). Because the Active Catalog version is handling multiple data types, it can get multiple result rows. It uses a technique that we won’t cover here called a correlated subquery to manage that situation, which requires the usage of the c1 and c2 aliases that appear on lines 59 and 80.

Rolling up using list_agg()

A final consideration with the Active Catalog version is the situation with the “top 10” most frequently occurring values for text data. Here is the generated code for the querytxt column:

select 'querytxt', count(querytxt) as row_count,  max(top10_literals) as top10_literals,
            count(distinct querytxt) as distinct_count
        from pg_catalog.stl_query r
        cross join (select listagg( querytxt, ',') top10_literals
                    from
                     (select top 10 querytxt from
                          (select querytxt, count(*) cnt
                           from pg_catalog.stl_query group by querytxt
                          ) order by cnt desc
                     )
                   ) as rr ;

Let’s examine this from the inside out. The innermost subselect returns the count(*) for our target text column. The next layer up uses a nice Redshift function TOP to get the top 10 querytxt values by count. But we want those querytxt values rolled up to a single row rather than on multiple rows. This is exactly what the LISTAGG function does. It takes the text rows from the querytxt column and concatenates them into a single long string that is named “top10_literals”. This probably isn’t too useful for a complex text value like querytxt, but LISTAGG is great to have in your bag of tricks.

Conclusion

First, use the Active Catalog. Seriously, right now open a Redshift connection tab in Aginity Pro, navigate to the Profile Data in Table object, double-click it, put select * in front of the inserted text, hit F5, and start using the generated code.

The idea of SQL that generates SQL is kind of mind-bending, but it is a technique that has been in use for several decades. You can find examples of it all over the web for every conceivable database system. I always learn something from deconstructing other people’s code.

When writing your own SQL-generating SQL, start slowly. Once you have a technique that works, put that into the Active Catalog so that you can find it when you need it and even share it with your teammates.

Lesson #4: How to hide a picture of a cat in Redshift

I didn’t start out trying to put a picture of a cat into Redshift. My motivation isn’t less strange though. The product manager of Aginity Team put out a call for “interesting” things to do with Aginity Team. I thought, “what could be more interesting than having Aginity Team print a picture of Aginity’s Enterprise Product Manager and Director of Solution Architecture, George L’Heureux?” Of course, encouraging you to put a picture of George, or of any of your co-workers into your Redshift database, would just be creepy. So, you can use an ASCII picture of a cat, hence the title of this article. The desktop software, Aginity Pro, works identically to Aginity Team, so you should be able to follow this tutorial with either version of software.

The truth is that those of us who wrangle data are often asked to do seemingly strange things with the tools at hand: query the Google Analytics API from Redshift; pull data from a GIS system directly into Snowflake; build an ad hoc ETL system inside Hive while we are waiting of IT to start the “real” project; etc. In this article, we will take some arbitrary string data, compress it according to a well-known algorithm by using some advanced SQL windowing functions, store the results in a database, read the compressed data, and uncompress the data using Reshift’s regular expression syntax. The fact that the string data makes a picture of a cat, or of your co-worker, is just for fun. Along the way, we’ll overcome some particular challenges that Redshift has in looping over functions.

Step 1 - Getting your string input

I don’t have any particular advice on an ASCII art converter program. Search Google and you will find many of them. I kept my resulting output to 100 columns wide and 51 rows, which worked well for my image. Also I stayed away from output that included single quotes. Here is the image I was working with.

/////::::::::::::::::---------------------------------------::::::::::::::::::::::////////////++++++
/:::::::::::::----------------------------------------------------:::::::::::::::::::://////////++++
::::::::::::------------------------------://+ossshho+//--------------::::::::::::::::::::://///////
::::::::-----------------------------:/ohdNNNMNMMMMMMNNNdy+:--------------:::::::::::::::::::://////
::::-----------------------.....--:sdNNNMMMMMMMMMMMMMMMMMMNd+------------------:::::::::::::::::////
---------------------...........-sdNMMMMNMMMMMNNNMMMNNMMMMMMNd+---------------------:::::::::::::://
-----------------.............-odMMMMMNNNmNmmdddmmmmddmmNNNNNMNdo:----------------------::::::::::::
--------------..............-sdNMMNNmmdhhhhssoosyyssssssyhhmmNNMNms:----------------------::::::::::
----------.................:yNMMNNmdhyssooo+++++++++++++ossyyhmNMMNd+------------------------:::::::
------....................-hNMMMmhyysoo++++////////////+++oosyhdmNMMNy--..---------------------:::::
-----.....................+NMMNmhysso+++/////////:::////++++osyhdmNNNNs-.....--------------------:::
---.......................sNMMNdysso+++////////////////+++++oooyhdNMMMm/.......-------------------::
-........................-mMMMNdyso++++//////////////////+++++osyhmNMMNs-.........------------------
-........................:NMMMNdyso+++///::::-::/::::-:::///++osyhmNNMNh:...........----------------
.........................-mMMNNmysoo++//::----.-:------::////++syhmNMMNh:.............--------------
.........................-dNMNNmhysoooo++++++/:-:::///+oooo+osssshmNNNms-...............------------
..........................+NNNNmhyssyyo++///://::::///://////+shyydNNNm+.................-----------
..........................-mNNNmhssys+++sys/+//////+o+++/yhyssoosyhNNNh:..................----------
...........................+NNNmysooosysss+///+o++++so/////++ooo+oymNms......................-------
...........................:hmNms++++++//////++o+++oso+/::///+++oosdNmo.......................------
...........................:oyNms++++////////++oo+/oso+/::::///++oodNd+.......................------
............................/ymdooo+//::::::/+oo+/:/oso+:-----:/+syhmh/.......................------
.............................smdsso+/:---://++++/:-:+ooo+/:---::/oyhdy:.........................----
.............................-ydyso//:::://:+o++++++ssss+////////oyhdo-.........................----
.............................-sdhyo++//+oossyyysso++syyyyyssso++oshmms-..........................---
..............................omdsoo+/+yyhhhhysso+/+oyyhdhddhs+ssydmmh/.........................----
.............................-ommhsss/+shyhhy+/:::::/:/osysyhsyshdmmmd+.........................-.--
..............................+mmdhyyyoos+/oys+/-...:/+ysoyysyhhddmmmd+-.........................---
..............................-ymmdhhhhyy+//+sssoooossyoooshyhddmmNmmy/-.........................---
.............................../mNmdhhhhyo/:/+++++++oo++oyyhhddmmNNmd+-........................-..--
...............................-hNNmddddhs/:/+o++++++++o+shdmmmNNNmmy:..........................-..-
..............................:ymmNNmmmmdyo//+osssssso+/ohdmNNNNNNmy/...........................----
...........................-+ymNhomNNNNmmdhsoosssoo+syssydmNNNNNNmh+............................----
.....................-:/oydmNNNNy-+dmNNNmmmdhhyyysooyhhddmNNNNNNNNmdyo:-.........................---
................-:/oydmNNNNNNNNms..-+dmNNNmmmmdddddhdmmmmmNNNmhNNNNNNNmdhs+:-.....................--
.........--/+oyhdmNNNNNNNNNNNNNms..``-sdmmNNNNmNmmmNNNNmNNNmy/+mmNNNNNNNNNNmdyo+/:-..............---
--..-:/oyhdmNNNNNNNNNNNNNNNNNNmms..````-ohdmmNNNNNNNNmNNmy+:..smmNNNNNNNNNNNNNNNmmdyo/:-.........---
-/+shdNNNNNNNNNNNNNNNNNNNNNNNNmms..```  `.:osyhhddddhyy+:.```-hmmmmmNNNNNNNNNNNNNNNNNNmdyo/:--------
dmNNNNNNNNNNNNNNNNNmNNNNNNNNNNmmy.`````    `.:/+//:-.````````-dmmmmmmmmmmNNNNNNNNNNNNNNNNNNmhy+:----
NNNNNNNNNNNNNNNNNNmNNNNNNNNNNmmmh.```````  `.-::``   `````  `/mmmmmmmmmmmNNmNNNNNNNNNNNNNNNNNNNmhs/-
NNNNNNNNNNNNNNNNNNNmNNNNNNNNNmmmd.` ```````shddy:  `````    `ommmmmmmmmmmmNNNNNmNNNNNNNNNNNNNNNNNNms
NNNNNNNNNNNNNNNNNNNNNNNNNNmmmmmmm-``   ```.ymmhyh:````      `ymmmNmmmmmmmNNNNNNNmNNNNNNNNNNNNNNNNNNm
NNNNNNNNNNNNNNNNNNNNNNNNmNNmmmmmm:``   ``.`:Ndyhd/``       `.mmmmNmmmmmmNNNNNNNNmNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNmmmmmmmmm+`````````+Ndhdh``       ``/mmmmmmNmmmNNmNNNNNNmmNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNmNNNNNNmmNmmmmms`````````ymhddh-``  ` ````smmmmmmNmmNNNNNNNNNNNmNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNmNNmmmmmh.``````.omdhmdhd:````````.dmmNmmmNmmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNmNNmNmmd.``````/hmhhmmymh.```````:mmmNmmNNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNmNNNmNmm:`````.smdhhdmhmd:```````smmmmmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNmmNNmmNm+`````:dNdhhdddmd/```.``.hmmNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNmNNNmmms.```.+mmhsyyddmd+`..```:mmNmmNNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNmh..`..hmmhhyhmmmds`..```ommNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

You can use whatever techniques you like for getting data into Redshift. The COPY command is very handy. Here is what I did:

drop table my_text;
create table my_text (seq int IDENTITY, line varchar(255));

copy my_text
from 's3://your_s3_bucket/txt/my_picture_ascii.txt'
credentials 'aws_access_key_id=XXXXXXXXX;aws_secret_access_key=ZZzzZZZzzZZZZ'
format as FIXEDWIDTH 'line:100'
;

However, note the sentences in the CREATE TABLE documentation that say:

With a COPY operation, the data is loaded in parallel and distributed to the node slices.
To be sure that the identity values are unique, Amazon Redshift skips a number of values
when creating the identity values.  As a result, identity values are unique and sequential,
but not consecutive, and the order might not match the order in the source files.

What I found is that with a small file like this, although there are gaps in the seq numbers, as long as I order by seq, my lines are correct. If you run into problems, then you can add a sequence number to your text file to keep things straight.

Step 2 - Encoding the Data

We don’t want to just insert our text picture into a table and then pull it back out. Where’s the challenge in that? We need to encode it somehow. I looked at the DEFLATE algorithm, but decided not to tackle encoding that in SQL today. Our data is plain ASCII with a lot of repeats. After a bit of searching, Run Length Encoding seemed like a good choice for this exercise.

Run Length Encoding

For Run Length Encoding, we are going to take a string, such as the first line of our ASCII picture:

/////::::::::::::::::---------------------------------------::::::::::::::::::::::////////////++++++

and encode all of the repeats. I found several implementations online and it seems pretty loose as to whether the character or number is first. For our purposes, we’ll put the character first. So the encoding for this line is:

/5:16-39:22/12+6

because the string has 5 slashes followed by 16 semicolons, followed by 39 dashes, etc.

Iterating in Redshift

The Redshift substring function will be useful here,

select seq, line, substring(line, 1, 1) as letter
from my_text
order by seq
;
_images/cat_result_1.png

That gives us the first letter of each line. What we really want is each letter of each line turned into rows of our query result. For example, Hive has functions that can pull a string apart into an array and can “explode” that array into rows of a query result. Redshift doesn’t have that, so we will have to handle that ourselves.

In order to explode our letters from the string into rows, we need to iterate over each letter. We want a row for each letter of the original string. We can use the SQL cross join against a “numbers” table to achieve this effect.

Note

Sidebar numbers

Needing a set of numbers is so common in SQL that many database systems have a function to generate numbers as needed. For example, PostgreSQL 9 has generate_series(). Unfortunately, Redshift and many other MPPs lack this feature; fortunately in MPPs, we have lots of things we can count in order to generate numbers.

In Redshift, the stl_plan_info table has a record for every query you have run. If you have been using Redshift for a while, then this willl be sufficient. Our string is only 100 characters wide, so we are only going to use the numbers from 1 through 100.

drop table if exists numbers;
create temp table numbers as
select num from
(select cast(row_number() over (partition by 1) as int) as num from stl_plan_info
) inner_query
;

To achieve this iteration, we can modify our query like this:

select seq, num, line, substring(line, num, 1) as letter
from my_text
cross join numbers
where num <= length(line)
order by seq, num
;
_images/cat_result_2.png

For each row in the my_text table, the cross join will give us a copy for each row of our numbers table. Our substring now includes num from the numbers table. We don’t need copies for all of the rows in the numbers table, so we limit it to only the length of the text. Now we have our string pivoted so that each letter is on a separate row.

Where are we heading? We want to count the 5 slashes followed by the 16 dashes and so on down the result set.

Grouping with windowing functions

Redshift supports the standard SQL Window functions. My first thought was to use count() or row_number() grouped over our letters. However, letters that we have already seen can return later in a line. For example, the slash characters that begin the first line also end the first line. This return of characters foils any attempt to use only grouping functions. First, we need to mark each change, preserving the character number where the change occurs. We use the LAG() function to bring in the previous letter for comparison and then some case logic to determine that a change has happened.

select seq, num, line, substring(line, num, 1) as letter,
   lag(letter) over(partition by seq order by num) as previous_letter,
   case when num = 1 then 1
        when letter <> previous_letter then num
   else 0 end as group_start
from my_text
cross join numbers
where num <= length(line)
order by seq, num
;
_images/cat_result_3.png

We want to use the group_start along with the next_group_start using the LEAD() function, filtering out all of the rows that don’t start a new group. Now we have one row for each group. Each of those rows has sufficient information to calculate the beginning and end of a substring for that group, which we’ll call a “chunk”.

select seq, group_start,
       nvl(lead(group_start) over(partition by seq order by group_start), length(line)+1) as next_group_start,
       substring(line, group_start, next_group_start - group_start) as chunk,
       substring(chunk, 1, 1) || length(chunk) as encoded_chunk
from
(select seq, num, line, substring(line, num, 1) as letter,
       lag(letter) over(partition by seq order by num) as previous_letter,
       case when num = 1 then 1
            when letter <> previous_letter then num
       else 0 end as group_start
    from my_text
    cross join numbers
    where num <= length(line)
    order by seq, num
 )
where group_start <> 0
order by seq, group_start
;
_images/cat_result_4.png
Rolling up with the listagg function

Now we have each chunk “encoded” with the character and count. We only need to bring all of the encoded chunks up to the same row so that the encodings for the entire line are together. The Redshift LISTAGG() function is what we want here.

select seq as id, listagg(encoded_chunk) within group (order by group_start) as value
from (
    select seq, group_start,
           nvl(lead(group_start) over(partition by seq order by group_start), length(line)+1) as next_group_start,
           substring(line, group_start, next_group_start - group_start) as chunk,
           substring(chunk, 1, 1) || length(chunk) as encoded_chunk
    from
    (select seq, num, line, substring(line, num, 1) as letter,
           lag(letter) over(partition by seq order by num) as previous_letter,
           case when num = 1 then 1
                when letter <> previous_letter then num
           else 0 end as group_start
        from my_text
        cross join numbers
        where num <= length(line)
        order by seq, num
     )
    where group_start <> 0
    order by seq, group_start
)
group by seq
order by seq
;
_images/cat_result_5.png

We have now encoded our string lines into an RLE format. Before we decode, let’s store that in a table:

create table dev.public.rle_data as
select * from ...
;

Step 3 - Decoding the Data

Now our picture is “hidden” in a Redshift table, with each line compressed into RLE format. We used iteration over the substring() function plus some windowing functions to get there. To pull the data out, we are going to use iteration over Redshift regular expression functions.

Here is our first row of data:

/5:16-39:22/12+6

We have a target character, followed by a number that tells how many times to repeat the character. We do have a REPEAT() function in Redshift.

select repeat('/', 5);

So, if we can identify the components, we can build the string for that repetition. We know from the previous section that if we can get those repetition strings into columns, we can use the LISTAGG() function to pull them together into the same row, reconstituting the original string.

Iterating using Regular Expression functions

Redshift supports several regular expression functions. Understanding regular expressions is one of the most important skills a programmer can have. With modern database systems, that includes SQL programmers. MPP SQL systems vary widely in their support for regular expressions. In Redshift, the REGEXP_SUBSTR() function is straight-forward but limited.

select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, 1)
from rle_data a
order by id
;
_images/cat_result_6.png

With this function we are substringing the value column, which contains the RLE encoded string, using the pattern \\D\\d{1,}, starting from the beginning of the string, and extracting the first occurrence of the string, which is /5. Developing regular expression patterns begins with thinking clearly about what you are trying to achieve. In this case, we set up our RLE to be “a character followed by some digits”. Another way to characterize that statement is “a non-digit followed by at least 1 digit”, which is represented by the pattern \\D\\d{1,}.

With Hive, we could “explode” directly into an array with multiple occurrences of the regular expression. With Redshift, we’ll use the previous technique of using a numbers table to drive iteration over the occurrence parameter of REGEXP_SUBSTR() . The only thing we need to know is when to stop iterating, which could be different for each string. Fortunately, the REGEXP_COUNT() function will help with that.

select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)
from rle_data a
cross join numbers n
where num <=REGEXP_COUNT(value, '\\D\\d{1,}')
order by id, num
;
_images/cat_result_7.png

For each pattern occurrence, we want to expand it using the REPEAT() function shown above. For that function, we need the first character of the occurrence and the number of times to repeat, which is everything in the pattern occurrence after the first character. Let’s look at all of that data on a single row:

select row_number() over (partition by 1) as key,
       id as line_number, num as pat_number, value as rle_line,
       REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)  as pattern_occurrence,
       substring(pattern_occurrence, 1, 1) as rle_char,
       cast(substring(pattern_occurrence, 2) as int) as char_count,
       repeat(rle_char, char_count) as expanded_pattern
from
 (select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)
  from rle_data a
  cross join numbers n
  where num <=REGEXP_COUNT(value, '\\D\\d{1,}')
  order by id, num
 ) as rle_data
order by id, num
 ;
_images/cat_result_8.png
Rolling up

Once again, we can use the LISTAGG() function to roll this data up to a single row.

select line_number, rle_line, listagg(expanded_pattern) within group (order by line_number, pat_number) as full_line
from
    (select row_number() over (partition by 1) as key,
           id as line_number, num as pat_number, value as rle_line,
           REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)  as pattern_occurrence,
           substring(pattern_occurrence, 1, 1) as rle_char,
           cast(substring(pattern_occurrence, 2) as int) as char_count,
           repeat(rle_char, char_count) as expanded_pattern
    from
     (select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)
      from rle_data a
      cross join numbers n
      where num <=REGEXP_COUNT(value, '\\D\\d{1,}')
      order by id, num
     ) as rle_data
    order by id, num
    ) rle_detail
group by line_number, rle_line
order by line_number
;
_images/cat_result_9.png

Now the full_line column has your complete picture, decoded from RLE format.

Conclusion

You will probably never need to hide an ASCII picture of a cat (or a co-worker) inside your Redshift database. However, you will certainly need to iterate over subsets of data, use windowing functions to group data, and use regular expressions to manipulate strings.

As modern data analysts and data engineers, we need to expand our toolbox to include all types of string manipulations. The next time you are asked to do something out of the ordinary in Redshift, Snowflake, or any other cloud database, you will be ready.