TWC_STAQ_AWS

Homework:

  1. AWS UI Login User/Pass: Waiting for systems team to create.
  2. AWS API Keys to run Copy commands to load redshift from s3 and Unload data from redshift to s3. (Waiting for Systems)
  3. Tableau Driver to Postgresql for connection to redshift. (See Link Below)
  4. Redshift Database User/Pass to also connect to Tableau (Temporary ones created below)
  5. SQLWorkbench/J UI install
  6. PostgreSQL Docs: http://www.postgresql.org/docs/9.4/static/queries-with.html
  7. Link to JDBC Driver for Redshift: http://downloads.tableausoftware.com/drivers/tableau/8.2/TableauDrivers.dmg
  8. Link to SQLWorkbench: http://www.sql-workbench.net/Workbench-Build117.zip
  9. Link to JDBC Driver jar file for SQLWorkbench: https://jdbc.postgresql.org/download/postgresql-9.4-1200.jdbc41.jar
  10. Java SDK: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
In [150]:
#URL In Caption Below:
#jdbc:postgresql://wfx-nonprod-analytics.citte9qqv5jw.us-east-1.redshift.amazonaws.com:5439/wfxanalytics
Image(url='http://s3.amazonaws.com/wfxnp-dev-robby/screen_redshift.png')
Out[150]:

Temporary user/pass for training

In []:
staq_user='twc_staq'
staq_pass='Lg3@*JKU'
url='wfx-nonprod-analytics.citte9qqv5jw.us-east-1.redshift.amazonaws.com'
database='wfxanalytics'

A. Data Prep and Formatting

  • Retrieve file from s3
  • Data Prep and formatting
  • Upload final data to s3 after splitting and compressing
In [3]:
getfroms3(prefix='staq/staq_sample_output.xlsx')
Downloaded 1:	staq_sample_output.xlsx: 7.91637611389 MB

In [8]:
#Some string formatting
staq_sample = pd.read_excel('staq_sample_output.xlsx')
staq_sample.columns = [col.lower().replace(' ','').rstrip('($)') for col in staq_sample.columns]
staq_sample = staq_sample.replace('-', np.nan)
staq_sample['date'] = staq_sample['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
staq_sample['datevalue'] = staq_sample['datevalue'].apply(lambda x: x.strftime('%Y-%m-%d'))
In [120]:
staq_sample.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65533 entries, 0 to 65532
Data columns (total 24 columns):
lineitem            65533 non-null object
order               65533 non-null object
date                65533 non-null object
lineitemid          65533 non-null int64
orderid             65533 non-null int64
orderponumber       64479 non-null object
rate                65533 non-null float64
totalimpressions    65533 non-null int64
totalclicks         65533 non-null int64
lineitemdate        65533 non-null object
typeofinv           65533 non-null object
directornetwork     65533 non-null object
typeofrevenue       65533 non-null object
metarefresh         65533 non-null bool
revenue             65533 non-null float64
placement           65533 non-null object
lineitemtype        65533 non-null object
position            64519 non-null object
size                65533 non-null object
source              12869 non-null object
site                65473 non-null object
property            65473 non-null object
billedimp           65533 non-null float64
datevalue           65533 non-null object
dtypes: bool(1), float64(3), int64(4), object(16)
memory usage: 12.1+ MB

In [28]:
#String Length review for creating table
staq_sample.applymap(lambda x: len(str(x))).max()
Out[28]:
lineitem            236
order               115
date                 19
lineitemid            9
orderid               9
orderponumber        20
rate                  5
totalimpressions      8
totalclicks           6
lineitemdate         18
typeofinv            10
directornetwork       9
typeofrevenue         8
metarefresh           5
revenue              16
placement            14
lineitemtype         14
position             21
size                 11
source               18
site                 19
property              7
billedimp            11
datevalue            19
dtype: int64

B. Database Table DDL and Loading

In [75]:
%%file /ebs/staq/staq_sample_ddl.sql
drop table if exists staq.staq_sample;
create table staq.staq_sample(
lineitem varchar(512),
order_name varchar(512),
date date,
lineitemid varchar(64),
orderid varchar(64),
orderponumber varchar(64),
rate double precision,
total_impressions bigint,
total_clicks bigint,
lineitem_date varchar(64),
type_of_inv varchar(64),
director_network varchar(64),
type_of_revenue varchar(32),
metarefresh boolean,
revenue double precision,
placement varchar(64),
lineitem_type varchar(64),
position varchar(64),
cr_size varchar(64),
source varchar(32),
site varchar(64),
property varchar(64),
billed_imps double precision,
datevalue date
)
diststyle key distkey(lineitemid)
sortkey(date, lineitemid, orderid);
Overwriting /ebs/staq/staq_sample_ddl.sql

In [76]:
%%bash -s "$devhost_rs" "$devport_rs" "$staq_user" "$staq_pass" "$devdb_rs"
PGPASSWORD=$4 psql -h $1 -p $2 -U $3 $5 -f '/ebs/staq/staq_sample_ddl.sql'
DROP TABLE
CREATE TABLE

In [95]:
%%bash -s "$devhost_rs" "$devport_rs" "$staq_user" "$staq_pass" "$devdb_rs"
PGPASSWORD=$4 psql -h $1 -p $2 -U $3 $5 -c """set search_path to 'staq';\
select * from pg_table_def where schemaname='staq'"""
 schemaname |  tablename  |      column       |          type          | encoding | distkey | sortkey | notnull 
------------+-------------+-------------------+------------------------+----------+---------+---------+---------
 staq       | staq_sample | lineitem          | character varying(512) | none     | f       |       0 | f
 staq       | staq_sample | order_name        | character varying(512) | none     | f       |       0 | f
 staq       | staq_sample | date              | date                   | none     | f       |       1 | f
 staq       | staq_sample | lineitemid        | character varying(64)  | none     | t       |       2 | f
 staq       | staq_sample | orderid           | character varying(64)  | none     | f       |       3 | f
 staq       | staq_sample | orderponumber     | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | rate              | double precision       | none     | f       |       0 | f
 staq       | staq_sample | total_impressions | bigint                 | none     | f       |       0 | f
 staq       | staq_sample | total_clicks      | bigint                 | none     | f       |       0 | f
 staq       | staq_sample | lineitem_date     | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | type_of_inv       | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | director_network  | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | type_of_revenue   | character varying(32)  | none     | f       |       0 | f
 staq       | staq_sample | metarefresh       | boolean                | none     | f       |       0 | f
 staq       | staq_sample | revenue           | double precision       | none     | f       |       0 | f
 staq       | staq_sample | placement         | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | lineitem_type     | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | position          | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | cr_size           | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | source            | character varying(32)  | none     | f       |       0 | f
 staq       | staq_sample | site              | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | property          | character varying(64)  | none     | f       |       0 | f
 staq       | staq_sample | billed_imps       | double precision       | none     | f       |       0 | f
 staq       | staq_sample | datevalue         | date                   | none     | f       |       0 | f
(24 rows)


In [122]:
staq_sample.to_csv('staq_sample.csv', header=None, index=False)
In [123]:
uploadtos3(filepattern='staq_sample.csv', s3path='staq', split=True, compression=True)
Finished Splitting ./staq_sample.csv
Completed Gzipping ./staq_sample.csvai
Completed Gzipping ./staq_sample.csvab
Completed Gzipping ./staq_sample.csvac
Completed Gzipping ./staq_sample.csvah
Completed Gzipping ./staq_sample.csvad
Completed Gzipping ./staq_sample.csvaj
Completed Gzipping ./staq_sample.csvaf
Completed Gzipping ./staq_sample.csvae
Completed Gzipping ./staq_sample.csvag
Completed Gzipping ./staq_sample.csvaa
Successfully uploaded staq/staq_sample.csvai.gz
Successfully uploaded staq/staq_sample.csvaa.gz
Successfully uploaded staq/staq_sample.csvaf.gz
Successfully uploaded staq/staq_sample.csvag.gz
Successfully uploaded staq/staq_sample.csvad.gz
Successfully uploaded staq/staq_sample.csvac.gz
Successfully uploaded staq/staq_sample.csvab.gz
Successfully uploaded staq/staq_sample.csvah.gz
Successfully uploaded staq/staq_sample.csvaj.gz
Successfully uploaded staq/staq_sample.csvae.gz

Copy command for loading data from s3 to redshift

In [2]:
%%file /ebs/staq/staq_sample_copy.sql
copy staq.staq_sample from 's3://redshift_etl_files/staq/staq_sample.csv'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
gzip
csv;
Overwriting /ebs/staq/staq_sample_copy.sql

In [126]:
%%bash -s "$devhost_rs" "$devport_rs" "$staq_user" "$staq_pass" "$devdb_rs"
PGPASSWORD=$4 psql -h $1 -p $2 -U $3 $5 -f '/ebs/staq/staq_sample_copy.sql'
COPY

psql:/ebs/staq/staq_sample_copy.sql:4: INFO:  Load into table 'staq_sample' completed, 65533 record(s) loaded successfully.

Check Final Table in Redshift

In [140]:
querydev("""select * from staq.staq_sample limit 3""")
Out[140]:
                                                                                lineitem  \
0  100177742-1_INTL_RON_Leaderboard_PLT_(TWC iPad)_NETWORK_LANG (SP)_GEO-IP Country (US)   
1                 100177868-1_RON_Paid Search_PLT (TWC)_NETWORK_GEO-IP Country (US)_RISK   
2                 100178395-1_RON_Small Rect_PLT(TWC Android Phone)_GEO-Cntxl DMA(multi)   

                                                 order_name        date lineitemid    orderid orderponumber  rate  \
0  106614_DoubleClick Advertising Exchange WEATHER.COM 2014  2015-01-01  100002857  149807897        106614    15   
1  106614_DoubleClick Advertising Exchange WEATHER.COM 2014  2015-01-01  100269137  149807897        106614     0   
2                                     108100_JACKSONHOLE013  2015-01-01  101540417  202821857        108100    11   

   total_impressions  total_clicks      lineitem_date type_of_inv director_network type_of_revenue metarefresh  \
0              53926           236  1000028571/1/2015     PREEMPT          Network        RevShare       False   
1                 46             0  1002691371/1/2015     PREEMPT          Network        RevShare       False   
2               6951            11  1015404171/1/2015  GUARANTEED           Direct             CPM       False   

     revenue placement lineitem_type      position    cr_size source            site property  billed_imps   datevalue  
0  31.084387      Ipad   Ad Exchange  pos=ipad_728   728 x 90    Adx  Weather Mobile   Mobile        60837  2015-01-01  
1   0.036949       S&D   Ad Exchange    pos=wx_pds  300 x 250    Adx     Weather.com  Display           57  2015-01-01  
2  76.461000   Android      Standard    pos=top300   320 x 50         Weather Mobile   Mobile            0  2015-01-01  
In [130]:
from IPython.display import HTML, Javascript, display
In [138]:
js = """
<script type='text/javascript' src='https://online.tableausoftware.com/javascripts/api/viz_v1.js'>\
</script><div class='tableauPlaceholder' style='width: 1024px; height: 768px;'>\
<object class='tableauViz' width='1024' height='768' style='display:none;'>\
<param name='host_url' value='https%3A%2F%2Fonline.tableausoftware.com%2F' /> \
<param name='site_root' value='&#47;t&#47;theweatherchannelllcweatherfx' />\
<param name='name' value='staq_demo&#47;Sheet1' /><param name='tabs' value='no' />\
<param name='toolbar' value='yes' /></object></div>"""

Sample Output File (URL) Displayed as Hosted HTML/JS file from S3 after Pulishing Tableau Extract using Redshift Connection

  • Interactive
In [139]:
HTML(js)
Out[139]: