Many people are kind of scared of using programming language. And sure, if you haven’t any experience using these ‘command line’ tools the learning curve may seem steep. And why should you learn a language like Python, if you have a fantastic tool like Alteryx to process your data?
Well – sometimes a few lines of code is easier than using drag-and-drop tools.
I had to make a visualisation in Tableau using all kind of metrics fromt the Worldbank. GDP, population, GINI, Income Share, etcetera. All these individual datasets can be found usint the Worldbank DataBank, but it sure cost a lot of time to download these sets, and combine them into one. It’s not a database, so Alteryx (nor Tableau) can connect to the data directly.
Luckily, the Worldbank offers an extensive API (https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation) which you can use to download the data. But this data is in the very ETL-unfriendly XML format, or in JSON – so you need to untangle all the data after you have fetched to data using this API
Python
Because of the ‘unfriendly’ nature of the output (like many APIs) several people have written wrappers around is so the data can be used more easily in applcations. Multiple Python modules are available – and that’s where I looked at. The “wbgapi” module is one of them.
It is really easy to get data from the Worldbank into Python using this module. You only need a few lines of code:
import (use) the module
<code>import wbgapi as wb</code>
define the worldbank indicator you need
<code>indicators = ['NY.GDP.MKTP.CD GDP</code>
get data: period is 2010 to 2019<br><code>df = wb.data.DataFrame(indicators, time=range(2010, 2019), numericTimeKeys=True, columns='series').reset_index()</code>
The data is now ‘in’ Python – an extra line exports the data to CSV:
<code>df.to_csv('worldbank.csv',index=False)</code>
You can use Python to import the data directly into a database, but this can sometimes be hard because of the needed database drivers and the exact configution. What if we could use Alteryx for that – without the CSV, and even using the Python code in Alteryx…
Python in Alteryx
Using Python in Alteryx is simple, start by dragging the Python tool on the canvas. You can select ‘interactive’ or ‘production’; just use ‘interactive’ for the moment.
A small Python environment will load, called a Notebook. This allows for easy code-entry and execution. This Python environment is slightly different from a ‘normal’ Notebook, since this is running inside Alteryx. The code had two parts: one which installs the needed modules (usually done with the ‘pip’ command), and a different part for the execution of the code. For both parts Alteryx helps using a template and sample-code.
Installing Python modules/packages in Alteryx
The ‘wbgapi’ module/library is not default, so we need to install it. You install a module using Package.installPackages command. In this case, it will look like:
Package.installPackages(['wbgapi'])
if you need to install more packages, just enter them separated by a comma:
Package.installPackages(['pandas','numpy','wbgapi','pantab'])
Fetch data
Already the next step is fetching the data from the Worldbank API. First we need to define all the indicators we need. If you only need one – GDP in US$ for example:
indicators = ['NY.GDP.MKTP.CD' ]
But I needed more indicators, so I added these comma separated, and added a comment (using the # sign) so I know which indicator is which:
[gist id=”3837669″]indicators = ['NY.GDP.MKTP.CD' # GDP (current US$)
'SP.POP.TOTL', # Population, total
'SE.PRM.ENRR', # School enrollment, primary (% gross)
'EN.ATM.CO2E.PC', # CO2 emissions (metric tons per capita)
'SI.POV.NAHC', # Poverty headcount ratio at national poverty lines (% of population)
'SP.DYN.LE00.IN', # Life expectancy at birth, total (years)
'SI.POV.NAHC', # Poverty headcount ratio at national poverty lines (% of population)
'NY.GNP.PCAP.CD', # GNI per capita, Atlas method (current US$)
'IQ.SCI.OVRL', # Statistical Capacity score (Overall average)
'SI.POV.URHC', # Urban poverty headcount ratio at national poverty lines (% of urban population)
'SI.POV.RUHC', # Rural poverty headcount ratio at national poverty lines (% of rural population)
'SI.POV.NAGP', # Poverty gap at national poverty lines (%)
'SI.POV.URGP', # Urban poverty gap at national poverty lines (%)
'SI.POV.RUGP', # Rural poverty gap at national poverty lines (%)
'SI.POV.GINI', # GINI index
'SI.DST.FRST.10', # Income share held by lowest 10%
'SI.DST.FRST.20', # Income share held by lowest 20%
'SI.DST.02ND.20', # Income share held by second 20%
'SI.DST.03RD.20', # Income share held by third 20%
'SI.DST.04TH.20', # Income share held by fourth 20%
'SI.DST.05TH.20', # Income share held by highest 20%
'SI.DST.10TH.10', # Income share held by highest 10%
'IC.REG.DURS', # Time required to start a business (days)
'IC.ELC.TIME', # Time required to get electricity (days)
'IC.TAX.GIFT.ZS', # Firms expected to give gifts in meetings with tax officials (% of firms)
'IC.FRM.FEMM.ZS' # Firms with female top manager (% of firms) ]
Fetching the data itself is done using the module-specific function call: wb – data – Dataframe with option – reset index: (a Dataframe is a very common kind of dataset within python, like a table)
df = wb.data.DataFrame(indicators, time=range(2010, 2019), numericTimeKeys=True, columns='series').reset_index()
Now the data is downloaded – but how do we process this data in Alteryx? A special ‘Alteryx’ python function. The Alteryx Python tool has 6 different outputs, with Alteryx.write(df,1)
the dataframe is written to the first output.
The final Alteryx flow – Python + configuration, and a Browse tool on the first output – looks like this:
And if you run the workflow the data shows up in the browse tool – so you can use it in Alteryx for further processing: (the ‘nulls’ are expected; not every indicator is available for each country and/or year)
Best of both worlds
If you know just a little about Python programming, or if you just copy/paste some sample code from websites you can save yourself a lot of time. Fetching the different indicators using ‘pure’ Alteryx would cost me a lot of time because I would need to extract the data from either XML or JSON, and using Python I was able to fetch 25 indicators at once.
Using only Python was a different path, but using the available tools in Alteryx (pivot, clean, insert in database) are much faster (for me at the moment…) than using Python.
And you can do a lot more with Python in Alteryx: data from other sources can be used as input, you can use Python to do all kind of AI/Data science stuff (besides the R related Predication tools which are included in Alteryx), and so many more. Read more about the Python tool at https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Tool-Mastery-Python/ta-p/197860,
and about sharing training models at https://community.alteryx.com/t5/Data-Science-Blog/Embedding-a-Model-in-a-Workflow-with-a-Python-Tool/ba-p/436204
Let me know if you have done something cool using Python and Alteryx, or are planning to!