//
home

Latest Post

Python vs SSIS

I am a long time MS BI stack evangelist, however recently I have been incorporating Python into my ETL arsenal. In several cases I have opted for python over SSIS. Python in combination with Dask and Panda can accomplish many if not all ETL tasks… And here’s why:

Cost

Python is a free and completely open source language maintained by the Python Software Foundation. Most of its packages are available at no cost and you can contribute your own functionality and\or enhance existing functionality if you encounter a bug or need a new feature. SSIS obviously requires a license for every environment you are running on and you may wait quite awhile for bug fixes and\or new features.

Testability

Python has several unit testing frameworks such as PYtest and Unittest. You can design your ETL process in such a way as it will be easily testable to ensure that your process is behaving as expected. SSIS on the other hand does not really offer any built in testing frameworks other then those available from 3rd party vendors.

Scalability

Dask is specifically designed for large datasets and is easily scalable across n nodes. Whereas a single SSIS data flow\package is restricted by the memory available on the current node. With Python you can easily scale out your ETL process across many nodes. Of course you can decide to utilize the SSIS “scale out” feature which does allow for a master\workers configuration. This can get quite complex and costly to set up as you now need licenses for all worker nodes. Set up can be further complicated if you are running your etl in a cloud configuration such as AWS or Azure.