Zero to PySpark on AWS EMR - quickly, sensibly.
View the Project on GitHub daniel-cortez-stevenson/cookiecutter-pyspark-cloud
Run PySpark code in the 'cloud' with Amazon Web Services (AWS) Elastic MapReduce (EMR) service in a few simple steps with this cookiecutter project template!
pip install -U "cookiecutter>=1.7"
cookiecutter --no-input https://github.com/daniel-cortez-stevenson/cookiecutter-pyspark-cloud.git
cd pyspark-cloud
make install
pyspark_cloud
Your console will look something like:
AWS :cloud: Cloudformation Template for EMR: Simple Spark cluster deployment with infrastructure as code
A Command-Line Interface for Running PySpark 'Jobs': For production :rocket: runs via EMR Step API
Log Like a Pro: Save time debugging in style :dancer:
Wrap Scala with Python :snake:: Use libraries that haven't been included in the PySpark API!
SnowballStemmer
Simplify Workflows with Make :white_check_mark:: A Makefile with commands for installation, development, and deployment.
make [COMMAND]
make s3dist
Organize Your Code: Package code shared between 'jobs' in a Python module of your package
called common
Extend the PySpark API: An example of extending the PySpark SQL DataFrame
class, which allows chaining custom
transformations with dot .
notation
Development Framework: All the tools you need
As defined in the Cloudformation template
git clone https://github.com/daniel-cortez-stevenson/cookiecutter-pyspark-cloud.git
cd cookiecutter-pyspark-cloud
conda create -n cookiecutter -y "python=3.7"
pip install -r requirements.txt
conda activate cookiecutter
Make any changes to the template, as you wish.
Create your project from the template:
cd ..
cookiecutter ./cookiecutter-pyspark-cloud
cd *your-repo_name*
git init
git add .
git commit -m "Initial Commit"
conda deactivate
conda create -n *your-repo_name* -y "python=3.6"
make install-dev
Contributions are welcome! Thanks!
Submit an Bug or Feature Request
Most of the ideas expressed in this repo are not new, but rather expressed in a new way. Thanks, folks! :raised_hands:
DataFrame
extension snippet