Practical projects for building data and software skills.
2023-06-23
Introduction
I recently gave a talk on career options in data.
I mentioned in the Q&A period that relevant and realistic side projects were a great way to build and display your technical skills, especially for people without prior experience in data or software roles.
Someone asked for project ideas for each of the roles I presented about, but there was only so much I could say with a couple minutes left in the presentation.
So, I did some brainstorming and came up with what I feel are practical, achievable, and interesting projects.
I designed one project per "area" of work, because in some cases, it's hard to isolate the work for a single role.
For example, it's hard for an ML platform engineer to implement model versioning without any models to version.
Each project starts with a "minimal viable product" (MVP), which is the minimum amount of work you need to do to have a complete project.
Then, there are optional extension tasks which enable you to dive more deeply into any role and increase the complexity of the project.
I've also included some of my own experience with these types of projects so you can trust that I'm not just making stuff up.
I want to show you that working on these projects have helped me, that they can help you too, and that they're not as hard as they may seem.
I've also included some books for further learning, if that's more your style.
If you want to work in a team, I suggest working on all parts of the MVP together so everyone gains some experience with all roles.
Then, you can split up the extension tasks and work on them individually or in pairs, choosing the ones that interest you most.
Having some knowledge of the other roles' work will make you more well-rounded anyway, which is helpful for job searching and working.
The extension tasks are all optional.
You can do as many or as few as you like, for whichever roles you like, and in mostly any order.
You may find that you prefer different roles' work than you expected.
Most importantly, follow your interests and have fun!
Tools
Although you can use any tools you'd like, I'm going to suggest some that I think may make these projects easier.
I suggest using Python as your programming language, especially if you're not an experienced developer.
Python is a popular language for data and ML work, and the ecosystem makes it easy to get started.
If you need to build a service in Python, I suggest FastAPI or Flask, because they're simple but powerful.
If you need to containerise something, use Docker for the build instructions and Docker Compose for service deployment.
Although you can use Kubernetes, it's more complex and not necessary for a small project.
If you don't have a favourite code editor, I suggest using VS Code.
It's free and has a great extension ecosystem.
I would stay away from general-purpose text editors like Notepad or TextEdit.
You should store all your code, documentation, notebooks, etc, in a Git repo.
I suggest hosting your project on GitHub.
If you don't have an account, you can sign up for free.
They have great documentation and learning resources.
You should develop "features" (or work on tasks) by committing progress to a branch dedicated to that feature, and updating the main branch by opening then merging a pull request for your feature branch.
This is called "feature branching" and is a common practice in software development.
It's a good habit to get into, and it will make it easier to collaborate with others.
In general, pick tools that reduce toil so you can focus on learning and building.
Insights project
In this project, you will be preparing a transactional dataset for analysis then uncovering and communicating insights within it.
Real data is ideal for learning.
Try working with a small business to use some of their data in exchange for analytics services.
Otherwise you can find suitable datasets online, like a subset of the AdventureWorks data files.
I suggest interacting with the data in a Jupyter Notebook so you can iterate faster.
My experience
Near the end of my undergraduate degree, I discovered a passion for dealing with large amounts of data.
Unfortunately, this was not something I had the opportunity to explore in my degree.
I invented a project for myself that was much like the one I'm describing to you now, although I didn't know what it would entail when I started.
I signed a contract with a local cafe, gaining access to their point-of-sale data in exchange for building a predictive model of their product sales.
Although I went into the project with the intention of learning data science alone, I ended up learning a lot about analytics engineering and data analysis too.
The skills I gained made it easy to find jobs in data science and engineering once I started looking.
MVP
Write code to extract the data, transform and clean it, and load it into storage (ETL).
Your storage can be a Parquet or CSV file.
Visualise trends and distributions in the data.
Feel free to use a no-code tool like Google Sheets.
Predict the future or unknowns using the data.
Use a library like scikit-learn for this.
Write a document explaining your work.
Include diagrams for the insights and an entity-relationship diagram (ERD) for the dataset.
Bonus: Make the entire process automatically refresh on a schedule, even if your data is static.
Data Analyst
Identify useful questions to ask about the data, and communicate the value and impact of your insights.
Perform your queries and transformations in SQL or Pandas instead of no-code tools.
Use an interactive tool like Looker Studio or Streamlit to display your visualisations.
Allow users to select parameters and specify filters, like date ranges.
Data Scientist
Create additional predictive models, covering different model types.
Perform statistical analyses and report your findings.
Quantify uncertainty (e.g. confidence or prediction intervals) and significance.
Perform causal inference to determine the causes of observed phenomena.
Analytics Engineer
Build the ETL pipeline with a workflow engine like Prefect.
Use a data warehouse for storage.
DuckDB is great for a simple project because it's embedded.
Use DBT to perform data validation, and to create wide tables, dimensional models, or views.
Create a data catalogue with documentation on tables, columns, and their lineages.
Ingest additional data sources and relate them to your existing data.
Models project
In this project, you will be creating a deep learning model, then automating, optimising, and evaluating it.
Find an open-source dataset online.
I suggest searching Hugging Face or Kaggle for a dataset intended for the type of model you will create.
If you need compute resources for training, I suggest using Google Colab or Kaggle Notebooks.
Both are free with limitations.
You can execute accelerator-based programs in notebooks (example).
There are also Python wrappers for the CUDA and OpenGL APIs.
My experience
I haven't worked on a modelling team before, nor have I completed a comprehensive project like this myself.
However, I've worked at companies in the machine learning space and have seen coworkers do similar work.
Many of these tasks may be difficult to approach as a beginner, but any progress you make is valuable.
The few times I've dabbled in training and managing my own models have helped me contextualise the work of my coworkers.
MVP
Train a simple deep learning model on the data.
Implement automatic model performance evaluations.
Compare model performance when adjusting parameters and varying approaches.
Implement model compression.
Write a thorough document demonstrating and explaining your work, including images where appropriate.
Bonus: Containerise your work.
ML Scientist
Design the model architecture yourself with research and experimentation.
Test and evaluate hypotheses regarding model design and configuration.
Implement model explainability methods to contextualise model predictions.
Identify an open-source model that could be used as a basis for this task, and determine the appropriate transfer learning technique to apply to it.
ML Engineer
Productionise the model and model training code.
Deploy the model and make it accessible through a service.
Implement automated hyperparameter tuning.
Create an ensemble model by combining multiple models to improve performance.
ML Platform Engineer
Automate the model deployment process.
Implement automatic data and model versioning.
Set up monitoring for data drift and performance degradation.
Create and utilise a feature store, if it makes sense for your use case.
ML Accelerators Engineer
Calculate your model's training and/or serving performance costs, and show your work.
Implement multiple different compression strategies and compare them.
Attempt to optimise part of the training process using accelerator-based programming.
It's okay if you don't actually make it faster; the goal is to learn.
Build a model server using TensorFlow Serving, Triton Inference Server, or similar.
Systems project
In this project, you will be building an application that uses a machine learning model.
Your application can do anything you want.
For inspiration, check out past submissions to AI hackathons on DevPost.
If it works for your use case, I suggest using an embedded database such as SQLite to simplify setup.
I suggest using Postman to simplify sending requests to your application.
My experience
I consider myself a backend engineer with a focus on data systems and an interest in infrastructure.
Therefore, I often build pieces of larger data and ML systems at work, but not entire systems end-to-end.
I recently started working on a project in my spare time which follows this same structure.
Despite having a few years of experience under my belt, the project has been an excellent up-skilling opportunity for me.
It's helping me become a much stronger developer, which has a direct impact on my work.
MVP
Build a basic backend (server) for the application.
Incorporate a request to an externally hosted model.
Design and create a self-hosted operational database for data storage.
Connect it to the backend, implementing a data access layer.
Containerise the backend with Docker and deploy it as a service with Docker Compose.
Switch to a non-embedded database (if applicable), deploy it as a service, and update the backend to use it.
Build a basic command-line or frontend (web) interface for presentation.
Bonus: Host and use your own open-source model.
It should be deployed as its own service.
Data Platform Engineer
Set up a schema migration system.
Consider incorporating non-relational databases (e.g. document, vector, graph) if they make sense for your use case.
Incorporate a messaging system or event-driven pipeline.
Implement an in-memory cache to reduce computation or networking for frequently requested data.
Learn about data replication strategies.
Infrastructure Engineer
Implement a CI/CD pipeline, i.e. automated testing and deployment of code.
Set up application monitoring and/or request tracing.
Deploy your services with Kubernetes instead of Docker Compose.
Use Minikube if you don't want cloud hosting.
Host your application with a cloud services provider, and handle the transition.
Define your cloud resources (if any) with Terraform.
Backend Engineer
Handle errors, and return appropriate status codes and messages.
Write unit tests, mocking dependencies if needed.
Create an OpenAPI specification.
Use it to generate documentation, a client library, and/or server stubs.
Implement user registration and login.
Rate limit the backend's requests to the model and users' requests to the backend.
Books
If projects aren't your thing, if you want more structured learning, or if you want to supplement your project work, I suggest reading books.
I've only read a few of these myself, but I've heard good things about most of them.
Most of these books are published by O'Reilly, which directly offers a subscription service and indirectly sells through Amazon.
Some libraries have O'Reilly subscriptions, so check if yours does.
Data Analyst
Communicating with Data:
How to communicate your insights persuasively and understand your audience.
SQL for Data Analysis:
Guide to using SQL to prepare data, analyse groups and trends, and perform A/B testing.
SQL is a data analyst's main way of interacting with data.
Data Scientist
Data Science for Business:
Statistical analysis methods with a focus on business applications, and how to think like a data scientist.
Python for Data Analysis:
Data wrangling, visualisation, and statistical analysis in Python.
Teaches you Python, including the Pandas library, which are popular tools for data scientists.
Causal Inference in Python (early release):
Teaches causal inference, which is essential for determining the cause of downstream effects.
It is a key skill for data scientists that is often assessed in interviews.
Analytics Engineer
Data Pipelines Pocket Reference:
Short book on building, validating, and monitoring data pipelines.
Data pipelines are an analytics engineer's bread and butter.
Fundamentals of Data Engineering:
Overview of the total data engineering life cycle.
Useful for broadening your understanding beyond the typical transformations and analytics work.
Deep Learning:
Classic textbook overview of deep learning methods.
Quite dense, but a worthwhile read.
Transformers for Machine Learning:
Comprehensive coverage of transformer architectures and techniques.
Transformers have revolutionised ML in the last few years, so it's useful to be particularly familiar with them.
Designing Machine Learning Systems:
Overview of building ML systems end-to-end, which is especially useful for "all-in-one" MLE roles.
The author also maintains an excellent blog.
ML Platform Engineer
Practical MLOps:
Practical guide to MLOps principles and tools, plus case studies.
Designing Machine Learning Systems:
Overview of building ML systems end-to-end, rather than just the traditional ML engineering parts.
The author also maintains an excellent blog.
CUDA Programming:
Introduction to parallel programming with CUDA, which is the most popular API for GPU programming.
Data Platform Engineer
Designing Data-Intensive Applications:
Comprehensive overview of data storage and retrieval, distributed systems, and data processing.
This is a must-read for data platform engineers, and one of my favourite technical books.
Fundamentals of Data Engineering:
Overview of the total data engineering life cycle.
Useful for gaining context on the analytics side of data engineering.
Site Reliability Engineering:
Collection of essays on the principles and practices of site reliability engineering.
Particularly useful for infrastructure engineers working on high-scale production systems.
Many possibilities depending on tools and specialty.
FastAPI (early release):
Building services with FastAPI, a popular Python web framework.
Covers general backend principles too, like layered architectures, authentication, testing, and more.
Designing Data-Intensive Applications:
Comprehensive overview of data storage and retrieval, distributed systems, and data processing.
Important for backend engineers working on high-scale data systems, especially on teams without data platform engineers.