Software projects

Practical projects for building skills in software, data, and ML.

2023-06-23

Introduction

I recently gave a talk on career options in software. I mentioned in the Q&A period that relevant and realistic side projects were a great way to build and display your technical skills, especially for people without prior experience in data or software roles.

Someone asked for project ideas for each of the roles I presented about, but there was only so much I could say with a couple minutes left in the presentation. So, I did some brainstorming and came up with what I feel are practical, achievable, and interesting projects.

I recommend not using GitHub Copilot or other AI auto-complete tools. The purpose of these projects is learning, which happens best when you struggle through figuring something out yourself. Of course, you can use AI to help you understand things, but don't let it do your work for you. The important part is that you're the one driving the process.

Update (June 2025): I finally finished updating this post to reflect the changes I made to the "software careers" post. Thank you for your patience! I promise I put a lot of time into this update.

Projects

I designed one project per "area" of work, because in some cases, it's hard to isolate the work for a single role. I also dissolved the "platforms" group and added the roles within it to the areas they primarily support. For example, it's hard for a machine learning platform engineer to implement model versioning without any models to version.

Each project starts with a "minimal viable product" (MVP), which is the minimum amount of work you need to do to have a complete project. Then, there are optional extension tasks which enable you to dive more deeply into any role and increase the complexity of the project. The focus is on building foundational knowledge rather than completing the project quickly, so the tasks are often purposefully "inefficient".

I've also included some of my own experience with these types of projects so you can trust that I'm not just making stuff up. I want to show you that working on these projects have helped me, that they can help you too, and that they're not as hard as they may seem.

These projects are all designed to be implemented by a single individual so that you develop a broad skillset. If you want to work in a team, I suggest working on all parts of the MVP together so everyone gains some experience with all roles. Then, you can split up the extension tasks and work on them individually or in pairs, choosing the ones that interest you most. Having some knowledge of the other roles' work will make you more well-rounded anyway, which is helpful for job searching and working.

Note that some roles' MVP work and extensions will be harder or take longer than others'. This is particularily true for foundational engineering like platform work. It's not that these roles themselves are necessarily harder, but the impact of their work scales across the company, and so the results are shared by many people.

The extension tasks are all optional. You can do as many or as few as you like, for whichever roles you like, and in mostly any order. You may find that you prefer different roles' work than you expected.

Most importantly, follow your interests and have fun!

Tools

Although you can use any tools you'd like, I'm going to suggest some that I think may make these projects easier.

I suggest using Python as your programming language, especially if you're not an experienced developer. Python is a popular language for data and ML work, and the ecosystem makes it easy to get started. This is less relevant for the "products" project, where you can feel free to use whichever language you're most comfortable in. If you need to build a service in Python, I suggest FastAPI or Flask, because they're simple but powerful.

If you need to containerise something, use Docker for the build instructions and Docker Compose for service deployment. Although you can use Kubernetes, it's more complex and not necessary for a small project.

If you don't have a favourite code editor, I suggest using VS Code. It's free and has a great extension ecosystem. I would stay away from general-purpose text editors like Notepad or TextEdit.

You should store all your code, documentation, notebooks, etc, in a Git repo. I suggest hosting your project on GitHub. If you don't have an account, you can sign up for free. They have great documentation and learning resources.

You should develop "features" (or work on tasks) by committing progress to a branch dedicated to that feature, and updating the main branch by opening then merging a pull request for your feature branch. This is called "feature branching" and is a common practice in software development. It's a good habit to get into, and it will make it easier to collaborate with others.

In general, pick tools that reduce toil so you can focus on learning and building.

Clouds

I have chosen to make the MVP versions of the project completely cloud-free so that you can easily develop and run the code purely through your own laptop. Knowing your way around one of the major cloud providers is a useful skill, but costs money unless you can stay within a free tier or get credits, and can add significant work to your project.

If you decide to adopt a cloud provider, I suggest either Google Cloud Platform (GCP) or Amazon Web Services (AWS). GCP has a massive advantage for analytics, and I would strongly recommend it for the Insights project. AWS has a better developer ecosystem for the Products project, but GCP is alright too. Otherwise GCP is easier to navigate. I would stay away from Azure unless you have credits.

If you want to deploy your project to the cloud for practical reasons, but don't actually care to learn infrastructure engineering or the major cloud providers, consider using a Platform-as-a-Service (PaaS) offering. PaaS providers are way easier to get started with and are often much cheaper for side projects. Flexibility is limited, but it's a nice middle ground between laptop-only and major cloud. This market is rapidly changing, so do your own research.

Insights project

In this project, you will be preparing a relational dataset for analysis then uncovering and communicating insights within it.

Real data is ideal for learning. Try working with a small business to use some of their data in exchange for analytics services. However, you will have to be extremely careful not to expose the business' data or even commit it to a private repository. Additionally, you will need to make sure to deliver what you promise.

Otherwise you can find suitable datasets online. AdventureWorks is a fake store with transactional data that can be easily downloaded. There are also a myriad of online APIs that you can fetch data from, like PokéAPI, REST Countries, and the Battle.net APIs.

The best choice, if you can find it, is a dataset related to an interest of yours. That will help keep you motivated to explore the data, and you'll be able to dive in deeper than others due to your unique context. Plus, you are less likely to use the same dataset as others, which means it will stand out to hiring teams. If a dataset doesn't exist in a downloadable or API format, you can consider building it yourself with web scraping.

I suggest interacting with the data in a Jupyter Notebook so you can iterate faster.

My experience

Near the end of my undergraduate degree, I discovered a passion for dealing with large amounts of data. Unfortunately, this was not something I had the opportunity to explore in my degree. I invented a project for myself that was much like the one I'm describing to you now, although I didn't know what it would entail when I started.

I signed a contract with a local cafe, gaining access to their point-of-sale data in exchange for building a predictive model of their product sales. Although I went into the project with the intention of learning data science alone, I ended up learning a lot about analytics engineering and data analysis too. The skills I gained made it easy to find jobs in data science and engineering once I started looking.

I currently work as a data platform engineer, so I also have lots of real-world experience with the tasks for that role.

MVP

Write code to extract the data, load it into storage, and transform and clean it (ELT). Your storage can be a Parquet or CSV file.
Visualise trends and distributions in the data. Feel free to use a no-code tool like Google Sheets.
Predict the future or unknowns using the data. Use a library like scikit-learn for this.
Write a document explaining your work. Include graphs for the insights and an entity-relationship diagram (ERD) for the dataset. You can generate ERDs from DBML with tools like dbdiagram.io.
Bonus: Make the entire process automatically refresh on a schedule, even if your data is static.

Data Platform Engineer

Implement a CI/CD pipeline, i.e. automated testing and deployment of code.
Set up an orchestration tool like Dagster for managing your ELT pipelines.
Create a data warehouse or data lake for storage. DuckDB is great for a simple project because it's embedded.
Set up a data development environment with different permissions for production vs development.
Host your application with a cloud services provider, and handle the transition.
Define your cloud resources (if any) as code.

Analytics Engineer

Abstract the source and destination connection logic into reusable components.
Create sensible data models for your use case, e.g. wide tables, dimensional tables, or event tables. Follow the medallion architecture pattern.
Set up a data transformation tool like dbt.
Develop data quality tests and/or metrics, using relevant tools.
Ingest additional data sources and make them relatable to your existing data.

Data Analyst

Identify useful questions to ask about the data, and communicate the value and impact of your insights.
Perform your queries and transformations in SQL or Pandas instead of no-code tools.
Use an interactive tool like Looker Studio or Streamlit to display your visualisations.
Allow users to select parameters and specify filters, like date ranges.

Data Scientist

Create additional predictive models, covering different model types.
Perform statistical analyses and report your findings.
Quantify uncertainty (e.g. confidence or prediction intervals) and significance.
Perform causal inference to determine the causes of observed phenomena.
Learn how to use PySpark and open table formats. You can run PySpark on Google Colab notebooks, or sign up for Databricks Community Edition.

Models project

In this project, you will be creating a deep learning model, then automating, optimising, and evaluating it.

Find an open-source dataset online. I suggest searching Hugging Face or Kaggle for a dataset intended for the type of model you will create. You can also find some hidden gems through Google Dataset Search.

The most common machine learning frameworks are PyTorch, TensorFlow, and JAX. I suggest PyTorch Lightning because it's high-level and approachable, but still fundamentally PyTorch.

If you need compute resources for training or inference, I suggest using Google Colab or Kaggle Notebooks. Both are free with limitations. There are many serverless AI hosting platforms emerging these days, and some of them have free tiers. I suggest seeing what your options are and trying one out, so you aren't limited by compute availability.

You can execute accelerator-based programs in notebooks (example). There are also Python wrappers for the CUDA and OpenGL APIs. I also suggest checking out Triton or ThunderKittens, which are much more user-friendly ways to program GPUs.

My experience

I haven't worked on a modelling team before, nor have I completed a comprehensive project like this myself. However, I've worked at companies in the machine learning space and have seen coworkers do similar work.

Many of these tasks may be difficult to approach as a beginner, but any progress you make is valuable. The few times I've dabbled in training and managing my own models have helped me contextualise the work of my coworkers.

MVP

Train a simple deep learning model on the data.
Evaluate model performance against relevant benchmarks.
Compare model performance when adjusting parameters and varying approaches.
Implement an optimisation approach such as compression.
Implement a pipeline which chains together the steps for pre-processing, training, optimising, and evaluating.
Write a thorough document demonstrating and explaining your work, including images where appropriate.
Bonus: Containerise your work.

Machine Learning Scientist

Design the model architecture yourself with research and experimentation.
Test and evaluate hypotheses regarding model design and configuration.
Implement model explainability methods to contextualise model predictions.
Identify an open-source model that could be used as a basis for this task, and determine the appropriate transfer learning technique to apply to it.

Machine Learning Engineer

Productionise the model and model training code.
Deploy the model and make it accessible through a service.
Implement automated hyperparameter tuning.
Create an ensemble model by combining multiple models to improve performance.

Machine Learning Platform Engineer

Automate the model deployment process.
Implement automatic data and model versioning.
Set up monitoring for data drift and performance degradation.
Host your application with a cloud services provider, and handle the transition.
Define your cloud resources (if any) as code.

Machine Learning Systems Engineer

Calculate your model's training and/or serving performance costs, and show your work.
Implement online serving with request batching.
Implement multiple different compression strategies and compare them.
Attempt to optimise part of the training process using accelerator-based programming. It's okay if you don't actually improve performance; the goal is to learn.
Build a model server using TensorFlow Serving, Triton Inference Server, or similar.
Implement a kernel from an ML paper, such as FlashAttention, with explanatory comments.

Products project

In this project, you will be building a web application that uses a machine learning model. The best ML-powered products improve as the model improves, rather than trying to augment a shortcoming in the currently available models.

Your application can do anything you want. For inspiration, check out past submissions to AI hackathons on DevPost or lablab.ai.

If it works for your use case, I suggest starting with an embedded database such as SQLite to simplify setup.

I suggest using Postman to simplify sending requests to your application.

My experience

I consider myself a generalist software engineer who primarily works on backend products, with a focus on data systems and an interest in infrastructure. (Phew, that's a mouthful!) Therefore, I often build pieces of larger data and ML systems at work, but not entire systems end-to-end.

I recently started working on a project in my spare time which follows this same structure. Despite having a few years of experience under my belt, the project has been an excellent up-skilling opportunity for me. It's helping me become a much stronger developer, which has a direct impact on my work.

MVP

Build a basic backend (server) for the application. Incorporate a request to an externally hosted model.
Design and create a self-hosted transactional database for data storage. Connect it to the backend, implementing a data access layer.
Containerise the backend with Docker and deploy it as a service with Docker Compose.
Switch to a non-embedded database (if applicable), deploy it as a service, and update the backend to use it.
Build a basic frontend (web) interface for presentation.
Bonus: Implement basic user registration and login. Ensure that credentials are securely stored.

Backend Product Engineer

Handle errors, and return appropriate status codes and messages.
Write unit tests, mocking dependencies if needed.
Specify an API schema for a common protocol like OpenAPI or JSON Schema. Use it to generate documentation, a client library, and/or server stubs.
Implement single sign-on (SSO), two-factor authentication (2FA), or passwordless login.
Rate limit the backend's requests to the model and users' requests to the backend.
Allow users to filter a list of objects by parameters.

Backend AI Engineer

Fine-tune the model on data relevant to your application.
Experiment with different parameter values and document your results.
Evaluate alternative models and compare their performance.
Integrate other types of models (e.g. GPT, BERT, Stable Diffusion) into your application.
Integrate features like semantic search, retrieval-augmented generation (RAG), or LLM agents, if they make sense for your use case.

Backend Integrations Engineer

Implement single sign-on (SSO) with standard providers like Google, and any others relevant to your product (e.g. Instagram, TikTok, GitHub).
Enable pulling data from third-party systems into your product. As a bonus, automatically ingest fresh data on an ongoing basis, without user involvement.
Enable interacting with third-party systems from your product.
Develop an internal framework for managing integrations-related boilerplate, like authentication and requests.
Reverse-engineer web APIs for integrations which don't expose desired access patterns via an official public API.

Backend Scalability Engineer

Set up a schema migration process for the database.
Consider incorporating non-relational databases (e.g. document, vector, graph) if they make sense for your use case.
Incorporate a message-passing system or event-driven pipeline.
Implement an in-memory cache to reduce computation or networking for frequently requested data.
Learn about data replication strategies.

Developer Experience Engineer

Implement a CI/CD pipeline, i.e. automated testing and deployment of code.
Deploy your services with Kubernetes instead of Docker Compose, using Helm charts. Use Minikube if you don't want cloud hosting.
Host your application with a cloud services provider, and handle the transition.
Define your cloud resources (if any) as code.

Site Reliability Engineer

Set up application monitoring and/or request tracing.
Set up an open-source observability stack (e.g. SigNoz).
Set up success and failure alerts (e.g. to Slack, Discord, or email).
Implement automatic scaling for your services, and test it by simulating load.

Web Frontend Engineer

Match all the backend features in the web app, using a component-based framework like React and a CSS framework like Tailwind.
Implement a responsive design which scales to different screen sizes.
Utilise platform-specific features (e.g. push notifications, camera access).
Implement web tests with a tool like Jest or Cypress.
Set up a CI/CD pipeline for the frontend.