Practical projects for building skills in software, data, and ML.
2023-06-23
I recently gave a talk on career options in software. I mentioned in the Q&A period that relevant and realistic side projects were a great way to build and display your technical skills, especially for people without prior experience in data or software roles.
Someone asked for project ideas for each of the roles I presented about, but there was only so much I could say with a couple minutes left in the presentation. So, I did some brainstorming and came up with what I feel are practical, achievable, and interesting projects.
I recommend not using GitHub Copilot or other AI auto-complete tools. The purpose of these projects is learning, which happens best when you struggle through figuring something out yourself. Of course, you can use AI to help you understand things, but don't let it do your work for you. The important part is that you're the one driving the process.
Update (June 2025): I finally finished updating this post to reflect the changes I made to the "software careers" post. Thank you for your patience! I promise I put a lot of time into this update.
I designed one project per "area" of work, because in some cases, it's hard to isolate the work for a single role. I also dissolved the "platforms" group and added the roles within it to the areas they primarily support. For example, it's hard for a machine learning platform engineer to implement model versioning without any models to version.
Each project starts with a "minimal viable product" (MVP), which is the minimum amount of work you need to do to have a complete project. Then, there are optional extension tasks which enable you to dive more deeply into any role and increase the complexity of the project. The focus is on building foundational knowledge rather than completing the project quickly, so the tasks are often purposefully "inefficient".
I've also included some of my own experience with these types of projects so you can trust that I'm not just making stuff up. I want to show you that working on these projects have helped me, that they can help you too, and that they're not as hard as they may seem.
These projects are all designed to be implemented by a single individual so that you develop a broad skillset. If you want to work in a team, I suggest working on all parts of the MVP together so everyone gains some experience with all roles. Then, you can split up the extension tasks and work on them individually or in pairs, choosing the ones that interest you most. Having some knowledge of the other roles' work will make you more well-rounded anyway, which is helpful for job searching and working.
Note that some roles' MVP work and extensions will be harder or take longer than others'. This is particularily true for foundational engineering like platform work. It's not that these roles themselves are necessarily harder, but the impact of their work scales across the company, and so the results are shared by many people.
The extension tasks are all optional. You can do as many or as few as you like, for whichever roles you like, and in mostly any order. You may find that you prefer different roles' work than you expected.
Most importantly, follow your interests and have fun!
Although you can use any tools you'd like, I'm going to suggest some that I think may make these projects easier.
I suggest using Python as your programming language, especially if you're not an experienced developer. Python is a popular language for data and ML work, and the ecosystem makes it easy to get started. This is less relevant for the "products" project, where you can feel free to use whichever language you're most comfortable in. If you need to build a service in Python, I suggest FastAPI or Flask, because they're simple but powerful.
If you need to containerise something, use Docker for the build instructions and Docker Compose for service deployment. Although you can use Kubernetes, it's more complex and not necessary for a small project.
If you don't have a favourite code editor, I suggest using VS Code. It's free and has a great extension ecosystem. I would stay away from general-purpose text editors like Notepad or TextEdit.
You should store all your code, documentation, notebooks, etc, in a Git repo. I suggest hosting your project on GitHub. If you don't have an account, you can sign up for free. They have great documentation and learning resources.
You should develop "features" (or work on tasks) by committing progress to a branch dedicated to that feature, and updating the main branch by opening then merging a pull request for your feature branch. This is called "feature branching" and is a common practice in software development. It's a good habit to get into, and it will make it easier to collaborate with others.
In general, pick tools that reduce toil so you can focus on learning and building.
I have chosen to make the MVP versions of the project completely cloud-free so that you can easily develop and run the code purely through your own laptop. Knowing your way around one of the major cloud providers is a useful skill, but costs money unless you can stay within a free tier or get credits, and can add significant work to your project.
If you decide to adopt a cloud provider, I suggest either Google Cloud Platform (GCP) or Amazon Web Services (AWS). GCP has a massive advantage for analytics, and I would strongly recommend it for the Insights project. AWS has a better developer ecosystem for the Products project, but GCP is alright too. Otherwise GCP is easier to navigate. I would stay away from Azure unless you have credits.
If you want to deploy your project to the cloud for practical reasons, but don't actually care to learn infrastructure engineering or the major cloud providers, consider using a Platform-as-a-Service (PaaS) offering. PaaS providers are way easier to get started with and are often much cheaper for side projects. Flexibility is limited, but it's a nice middle ground between laptop-only and major cloud. This market is rapidly changing, so do your own research.
In this project, you will be preparing a relational dataset for analysis then uncovering and communicating insights within it.
Real data is ideal for learning. Try working with a small business to use some of their data in exchange for analytics services. However, you will have to be extremely careful not to expose the business' data or even commit it to a private repository. Additionally, you will need to make sure to deliver what you promise.
Otherwise you can find suitable datasets online. AdventureWorks is a fake store with transactional data that can be easily downloaded. There are also a myriad of online APIs that you can fetch data from, like PokéAPI, REST Countries, and the Battle.net APIs.
The best choice, if you can find it, is a dataset related to an interest of yours. That will help keep you motivated to explore the data, and you'll be able to dive in deeper than others due to your unique context. Plus, you are less likely to use the same dataset as others, which means it will stand out to hiring teams. If a dataset doesn't exist in a downloadable or API format, you can consider building it yourself with web scraping.
I suggest interacting with the data in a Jupyter Notebook so you can iterate faster.
Near the end of my undergraduate degree, I discovered a passion for dealing with large amounts of data. Unfortunately, this was not something I had the opportunity to explore in my degree. I invented a project for myself that was much like the one I'm describing to you now, although I didn't know what it would entail when I started.
I signed a contract with a local cafe, gaining access to their point-of-sale data in exchange for building a predictive model of their product sales. Although I went into the project with the intention of learning data science alone, I ended up learning a lot about analytics engineering and data analysis too. The skills I gained made it easy to find jobs in data science and engineering once I started looking.
I currently work as a data platform engineer, so I also have lots of real-world experience with the tasks for that role.
In this project, you will be creating a deep learning model, then automating, optimising, and evaluating it.
Find an open-source dataset online. I suggest searching Hugging Face or Kaggle for a dataset intended for the type of model you will create. You can also find some hidden gems through Google Dataset Search.
The most common machine learning frameworks are PyTorch, TensorFlow, and JAX. I suggest PyTorch Lightning because it's high-level and approachable, but still fundamentally PyTorch.
If you need compute resources for training or inference, I suggest using Google Colab or Kaggle Notebooks. Both are free with limitations. There are many serverless AI hosting platforms emerging these days, and some of them have free tiers. I suggest seeing what your options are and trying one out, so you aren't limited by compute availability.
You can execute accelerator-based programs in notebooks (example). There are also Python wrappers for the CUDA and OpenGL APIs. I also suggest checking out Triton or ThunderKittens, which are much more user-friendly ways to program GPUs.
I haven't worked on a modelling team before, nor have I completed a comprehensive project like this myself. However, I've worked at companies in the machine learning space and have seen coworkers do similar work.
Many of these tasks may be difficult to approach as a beginner, but any progress you make is valuable. The few times I've dabbled in training and managing my own models have helped me contextualise the work of my coworkers.
In this project, you will be building a web application that uses a machine learning model. The best ML-powered products improve as the model improves, rather than trying to augment a shortcoming in the currently available models.
Your application can do anything you want. For inspiration, check out past submissions to AI hackathons on DevPost or lablab.ai.
If it works for your use case, I suggest starting with an embedded database such as SQLite to simplify setup.
I suggest using Postman to simplify sending requests to your application.
I consider myself a generalist software engineer who primarily works on backend products, with a focus on data systems and an interest in infrastructure. (Phew, that's a mouthful!) Therefore, I often build pieces of larger data and ML systems at work, but not entire systems end-to-end.
I recently started working on a project in my spare time which follows this same structure. Despite having a few years of experience under my belt, the project has been an excellent up-skilling opportunity for me. It's helping me become a much stronger developer, which has a direct impact on my work.