Data projects

Practical projects for building data and software skills.

2023-06-23

Introduction

I recently gave a talk on career options in data. I mentioned in the Q&A period that relevant and realistic side projects were a great way to build and display your technical skills, especially for people without prior experience in data or software roles. Someone asked for project ideas for each of the roles I presented about, but there was only so much I could say with a couple minutes left in the presentation.

So, I did some brainstorming and came up with what I feel are practical, achievable, and interesting projects. I designed one project per "area" of work, because in some cases, it's hard to isolate the work for a single role. For example, it's hard for an ML platform engineer to implement model versioning without any models to version.

Each project starts with a "minimal viable product" (MVP), which is the minimum amount of work you need to do to have a complete project. Then, there are optional extension tasks which enable you to dive more deeply into any role and increase the complexity of the project.

I've also included some of my own experience with these types of projects so you can trust that I'm not just making stuff up. I want to show you that working on these projects have helped me, that they can help you too, and that they're not as hard as they may seem. I've also included some books for further learning, if that's more your style.

If you want to work in a team, I suggest working on all parts of the MVP together so everyone gains some experience with all roles. Then, you can split up the extension tasks and work on them individually or in pairs, choosing the ones that interest you most. Having some knowledge of the other roles' work will make you more well-rounded anyway, which is helpful for job searching and working.

The extension tasks are all optional. You can do as many or as few as you like, for whichever roles you like, and in mostly any order. You may find that you prefer different roles' work than you expected.

Most importantly, follow your interests and have fun!

Tools

Although you can use any tools you'd like, I'm going to suggest some that I think may make these projects easier.

I suggest using Python as your programming language, especially if you're not an experienced developer. Python is a popular language for data and ML work, and the ecosystem makes it easy to get started. If you need to build a service in Python, I suggest FastAPI or Flask, because they're simple but powerful.

If you need to containerise something, use Docker for the build instructions and Docker Compose for service deployment. Although you can use Kubernetes, it's more complex and not necessary for a small project.

If you don't have a favourite code editor, I suggest using VS Code. It's free and has a great extension ecosystem. I would stay away from general-purpose text editors like Notepad or TextEdit.

You should store all your code, documentation, notebooks, etc, in a Git repo. I suggest hosting your project on GitHub. If you don't have an account, you can sign up for free. They have great documentation and learning resources.

You should develop "features" (or work on tasks) by committing progress to a branch dedicated to that feature, and updating the main branch by opening then merging a pull request for your feature branch. This is called "feature branching" and is a common practice in software development. It's a good habit to get into, and it will make it easier to collaborate with others.

In general, pick tools that reduce toil so you can focus on learning and building.

Insights project

In this project, you will be preparing a transactional dataset for analysis then uncovering and communicating insights within it.

Real data is ideal for learning. Try working with a small business to use some of their data in exchange for analytics services. Otherwise you can find suitable datasets online, like a subset of the AdventureWorks data files.

I suggest interacting with the data in a Jupyter Notebook so you can iterate faster.

My experience

Near the end of my undergraduate degree, I discovered a passion for dealing with large amounts of data. Unfortunately, this was not something I had the opportunity to explore in my degree. I invented a project for myself that was much like the one I'm describing to you now, although I didn't know what it would entail when I started.

I signed a contract with a local cafe, gaining access to their point-of-sale data in exchange for building a predictive model of their product sales. Although I went into the project with the intention of learning data science alone, I ended up learning a lot about analytics engineering and data analysis too. The skills I gained made it easy to find jobs in data science and engineering once I started looking.

MVP

Data Analyst

Data Scientist

Analytics Engineer

Models project

In this project, you will be creating a deep learning model, then automating, optimising, and evaluating it.

Find an open-source dataset online. I suggest searching Hugging Face or Kaggle for a dataset intended for the type of model you will create.

If you need compute resources for training, I suggest using Google Colab or Kaggle Notebooks. Both are free with limitations.

You can execute accelerator-based programs in notebooks (example). There are also Python wrappers for the CUDA and OpenGL APIs.

My experience

I haven't worked on a modelling team before, nor have I completed a comprehensive project like this myself. However, I've worked at companies in the machine learning space and have seen coworkers do similar work.

Many of these tasks may be difficult to approach as a beginner, but any progress you make is valuable. The few times I've dabbled in training and managing my own models have helped me contextualise the work of my coworkers.

MVP

ML Scientist

ML Engineer

ML Platform Engineer

ML Accelerators Engineer

Systems project

In this project, you will be building an application that uses a machine learning model.

Your application can do anything you want. For inspiration, check out past submissions to AI hackathons on DevPost.

If it works for your use case, I suggest using an embedded database such as SQLite to simplify setup.

I suggest using Postman to simplify sending requests to your application.

My experience

I consider myself a backend engineer with a focus on data systems and an interest in infrastructure. Therefore, I often build pieces of larger data and ML systems at work, but not entire systems end-to-end.

I recently started working on a project in my spare time which follows this same structure. Despite having a few years of experience under my belt, the project has been an excellent up-skilling opportunity for me. It's helping me become a much stronger developer, which has a direct impact on my work.

MVP

Data Platform Engineer

Infrastructure Engineer

Backend Engineer

Books

If projects aren't your thing, if you want more structured learning, or if you want to supplement your project work, I suggest reading books. I've only read a few of these myself, but I've heard good things about most of them.

Most of these books are published by O'Reilly, which directly offers a subscription service and indirectly sells through Amazon. Some libraries have O'Reilly subscriptions, so check if yours does.

Data Analyst

Data Scientist

Analytics Engineer

ML Scientist

ML Engineer

ML Platform Engineer

ML Accelerators Engineer

Data Platform Engineer

Infrastructure Engineer

Backend Engineer