Data science project quality varies a lot from person to person. Even within the same lab or team you can see massive level of disparity in structure and robustness of the underlying analysis.
Here is a list of things I've seen throughout my journey in data science.
- No comment in the whole code base.
- Folder structure is ultra nested and very difficult to follow
- The code cannot even be ran without using the same computer as the author as configuration wasn't stored in code.
- Setting up the project involve contacting the original author as there is no information in the README.md on what to do.
- No documentation whatsoever.
- No tests were written and the code was barely tested for correctness with a single input.
- The project relied on scripts to be ran in a specific, unspecified order and it use the global state.
This type of poor code quality is not a big problem when the project just started. Usually it's still blurry enough that it doesn't matter too much what the quality. However, after the project start to take more solid form with clear goals, poor code quality will bring everything to a halt and be a major time sink.
The solution for this type of issue is simple: introduce code review in all of your project. In this data science tips I'll explain:
- what is code review.
- how I personally like to run them.
- the major benefits of doing code review
- what to do if you are working alone
If you are a data scientist or a graduate student without a software engineering background, this short tips will be massively useful!
What is a Code Review
In a nutshell, a code review is a gate that you put between your new code and the old code you already did. Before merging your new piece of code/analysis you need to have it checked by other people.
These people are usually teamates working on the same project. Some of them will review your new contribution and leave comments to place where there could be improvements like:
- Something doesn't work and it should.
- Code isn't clear and documentation would help.
- There is a better way to do a part of the analysis.
- Just a general question about some part of your code that the teamates doesn't know about.
Usually what happens is that there is multiple round of review where multiple people review your code and then you address each of the comments. This trigger another round of review where people can leave more comments or approve your code.
Once the code is approved by enough people you usually merge the improved version of your code and continue your analysis. Similarly, you will periodically have to review other people's code and leave some comments or approve them.
There is many ways to implement code review. However, f you are using a remote repository like Gitlab or Github, the functionality are already built into the software and work quite well for most type of files.
How I Like to Run Code Reviews for Data Science Projects
In an idea world, you will have multiple people working on the same data science project and collaborating heavily together. In practice, I've rarely seen that.
People that are composing a team in a data science project are usually just all working on their own project which are not always very aligned with those of other.
I still like to run code reviews when data science teams are organized like this, because it allows everyone in the team to be aware of what others are doing and to share their knowledge together.
Here is my recipe for running effective code reviews.
- Each team members need to first comments or approve the ongoing code review (i.e. usually called a merge or a pull request) before continuing their analysis. This is important because an ongoing code review is a work-in-progress and it needs to be cleared out as fast as possible.
- The code review needs to happen on small merge requests. The more code there is to review, the longer it will take and the more back-and-forth there will be. Always trying to make incremental small change to the codebase is best.
- The merge request needs to be properly documented and easy to test out (i.e. by having comments or documentation explaining how to test the new thing). Otherwise the commenters will take too much of their time figuring the merge request.
- The code review can be stopped whenever the author has 1 approval and no major comments to address. Sometime, there are comments that are not necessarily objectively useful. Having the option to disregard some or to leave them as feedback is useful.
With these 4 points the code review process become pleasant and each team members learn a lot in the process!
At a more technical level, I like to structure the code base for optimal code review as follows:
- The stable analysis that correspond to the latest milestone I've reache in the projec (e.g. baseline classifier for X task) lives on the
main/masterbranch of my codebase.
- The current milestone I'm working on lives on a branch that is ahead of the
main/masterbranch that I call
- Everytime I want to make an incremental step in that milestone I will work on a branch out of the
devthat usually has a meaningful name for my experiment like
data-viz-losoor something like that.
- When I trigger a code review, I want people to review the
data-viz-losotype of work so that I can merge it with my
- Whenever I'm confident that I've hit my milestone, I will prepare a presentation for my work and merge the
main/masterbranch if there is good support that I've hit my milestone.
This way of working will ensure that you always have a very solid codebase and that you are working in small tested steps.
Let's outline the major benefits of doing code review for a data science team.
Major Benefits of Code Reviews
There is 6 major benefits of doing code reviews in data science project:
- Higher quality code: having the code double checked by many pair of eyes and re-worked before being accepted into the main code base increase the quality of the code. It also forces you and your teammates to prepare your code for other people to check out from the get go!
- Less chance of error in the analysis: Having bugs in a data science experiment is a nightmare. They are hard to find and can have drastic effects on the results (making them suspiciously interesting). Having people running and testing your code is a sure way to reduce the chance of bugs finding their way into the main analysis!
- Continuous learning of new techniques: By looking at other people's code and figuring out how they structure their analysis you have an unlimited potential of learning from them. This is massively helpful and can ensure that the best coder in your team teach continuously the more junior people!
- Improved documentation: Since you know that there is a 100% probability that someone else than yourself will read and try to understand that code, it makes you more prone to document thoroughly your analysis. Otherwise, you will invariably get comments of your teammates not understanding what your new snippets is about.
- Higher context awareness of your project by your peers: This is a big benefit! Since everyone in your team will be aware of what you are working on, it will be easier to have meaningful discussion about your project and to bounce ideas!
- Lots of practice to talk about your project usefulness and results: By having to justify your code in each new commits you bring to your main codebase you will gain an important practice in putting your project usefulness to the forefront. This will come handy whenever it is time to talk to stakeholders or to prepare a research paper!
It's also much more motivating to know that your code is being seen and that someone else than yourself is "working" on your problem!
What to do if Working Alone?
Now, you might not have the chance of working on a project within a teams or with teammate. If this is the case, I still think that you should work in a code review type of cadence with yourself.
It might be a bit awkward, but it help a lot for you to take the time to review your changes critically. By knowing that you can't just yank your code into the main codebase, it will force you to adopt a structure that is much more robust. This makes you much more likely to produce code of high quality especially if your project is a long winded one.
When doing solo code review, I like to review my merge request and do my code review on the next day that I've created them. This allow my brain to kind of rest on what I did the day(s) before and to come with a fresh-critical pair of eyes. I usually spot improvements and bug much more easily that way.
However, it's always best to find someone to review your code and for yourself to see other peoples code. It's more motivating and fun to work with peers than alone!
Code Review will have a major impact on how you think about your own data science project once implemented. It will yield higher quality code, analysis that are much more robust and ensure a minimum of documentation!
Not only that, but it's a massive learning opportunity and lead to a spirit of collaboration within a team that make for great professional growth!
I highly recommend it, don't hesitate to reach out at email@example.com if you have questions about implementation within your team.
For machine-learning content check out my Youtube channel covering these topics on a technical and theoretical level!
Have a wonderful week everyone! 👋