Beginner's Guide to Kaggle

NOVEMBER 2, 2023

Intro:

Kaggle, a popular platform for data science competitions, can be intimidating for beginners to get into. After all, some of the listed competitions have had over $1,000,000 prize pools and hundreds of competitors. Companies provide datasets and descriptions of the problems on Kaggle. Participants can then download the data and build models to make predictions and then submit their prediction results to Kaggle.


The problem with Kaggle is that: students who have just begun learning Data Analytics/Science might find Kaggle a little intimidating: how do you start? What problems should you begin with? How many problems do you need to solve? Should you participate in Kaggle Competitions? Is it worth is? How Do You Find The Right Kaggle Competition For Your Level? (From Beginners To Advanced Skills)


In this guide, I want to explain how you can start solving Kaggle problems, and how to make the most of this platform.

Type of Competitions on Kaggle

Machine learning problems related to data science projects created by Kaggle or other firms are used in Kaggle For Data Science contests. You can win real money rewards if you compete effectively. Kaggle also provides coding programs for placements to help you get a job as a machine learning engineer in your dream company.


For each competition, Kaggle usually provides a training dataset and a test dataset. All submissions will be evaluated by certain evaluation metrics based on the test dataset. The evaluation results will be shown in a leaderboard to let the participants know their relative progress. The scores shown on the leaderboard during the competition are known as “public score”, which is calculated based on a fraction of the test data set, which is uniquely specified for each competition. The final score, which is called “private score”, will be given based on the complete test dataset after the competition is closed. The final ranking of the competition is calculated according to the private score.


There are many different categories of competition with different incentives and goals. Some of them are:

Category Description
Featured• public competitions
• with significant prize money
• goal is to solve commercial problems.
Research• public competitions
• goals are research/ scientific in nature or serve a public good
• with cash prizes or invitations to conferences or publication in
peer-reviewed journals.
Community• public competitions
• competitions are created by community members
• with cash prizes for the best competition
• for beginners and intermediates
Playground• public competitions
• set up to be fun, quirky and idea-driven
• without any prize.
Simulations• you submit an agent to compete in a simulated environment
• public competitions
• without cash prizes
Analytics• open-ended exploration
• with cash prizes
• public competitions
• exclusively for data analytics
Getting Started• public competitions
• without cash prizes
• for machine learning beginner
• always available and have no deadline.

Steps to get started with Kaggle

1. Pick a language
2. Build a strong foundation
3. Topics you can learn before you start solving Kaggle competitions
4. Topics you need to be familiar with in order to make the most of Kaggle competitions
5. Your first Kaggle Project

1. Pick a language

This guide isn’t about which language you should learn first. Most learners go with either Python/R/SQL. The point is that you select one language that works for you.

2. Build a strong foundation

The number one reason why most students fail or quit Kaggle is that Kaggle has a prerequisite. I believe if you are completely new to data analytics, then you need to spend the first few months building a strong foundation before you register at Kaggle.

Note: Kaggle does offer introductory courses/competitions for complete beginners. But, I’d recommend that you build your foundations elsewhere, and then jump into Kaggle.

If you are completely new to analytics, then you can consider the following MOOCs:

- For Python: Statistics with Python Specialization
- For R: Data Science Foundations with R]
- For SQL + Excel: Excel to MySQL: Analytic Techniques for Business Specialization


Each program has a decent proportion of lectures + reading material + assignments + (optional) projects. All the courses are freely auditable  (Read: A Guide on How to Sign up for Coursera Courses for Free)

Make sure you solve all the practical assignments and projects.

3. Topics you can learn before you start solving Kaggle competitions

The aforementioned courses are exhaustive enough to cover all the fundamental topics you need to learn in order to get into Kaggle. But, incase you are learning from another source, or just need a explicit list, I’ll mention it:


Basic R/Python/SQL Programming -> Data Reading -> Data Cleaning & Manipulation -> Exploratory Data Analysis -> Statistical Analysis -> Feature Engineering and Selection -> Model Validation and Selection


Note: Some topics are lean towards ML, others are applicable to both - Data Analytics and Data Science.

5. Your First Kaggle Project

Once you’ve studied the foundational courses, you can start with your first Kaggle Project. As I mentioned, you can directly jump into Kaggle, too. But, studying the courses first has a higher return on investment.

The Getting Started’ competitions are great for beginners because they give you a low-stakes environment to learn, and they are also supported by many community-created tutorials. You can select any project from this list.


I’d recommend to go for the most popular project: Titanic - Machine Learning from Disaster Project. What’s more, you can also find a tutorial on the same project, to help you understand how to actually solve this competition

Things to keep in mind

1. Set Incremental Goals

The zone of proximal development (ZPD) is an educational concept which states that you can enhance your learning by working or studying something that’s *slightly above your current aptitude.* As you gain more competence, you are able to solve even more difficult problems. But, the condition is: you need to work on problems slightly above your current aptitude.


Most Kaggle participants will never win a single competition, and that’s completely fine. If you set that as your very first milestone, you may feel discouraged and lose motivation after a few tries.

Incremental targets make the journey more enjoyable. For example:

- Make a submission that beats the benchmark solution.
- Score in the top 50% in one competition.
- Score in the top 25% in one competition.
- Score in the top 25% in three competitions.
- Score in the top 10% in one competition.
- Win a competition!

This strategy will allow you to measure your progress and improvement along the way.

2. Do not use Kaggle Exclusively

3. Focus on learning

The first and most important thing on Kaggle is to make sure the focus is on learning. At least when you are getting started with Kaggle. Kaggle is a highly gamified platform. It is very easy to get stuck in the loop of getting a better rank.


The focus should be on learning new things from this platform. There are some things to learn on Kaggle,


- The algorithms that are best suited to a certain problem or dataset
- Data transformation required for algorithms
- Techniques to increase or decrease the number of features
- Best visualization to capture the insights and trends clearly
- Most commonly used libraries and packages and why?

4. Start fixing the problem at a high-level

As you know, Data Analytics isn’t just about code or creating elaborative models. It’s about problem-solving. When you encounter a problem, the healthy approach is to lay your hands off the keyboard, and instead, think about the problem at a high-level.

Apart from statistics, coding, and modelling, you also require domain expertise. If you are working with a dataset of population growth of several countries, it’s important that you understand each and every feature of the dataset, and how are they related in the real world. I repeat, analytics isn’t just about numbers - the data isn’t just “numbers”.

What’s worse, some people dive into a problem, without first understanding the problem. They might create a working model, sure. But, at the end they will realize that they was a simpler solution available - a solution that they would have thought of, had they given enough time to thinking about the problem in the first place.

5. Study other public notebooks

This platform hosts many leading data scientists. It has over 5 million registered users. The platform makes it possible to learn from the experts. I am not sure how many other industries would provide such a level of democracy. These are notebooks and discussion forums that can be used for learning.


Studying the notebooks shared in the competitions will help you learn different ways of solving the same problem.

6. Is it worth competing if you don’t have a realistic chance of winning?

Yes!

No matter how experienced you are in data science, you can improve your skills by participating in competitions in this continuously growing and developing field. These data science competitions will challenge you within your own capabilities. The more time and effort you put into Kaggle competitions,  the quicker you will get comfortable with the libraries and programming languages that you use.


Remember, you’re not necessarily committing to be a long-term Kaggler. If you find out that you dislike the format, then it’s no big deal. In fact, many people use Kaggle as a stepping stone before moving onto their own projects or becoming full-time data scientists.


Some beginners never start because they’re worried about low ranks showing up in their profile. Of course, competition anxiety is a real phenomenon, and it isn’t limited to Kaggle. However, low ranks are really not a big deal. No one else will judge you because they were all beginners at one point.

RELATED POSTS