What is Data Science?
Data Science becoming widely popular in the last few years hasn’t managed to find a place in English dictionary yet for its definition. However Oxford defines Data as “Facts and statistics collected together for reference or analysis” and Science as “The intellectual and practical activity encompassing the systematic study of the structure and behaviour of the physical and natural world through observation and experiment”
Irrespective of the above definition I am more impressed with the role of Data Scientist who is “expected to analyze and interpret complex digital data to assist a business in its decision-making”.
Try Googling the term “Data Science Definition” and there are tons of websites with various definitions, however i defined it as understanding data, driven by numbers to assist business in informed decision making.
Data Science is:
- Data Driven decision making
- Building future strategies based on experiences by looking at past data.
- Minimize the manual judgmental error of prediction.
There are many use cases for Data Science but I would list a few for ease understanding:
- Spam detection
- Credit Card Fraud detection
- Discount targeting by retail shops
- Cross Selling / upselling
- Image recognition and Image analysis
- Speech recognition
- Sentiment analysis
- Airline Route Planning
- Delivery Logistics
Career in Data Science is both exciting and challenging. I like challenges, I love working with numbers and I am already enjoying administering Data !
After a lot of research online below is my understanding about what are the different areas I need to specialise to become a Data Scientist
This forms the core of Data Science. If you are good at Maths chances are you can be best in your role as Data Scientist. To get a deep insight of the data you are working with you would need to be good with Mathematics. Not that every Data Scientist is a PhD but to get started we need to learn basics of the subjects listed below.
I will dedicate an entire post for this subject but here I would like to list a few Maths topics and their relevance in Data Science:
Descriptive Statistics : As the name infers it is used to describe the properties of the data given. This gives information about how the data is distributed, mean, median, mode, standard deviation.
Probability: Life is full of uncertainties and we must tame it to be used for our needs. What is the probability that it rains today, what is the probability heads when the coin is tossed? This is the backbone to mastering statistics for Machine Learning.
Inferential Statistics: The method of drawing conclusions from an entire dataset is called Inferential Statistics. Making inferences about population from a sample, finding if model is better than another model, adding or removing a feature will improve the model are some of the examples.
Linear Algebra: Machine Learning concepts are tied to Linear Algebra concepts. As a human when a picture of a cat is given we identify it. Our brains are trained over a period of time to understand that. Our brain processes many properties of the picture and identifies it. If we are make a machine do the same, we would need Linear Algebra which is a mathematical toolbox that offers helpful techniques for manipulating groups of numbers simultaneously (typically in matrices).
Calculus: A good knowledge of basic calculus is essential for understanding and selecting Machine Learning algorithms.
MIT Open Courseware is the answer if you are trying to find some good sources to learn mathematics
There are many programming languages you can pickup for data science discipline, however the most popular ones are R and Python. While R is a really the language statisticians use and sits right at top of the list of data languages, I chose Python as it is comparatively easier to learn. Python is an open source language backed by a huge online community and is only growing in strength every day. It has thousands of libraries added which support it to be a multi-purpose language.
We use the programming language to do Data Analysis, Data Exploration and Data Visualization.
I will do a separate post on basics of Python for Data Science.
Finally, the most important part of Data Science which is predicting. In practice it is the ability to develop computer programs that can make predictions based on data. A Machine Learning workflow is the process required for carrying out Machine Learning project. The main prerequisite for machine learning is Data Analysis.
It begins with:
Observations or Getting Data – Cleaning the Data – Aggregating the Data – Exploring the Data – Visualizing the Data – Train on ML Algorithm – Test the Model – Repeat the process
Machine Learning Algorithms are categorized into 2 types:
- Supervised: where we provide a known labelled dataset for training the algorithms and use it to predict the target from test data.
- Unsupervised: We provide unlabeled data and expect the machine to learn from it and differentiate from the test data.
This skill is something which goes without saying is important for any career. I learnt from many online articles that a Data scientist is expected to do lot of communication against different stake-holders with in the organisation. He is also expected to have structured thinking approach for a problem.
The most important of all is practice and there is no alternative for that. All the learning and skills above needs to be put to practice. There are many online hackathons which conduct Data Science competitions. We can participate and pick up a dataset and start practicing. The ones that I use are Kaggle and AnalyticsVidhya which have huge resources along with the competition datasets.
Dive in and start your Data Science journey…!