There are some commonly used terms in the learning path of Data Science. I have made my list of few such terms and try to briefly explain them in as simply terms as possible. Do not take them as definitions but instead perceive them as a basic understanding of the term because there are tons of websites dedicated advocating the definition of the terms.

For ease of understanding i try to give an example for most of the terms in the best possible way. If there seems to be a better way of putting the example (which i am sure would be) i would gladly welcome the suggestions and feedback as comments.

Part -1:

Algorithm:

It is a process or set of rules to complete a task. It is the backbone of computer science. We have different types of algorithms for different tasks.

Artificial Intelligence (AI):

As the term says it is an intelligence exhibited by the machines. The term “artificial intelligence” is applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving”.

Example: All the robots we hear about

Bayes Theorem:

This describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Example:  If cancer is related to age, then, using Bayes’ theorem, a person’s age can be used to more accurately assess the probability that they have cancer, compared to the assessment of the probability of cancer made without knowledge of the person’s age.

Bias:

In general, this means favouring of or against one thing, person or group compared with another which is usually considered unfair.

Example: You need to pick a sample 100 working people in a city where the city population is divided into 60:40 male to female ration. If your sample people consists of 80 males and 20 females then it is biases towards males and the survey results do not relate the city.

Big Data:

These are large and complex datasets which traditional data processing software’s find difficult to deal with. There are five dimensions to Big Data known as Volume, Variety, Velocity, Veracity and Value.

Binomial Distribution:

In simple words this can be thought of having only 2 probability outcomes of the event.

The best example I can think of is a coin tossing which has only 2 outcomes Heads or Tails. Another example can be PASS or FAIL in a test.

Categorical variable:

It is variable that can take one of a limited, and usually fixed, number of possible values. It can be numeric or non-numeric.

Example: In a School dataset if there is a column which speaks about student age it is considered as categorical variable. This is because usually all school going students are within the age of 4 – 16yrs old. Similarly gender variable is also a categorical but non-numeric.

Chi-squared test:

Spelled as ‘KAI – SQUARED’ and written as χ2 test, is any statistical hypothesis test which measures the goodness of fit. This tests the relationship of two categorical variables if they are independent of each other in determining the outcome.

Example: In Saving habits of employees dataset, we can try to find out if gender (MALE / FEMALE) determines the saving habit (YES / NO) or not.

Classification:

This is a technique of identifying to which of a set of categories a new observation belongs. This is done based on training the algorithm on current dataset with observations whose categories are known.

Example: Marking an email into “spam” or “non-spam” classes.

Clustering:

It is grouping set of objects in such way that they have more similarities than other objects in the set.

Example: Clustering employees in an organization based on their performance metrics as “Outstanding”, “Excellent”, “Normal”, “Needs Improvement”.

Coefficient:

It simply means a number used to multiply a variable. In the equation y = 10x which is 10 times x, where x is the variable and 10 is the coefficient of x. Coefficients are commonly used in finding correlation between 2 variables and measuring reliability of test consistency over a time.

Confidence interval:

A Confidence Interval is a range of values we are fairly sure our true value lies in.

Correlation:

The term “correlation” refers to a mutual relationship or association between quantities. It determines how two variables are related to each other like they are positively related or negatively related. In addition, it also determines the degree to which to values tend to move together which means the strength of the relationship. It can take values between -1 to 1, hence negative values mean increase in one variable other variable decreases and vice-versa for positive values. If it is zero or near zero there is no correlation.

Covariance:

Covariance is a similar to Correlation where it measures the mutual relationship of two variables. It measures the positive or negative relationship between two variables but unlike Correlation does not measure the strength of the relationship. This can take any value from negative infinity to positive infinity.

Cross-validation:

It is a model validation technique for assessing how the results of a statistical analysis will generalise to an independent data set. This is mainly used to verify how well the model performs when a set is given to it. It does this validation by splitting the original dataset into more than one part and training the model on all subsets except for one which is used for testing. The process is repeated for all the subsets of data and a final value is derived which is free from bias.

To be continued in Part-2

Categories: DataScience