Basic Data Science Interview Questions For Fresher

Q.1 What is marginal probability?

Marginal probability is the probability of a single event occurring independently, without considering the outcomes of other events. It focuses solely on the likelihood of one specific event happening, independent of any other events. Marginal probability is obtained by summing or integrating the joint probabilities of the event of interest across all possible outcomes of other events.

Q.2 What are the probability axioms?

The fundamental rules that control the behaviour and characteristics of probabilities in probability theory and statistics are referred to as the probability axioms, sometimes known as the probability laws or probability principles.

There are three fundamental axioms of probability:

Non-Negativity Axiom
Normalization Axiom
Additivity Axiom

Q.3 What is conditional probability?

Conditional probability refers to the probability of an event occurring given that another event has already occurred. Mathematically, it is defined as the probability of event A occurring, given that event B has occurred, and is denoted by [Tex]P(A|B)[/Tex] .

The formula for conditional probability is:

[Tex]P(A|B) = \frac{P(A\cap B)}{P(B)}[/Tex]

where:

P(A|B) is the conditional probability of event A given event B.
[Tex]P(A\cap B)[/Tex] is the joint probability of both events A and B occurring simultaneously.
P(B) is the probability of event B occurring.

Q.4 What is Bayes’ Theorem and when is it used in data science?

The Bayes theorem predicts the probability that an event connected to any condition would occur. It is also taken into account in the situation of conditional probability. The probability of “causes” formula is another name for the Bayes theorem.

In data science, Bayes’ Theorem is used primarily in:

Bayesian Inference
Machine Learning
Text Classification
Medical Diagnosis
Predictive Modeling

When working with ambiguous or sparse data, Bayes’ Theorem is very helpful since it enables data scientists to continually revise their assumptions and come to more sensible conclusions.

Q.5 Define variance and conditional variance.

A statistical concept known as variance quantifies the spread or dispersion of a group of data points within a dataset. It sheds light on how widely individual data points depart from the dataset’s mean (average). It assesses the variability or “scatter” of data.

Conditional Variance

A measure of the dispersion or variability of a random variable under certain circumstances or in the presence of a particular event, as the name implies. It reflects a random variable’s variance that is dependent on the knowledge of another random variable’s variance.

Q.6 Explain the concepts of mean, median, mode, and standard deviation.

Mean: The mean, often referred to as the average, is calculated by summing up all the values in a dataset and then dividing by the total number of values.

Median: When data are sorted in either ascending or descending order, the median is the value in the middle of the dataset. The median is the average of the two middle values when the number of data points is even.
In comparison to the mean, the median is less impacted by extreme numbers, making it a more reliable indicator of central tendency.

Mode: The value that appears most frequently in a dataset is the mode. One mode (unimodal), several modes (multimodal), or no mode (if all values occur with the same frequency) can all exist in a dataset.

Standard deviation: The spread or dispersion of data points in a dataset is measured by the standard deviation. It quantifies the variance between different data points.

Q.7 What is the normal distribution and standard normal distribution?

The normal distribution, also known as the Gaussian distribution or bell curve, is a continuous probability distribution that is characterized by its symmetric bell-shaped curve. The normal distribution is defined by two parameters: the mean (μ) and the standard deviation (σ). The mean determines the center of the distribution, and the standard deviation determines the spread or dispersion of the distribution. The distribution is symmetric around its mean, and the bell curve is centered at the mean. The probabilities for values that are further from the mean taper off equally in both directions. Similar rarity applies to extreme values in the two tails of the distribution. Not all symmetrical distributions are normal, even though the normal distribution is symmetrical.

The standard normal distribution, also known as the Z distribution, is a special case of the normal distribution where the mean (μ) is 0 and the standard deviation (σ) is 1. It is a standardized form of the normal distribution, allowing for easy comparison of scores or observations from different normal distributions.

Q.8 What is SQL, and what does it stand for?

SQL stands for Structured Query Language.It is a specialized programming language used for managing and manipulating relational databases. It is designed for tasks related to database management, data retrieval, data manipulation, and data definition.

Q.9 Explain the differences between SQL and NoSQL databases.

Both SQL (Structured Query Language) and NoSQL (Not Only SQL) databases, differ in their data structures, schema, query languages, and use cases. The following are the main variations between SQL and NoSQL databases.

SQL	NoSQL
SQL databases are relational databases, they organise and store data using a structured schema with tables, rows, and columns.	NoSQL databases use a number of different types of data models, such as document-based (like JSON and BSON), key-value pairs, column families, and graphs.
SQL databases have a set schema, thus before inserting data, we must establish the structure of our data.The schema may need to be changed, which might be a difficult process.	NoSQL databases frequently employ a dynamic or schema-less approach, enabling you to insert data without first creating a predetermined schema.
SQL is a strong and standardised query language that is used by SQL databases. Joins, aggregations, and subqueries are only a few of the complicated processes supported by SQL queries.	The query languages or APIs used by NoSQL databases are frequently tailored to the data model.

Q.10 What are the primary SQL database management systems (DBMS)?

Relational database systems, both open source and commercial, are the main SQL (Structured Query Language) database management systems (DBMS), which are widely used for managing and processing structured data. Some of the most popular SQL database management systems are listed below:

Q.11 What is the ER model in SQL?

The structure and relationships between the data entities in a database are represented by the Entity-Relationship (ER) model, a conceptual framework used in database architecture. The ER model is frequently used in conjunction with SQL for creating the structure of relational databases even though it is not a component of the SQL language itself.

Q.12 What is data transformation?

The process of transforming data from one structure, format, or representation into another is referred to as data transformation. In order to make the data more suited for a given goal, such as analysis, visualisation, reporting, or storage, this procedure may involve a variety of actions and changes to the data. Data integration, cleansing, and analysis depend heavily on data transformation, which is a common stage in data preparation and processing pipelines.

Q.13 What are the main components of a SQL query?

A relational database’s data can be retrieved, modified, or managed via a SQL (Structured Query Language) query. The operation of a SQL query is defined by a number of essential components, each of which serves a different function.

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
JOIN

Q.14 What is a primary key?

A relational database table’s main key, also known as a primary keyword, is a column that is unique for each record. It is a distinctive identifier.The primary key of a relational database must be unique. Every row of data must have a primary key value and none of the rows can be null.

Q.15 What is the purpose of the GROUP BY clause, and how is it used?

In SQL, the GROUP BY clause is used to create summary rows out of rows that have the same values in a set of specified columns. In order to do computations on groups of rows as opposed to individual rows, it is frequently used in conjunction with aggregate functions like SUM, COUNT, AVG, MAX, or MIN. we may produce summary reports and perform more in-depth data analysis using the GROUP BY clause.

Q.16 What is the WHERE clause used for, and how is it used to filter data?

In SQL, the WHERE clause is used to filter rows from a table or result set according to predetermined criteria. It enables us to pick only the rows that satisfy particular requirements or follow a pattern. A key element of SQL queries, the WHERE clause is frequently used for data retrieval and manipulation.

Q.17 How do you retrieve distinct values from a column in SQL?

Using the DISTINCT keyword in combination with the SELECT command, we can extract distinct values from a column in SQL. By filtering out duplicate values and returning only unique values from the specified column, the DISTINCT keyword is used.

Q.18 What is the HAVING clause?

To filter query results depending on the output of aggregation functions, the HAVING clause, a SQL clause, is used along with the GROUP BY clause. The HAVING clause filters groups of rows after they have been grouped by one or more columns, in contrast to the WHERE clause, which filters rows before they are grouped.

Q.19 How do you handle missing or NULL values in a database table?

Missing or NULL values can arise due to various reasons, such as incomplete data entry, optional fields, or data extraction processes.

Replace NULL with Placeholder Values
Handle NULL Values in Queries
Use Default Values

Q.20 What is the difference between supervised and unsupervised machine learning?

The difference between Supervised Learning and Unsupervised Learning are as follow:

Category	Supervised Learning	Unsupervised Learning
Definition	Supervised learning refers to that part of machine learning where we know what the target variable is and it is labeled.	Unsupervised Learning is used when we do not have labeled data and we are not sure about our target variable
Objective	The objective of supervised learning is to predict an outcome or classify the data	The objective here is to discover patterns among the features of the dataset and group similar features together
Algorithms	Some of the algorithm types are: Regression (Linear, Logistic, etc.) Classification (Decision Tree Classifier, Support Vector Classifier, etc.)	Some of the algorithms are : Dimensionality reduction (Principle Component Analysis, etc.) Clustering (KMeans, DBSCAN, etc.)
Evaluation metrics	Supervised learning uses evaluation metrics like: Mean Squared Error Accuracy	Unsupervised Learning uses evaluation metrics like: Silhouette Inertia
Use cases	Predictive modeling, Spam detection	Anomaly detection, Customer segmentation

Q.21 What is linear regression, and What are the different assumptions of linear regression algorithms?

Linear Regression – It is type of Supervised Learning where we compute a linear relationship between the predictor and response variable. It is based on the linear equation concept given by:

[Tex]\hat{y} = \beta_1x+\beta_o[/Tex],

where

[Tex]\hat{y}[/Tex] = response / dependent variable
[Tex]\beta_1[/Tex] = slope of the linear regression
[Tex]\beta_o[/Tex] = intercept for linear regression
[Tex]x[/Tex] = predictor / independent variable(s)

There are 4 assumptions we make about a Linear regression problem:

Linear relationship : This assumes that there is a linear relationship between predictor and response variable. This means that, which changing values of predictor variable, the response variable changes linearly (either increases or decreases).
Normality : This assumes that the dataset is normally distributed, i.e., the data is symmetric about the mean of the dataset.
Independence : The features are independent of each other, there is no correlation among the features/predictor variables of the dataset.
Homoscedasticity : This assumes that the dataset has equal variance for all the predictor variables. This means that the amount of independent variables have no effect on the variance of data.

Q.22 Logistic regression is a classification technique, why its name is regressions, not logistic classifications?

While logistic regression is used for classification, it still maintains a regression structure underneath. The key idea is to model the probability of an event occurring (e.g., class 1 in binary classification) using a linear combination of features, and then apply a logistic (Sigmoid) function to transform this linear combination into a probability between 0 and 1. This transformation is what makes it suitable for classification tasks.

In essence, while logistic regression is indeed used for classification, it retains the mathematical and structural characteristics of a regression model, hence the name.

Q.23 What is the logistic function (sigmoid function) in logistic regression?

Sigmoid Function: It is a mathematical function which is characterized by its S- shape curve. Sigmoid functions have the tendency to squash a data point to lie within 0 and 1. This is why it is also called Squashing function, which is given as:

Some of the properties of Sigmoid function is:

Range: [0,1]

Q.24 What is overfitting and how can be overcome this?

Overfitting refers to the result of analysis of a dataset which fits so closely with training data that it fails to generalize with unseen/future data. This happens when the model is trained with noisy data which causes it to learn the noisy features from the training as well.

To avoid Overfitting and overcome this problem in machine learning, one can follow the following rules:

Feature selection : Sometimes the training data has too many features which might not be necessary for our problem statement. In that case, we use only the necessary features that serve our purpose
Cross Validation : This technique is a very powerful method to overcome overfitting. In this, the training dataset is divided into a set of mini training batches, which are used to tune the model.
Regularization : Regularization is the technique to supplement the loss with a penalty term so as to reduce overfitting. This penalty term regulates the overall loss function, thus creating a well trained model.
Ensemble models : These models learn the features and combine the results from different training models into a single prediction.

Q.25 What is a support vector machine (SVM), and what are its key components?

Support Vector machines are a type of Supervised algorithm which can be used for both Regression and Classification problems. In SVMs, the main goal is to find a hyperplane which will be used to segregate different data points into classes. Any new data point will be classified based on this defined hyperplane.

Support Vector machines are highly effective when dealing with high dimensionality space and can handle non linear data very well. But if the number of features are greater than number of data samples, it is susceptible to overfitting.

The key components of SVM are:

Kernels Function: It is a mapping function used for data points to convert it into high dimensionality feature space.
Hyperplane: It is the decision boundary which is used to differentiate between the classes of data points.
Margin: It is the distance between Support Vector and Hyperplane
C: It is a regularization parameter which is used for margin maximization and misclassification minimization.

Q.26 Explain the k-nearest neighbors (KNN) algorithm.

The k-Nearest Neighbors (KNN) algorithm is a simple and versatile supervised machine learning algorithm used for both classification and regression tasks. KNN makes predictions by memorizing the data points rather than building a model about it. This is why it is also called “lazy learner” or “memory based” model too.

KNN relies on the principle that similar data points tend to belong to the same class or have similar target values. This means that, In the training phase, KNN stores the entire dataset consisting of feature vectors and their corresponding class labels (for classification) or target values (for regression). It then calculates the distances between that point and all the points in the training dataset. (commonly used distance metrics are Euclidean distance and Manhattan distance).

(Note : Choosing an appropriate value for k is crucial. A small k may result in noisy predictions, while a large k can smooth out the decision boundaries. The choice of distance metric and feature scaling also impact KNN’s performance.)

Q.27 What is the Naïve Bayes algorithm, what are the different assumptions of Naïve Bayes?

The Naïve Bayes algorithm is a probabilistic classification algorithm based on Bayes’ theorem with a “naïve” assumption of feature independence within each class. It is commonly used for both binary and multi-class classification tasks, particularly in situations where simplicity, speed, and efficiency are essential.

The main assumptions that Naïve Bayes theorem makes are:

Feature independence – It assumes that the features involved in Naïve Bayes algorithm are conditionally independent, i.e., the presence/ absence of one feature does not affect any other feature
Equality – This assumes that the features are equal in terms of importance (or weight).
Normality – It assumes that the feature distribution is Normal in nature, i.e., the data is distributed equally around its mean.

Q.28 What are decision trees, and how do they work?

Decision trees are a popular machine learning algorithm used for both classification and regression tasks. They work by creating a tree-like structure of decisions based on input features to make predictions or decisions. Lets dive into its core concepts and how they work briefly:

Decision trees consist of nodes and edges.
The tree starts with a root node and branches into internal nodes that represent features or attributes.
These nodes contain decision rules that split the data into subsets.
Edges connect nodes and indicate the possible decisions or outcomes.
Leaf nodes represent the final predictions or decisions.

The objective is to increase data homogeneity, which is often measured using standards like mean squared error (for regression) or Gini impurity (for classification). Decision trees can handle a variety of attributes and can effectively capture complex data relationships. They can, however, overfit, especially when deep or complex. To reduce overfitting, strategies like pruning and restricting tree depth are applied.

Q.29 Explain the concepts of entropy and information gain in decision trees.

Entropy: Entropy is the measure of randomness. In terms of Machine learning, Entropy can be defined as the measure of randomness or impurity in our dataset. It is given as:

, where

= probability of an event “i”.

Information gain: It is defined as the change in the entropy of a feature given that there’s an additional information about that feature. If there are more than one features involved in Decision tree split, then the weighted average of entropies of the additional features is taken.

Information gain = , where

E = Entropy

Q.30 What is the difference between the bagging and boosting model?

Category	Bagging Model	Boosting model
Definition	Bagging, or Bootstrap aggregating, is an ensemble modelling method where predictions from different models are combined together to give the aggregated result	Boosting method is where multiple weak learners are used together to get a stronger model with more robust predictions.
Agenda	This is used when dealing with models that have high variance (overfitting).	This is used when dealing with models with high bias (underfitting) and variance as well.
Robustness to Noise and Sensitivity	This is more robust due to averaging and this makes it less sensitive	It is more sensitive to presence of outliers and that makes it a bit less robust as compared to bagging models
Model running and dependence	The models are run in parallel and are typically independent	The models are run in sequential method where the base model is dependent.
Examples	Random Forest, Bagged Decision Trees	AdaBoost, Gradient Boosting, XGBoost

Q.31 Describe random forests and their advantages over single-decision trees.

Random Forests are an ensemble learning technique that combines multiple decision trees to improve predictive accuracy and reduce overfitting. The advantages it has over single decision trees are:

Improved Generalization: Single decision trees are prone to overfitting, especially when they become deep and complex. Random Forests mitigate this issue by averaging predictions from multiple trees, resulting in a more generalized model that performs better on unseen data
Better Handling of High-Dimensional Data : Random Forests are effective at handling datasets with a large number of features. They select a random subset of features for each tree, which can improve the performance when there are many irrelevant or noisy features
Robustness to Outliers: Random Forests are more robust to outliers because they combine predictions from multiple trees, which can better handle extreme cases

Q.32 What is K-Means, and how will it work?

K-Means is an unsupervised machine learning algorithm used for clustering or grouping similar data points together. It aims to partition a dataset into K clusters, where each cluster represents a group of data points that are close to each other in terms of some similarity measure. The working of K-means is as follow:

Choose the number of clusters K
For each data point in the dataset, calculate its distance to each of the K centroids and then assign each data point to the cluster whose centroid is closest to it
Recalculate the centroids of the K clusters based on the current assignment of data points.
Repeat the above steps until a group of clusters are formed.

Q.33 What is a confusion matrix? Explain with an example.

Confusion matrix is a table used to evaluate the performance of a classification model by presenting a comprehensive view of the model’s predictions compared to the actual class labels. It provides valuable information for assessing the model’s accuracy, precision, recall, and other performance metrics in a binary or multi-class classification problem.

A famous example demonstration would be Cancer Confusion matrix:

	Actual
Cancer	Not Cancer
Predicted	Cancer	True Positive (TP)	False Positive (FP)
Not Cancer	False Negative (FN)	True Negative (TN)

Actual

Cancer

Not Cancer

Predicted