Intermediate Data Science Interview Questions

Q.35 Explain the uniform distribution.

A fundamental probability distribution in statistics is the uniform distribution, commonly referred to as the rectangle distribution. A constant probability density function (PDF) across a limited range characterises it. In simpler terms, in a uniform distribution, every value within a specified range has an equal chance of occurring.

Q.36 Describe the Bernoulli distribution.

A discrete probability distribution, the Bernoulli distribution is focused on discrete random variables. The number of heads you obtain while tossing three coins at once or the number of pupils in a class are examples of discrete random variables that have a finite or countable number of potential values.

Q.37 What is the binomial distribution?

The binomial distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials, where each trial has only two possible outcomes: success or failure. The outcomes are often referred to as “success” and “failure,” but they can represent any dichotomous outcome, such as heads or tails, yes or no, or defective or non-defective.

The fundamental presumptions of a binomial distribution are that each trial has exactly one possible outcome, each trial has an equal chance of success, and each trial is either independent of the others or mutually exclusive.

Q.38 Explain the exponential distribution and where it’s commonly used.

The probability distribution of the amount of time between events in the Poisson point process is known as the exponential distribution. The gamma distribution is thought of as a particular instance of the exponential distribution. Additionally, the geometric distribution’s continuous analogue is the exponential distribution.

Common applications of the exponential distribution include:

Reliability Engineering
Queueing Theory
Telecommunications
Finance
Natural Phenomena
Survival Analysis

Q.39 Describe the Poisson distribution and its characteristics.

The Poisson distribution is a probability distribution that describes the number of events that occur within a fixed interval of time or space when the events happen at a constant mean rate and are independent of the time since the last event.

Key characteristics of the Poisson distribution include:

Discreteness: The Poisson distribution is used to model the number of discrete events that occur within a fixed interval.
Constant Mean Rate: The events occur at a constant mean rate per unit of time or space.
Independence: The occurrences of events are assumed to be independent of each other. The probability of multiple events occurring in a given interval is calculated based on the assumption of independence.

Q40. Explain the t-distribution and its relationship with the normal distribution.

The t-distribution, also known as the Student’s t-distribution, is used in statistics for inferences about population means when the sample size is small and the population standard deviation is unknown. The shape of the t-distribution is similar to the normal distribution, but it has heavier tails.

Relationship between T-Distribution and Normal Distribution: The t-distribution converges to the normal distribution as the degrees of freedom increase. In fact, when the degrees of freedom become very large, the t-distribution approaches the standard normal distribution (normal distribution with mean 0 and standard deviation 1). This is a result of the Central Limit Theorem.

Q.41 Describe the chi-squared distribution.

The chi-squared distribution is a continuous probability distribution that arises in statistics and probability theory. It is commonly denoted as χ2 (chi-squared) and is associated with degrees of freedom. The chi-squared distribution is particularly used to model the distribution of the sum of squared independent standard normal random variables.It is also used to determine if data series are independent, the goodness of fit of a data distribution, and the level of confidence in the variance and standard deviation of a random variable with a normal distribution.

Q.42 What is the difference between z-test, F-test, and t-test?

The z-test, t-test, and F-test are all statistical hypothesis tests used in different situations and for different purposes. Here’s a overview of each test and the key differences between them.

z-test	t-test	F-test
When we want to compare a sample mean to a known population mean and we know the population standard deviation, we use the z-test.	When we want to compare a sample mean to a known or assumed population mean but don’t know what the population standard deviation is we use the t-test.	The F-test is used to compare the variances of two or more samples. It is commonly used in analysis of variance (ANOVA) and regression analysis.
When we dealing with large sample sizes or when we known the population standard deviation it is most frequently used.	The t-test follows a t-distribution, which has different shapes depending on the degrees of freedom.	The two-sample F-test, which analyses the variances of two independent samples, is the most popular of the F-test’s variants.
The z-test follows a standard normal distribution when certain assumptions are met.	The sample standard deviation (s) determines the test statistic for the t-test.	One set of degrees of freedom corresponds to each sample’s degrees of freedom in the F-distribution.

In summary, the choice between a z-test, t-test, or F-test depends on the specific research question and the characteristics of the data.

Q.43 What is the central limit theorem, and why is it significant in statistics?

The Central Limit Theorem states that, regardless of the shape of the population distribution, the distribution of the sample means approaches a normal distribution as the sample size increases.This is true even if the population distribution is not normal. The larger the sample size, the closer the sampling distribution of the sample mean will be to a normal distribution.

Q.44 Describe the process of hypothesis testing, including null and alternative hypotheses.

Hypothesis testing is a statistical method used to make inferences about population parameters based on sample data.It is a systematic way of evaluating statements or hypotheses about a population using observed sample data.To identify which statement is best supported by the sample data, it compares two statements about a population that are mutually exclusive.

Null hypothesis(H0): The null hypothesis (H0) in statistics is the default assumption or assertion that there is no association between any two measured cases or any two groups. In other words, it is a fundamental assumption or one that is founded on knowledge of the problem.
Alternative hypothesis(H1): The alternative hypothesis, or H1, is the null-hypothesis-rejecting hypothesis that is utilised in hypothesis testing.

Q.45 How do you calculate a confidence interval, and what does it represent?

A confidence interval (CI) is a statistical range or interval estimate for a population parameter, such as the population mean or population proportion, based on sample data. to calculate confidence interval these are the following steps.

Collect Sample Data
Choose a Confidence Level
Select the Appropriate Statistical Method
Calculate the Margin of Error (MOE)
Calculate the Confidence Interval
Interpret the Confidence Interval

Confidence interval represents a range of values within which we believe, with a specified level of confidence (e.g., 95%), that the true population parameter lies.

Q.46 What is a p-value in Statistics?

The term “p-value,” which stands for “probability value,” is a key one in statistics and hypothesis testing. It measures the evidence contradicting a null hypothesis and aids in determining whether a statistical test’s findings are statistically significant. Here is a definition of a p-value and how it is used in hypothesis testing.

Q.47 Explain Type I and Type II errors in hypothesis testing.

Rejecting a null hypothesis that is actually true in the population results in a type I error (false-positive); failing to reject a null hypothesis that is actually untrue in the population results in a type II error (false-negative).

type I and type II mistakes cannot be completely avoided, the investigator can lessen their risk by increasing the sample size (the less likely it is that the sample will significantly differ from the population).

Q.48 What is the significance level (alpha) in hypothesis testing?

A crucial metric in hypothesis testing that establishes the bar for judging whether the outcomes of a statistical test are statistically significant is the significance level, which is sometimes indicated as (alpha). It reflects the greatest possible chance of committing a Type I error, or mistakenly rejecting a valid null hypothesis.

The significance level in hypothesis testing.

Setting the Significance Level
Interpreting the Significance Level
Hypothesis Testing Using Significance Level
Choice of Significance Level

Q.49 How can you calculate the correlation coefficient between two variables?

The degree and direction of the linear link between two variables are quantified by the correlation coefficient. The Pearson correlation coefficient is the most widely used method for determining the correlation coefficient. The Pearson correlation coefficient can be calculated as follows.

Collect Data
Calculate the Means
Calculate the Covariance
Calculate the Standard Deviations
Calculate the Pearson Correlation Coefficient (r)
Interpret the Correlation Coefficient.

Q.50 What is covariance, and how is it related to correlation?

Both covariance and correlation are statistical metrics that show how two variables are related to one another.However, they serve slightly different purposes and have different interpretations.

Covariance :Covariance measures the degree to which two variables change together. It expresses how much the values of one variable tend to rise or fall in relation to changes in the other variable.
Correlation : A standardised method for measuring the strength and direction of a linear relationship between two variables is correlation. It multiplies the standard deviations of the two variables to scale the covariance.

Q.51 Explain how to perform a hypothesis test for comparing two population means.

When comparing two population means, a hypothesis test is used to determine whether there is sufficient statistical support to claim that the means of the two distinct populations differ significantly. Tests we can commonly use for include “paired t-test” or “two -sample t test”. The general procedures for carrying out such a test are as follows.

Formulate Hypotheses
Choose the Significance Level
Collect Data
Define Test Statistic
Draw a Conclusion
Final Results

Q.52 Explain the concept of normalization in database design.

By minimising data duplication and enhancing data integrity, normalisation is a method in database architecture that aids in the effective organisation of data. It include dividing a big, complicated table into smaller, associated tables while making sure that connections between data elements are preserved. The basic objective of normalisation is to reduce data anomalies, which can happen when data is stored in an unorganised way and include insertion, update, and deletion anomalies.

Q.53 What is database normalization?

Database denormalization is the process of intentionally introducing redundancy into a relational database by merging tables or incorporating redundant data to enhance query performance. Unlike normalization, which minimizes data redundancy for consistency, denormalization prioritizes query speed. By reducing the number of joins required, denormalization can improve read performance for complex queries. However, it may lead to data inconsistencies and increased maintenance complexity. Denormalization is often employed in scenarios where read-intensive operations outweigh the importance of maintaining a fully normalized database structure. Careful consideration and trade-offs are essential to strike a balance between performance and data integrity.

Q.54 Define different types of SQL functions.

SQL functions can be categorized into several types based on their functionality.

Scalar Functions
Aggregate Functions
Window Functions
Table-Valued Functions
System Functions
User-Defined Functions
Conversion Functions
Conditional Functions

Q.55 Explain the difference between INNER JOIN and LEFT JOIN.

INNER JOIN and LEFT JOIN are two types of SQL JOIN operations used to combine data from multiple tables in a relational database. Here are the some main differences between them.

INNER JOIN	LEFT JOIN
Only rows with a match in the designated columns between the two tables being connected are returned by an INNER JOIN.	LEFT JOIN returns all rows from the left table and the matching rows from the right table.
A row is not included in the result set if there is no match for it in either of the tables.	Columns from the right table’s rows are returned with NULL values if there is no match for that row.
When we want to retrieve data from both tables depending on a specific criterion, INNER JOIN can be helpful.	It makes sure that every row from the left table appears in the final product, even if there are no matches for that row in the right table.

Q.56 What is a subquery, and how can it be used in SQL?

A subquery is a query that is nested within another SQL query, also referred to as an inner query or nested query. On the basis of the outcomes of another query, we can use it to get data from one or more tables. SQL’s subqueries capability is employed for a variety of tasks, including data retrieval, computations, and filtering.

Q.57 How do you perform mathematical calculations in SQL queries?

In SQL, we can perform mathematical calculations in queries using arithmetic operators and functions. Here are some common methods for performing mathematical calculations.

Arithmetic Operators
Mathematical Functions
Aggregate Functions
Custom Expressions

Q.58 What is the purpose of the CASE statement in SQL?

The SQL CASE statement is a flexible conditional expression that may be used to implement conditional logic inside of a query. we can specify various actions or values based on predetermined criteria.

Q.59 What is the difference between a database and a data warehouse?

Database: Consistency and real-time data processing are prioritised, and they are optimised for storing, retrieving, and managing structured data. Databases are frequently used for administrative functions like order processing, inventory control, and customer interactions.

Data Warehouse: Data warehouses are made for processing analytical data. They are designed to facilitate sophisticated querying and reporting by storing and processing massive amounts of historical data from various sources. Business intelligence, data analysis, and decision-making all employ data warehouses.

Q.60 What is regularization in machine learning, State the differences between L1 and L2 regularization

Regularization: Regularization is the technique to restrict the model overfitting during training by inducing a penalty to the loss. The penalty imposed on the loss function is added so that the complexity of the model can be controlled, thus overcoming the issue of overfitting in the model.

The following are the differences between L1 and L2 regularization:

category	L1 Regularization(Lasso)	L2 Regularization (Ridge)
Definition	L1 regularization is the technique where the induced penalty term changes some of the terms to be exactly zero	L2 regularization is the technique where the induced penalty term changes some of the terms to be as near to zero as possible.
Interpretability	Selects a subset of most important ones while eliminating less important ones.	Selects all the features but assigns less weights to less important features.
Formula	where, L1 = Lasso Loss function = Model loss = regularization controlling parameter w = weights of the model	where, L2 = Ridge Loss function = Model loss = regularization controlling parameter w = weights of the model
Robustness	Sensitive to outliers and noisy data as it can eliminate them	More robust to the presence of Outliers and noisy data
Computational efficiency	Computationally more expensive	Computationally less expensive.

Q.61 Explain the concepts of bias-variance trade-off in machine learning.

When creating predictive models, the bias-variance trade-off is a key concept in machine learning that deals with finding the right balance between two sources of error, bias and variance. It plays a crucial role in model selection and understanding the generalization performance of a machine learning algorithm. Here’s an explanation of these concepts:

Bias:Bias is simply described as the model’s inability to forecast the real value due of some difference or inaccuracy. These differences between actual or expected values and the predicted values are known as error or bias error or error due to bias.
Variance: Variance is a measure of data dispersion from its mean location. In machine learning, variance is the amount by which a predictive model’s performance differs when trained on different subsets of the training data. More specifically, variance is the model’s variability in terms of how sensitive it is to another subset of the training dataset, i.e. how much it can adapt on the new subset of the training dataset.

	Low Bias	High Bias
Low Variance	Best fit (Ideal Scenario )	Underfitting
High Variance	Overfitting	Not capture the underlying patterns (Worst Case)

As a Data Science Professional, Our focus should be to achieve the the best fit model i.e Low Bias and Low Variance. A model with low bias and low variance suggests that it can capture the underlying patterns in the data (low bias) and is not overly sensitive to changes in the training data (low variance). This is the perfect circumstance for a machine learning model, since it can generalize effectively to new, previously unknown data and deliver consistent and accurate predictions. However, in practice, this is not achievable.

If the algorithm is too simplified (hypothesis with linear equation), it may be subject to high bias and low variance, making it error-prone. If algorithms fit too complicated a hypothesis (hypothesis with a high degree equation), it may have a large variance and a low bias. In the latter case, the new entries will underperform. There is, however, something in between these two situations called as a Trade-off or Bias Variance Trade-off. So, that An algorithm can’t be more complex and less complex at the same time.

Q.62 How do we choose the appropriate kernel function in SVM?

A kernel function is responsible for converting the original data points into a high dimensionality feature space. Choosing the appropriate kernel function in a Support Vector Machine is a crucial step, as it determines how well the SVM can capture the underlying patterns in your data. Below mentioned are some of the ways to choose the suitable kernel function:

If the dataset exhibits linear relationship

In this case, we should use Linear Kernel function. It is simple, computationally efficient and less prone to overfitting. For example, text classification, sentiment analysis, etc.

If the dataset requires probabilistic approach

The sigmoid kernel is suitable when the data resembles a sigmoid function or when you have prior knowledge suggesting this shape. For example, Risk assessment, Financial applications, etc.

If the dataset is Simple Non Linear in nature

In this case, use a Polynomial Kernel Function. Polynomial functions are useful when we are trying to capture moderate level of non linearity. For example, Image and Speech Recognition, etc.

If the dataset is Highly Non-Linear in Nature/ we do not know about the underlying relationship

In that case, a Radial basis function is the best choice. RBF kernel can handle highly complex dataset and is useful when you’re unsure about the data’s underlying distribution. For example, Financial forecasting, bioinformatics, etc.

Q.63 How does Naïve Bayes handle categorical and continuous features?

Naive Bayes is probabilistic approach which assumes that the features are independent of each other. It calculates probabilities associated with each class label based on the observed frequencies of feature values within each class in the training data. This is done by finding the conditional probability of Feature given a class. (i.e., P(feature | class)). To make predictions on categorical data, Naive Bayes calculates the posterior probability of each class given the observed feature values and selects the class with the highest probability as the predicted class label. This is called as “maximum likelihood” estimation.

Q.64 What is Laplace smoothing (add-one smoothing) and why is it used in Naïve Bayes?

In Naïve Bayes, the conditional probability of an event given a class label is determined as P(event| class). When using this in a classification problem (let’s say a text classification), there could a word which did not appear in the particular class. In those cases, the probability of feature given a class label will be zero. This could create a big problem when getting predictions out of the training data.

To overcome this problem, we use Laplace smoothing. Laplace smoothing addresses the zero probability problem by adding a small constant (usually 1) to the count of each feature in each class and to the total count of features in each class. Without smoothing, if any feature is missing in a class, the probability of that class given the features becomes zero, making the classifier overly confident and potentially leading to incorrect classifications

Q.65 What are imbalanced datasets and how can we handle them?

Imbalanced datasets are datasets in which the distribution of class labels (or target values) is heavily skewed, meaning that one class has significantly more instances than any other class. Imbalanced datasets pose challenges because models trained on such data can have a bias toward the majority class, leading to poor performance on the minority class, which is often of greater interest. This will lead to the model not generalizing well on the unseen data.

To handle imbalanced datasets, we can approach the following methods:

Resampling (Method of either increasing or decreasing the number of samples):
- Up-sampling: In this case, we can increase the classes for minority by either sampling without replacement or generating synthetic examples. Some of the popular examples are SMOTE (Synthetic Minority Over-sampling Technique), etc.
- Down-sampling: Another case would be to randomly cut down the majority class such that it is comparable to minority class.
Ensemble methods (using models which are capable of handling imbalanced dataset inherently:
- Bagging : Techniques like Random Forests, which can mitigate the impact of class imbalance by constructing multiple decision trees from bootstrapped samples
- Boosting: Algorithms like AdaBoost and XGBoost can give more importance to misclassified minority class examples in each iteration, improving their representation in the final model

Q.66 What are outliers in the dataset and how can we detect and remove them?

An Outlier is a data point that is significantly different from other data points. Usually, Outliers are present in the extremes of the distribution and stand out as compared to their out data point counterparts.

For detecting Outliers we can use the following approaches:

Visual inspection: This is the easiest way which involves plotting the data points into scatter plot/box plot, etc.
statistics: By using measure of central tendency, we can determine if a data point falls significantly far from its mean, median, etc. making it a potential outlier.
Z-score: if a data point has very high Z-score, it can be identified as Outlier

For removing the outliers, we can use the following:

Removal of outliers manually
Doing transformations like applying logarithmic transformation or square rooting the outlier
Performing imputations wherein the outliers are replaced with different values like mean, median, mode, etc.

Q.67 What is the curse of dimensionality And How can we overcome this?

When dealing with a dataset that has high dimensionality (high number of features), we are often encountered with various issues and problems. Some of the issues faced while dealing with dimensionality dataset are listed below:

Computational expense: The biggest problem with handling a dataset with vast number of features is that it takes a long time to process and train the model on it. This can lead to wastage of both time and monetary resources.
Data sparsity: Many times data points are far from each other (high sparsity). This makes it harder to find the underlying patterns between features and can be a hinderance in proper analysis
Visualising issues and overfitting: It is rather easy to visualize 2d and 3d data. But beyond this order, it is difficult to properly visualize our data. Furthermore, more data features can be correlated and provide misleading information to the model training and cause overfitting.

These issues are what are generally termed as “Curse of Dimensionality”.

To overcome this, we can follow different approaches – some of which are mentioned below:

Feature Selection: Many a times, not all the features are necessary. It is the user’s job to select out the features that would be necessary in solving a given problem statement.
Feature engineering: Sometimes, we may need a feature that is the combination of many other features. This method can, in general, reduces the features count in the dataset.
Dimensionality Reduction techniques: These techniques reduce the number of features in a dataset while preserving as much useful information as possible. Some of the famous Dimensionality reduction techniques are: Principle component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), etc.
Regularization: Some regularization techniques like L1 and L2 regularizations are useful when deciding the impact each feature has on the model training.

Q.68 How does the random forest algorithm handle feature selection?

Mentioned below is how Random forest handles feature selection

When creating individual trees in the Random Forest ensemble, a subset of features is assigned to each tree which is called Feature Bagging. Feature Bagging introduces randomness and diversity among the trees.
After the training, the features are assigned a “importance score” based on how well those features performed by reducing the error of the model. Features that consistently contribute to improving the model’s accuracy across multiple trees are deemed more important
Then the features are ranked based on their importance scores. Features with higher importance scores are considered more influential in making predictions.

Q.69 What is feature engineering? Explain the different feature engineering methods.

Feature Engineering: It can be defined as a method of preprocessing of data for better analysis purpose which involves different steps like selection, transformation, deletion of features to suit our problem at hand. Feature Engineering is a useful tool which can be used for:

Improving the model’s performance and Data interpretability
Reduce computational costs
Include hidden patterns for elevated Analysis results.

Some of the different methods of doing feature engineering are mentioned below:

Principle Component Analysis (PCA) : It identifies orthogonal axes (principal components) in the data that capture the maximum variance, thereby reducing the data features.
Encoding – It is a technique of converting the data to be represented a numbers with some meaning behind it. It can be done in two ways :
- One-Hot Encoding – When we need to encode Nominal Categorical Data
- Label Encoding – When we need to encode Ordinal Categorical Data
Feature Transformation: Sometimes, we can create new columns essential for better modelling just by combining or modifying one or more columns.

Q.70 How we will deal with the categorical text values in machine learning?

Often times, we are encountered with data that has Categorical text values. For example, male/female, first-class/second-class/third-class, etc. These Categorical text values can be divided into two types and based on that we deal with them as follows:

If it is Categorical Nominal Data: If the data does not have any hidden order associated with it (e.g., male/female), we perform One-Hot encoding on the data to convert it into binary sequence of digits
If it is Categorical Ordinal Data : When there is a pattern associated with the text data, we use Label encoding. In this, the numerical conversion is done based on the order of the text data. (e.g., Elementary/ Middle/ High/ Graduate,etc.)

Q.71 What is DBSCAN and How we will use it?

Density-Based Spatial Clustering of Applications with Noise (DBSCAN), is a density-based clustering algorithm used for grouping together data points that are close to each other in high-density regions and labeling data points in low-density regions as outliers or noise. Here is how it works:

For each data point in the dataset, DBSCAN calculates the distance between that point and all other data points
DBSCAN identifies dense regions by connecting core points that are within each other’s predefined threshold (eps) neighborhood.
DBSCAN forms clusters by grouping together data points that are density-reachable from one another.

Q.72 How does the EM (Expectation-Maximization) algorithm work in clustering?

The Expectation-Maximization (EM) algorithm is a probabilistic approach used for clustering data when dealing with mixture models. EM is commonly used when the true cluster assignments are not known and when there is uncertainty about which cluster a data point belongs to. Here is how it works:

First, the number of clusters K to be formed is specified.
Then, for each data point, the likelihood of it belonging to each of the K clusters is calculated. This is called the Expectation (E) step
Based on the previous step, the model parameters are updated. This is called Maximization (M) step.
Together it is used to check for convergence by comparing the change in log-likelihood or the parameter values between iterations.
If it converges, then we have achieved our purpose. If not, then the E-step and M-step are repeated until we reach convergence.

Q.73 Explain the concept of silhouette score in clustering evaluation.

Silhouette score is a metric used to evaluate the quality of clusters produced by a clustering algorithm. Here is how it works:

the average distance between the data point and all other data points in the same cluster is first calculated. Let us call this as (a)
Then for the same data point, the average distance (b) between the data point and all data points in the nearest neighboring cluster (i.e., the cluster to which it is not assigned)
silhouette coefficient for each data point is calculated, which given by: S = (b – a) / max(a, b)
- if -1<S<0, it signifies that data point is closer to a neighboring cluster than to its own cluster.
- if S is close to zero, data point is on or very close to the decision boundary between two neighboring clusters.
- if 0<S<1, data point is well within its own cluster and far from neighboring clusters.

Q.74 What is the relationship between eigenvalues and eigenvectors in PCA?

In Principal Component Analysis (PCA), eigenvalues and eigenvectors play a crucial role in the transformation of the original data into a new coordinate system. Let us first define the essential terms:

Eigen Values: Eigenvalues are associated with each eigenvector and represent the magnitude of the variance (spread or extent) of the data along the corresponding eigenvector
Eigen Vectors: Eigenvectors are the directions or axes in the original feature space along which the data varies the most or exhibits the most variance

The relationship between them is given as:

[Tex]AV = \lambda{V} [/Tex], where

A = Feature matrix

V = eigen vector

[Tex]\lambda [/Tex] = Eigen value.

A larger eigenvalue implies that the corresponding eigenvector captures more of the variance in the data.The sum of all eigenvalues equals the total variance in the original data. Therefore, the proportion of total variance explained by each principal component can be calculated by dividing its eigenvalue by the sum of all eigenvalues

Q.75 What is the cross-validation technique in machine learning?

Cross-validation is a resampling technique used in machine learning to assess and validate the performance of a predictive model. It helps in estimating how well a model is likely to perform on unseen data, making it a crucial step in model evaluation and selection. Cross validation is usually helpful when avoiding overfitting the model. Some of the widely known cross validation techniques are:

K-Fold Cross-Validation: In this, the data is divided into K subsets, and K iterations of training and testing are performed.
Stratified K-Fold Cross-Validation: This technique ensures that each fold has approximately the same proportion of classes as the original dataset (helpful in handling data imbalance)
Shuffle-Split Cross-Validation: It randomly shuffles the data and splits it into training and testing sets.

Q.76 What are the ROC and AUC, explain its significance in binary classification.

Receiver Operating Characteristic (ROC) is a graphical representation of a binary classifier’s performance. It plots the true positive rate (TPR) vs the false positive rate (FPR) at different classification thresholds.

True positive rate (TPR) : It is the ratio of true positive predictions to the total actual positives.

Recall = TP / (TP + FN)

False positive rate (FPR) : It is the ratio of False positive predictions to the total actual positives.

FPR= FP / (TP + FN)

Area Under the Curve (AUC) as the name suggests is the area under the ROC curve. The AUC is a scalar value that quantifies the overall performance of a binary classification model and ranges from 0 to 1, where a model with an AUC of 0.5 indicates random guessing, and an AUC of 1 represents a perfect classifier.

Q.77 Describe gradient descent and its role in optimizing machine learning models.

Gradient descent is a fundamental optimization algorithm used to minimize a cost or loss function in machine learning and deep learning. Its primary role is to iteratively adjust the parameters of a machine learning model to find the values that minimize the cost function, thereby improving the model’s predictive performance. Here’s how Gradient descent help in optimizing Machine learning models:

Minimizing Cost functions: The primary goal of gradient descent is to find parameter values that result in the lowest possible loss on the training data.
Convergence: The algorithm continues to iterate and update the parameters until it meets a predefined convergence criterion, which can be a maximum number of iterations or achieving a desired level of accuracy.
Generalization: Gradient descent ensure that the optimized model generalizes well to new, unseen data.

Q.78 Describe batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

Batch Gradient Descent: In Batch Gradient Descent, the entire training dataset is used to compute the gradient of the cost function with respect to the model parameters (weights and biases) in each iteration. This means that all training examples are processed before a single parameter update is made. It converges to a more accurate minimum of the cost function but can be slow, especially in a high dimensionality space.

Stochastic Gradient Descent: In Stochastic Gradient Descent, only one randomly selected training example is used to compute the gradient and update the parameters in each iteration. The selection of examples is done independently for each iteration. This is capable of faster updates and can handle large datasets because it processes one example at a time but high variance can cause it to converge slower.

Mini-Batch Gradient Descent: Mini-Batch Gradient Descent strikes a balance between BGD and SGD. It divides the training dataset into small, equally-sized subsets called mini-batches. In each iteration, a mini-batch is randomly sampled, and the gradient is computed based on this mini-batch. It utilizes parallelism well and takes advantage of modern hardware like GPUs but can still exhibits some level of variance in updates compared to Batch Gradient Descent.

Q.79 Explain the Apriori — Association Rule Mining

Association Rule mining is an algorithm to find relation between two or more different objects. Apriori association is one of the most frequently used and most simple association technique. Apriori Association uses prior knowledge of frequent objects properties. It is based on Apriori property which states that:

“All non-empty subsets of a frequent itemset must also be frequent”