Fundamentals of Data Analysis Interview Questions

  1. Home
  2. »
  3. Fundamentals of Data Analysis Interview Questions

What is a Numpy array?

Answer: A Numpy array is a data structure that stores elements of the same data type in a contiguous block of memory.

How can you create a Numpy array?

Answer: You can create a Numpy array by using the numpy.array() function or by converting other data structures like lists or tuples into Numpy arrays.

What is the difference between a Numpy array and a Python list?

Answer: Numpy arrays are more efficient for numerical operations and occupy less memory compared to Python lists. Numpy arrays also support vectorized operations.

How can you access elements in a Numpy array?

Answer: Numpy provides a wide range of mathematical functions and operators that can be directly applied to Numpy arrays, allowing for element-wise operations.

How can you perform mathematical operations on Numpy arrays?

Answer: The Fields pane in Power BI Desktop is used to manage the data model. It displays the tables, fields, and relationships imported from the data sources. Users can drag and drop fields from the Fields pane onto the report canvas to create visualizations.

Can you change the size of a Numpy array after it is created?

Answer: No, the size of a Numpy array is fixed upon creation and cannot be changed. You can create a new array with the desired size or reshape an existing array.

What is the shape of a Numpy array?

Answer: The shape of a Numpy array refers to its dimensions, specified as a tuple. For example, a 2D array with 3 rows and 4 columns has a shape of (3, 4).

How can you perform slicing on a Numpy array?

Answer: Slicing allows you to extract a portion of a Numpy array. You can specify the range of indices using the colon operator, such as my_array[1:4] to slice elements from index 1 to 3.

Can a Numpy array contain elements of different data types?

Answer: Power BI Mobile is a mobile application that allows users to access and view Power BI reports and dashboards on their smartphones or tablets. It provides a responsive and touch-friendly interface for on-the-go data analysis.

What is broadcasting in Numpy arrays?

Answer: Broadcasting is a feature in Numpy that allows for element-wise operations between arrays with different shapes by implicitly expanding the smaller array to match the larger one.

How can you perform aggregation functions on a Numpy array?

Answer: Numpy provides various aggregation functions like numpy.mean(), numpy.sum(), numpy.min(), numpy.max(), etc., to compute statistics on Numpy arrays.

How can you concatenate Numpy arrays?

Answer: Numpy provides the numpy.concatenate() function to concatenate two or more Numpy arrays either horizontally (along columns) or vertically (along rows).

What is the difference between shallow copy and deep copy of a Numpy array?

Answer: A shallow copy creates a new array object that references the original data, while a deep copy creates a completely independent copy of the array, including the data.

Can you sort elements in a Numpy array?

Answer: Yes, Numpy provides the numpy.sort() function to sort elements in a Numpy array in ascending order. You can also specify the axis and sorting algorithm.

How can you reshape a Numpy array?

Answer: You can reshape a Numpy array using the numpy.reshape() function, which returns a new array with the desired shape. The total number of elements must remain the same.

Numpy Operations

What is broadcasting in Numpy?

Answer: Broadcasting is a mechanism in Numpy that allows for performing element-wise operations between arrays of different shapes by implicitly expanding the smaller array to match the larger one.

How can you perform element-wise addition of two Numpy arrays?

Answer: You can use the numpy.add() function or simply use the + operator to perform element-wise addition of two Numpy arrays.

What is the purpose of the numpy.dot() function?

Answer: The numpy.dot() function is used for matrix multiplication or dot product between two Numpy arrays. It can also be used for matrix-vector multiplication and inner product computation.

How can you perform element-wise multiplication of two Numpy arrays?

Answer: You can use the numpy.multiply() function or simply use the * operator to perform element-wise multiplication of two Numpy arrays.

What is the purpose of the numpy.sum() function?

Answer: The numpy.sum() function is used to calculate the sum of elements in a Numpy array. It can also be used to calculate column-wise or row-wise sums by specifying the axis parameter.

How can you find the minimum and maximum values in a Numpy array?

Answer: You can use the numpy.min() function to find the minimum value and the numpy.max() function to find the maximum value in a Numpy array. Both functions can also operate along a specified axis.

What is the purpose of the numpy.mean() function?

Answer: The numpy.mean() function calculates the arithmetic mean of elements in a Numpy array. It can also calculate column-wise or row-wise means by specifying the axis parameter.

How can you calculate the standard deviation of a Numpy array?

Answer: You can use the numpy.std() function to calculate the standard deviation of elements in a Numpy array. It can also calculate column-wise or row-wise standard deviations by specifying the axis parameter.

What is the purpose of the numpy.transpose() function?

Answer: The numpy.transpose() function is used to interchange the rows and columns of a Numpy array, effectively creating a transpose of the original array.

What is the purpose of the numpy.argmax() function?

Answer: The numpy.argmax() function is used to find the index of the maximum value in a Numpy array. It can also operate along a specified axis.

How can you calculate the cumulative sum of elements in a Numpy array?

Answer: You can use the numpy.cumsum() function to calculate the cumulative sum of elements in a Numpy array. The resulting array will have the same shape as the original array.

What is the purpose of the numpy.unique() function?

Answer: The numpy.unique() function is used to find the unique elements in a Numpy array. It returns a sorted array of unique values.

How can you perform element-wise exponentiation on a Numpy array?

Answer: You can use the numpy.exp() function to perform element-wise exponentiation on a Numpy array. It calculates e raised to the power of each element.

What is the purpose of the numpy.round() function?

Answer: The numpy.round() function is used to round the elements of a Numpy array to the nearest integer or a specified decimal place.

How can you perform element-wise comparison between two Numpy arrays?

Answer: You can use comparison operators like ==, !=, >, <, >=, and <= to perform element-wise comparison between two Numpy arrays, which will result in a boolean array.

Pandas

Indexing, Selecting, and Filtering Data

What is the purpose of indexing in Pandas?

Answer: Indexing in Pandas allows you to access and retrieve specific subsets of data from a DataFrame or Series based on labels or positions.

How can you select a single column from a DataFrame in Pandas?

Answer: You can select a single column from a DataFrame in Pandas by using square brackets or dot notation with the column name.

What is the difference between loc and iloc in Pandas?

Answer: The loc attribute is used for label-based indexing, while iloc is used for integer-based indexing. The loc attribute uses row and column labels, while iloc uses integer positions.

How can you select rows based on a condition in Pandas?

Answer: You can select rows based on a condition in Pandas by using boolean indexing. For example, df[df[‘column’] > 10] will select rows where the value in the ‘column’ is greater than 10

How can you select rows and columns simultaneously in Pandas?

Answer: You can select rows and columns simultaneously in Pandas by using the loc or iloc attribute with row and column labels or positions.

What is the purpose of the isin() method in Pandas?

Answer: The isin() method in Pandas is used to filter rows based on multiple values in a specific column. It returns a boolean Series indicating whether each element in the column is one of the specified values.

How can you select a specific range of rows or columns in Pandas?

Answer: How can you select a specific range of rows or columns in Pandas?

What is the purpose of the query() method in Pandas?

Answer: The query() method in Pandas allows you to filter rows based on a boolean expression using a more concise syntax. It is particularly useful when dealing with large datasets.

How can you filter rows based on multiple conditions in Pandas?

Answer: You can filter rows based on multiple conditions in Pandas by combining the conditions using logical operators such as & for AND and | for OR.

What is the purpose of the between() method in Pandas?

Answer: The between() method in Pandas is used to filter rows based on whether the values in a column fall within a specified range.

How can you select a random sample of rows from a DataFrame in Pandas?

Answer: You can select a random sample of rows from a DataFrame in Pandas by using the sample() method and specifying the number of rows or the fraction of rows to be sampled.

What is the purpose of the nsmallest() method in Pandas?

Answer: The nsmallest() method in Pandas is used to return the n smallest values from a specific column in a DataFrame.

How can you select the first or last n rows from a DataFrame in Pandas?

Answer: You can select the first or last n rows from a DataFrame in Pandas by using the head() or tail() method with the number of rows to be selected.

What is the purpose of the duplicated() method in Pandas?

Answer: The duplicated() method in Pandas is used to identify and filter rows that contain duplicate values based on one or more columns.

How can you drop rows or columns with missing values in Pandas?

Answer: You can drop rows or columns with missing values in Pandas by using the dropna() method. By default, it drops any row or column that contains at least one missing value.

Merging and Concatenation of Data

What is the purpose of merging data in Pandas?

Answer: Merging data in Pandas allows you to combine multiple DataFrames based on common columns or indices to create a single, unified dataset.

How can you merge two DataFrames horizontally in Pandas?

Answer: You can merge two DataFrames horizontally in Pandas by using the merge() function and specifying the common column(s) to merge on.

What is the difference between inner join and outer join in Pandas?

Answer: In Pandas, an inner join returns only the rows that have matching values in both DataFrames, while an outer join returns all rows from both DataFrames, filling in missing values with NaN.

How can you merge two DataFrames vertically in Pandas?

Answer: You can merge two DataFrames vertically in Pandas by using the concat() function and specifying the axis parameter as 0.

What is the purpose of concatenating data in Pandas?

Answer: Concatenating data in Pandas allows you to combine multiple DataFrames or Series along a particular axis (row-wise or column-wise) to create a larger dataset.

How can you concatenate DataFrames horizontally in Pandas?

Answer: You can concatenate DataFrames horizontally in Pandas by using the concat() function and specifying the axis parameter as 1.

What is the purpose of the join() method in Pandas?

Answer: The join() method in Pandas is used to join two or more DataFrames based on their indices.

How can you perform a left join in Pandas?

Answer: You can perform a left join in Pandas by using the merge() function with the “how” parameter set to “left”.

What is the purpose of the suffixes parameter in the merge() function?

Answer: The suffixes parameter in the merge() function is used to specify the suffixes to add to overlapping column names in case of a merge conflict.

How can you merge two DataFrames based on multiple columns in Pandas?

Answer: You can merge two DataFrames based on multiple columns in Pandas by passing a list of column names to the “on” parameter of the merge() function.

What is the purpose of the indicator parameter in the merge() function?

Answer: The indicator parameter in the merge() function is used to add a column that indicates the source of each row (whether it came from the left or right DataFrame).

How can you perform an inner join on multiple columns in Pandas?

Answer: You can perform an inner join on multiple columns in Pandas by passing a list of column names to the “on” parameter of the merge() function.

What is the purpose of the how parameter in the merge() function?

Answer: The how parameter in the merge() function is used to specify the type of join to perform (inner, outer, left, or right).

How can you concatenate DataFrames with different column names in Pandas?

Answer: You can concatenate DataFrames with different column names in Pandas by using the concat() function and aligning the columns using the reindex() method.

What is the purpose of the ignore_index parameter in the concat() function?

Answer: The ignore_index parameter in the concat() function is used to reset the index of the resulting concatenated DataFrame.

Grouping and Cross Tabulation

What is the purpose of grouping data in Pandas?

Answer: Grouping data in Pandas allows you to split the data into groups based on one or more criteria and perform calculations or analysis on each group separately.

How can you group data in Pandas based on a single column?

Answer: You can group data in Pandas based on a single column by using the groupby() function and specifying the column to group on.

What is the result of a groupby operation in Pandas?

Answer: The result of a groupby operation in Pandas is a GroupBy object, which represents a collection of groups.

How can you perform calculations on grouped data in Pandas?

Answer: You can perform calculations on grouped data in Pandas by applying an aggregation function, such as sum(), mean(), count(), etc., to the GroupBy object.

What is the purpose of the agg() function in Pandas?

Answer: The agg() function in Pandas is used to apply multiple aggregation functions to grouped data and obtain a summarized result.

How can you group data in Pandas based on multiple columns?

Answer: You can group data in Pandas based on multiple columns by passing a list of column names to the groupby() function.

What is the difference between the size() and count() functions in Pandas groupby?

Answer: The size() function in Pandas groupby returns the number of rows in each group, while the count() function returns the number of non-null values in each group.

How can you rename the columns of a grouped DataFrame in Pandas?

Answer: You can rename the columns of a grouped DataFrame in Pandas by using the rename() function or by directly assigning new column names to the columns attribute.

What is the purpose of cross tabulation in Pandas?

Answer: Cross tabulation in Pandas is used to compute a cross-tabulation table that shows the frequency distribution of variables across different dimensions.

How can you perform cross tabulation in Pandas?

Answer: You can perform cross tabulation in Pandas by using the crosstab() function and specifying the variables to cross-tabulate.

What is the purpose of the margins parameter in the crosstab() function?

Answer: The margins parameter in the crosstab() function is used to include row and column totals in the resulting cross-tabulation table.

How can you customize the aggregation function used in a groupby operation?

Answer: You can customize the aggregation function used in a groupby operation by defining a custom function and passing it to the agg() function.

What is the purpose of the transform() function in Pandas groupby?

Answer: The transform() function in Pandas groupby is used to perform group-wise transformations on the data, returning an object that is the same shape as the original DataFrame.

How can you filter data based on group-specific conditions in Pandas groupby?

Answer: You can filter data based on group-specific conditions in Pandas groupby by using the filter() function and specifying the condition to apply on each group.

What is the purpose of the as_index parameter in the groupby() function?

Answer: The as_index parameter in the groupby() function is used to control whether the grouped columns should be included as part of the DataFrame’s index.

Data Visualization

Univariate Analysis

What is univariate analysis in statistics?

Answer: Univariate analysis is a statistical analysis technique that focuses on examining and describing a single variable at a time.

What are the common measures of central tendency used in univariate analysis?

Answer: The common measures of central tendency used in univariate analysis are mean, median, and mode.

How is the mean calculated in univariate analysis?

Answer: The mean in univariate analysis is calculated by summing up all the values in the dataset and dividing the sum by the total number of values.

What does the median represent in univariate analysis?

Answer: The median in univariate analysis represents the middle value of a dataset when it is arranged in ascending or descending order.

How is the mode determined in univariate analysis?

Answer: The mode in univariate analysis is the value or values that appear most frequently in the dataset.

What is the purpose of the histogram in univariate analysis?

Answer: The purpose of a histogram in univariate analysis is to visually represent the distribution of a single variable by dividing it into bins and displaying the frequency or density of observations in each bin.

How is the skewness of a variable calculated in univariate analysis?

Answer: Skewness in univariate analysis is a measure of the asymmetry of the distribution and can be calculated using statistical formulas or software functions.

What does a positive skewness value indicate in univariate analysis?

Answer: A positive skewness value in univariate analysis indicates that the distribution is skewed to the right, with a long tail on the right side.

What does a negative skewness value indicate in univariate analysis?

Answer: A negative skewness value in univariate analysis indicates that the distribution is skewed to the left, with a long tail on the left side.

How is the kurtosis of a variable calculated in univariate analysis?

Answer: Kurtosis in univariate analysis measures the peakedness or flatness of a distribution and can be calculated using statistical formulas or software functions.

What does a high kurtosis value indicate in univariate analysis?

Answer: A high kurtosis value in univariate analysis indicates a distribution with heavy tails and a peaked center, suggesting a more extreme distribution compared to a normal distribution.

What does a low kurtosis value indicate in univariate analysis?

Answer: A low kurtosis value in univariate analysis indicates a distribution with lighter tails and a flatter center, suggesting a less extreme distribution compared to a normal distribution.

How can outliers be identified in univariate analysis?

Answer: Outliers in univariate analysis can be identified using statistical methods such as the z-score or interquartile range (IQR).

What is the purpose of box plots in univariate analysis?

Answer: Box plots in univariate analysis provide a visual representation of the distribution, median, quartiles, and potential outliers of a variable.

How can the spread or variability of a variable be measured in univariate analysis?

Answer: The spread or variability of a variable in univariate analysis can be measured using statistical measures such as the range, variance, and standard deviation.

Bivariate Analysis

What is bivariate analysis in statistics?

Answer: Bivariate analysis is a statistical analysis technique that focuses on examining the relationship between two variables.

What are the types of variables used in bivariate analysis?

Answer: The types of variables used in bivariate analysis can be categorical (nominal or ordinal) or continuous (interval or ratio).

How can the relationship between two categorical variables be analyzed in bivariate analysis?

Answer: The relationship between two categorical variables can be analyzed using contingency tables and measures like chi-square tests or Cramér’s V.

How can the relationship between a categorical variable and a continuous variable be analyzed in bivariate analysis?

Answer: The relationship between a categorical variable and a continuous variable can be analyzed using techniques like t-tests, ANOVA, or non-parametric tests.

What does a positive correlation coefficient indicate in bivariate analysis?

Answer: The SUMX function in DAX calculates the sum of an expression for each row in a table and returns the total. It is commonly used to perform calculations over a set of rows, such as iterating through a table and summing values.

How can the strength of a correlation be determined in bivariate analysis?

Answer: The strength of a correlation in bivariate analysis can be determined by the magnitude of the correlation coefficient, typically ranging from -1 to +1.

What is the purpose of scatter plots in bivariate analysis?

Answer: Scatter plots in bivariate analysis are used to visually represent the relationship between two continuous variables, with each data point plotted as a point on the graph.n differen

What is a correlation matrix in bivariate analysis?

Answer: A How can the relationship between two variables be analyzed when one variable is continuous and the other is categorical?

How can the relationship between two variables be analyzed when one variable is continuous and the other is categorical?

Answer: The relationship between a continuous variable and a categorical variable can be analyzed using techniques like box plots, violin plots, or ANOVA.

What is a scatterplot matrix in bivariate analysis?

Answer: Outliers in bivariate analysis can be identified using techniques such as scatter plots, where data points that deviate significantly from the overall pattern may be considered outliers.

What is a residual plot in bivariate analysis?

Answer: A residual plot in bivariate analysis is used to examine the residuals (the differences between the observed and predicted values) of a regression model to check for any patterns or deviations.

What is the purpose of cross-tabulation in bivariate analysis?

Answer: Cross-tabulation in bivariate analysis is used to summarize and compare the distribution of one variable across different categories of another variable.

How can the strength and direction of a relationship between two variables be assessed in bivariate analysis?

Answer: The strength and direction of a relationship between two variables in bivariate analysis can be assessed using correlation coefficients such as Pearson’s correlation or Spearman’s rank correlation.

Multivariate Analysis

What is multivariate analysis?

Answer: Multivariate analysis is a statistical analysis technique that focuses on examining the relationship between multiple variables simultaneously.

What is the difference between univariate and multivariate analysis?

Answer: Univariate analysis involves analyzing a single variable, while multivariate analysis involves analyzing multiple variables and studying their interrelationships.

What are the types of multivariate analysis techniques?

Answer: The types of multivariate analysis techniques include factor analysis, cluster analysis, discriminant analysis, and principal component analysis.

How does factor analysis work in multivariate analysis?

Answer: Factor analysis is used to identify underlying factors or latent variables that explain the common variance among a set of observed variables.

How does discriminant analysis work in multivariate analysis?

Answer: Discriminant analysis is used to determine which variables discriminate between two or more groups or categories.

What is principal component analysis (PCA) in multivariate analysis?

Answer: Principal component analysis is a dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components.

What is the purpose of multivariate analysis of variance (MANOVA)?

Answer: Multivariate analysis of variance is used to simultaneously analyze the differences between groups across multiple dependent variables.

What is canonical correlation analysis in multivariate analysis?

Answer: Canonical correlation analysis is used to measure the relationship between two sets of variables and determine the maximum correlation between them.

What is the purpose of correspondence analysis in multivariate analysis?

Answer: Correspondence analysis is a technique used to visualize and analyze the associations between categorical variables in a contingency table.

How can you interpret the results of a multivariate analysis?

Answer: The interpretation of multivariate analysis results involves understanding the relationships between variables, identifying patterns or clusters, and making inferences or predictions based on the analysis.

What are the assumptions of multivariate analysis?

Answer: The assumptions of multivariate analysis include normality, linearity, homoscedasticity, and absence of multicollinearity among the variables.

How can you handle missing data in multivariate analysis?

Answer: Missing data in multivariate analysis can be handled through techniques such as imputation, deletion of missing cases, or using statistical models that accommodate missing data.

What is the difference between exploratory and confirmatory multivariate analysis?

Answer: Exploratory multivariate analysis involves exploring and discovering patterns or relationships in the data, while confirmatory multivariate analysis tests pre-defined hypotheses or models.

How can you assess the significance of findings in multivariate analysis?

Answer: The significance of findings in multivariate analysis can be assessed using statistical tests such as multivariate analysis of variance (MANOVA), chi-square tests, or hypothesis testing.