【导读】再过一段时间,又要进入“找工作”季了,各路大神当然无所畏惧,不过多做些准备,总是不慌嘛。Github作者kojino整理了数据科学领域常见的120例面试问题,希望可以为各位的应聘事业添砖加瓦。
作者 | kojino
整理 | Xiaowen
github地址:
https://github.com/kojino/120-Data-Science-Interview-Questions
图片源自网络
【Communication】
5个问题
A/B testing, or more broadly, multivariate testing, is the testing of different elements of a user's experience to determine which variation helps the business achieve its goal more effectively (i.e. increasing conversions, etc..) This can be copy on a web site, button colors, different user interfaces, different email subject lines, calls to action, offers, etc.
https://www.quora.com/What-is-a-confidence-interval-in-laymans-terms
【Data Analysis】
27个问题
goodness of fit measure. variance explained by the regression / total variance
the more predictors you add the higher R^2 becomes.
hence use adjusted R^2 which adjusts for the degrees of freedom
or train error metrics
High dimensionality makes clustering hard, because having lots of dimensions means that everything is "far away" from each other.
For example, to cover a fraction of the volume of the data we need to capture a very wide range for each variable as the number of variables increases
All samples are close to the edge of the sample. And this is a bad news because prediction is much more difficult near the edges of the training sample.
The sampling density decreases exponentially as p increases and hence the data becomes much more sparse without significantly more data.
We should conduct PCA to reduce dimensionality
Statistically,
It depends on the quality of your data, for example, if your data is biased, just getting more data won’t help.
It depends on your model. If your model suffers from high bias, getting more data won’t improve your test results beyond a point. You’d need to add more features, etc.
Practically,
Also there’s a tradeoff between having more data and the additional storage, computational power, memory it requires. Hence, always think about the cost of having more data.
Data sets have errors. You won't find them all but you might find some. That 212 year old man. That 9 foot tall woman.
Variables can have skewness, outliers etc. Then the arithmetic mean might not be useful. Which means the standard deviation isn't useful.
Variables can be multimodal! If a variable is multimodal then anything based on its mean or median is going to be suspect.
Proper exploratory data analysis.
In every data analysis task, there's the exploratory phase where you're just graphing things, testing things on small sets of the data, summarizing simple statistics, and getting rough ideas of what hypotheses you might want to pursue further.
Then there's the exploitatory phase, where you look deeply into a set of hypotheses.
The exploratory phase will generate lots of possible hypotheses, and the exploitatory phase will let you really understand a few of them. Balance the two and you'll prevent yourself from wasting time on many things that end up meaningless, although not all.
data analysis is a repetition of setting up a new hypothesis and trying to refute the null hypothesis.
The scientific method is eminently inductive: we elaborate a hypothesis, test it and refute it or not. As a result, we come up with new hypotheses which are in turn tested and so on. This is an iterative process, as science always is.
run the features though a Gradient Boosting Machine or Random Forest to generate plots of relative importance and information gain for each feature in the ensembles.
Look at the variables added in forward variable selection
Remove rows with missing values - This works well if 1) the values are missing randomly (see Vinay Prabhu's answer for more details on this) 2) if you don't lose too much of the dataset after doing so.
Build another predictive model to predict the missing values - This could be a whole project in itself, so simple techniques are usually used here.
Use a model that can incorporate missing data - Like a random forest, or any tree-based method.
Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.
Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.
principal component regression
PCA
ridge / lasso / elastic net regression
Univariate Feature Selection where a statistical test is applied to each feature individually. You retain only the best features according to the test outcome scores
"Recursive Feature Elimination":
First, train a model with all the feature and evaluate its performance on held out data.
Then drop let say the 10% weakest features (e.g. the feature with least absolute coefficients in a linear model) and retrain on the remaining features.
Iterate until you observe a sharp drop in the predictive accuracy of the model.
p > n.
If some of the explanatory variables are perfectly correlated (positively or negatively) then the coefficients would not be unique.
The dataset might be heterogeneous. In which case, it is recommended to cluster datasets into different subsets wisely, and then draw different models for different subsets. Or, use models like non parametric models (trees) which can deal with heterogeneity quite nicely.
The assumption is that a group of weak learners can be combined to form a strong learner.
Hence the combined model is expected to perform better than an individual model.
Assumptions:
average out biases
reduce variance
Bagging works because some underlying learning algorithms are unstable: slightly different inputs leads to very different outputs. If you can take advantage of this instability by running multiple instances, it can be shown that the reduced instability leads to lower error. If you want to understand why, the original bagging paper( http://www.springerlink.com/cont...) has a section called "why bagging works"
Boosting works because of the focus on better defining the "decision edge". By reweighting examples near the margin (the positive and negative examples) you get a reduced error (see http://citeseerx.ist.psu.edu/vie...)
Use the outputs of your models as inputs to a meta-model.
For example, if you're doing binary classification, you can use all the probability outputs of your individual models as inputs to a final logistic regression (or any model, really) that can combine the probability estimates.
One very important point is to make sure that the output of your models are out-of-sample predictions. This means that the predicted value for any row in your dataframe should NOT depend on the actual value for that row.
If the data is more used in one room, then that one is over utilized! Maybe account for the room capacity and normalize the data.
like page rank with each user corresponding to the webpages and linking to the page equivalent to following.
One way you could do this is by storing a "skill level" for each user and a "difficulty level" for each problem. We assume that the probability that a user solves a problem only depends on the skill of the user and the difficulty of the problem.* Then we maximize the likelihood of the data to find the hidden skill and difficulty levels.
The Rasch model for dichotomous data takes the form:
{\displaystyle \Pr\{X_{ni}=1\}={\frac {\exp({\beta _{n}}-{\delta _{i}})}{1+\exp({\beta _{n}}-{\delta _{i}})}},}
where is the ability of person and is the difficulty of item}.
Some people would take the mean rank of each sushi. If I wanted something simple, I would use the median, since ranks are (strictly speaking) ordinal and not interval, so adding them is a bit risque (but people do it all the time and you probably won't be far wrong).
collaborative filtering. you have your votes and we can calculate the similarity for each representatives and select the most similar representative
for liberal and republican parties, find the mean vector and find the representative closest to the center point
reduce the text to a more compact form (e.g. fingerprinting, bag of words) then compare those with other texts by calculating the similarity
KNN
choose a small value of k that still has a low SSE (elbow method)
https://bl.ocks.org/rpgove/0060ff3b656618e9136b
collaborative filtering
【Predictive Modeling】
19个问题
Start by fitting a simple model (multivariate regression, logistic regression), do some feature engineering accordingly, and then try some complicated models. Always split the dataset into train, validation, test dataset and use cross validation to check their performance.
Determine if the problem is classification or regression
Favor simple models that run quickly and you can easily explain.
Mention cross validation as a means to evaluate the model.
Plot and visualize the data.
The model that has high training accuracy might have low test accuracy. Without further knowledge, it is hard to know which dataset represents the population data and thus the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs test dataset (as in cross validation).
When there is a change in data distribution, this is called the dataset shift. If the train and test data has a different distribution, then the classifier would likely overfit to the train data.
This issue can be overcome by using a more general learning method.
This can occur when:
P(y|x) are the same but P(x) are different. (covariate shift)
P(y|x) are different. (concept shift)
The causes can be:
Training samples are obtained in a biased way. (sample selection bias)
Train is different from test because of temporal, spatial changes. (non-stationary environments)
Solution to covariate shift
importance weighted cv
We can have regularization such as L1 or L2 to reduce variance (increase bias).
Changes to the algorithm:
Use tree-based methods instead of regression methods as they are more resistant to outliers. For statistical tests, use non parametric tests instead of parametric ones.
Use robust error metrics such as MAE or Huber Loss instead of MSE.
Changes to the data:
Winsorizing the data
Transforming the data (e.g. log)
Remove them only if you’re certain they’re anomalies not worth predicting
MSE is more strict to having outliers. MAE is more robust in that sense, but is harder to fit the model for because it cannot be numerically optimized. So when there are less variability in the model and the model is computationally easy to fit, we should use MAE, and if that’s not the case, we should use MSE.
MSE: easier to compute the gradient, MAE: linear programming needed to compute the gradient
MAE more robust to outliers. If the consequences of large errors are great, use MSE
MSE corresponds to maximizing likelihood of Gaussian random variables
Accuracy: proportion of instances you predict correctly. Pros: intuitive, easy to explain, Cons: works poorly when the class labels are imbalanced and the signal from the data is weak
AUROC: plot fpr on the x axis and tpr on the y axis for different threshold. Given a random positive instance and a random negative instance, the AUC is the probability that you can identify who's who. Pros: Works well when testing the ability of distinguishing the two classes, Cons: can’t interpret predictions as probabilities (because AUC is determined by rankings), so can’t explain the uncertainty of the model
logloss/deviance: Pros: error metric based on probabilities, Cons: very sensitive to false positives, negatives
When there are more than 2 groups, we can have k binary classifications and add them up for logloss. Some metrics like AUC is only applicable in the binary case.
Things to look at: N, P, linearly seperable?, features independent?, likely to overfit?, speed, performance, memory usage
Logistic Regression
features roughly linear, problem roughly linearly separable
robust to noise, use l1,l2 regularization for model selection, avoid overfitting
the output come as probabilities
efficient and the computation can be distributed
can be used as a baseline for other algorithms
(-) can hardly handle categorical features
SVM
with a nonlinear kernel, can deal with problems that are not linearly separable
(-) slow to train, for most industry scale applications, not really efficient
Naive Bayes
computationally efficient when P is large by alleviating the curse of dimensionality
works surprisingly well for some cases even if the condition doesn’t hold
with word frequencies as features, the independence assumption can be seen reasonable. So the algorithm can be used in text categorization
(-) conditional independence of every other feature should be met
Tree Ensembles
good for large N and large P, can deal with categorical features very well
non parametric, so no need to worry about outliers
GBT’s work better but the parameters are harder to tune
RF works out of the box, but usually performs worse than GBT
Deep Learning
works well for some classification tasks (e.g. image)
used to squeeze something out of the problem
Regularization is useful for reducing variance in the model, meaning avoiding overfitting . For example, we can use L1 regularization in Lasso regression to penalize large coefficients.
When we add irrelevant features, it increases model's tendency to overfit because those features introduce more noise. When two variables are correlated, they might be harder to interpret in case of regression, etc.
curse of dimensionality
adding random noise makes the model more complicated but useless
computational cost
Ask someone for more details.
Build a time series model with the training data with a seven day cycle and then use that for a new data with only 2 days data.
Ask someone for more details.
Build a regression function to estimate the number of retweets as a function of time t
to determine if one regression function can be built, see if there are clusters in terms of the trends in the number of retweets
if not, we have to add features to the regression function
features + # of retweets on the first and the second day -> predict the seventh day
https://en.wikipedia.org/wiki/Dynamic_time_warping
We can collect social media data using twitter, Facebook, instagram API’s. Then, for example, for twitter, we can construct features from each tweet, e.g. the tweeted date, number of favorites, retweets, and of course, the features created from the tweeted content itself. Then use a multi variate time series model to predict the weather.
Ask someone for more details.
We can do so using building a recommendation engine. The easiest we can do is to show contents that are popular other users, which is still a valid strategy if for example the contents are news articles. To be more accurate, we can build a content based filtering or collaborative filtering. If there’s enough user usage data, we can try collaborative filtering and recommend contents other similar users have consumed. If there isn’t, we can recommend similar items based on vectorization of items (content based filtering).
Find strong unconnected people in weighted connection graph
friend connections (neighbors)
Check-in’s people being at the same location all the time.
same college, workplace
Have randomly dropped graphs test the performance of the algorithm
Define similarity as how strong the two people are connected
Given a certain feature, we can calculate the similarity based on
ref. News Feed Optimization
Affinity score: how close the content creator and the users are
Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote
Time decay: the older the less important
for each user, assign a score of how likely someone would send an email to
the rest is feature engineering:
number of past emails, how many responses, the last time they exchanged an email, whether the last email ends with a question mark, features about the other users, etc.
Ask someone for more details.
People who someone sent emails the most in the past, conditioning on time decay.
build a master dataset with local demographic information available for each location.
local income levels, proximity to traffic, weather, population density, proximity to other businesses
a reference dataset on local, regional, and national macroeconomic conditions (e.g. unemployment, inflation, prime interest rate, etc.)
any data on the local franchise owner-operators, to the degree the manager
identify a set of KPIs acceptable to the management that had requested the analysis concerning the most desirable factors surrounding a franchise
quarterly operating profit, ROI, EVA, pay-down rate, etc.
run econometric models to understand the relative significance of each variable
run machine learning algorithms to predict the performance of each location candidate
Based on the past frequencies of words shown up given a sequence of words, we can construct conditional probabilities of the set of next sequences of words that can show up (n-gram). The sequences with highest conditional probabilities can show up as top candidates.
To further improve this algorithm,
we can put more weight on past sequences which showed up more recently and near your location to account for trends
show your recent searches given partial data
Based on frequency and amount of donations, graduation year, major, etc, construct a supervised regression (or binary classification) algorithm.
Based on the past pickup location of passengers around the same time of the day, day of the week (month, year), construct
Ask someone for more details.
Based on the number of past pickups
account for periodicity (seasonal, monthly, weekly, daily, hourly)
special events (concerts, festivals, etc.) from tweets
One vector each for team A and B. Take the difference of the two vectors and use that as an input to predict the probability that team A would win by training the model. Train the models using past tournament data and make a prediction for the new tournament by running the trained model for each round of the tournament
Some extensions:
Experiment with different ways of consolidating the 2 team vectors into one (e.g concantenating, averaging, etc)
Consider using a RNN type model that looks at time series data.
This is equivalent to making the model more robust to outliers.
See Q3.
【Probability】
19个问题
p=1/4+1/4p+1/2p^2 => p=1/2
1-(0.8)^4. Or, we can use Poisson processes
Launch it 3 times: each throw sets the nth bit of the result.
For each launch, if the value is 1-3, record a 0, else 1. The result is between 0 (000) and 7 (111), evenly spread (3 independent throw). Repeat the throws if 0 was obtained: the process stops on evenly spread values.
Flip twice and if HT then H, TH then T.
more than two standard deviations
plug in the value to the CDF of the same random variable
1/3
gender ratio is 1:1. Expected number of children is 2. let X be the number of children until getting a female (happens with prob 1/2). this follows a geometric distribution with probability 1/2
the outcome follows a multinomial distribution with n=12 and k=3. but the classes are indistinguishable
the probability of a hash collision: 1-(10!/10^10)
the expected number of hash collisions: 1-10*(9/10)^10
the expected number of hashes that are unused: 10*(9/10)^10
Lyfts arrive first: 2!*3!/5!
Ubers arrive first: same
100+60-20=140
24C5*(1+5(24-5))/24C5*24C5 = 4/1771
1
Shorter. Regression to the mean
less than $3
4/13
Events: F = "picked a fair coin", T = "10 heads in a row"
(1) P(F|T) = P(T|F)P(F)/P(T) (Bayes formula)
(2) P(T) = P(T|F)P(F) + P(T|¬F)P(¬F) (total probabilities formula)
Injecting (2) in (1): P(F|T) = P(T|F)P(F)/(P(T|F)P(F) + P(T|¬F)P(¬F)) = 1 / (1 + P(T|¬F)P(¬F)/(P(T|F)P(F)))
Numerically: 1/(1 + 0.001 * 2^10 /0.999).
With 2^10 ≈ 1000 and 0.999 ≈ 1 this simplifies to 1/2
The probability to obtain a similar or more extreme result than observed when the null hypothesis is assumed.
⇒ If the p-value is small, the null hypothesis is unlikely
【Product Metrics】
15个问题
advertising-driven: Pageviews and daily actives, CTR, CPC (cost per click)
click-ads
display-ads
service-driven: number of purchases, conversion rate
productivity tool: same as premium subscriptions
MOOC: same as premium subscriptions, completion rate
e-commerce: number of purchases, conversion rate, Hourly, daily, weekly, monthly, quarterly, and annual sales, Cost of goods sold, Inventory levels, Site traffic, Unique visitors versus returning visitors, Customer service phone call count, Average resolution time
subscription
churn, CoCA, ARPU, MRR, LTV
premium subscriptions:
heavily on engagement and interaction: uses AU ratios, email summary by type, and push notification summary by type, resurrection ratio
messaging product:
Average Revenue Per Paid User
Average Revenue Per User
breakdown the KPI’s into what consists them and find where the change is
then further breakdown that basic KPI by channel, user cluster, etc. and relate them with any campaigns, changes in user behaviors in that segment
for similar restaurants (they should define similarity), average increase in revenue gain per coupon, average increase in customers per coupon
define efficiency
rate for each action, duration users stay, CTR for sponsor feed posts
ref. News Feed Optimization
Affinity score: how close the content creator and the users are
Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote
Time decay: the older the less important
AB test on different balance ratio and see
there is a gradual step-function type scaling mechanism until that imbalance of requests-to-drivers is alleviated and then vice versa as too many drivers come online enticed by the surge pricing structure.
I would bet the algorithm is custom tailored and calibrated to each location as price elasticities almost certainly vary across different cities depending on a huge multitude of variables: income, distance/sprawl, traffic patterns, car ownership, etc. With the massive troves of user data that Uber probably has collected, they most likely have tweaked the algos for each city to adjust for these varying sensitivities to surge pricing. Throw in some machine learning and incredibly rich data and you've got yourself an incredible, constantly-evolving algorithm.
Netflix uses data to estimate the potential market size for an original series before giving it the go-ahead.
subscription based services
【Programming】
19个问题
Recursive programming (sol in code)
Store all the hashtags in a dictionary and get the top 10 values
Greedy solution (add the best v/w as much as possible and move on to the next)
Greedy
https://en.wikipedia.org/wiki/Reservoir_sampling
https://www.quora.com/What-is-the-method-to-calculate-a-square-root-by-hand?redirected_qid=664405
https://en.wikipedia.org/wiki/Newton's_method#Square_root_of_a_number
sort then select the highest and the lowest 2.5%
When could it make your algorithms run slower?
Ask someone for more details.
compute in parallel when communication cost < computation cost
ensemble trees
minibatch
cross validation
forward propagation
minibatch
not suitable for online learning
(INNER) JOIN: Returns records that have matching values in both tables LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table FULL (OUTER) JOIN: Return all records when there is a match in either left or right table
Change the subquery to a join.
Primary keys are columns whose value combinations must be unique in a specific table so that each row can be referenced uniquely. Foreign keys are columns that references columns (often primary keys) in other tables.
select faculty_name from faculty_id c join (select faculty_id from (select course_id from COURSES where course_name=xxx) as a join COURSE_FACULTY b on a.course_id = b.course_id) d on c.faculty_id = d.faculty_id
select id, average(click) from (select count(click) as click from IMPRESSIONS group by id,month(date)) group by id
EMPLOYEES containing: Emp_ID (Primary key) and Emp_Name
EMPLOYEE_DEPT containing: Emp_ID (Foreign key) and Dept_ID (Foreign key)
DEPTS containing: Dept_ID (Primary key) and Dept_Name
select Dept_Name, count(1) from DEPTS a right join EMPLOYEE_DEPT b on a.Dept_id = b.Dept_id group by Dept_Name
【Statistical Inference】
15个问题
Plot the distributions of multiple features for both A and B and make sure that they have the same shape. More rigorously, we can conduct a permutation test to see if the distributions are the same.
MANOVA to compare different means
Verify the sampling algorithm is random.
The user might not act the same suppose had they not seen the other bucket. You are essentially adding additional variables of whether the user peeked the other bucket, which are not random across groups.
Same as the previous question. The above problem can happen in larger scale.
Ask someone for more details.
one control, 20 treatment, if the sample size for each group is big enough.
Ways to attempt to correct for this include changing your confidence level (e.g. Bonferroni Correction) or doing family-wide tests before you dive in to the individual metrics (e.g. Fisher's Protected LSD).
lower the variability by modifying the KPI
cap values
percentile metrics
log transform
https://www.quora.com/How-would-you-run-an-A-B-test-if-the-observations-are-extremely-right-skewed
exclusive -> ok
type-1 error: rejecting Ho when Ho is true
type-2 error: not rejecting Ho when Ha is true
For randomly selected listings with more than 1 pictures, hide 1 random picture for group A, and show all for group B. Compare the booking rate for the two groups.
Ask someone for more details.
The best way I know to quantify the impact of performance is to isolate just that factor using a slowdown experiment, i.e., add a delay in an A/B test.
A method for parameter optimization (fitting a model). We choose parameters so as to maximize the likelihood function (how likely the outcome would happen given the current data and our model).
maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. MLE can be seen as a special case of the maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters, or as a variant of the MAP that ignores the prior and which therefore is unregularized.
for gaussian mixtures, non parametric models, it doesn’t exist
MAP estimates the posterior distribution given the prior distribution and data which maximizes the likelihood function. MLE is a special case of MAP where the prior is uninformative uniform distribution.
MOM sets moment values and solves for the parameters. MOM is not used much anymore because maximum likelihood estimators have higher probability of being close to the quantities to be estimated and are more often unbiased.
For example, 95% confidence interval is an interval that when constructed for a set of samples each sampled in the same way, the constructed intervals include the true mean 95% of the time.
if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level.
Unbiasedness means that the expectation of the estimator is equal to the population value we are estimating. This is desirable in inference because the goal is to explain the dataset as accurately as possible. However, this is not always desirable for data analysis or predictive modeling as there is the bias variance tradeoff. We sometimes want to prioritize the generalizability and avoid overfitting by reducing variance and thus increasing bias.
-END-
专 · 知
专知《深度学习:算法到实战》课程全部完成!480+位同学在学习,现在报名,限时优惠!网易云课堂人工智能畅销榜首位!
欢迎微信扫一扫加入专知人工智能知识星球群,获取最新AI专业干货知识教程视频资料和与专家交流咨询!
请加专知小助手微信(扫一扫如下二维码添加),加入专知人工智能主题群,咨询《深度学习:算法到实战》课程,咨询技术商务合作~
请PC登录www.zhuanzhi.ai或者点击阅读原文,注册登录专知,获取更多AI知识资料!
点击“阅读原文”,了解报名专知《深度学习:算法到实战》课程