by Cecelia Shao


It seems like everyone wants to be a data scientist these days — from PhD students to data analysts to your old college roommate who keeps Linkedin messaging you to ‘grab coffee’.


Perhaps you’ve had the same inkling that you should at least explore some data science positions and see what the hype is about. Maybe you’ve seen articles like Vicki Boykis’ that states:

也许您有过类似的想法,至少应该探索一些数据科学的职位,并了解炒作的含义。 也许您已经看到像Vicki Boykis的“ ”这样的文章 ,其中指出:

What is becoming clear is that, in the late stage of the hype cycle, data science is asymptotically moving closer to engineering, and the moving forward are less visualization and statistics-based, and …:

越来越清楚的是,在炒作周期的后期,数据科学正在渐近地接近工程学, 向前发展的较少以可视化和统计为基础,而是 …:

Concepts like unit testing and continuous integration rapidly found its way into the jargon and the toolset commonly used by data scientist and numerical scientist working on ML engineering.

or like Tim Hopper’s:

或类似Tim Hopper的 :

What’s not clear is how you can leverage your experience as a software engineer into a data science position. Some other questions you might have are:

尚不清楚的是如何利用您作为软件工程师的经验来担任数据科学职位。 您可能还有其他一些问题:

What should I prioritize learning?


Are there best practices or tools that are different for data scientists?


Will my current skill set carry over to a data science role?


This article will provide a background on the data scientist role and why your background might be a good fit for data science, plus tangible stepwise actions that you, as a developer, can take to ramp up on data science.


数据科学家与数据工程师 (Data Scientist versus Data Engineer)

First things first, we should distinguish between two complementary roles: Data Scientist versus Data Engineer. While both of these roles handle machine learning models, their interaction with these models as well as the the requirements and nature of the work for Data Scientists and Data Engineers vary widely.

首先,我们应该区分两个互补的角色:数据科学家与数据工程师。 虽然这两个角色都处理机器学习模型,但是它们与这些模型的交互以及数据科学家和数据工程师的工作要求和性质差异很大。

Note: The Data Engineer role that is specialized for machine learning can also manifest itself in job descriptions as ‘Software Engineer, Machine Learning’ or ‘Machine Learning Engineers’

As part of , data scientist will perform the statistical analysis required to determine which machine learning approach to use then begin prototyping and building out those models.


Machine learning engineers will often collaborate with data scientists before and after this modeling process: (1) building data pipelines to feed data into these models and (2) design an engineering system that will serve these models to ensure continuous model health.


The diagram below is one way to view this continuum of skills:


There is a wealth of online resources on the difference between Data Scientists and Data Engineers — make sure to check out:


As a disclaimer, this article primarily covers the Data Scientist role with some nod towards the Machine Learning Engineering side (especially relevant if you're looking at position in a smaller company where you might have to serve as both).

作为免责声明,本文主要介绍数据科学家的角色,并向机器学习工程学方面致敬(特别是如果您正在寻找可能必须同时担任这两家公司的较小公司的职位,则特别相关)。 如果您有兴趣了解如何转变为数据工程师或机器学习工程师,请在下面的评论中告诉我们!

您作为开发者的优势 (Your advantage as a developer)

To everyone’s detriment, classes around machine learning like ‘Introduction to Data Science in Python’ or Andrew Ng’s Coursera course do not cover concepts and best practices from software engineering like unit testing, writing modular reusable code, CI/CD, or version control. Even some of the most advanced machine learning teams still do not use these practices for their machine learning code, leading to a disturbing trend…

对所有人不利的是,围绕机器学习的课程(如“ Python中的数据科学入门”或Andrew Ng的Coursera课程) 并未涵盖软件工程中的概念和最佳实践,如单元测试,编写模块化可重用代码,CI / CD或版本控制。 即使是一些最先进的机器学习团队,仍然没有将这些实践用于他们的机器学习代码,从而导致令人不安的趋势……

Pete Warden described this trend as ‘’:

皮特·沃登(Pete Warden)将这种趋势描述为“ ”:

we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.

在跟踪更改和从头重建模型方面,我们仍处于黑暗时代。 太糟糕了,有时感觉就像回到了没有源代码控制的时候。

While you may not see these ‘software engineering’ skills explicitly stated in data scientist job descriptions, having a good grasp of these skills as part of your background already will help 10x your work as a data scientist. Plus they’ll come into use when it’s time to answer those programming questions during your data science interview.

尽管您可能看不到数据科学家职位描述中明确提到的这些“软件工程”技能,但作为背景知识的一部分对这些技能的充分掌握将有助于您将数据科学家的工作提高10倍。 另外,当您在数据科学面试期间回答这些编程问题时,它们将投入使用。

For some interesting perspective from the other side, check out ’s piece on ‘’ on skills that he recommends data scientists should learn to “write better code, interact better with software developers, and ultimately save you time and headaches”.

从另一个角度来看,您可以从 “ 的 ”一文中获得一些有趣的观点,他建议数据科学家应该学习“编写更好的代码,与软件开发者进行更好的交互,最终为您省钱”的技巧。时间和头痛”。

加强数据科学 (Ramping up on data science)

It’s great that you have a good foundation with your software engineering background, but what’s the next step towards becoming a data scientist? Josh Will’s tongue-in-cheek tweet on the definition of a data scientist is surprisingly accurate:

拥有良好的软件工程背景非常好,但是成为数据科学家的下一步是什么? 乔什·威尔(Josh Will)关于数据科学家的定义的-讽推文令人惊讶地准确:

It hints at one of the topics you should catch up on if you’re interested in pursuing a data scientist role or career: statistics. In this next section, we’ll cover great resources for:

如果您对追求数据科学家的角色或职业感兴趣,它暗示了您应该赶上的主题之一:统计。 在下一节中,我们将涵盖以下方面的大量资源:

  • Building ML-specific knowledge


  • Building industry knowledge


  • Tools in the ML stack


  • Skills and qualifications


建立特定于机器学习的知识 (Building ML-specific knowledge)

It’s most effective to build a combination of theory-based knowledge around probability and statistics as well as applied skills in things like data wrangling or training models on GPUs/distributed compute.

在概率和统计方面建立基于理论的知识以及在GPU /分布式计算上的数据整理或训练模型等方面的应用技能,这是最有效的组合。

One way to frame the knowledge you’re gaining is to reference it against the machine learning workflow.


See from Skymind AI

查看Skymind AI的

Here we list out some of the best resources you can find around machine learning. It would be impossible to have an exhaustive list and to save space (and reading time) we didn’t mention very popular resources like Andrew Ng’s Coursera course or Kaggle.

在这里,我们列出了您可以在机器学习中找到的一些最佳资源。 没有详尽的清单并节省空间(和阅读时间)是不可能的,我们没有提到非常流行的资源,例如Andrew Ng的Coursera课程或Kaggle。



  • (free courses that teach very applied skills across Practical Deep Learning for Coders, Cutting Edge Deep Learning for Coders, Computational Linear Algebra, and Introduction to Machine Learning for Coders)


  • Khan Academy

  • and youtube channel

    和 YouTube频道

  • Udacity courses (including )

    Udacity课程(包括 )

  • track


Textbooks: *tried to find free PDFs online for most of these*

教科书:* 试图在线查找其中大部分的免费PDF *



  • (for a good starting point, see )

    (有关一个好的起点,请参阅 )

  • (for computer vision)


Meetups: *primarily NYC-based ones*

聚会: *主要是在纽约的 聚会 *

For a cool starting point, check out Will Wolf’s ‘ on how you can structure your time across studying specific topics and working on projects to showcase expertise in a low-cost remote location.

作为一个不错的起点,请查看Will Wolf的“ ,您可以如何安排时间来研究特定主题并开展项目,以在低成本的远程位置展示专业知识。

建立行业特定知识 (Building industry-specific knowledge)

If you have an inkling that you would like to be a specific industry like healthcare, financial services, consumer goods, retail, etc…, it is invaluable to catch up on the pain points and developments of that industry as it relates to data and machine learning.


One pro tip = you can scan the websites of vertical-specific AI startups and see how they’re positioning their value proposition and where machine learning comes into play. This will give you ideas for specific areas of machine learning to study and topics for projects to showcase your work.

一个专业提示=您可以浏览特定垂直行业的AI初创公司的网站,了解它们如何定位其价值主张以及机器学习在何处发挥作用。 这将为您提供学习机器学习特定领域的想法,并为项目展示您的工作提供主题。

We can walk through an example: let’s say I’m interested in working in healthcare.


  1. Through a quick google search for “machine learning healthcare”, I found this list from Healthcareweekly.com on ‘’

    通过Google的快速搜索“ 机器学习医疗保健”,我在Healthcareweekly.com的“ 的 ”中找到了此列表。

You can also do quick searches on or with “healthcare” as a keyword

您也可以在或以“ healthcare”为关键字进行快速搜索

2. Let’s take one of the companies featured on the list, , as an example.


3. BenevolentAI’s website states:

3. BenevolentAI的网站指出:

We are an AI company with end-to-end capability from early drug discovery to late-stage clinical development. BenevolentAI combines the power of computational medicine and advanced AI with the principles of open systems and cloud computing to transform the way medicines are designed, developed, tested and brought to market.
我们是一家AI公司,具有从早期药物开发到后期临床开发的端到端功能。 BenevolentAI将计算医学和先进AI的功能与开放系统和云计算的原理相结合,以改变医学设计,开发,测试和投放市场的方式。
We built the Benevolent Platform to better understand disease and to design new, and improve existing treatments, from vast quantities of biomedical information. We believe our technology empowers scientists to develop medicines faster and more cost-efficiently.
我们建立了仁慈平台,以从大量的生物医学信息中更好地了解疾病并设计新的药物并改善现有的治疗方法。 我们相信,我们的技术使科学家能够更快,更经济地开发药物。
A new research paper is published every 30 seconds yet scientists currently only use a fraction of the knowledge available to understand the cause of disease and propose new treatments. Our platform ingests, ‘reads’ and contextualises vast quantities of information drawn from written documents, databases and experimental results. It is able to make infinitely more deductions and inferences across these disparate, complex data sources, identifying and creating relationships, trends and patterns, that would be impossible for a human being to make alone.
每隔30秒就会发布一份新的研究论文,但科学家目前仅使用部分知识来了解疾病的原因并提出新的治疗方法。 我们的平台可吸收,“读取”并根据书面文件,数据库和实验结果提取大量信息。 它能够跨这些不同的,复杂的数据源进行更多的推论和推断,识别并创建关系,趋势和模式,这是人类一个人无法独自实现的。

4. Immediately you can see that BenevolentAI is using natural language processing (NLP) and are probably working with some knowledge graphs if they’re identifying relationships between diseases and treatment research


5. If you check BenevolentAI’s career page, you can see that they’re hiring for a . This is a senior role, so it’s not a perfect example, but take a look at the skills and qualifications they’re asking for below:

5.如果查看BenevolentAI的职业页面,则可以看到他们正在招聘 。 这是一个高级职位,所以它不是一个完美的例子,但请查看以下他们要求的技能和资格:



  • natural language processing, knowledge graph inference, active learning and biochemical modeling

  • structured and unstructured data sources

  • bayesian model approaches

  • knowledge of modern tools for ML


This should give you some steps for what to approach next:


  • working with structured data

  • working with unstructured data

  • classifying relationships in knowledge graphs (see a good resource )

    在知识图中对关系进行分类( 查看良好的资源)

  • learning bayesian probability and modeling approaches

  • work on an NLP project (so text data)


We’re not recommending that you apply to the companies you find through your search, but rather see how they describe their customer’s pain points, their company’s value propositions, and what kind of skills they list in their job descriptions to guide your research.


ML堆栈中的工具 (Tools in the ML stack)

In the BenevolentAI Senior Machine Learning Researcher job description, they ask for “knowledge of modern tools for ML, such as Tensorflow, PyTorch, etc…”


Learning these modern tools for ML can seem daunting since the space is always changing. To break up the learning process into manageable pieces, remember to anchor your thinking around the machine learning workflow from above — “What tool can help me with this part of the workflow?” ?

学习这些现代机器学习工具似乎令人生畏,因为空间总是在变化。 要将学习过程分解为可管理的部分,请记住从上面围绕机器学习工作流进行思考- “哪种工具可以帮助我完成工作流的这一部分?”

To see which tools accompany each step of this machine learning workflow, check out ’s ‘’ which covers tools like , , and .

要查看该机器学习工作流程的每个步骤中附带哪些工具,请查看的“ ”,其中介绍了 , 和 。

Tactically speaking, and are the most common programming languages data scientists use and you can will encounter add-on packages designed for data science applications, such as and , and matplotlib. These languages are interpreted, rather than compiled, leaving the data scientist free to focus on the problem rather than nuances of the language. It’s worth investing time learning object-oriented programming to understand the implementation of data structures as classes.

从战术上讲, 和是数据科学家最常用的编程语言,您会遇到为数据科学应用程序设计的附加软件包,例如和以及matplotlib。 这些语言是解释性的,而不是编译性的,这使数据科学家可以自由地专注于问题而不是语言的细微差别。 值得花时间学习面向对象的编程,以将数据结构的实现理解为类。

To catch up on ML frameworks like Tensorflow, Keras, and PyTorch, make sure to go to their documentation and try implementing their tutorials end-to-end.


At the end of the day, you want to make sure that you’re building out projects that showcase these modern tools for data collection and wrangling, machine learning experiment management, and modeling.


For some inspiration for your projects, check out ’s piece on ‘’

对于您的项目有一些启发,请查看的文章“ ”

技能和资格 (Skills and qualifications)

We left this section for last since it aggregates much of the information from the previous sections, but is specifically geared towards data science interview preparation. There are six main topics during a data scientist interview:

我们将本节留在最后,因为它汇总了前几节中的许多信息,但专门针对数据科学面试准备。 在数据科学家访谈中,有六个主要主题:

  1. Coding

  2. Product

  3. SQL

  4. A/B testing

    A / B测试
  5. Machine Learning

  6. Probability (see a good definition vs. Statistics )

    概率( 查看良好的定义与统计 )

You’ll notice that one of these topics is not like the others (Product). For data science positions, as well as business metrics and impact is crucial.

您会注意到,其中一个主题与其他主题(产品)不同。 对于数据科学职位, 以及业务指标和影响进行至关重要。

Some useful aggregations of data science interview questions:


?? ht

?? ht

’ — as you interview for roles, you’ll come across companies who are still building up their data infrastructure or may not have a solid understanding of how their data science team fits into the larger company value.

的 ”一文中包含了该内容—在您面试职位时,您会遇到那些仍在建立其数据基础结构或可能不了解其工作方式的公司他们的数据科学团队适合更大的公司价值。

These companies may still be climbing up this hierarchy of needs below.


For some expectation setting around data science interviews, I would recommend reading Tim Hopper’s piece on ‘’

对于围绕数据科学面试的一些期望,我建议您阅读蒂姆·霍珀(Tim Hopper)的文章“ 拒绝的 ”。

谢谢阅读! 我们希望本指南可以帮助您了解数据科学是否是您应该考虑的职业以及如何开始这一旅程! (Thanks for reading! We hope this guide helps you understand if data science is a career you should consider and how to begin that journey!)

