Intro to Data science skills (part 1)

AndReda Mind
3 min readMay 12, 2021

Everybody loves a data scientist, wrote Simon Rogers (2012) in the Guardian, Mr. Rogers also traced the newfound love for number crunching to a quote by Google’s Hal Varian, who declared that the sexy job in the next ten years Will be statisticians. So the important question here is what is the skills that Data scientist should have?

I’ll start to talk about the tech or non-tech skills that you should work on.

As we all can see that as data science is not a discipline traditionally taught at universities, contemporary data scientists come from diverse backgrounds such as engineering, statistics, and physics.

In order to succeed as a data scientist you should be:

Curious.

Argumentative.

Judgmental.

Hint: Using complicated machine learning algorithms does not always guarantee achieving a better performance, so keep it simple.

The Tech requirements:

1- I think the first skills you need is you need to know how to program, at least have some computational thinking (Algorithms and Data structures).

2- you need to know some algebra, at least up to analytics, geometry, and hopefully some calculus, some basic probability, some basic statistics

3- relational databases

4- Can use cloud services like (Google - IBM - AWS)to be able to use up to date tools such as Python, UNIX commands, pandas, Jupyter, Apache Spark.

Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Apache Spark

Apache Spark: is a unified analytics engine for large-scale data processing.

Can deal with Apache Hadoop.

Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop, and Spark are very suitable for dealing with big amount of Data.

--

--