A question I am often asked is: what is data science, and what skills do I need to be a successful data scientist? For the first question, I have a simple answer:
Data Science is an umbrella term for a set of statistical and computational techniques that suddenly seem very important for the future of the world
It's important to recognize this: data science is not all new stuff. It includes technical capabilities like relational databases and machine learning that have been around for decades. What's new is how these techniques can come together to transform big data into game changers for industry, government, academia, and the ordinary citizen.
So what are these techniques? What are the skills you need to learn to be a successful data scientist? I categorize them into five "shopping bags" of skills:
Systems refers to the physical infrastructure necessary to manage big data, and the distributed computing systems necessary to process big data. The skill sets you need in this area include: familiarity with cloud computing services, such as Amazon Web Services (AWS); distributed file system management using Hadoop, and increasingly Apache Spark; and knowledge of High Performance Computing (HPC) techniques.
Big Data Management refers to the software and strategies of big data management. Many systems still use SQL and relational databases, and knowledge of these is a must for any data scientist; but increasingly new database technologies such as NoSQL (especially MongoDB), semantic databases and graph databases are being used. Important in this area is the emerging concept of the "data lake" as opposed to data warehousing.
Programming is the glue that brings everything together, be it writing code to manage or transform data, or user-side app development. The most popular languages are Java, C++, and Python.
Analytics is at the core of what most people consider data science. This is about being able to transform data into knowledge, insights and even wisdom. A good statistical training is an absolute must for this area. on top of this, visualization, data mining and machine learning are the most important techniques. Learning the R package is a good starting point too.
Human Data Interaction is about how to make analysis "move the needle" in positive and not negative ways for human beings. This bag is all about the strategy of making data science work for the world. It includes areas such as policy, strategy, ethics, security, and application of data science methods in different domains.
The job of the data scientist is to make recipes from the ingredients in these bags. As a data scientist, you don't have to be an expert in all these areas, but you should have a breadth of skills across these areas, and a depth in one or two of them