Already crowned as the best job in America for 2016, the definition and skill set required to be a data scientist is in a constant state of flux. Advancements in technology and business demand drive its evolution in an ever-changing industry. In this article, we take a closer look at the role of a Data Scientist in 2016.
Dave Holtz writes that the title ‘data scientist’ is often used as a blanket title to describe a set of jobs that are drastically different. He attributes this to the fact that the field of data science is still in its infancy and so is ill-defined. Adopting the all-encompassing sub-title of being part of an ‘interdisciplinary field’, a data scientist works to extract knowledge or insights from large volumes of data in various forms.
The age of big data is upon us, and it’s here to stay. With more data being collected than ever before, extracting value from this data is only going to become more intricate and demanding as time goes on. The logic behind the big data economy is shaping our personal lives in ways that we probably can’t even conceive or predict; every electronic move that we make produces a statistic and insight into our life.
As participants in the consumer economy, we are mined for data when we connect to any website or electronic service, and a data scientist is there to collect, clean, analyse and predict the data that we provide by using a combination of computer science, statistical analysis and intricate business knowledge.
The following diagram shows the skillsets required for a Data Scientist. As we can see, this responsibility is a combination of multiple skillsets and expertise compared to a typical Big Data Developer or Business Analyst.
WHY ARE DATA SCIENTIST DIFFERENT?
Rivera and Haverson suggest that, whilst previous data professionals were concerned with focusing on past movements and interpretation of data, a data scientist tends to be more mathematically focused- concentrating on providing an insight into future patterns identified from past and current data. If one takes the two words literally – ‘science’ implying knowledge gained through systematic study; ‘data’ being an information set of qualitative or quantitative variables – a data scientist can therefore be defined literally as one who systematically studies the organisation and property of information.
Notwithstanding the crucial role of statisticians and others who study data analytics, the role of a data scientist, described by Anjul Bhambari as part analyst, part artist, is set to revolutionise the way that traditional data is analysed and used.
THE GROWING DEMAND FOR DATA SCIENTIST
The success of business networking site LinkedIn is a prime example of the crucial benefit that data scientists are bringing to business intelligence. As an enterprise that relies almost solely on the data transferred by its 380,000,000 users making connections with each other, LinkedIn is utilising those professionals with the training and curiosity to make discoveries in the world of big data.
LinkedIn, alongside other large knowledge industries such as Facebook and Google, is utilising the role of data scientists to bring structure to large quantities of formless data and to determine significance in its value, and systematic relationships between the variables.
A recent survey of C-suite executives by KPMG found 99% of respondents thought analysis of big data was important to their strategy next year. In an age where enterprise data is expected to exceed 240 exabytes per day by 2020, the need for data scientists with the skills to extract valuable insights from this data is more important than ever. . However, an article by Travis Wright for Venture Beat suggests that demand for data scientists is very much outstripping supply and that companies in the United States alone will need to hire between 140,000 – 190,000 data scientists if they are to keep up with the new data economy.
Ironically, there is a great deal of conflicting data on the average salary for a data scientist, however, what is clear is that the average salary does tend to be inherently concurrent with the high demand level for data scientists. Not surprisingly, if employers are asking candidates to be experienced with data mining algorithms, able to work comprehensively in languages like R and Python, experienced in working with large databases (SQL or similar), implementing Java applications, manipulating NoSQL databases (to quote about 10% of a job specification) – all with the ability to communicate all of this to a non-technical audience, an average salary of about $120,000 doesn’t seem too far fetched.
THE ROLE OF A DATA SCIENTIST
Whilst the role of a data scientist crosses over with more conventional data analysis positions, there are some stark differences.
A data analyst or architect can extract information from large sets of data. Yet they are bound by the SQL queries and analytics packages used to slice these datasets. Through an advanced knowledge of machine learning and programming/engineering, data scientists can manipulate data at their own will uncovering deeper insight. They are not bound by these programmes.
Whilst your typical data analyst looks to the past and what’s happened, a data scientist must go beyond this and look to the future. Through application of advanced statistics and complex data modelling they must uncover patterns and make future predictions.
THE SKILLS OF A DATA SCIENTIST
Successful data analytics rely on one being able to clean, integrate and transform the data – and this is the crucial combination of skills all data scientists must possess. By combining a scientific background with computational and analytical skills, you can put yourself a ‘cut above the rest’.
Figure 3 below shows the several areas of focus for typical data science discipline.
Figure 2. Data Science Focus Areas
But let’s dig deeper into the actual skills required to become a data scientist. Mark van Rijmenam, CEO at Data Floq, recommends that data scientists possess the following skills: statistical, mathematical and ethical, as well as a high degree of predictive modelling experience in order to build the algorithms necessary to ask the right questions and find the right answers.
Ferris Jumah from LinkedIn goes further to neatly group the skills required, despite the huge array of skills and different job roles a data scientist might perform.
A DATA SCIENCE MUST…
Look at data with a mathematical mind-set. Learning skills such as machine learning, data mining, data analysis and statistics are crucial. A data scientist will need to interpret and represent data mathematically.
Use a common language to access, explore and model data. Knowledge of a statistical programming language will be critical. Languages like R, Python or MATLAB, and a database querying language like SQL are some of the most popular skills in demand. Data extraction, exploration and hypothesis testing are central to the data science practice.
Develop strong computer science and software engineering backgrounds. This involves developing a skill set which could include Java, C++ or knowledge of algorithms and Hadoop. These skills will be used to leverage data to architect systems.
TOOLS OF A DATA SCIENTIST
Unlike your typical programmer, who may use a standardised set of tools, data scientists tend to use a wide array of ever changing tools. This is because the data science landscape is evolving rapidly, with many new tools still far from maturity. That being said, below we’ve compiled a series of popular tools for data scientists aligned to specific practices:
Here, the tools are really just the programming languages a data scientist uses to extract and analyse data. This is typically Python, R and SQL.
A data scientist may choose to have their own database to which they can extract and analyse data. MySQL is among the most popular to handle reasonable size datasets. Moving in to the realms of big data, they would typically turn to programs like Hive or Redshift. You’d also be surprised how far most data scientists can go utilising the average .CSV file before it falls over.
Among the most commonly mentioned tools for data visualisation are D3.js and Tableau. For D3.js, if you can imagine a data visualisation, a data scientist can achieve it using the software. Tableau is the most popular data visualisation tool out there at the moment allowing the compiling data from hundreds of inputs and then easily transforming the data into visualisations.
This is perhaps the area most in flux with new tools emerging daily. Most established and widely used is perhaps Scikit-learn which utilises Python for machine learning. Then of course there is Spark MLlib which is Apache’s own machine learning library for Spark and Hadoop.