What is data science?
Discover what a data scientist does and how to become a successful data scientist
What is a data scientist?
A data scientist leads research projects to extract valuable information from big data and is skilled in technology, mathematics, business, and communications. Organizations use this information to make better decisions, solve complex problems, and improve their operations. By revealing actionable insights hidden in large datasets, a data scientist can significantly improve his or her company’s ability to achieve its goals. That's why data scientists are in high demand and even considered "rock stars" in the business world.
Introduction to data science
What is data science?
Data science is the scientific study of data to gain knowledge. This field combines multiple disciplines to extract knowledge from massive datasets for the purpose of making informed decisions and predictions. Data scientists, data analysts, data architects, data engineers, statisticians, database administrators, and business analysts all work in the data science field.
The need for data science is growing rapidly as the amount of data increases exponentially and companies depend more heavily on analytics to drive revenue and innovation. For example, as business interactions become more digital, more data is created, presenting new opportunities to derive insights into how to better personalize experiences, improve service and customer satisfaction, develop new and enhanced products, and increase sales. Additionally, in the business world and beyond, data science has the potential to help solve some of the world's most difficult challenges.
What does a data scientist do?
A data scientist collects, analyzes, and interprets big data to uncover patterns and insights, make predictions, and create actionable plans. Big data can be defined as datasets that have greater variety, volume, and velocity than earlier methods of data management were equipped to handle. Data scientists work with many types of big data, including:
- Structured data, which is typically organized in rows and columns and includes words and numbers such as names, dates, and credit card information. For example, a data scientist in the utility industry might analyze tables of power generation and usage data to help reduce costs and detect patterns that could cause equipment to fail.
- Unstructured data, which is unorganized and includes text in document files, social media and mobile data, website content, and videos. For example, a data scientist in the retail industry might answer a question about improving the customer experience by analyzing unstructured call center notes, emails, surveys, and social media posts.
Additionally, the characteristics of the dataset can be described as quantitative, structured numerical data, or qualitative or categorical data, which is not represented through numerical values and can be grouped based on categories. It's important for data scientists to know the type of data they're working with, as it directly impacts the type of analyses they perform and the types of graphs they can use to visualize the data.
To gain knowledge from all these data types, data scientists utilize their skills in:
- Computer programming. Data scientists write queries using languages such as Julia, R, or Python to pull data from their company's database. Python is the language of choice for many data scientists because it's easy to learn and use, even for people without coding experience, and offers prebuilt data science modules for data analysis.
- Mathematics, statistics, and probability. Data scientists draw on these skills to analyze data, test hypotheses, and build machine learning models—files that data scientists train to recognize certain types of patterns. Data scientists use trained machine learning models to discover the relationships in data, make predictions about data, and figure out solutions to problems. Instead of building and training models from scratch, data scientists can also take advantage of automated machine learning to access production-ready machine learning models.
- Domain knowledge. To translate data into relevant and meaningful insights that drive business outcomes, data scientists also need domain knowledge—an understanding of the industry and company where they work. Here are some examples of how data scientists would apply their domain knowledge to solve industry-specific problems.
|Industry||Types of data science projects|
New product development and product enhancements
Supply chain and inventory management
Customer service improvements
Product recommendations to e-commerce customers
Understanding of media content usage patterns
Content development based on target market data
Content performance measurement
Customized recommendations based on user preferences
|Finance and banking||
Prevention of fraud and other security breaches
Risk management of investment portfolios
Virtual assistants to help customers with questions
Constituent satisfaction monitoring
Fraud detection, such as social disability claims
Evidence-based drug therapy and cost-effectiveness of new drugs
Real-time tracking of disease outbreaks
Wearable trackers to improve patient care
Service improvements based on user preferences and locations
Minimization of dropped calls and other service issues
Smart meter analysis to improve utility usage and customer satisfaction
Improved asset and workforce management
There's another skill that's critical to the question "What does a data scientist do?" Effectively communicating the results of their analyses to managers, executives, and other stakeholders is one of the most important parts of the job. Data scientists need to make their findings easy to understand for a non-technical audience, so they can use the insights to make informed decisions. Therefore, data scientists need to be skilled in:
- Communications, public speaking, and data visualization. Great data scientists have strong verbal communication skills, including storytelling and public speaking. In the field of data science, a picture is truly worth a thousand words. Presenting data science findings using graphs and charts enables the audience to quickly understand the data, in as little as five seconds or less. For that reason, successful data scientists take their data visualizations as seriously as their analyses.
Data science processes and deliverables
Data science processes
Data scientists follow a similar process to complete their projects:
The data scientist works with stakeholders to clearly define the problem they want to solve or question they need to answer, along with the project's objectives and solution requirements.
Based on the business problem, the data scientist decides which analytic approach to follow, either 1) descriptive for more information about the current status, 2) diagnostic to understand what is happening and why 3) predictive to forecast what will happen, or 4) prescriptive to understand how to solve the problem.
The data scientist identifies and acquires the data needed to achieve the desired result. This could involve querying databases, extracting information from websites (web scraping), or obtaining data from files. The data might be internally available, or the team might need to purchase the data. In some cases, organizations might need to collect new data to be able to successfully run a project.
Typically, this step is the most time consuming. To create the dataset for modeling, the data scientist converts all the data into the same format, organizes the data, removes what's not needed, and replaces any missing data.
Once the data is cleaned, a data scientist explores the data and applies statistical analytical techniques to reveal relationships between data features and the statistical relationships between them and the values they predict (known as a label). The predicted label can be a quantitative value, like the financial value of something in the future, or the duration of a flight delay in minutes.
Exploration and preparation typically involve a great deal of interactive data analysis and visualization—usually using languages such as Python and R in interactive tools and environments that are specifically designed for this task. The scripts used to explore the data are typically hosted in specialized environments such as Jupyter Notebooks. These tools enable data scientists to explore the data programmatically while documenting and sharing the insights they find.
The data scientist builds and trains prescriptive or descriptive models, then tests and evaluates the model to make sure it answers the question or addresses the business problem. At its simplest, a model is a piece of code that takes an input and produces output. Creating a machine learning model involves selecting an algorithm, providing it with data, and tuning hyperparameters. Hyperparameters are adjustable parameters that let data scientists control the model training process. For example, with neural networks, the data scientist decides the number of hidden layers and the number of nodes in each layer. Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that result in the best performance.
A common question is "Which machine learning algorithm should I use?" A machine learning algorithm turns a dataset into a model. The algorithm the data scientist selects depends primarily on two different aspects of the data science scenario:
- What is the business question the data scientist wants to answer by learning from past data?
- What are the requirements of the data science scenario, including the accuracy, training time, linearity, number of parameters, and number of features?
To help answer these questions, Azure Machine Learning provides a comprehensive portfolio of algorithms, such as Multiclass Decision Forest, Recommendation systems, Neural Network Regression, Multiclass Neural Network, and K-Means Clustering. Each algorithm is designed to address a different type of machine learning problem. In addition, The Azure Machine Learning Algorithm Cheat Sheet helps data scientists choose the right algorithm to answer the business question.
The data scientist delivers the final model with documentation and deploys the new dataset into production after testing, so it can play an active role in a business. Predictions from a deployed model can be used for business decisions.
Visualization tools like Microsoft Power BI, Tableau, Apache wSuperset, and Metabase make it easy for the data scientist to explore the data and generate beautiful visualizations that show the findings in a way that makes it simple for non-technical audiences to understand.
Data scientists might also use web-based data science notebooks, such as Zeppelin Notebooks, throughout the much of the process for data ingestion, discovery, analytics, visualization, and collaboration.
Data science methods
Data scientists use statistical methods such as hypothesis testing, factor analysis, regression analysis and clustering to unearth statistically sound insights.
Data science documentation
Although data science documentation varies by project and industry, it generally includes documentation that shows where the data comes from and how it was modified. This helps other members of the data team effectively use the data moving forward. For example, documentation helps business analysts use visualization tools to interpret the dataset.
Types of data science documentation include:
- Project plans to define the project's business objectives, evaluation metrics, resources, timeline, and budget.
- Data science user stories to generate ideas for data science projects. The data scientist writes the story from the stakeholder's point of view, describing what the stakeholder would like to achieve and the reason the stakeholder is requesting the project.
- Data science model documentation to document the dataset, the experiment's design, and the algorithms.
- Supporting systems documentation including user guides, infrastructure documentation for system maintenance, and code documentation.
How to become a data scientist
There are multiple paths to becoming a data scientist. Requirements usually include a degree in information technology or computer science. However, some IT professionals learn data science by taking bootcamps and online courses, and others earn a data science master's degree or certification.
To learn how to be a data scientist, take advantage of these Microsoft training resources designed to help you:
- Quickly get started. Read the free Packt e-book Principles of Data Science, A beginner's guide to statistical techniques and theory. You'll learn the basics of statistical analysis and machine learning, key terms, and data science processes.
- Build your machine learning skills with Azure, the Microsoft cloud platform. Explore Azure machine learning for data scientists resources, including free training videos, example solution architectures, and customer stories.
- Achieve machine learning expertise on Azure for free, in just 4 weeks. Take an hour a day to learn how to create innovative solutions for complex problems. You'll learn the basics all the way to scaling your machine learning projects using the latest tools and frameworks. The self-paced Zero to hero machine learning path also prepares you for the Azure Data Scientist Associate certificate.
- Get comprehensive training. Take the Microsoft data scientist learning path and choose from a range self-paced and instructor-led courses. Learn how to create machine learning models, use visual tools, run data science workloads in the cloud, and build applications that support natural language processing.
Get your data scientist certification
Certifications are a great way to demonstrate your data science qualifications and jumpstart your career. Microsoft certified professionals are in high demand and there are jobs available for Azure data scientists right now. Explore the data scientist certifications most sought after by employers:
- Microsoft Certified: Azure Data Scientist Associate. Apply your knowledge of data science and machine learning to implement and run machine learning workloads on Azure using Azure Machine Learning Service.
- Microsoft Certified: Customer Data Platform Specialty. Implement solutions that provide insights into customer profiles and track engagement activities to help improve customer experiences and increase customer retention.
Differences between data analysts and data scientists
Like data scientists, data analysts work with large datasets to uncover trends in data. However, data scientists are typically more technical team members with more expertise and responsibility such as initiating and leading data science projects, building and training machine learning models, and presenting their findings to executives and at conferences. Some data scientists perform all of these tasks and others focus on specific ones, like training algorithms or building models. Many data scientists began their careers as data analysts and data analysts can be promoted to data scientist positions within a few years.
|Data analyst||Data scientist|
|Role||Statistical data analysis||Develop solutions to complex business needs using big data|
|Typical tools||Microsoft Excel, SQL, Tableau, Power BI||SQL, Python, R, Julia, Hadoop, Apache Spark, SAS, Tableau, Machine Learning, Apache Superset, Power BI, Data Science Notebooks|
|Analysis of data types||Structured data||Structured and unstructured data|
|Tasks and duties||
A data scientist leads research projects to extract valuable information from big data and is skilled in technology, mathematics, business, and communications. Organizations use this information to make better decisions, solve complex problems, and improve their operations. By revealing actionable insights hidden in large datasets, a data scientist can significantly improve his or her company's ability to achieve its goals. That's why data scientists are in high demand and even considered "rock stars" in the business world.
Data science is the scientific study of data to gain knowledge. This field combines multiple disciplines to extract knowledge from massive datasets for the purpose of making informed decisions and predictions.
Data scientists lead research projects to extract valuable information and actionable insights from big data. This includes defining the problem to be solved, writing queries to pull the right data from databases, cleaning and sorting the data, building and training machine learning models, and using data visualization techniques to effectively communication the findings to stakeholders.
Although data science documentation varies by project and industry, it generally includes project plans, user stories, model documentation, and supporting systems documentation such as user guides.
Some IT professionals learn data science by taking bootcamps and online courses, and others earn a data science master's degree or certification. Certifications are a great way to demonstrate your data science qualifications and jumpstart your career. Microsoft certified professionals are in high demand and there are jobs available for Azure data scientists right now.
Like data scientists, data analysts work with large datasets to uncover trends in data. However, data scientists are more technical team members with more expertise and responsibility, such as initiating and leading data science projects, building and training machine learning models, and presenting the results of their projects to executives and at conferences. Some data scientists perform all of these tasks and others focus on specific ones, like training algorithms or building models.
See a comparison of data scientist and data analyst responsibilities
Get started with an Azure free account
Enjoy popular Azure services free for 12 months, more than 25 services free always, and $200 credit to use in your first 30 days.
Connect with an Azure AI sales specialist
Get advice on getting started with Azure AI. Ask questions, learn about pricing and best practices, and get help designing a solution to meet your needs.