Skip to content
Home » IBM » IBM Data Science Professional Certificate » Course 1: What is Data Science? » Module 1: Defining Data Science and What Data Scientists Do

Module 1: Defining Data Science and What Data Scientists Do

In Module 1, you delve into some fundamentals of Data Science. In lesson 1, you listen to how other professionals in the field define what data science is to them and the paths they took to consider data science as a career for themselves. You explore different roles data scientists fulfill, how data analysis is used in data science, and how data scientists follow certain processes to answer questions with that data. Moving on to Lesson 2, the focus shifts to the daily activities of data scientists. This encompasses learning about various real-world data science problems that professionals solve, the skills and qualities needed to be a successful data scientist, and opinions on how “big data” relates to those skills. You also learn a little about various data formats data scientists work with and algorithms used in the field to process data.

Learning Objectives

  • Define data science and its importance in today’s data driven world.
  • List some paths that can lead to a career in data science.
  • Summarize advice given by seasoned data science professionals to data scientists who are just starting out.
  • Articulate why data science is considered the most in-demand job in the 21st century.
  • Describe what a typical day in the life of a data scientist looks like.
  • Define some of the commonly used terms in data science.
  • Identify some of the key qualities of a successful data scientist.

Welcome to the Course


Video: Course Introduction

This course is an introduction to the field of Data Science, which is a rapidly growing field with a projected 35% growth rate in the US and a median annual salary of $103,000. The course is designed for beginners and does not require prior knowledge or a degree in data science or programming. It covers the basics of data science, including what it is, the role of a data scientist, and the skills needed to become one. The course is divided into three modules, plus an optional module, and covers topics such as big data, artificial intelligence, machine learning, and data mining. Students will learn from instructional videos, videos with data science professionals, readings, practice assessments, and a final case study and quiz. The course is designed to provide a foundational understanding of data science and prepare students for a career in the field.

Welcome to your journey toward an enticing
career path in the Data Science field. Data science has witnessed
recent remarkable growth due to the abundance
of electronic data, computing power, advancements in artificial intelligence,
and its demonstrated business value. In the United States,
the US Bureau of Labor Statistics projects a 35% growth rate in the field. Currently, the median annual salary
is an estimated $103,000. So, begin your journey into this thriving
and fascinating field with this course! Its increased adoption across industries has
fueled demand for skilled data scientists who can use data to tell compelling stories
to transform and inform businesses. In this introductory course, learn about these opportunities and
investigate the needed career skills to become a data scientist. Additionally, this course will help
managers and executives who want to learn about this field to help transform
their organization to become more data-driven. This course is designed for beginners, so you don’t need prior knowledge or
a degree in data science or programming. It serves as an introduction to the field. Take this course if you are
curious about data science and how and why data science
is transforming our world. Learn about concepts such as “Big Data,” different branches of artificial intelligence, and how data science leverages these ideas
to uncover and tell hidden stories. This course is the first in several other
specializations and programs, including: IBM Data Science Professional Certificate Introduction to Data Science
Key Technologies for Business IBM AI Foundations for Business. You will meet several data scientists who will share their insights
and experiences in Data Science. You’ll have a foundational
understanding of data science through instructional videos and
videos with data science professionals, readings, practice assessments
where you can practice your skills, glossaries for your reference,
and a video that summarizes each lesson. At the end of the course, you will have a case study
and a quiz based on it as a final assignment. You explore Data Science job listings
in the final peer-reviewed project. Let’s take a more detailed view
of the course content. The course is divided into three modules,
plus an optional module. In the first lesson, you’ll explore what data science is, the data scientist’s role in an organization,
and what makes a skilled data scientist. Then, you continue this module by
learning about what data scientists do. You’ll also learn essential skills
for becoming a good data scientist and why significant data matters. You’ll explore handling different file types, study data science topics and algorithms, and discuss the qualities
that define a data scientist. In Module 2, Lesson 1, you’ll explore the
interaction of Big Data and Cloud Computing in driving digital transformation. You will also learn foundational concepts, key tools, and data mining techniques essential for
data scientists in this dynamic landscape. In the second module lesson, you’ll dive into the exciting concepts of
artificial intelligence and data science. Throughout this module, you will
explore artificial intelligence concepts like machine learning and deep learning,
gaining valuable insights and skills. In the third module, you’ll embark on a journey
to explore the vast and impactful realms where data science plays a pivotal role. This engaging module
delves into various activities that shed light on the diverse applications
of data science today. In addition to these three modules, you are also offered an optional module
if you’d like to familiarize yourself with data literacy concepts for data science
and where you will learn about the data ecosystem. Topics include sources of data that
data professionals use in their day-to-day tasks and distinguish between databases,
data warehouses, data marts, and data lakes. You gain a basic understanding of how data gets processed, such as the Extract, Transform, and Load (ETL)
processes, and working with Data Pipelines. While participating in the course,
if you encounter challenges or have questions, you can find support and answers and
connect with other learners and support staff in the course’s discussion forums. And now, welcome to the course.
We look forward to you completing this course, earning a valuable certificate, and continuing
your path to a data science career.

Reading: Course Syllabus

Reading

Defining Data Science


Reading: Lesson Overview: Defining Data Science

Reading

Video: What is Data Science?

The speaker defines data science as a process of using data to understand the world, validate hypotheses, and uncover insights and trends. It involves translating data into a story to generate insights and inform strategic decisions. Data science is a field that studies data, extracting insights from both structured and unstructured forms. The speaker emphasizes that data science is more about working with data than it is about science, and that it requires curiosity, manipulation, and exploration of data to find answers to questions.

The speaker also highlights the relevance of data science today due to the abundance of data, availability of algorithms, and affordability of open-source software and data storage. This has made it easier and more accessible for people to work with data, making it an exciting time to be a data scientist.

Key points:

  • Data science is a process, not an event
  • It involves using data to understand the world and validate hypotheses
  • Data science is about uncovering insights and trends in data
  • It requires translating data into a story to generate insights
  • Data science is a field that studies data in all its forms
  • It’s more about working with data than it is about science
  • The abundance of data, algorithms, and affordable tools make it an exciting time for data science.

What is Data Science? A Comprehensive Guide

In today’s data-driven world, the term “Data Science” has become a buzzword in the industry. But what exactly is Data Science, and how does it impact our daily lives? In this tutorial blog post, we’ll delve into the world of Data Science, exploring its definition, key components, and applications.

What is Data Science?

Data Science is a multidisciplinary field that combines elements of computer science, statistics, and domain-specific knowledge to extract insights and knowledge from data. It involves using various techniques, tools, and algorithms to analyze and interpret complex data sets, often to drive business decisions or solve real-world problems.

Key Components of Data Science

Data Science encompasses several key components, including:

  1. Data: The foundation of Data Science is data itself. This can include structured data (e.g., databases), unstructured data (e.g., text, images), or semi-structured data (e.g., XML).
  2. Statistics and Mathematics: Statistical and mathematical techniques are used to analyze and model data, including regression, hypothesis testing, and machine learning algorithms.
  3. Computer Science: Data Science relies heavily on computer science concepts, such as programming languages (e.g., Python, R), data structures, and algorithms.
  4. Domain Expertise: Domain-specific knowledge is essential to understand the context and relevance of the data, as well as to identify meaningful insights and recommendations.
  5. Visualization: Data visualization techniques are used to communicate complex data insights to stakeholders, using tools like plots, charts, and dashboards.

The Data Science Process

The Data Science process typically involves the following steps:

  1. Problem Definition: Identify a business problem or question that can be addressed through data analysis.
  2. Data Collection: Gather relevant data from various sources, including databases, APIs, or files.
  3. Data Cleaning: Clean and preprocess the data to ensure it’s accurate, complete, and consistent.
  4. Exploratory Data Analysis (EDA): Use statistical and visual techniques to understand the data distribution, identify patterns, and detect anomalies.
  5. Modeling: Develop and train machine learning models to predict outcomes, classify data, or identify relationships.
  6. Evaluation: Assess the performance of the model using metrics like accuracy, precision, and recall.
  7. Deployment: Deploy the model in a production-ready environment, such as a web application or API.
  8. Monitoring: Continuously monitor the model’s performance and retrain as necessary to maintain its accuracy.

Applications of Data Science

Data Science has numerous applications across various industries, including:

  1. Business Intelligence: Analyze customer behavior, sales trends, and market dynamics to inform business decisions.
  2. Healthcare: Predict patient outcomes, identify disease patterns, and develop personalized treatment plans.
  3. Finance: Detect fraudulent transactions, predict stock prices, and optimize investment portfolios.
  4. Marketing: Analyze customer behavior, personalize recommendations, and optimize marketing campaigns.
  5. Environmental Science: Monitor climate patterns, predict natural disasters, and optimize resource allocation.

Real-World Examples of Data Science

  1. Netflix Recommendation Engine: Uses machine learning algorithms to recommend TV shows and movies based on user behavior.
  2. Google Search Engine: Employs natural language processing and machine learning to rank search results and predict user intent.
  3. Amazon Product Recommendations: Analyzes customer behavior and product attributes to suggest relevant products.
  4. Weather Forecasting: Uses statistical models and machine learning algorithms to predict weather patterns and issue warnings.

Conclusion

Data Science is a powerful tool for extracting insights and knowledge from complex data sets. By combining statistics, computer science, and domain expertise, Data Science professionals can drive business decisions, solve real-world problems, and create innovative solutions. Whether you’re a beginner or an experienced professional, understanding the fundamentals of Data Science can help you unlock new opportunities and stay ahead in the industry.

Further Reading

  • Data Science Handbook: A comprehensive guide to Data Science, covering topics from data cleaning to machine learning.
  • Python for Data Science: A tutorial on using Python for Data Science, including popular libraries like Pandas and Scikit-learn.
  • Data Science with R: A guide to using R for Data Science, covering topics from data visualization to machine learning.

Get Started with Data Science

  • Take online courses: Websites like Coursera, edX, and Udemy offer a wide range of Data Science courses.
  • Join online communities: Participate in online forums like Kaggle, Reddit, and GitHub to connect with Data Science professionals.
  • Practice with datasets: Use public datasets like Kaggle, UCI Machine Learning Repository, or World Bank Open Data to practice your skills.

We hope this tutorial blog post has provided a comprehensive introduction to Data Science. Whether you’re a beginner or an experienced professional, we encourage you to continue exploring the world of Data Science and stay up-to-date with the latest trends and techniques.

Data science is the field of exploring, manipulating, and analyzing data, and using data to answer questions or make recommendations.

True

Although the different data scientists interviewed in the video gave what seemed like different answers, they all agreed that data science involved using data to answer questions or make recommendations.

Data Science is a process, not an event. It is the process of using data to understand different things, to understand the world. For me is when you have a model or hypothesis of a problem, and you try to validate that hypothesis or model with your data. Data science is the art of uncovering the insights and trends that are hiding behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution. Data science is a field about processes and systems to extract data from various forms of whether it is unstructured or structured form. Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some science. The definition or the name came up in the 80s and 90s when some professors were looking into the statistics curriculum, and they thought it would be better to call it data science. But what is Data Science? I’d see data science as one’s attempt to work with data, to find answers to questions that they are exploring. In a nutshell, it’s more about data than it is about science. If you have data, and you have curiosity, and you’re working with data, and you’re manipulating it, you’re exploring it, the very exercise of going through analyzing data, trying to get some answers from it is data science. Data science is relevant today because we have tons of data available. We used to worry about lack of data. Now we have a data deluge. In the past, we didn’t have algorithms, now we have algorithms. In the past, the software was expensive, now it’s open source and free. In the past, we couldn’t store large amounts of data, now for a fraction of the cost, we can have gazillions of datasets for a very low cost. So, the tools to work with data, the very availability of data, and the ability to store and analyze data, it’s all cheap, it’s all available, it’s all ubiquitous, it’s here. There’s never been a better time to be a data scientist.

Video: Fundamentals of Data Science

What is Data Science?

Data Science is a field that involves analyzing large amounts of data from various sources to gain new insights and knowledge. It’s not just about analyzing data, but about using that analysis to add value to an organization.

The Process of Data Science

The process of Data Science typically involves:

  1. Defining the problem: Identifying a specific question or issue that the organization wants to answer or solve.
  2. Gathering data: Determining what data is needed to solve the problem and where it will come from.
  3. Analyzing data: Using various models and techniques to explore the data, reveal patterns and outliers, and gain insights.
  4. Communicating results: Presenting the findings to stakeholders in a clear and concise manner, often using data visualization tools.

The Role of a Data Scientist

A good data scientist is curious, asks questions to clarify the business need, and is able to communicate complex results in a simple way. They use data analysis to provide value to the organization and help it understand its environment, identify opportunities, and solve problems.

The Impact of Data Science

Data Science is changing the way organizations work, use data, and understand the world. It has the power to reveal new knowledge, challenge assumptions, and drive business decisions.

Fundamentals of Data Science: A Comprehensive Guide

Welcome to the world of Data Science! As a rapidly growing field, Data Science has become an essential tool for businesses, organizations, and individuals to make informed decisions, drive innovation, and solve complex problems. In this tutorial blog post, we’ll cover the fundamentals of Data Science, including the key concepts, techniques, and tools you need to get started.

What is Data Science?

Data Science is a multidisciplinary field that combines elements of computer science, statistics, and domain-specific knowledge to extract insights and knowledge from data. It involves using various techniques, tools, and algorithms to analyze and interpret complex data sets, often to drive business decisions or solve real-world problems.

Key Concepts in Data Science

Before we dive into the techniques and tools, let’s cover some key concepts in Data Science:

  1. Data: The foundation of Data Science is data itself. This can include structured data (e.g., databases), unstructured data (e.g., text, images), or semi-structured data (e.g., XML).
  2. Variables: Variables are the individual elements of data that we want to analyze. They can be categorical (e.g., gender, country), numerical (e.g., age, income), or text-based (e.g., customer reviews).
  3. Data Types: Data types refer to the format of the data, such as integers, floats, strings, or dates.
  4. Data Structures: Data structures refer to the way data is organized and stored, such as arrays, lists, or tables.

Data Science Techniques

Now that we’ve covered the key concepts, let’s explore some essential techniques in Data Science:

  1. Data Cleaning: Data cleaning involves removing errors, inconsistencies, and missing values from the data to ensure it’s accurate and reliable.
  2. Data Transformation: Data transformation involves converting data from one format to another, such as aggregating data or creating new variables.
  3. Data Visualization: Data visualization involves using plots, charts, and other graphical techniques to communicate insights and patterns in the data.
  4. Machine Learning: Machine learning involves using algorithms to train models on data and make predictions or classify new data.

Data Science Tools

Here are some essential tools you’ll need to get started with Data Science:

  1. Python: Python is a popular programming language used extensively in Data Science. It’s easy to learn and has a vast number of libraries and frameworks.
  2. R: R is a programming language specifically designed for statistical computing and data visualization.
  3. SQL: SQL (Structured Query Language) is a language used to manage and analyze relational databases.
  4. Tableau: Tableau is a data visualization tool that allows you to connect to various data sources and create interactive dashboards.
  5. Jupyter Notebook: Jupyter Notebook is a web-based interactive environment for working with Python, R, and other languages.

Data Science Workflow

The Data Science workflow typically involves the following steps:

  1. Problem Definition: Identify a business problem or question that can be addressed through data analysis.
  2. Data Collection: Gather relevant data from various sources, including databases, APIs, or files.
  3. Data Cleaning: Clean and preprocess the data to ensure it’s accurate and reliable.
  4. Data Transformation: Transform the data into a suitable format for analysis.
  5. Data Visualization: Use plots, charts, and other graphical techniques to communicate insights and patterns in the data.
  6. Machine Learning: Use algorithms to train models on the data and make predictions or classify new data.
  7. Deployment: Deploy the model or insights to stakeholders, such as business leaders or customers.

Conclusion

In this tutorial blog post, we’ve covered the fundamentals of Data Science, including key concepts, techniques, and tools. We’ve also explored the Data Science workflow and the essential steps involved in working with data. Whether you’re a beginner or an experienced professional, we hope this guide has provided a solid foundation for your Data Science journey.

Further Reading

  • Data Science Handbook: A comprehensive guide to Data Science, covering topics from data cleaning to machine learning.
  • Python for Data Science: A tutorial on using Python for Data Science, including popular libraries like Pandas and Scikit-learn.
  • Data Science with R: A guide to using R for Data Science, covering topics from data visualization to machine learning.

Get Started with Data Science

  • Take online courses: Websites like Coursera, edX, and Udemy offer a wide range of Data Science courses.
  • Join online communities: Participate in online forums like Kaggle, Reddit, and GitHub to connect with Data Science professionals.
  • Practice with datasets: Use public datasets like Kaggle, UCI Machine Learning Repository, or World Bank Open Data to practice your skills.

Everyone you ask will give you a slightly
different description of what Data Science is, but most people agree that it has a significant
data analysis component. Data analysis isn’t new. What is new is the vast quantity of data
available from massively varied sources: from log files, email, social media, sales data,
patient information files, sports performance data, sensor data, security cameras, and many
more besides. At the same time that there is more data available than ever, we have
the computing power needed to make a useful analysis and reveal new knowledge.
Data science can help organizations understand their environments, analyze existing issues,
and reveal previously hidden opportunities. Data scientists use data analysis to add to
the knowledge of the organization by investigating data, exploring the best way to use it to
provide value to the business. So, what is the process of data science? Many
organizations will use data science to focus on a specific problem, and so it’s essential
to clarify the question that the organization wants answered. This first and most crucial
step defines how the data science project progresses. Good data scientists are curious
people who ask questions to clarify the business need.
The next questions are: “what data do we need to solve the problem, and where will that
data come from?”. Data scientists can analyze structured and unstructured data from many
sources, and depending on the nature of the problem, they can choose to analyze the data
in different ways. Using multiple models to explore the data reveals patterns and outliers;
sometimes, this will confirm what the organization suspects, but sometimes it will be completely
new knowledge, leading the organization to a new approach.
When the data has revealed its insights, the role of the data scientist becomes that of
a storyteller, communicating the results to the project stakeholders. Data scientists
can use powerful data visualization tools to help stakeholders understand the nature
of the results, and the recommended action to take.
Data Science is changing the way we work; it’s changing the way we use data and it’s
changing the way organisations understand the world.

Video: The Many Paths to Data Science

The speakers in this conversation are all data scientists, but none of them grew up wanting to be data scientists because the field didn’t exist when they were young. Data science as a field is relatively new, with the term being coined around 2009-2011.

Each speaker has a unique background and path that led them to data science. Some studied statistics, math, or engineering, while others started in business or other fields. They all discovered data science through various experiences, such as working with strategic consulting firms, solving complex problems, or analyzing large data sets.

Some common themes among the speakers include:

  • A strong foundation in quantitative analysis and math
  • Exposure to data science through internships, jobs, or research projects
  • A shift from traditional fields like statistics or engineering to data science
  • The use of data science to solve real-world problems, such as forecasting traffic or analyzing electronic point of sale data

Overall, the speakers’ paths to data science were diverse and often unexpected, but they all share a passion for working with data and driving insights.

Which of the following statements is correct?

As data science is not a discipline traditionally taught at universities, contemporary data scientists come from diverse backgrounds such as engineering, statistics, and physics.

This is correct. Contemporary data scientists come from different backgrounds such as engineering, mathematics, and even psychology. The secret skill is passion for continuous learning of new tools and patience to clean and analyze data.

[SOUND] [MUSIC] Data science didn’t really
exist when I was growing up. It’s not something that I ever woke up and said, I want to be a data
scientist when I grow up. No, it didn’t exist. I didn’t know I would be
working in data science. When I grew up, there isn’t
that field called data science. And I think it’s really new. Data science didn’t exist until 2009,

  1. Someone like DJ Patil or
    Andrew Gelman coined the term. Before that, there was statistics. And I didn’t want to be any of those. I want to be in business. And then I found data science
    a heck of a lot more interesting. I studied statistics,
    that’s how I started. I went through many different stages in
    my life where I wanted to be a singer and then a doctor. And then I realized that
    I was good at math. So I chose an area that was
    focused on quantitative analysis. And from then I do think that
    I wanted to work with data. Not necessarily data science
    as it’s known today. The first time that I had
    contact with data science, when I was my first year as
    a mechanical engineering. And strategic consulting firms,
    they use data science to make decisions. So it was my first contact
    with data science. I had a complicated problem
    that I needed to solve, and the usual techniques that we had at
    that time couldn’t help with that problem. I graduated with a math degree
    in the worst possible time, right after the economic crisis, and you
    actually had to be useful to get a job. So I went and got a degree in statistics. And then I worked enough jobs that
    were called data scientist that I suddenly became one. My undergraduate degree was in
    business, and I majored in politics, philosophy, and economics. And then I did a masters
    in business analytics at New York University at
    the Stern School of Business. When I left my undergrad,
    the first company I joined, it turned out that they were analyzing electronic point
    of sale data for retail manufacturers. And what we were doing was data science. But we only really started
    using that term much later. In fact, I’d say four or five years ago is
    when we started calling it analytics and data science. I had several options for
    my internship here in Canada. And one of the options was
    to work with data science. I used to work with project development. But I think that was a good choice. And then I start my
    internship with data science. I’m a civil engineer by training,
    so all engineers work with data. I would say the conventional use of data science in my life started
    with transportation research. I started building large models trying
    to forecast traffic on streets, trying to determine congestion and greenhouse
    gas emissions or tailpipe emissions. So I think that’s where my start was. And I started building these models
    when I was a graduate student at the University of Toronto. Started working with very large data sets,
    looking at household samples of, say, 150,000 households
    from half a million trips. And that, too,
    I’m speaking from mid 90s when this was supposed to be a very large data set,
    but not in today’s terms. But that’s how I started. I continued working with it. And then I moved to McGill University
    where I was a professor of transportation engineering. And I built even bigger data models
    that involved data and analytics. And so I would say, yes, transportation
    research brought me to data science. [MUSIC]

Upgraded Plugin: Data Science: The Sexiest Job in the 21st Century

Reading

Practice Assignment: Practice Quiz: Data Science: The Sexiest Job in the 21st Century

Why are companies looking for well-rounded individuals when hiring data scientists?
Because data science requires a combination of skills, including subject matter expertise, programming, and communication abilities
Because data scientists need to be artistic
Because data scientists are responsible for conducting biological experiments
Because data science jobs are purely focused on statistics

Why is there a growing demand for data scientists and analytics professionals in various industries?
Because of the digital revolution and the need to analyze big data for effective decision-making
Because data scientists are primarily focused on conducting biological experiments
Because data scientists are mainly responsible for convincing C-suite executives about the benefits of data and analytics
Because data scientists are primarily involved in managing customer analytics initiatives

Due to the shortage of data scientists, employers are willing to pay top salaries for their talent, with an average base salary for data scientists reported as $112,000.
True
False

Video: Advice for New Data Scientists

The speaker offers advice to aspiring data scientists, highlighting four key traits:

  1. Curiosity: A must-have trait to explore and understand data.
  2. Argumentative and judgmental: Having preconceived notions and being able to argue a case helps in forming hypotheses and assumptions, which can be refined through data analysis.
  3. Comfort with analytics platforms: Familiarity with software and computing platforms is important, but secondary to curiosity and argumentative skills.
  4. Ability to tell a story: Being able to communicate findings effectively and tell a compelling story is crucial to making an impact and rising to prominence.

Additionally, the speaker suggests that aspiring data scientists should:

  • Identify their competitive advantage and area of interest (e.g., health, retail, film)
  • Develop analytical skills specific to their chosen industry
  • Apply their skills to real-world problems
  • Share their work and tell the world about their accomplishments

By following these steps, aspiring data scientists can develop a strong foundation and make a meaningful impact in their chosen field.

According to professor Haider, the three important qualities to possess in order to succeed as a data scientist are:

Curious. Judgemental. Argumentative

According to professor Haider, the three important qualities are being curious, judgemental, and argumentative.

[Music] My advice to an aspiring data scientist is to be curious, extremely argumentative and judgmental. Curiosity is absolute must. If you’re not curious, you would not know what to do with the data. Judgmental because if you do not have preconceived notions about things you wouldn’t know where to begin with. Argumentative because if you can argument and if you can plead a case, at least you can start somewhere and then you learn from data and then you modify your assumptions and hypotheses and your data would help you learn. And you may start at the wrong point. You may say that I thought I believed this, but now with data I know this. So, this allows you a learning process. So, curiosity being able to take a position, strong position, and then moving forward with it. The other thing that the data scientist [should] would need is some comfort and flexibility with analytics platforms: some software, some computing platform, but that’s secondary. The most important thing is curiosity and the ability to take positions. Once you have done that, once you’ve analyzed, then you’ve got some answers. And that’s the last thing that a data scientist need, and that is the ability to tell a story. That once you have your analytics, once you have your tabulations, now you should be able to tell a great story from it. Because if you don’t tell a great story from it, your findings will remain hidden, remain buried, nobody would know. Your rise to prominence is pretty much relying on your ability to tell great stories. A starting point would be to see what is your competitive advantage. Do you want to be a data scientist in any field or a specific field? Because, let’s say you want to be a data scientist and work for an IT firm or a web-based or Internet based firm, then you need a different set of skills. And if you want to be a data scientist, for lets say, in the health industry, then you need different sets of skills. So figure out first what you’re interested, and what is your competitive advantage. Your competitive advantage is not necessarily going to be your analytical skills. Your competitive advantage is your understanding of some aspect of life where you exceed beyond others in understanding that. Maybe it’s film, maybe it’s retail, maybe it’s health, maybe it’s computers. Once you’ve figured out where your expertise lies, then you start acquiring analytical skills. What platforms to learn and those platforms, those tools would be specific to the industry that you’re interested in. And then once you have got some proficiency in the tools, the next thing would be to apply your skills to real problems, and then tell the rest of the world what you can do with it. [Music]

Video: Lesson Summary: Defining Data Science

Here’s a summary of the lesson on defining data science:

What is Data Science?

  • Data science is the study of data to understand the world around us
  • It involves uncovering insights and trends hidden in data
  • Data analysis is not new, but recent advances in computing power and data access have enabled new insights and knowledge

Role of a Data Scientist

  • Data scientists translate data into stories to generate insights
  • They aid strategic decision-making for companies or institutions
  • They deal with structured and unstructured data
  • The process of gleaning insights from data includes clarifying the problem, data collection, analysis, pattern recognition, storytelling, and visualization

Skills Required

  • Curiosity to explore data and come up with meaningful questions
  • Argumentation to explain findings and compel listeners to adjust their ideas
  • Judgment to guide the direction of analysis
  • Versatility, with knowledge of a particular subject, programming, and data analysis skills
  • Good communication skills to tell stories and convey insights
  • Comfort with math and logic

Career Path

  • Focus on mastering data analysis in a specific field and selecting suitable tools
  • Apply expertise to solve real-world issues using data
  • The future of data science involves continuous learning, certification, and adaptation to new technologies and data roles
  • Data scientists must think logically, use algorithms, and follow a methodical approach to achieve successful business results.

Welcome to defining data science,
lesson summary. In this video, we’ll review important
points you learned from the videos and readings in this lesson, we’ll also link together the ideas
from the resources in this lesson. We will quickly recap
what data science is, data scientists role in an organization,
what makes a skilled data a scientist, and expert advice on how
to acquire these skills? Let’s begin, so
what really is data science? Simply put,
data science is the study of data, data science uses data to understand
the world around us, some consider data science an art of uncovering the insights
and trends hiding behind data. Data analysis isn’t new,
recent data access and enhanced computing power drives new
insights and knowledge through analysis. We also have the computing power needed to
analyze this data to reveal new knowledge, with the power of computers, we can dig through this information
to reveal valuable insights. Just like a detective uncovering secrets,
data scientists translate the data into stories to generate insights, these
insights aid strategic decision-making for companies or institutions. Similar to biological or
physical sciences, data science deals with structured and
unstructured data. The process of gleaning insights from data
includes clarifying the problem, data collection, analysis, pattern recognition,
storytelling, and visualization. According to Professor Murtez Aheter from
the Ted Rogers School of Management, curiosity, argumentation, and
judgment are vital for data scientists. Curiosity helps you explore data and
come up with meaningful questions. Good, sound, reasonable arguments help you explain your
findings after learning from the data, compelling the listener to adjust their
ideas based on the new information. Good judgment guides you to
start in the right direction. Skilled data scientists go beyond just
being statisticians or computer experts. Companies are looking for versatile individuals who know
a lot about a particular subject, have some experience in programming and
analyzing data, and can communicate well. Generally, data scientists are comfortable
with math, they’re curious, they’re good at telling stories. Their backgrounds can come from
various fields like economics, engineering, medicine, and more. Once you understand your strengths and
interests, focus on mastering data analysis in that field and
select suitable tools for your industry. As you become proficient, apply your
expertise to solve real world issues using data, similar to solving
mysteries by deciphering clues. So, what does the future look like for
you as a skilled data scientist? Data scientist jobs will also change as
technology changes and data roles develop, to ensure their employees are qualified,
companies will require certification. Data scientists will always need to
think logically, use algorithms, and follow a methodical approach. Most importantly, they must gather
data correctly and carefully analyze the models being used, all aiming to
achieve successful business results. [MUSIC]

Practice Assignment: Practice Quiz: Defining Data Science

Imagine you’re working for a retail company that wants to optimize its product offerings and marketing strategies. In this scenario, you would use Data Science for:
Creating artistic visualizations for in-store displays to attract customers.
Developing algorithms to predict future stock market trends for investment decisions.
Analyzing customer purchase data to identify trends and tailor product recommendations.
Conducting biological experiments to enhance the quality of retail products.

What is the role of data analysis in Data Science and how does it contribute to decision-making?
Data analysis is a recent concept leveraging computing power for pattern recognition.
Data analysis focuses on formulating business questions for organizations.
Data analysis involves gathering insights from data and helps make informed decisions.
Data analysis is no longer relevant due to advanced data visualization tools.

In a healthcare context with patient data, medical histories, and treatment outcomes, Data Science can be applied to:
Predicting future medical advancements using patient data.
Creating artistic visualizations of patient data for aesthetics.
Analyzing patient data for personalized treatment plans.
Automating patient diagnoses and treatment decisions.

Considering an individual with a marketing background transitioning to data science, how might their marketing experience contribute to their data science journey?
Their marketing skills could enhance their ability to perform complex statistical analyses.
Their marketing knowledge could help predict future data science trends.
Their marketing expertise could replace the need for data analysis, given its relevance to business.
Their marketing background might assist in interpreting data to generate actionable insights.

You have just started your career as a data scientist. Which of the following skills should you develop to succeed as a data scientist? You should:
Dedicate your efforts to becoming proficient in technical skills such as mathematics and statistical modeling.
Cultivate curiosity, develop strong positions, and learn to communicate insights effectively through storytelling.
Prioritize industry-specific knowledge above all else to establish a competitive advantage as a data scientist.
Focus primarily on mastering coding languages and analytics tools to excel in data analysis.

Upgraded Plugin: Glossary: Defining Data Science

Reading

Graded Assignment: Graded Quiz: Defining Data Science

You are a data scientist about to start a new project. What would one of your key roles be?
Asking questions to clarify the business need
Focusing solely on data visualization
Collecting vast quantities of data from varied sources
Designing data collection methods

When did the term “data science” come into existence and who is credited with coining the term?
1960s, no specific person credited
Early 2000s, led by business analysts
2009-2011, DJ Patil or Andrew Gelman
1990s, DJ Patil and Andrew Gelman

As an aspiring data scientist, what primary qualities should you possess to succeed in the field?
Proficiency in analytics platforms and software.
Strong expertise in a specific industry.
Curiosity and storytelling skills.
Extensive experience with data analysis software.

What Do Data Scientists Do?


Reading: Lesson Overview: What Do Data Scientists Do?

Reading

Video: A Day in the Life of a Data Scientist

The speaker is sharing their experiences working on various projects that involved data analysis and problem-solving. They mention three specific projects:

  1. Building a recommendation engine: The speaker built a recommendation engine as part of a large organization, which was a complex problem that required collaboration with different teams. They came up with a simple solution that was efficient and easy to understand.
  2. Predicting algae blooms: In university, the speaker worked on a project to predict algae blooms in water using artificial neural networks. This was a complex problem that required a different approach, and they were able to find a solution that helped water treatment companies.
  3. Analyzing TTC complaints: The speaker was approached by the Toronto Transit Commission (TTC) to analyze their complaints data. They had half a million complaints, and the speaker had to find a way to make sense of the data. They discovered that the majority of complaints were related to extreme weather conditions, such as unexpected rain, extreme temperature drops, and high winds.

The speaker’s approach to problem-solving is to look for simple solutions to complex problems. They emphasize the importance of understanding the root cause of a problem and finding a solution that is easy to understand and implement. In the case of the TTC complaints, the speaker’s discovery of the relationship between extreme weather and complaints helped the TTC understand why they were receiving more complaints on certain days.

As a data scientist. You get to work on a broad spectrum of projects. Which of the following were projects discussed in the video?

Building a recommendation engine.

Correct. The engine was built by Stephen Sherman.

Analyzing hundreds of thousands of complaints filed with the Toronto Transit Commission.

Correct. This project was executed by professor Murtaza Haider.

[Music] I’ve built a recommendation engine before, as part of a large organization and worked through all types of engineers and accounted for different parts of the problem. It’s one of the ones I’m most happy with because ultimately, I came up with a very simple solution that was easy to understand from all levels, from the executives to the engineers and developers. Ultimately, it was just as efficient as something really complex, and they could have spent a lot more time on. Back in the university, we have a problem that we wanted to predict algae blooms. This algae blooms could cause a rise in toxicity of the water and it could cause problems through the water treatment company. We couldn’t like predict with our chemical engineering background. So we use artificial neural networks to predict when these blooms will reoccur. So the water treatment companies could better handle this problem. In Toronto, the public transit is operated by Toronto Transit Commission. We call them TTC. It’s one of the largest transit authorities in the region, in North America. And one day they contacted me and said, “We have a problem.” And I said, “Okay, what’s the problem?” They said, “Well, we have complaints data, and we would like to analyze it, and we need your help.” I said, “Fine I would be very happy to help.” So I said, “How many complaints do you have?” They said, “A few.” I said, “How many?” Maybe half a million. I said, “Well, let’s start working with it.” So I got the data and I started analyzing it. So, basically, they have done a great job of keeping some data in tabular format that was unstructured data. And in that case, tabular data was when the complaint arrived, who received it, what was the type of the complaint, was it resolved, whose fault was it. And the unstructured part of it was the exchange of e-mails and faxes. So, imagine looking at how half a million exchanges of e-mails and trying to get some answers from it. So I started working with it. The first thing I wanted to know is why would people complain and is there a pattern or is there some days when there are more complaints than others? And I had looked at the data and I analyzed it in all different formats, and I couldn’t find [what] the impetus for complaints being higher on a certain day and lower on others. And it continued for maybe a month or so. And then, one day I was getting off the bus in Toronto, and I was still thinking about it. And I stepped out without looking on the ground, and I stepped into a puddle, puddle of water. And now, I was sort of ankle deep into water, and it was just one foot wet and the other dry. And I was extremely annoyed. And I was walking back and then it hit me, and I said, “Well, wait a second. Today it rained unexpectedly, and I wasn’t prepared for it. That’s why I’m wet, and I wasn’t looking for it.” What if there was a relationship between extreme weather and the type of complaints TTC receives? So I went to the environment Canada’s website, and I got data on rain and precipitation, wind and the light. And there, I found something very interesting. The 10 most excessive days for complaints. The 10 days where people complain the most were the days when the weather was bad. It was unexpected rain, an extreme drop in temperature, too much snow, very windy day. So I went back to the TTC’s executives and I said, “I’ve got good news and bad news.” And the good news is, I know why people would complain excessively on certain days. I know the reason for it. The bad news is, there’s nothing you can do about it. [Music]

Video: Data Science Skills & Big Data

Norman White is a Clinical Faculty Member in the Information, Operations, and Management Science Department at NYU’s Stern School of Business. He’s a self-taught programmer with a background in Applied Physics and Economics. He’s been at Stern for a long time, starting as the director of the computer center while pursuing his PhD in Economics and Statistics.

Currently, Norman is the Faculty Director of the Stern Center for Research Computing, which provides specialized hardware and software to faculty and PhD students. He teaches two classes on Mondays: “Design and Development of Web-Based Systems” and “Dealing with Data”, which focuses on Python, Unix, Linux, relational databases, and big data using Jupyter notebooks. Students work in teams on weekly homework and a final project, and each student has their own virtual machine on Amazon Web Services to ensure a standardized environment.

Norman is passionate about technology and is known as a “big data evangelist”. He’s enthusiastic about teaching and helping students learn about data science and technology.

According to Dr. White, his students, who are mostly aspiring data scientists, need to learn many tools such as Python, UNIX commands, pandas, and Jupyter notebook.

True.

Music I’m Norman White, I’m
a Clinical Faculty Member in the IOMS Department, Information, Operations
and Management Science Department here at Stern. I’ve been here for a long time (laughs), since I got out of college, pretty much. I’m sort of a techy, geeky kind of person. I really like to play with
technology in my spare time. I’m currently Faculty Director of the Stern Center
for Research Computing, in which we have a private cloud that runs lots of
different kinds of systems. Many of our faculty or
PhD students who need specialized hardware and
software will come to us, we’ll spin up a machine
for them, configure it, I’ll help them and advise on them. A lot of the data
scientists, or virtually all the data scientists at
Stern use our facilities. And their PhD students use them a lot. (music) I have an undergraduate
degree in Applied Physics and while I was an
undergrad I took a number of economics courses,
so I ended up deciding to go to business school, but I had, this was in the early
days of computers (laughs) and I had gotten interested in computers. I came to Stern, which was then
NYU Business School downtown and they had a little computer center, and I decided that I was gonna learn two things while I was there. One, I was gonna learn how to program. I had taken one programming
course in college. And I was gonna learn how to touch type. I never did learn how
to touch type (laughs). Or maybe I did but I’ve forgotten now, and back to two finger pecking. But I became a self taught programmer, and then I took a number of courses at IBM because I eventually became the director of the computer center,
while I was getting my PhD in Economics and Statistics at Stern. In 1973, the school formed
a department called Computer Applications
and Information Systems and I was one of the first faculty members in the department and I’ve
been here ever since (laughs). (music) My typical Monday is, I usually
get in around 11 o’clock and I do my email at home first, but I come in and I have
two classes on Monday. I have a class on design and development of web based systems at six o’clock. Two o’clock, I have a
dealing with data class. The class is based on Python notebooks, so we start with the
basics of Unix and Linux, just to get the students used to that. We move onto some Python,
some regular expressions, a lot of relational
databases, some Python Pandas, which is sort of like R
for Python, lets you do mathematical and statistical
calculations in Python. And then I end up with big data, for which, as you probably
know, I’m an evangelist. The students I have, weekly homeworks. I put them in teams and they
have to do a big project at the end of the term, and
they do some really cool things. (music) Yes, in fact, the whole course is taught using Jupyter notebooks. Every student has their
own virtual machine on Amazon Web Services, so we
pre configure all the machines and they get a standard image
that has all of the materials for the course either loaded
on it or in a Jupyter notebook, there are the commands to download it or update the server
with the right software. So everybody is in the same environment, it doesn’t matter what kind of, whether they have a Mac
or a Windows machine or how old it is, everybody
can do everything in the class. (upbeat music)

Video: Understanding Different Types of File Formats

Importance of Understanding File Formats

As a data professional, it’s crucial to understand various file formats, their structures, benefits, and limitations to make informed decisions about the best formats for your data and performance needs.

File Formats Covered

  1. Delimited Text Files: Text files with values separated by delimiters (e.g., commas, tabs, colons). Examples: CSV (Comma-Separated Values) and TSV (Tab-Separated Values).
    • Each row represents a record, with the first row as a column header.
    • Delimited files allow field values of any length and are widely supported.
  2. Microsoft Excel Open XML Spreadsheet (XLSX): An XML-based file format for spreadsheets, allowing multiple worksheets with rows and columns.
    • XLSX is accessible to most applications, secure, and can use all Excel functions.
  3. Extensible Markup Language (XML): A markup language for encoding data, readable by humans and machines.
    • XML is platform-independent, programming language-independent, and self-descriptive.
  4. Portable Document Format (PDF): A file format for presenting documents independently of application software, hardware, and operating systems.
    • PDFs are often used for legal and financial documents and can be used for filling in data.
  5. JavaScript Object Notation (JSON): A text-based open standard for transmitting structured data over the web.
    • JSON is language-independent, easy to use, and compatible with a wide range of browsers, making it ideal for sharing data of any size and type.

These file formats are essential for data professionals to understand, as they are commonly used in various applications and industries.

As a data professional, you will be working
with a variety of data file types, and formats. It is important to understand the underlying
structure of file formats along with their benefits and limitations. This understanding will support you to make
the right decisions on the formats best suited for your data and performance needs. Some of the standard file formats that we
will cover in this video include: Delimited text file formats, Microsoft Excel Open XML Spreadsheet, or XLSX Extensible Markup Language, or XML, Portable Document Format, or PDF, JavaScript Object Notation, or JSON, Delimited text files are text files used to
store data as text in which each line, or row, has values separated by a delimiter; where a delimiter is a sequence of one or
more characters for specifying the boundary between independent entities or values. Any character can be used to separate the
values, but most common delimiters are the comma, tab, colon, vertical bar, and space. Comma-separated values (or CSVs) and tab-separated
values (or TSVs) are the most commonly used file types in this category. In CSVs, the delimiter is a comma while in
TSVs, the delimiter is a tab. When literal commas are present in text data
and therefore cannot be used as delimiters, TSVs serve as an alternative to CSV format. Tab stops are infrequent in running text. Each row, or horizontal line, in the text
file has a set of values separated by the delimiter, and represents a record. The first row works as a column header, where
each column can have a different type of data. For example, a column can be of date type,
while another can be a string or integer type data. Delimited files allow field values of any
length and are considered a standard format for providing straightforward information
schema. They can be processed by almost all existing
applications. Delimiters also represent one of various means
to specify boundaries in a data stream. Microsoft Excel Open XML Spreadsheet, or XLSX,
is a Microsoft Excel Open XML file format that falls under the spreadsheet file format. It is an XML-based file format created by
Microsoft. In an .XLSX, also known as a workbook, there
can be multiple worksheets. And each worksheet is organized into rows
and columns, at the intersection of which is the cell. Each cell contains data. XLSX uses the open file format, which means
it is generally accessible to most other applications. It can use and save all functions available
in Excel and is also known to be one of the more secure file formats as it cannot save
malicious code. Extensible Markup Language, or XML, is a markup
language with set rules for encoding data. The XML file format is both readable by humans
and machines. It is a self-descriptive language designed
for sending information over the internet. XML is similar to HTML in some respects, but
also has differences. For example, an .XML does not use predefined
tags like .HTML does. XML is platform independent and programming
language independent and therefore simplifies data sharing between various systems. Portable Document Format, or PDF, is a file
format developed by Adobe to present documents independent of application software, hardware,
and operating systems, which means it can be viewed the same way on any device. This format is frequently used in legal and
financial documents and can also be used to fill in data such as for forms. JavaScript Object Notation, or JSON, is a
text-based open standard designed for transmitting structured data over the web. The file format is a language-independent
data format that can be read in any programming language. JSON is easy to use, is compatible with a
wide range of browsers, and is considered as one of the best tools for sharing data
of any size and type, even audio and video. That is one reason, many APIs and Web Services
return data as JSON. In this video, we looked at some popular file
and data formats. In the next video, we will learn about the
different sources of data.

Video: Data Science Topics and Algorithms

The speaker discusses various topics in data science and machine learning, including:

  1. Regression: The speaker enjoys regression and explains it in simple terms using the analogy of a taxi ride. Regression is about finding the relationship between variables (e.g., distance, time, and fare) and a constant (e.g., the base fare).
  2. Data Visualization: The speaker emphasizes the importance of data visualization in communicating insights to non-technical people.
  3. Artificial Neural Networks: The speaker is passionate about neural networks, which are inspired by the human brain and can be used to mimic biological behavior in algorithms.
  4. Nearest Neighbor: The speaker appreciates the simplicity and effectiveness of the nearest neighbor algorithm, which can often produce good results without overfitting.
  5. Structured vs. Unstructured Data: The speaker explains the difference between structured data (e.g., tabular data in Excel) and unstructured data (e.g., text, video, and audio from the web), which requires more sophisticated algorithms to extract insights.

The speaker’s goal is to make complex concepts more accessible and understandable, as they feel that many explanations of regression and other topics are overly complicated.

Using complicated machine learning algorithms does not always guarantee achieving a better performance. Occasionally, a simple algorithm such as k-nearest neighbor can yield a satisfactory performance comparable to the one achieved using a complicated algorithm. It all depends on the data

True

In any field, and data science is no different, a simple solution is always preferred over a complicated one, especially if the performance is comparable.

[Music] I really enjoy regression. I’d say regression was maybe one of the
first concepts that I, that really helped me understand data so I enjoy
regression. I really like data visualization. I think it’s a key element
for people to get across their message to people that don’t understand that
well what data science is. Artificial neural networks. I’m really passionate about neural networks
because we have a lot to learn with nature so when we are trying to mimic our, our brain
I think that we can do some applications with this behavior with this biological behavior
in algorithms. Data visualization with R. I love to do this. Nearest neighbor. It’s the simplest but it just gets the best
results so many more times than some overblown, overworked algorithm that’s just as likely
to overfit as it is to make a good fit. So structured data is more like tabular data
things that you’re familiar with in Microsoft Excel format. You’ve got rows and columns and that’s called
structured data. Unstructured data is basically data that is
coming from mostly from web where it’s not tabular. It is not, it’s not in rows and columns. It’s text. It’s sometimes it’s video and audio, so you
would have to deploy more sophisticated algorithms to extract data. And in fact, a lot of times we take unstructured
data and spend a great deal of time and effort to get some structure out of it and then analyze
it. So if you have something which fits nicely
into tables and columns and rows, go ahead. That’s your structured data. But if you see if it’s a weblog or if you’re
trying to get information out of webpages and you’ve got a gazillion web pages, that’s
unstructured data that would require a little bit more effort to get information out of
it. There are thousands of books written on regression
and millions of lectures delivered on regression. And I always feel that they don’t do a good
job of explaining regression because they get into data and models and statistical distributions. Let’s forget about it. Let me explain regression in the simplest
possible terms. If you have ever taken a cab ride, a taxi
ride, you understand regression. Here is how it works. The moment you sit in a cab ride, in a cab,
you see that there’s a fixed amount there. It says $2.50. You, rather the cab, moves or you get off. This is what you owe to the driver the moment
you step into a cab. That’s a constant. You have to pay that amount if you have stepped
into a cab. Then as it starts moving for every meter or
hundred meters the fare increases by certain amount. So there’s a… there’s a fraction, there’s
a relationship between distance and the amount you would pay above and beyond that constant. And if you’re not moving and you’re stuck
in traffic, then every additional minute you have to pay more. So as the minutes increase, your fare increases. As the distance increases, your fare increases. And while all this is happening you’ve already
paid a base fare which is the constant. This is what regression is. Regression tells you what the base fare is
and what is the relationship between time and the fare you have paid, and the distance
you have traveled and the fare you’ve paid. Because in the absence of knowing those relationships,
and just knowing how much people traveled for and how much they paid, regression allows
you to compute that constant that you didn’t know. That it was $2.50, and it would compute the relationship
between the fare and and the distance and the fare and the time. That is regression. [Music]

Upgraded Plugin: What Makes Someone a Data Scientist?

Reading

Video: Lesson Summary: What Do Data Scientists Do?

The article discusses the role of a data scientist and the skills required to be successful in this field. Data scientists are problem-solvers who use data to find explanations for complex issues. They may work on environmental challenges, such as predicting algae blooms, or build recommendation engines to simplify intricate problems.

The article highlights the importance of education in equipping future data scientists with essential skills, including Python, Unix, Linux, relational databases, and statistical models. Data scientists use a range of tools and techniques, including machine learning algorithms and statistical regression, to process and analyze data.

The article also emphasizes the importance of communication skills in data science. A data scientist must be able to convey complex insights to stakeholders in a compelling narrative. The data they work with comes from a wide variety of sources and formats, including text, video, and audio.

According to Rachel Schutt, chief data scientist at News Corp, what makes a data scientist truly exceptional is curiosity. A successful data scientist is a blend of computer scientist, software engineer, and statistician, with the ability to transform unstructured solutions into structured insights.

The article concludes that data science is not just a profession, but a journey of exploration, innovation, and storytelling. Data scientists, armed with skills, curiosity, and determination, navigate the vast world of data to unravel the extraordinary.

Key takeaways:

  • Data scientists are problem-solvers who use data to find explanations for complex issues.
  • Education is essential in equipping future data scientists with essential skills.
  • Data scientists use a range of tools and techniques, including machine learning algorithms and statistical regression.
  • Communication skills are critical in data science, as data scientists must convey complex insights to stakeholders.
  • Curiosity is a key trait of successful data scientists.
  • Data science is a journey of exploration, innovation, and storytelling.

Welcome to a realm where
data isn’t just numbers. It’s the gateway to innovation, discovery, and the endless
possibilities that lie ahead. To understand the heart and soul of a data scientist world, let’s review what
data scientists do as discussed in the readings
and videos from this lesson. Data scientist investigate and find explanations
for many problems. For example, Dr. Murtaza Haider
found an explanation for why half a million customers complained about public
transit in Toronto. After much investigation, he
found a relationship between unexpected bad weather events and the number of complaints
on that particular day. Data scientist may tackle
environmental challenges such as predicting algae blooms to prevent water toxicity. By harnessing data in the prowess of artificial
neural networks, data scientist can help water treatment companies
safeguard the ecosystem. Norman White, a clinical faculty member at the
Stern School of Business, also discussed the journey
that led him to build a recommendation engine that simplified intricate
problems across departments. This underscores the
essence of data science, solving real world issues
with innovative solutions. Education serves
as a cornerstone, equipping future data scientists
with essential skills. Dr. White’s classroom comes
alive with Python notebooks, revealing the secrets of Unix, Linux, relational databases, and powerful tools like Pandas. Dr. Vincent Granville,
an author on data science, lists
algebra, calculus, and training, and probability
and statistics as necessary educational backgrounds
to be a data scientist. He distinguishes between a statistician and
a data scientist. A data scientist
uses statistics, but is not only a statistician. However, data scientists do use a lot of statistical models, such as statistical regression. Regression shows the
probable relationship between two variables, such as the distance you drive and the amount of gas you use. Data scientists also utilize machine learning
algorithms such as nearest neighbor to process what much of the media
refers to as big data. The term should be
used with caution. What was once
considered big data is continually reshaped
by innovation. Tools like Hadoop and software advancements have
shattered traditional limits, ushering in a new era
of possibilities. Neither does Dr. Patel restrict data scientists to dealing with datasets of arbitrary size, nor does he restrict them
to using particular tools. His definition
includes individuals of various academic
backgrounds and training. A data scientist not only unlocks the insights
within a dataset, but conveys a compelling
narrative to stakeholders. It’s this blend of
technical acumen and communication finesse
that sets them apart. The the data that
data scientists use comes from a wide
variety of sources. Sometimes video,
sometimes audio, and often unformatted text. Text-based data can
also be structured, such as in tables with
rows and columns, or unstructured like
emails or logs. The data they work with comes in a wide variety of formats, such as delimited text files, spreadsheets, XML, PDFs, and Javascript Object
Notation, or JSON. What makes a data scientist
truly exceptional? The answer, according
to Rachel Schutt, chief data scientist at News
Corp, lies in curiosity. A successful data scientist is a blend of
computer scientist, software engineer,
and statistician. Their ability to transform unstructured solutions into structured insights
defines their prowess. As we reflect on
the insights from a day in the life of
a data scientist, we realize that data science
isn’t just a profession. It’s a journey of exploration, innovation, and story telling. The world of data is vast and the data scientist,
armed with skills, curiosity and determination, navigates it to unravel
the extraordinary.

Practice Assignment: Practice Quiz: What makes Someone a Data Scientist?

You have the task of defining the role of a data scientist for a retail company that seeks to improve its product offerings and marketing strategies. In this context, a data scientist would primarily engage in which activity?
Creating artistic visualizations for in-store displays to attract customers.
Developing algorithms to predict future stock market trends for investment decisions.
Analyzing customer purchase data to identify trends and tailor product recommendations.
Conducting biological experiments to enhance the quality of retail products.

What is a key characteristic that defines a data scientist?
A data scientist is someone who can process large data sets quickly and has a strong distrust of statistical models, according to Dr. Vincent Granville.
According to the discussion on big data, a data scientist must meet certain arbitrary thresholds related to data size or use specific tools like Hadoop.
A data scientist is primarily defined by their proficiency in machine learning and black-box algorithms, according to some experts in the field.
A curious mind, fluency in analytics, and effective communication of findings characterize a data scientist.

Dr. Vincent Granville defines a data scientist as someone who relies solely on statistical models for data analysis.
True
False

Glossary: What do Data Scientists Do?

Reading

Graded Assignment: Graded Quiz: What Data Scientists Do

You are a new data scientist. You have been tasked with coming up with a solution for reducing traffic congestion and improving transportation efficiency. How would you go about it?
Suggest creating more parking lots and garages in the city
Suggest implementation of strict speed limits and traffic fines
Suggest implementation of surge charges for ride-sharing services.
Gather and analyze streetcar operations data and identify congested routes

Imagine you take a taxi ride where the initial fare is a fixed amount, and the fare increases based on both the distance traveled and the time spent in traffic. Which concept in data analysis does this scenario closely resemble?
Nearest neighbor algorithm
Unstructured data extraction
Data visualization with R
Regression analysis

You have to pick a file format which meets the following conditions: a) is self-descriptive for internet-based information sharing b) readable by both humans and machines c) Facilitates easy data sharing between different systems. Which file format would you pick?
JavaScript Object Notation (JSON)
Extensible Markup Language (XML)
Microsoft Excel Open XML Spreadsheet (XLSX)
Delimited text file formats (CSV/TSV)

Reading: Summary: What Do Data Scientists Do?

Reading


Home » IBM » IBM Data Science Professional Certificate » Course 1: What is Data Science? » Module 1: Defining Data Science and What Data Scientists Do