Making Data Analytics Predictions for PhDs
Simulation co-authors – Sarah Peterson, PhD
Simulation vetted by data professionals in the Greater Atlanta area
Simulation Objective:
Organize and Analyze a Dataset to Determine Trends Influencing Company Outcomes
Associated Simulation Library:
Background
Careers in data science span from cleaning and converting data to analyzing and experimenting with data to drive company strategy. At the heart of data science is the ability to extract information from data to solve problems and/or answer questions that drive an organization forward. Data scientists utilize statistics and computer programming but also curiosity, analytical creativity and strong communication skills. As such, there is a wide range of ways a data scientist interacts with data which spans from straightforward and operationalized to more creative and strategic.
While in some circumstances the outcome sought from the data analysis is known, in most cases careful listening and productive dialogue are required to determine the main driver of value for the company. After that determination is made, the data scientist spends time deciding on parameters, trying out different models, seeking out patterns and making connections within a data set. Throughout, it is important to remember that communicating back what you learn from the data and how you learned it will be essential to successfully conveying the value of the information to team members and/or clients.
The Process
-
Understand the question you trying to answer using data. Questions should be as specific, measurable and concise as possible.
-
Determine what data you have access to and how you need to process that data in order to answer the question of interest. Data in the real world often isn’t complete, so understanding how to deal with missing data is important.
-
Given the question, identify the main driver of value. And what features or aspects of the data should/will have the greatest impact on the outcome variable associated with that value?
-
Once all these immediate questions are answered, enter into an exploratory analysis that can point to trends in the data that drive value. What techniques are best suited to answer these questions?
-
Perform the analysis of the data using a combination of regression, principal component analysis, machine learning, and statistics as necessary.
-
Interpret results and communicate back to team.
-
Statistics
General mathematics
Linear algebra
Programming languages including R, Python, SAS
Machine learning
Optimization
Resources
The Exercise
Using this data set, complete the tasks outlined below that move through Steps 2, 3, 4 and 5 of the data science process. These tasks demonstrate the range of ways data scientists interact with data.
Task 1 - Data Preparation
Cleaning and organizing:
Determine if the data makes sense as it is presented on the spreadsheet. For example, does the data match the column label? Is it in the correct format? Are there basic spelling errors or other typos?
Make any corrections.
Identify any missing data and determine how you will deal with the null fields and different data types (i.e. names, phone numbers, and locations).
Task 2 - Determine Factors
You have been tasked by the company to determine which factors influence sales (denoted as the SALES column in the data set). If possible, derive a method to predict sales.
The company is interested in finding the top 5 factors which drive sales.
They expect you to examine every variable included in the data set.
Using these factors, the company would like to build a machine learning algorithm to predict the sales they can expect from a new customer. What are the challenges associated with this approach?
Task 3 - Identify Trends
Now that you have figured out which variables influence sales – the company wants to know if there are other interesting trends in the data. Do certain customers tend to buy at certain times? What drives the deal size?
Define the question you are interested in answering.
Outline the methods you will use to examine the question.
Report the results of the data in a professional report (1 pg) to the company.
Deliverables
At the end of these tasks you should be able to deliver a Word document which clearly outlines the following:
question you were trying to ask
methods you used to answer the question
results of the analysis
The document should include tables and graphs were appropriate to support your findings. Make sure to include the specifics of how the data was processed in each case.
Additional Tasks
Machine Learning
Skills Used to Perform this Task
Organization
Patience
Statistical Analysis
Data intuition
Analytical
Skills Used in the Field
Communication
Programming
Data visualization
Creativity