export default `
[A valuable skill you can acquire this term is striving to logically look at your assignment or goals, and {break these down into smaller tasks} that can be easily translated into code.|][ The {OSEMN is a good approach for Data Science} which can help add structure to your efforts and is recommended.|][ This term, when you work on an assignment, {break it down using this process and then plan out the subtasks for each process step}.|][ Following an organized approach will add structure, facilitate collaboration, and overall improve the quality of your work.|]

[Some of the material in this lecture is from the recommended reading assignments as well as my own experience.|]

[{Software Development and Coding skills require practice}.|][ Like any language that you might learn, start slow with Python, R, and SQL and work to increase your fluency.|][ As fluency increases so will development speed and your comfort with these languages.|][ {Learning to break projects down into manageable sizes and then implement these as the smallest simplest implementation that you can} is part of developing this fluency.|][ Once you’re successful with this, expand this small code out to do more.|]

[Data science teams have borrowed from software best practices.|][ Many of the data scientists with whom I spoke said that software development best practices {didn’t become useful until you have a good idea of what to build}.|][ At the beginning of a project, a data scientist doesn’t always know what that is.|][ So {there is often a zero-step of exploratory data analysis or experimentation} that must be done in order to know how to define the end of a project.|][ The point is that while some parts of the software development standard 
processes like planning, development, testing, integration, and deployment are valuable but need to be adapted to the specific goals around Data Science.|]

[{Actionable insight is a key outcome} to show how data science can answer the business questions we asked when we first started the project.|][ {If your presentation does not trigger actions in your audience, it means that your communication was not efficient}.|][ Remember that you will be presenting to an audience with no technical background, so the way you communicate the message is key.|]

[{Processes are a network of activities that generate value by transforming inputs into outputs.|} Managing value necessitates good process management.|]

[Examples of good Data Science Process management:|  
-{Going back and iterating on each model separately} to improve them.|][  
-{Turning existing code}—whether it’s written in Python, R, or Java—{into a command-line tool} so it can be reused and combined with other tools, and GitHub makes it easier than ever to share your tools with the rest of the world or find ones created by others.|][  
-{Having a process be auditable}; so that later on it can be reviewed to verify that it does what it’s expected to do.|][ Automated and well documented workflows makes auditing much easier.|][ If it’s auditable then you don’t have to just take someone’s word.|]

[Before you begin the process, {write down a problem statement so that you know what you are after}.|][ Here too, it's tempting to tackle a very big goal right from the beginning.|][ But first try and {develop a series of smaller questions} to help understand how you can meet the goal.|][ Once you answer that it can lead to new problem statements.|]

[Here’s an example problem statement with a template to use for formatting:|  

[Cafe A tells you that they want to increase their profits as compared to Cafe B.|
So here you have {the problem and the goal.|}
Now you can translate this statement into the following analytical questions:| ]
[What product is popular?|  
What product is least popular?|  
How does the price of items at Cafe A compare to that of their competitors at Cafe B?|   
How many customers does Cafe A have as compared to Cafe B?|   
What are the peak hours at Cafe A and Cafe B, is there any convergence?|   
What is the traffic to each cafe?|   
What is the average age of the customers at each cafe?|  
What is the number of repeat customers to each cafe.|]

[After analyzing data to answer these smaller questions soon you notice that Cafe A sells less coffee than their competitor Cafe B.|
This changes the problem statement from “How do we increase our profits?|”   to “ How do we sell more coffee?|”   ]

[Frame the Problem Statement
{Write a statement that describes the problem, why solving the problem is important and a starting point to begin solving it.|}
A problem statement generally follows the format {:|“The problem P, has the impact I, which affects B, so a good starting point would be S.|”  } ]
[Let’s break the statement down:| ]
[{“The problem P”}:| Here insert the problem as defined by the company.|]
[{“[ has the impact I .|”  } Insert the negative impacts/pain points of the problem.|]
[{“[ which affects B.|”  } Insert the parties that are affected.|][ IT could be the business, the customers or a third party.|]
[{“…,so a good starting point would be S.|”  } Insert the benefits of solving the problem.|]
[For the cafe scenario above, the problem statement would be something like:| ]
[“The problem of low coffee sales, has the impact of decreased profits, which affects Cafe A, so a good starting point would be to compare their coffee price with that of their competitors.|”   ]

[In this step, {you will need to query databases}, using technical skills like MySQL to process the data.|][ You may also receive data in file formats like Microsoft Excel.|][ If you are using Python or R, they have specific packages that can read data from these data sources directly into your data science programs.|]

[The different type of databases you may encounter are like PostgreSQL, Oracle, or even non-relational databases (NoSQL) like MongoDB.|][ Another way to obtain data is to scrape from the websites using web scraping tools such as Beautiful Soup, or connecting to Web APIs.|][ Websites such as Facebook and Twitter allows users to connect to their web servers and access their data.|][ All you need to do is to use their Web API to crawl their data.|]

[For your course project {I encourage you to start thinking now about where you will get data and what questions you want to answer}.|][ For example you might want to look at the US Supreme Court transcripts (Links to an external site.|) It has various people, justices, attorneys and parties as well as what they say in court, the data is split up across multiple files and it contains strings with punctuation.|][ I think there are a variety of things to look at like which justices most often rule together, what words are indicative of a justice ruling for or against a party, which justices are the most talkative etc.|][ Analyzing those would most likely result in several interesting problem statements.|]

[Next {convert the data from one format to another and consolidate everything} into a single standardized format to facilitate ease of processing and analysis.|][ For example, if your data is stored in multiple CSV files, then you will consolidate these CSV data into a single repository, so that you can process and analyze it.|]

[This is the step where {lots of new Data Scientists get overwhelmed} by the coding, and it’s a place where some students in the past have really struggled.|][ {I want to encourage you to try and learn to break the problem down into manageable subtasks and then implement these as the smallest simplest level that you can.|} Once you’re successful with this, expand this small code out to do more.|]

[For example, in an assignment if you needed to 1) load a directory of files, 2) convert them from CSV data format to JSON data format, and finally 3) write one file of JSON.|][ Then {realize you will have one program  composed of three sub-programs} or tasks.|][ {Don't start by trying to write a program to solve this bigger problem}.|][ Get used to looking at the problem as compositions of a bunch of little programs that each carry out their own tasks.|][ This part of the planning phase, figuring out how to break a more complex problem into manageable sub tasks is critical.|][ Once you’ve done this you can address each piece one small task at a time.|][ In our example, we want read a lot of files, or a directory of files.|][ So {I would start by reading one single file before I tried to read multiple files} or a directory of files.|][ Conduct your research and look at other example code, then implement your code a few lines at a time and test it often.|][ {Once this first task is successful it is generally easier to scale up} to read multiple files, or a directory of files.|]

[Python is a common tool in this step we will use this term.|][ For handling bigger data sets later in this term we will use DataPrep, BigQuery and PySpark, among others.|]

[{Scrubbing data also includes the task of extracting and replacing values.|} If you realize there are missing data sets or they could appear to be non-values, this is the time to replace them accordingly.|]

[{Lastly, you will also need to split, merge and extract columns}.|][ For example, for the place of origin, you may have both “City” and “State”.|][ Depending on your requirements, you might need to either merge or split these data.|]

[{Think of this process as organizing and tidying up the data}, removing what is no longer needed, replacing what is missing and standardizing the format across all the data collected.|]
[Now your data is ready to be used, and  before you jump into AI and Machine Learning, examine the data by computing descriptive statistics to extract features and test significant variables.|][ Testing significant variables is often done with correlation.|][ The term “Feature” used in Machine Learning or Modelling, is the data features that help us to identify the characteristics that represent the data.|][ For example, “Name”, “Age”, “Gender” are typical features of members or employees dataset.|][ Data visualization helps to identify significant patterns and trends in your data.|][ We can gain a better picture through simple charts like line charts, bar charts or scatter plots to help understand the features and variables.|]

[The command line may be most useful when you want to explore a new dataset, create some quick visualizations, or compute some aggregate statistics.|][ Having a good data directory structure helps, and a tool called cookiecutter can take care of all the setup and boilerplate for data science projects.|]

[{Data Scientists frequently downsample data in order to quickly prototype.|} If the original data is a terabyte, downsampling it to a couple
hundred megabytes still represents something significant, but is more manageable and may be processed quickly.|]
[Once again, before starting this stage, {bear in mind that the scrubbing and exploring stage are equally crucial to building useful models}.|][ So take your time on those stages instead of jumping right to this process.|]

[  Modelling data starts with reducing the dimensionality of your data set.|][ {Not all your features or values are essential} to predicting your model.|][ {You want to select the strongest contributors to the prediction of results.|} We can forecast values using linear regressions.|][ We can also use modelling to group data to understand the logic behind those clusters.|][ For example, we group our e-commerce customers using k-means or hierarchical clustering to understand their behavior on the website.|]

[In the terminology of machine learning, {classification is considered supervised learning}, i.e. learning where a training set of correctly identified observations is available.|][ {Unsupervised learning is known as clustering, and involves grouping data into categories based on some measure of inherent similarity} or distance.|]

[In Machine Learning, the skills you will need are both supervised and unsupervised algorithms in R and Python as well as other statistical analysis tools.|]

[After the modelling process, you will need to be able to calculate evaluation scores such as precision, recall and F1 score for classification.|][ For regressions, you need to be familiar with R squared to measure goodness-of-fit, and using error scores like MAE (Mean Average Error), or RMSE (Root Mean Square Error) to measure the distance between the predicted and observed data points.|
{The final and most crucial step, Interpreting Data, refers to the presentation of your data to a non-technical layman}.|][  Data scientists often say formulating a problem statement and communicating results are the most problematic.|][ They are also the steps that are the least amenable to automation.|]

[We seek to deliver the results that answers the business questions we asked when we first started the project, together with the actionable insights that we found through the data science process.|]

[Actionable insight is a key outcome that we show how data science can bring about predictive analytics and later on prescriptive analytics.|][ In which, we learn how to repeat a positive result, or prevent a negative outcome.|][ {If our stakeholders know what to do with our results then we have successfully added value with our efforts}.|]

[Use visualizations that your stakeholders can understand and relate to the problem statement.|][ If you don’t present your findings in such a way that is useful it runs the risk of being pointless to your stakeholders.|]

[It is difficult for people to wrap their heads around when everything is expressed as just a probability.|][ {Stakeholders often just want a 'Does it work?' answer.|} So try and summarize it in language they can understand and take appropriate action upon.|]

[Another issue is the tendency of data scientists to build over-complicated models that stakeholders do not use.|][ Either they don’t understand it because they don’t understand the black box, or the output is not something that’s consumable.|][ {You are better served focusing more on usability as opposed to complexity}.|]

[Involving stakeholders throughout the data science process can help them understand and guide your efforts.|]
[{For the assignments in this course you will be asked to present your work according to the OSEMN model}.|][ Of course this works a lot better if you start off by creating a document (you can begin with just an outline of all five of the model elements) to document your plan.|][ Then follow this plan and update it as new insights and details emerge.|][ Student may wish to read the OSEMN Process Beehive Data Collection article as an example of how to apply this to assignments.|]

[{Be as detailed as necessary to describe what steps you will accomplish in each step of the process.|} Describe your approach to building the tools to accomplish each step.|][ For example “In the SCRUB step I needed to read a lot of files organized in directories before I could use the data.|][ I started by writing a program to read one single file then went on to read more than one.|][ I had to research how to do this recursively in subdirectories.|][ I tested the recursive code with just a few files in one subdirectory directories.|][ Once this was successful I was able to expand up to read a large directory tree of files.|”   ]
[{Students data science plans will be evaluated on their assignments, beginning with the Data Wrangling assignment in module 4.|} Here are example rubric criteria for evaluating.|]`;
