當前位置：首頁 > 编程语言 > python >内容正文

python

Coursera | Applied Plotting, Charting Data Representation in Python（UMich）| Assignment4

發布時間：2023/12/8 python 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 Coursera | Applied Plotting, Charting Data Representation in Python（UMich）| Assignment4 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?? 所有assignment相關鏈接：
??Coursera | Applied Plotting, Charting & Data Representation in Python（University of Michigan）| Assignment1
??Coursera | Applied Plotting, Charting & Data Representation in Python（University of Michigan）| Assignment2
??Coursera | Applied Plotting, Charting & Data Representation in Python（University of Michigan）| Assignment3
??Coursera | Applied Plotting, Charting & Data Representation in Python（University of Michigan）| Week3 Practice Assignment
??Coursera | Applied Plotting, Charting & Data Representation in Python（University of Michigan）| Assignment4
?? 有時間（需求）就把所有代碼放到github上
?? 推廣下自己的博客，以后CSDN的文章都會放到自己的博客的。

Coursera | Applied Plotting, Charting & Data Representation in Python（University of Michigan）| Assignment4

Assignment 4 - Becoming an Independent Data Scientist
- Peer Review
- Code
- - Tips
  - Example
  - Project info
  - Load data and clean data
  - Visualize-KDE
  - Visualize-Line Plot
- Discussion

Assignment 4 - Becoming an Independent Data Scientist

??最后一周作業又一次讓我反復去世 😦 。
??然后這個是個獨立作業，就是說數據自己找，題目自己定，靈活性非常大，因此我選了好久題目，看了很多參考，沉浸在快樂的維基百科，后來太糾結，就做了這個和example相關的課題，同時也和之前做的Introduction to Data Science in Python| Assignment4有關，大家可以參考下。歡迎討論、提出建議~

Peer Review

Code

Before working on this assignment please read these instructions fully. In the submission area, you will notice that you can click the link to Preview the Grading for each step of the assignment. This is the criteria that will be used for peer grading. Please familiarize yourself with the criteria before beginning the assignment.

This assignment requires that you to find at least two datasets on the web which are related, and that you visualize these datasets to answer a question with the broad topic of sports or athletics (see below) for the region of Farmington, Michigan, United States, or United States more broadly.

You can merge these datasets with data from different regions if you like! For instance, you might want to compare Farmington, Michigan, United States to Ann Arbor, USA. In that case at least one source file must be about Farmington, Michigan, United States.

You are welcome to choose datasets at your discretion, but keep in mind they will be shared with your peers, so choose appropriate datasets. Sensitive, confidential, illicit, and proprietary materials are not good choices for datasets for this assignment. You are welcome to upload datasets of your own as well, and link to them using a third party repository such as github, bitbucket, pastebin, etc. Please be aware of the Coursera terms of service with respect to intellectual property.

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations. You are welcome to provide multiple visuals in different languages if you would like!

As this assignment is for the whole course, you must incorporate principles discussed in the first week, such as having as high data-ink ratio (Tufte) and aligning with Cairo’s principles of truth, beauty, function, and insight.

Here are the assignment instructions:

State the region and the domain category that your data sets are about (e.g., Farmington, Michigan, United States and sports or athletics).
You must state a question about the domain category and region that you identified as being interesting.
You must provide at least two links to available datasets. These could be links to files such as CSV or Excel files, or links to websites which might have data in tabular form, such as Wikipedia pages.
You must upload an image which addresses the research question you stated. In addition to addressing the question, this visual should follow Cairo’s principles of truthfulness, functionality, beauty, and insightfulness.
You must contribute a short (1-2 paragraph) written justification of how your visualization addresses your stated research question.

What do we mean by sports or athletics? For this category we are interested in sporting events or athletics broadly, please feel free to creatively interpret the category when building your research question!

Tips

Wikipedia is an excellent source of data, and I strongly encourage you to explore it for new data sources.
Many governments run open data initiatives at the city, region, and country levels, and these are wonderful resources for localized data sources.
Several international agencies, such as the United Nations, the World Bank, the Global Open Data Index are other great places to look for data.
This assignment requires you to convert and clean datafiles. Check out the discussion forums for tips on how to do this from various sources, and share your successes with your fellow students!

Example

Looking for an example? Here’s what our course assistant put together for the Ann Arbor, MI, USA area using sports and athletics as the topic. Example Solution File

Project info

Name:

Summary of win percentages for the Big4 sports teams in Michigan

Region:

Michigan, United States

Category:

Sports and Athletics

Question:

How are situations of win percentages for the Big4 sports teams in Michigan and their trends?

Links:

List_of_Detroit_Lions_seasons

List_of_Detroit_Tigers_seasons

List_of_Detroit_Pistons_seasons

List_of_Detroit_Red_Wings_seasons

import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy.stats as st import matplotlib.colors as col import matplotlib.cm as cm import seaborn as sns import re%matplotlib notebookplt.style.use('seaborn-colorblind')

Load data and clean data

!pip install lxml dict_datasets={"Tigers":"List of Detroit Tigers seasons - Wikipedia.html","Lions":"List of Detroit Lions seasons - Wikipedia.html","Pistons":"List of Detroit Pistons seasons - Wikipedia.html","RedWings":"List of Detroit Red Wings seasons - Wikipedia.html", }# Lions df_lions=pd.read_html(dict_datasets['Lions'])[1][6:92]lions=pd.DataFrame() lions['Year']=df_lions['NFL season']['NFL season'] lions['Wins']=df_lions['Regular season']['Wins'].astype(int) lions['Losses']=df_lions['Regular season']['Losses'].astype(int)lions['Win %_Lions']=lions['Wins']/(lions['Wins']+lions['Losses'])# Tigers df_tigers=pd.read_html(dict_datasets['Tigers'])[3]tigers=pd.DataFrame() tigers[['Year','Wins','Losses']]=df_tigers[['Season','Wins','Losses']].copy() tigers['Year']=tigers['Year'].astype(str) tigers['Year']=tigers['Year'].astype(object) tigers['Wins']=tigers['Wins'].astype(int) tigers['Losses']=tigers['Losses'].astype(int) tigers['Win %_Tigers']=tigers['Wins']/(tigers['Wins']+tigers['Losses'])# Pistons df_pistons=pd.read_html(dict_datasets['Pistons'])[1][11:74]pistons=pd.DataFrame() pistons['Year']=df_pistons['Team Season'].str[:4] pistons[['Wins','Losses']]=df_pistons[['Wins','Losses']] pistons['Wins']=pistons['Wins'].astype(int) pistons['Losses']=pistons['Losses'].astype(int)pistons['Win %_Pistons']=pistons['Wins']/(pistons['Wins']+pistons['Losses'])# Red Wings df_redw=pd.read_html(dict_datasets['RedWings'])[2][:94]redw=pd.DataFrame() redw['Year']=df_redw['NHL season']['NHL season'].str[:4] redw[['Wins','Losses']]=df_redw['Regular season[3][6][7][8]'][['W','L']] redw=redw.set_index('Year')# missing 2004 redw.loc['2004',['Wins','Losses']]=redw.loc['2003'][['Wins','Losses']]redw['Wins']=redw['Wins'].astype(int) redw['Losses']=redw['Losses'].astype(int)redw['Win %_RedWings']=redw['Wins']/(redw['Wins']+redw['Losses']) redw=redw.reset_index()# Merge data for visualize Big4_Michigan=pd.merge(lions.drop(['Wins','Losses'], axis=1),tigers.drop(['Wins','Losses'], axis=1),on='Year') Big4_Michigan=pd.merge(Big4_Michigan,pistons.drop(['Wins','Losses'], axis=1),on='Year') Big4_Michigan=pd.merge(Big4_Michigan,redw.drop(['Wins','Losses'], axis=1),on='Year')

Visualize-KDE

%matplotlib notebook # Draw KDE kde=Big4_Michigan.plot.kde() [kde.spines[loc].set_visible(False) for loc in ['top', 'right']] kde.axis([0,1,0,6]) kde.set_title('KDE of Big4 Win % in Michigan\n(1957-2019)',alpha=0.8) kde.legend(['Lions','Tigers','Pistons','Red Wings'],loc = 'best',frameon=False, title='Big4', fontsize=10)

Visualize-Line Plot

Big4_Michigan_0019=Big4_Michigan[40:] fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2, 2, sharex=True, sharey=True) axs=[ax1,ax2,ax3,ax4]fig.suptitle('Big4 Win % in Michigan\n(2000-2019)',alpha=0.8);# Properties columns_w=['Win %_Lions','Win %_Tigers','Win %_Pistons','Win %_RedWings'] colors=['g','b','y','r'] titles=['NFL: Lions','MLB: Tigers','NBA: Pistons','NHL: Red Wings'] axis=[0,20,0,0.8]y=0.5for i in range(len(axs)):# Draw the subplotax=axs[i]# ax.plot(Big4_Michigan_0019['Year'],Big4_Michigan_0019[columns_w[i]],c=colors[i], alpha=0.5)# sns.lineplot(x=Big4_Michigan_0019['Year'],y=Big4_Michigan_0019[columns_w[i]], alpha=0.5,ax=ax)sns.pointplot(x=Big4_Michigan_0019['Year'],y=Big4_Michigan_0019[columns_w[i]],scale = 0.7, alpha=0.5,ax=ax)ax.axhline(y=0.5, color='gray', linewidth=1, linestyle='--')ax.fill_between(range(0,20), 0.5, Big4_Michigan_0019[columns_w[i]],where=(Big4_Michigan_0019[columns_w[i]]<y), color='red',interpolate=True, alpha=0.3)ax.fill_between(range(0,20), 0.5, Big4_Michigan_0019[columns_w[i]],where=(Big4_Michigan_0019[columns_w[i]]>y), color='blue',interpolate=True, alpha=0.3)# Beautify the plot[ax.spines[loc].set_visible(False) for loc in ['top', 'right']] # Turn off some plot rectangle spinesax.set_ylabel('Win % ', alpha=0.8)ax.set_xlabel('Year', alpha=0.8)ax.set_title(titles[i], fontsize=10, alpha=0.8)ax.axis(axis)ax.set_xticks(np.append(np.arange(0, 20, 5),19)) ax.set_xticklabels(['2000','2005','2010','2015','2019'], fontsize=8, alpha=0.8)for label in ax.get_xticklabels() + ax.get_yticklabels():label.set_fontsize(8)label.set_bbox(dict(facecolor='white',edgecolor='white', alpha=0.8))

Discussion

Justification:

Before visualizing, data are loaded from Wikipedia and necessary cleaning processes are made. For example, tie games were dropped and some missing season data are replaced by data from near years.

There are two visualizations for answering the question. What must be prioritized is that the first figure, the KDE, shows the kernel density estimate of four teams’ win percentages from 1957 to 2019. As shown, we can make a conclusion about the statistical information of 4 teams, including mean, standard deviation. For instance, during most seasons, the Tigers always keeps a stable win percentage and shows less variation compared to other teams. Moreover, win percentages of Pistons and Lions seem a little less than 0.5.

When it comes to predicting the win percentage trend of each team, data from 2000 to 2019 are extracted to generate the second visualization. We can clearly find the trend for each team. For example, for the last 20 years, win percentages of the Lions and the Tigers were always less than 0.5, which seems not positive. Furthermore, though win percentages of Red Wings seem well at most time, they are in a decreasing trend and thus it is hard to guarantee that they will keep it in the next season.

Principles: (truthfulness beauty functionality insightfulness)

What must be prioritized is truthfulness. All data are extracted from the Wikipedia, including Wins and Losses, and win percentages are calculated from the formula Wins /(Wins + Losses). All data cleaning has been done carefully.

Concerning beauty. All elements of the visualization are designed properly. For example, for the Big4 W % in Michigan(2000-2019), a horizontal line of 0.5 is drawn for each team. And win percentages larger than 0.5 are filled with blue and win percentages less than 0.5 is filled with red for comparison. All colors are set vivid and smooth.

What is equally worth discussing is functionality. We choose the KDE for visualizing win percentages of 4 teams for a long time, which can clearly show the statistical information. And refering to trend, we use data in 20 years and choose line plot which can help us see the change of win percentages though time.

Lastly, insightfulness is completely shown. The horizontal line is set to show whether the team has won more than loss in that year. And the shadow filled with color can show the general win situation of the team in the last 20 years. For example, we can compare the area of red and blue and find that Red Wings is competent in NHL.

總結

以上是生活随笔為你收集整理的Coursera | Applied Plotting, Charting Data Representation in Python（UMich）| Assignment4的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：读论文：Charting the Rig
下一篇： python 多列排序_python s