4-1 통계로 요약하기¶

기술통계 구하기¶

In [4]:

import pandas as pd
ns_book6 = pd.read_csv('ns_book6.csv', low_memory=False)
ns_book6.head()

Out[4]:

	번호	도서명	저자	출판사	발행년도	ISBN	세트 ISBN	부가기호	권	주제분류번호	도서권수	등록일자
0	1	인공지능과 흙	김동훈 지음	민음사	2021	9788937444319	NaN	NaN	NaN	NaN	1	2021-03-19
1	2	가짜 행복 권하는 사회	김태형 지음	갈매나무	2021	9791190123969	NaN	NaN	NaN	NaN	1	2021-03-19
2	3	나도 한 문장 잘 쓰면 바랄 게 없겠네	김선영 지음	블랙피쉬	2021	9788968332982	NaN	NaN	NaN	NaN	1	2021-03-19
3	4	예루살렘 해변	이도 게펜 지음, 임재희 옮김	문학세계사	2021	9788970759906	NaN	NaN	NaN	NaN	1	2021-03-19
4	5	김성곤의 중국한시기행 : 장강·황하 편	김성곤 지음	김영사	2021	9788934990833	NaN	NaN	NaN	NaN	1	2021-03-19

In [8]:

ns_book6.describe()

Out[8]:

	번호	발행년도	도서권수	대출건수
count	379976.000000	379976.000000	379976.000000	379976.000000
mean	201726.332847	2008.516306	1.135874	11.504629
std	115836.454596	8.780529	0.483343	19.241926
min	1.000000	1947.000000	0.000000	0.000000
25%	102202.750000	2003.000000	1.000000	2.000000
50%	203179.500000	2009.000000	1.000000	6.000000
75%	301630.250000	2015.000000	1.000000	14.000000
max	401681.000000	2650.000000	40.000000	1765.000000

In [10]:

sum(ns_book6['도서권수']==0)

Out[10]:

In [12]:

ns_book7 = ns_book6[ns_book6['도서권수']>0]

In [14]:

ns_book7.describe(percentiles=[0.3, 0.6, 0.9])

Out[14]:

	번호	발행년도	도서권수	대출건수
count	376770.000000	376770.000000	376770.000000	376770.000000
mean	202977.476649	2008.460076	1.145540	11.593439
std	115298.245784	8.773148	0.473853	19.279409
min	1.000000	1947.000000	1.000000	0.000000
30%	124649.700000	2004.000000	1.000000	2.000000
50%	204550.500000	2009.000000	1.000000	6.000000
60%	243537.400000	2011.000000	1.000000	8.000000
90%	361341.100000	2018.000000	2.000000	28.000000
max	401681.000000	2650.000000	40.000000	1765.000000

In [16]:

ns_book7.describe(include='object')

Out[16]:

	도서명	저자	출판사	ISBN	세트 ISBN	부가기호	권	주제분류번호	등록일자
count	376770	376770	376770	376770	55866	308252	61793	359792	376770
unique	336408	248850	21875	350810	14875	17	834	12467	4562
top	승정원일기	세종대왕기념사업회 [편]	문학동네	9788937430299	9788937460005	0	1	813.6	1970-01-01
freq	250	303	4410	206	702	158235	13282	14816	28185

평균 구하기¶

In [49]:

x = [10, 20, 30]
sum = 0
for i in range(3):
    sum += x[i]
print("평균:", sum / len(x))

평균: 20.0

In [20]:

ns_book7['대출건수'].mean()

Out[20]:

11.593438968070707

중앙값 구하기¶

In [22]:

ns_book7['대출건수'].median()

Out[22]:

6.0

In [24]:

temp_df = pd.DataFrame([1,2,3,4])
temp_df.median()

Out[24]:

0    2.5
dtype: float64

In [26]:

ns_book7['대출건수'].drop_duplicates().median()

Out[26]:

183.0

최솟값, 최댓값 구하기¶

In [28]:

ns_book7['대출건수'].min()

Out[28]:

In [30]:

ns_book7['대출건수'].max()

Out[30]:

분위수 구하기¶

In [32]:

ns_book7['대출건수'].quantile(0.25)

Out[32]:

2.0

In [34]:

ns_book7['대출건수'].quantile([0.25,0.5,0.75])

Out[34]:

0.25     2.0
0.50     6.0
0.75    14.0
Name: 대출건수, dtype: float64

In [36]:

pd.Series([1,2,3,4,5]).quantile(0.9)

Out[36]:

4.6

In [38]:

pd.Series([1,2,3,4,5]).quantile(0.9, interpolation='midpoint')

Out[38]:

4.5

In [40]:

pd.Series([1,2,3,4,5]).quantile(0.9, interpolation='nearest')

Out[40]:

In [42]:

borrow_10_flag = ns_book7['대출건수'] < 10

In [44]:

borrow_10_flag.mean()

Out[44]:

0.6402712530190833

In [46]:

ns_book7['대출건수'].quantile(0.65)

Out[46]:

10.0

분산 구하기¶

In [55]:

ns_book7['대출건수'].var()

Out[55]:

371.6956304306922

표준편차 구하기¶

In [58]:

ns_book7['대출건수'].std()

Out[58]:

19.27940949382766

최빈값 구하기¶

In [61]:

ns_book7['도서명'].mode()

Out[61]:

0    승정원일기
Name: 도서명, dtype: object

In [63]:

ns_book7['발행년도'].mode()

Out[63]:

0    2012
Name: 발행년도, dtype: int64

데이터프레임에서 기술통계 구하기¶

In [66]:

ns_book7.mean(numeric_only = True)

Out[66]:

번호      202977.476649
발행년도      2008.460076
도서권수         1.145540
대출건수        11.593439
dtype: float64

In [68]:

ns_book7.loc[:, '도서명':].mode()

Out[68]:

	도서명	저자	출판사	발행년도	ISBN	세트 ISBN	부가기호	권	주제분류번호	도서권수	대출건수	등록일자
0	승정원일기	세종대왕기념사업회 [편]	문학동네	2012	9788937430299	9788937460005	0	1	813.6	1	0	1970-01-01

In [70]:

ns_book7.to_csv('ns_book7.csv', index=False)

넘파이의 기술통계 함수¶

- 가중 평균¶

In [74]:

import numpy as np
np.average(ns_book7['대출건수'], weights=1/ns_book7['도서권수'])

Out[74]:

10.543612175385386

In [76]:

np.mean(ns_book7['대출건수']/ns_book7['도서권수'])

Out[76]:

9.873029861445774

In [78]:

ns_book7['대출건수'].sum()/ns_book7['도서권수'].sum()

Out[78]:

10.120503701300958

- 분위수 구하기¶

In [83]:

np.quantile(ns_book7['대출건수'], [0.25,0.5,0.75])

Out[83]:

array([ 2.,  6., 14.])

- 분산 구하기¶

In [94]:

np.var(ns_book7['대출건수']) ## n으로 나눔, ddof = 0

Out[94]:

371.694643898775

In [96]:

ns_book7['대출건수'].var() ## n-1으로 나눔, ddof = 1

Out[96]:

371.6956304306922

In [98]:

np.var(ns_book7['대출건수'], ddof=1)

Out[98]:

371.6956304306922

- 최빈값 구하기¶

In [101]:

values, counts = np.unique(ns_book7['도서명'], return_counts=True)

In [115]:

max_idx = np.argmax(counts)

In [117]:

values[max_idx]

Out[117]:

'승정원일기'

3번 연습문제¶

In [121]:

a = pd.DataFrame([1, 10, 3, 6, 20])
print(a.var())
print(a.std())

0    56.5
dtype: float64
0    7.516648
dtype: float64

4번 연습문제¶

In [139]:

ns_book7[['출판사', '대출건수']].groupby('출판사').mean().sort_values('대출건수', ascending=False).head(10)

Out[139]:

	대출건수
출판사
동녁라이프	224.000000
와이비엠시사	215.000000
환상	210.000000
두앤비컨텐츠(랜덤하우스중앙)	167.000000
개미들출판사-에셀문화사	157.000000
해커스 어학 연구소	149.500000
Test clinic	137.000000
마야출판사	133.333333
해밀 & Co.	133.000000
마그나(Magna)	130.000000

5번 연습문제¶

In [150]:

target_range = np.array(ns_book7['대출건수'].quantile(q=[0.25,0.75]))
target_bool_idx = (ns_book7['대출건수'] >= target_range[0]) & (ns_book7['대출건수'] <= target_range[1])
target_bool_idx.sum() / len(ns_book7) * 100

Out[150]:

51.51737134060568

4-2 분포 요약하기¶

In [154]:

pip install matplotlib

Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.12/site-packages (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.23 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

In [156]:

import pandas as pd
ns_book7 = pd.read_csv('ns_book7.csv', low_memory=False)
ns_book7.head()

Out[156]:

	번호	도서명	저자	출판사	발행년도	ISBN	세트 ISBN	부가기호	권	주제분류번호	도서권수	등록일자
0	1	인공지능과 흙	김동훈 지음	민음사	2021	9788937444319	NaN	NaN	NaN	NaN	1	2021-03-19
1	2	가짜 행복 권하는 사회	김태형 지음	갈매나무	2021	9791190123969	NaN	NaN	NaN	NaN	1	2021-03-19
2	3	나도 한 문장 잘 쓰면 바랄 게 없겠네	김선영 지음	블랙피쉬	2021	9788968332982	NaN	NaN	NaN	NaN	1	2021-03-19
3	4	예루살렘 해변	이도 게펜 지음, 임재희 옮김	문학세계사	2021	9788970759906	NaN	NaN	NaN	NaN	1	2021-03-19
4	5	김성곤의 중국한시기행 : 장강·황하 편	김성곤 지음	김영사	2021	9788934990833	NaN	NaN	NaN	NaN	1	2021-03-19

In [158]:

import matplotlib.pyplot as plt

scatter() 함수¶

In [166]:

plt.scatter([1,2,3,4],[1,2,3,4])
plt.show()

No description has been provided for this image

In [168]:

plt.scatter(ns_book7['번호'], ns_book7['대출건수'])
plt.show()

In [170]:

plt.scatter(ns_book7['도서권수'], ns_book7['대출건수'])
plt.show()

투명도 조절하기¶

In [177]:

plt.scatter(ns_book7['도서권수'], ns_book7['대출건수'], alpha=0.3)
plt.show()

In [179]:

average_borrows = ns_book7['대출건수']/ns_book7['도서권수']
plt.scatter(average_borrows, ns_book7['대출건수'], alpha = 0.1)
plt.show()

히스토그램 그리기¶

In [182]:

plt.hist([0,3,5,6,7,7,9,13], bins=5)
plt.show()

In [184]:

import numpy as np
np.histogram_bin_edges([0,3,5,6,7,7,9,13], bins=5)

Out[184]:

array([ 0. ,  2.6,  5.2,  7.8, 10.4, 13. ])

In [186]:

np.random.seed(42) # seed값이 같을 때는 모두 같은 값
random_samples = np.random.randn(1000)

In [188]:

print(np.mean(random_samples), np.std(random_samples))

0.019332055822325486 0.9787262077473543

In [190]:

plt.hist(random_samples)
plt.show()

In [192]:

plt.hist(ns_book7['대출건수'])
plt.show()

In [194]:

plt.hist(ns_book7['대출건수'])
plt.yscale('log')
plt.show()

In [198]:

plt.hist(ns_book7['대출건수'], bins=100)
plt.yscale('log')
plt.show()

In [200]:

title_len = ns_book7['도서명'].apply(len)
plt.hist(title_len, bins=100)
plt.show()

In [202]:

plt.hist(title_len, bins=100)
plt.xscale('log')
plt.show()

상자 수염 그림 그리기¶

In [205]:

plt.boxplot(ns_book7[['대출건수', '도서권수']])
plt.show()

In [207]:

plt.boxplot(ns_book7[['대출건수', '도서권수']])
plt.yscale('log')
plt.show()

In [209]:

plt.boxplot(ns_book7[['대출건수', '도서권수']], vert=False)
plt.xscale('log')
plt.show()

In [211]:

plt.boxplot(ns_book7[['대출건수', '도서권수']], whis=10)
plt.yscale('log')
plt.show()

In [213]:

plt.boxplot(ns_book7[['대출건수', '도서권수']], whis=(0,100))
plt.yscale('log')
plt.show()

5번 연습문제¶

In [246]:

selected_rows = (ns_book7['발행년도'] >= 1980) & (ns_book7['발행년도'] <= 2022)
plt.hist(ns_book7[selected_rows]['발행년도'])
plt.show()

6번 연습문제¶

In [249]:

plt.boxplot(ns_book7[selected_rows]['발행년도'])
plt.show()

[혼공파] 6주차_혼공분석 (0)	2025.03.05
[혼공파] 5주차_혼공분석 (1)	2025.03.05
[혼공파] 3주차_혼공분석 (0)	2025.01.26
[혼공파] 2주차_혼공분석 (2)	2025.01.19

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

bbbakery 님의 블로그

[혼공파] 4주차_혼공분석

4-1 통계로 요약하기¶

기술통계 구하기¶

평균 구하기¶

중앙값 구하기¶

최솟값, 최댓값 구하기¶

분위수 구하기¶

분산 구하기¶

표준편차 구하기¶

최빈값 구하기¶

데이터프레임에서 기술통계 구하기¶

넘파이의 기술통계 함수¶

- 가중 평균¶

- 분위수 구하기¶

- 분산 구하기¶

- 최빈값 구하기¶

3번 연습문제¶

4번 연습문제¶

5번 연습문제¶

4-2 분포 요약하기¶

scatter() 함수¶

투명도 조절하기¶

히스토그램 그리기¶

상자 수염 그림 그리기¶

5번 연습문제¶

6번 연습문제¶

'[혼공] 데이터분석' 카테고리의 다른 글

'[혼공] 데이터분석'의 다른글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

2025. 04
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

[혼공파] 4주차_혼공분석

4-1 통계로 요약하기¶

기술통계 구하기¶

평균 구하기¶

중앙값 구하기¶

최솟값, 최댓값 구하기¶

분위수 구하기¶

분산 구하기¶

표준편차 구하기¶

최빈값 구하기¶

데이터프레임에서 기술통계 구하기¶

넘파이의 기술통계 함수¶

- 가중 평균¶

- 분위수 구하기¶

- 분산 구하기¶

- 최빈값 구하기¶

3번 연습문제¶

4번 연습문제¶

5번 연습문제¶

4-2 분포 요약하기¶

scatter() 함수¶

투명도 조절하기¶

히스토그램 그리기¶

상자 수염 그림 그리기¶

5번 연습문제¶

6번 연습문제¶

'[혼공] 데이터분석' 카테고리의 다른 글

'[혼공] 데이터분석'의 다른글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역