c
cgyyy
V1
2022/09/28阅读:22主题:默认主题
Test
HW01
0.初始化
global path "D:\Stata17_working\HW01" //定义目录
global D0 "$path\data0"
global D1 "$path\data1"
global Out "$path\out" //结果:图形和表格
cd "$D1"
sysuse nlsw88.dta, clear
1. nlsw88.dta 描述性统计 : 用sum2docx
sum2docx age grade wage hours ttl_exp tenure using $Out\Table01.docx, ///
replace stats(N mean(%9.2f) sd min(%9.0g) median(%9.0g) max(%9.0g))
shellout $Out\Table01.docx , replace
结果如图:

2.变量生成
gen age2 = age^2
gen ln_wage = ln(wage)
gen wage_hour = wage/hours
egen meanwage = mean(wage)
gen dum =(wage>meanwage)
3.绘制 ttl_exp
变量的直方图和密度函数图
histogram wage
graph export $Out\His_ttl.png
kdensity wage
graph export $Out\Kendi_ttl.png
结果如图:


4.分行业统计
4.1 每个行业的观察值个数
思路:sum一下发现行业是有缺失值的,但age没有。所以可以 tabulate industry, sum(age),那么最右边那一列就是行业的分布了
tabulate industry, sum(age)
//结果
| Summary of Age in current year
Industry | Mean Std. dev. Freq.
------------+------------------------------------
Ag/Forest | 39.941176 3.1517969 17
Mining | 37.25 3.8622101 4
Construct | 38.62069 2.8587434 29
Manufactu | 38.989101 3.10383 367
Transport | 39.277778 3.3688186 90
Wholesale | 39.288288 3.0999049 333
Finance/I | 38.828125 3.1384527 192
Business/ | 38.732558 2.8837248 86
Personal | 39.237113 3.0508646 97
Entertain | 40.117647 3.2187411 17
Professio | 39.239078 3.0356132 824
Public ad | 39.159091 2.8400658 176
------------+------------------------------------
Total | 39.146057 3.0614786 2,232
4.2 统计各个行业妇女的平均工资等
思路:用tabstat分类统计
tabstat wage hours age, by (industry) stat(mean)
//结果
industry | wage hours age
-----------------+------------------------------
Ag/Forestry/Fish | 5.621121 34.47059 39.94118
Mining | 15.34959 40 37.25
Construction | 7.564934 35.65517 38.62069
Manufacturing | 7.501578 40.89373 38.9891
Transport/Comm/U | 11.44335 39.85556 39.27778
Wholesale/Retail | 6.125897 35.24699 39.28829
Finance/Ins/Real | 9.843174 38.51563 38.82813
Business/Repair | 7.51579 33.15116 38.73256
Personal service | 4.401093 32.09375 39.23711
Entertainment/Re | 6.724409 34.35294 40.11765
Professional ser | 7.871186 36.71655 39.23908
Public administr | 9.148407 38.54545 39.15909
-----------------+------------------------------
Total | 7.783463 37.23205 39.14606
------------------------------------------------
4.3 列表统计不同行业中白种人、黑种人和其他人种的比例
思路:对每个行业先生成各自的人种比例数量indwhite/indblack/indother,然后tabstat(好像有点麻烦)
bysort industry : egen indnum = count(industry) //生成行业人数的变量
bysort industry : egen indwhite = count(industry) if race==1 //该行业白人数目
replace indwhite = indwhite/indnum //白人比例
bysort industry : egen indblack = count(industry) if race==2 //行业黑人数目
replace indblack = indblack/indnum //黑人比例
bysort industry : egen indother = count(industry) if race==3 //行业其他人数目
replace indother = indother/indnum //其他人比例
tabstat indwhite indblack indother, by (industry) stat(mean) f(%9.4f)
结果:
industry | indwhite indblack indother
-----------------+------------------------------
Ag/Forestry/Fish | 0.7647 0.2353 .
Mining | 1.0000 . .
Construction | 0.8276 0.1379 0.0345
Manufacturing | 0.6240 0.3651 0.0109
Transport/Comm/U | 0.6889 0.3000 0.0111
Wholesale/Retail | 0.8018 0.1982 .
Finance/Ins/Real | 0.8594 0.1302 0.0104
Business/Repair | 0.7442 0.2326 0.0233
Personal service | 0.5258 0.4639 0.0103
Entertainment/Re | 0.8235 0.1765 .
Professional ser | 0.7476 0.2391 0.0133
Public administr | 0.6705 0.3068 0.0227
-----------------+------------------------------
5.请使用 label define
和 label value
命令,把 race
变量中的数值做定义
label define race 1 "白种人" 2 "黑种人" 3 "其他"
label value race race
结果如图:race被打上标签

6.续别变量转类别变量
gen G_age=(age<=37)
replace G_age=2 if age>37 & age<=42
replace G_age=3 if age>42
label define G_age 1 "37岁以下" 2 "38到42岁之间" 3 "43岁以上"
label values G_age G_age
7.工资分布
7.1 使用kdensity
set scheme white_tableau
twoway (kdensity wage if race == 1, color(gs10)) ///
(kdensity wage if race == 2, color(emerald)), ///
legend(order(1 "White" 2 "Black")) ///
xtitle(wage) ytitle(density)
graph export $Out\kdensity.png
结果如下:

分析
①从平均工资来看,白人的平均工资较高
②从分布来看,黑人的工资分布更为集中,5美元/小时左右的分布集中度最大
7.2 用柱状图呈现白人 (race==1
) 和黑人 (race==2
) 妇女的工资 (wage
) 在不同行业 (industry
) 的分布特征。
思路:先生成工资分段的分类变量,再计算各行业各个工资段的人口比例,最后用betterbar进行二分类柱状图
en wage_1 = (wage<5) //生成工资分段的分类变量
replace wage_1 = 2 if wage>=5 & wage<10
replace wage_1 = 3 if wage>=10 & wage<20
replace wage_1 = 4 if wage>=20
label define wage_1 1 "<5$/h" 2 "5-10$/h" 3 "10-20$/h" 4 ">=20$/h"
label values wage_1 wage_1
bysort industry : egen indnum = count(industry) //生成行业人数的变量
bysort industry wage_1: egen wagepct = sum(1/indnum) //各行业工资分段的人数比例
global pct `" 0 "0%" .25 "25%" .5 "50%" .75 "75%" 1 "100%" "'
betterbar wagepct, over(wage_1) by(industry) xlabel(${pct}) bar pct
graph export $Out\betterbar.png

作者介绍
c
cgyyy
V1