HW01

0.初始化

 global path "D:\Stata17_working\HW01"  //定义目录
 global D0    "$path\data0"
 global D1    "$path\data1"
 global Out  "$path\out"       //结果：图形和表格
 cd "$D1"
 sysuse nlsw88.dta, clear

1. nlsw88.dta 描述性统计 : 用sum2docx

sum2docx age grade wage hours ttl_exp tenure using $Out\Table01.docx, ///
replace stats(N mean(%9.2f) sd min(%9.0g) median(%9.0g) max(%9.0g))
shellout $Out\Table01.docx , replace

结果如图：

2.变量生成

gen age2 = age^2
gen ln_wage = ln(wage)
gen wage_hour = wage/hours
egen meanwage = mean(wage)
gen dum =(wage>meanwage)

3.绘制 ttl_exp 变量的直方图和密度函数图

histogram wage
graph export $Out\His_ttl.png 
kdensity wage
graph export $Out\Kendi_ttl.png

结果如图：

4.分行业统计

4.1 每个行业的观察值个数

思路：sum一下发现行业是有缺失值的，但age没有。所以可以 tabulate industry, sum(age)，那么最右边那一列就是行业的分布了

tabulate industry, sum(age)
//结果
|   Summary of Age in current year
   Industry |        Mean   Std. dev.       Freq.
------------+------------------------------------
  Ag/Forest |   39.941176   3.1517969          17
     Mining |       37.25   3.8622101           4
  Construct |    38.62069   2.8587434          29
  Manufactu |   38.989101     3.10383         367
  Transport |   39.277778   3.3688186          90
  Wholesale |   39.288288   3.0999049         333
  Finance/I |   38.828125   3.1384527         192
  Business/ |   38.732558   2.8837248          86
  Personal  |   39.237113   3.0508646          97
  Entertain |   40.117647   3.2187411          17
  Professio |   39.239078   3.0356132         824
  Public ad |   39.159091   2.8400658         176
------------+------------------------------------
      Total |   39.146057   3.0614786       2,232

4.2 统计各个行业妇女的平均工资等

思路：用tabstat分类统计

tabstat wage hours age, by (industry) stat(mean)
//结果
        industry |      wage     hours       age
-----------------+------------------------------
Ag/Forestry/Fish |  5.621121  34.47059  39.94118
          Mining |  15.34959        40     37.25
    Construction |  7.564934  35.65517  38.62069
   Manufacturing |  7.501578  40.89373   38.9891
Transport/Comm/U |  11.44335  39.85556  39.27778
Wholesale/Retail |  6.125897  35.24699  39.28829
Finance/Ins/Real |  9.843174  38.51563  38.82813
Business/Repair  |   7.51579  33.15116  38.73256
Personal service |  4.401093  32.09375  39.23711
Entertainment/Re |  6.724409  34.35294  40.11765
Professional ser |  7.871186  36.71655  39.23908
Public administr |  9.148407  38.54545  39.15909
-----------------+------------------------------
           Total |  7.783463  37.23205  39.14606
------------------------------------------------

4.3 列表统计不同行业中白种人、黑种人和其他人种的比例

思路：对每个行业先生成各自的人种比例数量indwhite/indblack/indother，然后tabstat（好像有点麻烦）

bysort industry : egen indnum = count(industry)  //生成行业人数的变量

bysort industry : egen indwhite = count(industry) if race==1 //该行业白人数目
replace indwhite = indwhite/indnum //白人比例

bysort industry : egen indblack = count(industry) if race==2 //行业黑人数目
replace indblack = indblack/indnum //黑人比例

bysort industry : egen indother = count(industry) if race==3 //行业其他人数目
replace indother = indother/indnum //其他人比例

tabstat indwhite indblack indother, by (industry) stat(mean) f(%9.4f)

结果：

        industry |  indwhite  indblack  indother
-----------------+------------------------------
Ag/Forestry/Fish |    0.7647    0.2353         .
          Mining |    1.0000         .         .
    Construction |    0.8276    0.1379    0.0345
   Manufacturing |    0.6240    0.3651    0.0109
Transport/Comm/U |    0.6889    0.3000    0.0111
Wholesale/Retail |    0.8018    0.1982         .
Finance/Ins/Real |    0.8594    0.1302    0.0104
Business/Repair  |    0.7442    0.2326    0.0233
Personal service |    0.5258    0.4639    0.0103
Entertainment/Re |    0.8235    0.1765         .
Professional ser |    0.7476    0.2391    0.0133
Public administr |    0.6705    0.3068    0.0227
-----------------+------------------------------

5.请使用 label define 和 label value 命令，把 race 变量中的数值做定义

label define race 1 "白种人" 2 "黑种人" 3 "其他"
label value race race

结果如图：race被打上标签

6.续别变量转类别变量

gen G_age=(age<=37)
replace G_age=2 if age>37 & age<=42
replace G_age=3 if age>42
label define G_age 1 "37岁以下" 2 "38到42岁之间" 3 "43岁以上"
label values G_age G_age

7.工资分布

7.1 使用kdensity

set scheme white_tableau
twoway 	(kdensity wage if race == 1, color(gs10)) ///
        (kdensity wage if race == 2, color(emerald)), ///
		legend(order(1 "White" 2 "Black")) ///
		xtitle(wage) ytitle(density)
graph export $Out\kdensity.png

结果如下：

分析

①从平均工资来看，白人的平均工资较高

②从分布来看，黑人的工资分布更为集中，5美元/小时左右的分布集中度最大

7.2 用柱状图呈现白人 (race==1) 和黑人 (race==2) 妇女的工资 (wage) 在不同行业 (industry) 的分布特征。

思路：先生成工资分段的分类变量，再计算各行业各个工资段的人口比例，最后用betterbar进行二分类柱状图

en wage_1 = (wage<5) //生成工资分段的分类变量
replace wage_1 = 2 if wage>=5 & wage<10
replace wage_1 = 3 if wage>=10 & wage<20
replace wage_1 = 4 if wage>=20
label define wage_1 1 "<5$/h" 2 "5-10$/h" 3 "10-20$/h" 4 ">=20$/h"
label values wage_1 wage_1

bysort industry : egen indnum = count(industry)  //生成行业人数的变量
bysort industry wage_1: egen wagepct = sum(1/indnum) //各行业工资分段的人数比例

global  pct `" 0 "0%" .25 "25%" .5 "50%" .75 "75%" 1 "100%" "'

betterbar wagepct, over(wage_1) by(industry) xlabel(${pct}) bar pct
graph export $Out\betterbar.png