使用Stata获取与处理数据

Hua Peng@StataCorp

2020 Stata 中国用户大会

https://huapeng01016.github.io/china2020/

数据的获取

import delimited

. local date = "08-10-2020"

. import delimited "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_
> covid_19_daily_reports/`date'.csv", clear
(14 vars, 3,940 obs)

. describe

Contains data
  obs:         3,940                          
 vars:            14                          
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
fips            long    %12.0g                FIPS
admin2          str41   %41s                  Admin2
province_state  str40   %40s                  Province_State
country_region  str32   %32s                  Country_Region
last_update     str19   %19s                  Last_Update
lat             float   %9.0g                 Lat
long_           float   %9.0g                 Long_
confirmed       long    %12.0g                Confirmed
deaths          long    %12.0g                Deaths
recovered       long    %12.0g                Recovered
active          long    %12.0g                Active
combined_key    str55   %55s                  Combined_Key
incidence_rate  float   %9.0g                 Incidence_Rate
casefatality_~o float   %9.0g                 Case-Fatality_Ratio
-------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

import excel

. import excel "./data/广东数据/广东省新冠肺炎疫情基本情况统计表_1595314944557.xlsx", clear
(8 vars, 27 obs)

. describe

Contains data
  obs:            27                          
 vars:             8                          
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
A               str78   %78s                  
B               str9    %9s                   
C               str12   %12s                  
D               str12   %12s                  
E               str6    %9s                   
F               str6    %9s                   
G               str12   %12s                  
H               str13   %13s                  
-------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

import spss

. import spss using "./data/manipulate.sav", clear
(7 vars, 100 obs)

. list in 1/5

     +-----------------------------------------------------------------------------------------------------+
     | ID   Sex   FundCode            ArrivalDate              EmergDate          DischargeDate   Length~y |
     |-----------------------------------------------------------------------------------------------------|
  1. |  1     M        HOS   10-Mar-2005 03:04:00   10-Mar-2005 05:09:00   16-Mar-2005 13:00:00          6 |
  2. |  2     M        HOS   27-Aug-2004 12:32:00   27-Aug-2004 13:10:00   27-Aug-2004 14:00:00          0 |
  3. |  3     M        HOS   19-Feb-2005 19:18:00   20-Feb-2005 03:39:00   25-Feb-2005 01:00:00          5 |
  4. |  4     M        TAC   24-Sep-2007 09:35:00   24-Sep-2007 11:38:00   24-Sep-2007 17:05:00          0 |
  5. |  5     M        TAC   19-Jan-2009 08:39:00   19-Jan-2009 22:06:00   29-Jan-2009 13:01:00         10 |
     +-----------------------------------------------------------------------------------------------------+

import sas

. import sas using "./data/psam_p30.sas7bdat", clear
(283 vars, 10,113 obs)

. describe

Contains data
  obs:        10,113                          PSAM_P30
 vars:           283                          
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
RT              str1    %1s                   Record type
SERIALNO        str9    %9s                   Housing unit/GQ person serial number
SPORDER         byte    %010.2g               Person key after swapping
PUMA            str5    %5s                   Puma Code
ST              str2    %2s                   State of current residence
ADJINC          str7    %7s                   Adjustment factor for income and earnings dollar amounts
PWGTP           int     %010.2g               PUMS person weight
AGEP            byte    %010.2g               PUMS Age
CIT             str1    %1s                   Citizenship
CITWP           int     %010.2g               PUMS Year of naturalization write-in
COW             str1    %1s                   Class of worker
DDRS            str1    %1s                   Difficulty dressing
DEAR            str1    %1s                   Hearing difficulty
DEYE            str1    %1s                   Vision difficulty
DOUT            str1    %1s                   Difficulty going out
DPHY            str1    %1s                   Physical difficulty
DRAT            str1    %1s                   Disability rating
DRATX           str1    %1s                   Disability rating checkbox
DREM            str1    %1s                   Difficulty remembering
ENG             str1    %1s                   English ability
FER             str1    %1s                   Fertility
GCL             str1    %1s                   Grandchildren living in house
GCM             str1    %1s                   Months responsible for grandchildren
GCR             str1    %1s                   Responsible for grandchildren
HINS1           str1    %1s                   Health insurance through employer/union
HINS2           str1    %1s                   Health insurance purchased directly
HINS3           str1    %1s                   Health insurance through Medicare
HINS4           str1    %1s                   Health Insurance through Medicaid/means-tested public coverage
HINS5           str1    %1s                   Health Insurance through TRICARE/military health coverage
HINS6           str1    %1s                   Health Insurance through VA Health Care
HINS7           str1    %1s                   Health insurance through Indian Health Service
INTP            long    %010.2g               PUMS Interest, net rental, etc. income
JWMNP           int     %010.2g               PUMS Minutes to work
JWRIP           byte    %010.2g               PUMS Total riders
JWTR            str2    %2s                   Transportation to work
LANX            str1    %1s                   Speaks another language at home
MAR             str1    %1s                   Marital status
MARHD           str1    %1s                   Divorced in the past 12 months
MARHM           str1    %1s                   Married in the past 12 months
MARHT           str1    %1s                   Times married
MARHW           str1    %1s                   Widowed in the past 12 months
MARHYP          int     %010.2g               PUMS Year last married
MIG             str1    %1s                   Mobility status
MIL             str1    %1s                   Served in Armed Forces
MLPA            str1    %1s                   Active duty -- SEP2001 or later
MLPB            str1    %1s                   Active duty -- AUG1990 to AUG2001
MLPCD           str1    %1s                   Active Duty -- MAY 1975 to JUL 1990
MLPE            str1    %1s                   Active duty -- Vietnam era
MLPFG           str1    %1s                   Active Duty -- FEB 1955 to JUL 1964
MLPH            str1    %1s                   Active duty -- Korean War
MLPI            str1    %1s                   Active duty -- JAN1947 to JUN1950
MLPJ            str1    %1s                   Active duty -- World War II
MLPK            str1    %1s                   Active duty -- NOV1941 or earlier
NWAB            str1    %1s                   Temporarily absent from work
NWAV            str1    %1s                   Available for Work
NWLA            str1    %1s                   On layoff
NWLK            str1    %1s                   Looking for work
NWRE            str1    %1s                   Informed of recall
OIP             long    %010.2g               PUMS Other income amount
PAP             int     %010.2g               PUMS Cash Public Assistance Income
RELP            str2    %2s                   PUMS Relationship to Reference Person
RETP            long    %010.2g               PUMS Retirement income
SCH             str1    %1s                   School enrollment
SCHG            str2    %2s                   Grade level attending
SCHL            str2    %2s                   Educational attainment
SEMP            long    %010.2g               PUMS Self-employment income
SEX             str1    %1s                   Sex
SSIP            int     %010.2g               PUMS Supplemental security income
SSP             long    %010.2g               PUMS Social Security or Railroad Retirement Income
WAGP            long    %010.2g               PUMS Wages/salary income
WKHP            byte    %010.2g               PUMS Hours worked per week
WKL             str1    %1s                   When last worked
WKW             str1    %1s                   Weeks worked past 12 months
WRK             str1    %1s                   Worked last week
YOEP            int     %010.2g               PUMS Year of entry
ANC             str1    %1s                   Ancestry categorization
ANC1P           str3    %3s                   First ancestry 5% code
ANC2P           str3    %3s                   Second ancestry 5% code
DECADE          str1    %1s                   Decade of entry
DIS             str1    %1s                   Disability Recode
DRIVESP         str1    %1s                   Number of vehicles calculated from JWRI
ESP             str1    %1s                   Employment status of parents
ESR             str1    %1s                   Employment Status Recode
FOD1P           str4    %4s                   First Field of Degree 5% code
FOD2P           str4    %4s                   Second Field of Degree 5% code
HICOV           str1    %1s                   Any health insurance coverage
HISP            str2    %2s                   Hispanic recode
INDP            str4    %4s                   Industry recode 5%
JWAP            str3    %3s                   Time of arrival at work categorization
JWDP            str3    %3s                   Time of departure for work categorization
LANP            str3    %3s                   Other language recode 5%
MIGPUMA         str5    %5s                   Migration PUMA
MIGSP           str3    %3s                   Migration state/foreign country recode 5%
MSP             str1    %1s                   Married -- spouse present/ spouse absent
NAICSP          str8    %8s                   NAICS industry code 5%
NATIVITY        str1    %1s                   Nativity
NOP             str1    %1s                   Nativity of Parent
OC              str1    %1s                   Own child
OCCP            str4    %4s                   Occupation recode 5%
PAOC            str1    %1s                   Presence and age of own children
PERNP           long    %010.2g               PUMS Persons earnings (signed)
PINCP           long    %010.2g               PUMS Persons income (signed)
POBP            str3    %3s                   Place of birth 5% code
POVPIP          int     %010.2g               PUMS Poverty index
POWPUMA         str5    %5s                   Place of work PUMA
POWSP           str3    %3s                   Place of work state/fgn countyy recode 5%
PRIVCOV         str1    %1s                   Private health insurance
PUBCOV          str1    %1s                   Public Coverage
QTRBIR          str1    %1s                   Quarter of birth
RAC1P           str1    %1s                   Race1 recode
RAC2P           str2    %2s                   Race2 recode
RAC3P           str3    %3s                   Race3 recode
RACAIAN         str1    %1s                   Race includes AIAN
RACASN          str1    %1s                   Race includes Asian
RACBLK          str1    %1s                   Race includes Black
RACNH           str1    %1s                   Race includes NH
RACNUM          str1    %1s                   Number of race groups represented
RACPI           str1    %1s                   Race includes PI
RACSOR          str1    %1s                   Race includes Other race
RACWHT          str1    %1s                   Race includes White
RC              str1    %1s                   Related child
SCIENGP         str1    %1s                   PUMS Field of Degree Science and Engineering Flag
SCIENGRLP       str1    %1s                   PUMS Field of Degree Science and Engineering Related Flag
SFN             str1    %1s                   Subfamily Number
SFR             str1    %1s                   Subfamily Relationship
SOCP            str6    %6s                   SOC occupation code 5%
VPS             str2    %2s                   Veterans period of service
WAOB            str1    %1s                   World area of birth
FAGEP           str1    %1s                   PUMS Age allocation flag
FANCP           str1    %1s                   PUMS Ancestry Code allocation flag
FCITP           str1    %1s                   PUMS Citizenship allocation flag
FCITWP          str1    %1s                   PUMS Year of naturalization write-in allocation flag
FCOWP           str1    %1s                   PUMS Class of Worker allocation flag
FDDRSP          str1    %1s                   PUMS Difficulty dressing allocation flag
FDEARP          str1    %1s                   PUMS Hearing difficulty allocation
FDEYEP          str1    %1s                   PUMS Vision difficulty allocation
FDISP           str1    %1s                   PUMS Disability Recode allocation flag
FDOUTP          str1    %1s                   PUMS Going Out Difficulty allocation flag
FDPHYP          str1    %1s                   PUMS Physical Activity Difficulty allocation
FDRATP          str1    %1s                   PUMS Disability rating allocation flag
FDRATXP         str1    %1s                   PUMS Disability rating checkbox allocat flag
FDREMP          str1    %1s                   PUMS Remembering Difficulty allocation flag
FENGP           str1    %1s                   PUMS English Ability allocation flag
FESRP           str1    %1s                   PUMS Employment Status Recode allocation flag
FFERP           str1    %1s                   PUMS Fertility allocation flag
FFODP           str1    %1s                   PUMS Field of degree allocation flag
FGCLP           str1    %1s                   PUMS Grandchildren Living at House allocation flag
FGCMP           str1    %1s                   PUMS Months Responsible for Grandchildren allocation flag
FGCRP           str1    %1s                   PUMS Responsibility for Grandchildren allocation flag
FHINS1P         str1    %1s                   PUMS Health insurance allocation flag #1
FHINS2P         str1    %1s                   PUMS Health insurance allocation flag #2
FHINS3C         str1    %1s                   Medicare coverage given through the eligibility coverage edit
FHINS3P         str1    %1s                   PUMS Health insurance allocation flag #3
FHINS4C         str1    %1s                   Medicaid coverage given through the eligibility coverage edit
FHINS4P         str1    %1s                   PUMS Health insurance allocation flag #4
FHINS5C         str1    %1s                   TRICARE coverage given through the eligibility coverage edit
FHINS5P         str1    %1s                   PUMS Health insurance allocation flag #5
FHINS6P         str1    %1s                   PUMS Health insurance allocation flag #6
FHINS7P         str1    %1s                   PUMS Health insurance allocation flag #7
FHISP           str1    %1s                   PUMS Hispanic Origin allocation flag
FINDP           str1    %1s                   PUMS Industry allocation flag
FINTP           str1    %1s                   PUMS Interest allocation flag
FJWDP           str1    %1s                   PUMS Time of Departure for Work allocation flag
FJWMNP          str1    %1s                   PUMS Travel Time to Work allocation flag
FJWRIP          str1    %1s                   PUMS Vehicle Occupancy allocation flag
FJWTRP          str1    %1s                   PUMS Means of Transportation to Work allocation flag
FLANP           str1    %1s                   PUMS Language Spoken allocation flag
FLANXP          str1    %1s                   PUMS Non-English Language allocation flag
FMARHDP         str1    %1s                   PUMS Divorced in past 12 months allocation flag
FMARHMP         str1    %1s                   PUMS Married in past 12 months allocation flag
FMARHTP         str1    %1s                   PUMS Times married allocation flag
FMARHWP         str1    %1s                   PUMS Widowed in past 12 months allocation flag
FMARHYP         str1    %1s                   PUMS Year last married allocation flag
FMARP           str1    %1s                   PUMS Marital Status allocation flag
FMIGP           str1    %1s                   PUMS Mobility Status allocation flag
FMIGSP          str1    %1s                   PUMS Migration State allocation flag
FMILPP          str1    %1s                   PUMS Period of Service allocation flag
FMILSP          str1    %1s                   PUMS Veteran Status allocation flag
FOCCP           str1    %1s                   PUMS Occupation allocation flag
FOIP            str1    %1s                   PUMS Other Income allocation flag
FPAP            str1    %1s                   PUMS Public Assistance Income allocation flag
FPERNP          str1    %1s                   PUMS Total Persons Earnings allocation flag
FPINCP          str1    %1s                   PUMS Total Persons Income allocation flag
FPOBP           str1    %1s                   PUMS Place of Birth allocation flag
FPOWSP          str1    %1s                   PUMS Place of Work State allocation flag
FPRIVCOVP       str1    %1s                   PUMS Private Health Insurance Coverage allocation flag
FPUBCOVP        str1    %1s                   PUMS Public Health Insurance Coverage allocation flag
FRACP           str1    %1s                   PUMS Race allocation flag
FRELP           str1    %1s                   PUMS Relationship allocation flag
FRETP           str1    %1s                   PUMS Retirement Income allocation flag
FSCHGP          str1    %1s                   PUMS Grade Attending allocation flag
FSCHLP          str1    %1s                   PUMS Educational Attainment allocation flag
FSCHP           str1    %1s                   PUMS School Enrollment allocation flag
FSEMP           str1    %1s                   PUMS Self-employment Income allocation flag
FSEXP           str1    %1s                   PUMS Sex allocation flag
FSSIP           str1    %1s                   PUMS Suplemental Security Income allocation flag
FSSP            str1    %1s                   PUMS Social Security Income allocation flag
FWAGP           str1    %1s                   PUMS Wages/Salary Income allocation flag
FWKHP           str1    %1s                   PUMS Hours Worked Per Week allocation flag
FWKLP           str1    %1s                   PUMS When Last Worked allocation flag
FWKWP           str1    %1s                   PUMS Weeks Worked Past 12 Months allocation flag
FWRKP           str1    %1s                   PUMS Worked last week flag
FYOEP           str1    %1s                   PUMS Year of Entry allocation flag
pwgtp1          int     %010.2g               
pwgtp2          int     %010.2g               
pwgtp3          int     %010.2g               
pwgtp4          int     %010.2g               
pwgtp5          int     %010.2g               
pwgtp6          int     %010.2g               
pwgtp7          int     %010.2g               
pwgtp8          int     %010.2g               
pwgtp9          int     %010.2g               
pwgtp10         int     %010.2g               
pwgtp11         int     %010.2g               
pwgtp12         int     %010.2g               
pwgtp13         int     %010.2g               
pwgtp14         int     %010.2g               
pwgtp15         int     %010.2g               
pwgtp16         int     %010.2g               
pwgtp17         int     %010.2g               
pwgtp18         int     %010.2g               
pwgtp19         int     %010.2g               
pwgtp20         int     %010.2g               
pwgtp21         int     %010.2g               
pwgtp22         int     %010.2g               
pwgtp23         int     %010.2g               
pwgtp24         int     %010.2g               
pwgtp25         int     %010.2g               
pwgtp26         int     %010.2g               
pwgtp27         int     %010.2g               
pwgtp28         int     %010.2g               
pwgtp29         int     %010.2g               
pwgtp30         int     %010.2g               
pwgtp31         int     %010.2g               
pwgtp32         int     %010.2g               
pwgtp33         int     %010.2g               
pwgtp34         int     %010.2g               
pwgtp35         int     %010.2g               
pwgtp36         int     %010.2g               
pwgtp37         int     %010.2g               
pwgtp38         int     %010.2g               
pwgtp39         int     %010.2g               
pwgtp40         int     %010.2g               
pwgtp41         int     %010.2g               
pwgtp42         int     %010.2g               
pwgtp43         int     %010.2g               
pwgtp44         int     %010.2g               
pwgtp45         int     %010.2g               
pwgtp46         int     %010.2g               
pwgtp47         int     %010.2g               
pwgtp48         int     %010.2g               
pwgtp49         int     %010.2g               
pwgtp50         int     %010.2g               
pwgtp51         int     %010.2g               
pwgtp52         int     %010.2g               
pwgtp53         int     %010.2g               
pwgtp54         int     %010.2g               
pwgtp55         int     %010.2g               
pwgtp56         int     %010.2g               
pwgtp57         int     %010.2g               
pwgtp58         int     %010.2g               
pwgtp59         int     %010.2g               
pwgtp60         int     %010.2g               
pwgtp61         int     %010.2g               
pwgtp62         int     %010.2g               
pwgtp63         int     %010.2g               
pwgtp64         int     %010.2g               
pwgtp65         int     %010.2g               
pwgtp66         int     %010.2g               
pwgtp67         int     %010.2g               
pwgtp68         int     %010.2g               
pwgtp69         int     %010.2g               
pwgtp70         int     %010.2g               
pwgtp71         int     %010.2g               
pwgtp72         int     %010.2g               
pwgtp73         int     %010.2g               
pwgtp74         int     %010.2g               
pwgtp75         int     %010.2g               
pwgtp76         int     %010.2g               
pwgtp77         int     %010.2g               
pwgtp78         int     %010.2g               
pwgtp79         int     %010.2g               
pwgtp80         int     %010.2g               
-------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

get data from pandas using sfi

. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> import pandas as pd
>>> data = pd.read_html("https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/index.html")
>>> df = data[4]
>>> df.head()
                                        Unnamed: 0  ... Non-Hispanic White
0      Proportion of hospitalized COVID-NET cases1  ...              31.4%
1  Proportion of population in COVID-NET catchment  ...              58.5%
2                               Prevalence ratios2  ...                0.5

[3 rows x 6 columns]
>>> t = df.values.tolist()
>>> end
-------------------------------------------------------------------------------------------------------------

生成Stata dataset

clear
quietly python:
from sfi import Data
Data.addObs(len(t))
stata: gen desc = ""
stata: gen indian = ""
stata: gen balck = ""
stata: gen hisp = ""
stata: gen asian = ""
stata: gen white = ""
Data.store(None, range(len(t)), t)
end

. list, clean string(22)

       desc                       indian   balck    hisp   asian   white  
  1.   Proportion of hospital..     1.3%   33.0%   23.1%    5.0%   31.4%  
  2.   Proportion of populati..     0.7%   17.9%   14.1%    8.9%   58.5%  
  3.   Prevalence ratios2            1.9     1.8     1.6     0.6     0.5  

Covid-19数据的获取与显示

Covid-19数据

. local date = "07-30-2020"

. import delimited "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_
> covid_19_daily_reports/`date'.csv", clear
(14 vars, 3,935 obs)

. save ./data/covid_`date'.dta, replace
file ./data/covid_07-30-2020.dta saved

. desc

Contains data from ./data/covid_07-30-2020.dta
  obs:         3,935                          
 vars:            14                          22 Aug 2020 08:49
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
fips            long    %12.0g                FIPS
admin2          str41   %41s                  Admin2
province_state  str40   %40s                  Province_State
country_region  str32   %32s                  Country_Region
last_update     str19   %19s                  Last_Update
lat             float   %9.0g                 Lat
long_           float   %9.0g                 Long_
confirmed       long    %12.0g                Confirmed
deaths          long    %12.0g                Deaths
recovered       long    %12.0g                Recovered
active          long    %12.0g                Active
combined_key    str55   %55s                  Combined_Key
incidence_rate  float   %9.0g                 Incidence_Rate
casefatality_~o float   %9.0g                 Case-Fatality_Ratio
-------------------------------------------------------------------------------------------------------------
Sorted by: 

获得与生成US county shape data

cd data
copy https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip ///
    cb_2018_us_county_500k.zip
unzipfile cb_2018_us_county_500k.zip

spshape2dta ./cb_2018_us_county_500k/cb_2018_us_county_500k.shp,    /// 
    saving(usacounties) replace
use usacounties.dta, clear
generate fips = real(GEOID)
save usacounties.dta, replace
cd ..

US county数据

. desc

Contains data from ./data/usacounties.dta
  obs:         3,233                          
 vars:            13                          12 Aug 2020 10:13
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
_ID             int     %12.0g                Spatial-unit ID
_CX             double  %10.0g                x-coordinate of area centroid
_CY             double  %10.0g                y-coordinate of area centroid
STATEFP         str2    %9s                   STATEFP
COUNTYFP        str3    %9s                   COUNTYFP
COUNTYNS        str8    %9s                   COUNTYNS
AFFGEOID        str14   %14s                  AFFGEOID
GEOID           str5    %9s                   GEOID
NAME            str21   %21s                  NAME
LSAD            str2    %9s                   LSAD
ALAND           double  %14.0f                ALAND
AWATER          double  %14.0f                AWATER
fips            float   %9.0g                 
-------------------------------------------------------------------------------------------------------------
Sorted by: _ID

. list in 1/5

     +-------------------------------------------------------------------------------------------------+
  1. | _ID |        _CX |       _CY | STATEFP | COUNTYFP | COUNTYNS |       AFFGEOID | GEOID |    NAME |
     |   1 | -88.999256 | 37.058482 |      21 |      007 | 00516850 | 0500000US21007 | 21007 | Ballard |
     |-------------------------------------------------------------------------------------------------|
     |        LSAD        |             ALAND         |          AWATER         |         fips         |
     |          06        |         639387454         |        69473325         |        21007         |
     +-------------------------------------------------------------------------------------------------+

     +-------------------------------------------------------------------------------------------------+
  2. | _ID |        _CX |       _CY | STATEFP | COUNTYFP | COUNTYNS |       AFFGEOID | GEOID |    NAME |
     |   2 | -84.217151 | 38.206735 |      21 |      017 | 00516855 | 0500000US21017 | 21017 | Bourbon |
     |-------------------------------------------------------------------------------------------------|
     |        LSAD        |             ALAND         |          AWATER         |         fips         |
     |          06        |         750439351         |         4829777         |        21017         |
     +-------------------------------------------------------------------------------------------------+

     +-------------------------------------------------------------------------------------------------+
  3. | _ID |        _CX |       _CY | STATEFP | COUNTYFP | COUNTYNS |       AFFGEOID | GEOID |    NAME |
     |   3 | -86.681623 | 37.207285 |      21 |      031 | 00516862 | 0500000US21031 | 21031 |  Butler |
     |-------------------------------------------------------------------------------------------------|
     |        LSAD        |             ALAND         |          AWATER         |         fips         |
     |          06        |        1103571974         |        13943044         |        21031         |
     +-------------------------------------------------------------------------------------------------+

     +-------------------------------------------------------------------------------------------------+
  4. | _ID |        _CX |       _CY | STATEFP | COUNTYFP | COUNTYNS |       AFFGEOID | GEOID |    NAME |
     |   4 | -83.964311 | 37.692444 |      21 |      065 | 00516879 | 0500000US21065 | 21065 |  Estill |
     |-------------------------------------------------------------------------------------------------|
     |        LSAD        |             ALAND         |          AWATER         |         fips         |
     |          06        |         655509930         |         6516335         |        21065         |
     +-------------------------------------------------------------------------------------------------+

     +-------------------------------------------------------------------------------------------------+
  5. | _ID |        _CX |       _CY | STATEFP | COUNTYFP | COUNTYNS |       AFFGEOID | GEOID |    NAME |
     |   5 | -83.696656 | 38.370118 |      21 |      069 | 00516881 | 0500000US21069 | 21069 | Fleming |
     |-------------------------------------------------------------------------------------------------|
     |        LSAD        |             ALAND         |          AWATER         |         fips         |
     |          06        |         902727151         |         7182793         |        21069         |
     +-------------------------------------------------------------------------------------------------+

人口数据

. import delimited https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2
> 019-alldata.csv, clear
(164 vars, 3,193 obs)

. 
. generate fips = state*1000 + county

. desc

Contains data
  obs:         3,193                          
 vars:           165                          
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
sumlev          byte    %8.0g                 SUMLEV
region          byte    %8.0g                 REGION
division        byte    %8.0g                 DIVISION
state           byte    %8.0g                 STATE
county          int     %8.0g                 COUNTY
stname          str20   %20s                  STNAME
ctyname         str33   %33s                  CTYNAME
census2010pop   long    %12.0g                CENSUS2010POP
estimatesb~2010 long    %12.0g                ESTIMATESBASE2010
popestimate2010 long    %12.0g                POPESTIMATE2010
popestimate2011 long    %12.0g                POPESTIMATE2011
popestimate2012 long    %12.0g                POPESTIMATE2012
popestimate2013 long    %12.0g                POPESTIMATE2013
popestimate2014 long    %12.0g                POPESTIMATE2014
popestimate2015 long    %12.0g                POPESTIMATE2015
popestimate2016 long    %12.0g                POPESTIMATE2016
popestimate2017 long    %12.0g                POPESTIMATE2017
popestimate2018 long    %12.0g                POPESTIMATE2018
popestimate2019 long    %12.0g                POPESTIMATE2019
npopchg_2010    long    %12.0g                NPOPCHG_2010
npopchg_2011    long    %12.0g                NPOPCHG_2011
npopchg_2012    long    %12.0g                NPOPCHG_2012
npopchg_2013    long    %12.0g                NPOPCHG_2013
npopchg_2014    long    %12.0g                NPOPCHG_2014
npopchg_2015    long    %12.0g                NPOPCHG_2015
npopchg_2016    long    %12.0g                NPOPCHG_2016
npopchg_2017    long    %12.0g                NPOPCHG_2017
npopchg_2018    long    %12.0g                NPOPCHG_2018
npopchg_2019    long    %12.0g                NPOPCHG_2019
births2010      long    %12.0g                BIRTHS2010
births2011      long    %12.0g                BIRTHS2011
births2012      long    %12.0g                BIRTHS2012
births2013      long    %12.0g                BIRTHS2013
births2014      long    %12.0g                BIRTHS2014
births2015      long    %12.0g                BIRTHS2015
births2016      long    %12.0g                BIRTHS2016
births2017      long    %12.0g                BIRTHS2017
births2018      long    %12.0g                BIRTHS2018
births2019      long    %12.0g                BIRTHS2019
deaths2010      long    %12.0g                DEATHS2010
deaths2011      long    %12.0g                DEATHS2011
deaths2012      long    %12.0g                DEATHS2012
deaths2013      long    %12.0g                DEATHS2013
deaths2014      long    %12.0g                DEATHS2014
deaths2015      long    %12.0g                DEATHS2015
deaths2016      long    %12.0g                DEATHS2016
deaths2017      long    %12.0g                DEATHS2017
deaths2018      long    %12.0g                DEATHS2018
deaths2019      long    %12.0g                DEATHS2019
naturalinc2010  long    %12.0g                NATURALINC2010
naturalinc2011  long    %12.0g                NATURALINC2011
naturalinc2012  long    %12.0g                NATURALINC2012
naturalinc2013  long    %12.0g                NATURALINC2013
naturalinc2014  long    %12.0g                NATURALINC2014
naturalinc2015  long    %12.0g                NATURALINC2015
naturalinc2016  long    %12.0g                NATURALINC2016
naturalinc2017  long    %12.0g                NATURALINC2017
naturalinc2018  long    %12.0g                NATURALINC2018
naturalinc2019  long    %12.0g                NATURALINC2019
internatio~2010 int     %8.0g                 INTERNATIONALMIG2010
internatio~2011 long    %12.0g                INTERNATIONALMIG2011
internatio~2012 long    %12.0g                INTERNATIONALMIG2012
internatio~2013 long    %12.0g                INTERNATIONALMIG2013
internatio~2014 long    %12.0g                INTERNATIONALMIG2014
internatio~2015 long    %12.0g                INTERNATIONALMIG2015
internatio~2016 long    %12.0g                INTERNATIONALMIG2016
internatio~2017 long    %12.0g                INTERNATIONALMIG2017
internatio~2018 long    %12.0g                INTERNATIONALMIG2018
internatio~2019 long    %12.0g                INTERNATIONALMIG2019
domesticmig2010 int     %8.0g                 DOMESTICMIG2010
domesticmig2011 long    %12.0g                DOMESTICMIG2011
domesticmig2012 long    %12.0g                DOMESTICMIG2012
domesticmig2013 long    %12.0g                DOMESTICMIG2013
domesticmig2014 long    %12.0g                DOMESTICMIG2014
domesticmig2015 long    %12.0g                DOMESTICMIG2015
domesticmig2016 long    %12.0g                DOMESTICMIG2016
domesticmig2017 long    %12.0g                DOMESTICMIG2017
domesticmig2018 long    %12.0g                DOMESTICMIG2018
domesticmig2019 long    %12.0g                DOMESTICMIG2019
netmig2010      long    %12.0g                NETMIG2010
netmig2011      long    %12.0g                NETMIG2011
netmig2012      long    %12.0g                NETMIG2012
netmig2013      long    %12.0g                NETMIG2013
netmig2014      long    %12.0g                NETMIG2014
netmig2015      long    %12.0g                NETMIG2015
netmig2016      long    %12.0g                NETMIG2016
netmig2017      long    %12.0g                NETMIG2017
netmig2018      long    %12.0g                NETMIG2018
netmig2019      long    %12.0g                NETMIG2019
residual2010    int     %8.0g                 RESIDUAL2010
residual2011    int     %8.0g                 RESIDUAL2011
residual2012    int     %8.0g                 RESIDUAL2012
residual2013    int     %8.0g                 RESIDUAL2013
residual2014    int     %8.0g                 RESIDUAL2014
residual2015    int     %8.0g                 RESIDUAL2015
residual2016    int     %8.0g                 RESIDUAL2016
residual2017    int     %8.0g                 RESIDUAL2017
residual2018    int     %8.0g                 RESIDUAL2018
residual2019    int     %8.0g                 RESIDUAL2019
gqestimatesba~0 long    %12.0g                GQESTIMATESBASE2010
gqestimates2010 long    %12.0g                GQESTIMATES2010
gqestimates2011 long    %12.0g                GQESTIMATES2011
gqestimates2012 long    %12.0g                GQESTIMATES2012
gqestimates2013 long    %12.0g                GQESTIMATES2013
gqestimates2014 long    %12.0g                GQESTIMATES2014
gqestimates2015 long    %12.0g                GQESTIMATES2015
gqestimates2016 long    %12.0g                GQESTIMATES2016
gqestimates2017 long    %12.0g                GQESTIMATES2017
gqestimates2018 long    %12.0g                GQESTIMATES2018
gqestimates2019 long    %12.0g                GQESTIMATES2019
rbirth2011      float   %9.0g                 RBIRTH2011
rbirth2012      float   %9.0g                 RBIRTH2012
rbirth2013      float   %9.0g                 RBIRTH2013
rbirth2014      float   %9.0g                 RBIRTH2014
rbirth2015      float   %9.0g                 RBIRTH2015
rbirth2016      float   %9.0g                 RBIRTH2016
rbirth2017      float   %9.0g                 RBIRTH2017
rbirth2018      float   %9.0g                 RBIRTH2018
rbirth2019      float   %9.0g                 RBIRTH2019
rdeath2011      float   %9.0g                 RDEATH2011
rdeath2012      float   %9.0g                 RDEATH2012
rdeath2013      float   %9.0g                 RDEATH2013
rdeath2014      float   %9.0g                 RDEATH2014
rdeath2015      float   %9.0g                 RDEATH2015
rdeath2016      float   %9.0g                 RDEATH2016
rdeath2017      float   %9.0g                 RDEATH2017
rdeath2018      float   %9.0g                 RDEATH2018
rdeath2019      float   %9.0g                 RDEATH2019
rnaturalinc2011 float   %9.0g                 RNATURALINC2011
rnaturalinc2012 float   %9.0g                 RNATURALINC2012
rnaturalinc2013 float   %9.0g                 RNATURALINC2013
rnaturalinc2014 float   %9.0g                 RNATURALINC2014
rnaturalinc2015 float   %9.0g                 RNATURALINC2015
rnaturalinc2016 float   %9.0g                 RNATURALINC2016
rnaturalinc2017 float   %9.0g                 RNATURALINC2017
rnaturalinc2018 float   %9.0g                 RNATURALINC2018
rnaturalinc2019 float   %9.0g                 RNATURALINC2019
rinternati~2011 float   %9.0g                 RINTERNATIONALMIG2011
rinternati~2012 float   %9.0g                 RINTERNATIONALMIG2012
rinternati~2013 float   %9.0g                 RINTERNATIONALMIG2013
rinternati~2014 float   %9.0g                 RINTERNATIONALMIG2014
rinternati~2015 float   %9.0g                 RINTERNATIONALMIG2015
rinternati~2016 float   %9.0g                 RINTERNATIONALMIG2016
rinternati~2017 float   %9.0g                 RINTERNATIONALMIG2017
rinternati~2018 float   %9.0g                 RINTERNATIONALMIG2018
rinternati~2019 float   %9.0g                 RINTERNATIONALMIG2019
rdomesticm~2011 float   %9.0g                 RDOMESTICMIG2011
rdomesticm~2012 float   %9.0g                 RDOMESTICMIG2012
rdomesticm~2013 float   %9.0g                 RDOMESTICMIG2013
rdomesticm~2014 float   %9.0g                 RDOMESTICMIG2014
rdomesticm~2015 float   %9.0g                 RDOMESTICMIG2015
rdomesticm~2016 float   %9.0g                 RDOMESTICMIG2016
rdomesticm~2017 float   %9.0g                 RDOMESTICMIG2017
rdomesticm~2018 float   %9.0g                 RDOMESTICMIG2018
rdomesticm~2019 float   %9.0g                 RDOMESTICMIG2019
rnetmig2011     float   %9.0g                 RNETMIG2011
rnetmig2012     float   %9.0g                 RNETMIG2012
rnetmig2013     float   %9.0g                 RNETMIG2013
rnetmig2014     float   %9.0g                 RNETMIG2014
rnetmig2015     float   %9.0g                 RNETMIG2015
rnetmig2016     float   %9.0g                 RNETMIG2016
rnetmig2017     float   %9.0g                 RNETMIG2017
rnetmig2018     float   %9.0g                 RNETMIG2018
rnetmig2019     float   %9.0g                 RNETMIG2019
fips            float   %9.0g                 
-------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. * save ./data/census_popn.dta, replace

merge 数据

use ./data/covid_`date'.dta, clear
drop if fips >= .
merge 1:1 fips using data/usacounties.dta
keep if _merge == 3
drop _merge

merge 1:1 fips using data/census_popn

generate confirmed_adj = 100000*(confirmed/popestimate2019)
label var confirmed_adj "Cases per 100,000"
format %16.0fc confirmed_adj

grmap 显示数据

grmap, activate
drop if province_state == "Alaska" | province_state == "Hawaii" | _ID >= .
spset, modify shpfile(usacounties_shp)
grmap confirmed_adj, clnumber(7)

grmap结果

covid19 map
covid19 map

使用Python获取与显示数据

World数据

local date = "07-29-2020"
import delimited "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/`date'.csv", ///
    encoding(utf-8) clear

capture confirm variable country_region
if _rc > 0 {
    capture confirm variable countryregion
    if _rc > 0 {
di as error "variable countr_region or countryregion required"      
    }
    
    rename countryregion country_region
}
    
keep country_region confirmed   
sort country_region
by country_region : gen confirmed_total = sum(confirmed)
by country_region : keep if _n == _N

keep country_region confirmed_total
export delimited _confirmed_world_`date'.csv, replace 

数据改进

_pctile confirmed_total, percentiles(25 75)
local r25 = int(r(r1)) 
local r75 = int(r(r2)) 

显示World数据

python:
import numpy as np
import pandas as pd
df = pd.read_csv("_confirmed_world_`date'.csv")
df.head()
end

python:
from urllib.request import urlopen
import numpy as np
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)


import plotly.express as px
fig = px.choropleth(df, geojson=counties, locations='country_region', 
                        locationmode='country names',
                        color='confirmed_total',
                        hover_data=['country_region', 'confirmed_total'],
                        color_continuous_scale='ylorrd',
                        range_color = [`r25', `r75'],
                        scope="world",
                        labels={
                            'country_region':'country region',
                            'confirmed_total':'confirmed cases'
                        }
                    )


fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# fig.write_html("test.html")
fig.show()
end

World结果

do-file中使用参数

local 0 ",`0'"
di "`0'"

syntax [anything] , [state(string asis) date(string asis) png(string asis) show]

Stata中的日期

. 
. display %td 0
01jan1960

. display %td 0
01jan1960

. display %tdDD/NN/CCYY 0
01/01/1960

. 
. local year = 2020

. local month = 7

. local day = 30

. 
. local td = date("`month'/`day'/`year'", "MDY")

. display `td'
22126

. display %td `td'
30jul2020

.  

美国数据

使用Python

. local date = "07-30-2020"

. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv("https://raw.githubusercontent.com/"\
...         "CSSEGISandData/COVID-19/master/csse_covid_19_data/"\
...         "csse_covid_19_daily_reports/`date'.csv",\
...         dtype={"fips" : np.int32})
>>> df.columns = df.columns.str.lower()
>>> df = df.loc[df['country_region'] == "US"]
>>> df.head()
      fips     admin2  ... incidence_rate case-fatality_ratio
0  45001.0  Abbeville  ...    1161.984751            2.456140
1  22001.0     Acadia  ...    3705.375131            3.088299
2  51001.0   Accomack  ...    3323.431118            1.396648
3  16001.0        Ada  ...    1594.935079            0.716053
4  19001.0      Adair  ...     279.642058            0.000000

[5 rows x 14 columns]
>>> end
-------------------------------------------------------------------------------------------------------------

使用plotly显示数据

python:
from urllib.request import urlopen
import numpy as np
import json
with urlopen("https://raw.githubusercontent.com/"\
    "plotly/datasets/master/geojson-counties-fips.json") as response:
    counties = json.load(response)

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/"\
    "CSSEGISandData/COVID-19/master/csse_covid_19_data/"\
    "csse_covid_19_daily_reports/`date'.csv",\
    dtype={"fips" : np.int32})
df.columns = df.columns.str.lower()
df = df.loc[df['country_region'] == "US"]
import plotly.express as px
fig = px.choropleth(df, geojson=counties, locations='fips', 
                        color='confirmed',
                        hover_data=['combined_key', 'confirmed'],
                        color_continuous_scale='Inferno',
                        range_color = [100, 5000],
                        scope="usa",
                        labels={
                            'combined_key':'localtion',
                            'confirmed':'confirmed cases'
                        }
                    )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
# fig.write_html("./output/`date'-`state'.html")
end

US结果

使用人口数据调整病例数

. local date = "07-30-2020"

. 
. import delimited https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2
> 019-alldata.csv, clear
(164 vars, 3,193 obs)

. generate fips = state*1000 + county

. save data/census_popn.dta, replace
file data/census_popn.dta saved

. import delimited "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_
> covid_19_daily_reports/`date'.csv", clear
(14 vars, 3,935 obs)

. merge m:1 fips using data/census_popn.dta
(note: variable fips was long, now double to accommodate using data's values)

    Result                           # of obs.
    -----------------------------------------
    not matched                           934
        from master                       838  (_merge==1)
        from using                         96  (_merge==2)

    matched                             3,097  (_merge==3)
    -----------------------------------------

. keep if _merge==3
(934 observations deleted)

. drop _merge

. generate confirmed_adj = 100000*(confirmed/popestimate2019)

. list combined_key confirmed_adj in 1/5

     +---------------------------------+
     |         combined_key   confir~j |
     |---------------------------------|
  1. | Autauga, Alabama, US   1723.675 |
  2. | Baldwin, Alabama, US   1336.714 |
  3. | Barbour, Alabama, US   2183.424 |
  4. |    Bibb, Alabama, US   1536.126 |
  5. |  Blount, Alabama, US   1137.896 |
     +---------------------------------+

. 
. save data/covid19_pop_adj.dta, replace
file data/covid19_pop_adj.dta saved

end

显示人口数据调整病例数do-file

生成动画

version 16

cscript

local start =  date("04/01/2020", "MDY")
local end = date(c(current_date), "DMY") - 1

forval i = `start'/`end' {
    local date : di %tdNN-DD-CCYY `i'
    local j = `i' - `start'
    do covid19_us state(Texas) date(`date') png(t`j')
}

local count = `end' - `start'

python:
import imageio as io

with io.get_writer('../output/texas.gif', mode='I', duration=0.5) as writer:
    for i in range(0, `count', 1):
        image = io.imread("../output/t"+str(i)+".png")
        writer.append_data(image)
writer.close()
end

Texas结果

texas.gif
texas.gif

广东Covid-19数据

Excel文件的格式问题

. import excel "./data/广东数据/广东省新冠肺炎疫情基本情况统计表_1595314944557.xlsx", clear
(8 vars, 27 obs)

. list in 1/5

     +------------------------------------------------------------------------------------------------------+
  1. |                                                                A |    B |        C |        D |    E |
     |                                 广东省新冠肺炎疫情基本情况统计表 |      |          |          |      |
     |------------------------------------------------------------------------------------------------------|
     |                F             |                    G             |                     H              |
     |                              |                                  |                                    |
     +------------------------------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------------------------------+
  2. |                                                                A |    B |        C |        D |    E |
     | 新增:2020年7月20日  00:00-24:00   累计:截至2020年7月20日 24:00 |      |          |          |      |
     |------------------------------------------------------------------------------------------------------|
     |                F             |                    G             |                     H              |
     |                              |                                  |                                    |
     +------------------------------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------------------------------+
  3. |                                                                A |    B |        C |        D |    E |
     |                                                             序号 | 地市 | 确诊病例 |          |      |
     |------------------------------------------------------------------------------------------------------|
     |                F             |                    G             |                     H              |
     |                              |             在院病例             |                                    |
     +------------------------------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------------------------------+
  4. |                                                                A |    B |        C |        D |    E |
     |                                                                  |      |          |          |      |
     |------------------------------------------------------------------------------------------------------|
     |                F             |                    G             |                     H              |
     |                              |                                  |                                    |
     +------------------------------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------------------------------+
  5. |                                                                A |    B |        C |        D |    E |
     |                                                                  |      |     累计 | 境外输入 | 境内 |
     |------------------------------------------------------------------------------------------------------|
     |                F             |                    G             |                     H              |
     |             新增             |             在院总数             |             危重/重症              |
     +------------------------------------------------------------------------------------------------------+

列举文件

local files : dir "./广东数据" files "*.xlsx"
foreach file in `files' {
        qui import excel "./广东数据/`file'", clear
        /* other steps */
}

中文日期

if ustrword(A[2], 4) == "年" {
    local year = ustrword(A[2], 3)  
}

if ustrword(A[2], 6) == "月" {
    local month = ustrword(A[2], 5) 
}

if ustrword(A[2], 8) == "日" {
    local day = ustrword(A[2], 7)   
}

if "`year'" == "" | "`month'" == "" | "`day'" == "" {
di as error "can not find date!"
exit 198
}

local date = date("`month'-`day'-`year'", "MDY")

遍历每个变量

ds
local varlist = "`r(varlist)'"
forval i=1/`varno' {
    local var : word `i' of `varlist'
    cap assert `var' >= .
    if _rc == 0 {
        qui drop `var'
    }
}

完整do-file

quietly forval i=1/`obsno' {
    use ../data/fileinfo.dta, clear
    local date = date[`i']
    local file = file[`i']

    import excel "../data/广东数据/`file'", cellrange(B7:H27) clear
    gen int date =  `date'
    format %td date
    rename B 地市
    rename C 确诊病例_累计
    rename D 确诊病例_境外输入
    rename E 确诊病例_境内
    rename F 在院病例_新增
    rename G 在院病例_在院总数
    rename H 在院病例_危重重症
    
    append using ../data/广东数据.dta
    save ../data/广东数据.dta, replace
}

完整结果

. use ./data/广东数据.dta, clear

. desc

Contains data from ./data/广东数据.dta
  obs:           462                          
 vars:             8                          17 Aug 2020 15:51
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
地市            str20   %20s                  
确诊病例_累计   int     %10.0g                
确诊病例_境~入  int     %10.0g                
确诊病例_境内   int     %10.0g                
在院病例_新增   int     %10.0g                
在院病例_在~数  int     %10.0g                
在院病例_危~症  int     %10.0g                
date            int     %td                   
-------------------------------------------------------------------------------------------------------------
Sorted by: 地市  date

. list in 1/5

     +--------------------------------------------------------------------------------+
     |   地市   确诊~计   确诊~入   确诊~内   在院~增   在院~数   在院~症        date |
     |--------------------------------------------------------------------------------|
  1. | 东莞市       100         1        99         .         .         .   20jul2020 |
  2. | 东莞市       100         1        99         .         .         .   21jul2020 |
  3. | 东莞市       100         1        99         .         .         .   22jul2020 |
  4. | 东莞市       100         1        99         .         .         .   23jul2020 |
  5. | 东莞市       100         1        99         .         .         .   24jul2020 |
     +--------------------------------------------------------------------------------+

谢谢!

Post-credits…

TODO

  • https://www.stata.com/training/webinar_series/covid-19/