Stata Python 融合应用

Hua Peng@StataCorp

2019 Stata 中国用户大会

https://huapeng01016.github.io/china2019/

Stata 16与Python的紧密结合

嵌入与运行Python程序
互动式执行Python
在do-file中定义与运行Python程序
在ado-file中定义与运行Python程序
Python与Stata通过Stata Function Interface (sfi)互动

互动式执行Python

Hello World!

. python:
----------------------------------------------- python (type end to exit) -------------------------
>>> print('Hello World!')
Hello World!
>>> end
---------------------------------------------------------------------------------------------------

for 循环

Stata与其他Python环境一样，输入Python语句需要正确使用“缩进”。

. python:
----------------------------------------------- python (type end to exit) -------------------------
>>> sum = 0
>>> for i in range(7):
...     sum = sum + i
>>> print(sum)
21
>>> end
---------------------------------------------------------------------------------------------------

sfi

. python:
----------------------------------------------- python (type end to exit) -------------------------
>>> from functools import reduce
>>> from sfi import Data, Macro
>>> 
>>> stata: quietly sysuse auto, clear
>>> 
>>> sum = reduce((lambda x, y: x + y), Data.get(var='price'))
>>> 
>>> Macro.setLocal('sum', str(sum))
>>> end
---------------------------------------------------------------------------------------------------

. display "sum of var price is : `sum'"
sum of var price is : 456229

使用Python模块

Pandas
Numpy
BeautifulSoup, lxml
Matplotlib
Scikit-Learn, Tensorflow, Keras
NLTK,jieba

三维曲面图

导入Python模块

. python:
----------------------------------------------- python (type end to exit) -------------------------
>>> import numpy as np
>>> from sfi import Platform
>>> 
>>> import matplotlib
>>> if Platform.isWindows():
...         matplotlib.use('TkAgg')
... 
>>> import matplotlib.pyplot as plt
>>> from mpl_toolkits import mplot3d
>>> from sfi import Data
>>> end
---------------------------------------------------------------------------------------------------

使用sfi.Data导入数据

. use https://www.stata-press.com/data/r16/sandstone, clear
(Subsea elevation of Lamont sandstone in an area of Ohio)

. * Use sfi to get data from Stata
. python:
----------------------------------------------- python (type end to exit) -------------------------
>>> D = np.array(Data.get("northing easting depth"))
>>> end
---------------------------------------------------------------------------------------------------

使用三角网画图

python:
ax = plt.axes(projection='3d')
ax.xaxis.xticks(np.arange(60000, 90001, step=10000))
ax.yaxis.yticks(np.arange(30000, 50001, step=5000))
ax.plot_trisurf(D[:,0], D[:,1], D[:,2], cmap='viridis', edgecolor='none')
plt.savefig("sandstone.png")
end

结果

改变颜色和视角

python:
ax.plot_trisurf(D[:,0], D[:,1], D[:,2],
    cmap=plt.cm.Spectral, edgecolor='none')
ax.view_init(30, 60)
plt.savefig("sandstone1.png")
end

动画 (do-file)

网络数据的抓取

使用pandas获取表格

抓取Nasdaq 100 stock tickers

python:
import pandas as pd
data = pd.read_html("https://en.wikipedia.org/wiki/NASDAQ-100")
df = data[2]
df = df.drop(df.index[0])
t = df.values.tolist()
end

生成Stata dataset

python:
from sfi import Data
Data.addObs(len(t))
stata: gen company = ""
stata: gen ticker = ""
Data.store(None, range(len(t)), t)
end

. list in 1/5, clean

                       company   ticker  
  1.                Adobe Inc.     ADBE  
  2.    Advanced Micro Devices      AMD  
  3.   Alexion Pharmaceuticals     ALXN  
  4.    Align Technology, Inc.     ALGN  
  5.   Alphabet Inc. (Class A)    GOOGL

使用lxml分解HTML

使用Python script获得Nasdaq 100股票具体信息, 例如ATVI.

调用Python文件和输入参数

use nas100ticker, clear
quietly describe
frame create detail
forvalues i = 1/`r(N)' {
    local a = ticker[`i']
    local b detail
    python script nas1detail.py, args(`a' `b')
    sleep 100
}
frame detail : save nasd100detail.dta, replace

. list ticker open_price open_date close_price close_date in 1/5, clean

       ticker   open_price       open_date   close_pr~e      close_date  
  1.    GOOGL   $ 1,168.43   Aug. 15, 2019   $ 1,164.25   Aug. 14, 2019  
  2.     GOOG   $ 1,164.32   Aug. 15, 2019   $ 1,164.29   Aug. 14, 2019  
  3.     AMZN      $ 1,783   Aug. 15, 2019   $ 1,762.96   Aug. 14, 2019  
  4.      AAL      $ 26.20   Aug. 15, 2019      $ 26.10   Aug. 14, 2019  
  5.     AMGN        $ 200   Aug. 15, 2019     $ 198.87   Aug. 14, 2019

Support Vector Machine (SVM)

pysvm (ado file)

program pysvm
    version 16
    syntax varlist, predict(name)
    gettoken label feature : varlist
    python: dosvm("`label'", "`feature'", "`predict'")
end

Python部分ado file

python:
from sfi import Data
import numpy as np
from sklearn.svm import SVC

def dosvm(label, features, predict):
    X = np.array(Data.get(features))
    y = np.array(Data.get(label))

    svc_clf = SVC(gamma='auto')
    svc_clf.fit(X, y)

    y_pred = svc_clf.predict(X)

    Data.addVarByte(predict)
    Data.store(predict, None, y_pred)

end

用auto dataset测试

. sysuse auto, clear
(1978 Automobile Data)

. pysvm foreign mpg price, predict(for2)

比较结果

. label values for2 origin

. tabulate foreign for2, nokey

           |         for2
  Car type |  Domestic    Foreign |     Total
-----------+----------------------+----------
  Domestic |        52          0 |        52 
   Foreign |         0         22 |        22 
-----------+----------------------+----------
     Total |        52         22 |        74

改进版

. pysvm2 foreign mpg price if runiform() <= 0.2
note: training finished successfully

. pysvm2predict for2

. label values for2 origin

. tabulate foreign for2, nokey

           |         for2
  Car type |  Domestic    Foreign |     Total
-----------+----------------------+----------
  Domestic |        52          0 |        52 
   Foreign |        16          6 |        22 
-----------+----------------------+----------
     Total |        68          6 |        74

训练程序(pysvm2.ado)

program pysvm2
    version 16
    syntax varlist(min=3) [if] [in]
    gettoken label features : varlist
    marksample touse
    qui count if `touse'
    if r(N) == 0 {
        di as error "no observations"
        exit 2000
    }

    qui summarize `label' if `touse'
    if r(min) >= r(max) {
        di as error "outcome does not vary"
        exit 2000
    }

    quietly python: dosvm2("`label'", "`features'", "`touse'")
    di as text "note: training finished successfully"
end

Python部分pysvm2.ado

python:
import sys
from sfi import Data, Macro
import numpy as np
from sklearn.svm import SVC
import __main__

def dosvm2(label, features, select):
    X = np.array(Data.get(features, selectvar=select))
    y = np.array(Data.get(label, selectvar=select))

    svc_clf = SVC(gamma='auto')
    svc_clf.fit(X, y)

    __main__.svc_clf = svc_clf
    Macro.setGlobal('e(svm_features)', features)
    return svc_clf
end

预测程序(pysvm2predict.ado)

program pysvm2predict
    version 16
    syntax anything [if] [in]

    gettoken newvar rest : anything
    if "`rest'" != "" {
        exit 198
    }
    confirm new variable `newvar'
    marksample touse
    qui count if `touse'
    if r(N) == 0 {
        di as text "zero observations"
        exit 2000
    }

    qui replace `touse' = _n if `touse' != 0    
    python: dosvmpredict2("`newvar'", "`touse'")
end

Python部分pysvm2predict.ado

python:
import sys
from sfi import Data, Macro
import numpy as np
from sklearn.svm import SVC
from __main__ import svc_clf

def dosvmpredict2(predict, select):
    features = select + " "+ Macro.getGlobal('e(svm_features)')
    X = np.array(Data.get(features, selectvar=select))

    y_pred = svc_clf.predict(X[:,1:])
    y1 = np.reshape(y_pred, (-1,1))

    y = np.concatenate((X, y1), axis=1)
    Data.addVarDouble(predict)
    dim = y.shape[0]
    j = y.shape[1]-1
    for i in range(dim):
        Data.storeAt(predict, y[i,0]-1, y[i,j])

end

ado程序间传递Python实例

In pysvm2.ado ado code:

...
import __main__
...
__main__.svc_clf = svc_clf
...

To access svc_clf in Python routines in ado files:

...
from __main__ import svc_clf
...

工具

python query
python describe
python set exec
python search

词语分析

jieba中文分词

url = "http://www.uone-tech.cn/news/stata16.html"    
html = requests.get(url) 
html.encoding='utf-8' 
text = BeautifulSoup(html.text).get_text() 

import jieba       
words = jieba.lcut(text)

英文词频

url = "https://www.stata.com/new-in-stata/python-integration/"    
html = requests.get(url)  
text = BeautifulSoup(html.text).get_text() 

from wordcloud import WordCloud

wordcloud = WordCloud(max_font_size=75, 
    max_words=100, 
    background_color="white").generate(text)

谢谢!

Post-credits…

sfi details and examples
Stata Python documentation
Stata Python integration
The talk is made with Stata markdown and dynpandoc
wordcloud do-file

Stata Python 融合应用

Hua Peng@StataCorp

2019 Stata 中国用户大会

https://huapeng01016.github.io/china2019/

Stata 16与Python的紧密结合

互动式执行Python

Hello World!

for 循环

sfi

更多sfi

使用Python模块

三维曲面图

导入Python模块

使用sfi.Data导入数据

使用三角网画图

结果

改变颜色和视角

动画 (do-file)

网络数据的抓取

使用pandas获取表格

生成Stata dataset

使用lxml分解HTML

调用Python文件和输入参数

Support Vector Machine (SVM)

pysvm (ado file)

Python部分ado file

用auto dataset测试

比较结果

改进版

训练程序(pysvm2.ado)

Python部分pysvm2.ado

预测程序(pysvm2predict.ado)

Python部分pysvm2predict.ado

ado程序间传递Python实例

工具

词语分析

jieba中文分词

英文词频

谢谢!

Post-credits…