Consider the dataset containing a list of addresses,
input str200 address
"1601 E NASA Pkwy, Houston, TX 77058"
"1000 George Bush Drive West College Station, TX 77845"
"150 West 65th Street, New York, NY 10023"
"1600 Pennsylvania Avenue NW, Washington, DC 20500"
"One Shields Avenue, Davis, CA 95616"
end
We may use the following regular expression to retrieve the ZIP codes from the addresses.
gen zip = ustrregexs(0) if ustrregexm(address, "\d{5}$")
If we only want the ZIP codes from addresses in Texas, a positive lookbehind (?<=TX\s)
could be used.
gen zip_tx = ustrregexs(0) if ustrregexm(address, "(?<=TX\s)\d{5}$")
A negative lookbehind (?<!TX\s)
could be used to retrieve the ZIP codes from addresses NOT in Texas.
gen zip_not_tx = ustrregexs(0) if ustrregexm(address, "(?<!TX\s)\d{5}$")
The dataset now looks like:
. list
+-----------------------------------------------------------------------------------+
| address zip zip_tx zip_no~x |
|-----------------------------------------------------------------------------------|
1. | 1601 E NASA Pkwy, Houston, TX 77058 77058 77058 |
2. | 1000 George Bush Drive West College Station, TX 77845 77845 77845 |
3. | 150 West 65th Street, New York, NY 10023 10023 10023 |
4. | 1600 Pennsylvania Avenue NW, Washington, DC 20500 20500 20500 |
5. | One Shields Avenue, Davis, CA 95616 95616 95616 |
+-----------------------------------------------------------------------------------+
Stata has four regular expression functions based on the ICU. Asjad Naqvi has a wonderful tutorial about their usage.