In China, the term “city” can refer to a county, prefecture, or
province. This ambiguity creates challenges for researchers, who often
struggle to convert regional names and their corresponding geocodes,
especially in datasets that span many years. The central government
periodically changes or eliminates administrative unit names, further
complicating these conversions(国家统计局
2022a). Inspired by Vincent Arel-Bundock’s
countrycode
package, we developed regioncode
to streamline the conversion of Chinese region names and codes for the
years 1986 – 2022.
regioncode
?The Chinese government assigns a unique geocode to each county, prefecture, and province, and consistently adjusts and updates these “administrative division codes” to support national and regional development plans (民政部 2022). These adjustments, however, challenge researchers who conduct longitudinal studies or merge geospatial data from different years. For example, when inconsistencies exist between map data and statistical data, any attempt to render the statistical information on a map of China can produce errors.
regioncode
provides a one-step solution to these
challenges. The package seamlessly converts formal names, common names,
and administrative division codes for Chinese provinces and prefectures
over thirty years (1986 – 2022). Apart from code conversion, the package
also support matching administrative divisions with their economic and
linguistic characteristics.
To install:
install.packages("regioncode")
.remotes::install_github("sammo3182/regioncode")
.We demonstrate the basic functionality of regioncode
using a sample dataset randomly drawn from Wang
(2020)’s China Corruption Investigations Dataset. The package
uses code
to denote administrative division codes and
name
for the formal names of regions. It can convert
between these two formats.
A user simply provides a character vector of names or a numeric
vector of geocodes to the function and specifies the desired output with
the convert_to
argument. The following example converts
geocodes from 2019 to their 1989 equivalents. Users must set the
year_from
argument to the correct reference year and then
use the year_to
and convert_to
arguments to
specify the target year and output format.
library(regioncode)
data("corruption")
# Conversion to the 1989 version
regioncode(
data_input = corruption$prefecture_id,
convert_to = "code", # default setting
year_from = 2019,
year_to = 1989
)
## [1] 370100 329001 310227 420500 452200 433000 350300 512500 460025 420600
# Comparison
tibble(
code2019 = corruption$prefecture_id,
code1989 = regioncode(
data_input = corruption$prefecture_id,
convert_to = "code", # default setting
year_from = 2019,
year_to = 1989
),
name2019 = regioncode(
data_input = corruption$prefecture_id,
convert_to = "name", # default setting
year_from = 2019,
year_to = 2019
),
name1989 = regioncode(
data_input = corruption$prefecture_id,
convert_to = "name", # default setting
year_from = 2019,
year_to = 1989
)
)
## # A tibble: 10 × 4
## code2019 code1989 name2019 name1989
## <dbl> <dbl> <chr> <chr>
## 1 370100 370100 济南市 济南市
## 2 321200 329001 泰州市 泰州市
## 3 310117 310227 松江区 松江县
## 4 420500 420500 宜昌市 宜昌市
## 5 451300 452200 来宾市 柳州地区
## 6 431200 433000 怀化市 怀化地区
## 7 350300 350300 莆田市 莆田市
## 8 511500 512500 宜宾市 宜宾地区
## 9 469021 460025 定安县 定安县
## 10 420600 420600 襄阳市 襄樊市
Note that if a region geocoded in 1989 was later absorbed into a new
region by 2019, the package uses the new region’s geocode. If a single
large area was later divided into several smaller regions, the package
aligns the new codes with the first new region based on the ascending
numerical order of their geocodes.1 regioncode
automatically
identifies the input format, treating numeric vectors as geocodes and
character vectors as names. The following example demonstrates
converting various input types into different output formats:
## # A tibble: 10 × 2
## id name
## <dbl> <chr>
## 1 370100 济南市
## 2 321200 泰州市
## 3 310117 松江区
## 4 420500 宜昌市
## 5 451300 来宾市
## 6 431200 怀化市
## 7 350300 莆田市
## 8 511500 宜宾市
## 9 469021 定安县
## 10 420600 襄阳市
# Codes to name
regioncode(
data_input = corruption$prefecture_id,
convert_to = "name",
year_from = 2019,
year_to = 1989
)
## [1] "济南市" "泰州市" "松江县" "宜昌市" "柳州地区" "怀化地区"
## [7] "莆田市" "宜宾地区" "定安县" "襄樊市"
# Name to codes of the same year
regioncode(
data_input = corruption$prefecture,
convert_to = "code",
year_from = 2019,
year_to = 2019
)
## [1] 370100 321200 310117 420500 451300 431200 350300 511500 469021 420600
# Name to name of a different year
regioncode(
data_input = corruption$prefecture,
convert_to = "name",
year_from = 2019,
year_to = 1989)
## [1] "济南市" "泰州市" "松江县" "宜昌市" "柳州地区" "怀化地区"
## [7] "莆田市" "宜宾地区" "定安县" "襄樊市"
The regioncode
package also provides specialized
functions for more complex data and diverse research needs,
including:
Datasets often record geographic information using incomplete names
that omit the administrative level, such as “北京” for “北京市” or
“内蒙” for “内蒙古自治区.” To handle this type of data, a user can set
the incomplete_name
argument to TRUE
.
regioncode
can perform the conversion as long as at least
two characters are available to identify the city or province. In the
following example, we shorten 70% of the city names in the input vector
to demonstrate how regioncode
resolves this issue:
## [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
## [9] "定安县" "襄阳市"
fake_incomplete <- corruption$prefecture
index_incomplete <- sample(seq(length(corruption$prefecture)), 7)
fake_incomplete[index_incomplete] <- fake_incomplete[index_incomplete] |>
substr(start = 1, stop = 2)
fake_incomplete
## [1] "济南" "泰州" "松江" "宜昌" "来宾" "怀化市" "莆田市" "宜宾市"
## [9] "定安" "襄阳"
# Conversion to full names in 2008
regioncode(
data_input = fake_incomplete,
convert_to = "name",
year_from = 2019,
year_to = 2008,
incomplete_name = TRUE
)
## [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
## [9] "定安县" "襄樊市"
In China, municipalities (“直辖市”) are geographically cities but
function administratively as provinces. Different datasets may
categorize them differently, with some treating them as equivalent to
prefectures. The regioncode
package includes the
zhixiashi
argument to manage this distinction. The default
setting, FALSE
, treats municipalities as provinces. When
set to TRUE
, the function treats them as prefectures and
uses their provincial codes as their geocodes. The following example
demonstrates this functionality using a character vector containing the
names of municipalities, their districts, and a prefecture:
names_municipality <- c(
"北京市", # Beijing, a municipality
"海淀区", # A district of Beijing
"上海市", # Shanghai, a municipality
"静安区", # A district of Shanghai
"济南市"
) # A prefecture of Shandong
# When `zhixiashi` is FALSE, only the districts are recognized
regioncode(
data_input = names_municipality,
year_from = 2019,
year_to = 2019,
convert_to = "code",
zhixiashi = FALSE
)
## [1] NA 110108 NA 310106 370100
# When `zhixiashi` is TRUE, municipalities are recognized
regioncode(
data_input = names_municipality,
year_from = 2019,
year_to = 2019,
convert_to = "code",
zhixiashi = TRUE
)
## [1] 110000 110108 310000 310106 370100
The Statistical Yearbook of Urban and Rural Construction classifies Chinese cities into different tiers based largely on population (国家统计局 2022b). A four-tier system existed from 1989 to 2014, after which the government expanded it to a seven-tier system, as detailed in the following table:
Criterion | Population | Rank |
---|---|---|
Old (1989) | > 1 million | 超大城市 |
500,000 ~ 1 million | 大城市 | |
200,000 ~ 500,000 | 中等城市 | |
< 200,000 | 小城市 | |
New (2014) | > 10 million | 超大城市 |
5 million ~ 10 million | 特大城市 | |
3 million ~ 5 million | I型大城市 | |
1 million ~ 3 million | II型大城市 | |
500,000 ~ 1 million | 中等城市 | |
200,000 ~ 500,000 | I型小城市 | |
< 200,000 | II型小城市 |
The regioncode
function can return the population-based
rank of a city for a given year. Users can perform this conversion by
setting convert_to = "rank"
. The function applies the old
ranking system for years up to and including 1989 and the new system for
all subsequent years. If a city’s population data is unavailable in the
official sources, the function returns NA
. The following
example compares the city ranks generated from the same input vector but
for different years:
tibble(
city = corruption$prefecture,
rank1989 = regioncode(
data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to = "rank"
),
rank2014 = regioncode(
data_input = corruption$prefecture,
year_from = 2019,
year_to = 2014,
convert_to = "rank"
))
## # A tibble: 10 × 3
## city rank1989 rank2014
## <chr> <chr> <chr>
## 1 济南市 特大城市 I型大城市
## 2 泰州市 小城市 II型大城市
## 3 松江区 特大城市 超大城市
## 4 宜昌市 中等城市 II型大城市
## 5 来宾市 <NA> 中等城市
## 6 怀化市 小城市 I型小城市
## 7 莆田市 小城市 II型大城市
## 8 宜宾市 中等城市 II型大城市
## 9 定安县 <NA> <NA>
## 10 襄阳市 中等城市 II型大城市
Pinyin provides a phonetic romanization of Chinese characters, and
some datasets store region names in pinyin. While
regioncode
defaults to outputting names in Chinese
characters, users can obtain pinyin output by setting the
to_pinyin
argument to TRUE
. This functionality
is integrated from the pinyin
package developed by Peng
Zhao and Qu Cheng. The function also produces the correct romanization
for regions with unique spellings, such as Shanxi versus Shaanxi, Inner
Mongolia, and the special administrative regions. This pinyin conversion
works for official names, incomplete names, and administrative area
outputs. The following example shows how the function works for
different requests:
tibble(
city = corruption$prefecture,
cityPY = regioncode(
data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to = "name",
to_pinyin = TRUE
),
areaPY = regioncode(
data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to = "area",
to_pinyin = TRUE
)
)
## # A tibble: 10 × 3
## city cityPY areaPY
## <chr> <chr> <chr>
## 1 济南市 ji_nan hua_dong
## 2 泰州市 tai_zhou hua_dong
## 3 松江区 song_jiang hua_dong
## 4 宜昌市 yi_chang hua_zhong
## 5 来宾市 liu_zhou hua_nan
## 6 怀化市 huai_hua hua_zhong
## 7 莆田市 pu_tian hua_dong
## 8 宜宾市 yi_bin xi_nan
## 9 定安县 ding_an hua_nan
## 10 襄阳市 xiang_fan hua_zhong
# Regions with special spelling
regioncode(
data_input = c("山西", "陕西", "内蒙古", "香港", "澳门"),
year_from = 2019,
year_to = 2008,
convert_to = "name",
incomplete_name = TRUE,
province = TRUE,
to_pinyin = TRUE
)
## <NA> 陕西 内蒙 香港
## "shan_xi" "shaan_xi" "inner_mongolia" "hong_kong"
## 澳门
## "macao"
The regioncode
function also converts data at the
provincial level. By setting the province
argument to
TRUE
, users can convert provincial geocodes and names.
Since Chinese provinces have standard abbreviations, users can convert
these abbreviations to other data types by setting the
convert_to
argument to abbreTocode
,
abbreToname
, or abbreToarea
. To convert names
or codes to abbreviations, a user can set
convert_to = "abbre"
. The following example demonstrates
converting a vector of provincial geocodes into their official names and
abbreviations:
tibble(
province = corruption$province_id,
prov_name = regioncode(
data_input = corruption$province_id,
convert_to = "name",
year_from = 2019,
year_to = 1989,
province = TRUE
),
prov_abbre = regioncode(
data_input = corruption$province_id,
convert_to = "codeToabbre",
year_from = 2019,
year_to = 1989,
province = TRUE
)
)
## # A tibble: 10 × 3
## province prov_name prov_abbre
## <dbl> <chr> <chr>
## 1 370000 山东省 鲁
## 2 320000 江苏省 苏
## 3 310000 上海市 沪
## 4 420000 湖北省 鄂
## 5 450000 广西壮族自治区 桂
## 6 430000 湖南省 湘
## 7 350000 福建省 闽
## 8 510000 四川省 川
## 9 460000 海南省 琼
## 10 420000 湖北省 鄂
regioncode
can also convert geographic units beyond the
provincial level into two larger categories: administrative areas and
linguistic zones.
For social, political, and military purposes, China divides its territory into seven major administrative areas (孙平 2020):
Region | Provincial-level Administrative Unit |
---|---|
华北 | 北京市, 天津市, 山西省, 河北省, 内蒙古自治区 |
东北 | 黑龙江省, 吉林省, 辽宁省 |
华东 | 上海市, 江苏省, 浙江省, 安徽省, 福建省, 台湾省, 江西省, 山东省 |
华中 | 河南省, 湖北省, 湖南省 |
华南 | 广东省, 海南省, 广西壮族自治区, 香港特别行政区, 澳门特别行政区 |
西南 | 重庆市, 四川省, 贵州省, 云南省, 西藏自治区 |
西北 | 陕西省, 甘肃省, 青海省, 宁夏回族自治区, 新疆维吾尔自治区 |
Users who need to identify the administrative area for a given
prefecture or province can do so with regioncode
. The
function converts regional codes and names (for both prefectures and
provinces) into their corresponding administrative area by setting the
convert_to
argument to "area"
:
regioncode(
data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to = "area")
## [1] "华东" "华东" "华东" "华中" "华南" "华中" "华东" "西南" "华南" "华中"
As a country with numerous dialects, China’s linguistic distribution
presents challenges for relevant economic, political, and sociological
analyses. A single dialect may span several prefectures or even multiple
provinces. For political and sociolinguistic researchers,
regioncode
can identify the approximate linguistic zone for
a given geocode or prefectural name. Following the 1987 language atlas
of China, the package provides two levels of linguistic zone
identification: dialect groups (dia_group
, “方言大类”) and
dialect sub-groups (dia_sub_group
, “分区片”). Note that
when province = TRUE
, conversions are only possible to the
dialect group level.
The following example converts the sample data to both dialect groups
and sub-groups. Because China’s linguistic distribution is too complex
and dynamic to measure precisely at the prefectural level, the
linguistic zone output from regioncode
should be used for
reference only, not for rigorous linguistic research.
tibble(
city = corruption$prefecture,
dialectGroup = regioncode(
data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
to_dialect = "dia_group"
),
dialectSubGroup = regioncode(
data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
to_dialect = "dia_sub_group"
)
)
## # A tibble: 10 × 3
## city dialectGroup dialectSubGroup
## <chr> <chr> <chr>
## 1 济南市 冀鲁官话 沧惠片-1,石济片-8
## 2 泰州市 江淮官话 泰如片-1
## 3 松江区 吴语 太湖片-1
## 4 宜昌市 西南官话 成渝片-3,成渝片-9
## 5 来宾市 西南官话 桂柳片-10
## 6 怀化市 湘语 岑江片-2,吉溆片-3,娄邵片-1,黔北片-3,长益片-3
## 7 莆田市 莆仙区 莆仙区-4
## 8 宜宾市 西南官话 灌赤片-10
## 9 定安县 琼文区 府城片-1
## 10 襄阳市 西南官话 鄂北片-10
regioncode
provides a convenient tool for converting
Chinese administrative division codes and official names and for
performing other specialized conversions. We are actively developing the
package and plan to add more administrative levels and richer data in
future versions. We welcome collaboration, and users can direct any
questions, comments, or bug reports to our Github Issues page.
We extend our appreciation to LI Ruizhe, ZHU Meng, SHI Yuyang, XU Yujia, PAN Yuxin, TIAN Haiting, SHAO Weihang, CHEN Yuanqian, and LIU Xueyan for their contributions to data collection and code development for this package.
HU Yue
Department of Political Science,
Tsinghua University,
Email: yuehu@tsinghua.edu.cn
Website: https://www.drhuyue.site
YE Xinyi
Department of Political Science,
Tsinghua University,
Email: yexy23@mails.tsinghua.edu.cn
Users may notice that regioncode
sometimes
outputs provincially-administered counties (“直辖县”). We include some
of these units to minimize missing data, although they remain
county-level administrative units. Current resources do not permit
coverage of all such counties, an issue we plan to address in the future
(github issues #54). We welcome users to contribute a pull request or
contact us to help resolve this issue.↩︎