regioncode: One-Step Solution for Chinese Region Conversions

HU Yue, YE Xinyi

2025-07-05

In China, the term “city” can refer to a county, prefecture, or province. This ambiguity creates challenges for researchers, who often struggle to convert regional names and their corresponding geocodes, especially in datasets that span many years. The central government periodically changes or eliminates administrative unit names, further complicating these conversions(国家统计局 2022a). Inspired by Vincent Arel-Bundock’s countrycode package, we developed regioncode to streamline the conversion of Chinese region names and codes for the years 1986 – 2022.

Why regioncode?

The Chinese government assigns a unique geocode to each county, prefecture, and province, and consistently adjusts and updates these “administrative division codes” to support national and regional development plans (民政部 2022). These adjustments, however, challenge researchers who conduct longitudinal studies or merge geospatial data from different years. For example, when inconsistencies exist between map data and statistical data, any attempt to render the statistical information on a map of China can produce errors.

regioncode provides a one-step solution to these challenges. The package seamlessly converts formal names, common names, and administrative division codes for Chinese provinces and prefectures over thirty years (1986 – 2022). Apart from code conversion, the package also support matching administrative divisions with their economic and linguistic characteristics.

Installation

To install:

Basic Usage

We demonstrate the basic functionality of regioncode using a sample dataset randomly drawn from Wang (2020)’s China Corruption Investigations Dataset. The package uses code to denote administrative division codes and name for the formal names of regions. It can convert between these two formats.

A user simply provides a character vector of names or a numeric vector of geocodes to the function and specifies the desired output with the convert_to argument. The following example converts geocodes from 2019 to their 1989 equivalents. Users must set the year_from argument to the correct reference year and then use the year_to and convert_to arguments to specify the target year and output format.

library(regioncode)

data("corruption")

# Conversion to the 1989 version
regioncode(
  data_input = corruption$prefecture_id,
  convert_to = "code", # default setting
  year_from = 2019,
  year_to = 1989
)
##  [1] 370100 329001 310227 420500 452200 433000 350300 512500 460025 420600
# Comparison
tibble(
  code2019 = corruption$prefecture_id,
  code1989 = regioncode(
    data_input = corruption$prefecture_id,
    convert_to = "code", # default setting
    year_from = 2019,
    year_to = 1989
  ),
  name2019 = regioncode(
    data_input = corruption$prefecture_id,
    convert_to = "name", # default setting
    year_from = 2019,
    year_to = 2019
  ),
  name1989 = regioncode(
    data_input = corruption$prefecture_id,
    convert_to = "name", # default setting
    year_from = 2019,
    year_to = 1989
  )
)
## # A tibble: 10 × 4
##    code2019 code1989 name2019 name1989
##       <dbl>    <dbl> <chr>    <chr>   
##  1   370100   370100 济南市   济南市  
##  2   321200   329001 泰州市   泰州市  
##  3   310117   310227 松江区   松江县  
##  4   420500   420500 宜昌市   宜昌市  
##  5   451300   452200 来宾市   柳州地区
##  6   431200   433000 怀化市   怀化地区
##  7   350300   350300 莆田市   莆田市  
##  8   511500   512500 宜宾市   宜宾地区
##  9   469021   460025 定安县   定安县  
## 10   420600   420600 襄阳市   襄樊市

Note that if a region geocoded in 1989 was later absorbed into a new region by 2019, the package uses the new region’s geocode. If a single large area was later divided into several smaller regions, the package aligns the new codes with the first new region based on the ascending numerical order of their geocodes.1 regioncode automatically identifies the input format, treating numeric vectors as geocodes and character vectors as names. The following example demonstrates converting various input types into different output formats:

# Original name
tibble(
  id = corruption$prefecture_id,
  name = corruption$prefecture
)
## # A tibble: 10 × 2
##        id name  
##     <dbl> <chr> 
##  1 370100 济南市
##  2 321200 泰州市
##  3 310117 松江区
##  4 420500 宜昌市
##  5 451300 来宾市
##  6 431200 怀化市
##  7 350300 莆田市
##  8 511500 宜宾市
##  9 469021 定安县
## 10 420600 襄阳市
# Codes to name
regioncode(
  data_input = corruption$prefecture_id,
  convert_to = "name",
  year_from = 2019,
  year_to = 1989
)
##  [1] "济南市"   "泰州市"   "松江县"   "宜昌市"   "柳州地区" "怀化地区"
##  [7] "莆田市"   "宜宾地区" "定安县"   "襄樊市"
# Name to codes of the same year
regioncode(
  data_input = corruption$prefecture,
  convert_to = "code",
  year_from = 2019,
  year_to = 2019
)
##  [1] 370100 321200 310117 420500 451300 431200 350300 511500 469021 420600
# Name to name of a different year
regioncode(
  data_input = corruption$prefecture,
  convert_to = "name",
  year_from = 2019,
  year_to = 1989)
##  [1] "济南市"   "泰州市"   "松江县"   "宜昌市"   "柳州地区" "怀化地区"
##  [7] "莆田市"   "宜宾地区" "定安县"   "襄樊市"

Advanced Applications

The regioncode package also provides specialized functions for more complex data and diverse research needs, including:

  1. Conversion from/to incomplete names.
  2. Different handling of municipalities.
  3. Return of population-based city ranks.
  4. Return of pinyin format of outputs.
  5. Conversion of provincial data.
  6. Return of administrative areas.
  7. Return of linguistic zones.

Incomplete Naming of Prefectures

Datasets often record geographic information using incomplete names that omit the administrative level, such as “北京” for “北京市” or “内蒙” for “内蒙古自治区.” To handle this type of data, a user can set the incomplete_name argument to TRUE. regioncode can perform the conversion as long as at least two characters are available to identify the city or province. In the following example, we shorten 70% of the city names in the input vector to demonstrate how regioncode resolves this issue:

# Original full names
corruption$prefecture
##  [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
##  [9] "定安县" "襄阳市"
fake_incomplete <- corruption$prefecture

index_incomplete <- sample(seq(length(corruption$prefecture)), 7)

fake_incomplete[index_incomplete] <- fake_incomplete[index_incomplete] |>
  substr(start = 1, stop = 2)

fake_incomplete
##  [1] "济南"   "泰州"   "松江"   "宜昌"   "来宾"   "怀化市" "莆田市" "宜宾市"
##  [9] "定安"   "襄阳"
# Conversion to full names in 2008
regioncode(
  data_input = fake_incomplete,
  convert_to = "name",
  year_from = 2019,
  year_to = 2008,
  incomplete_name = TRUE
)
##  [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
##  [9] "定安县" "襄樊市"

Municipalities

In China, municipalities (“直辖市”) are geographically cities but function administratively as provinces. Different datasets may categorize them differently, with some treating them as equivalent to prefectures. The regioncode package includes the zhixiashi argument to manage this distinction. The default setting, FALSE, treats municipalities as provinces. When set to TRUE, the function treats them as prefectures and uses their provincial codes as their geocodes. The following example demonstrates this functionality using a character vector containing the names of municipalities, their districts, and a prefecture:

names_municipality <- c(
  "北京市", # Beijing, a municipality
  "海淀区", # A district of Beijing
  "上海市", # Shanghai, a municipality
  "静安区", # A district of Shanghai
  "济南市"
) # A prefecture of Shandong

# When `zhixiashi` is FALSE, only the districts are recognized
regioncode(
  data_input = names_municipality,
  year_from = 2019,
  year_to = 2019,
  convert_to = "code",
  zhixiashi = FALSE
)
## [1]     NA 110108     NA 310106 370100
# When `zhixiashi` is TRUE, municipalities are recognized
regioncode(
  data_input = names_municipality,
  year_from = 2019,
  year_to = 2019,
  convert_to = "code",
  zhixiashi = TRUE
)
## [1] 110000 110108 310000 310106 370100

City Ranking

The Statistical Yearbook of Urban and Rural Construction classifies Chinese cities into different tiers based largely on population (国家统计局 2022b). A four-tier system existed from 1989 to 2014, after which the government expanded it to a seven-tier system, as detailed in the following table:

Criterion Population Rank
Old (1989) > 1 million 超大城市
500,000 ~ 1 million 大城市
200,000 ~ 500,000 中等城市
< 200,000 小城市
New (2014) > 10 million 超大城市
5 million ~ 10 million 特大城市
3 million ~ 5 million I型大城市
1 million ~ 3 million II型大城市
500,000 ~ 1 million 中等城市
200,000 ~ 500,000 I型小城市
< 200,000 II型小城市

The regioncode function can return the population-based rank of a city for a given year. Users can perform this conversion by setting convert_to = "rank". The function applies the old ranking system for years up to and including 1989 and the new system for all subsequent years. If a city’s population data is unavailable in the official sources, the function returns NA. The following example compares the city ranks generated from the same input vector but for different years:

tibble(
  city = corruption$prefecture,
  rank1989 = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    convert_to = "rank"
  ),
  rank2014 = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 2014,
    convert_to = "rank"
  ))
## # A tibble: 10 × 3
##    city   rank1989 rank2014  
##    <chr>  <chr>    <chr>     
##  1 济南市 特大城市 I型大城市 
##  2 泰州市 小城市   II型大城市
##  3 松江区 特大城市 超大城市  
##  4 宜昌市 中等城市 II型大城市
##  5 来宾市 <NA>     中等城市  
##  6 怀化市 小城市   I型小城市 
##  7 莆田市 小城市   II型大城市
##  8 宜宾市 中等城市 II型大城市
##  9 定安县 <NA>     <NA>      
## 10 襄阳市 中等城市 II型大城市

Pinyin

Pinyin provides a phonetic romanization of Chinese characters, and some datasets store region names in pinyin. While regioncode defaults to outputting names in Chinese characters, users can obtain pinyin output by setting the to_pinyin argument to TRUE. This functionality is integrated from the pinyin package developed by Peng Zhao and Qu Cheng. The function also produces the correct romanization for regions with unique spellings, such as Shanxi versus Shaanxi, Inner Mongolia, and the special administrative regions. This pinyin conversion works for official names, incomplete names, and administrative area outputs. The following example shows how the function works for different requests:

tibble(
  city = corruption$prefecture,
  cityPY = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    convert_to = "name",
    to_pinyin = TRUE
  ),
  areaPY = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    convert_to = "area",
    to_pinyin = TRUE
  )
)
## # A tibble: 10 × 3
##    city   cityPY     areaPY   
##    <chr>  <chr>      <chr>    
##  1 济南市 ji_nan     hua_dong 
##  2 泰州市 tai_zhou   hua_dong 
##  3 松江区 song_jiang hua_dong 
##  4 宜昌市 yi_chang   hua_zhong
##  5 来宾市 liu_zhou   hua_nan  
##  6 怀化市 huai_hua   hua_zhong
##  7 莆田市 pu_tian    hua_dong 
##  8 宜宾市 yi_bin     xi_nan   
##  9 定安县 ding_an    hua_nan  
## 10 襄阳市 xiang_fan  hua_zhong
# Regions with special spelling
regioncode(
  data_input = c("山西", "陕西", "内蒙古", "香港", "澳门"),
  year_from = 2019,
  year_to = 2008,
  convert_to = "name",
  incomplete_name = TRUE,
  province = TRUE,
  to_pinyin = TRUE
)
##             <NA>             陕西             内蒙             香港 
##        "shan_xi"       "shaan_xi" "inner_mongolia"      "hong_kong" 
##             澳门 
##          "macao"

Provinces

The regioncode function also converts data at the provincial level. By setting the province argument to TRUE, users can convert provincial geocodes and names. Since Chinese provinces have standard abbreviations, users can convert these abbreviations to other data types by setting the convert_to argument to abbreTocode, abbreToname, or abbreToarea. To convert names or codes to abbreviations, a user can set convert_to = "abbre". The following example demonstrates converting a vector of provincial geocodes into their official names and abbreviations:

tibble(
  province = corruption$province_id,
  prov_name = regioncode(
    data_input = corruption$province_id,
    convert_to = "name",
    year_from = 2019,
    year_to = 1989,
    province = TRUE
  ),
  prov_abbre = regioncode(
    data_input = corruption$province_id,
    convert_to = "codeToabbre",
    year_from = 2019,
    year_to = 1989,
    province = TRUE
  )
)
## # A tibble: 10 × 3
##    province prov_name      prov_abbre
##       <dbl> <chr>          <chr>     
##  1   370000 山东省         鲁        
##  2   320000 江苏省         苏        
##  3   310000 上海市         沪        
##  4   420000 湖北省         鄂        
##  5   450000 广西壮族自治区 桂        
##  6   430000 湖南省         湘        
##  7   350000 福建省         闽        
##  8   510000 四川省         川        
##  9   460000 海南省         琼        
## 10   420000 湖北省         鄂

Geographic Units Beyond Provinces

regioncode can also convert geographic units beyond the provincial level into two larger categories: administrative areas and linguistic zones.

Administrative Area

For social, political, and military purposes, China divides its territory into seven major administrative areas (孙平 2020):

Region Provincial-level Administrative Unit
华北 北京市, 天津市, 山西省, 河北省, 内蒙古自治区
东北 黑龙江省, 吉林省, 辽宁省
华东 上海市, 江苏省, 浙江省, 安徽省, 福建省, 台湾省, 江西省, 山东省
华中 河南省, 湖北省, 湖南省
华南 广东省, 海南省, 广西壮族自治区, 香港特别行政区, 澳门特别行政区
西南 重庆市, 四川省, 贵州省, 云南省, 西藏自治区
西北 陕西省, 甘肃省, 青海省, 宁夏回族自治区, 新疆维吾尔自治区

Users who need to identify the administrative area for a given prefecture or province can do so with regioncode. The function converts regional codes and names (for both prefectures and provinces) into their corresponding administrative area by setting the convert_to argument to "area":

regioncode(
  data_input = corruption$prefecture,
  year_from = 2019,
  year_to = 1989,
  convert_to = "area")
##  [1] "华东" "华东" "华东" "华中" "华南" "华中" "华东" "西南" "华南" "华中"

Linguistic Zone

As a country with numerous dialects, China’s linguistic distribution presents challenges for relevant economic, political, and sociological analyses. A single dialect may span several prefectures or even multiple provinces. For political and sociolinguistic researchers, regioncode can identify the approximate linguistic zone for a given geocode or prefectural name. Following the 1987 language atlas of China, the package provides two levels of linguistic zone identification: dialect groups (dia_group, “方言大类”) and dialect sub-groups (dia_sub_group, “分区片”). Note that when province = TRUE, conversions are only possible to the dialect group level.

The following example converts the sample data to both dialect groups and sub-groups. Because China’s linguistic distribution is too complex and dynamic to measure precisely at the prefectural level, the linguistic zone output from regioncode should be used for reference only, not for rigorous linguistic research.

tibble(
  city = corruption$prefecture,
  dialectGroup = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    to_dialect = "dia_group"
  ),
  dialectSubGroup = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    to_dialect = "dia_sub_group"
  )
)
## # A tibble: 10 × 3
##    city   dialectGroup dialectSubGroup                             
##    <chr>  <chr>        <chr>                                       
##  1 济南市 冀鲁官话     沧惠片-1,石济片-8                           
##  2 泰州市 江淮官话     泰如片-1                                    
##  3 松江区 吴语         太湖片-1                                    
##  4 宜昌市 西南官话     成渝片-3,成渝片-9                           
##  5 来宾市 西南官话     桂柳片-10                                   
##  6 怀化市 湘语         岑江片-2,吉溆片-3,娄邵片-1,黔北片-3,长益片-3
##  7 莆田市 莆仙区       莆仙区-4                                    
##  8 宜宾市 西南官话     灌赤片-10                                   
##  9 定安县 琼文区       府城片-1                                    
## 10 襄阳市 西南官话     鄂北片-10

Conclusion

regioncode provides a convenient tool for converting Chinese administrative division codes and official names and for performing other specialized conversions. We are actively developing the package and plan to add more administrative levels and richer data in future versions. We welcome collaboration, and users can direct any questions, comments, or bug reports to our Github Issues page.

We extend our appreciation to LI Ruizhe, ZHU Meng, SHI Yuyang, XU Yujia, PAN Yuxin, TIAN Haiting, SHAO Weihang, CHEN Yuanqian, and LIU Xueyan for their contributions to data collection and code development for this package.

Reference

Wang, Yuhua. 2020. “China’s Corruption Investigations Dataset.” Harvard Dataverse. https://doi.org/10.7910/DVN/9QZRAD.
国家统计局. 2022a. “关于更新全国统计用区划代码和城乡划分代码的公告.” 中华人民共和国国家统计局.
———, ed. 2022b. 中国统计年鉴2022(附光盘) 中国城乡建设统计年鉴2021(2022年新书). 中国统计年鉴. 中国统计出版社. https://item.jd.com/10038568378953.html.
孙平. 2020. “把握新时代行政区划优化设置的着力点 - 中华人民共和国民政部.” 中国社会报, December 14, 2020.
民政部. 2022. “2021年中华人民共和国行政区划代码.” 中华人民共和国民政部.

Affiliation

HU Yue

Department of Political Science,
Tsinghua University,
Email:
Website: https://www.drhuyue.site

YE Xinyi

Department of Political Science,
Tsinghua University,
Email:


  1. Users may notice that regioncode sometimes outputs provincially-administered counties (“直辖县”). We include some of these units to minimize missing data, although they remain county-level administrative units. Current resources do not permit coverage of all such counties, an issue we plan to address in the future (github issues #54). We welcome users to contribute a pull request or contact us to help resolve this issue.↩︎