Elasticsearch(GEO)数据写入和空间检索

Elasticsearch简介

什么是 Elasticsearch?

Elasticsearch 是一个开源的分布式 RESTful搜索和分析引擎，能够解决越来越多不同的应用场景。

本文内容

本文主要是介绍了ES GEO数据写入和空间检索，ES版本为7.3.1

数据准备

Qgis使用渔网工具，对范围进行切割，得到网格的Geojson

新建索引设置映射

def set_mapping(es,index_name="content_engine",doc_type_name="en",my_mapping={}):

    # ignore 404 and 400

    es.indices.delete(index=index_name, ignore=[400, 404])

    print("delete_index")

    # ignore 400 cause by IndexAlreadyExistsException when creating an index

    my_mapping = {

        "properties": {

            "location": {"type": "geo_shape"},

            "id": {"type": "long"}

        }

    }

    create_index = es.indices.create(index=index_name)

    mapping_index = es.indices.put_mapping(index=index_name, doc_type=doc_type_name, body=my_mapping,                          include_type_name=True)

    print("create_index")

    if create_index["acknowledged"] is not True or mapping_index["acknowledged"] is not True:

        print("Index creation failed...")

数据插入

使用multiprocessing和elasticsearch.helpers.bulk进行数据写入,每一万条为一组写入，剩下的为一组,然后多线程写入。分别写入4731254条点和面数据。写入时候使用多核，ssd，合适的批量数据可以有效加快写入速度，通过这些手段可以在三分钟左右写入四百多万的点或者面数据。

def mp_worker(features):

    count = 0

    es = Elasticsearch(hosts=[ip], timeout=5000)

    success, _ = bulk(es,features, index=index_name, raise_on_error=True)

    count += success

    return count

def mp_handler(input_file, index_name, doc_type_name="en"):

    with open(input_file, 'rb') as f:

        data = json.load(f)

    features = data["features"]

    del data

    act=[]

    i=0

    count=0

    actions = []

    for feature in features:

        action = {

                "_index": index_name,

                "_type": doc_type_name,

                "_source": {

                    "id": feature["properties"]["id"],

                    "location": {

                        "type": "polygon",

                        "coordinates": feature["geometry"]["coordinates"]

                    }

                }

            }

        i=i+1

        actions.append(action)

        if (i == 9500):

            act.append(actions)

            count=count+i

            i = 0

            actions = []

    if i!=0:

        act.append(actions)

        count = count + i

    del features

    print('read all %s data ' % count)

    p = multiprocessing.Pool(4)

    i=0

    for result in p.imap(mp_worker, act):

        i=i+result

    print('write all %s data ' % i)

GEO（point）查询距离nkm附近的点和范围选择

from elasticsearch import Elasticsearch

from elasticsearch.helpers import scan

import time

starttime = time.time()

_index = "gis_point"

_doc_type = "20190824"

ip = "127.0.0.1:9200"

# 附近nkm 选择

_body = {

    "query": {

        "bool": {

            "must": {

                "match_all": {}

            },

            "filter": {

                "geo_distance": {

                    "distance": "9km",

                    "location": {

                        "lat": 18.1098857850465471,

                        "lon": 109.1271036098896730

                    }

                }

            }

        }

    }

}

# 范围选择

# _body={

#   "query": {

#     "geo_bounding_box": {

#       "location": {

#         "top_left": {

#           "lat": 18.4748659238899933,

#           "lon": 109.0007435371629470

#         },

#         "bottom_right": {

#           "lat": 18.1098857850465471,

#           "lon": 105.1271036098896730

#         }

#       }

#     }

#   }

# }

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = scan(es, query=_body, scroll="10m", index=_index, timeout="10m")

for resp in scanResp:

    print(resp)

endtime = time.time()

print(endtime - starttime)

GEO（shape）范围选择

from elasticsearch import Elasticsearch

from elasticsearch.helpers import scan

import time

starttime = time.time()

_index = "gis"

_doc_type = "20190823"

ip = "127.0.0.1:9200"

# envelope format, [[minlon,maxlat],[maxlon,minlat]]

_body = {

    "query": {

        "bool": {

            "must": {

                "match_all": {}

            },

            "filter": {

                "geo_shape": {

                    "location": {

                        "shape": {

                            "type": "envelope",

                            "coordinates": [[108.987103609889, 18.474865923889993], [109.003537162947, 18.40988578504]]

                        },

                        "relation": "within"

                    }

                }

            }

        }

    }

}

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = scan(es, query=_body, scroll="1m", index=_index, timeout="1m")

for resp in scanResp:

    print(resp)

endtime = time.time()

print(endtime - starttime)

GEO（point）距离聚合

from elasticsearch import Elasticsearch

import time

starttime = time.time()

_index = "gis_point"

_doc_type = "20190824"

ip = "127.0.0.1:9200"

# 距离聚合

_body = {

    "aggs" : {

        "rings_around_amsterdam" : {

            "geo_distance" : {

                "field" : "location",

                "origin" : "18.1098857850465471,109.1271036098896730",

                "ranges" : [

                    { "to" : 100000 },

                    { "from" : 100000, "to" : 300000 },

                    { "from" : 300000 }

                ]

            }

        }

    }

}

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = es.search( body=_body, index=_index)

for i in scanResp['aggregations']['rings_around_amsterdam']['buckets']:

    print(i)

endtime = time.time()

print(endtime - starttime)

中心点聚合

_body ={

     "aggs" : {

        "centroid" : {

            "geo_centroid" : {

                "field" : "location"

            }

        }

    }

}

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = es.search( body=_body, index=_index)

print(scanResp['aggregations'])

范围聚合

_body = {

    "aggs": {

        "viewport": {

            "geo_bounds": {

                "field": "location"

            }

        }

    }

}

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = es.search(body=_body, index=_index)

print(scanResp['aggregations']['viewport'])

geohash聚合

##低精度聚合，precision代表geohash长度

_body = {

    "aggregations": {

        "large-grid": {

            "geohash_grid": {

                "field": "location",

                "precision": 3

            }

        }

    }

}

# 高精度聚合,范围聚合以及geohash聚合

# _body = {

#     "aggregations": {

#         "zoomed-in": {

#             "filter": {

#                 "geo_bounding_box": {

#                     "location": {

#                         "top_left": "18.4748659238899933,109.0007435371629470",

#                         "bottom_right": "18.4698857850465471,108.9971036098896730"

#                     }

#                 }

#             },

#             "aggregations": {

#                 "zoom1": {

#                     "geohash_grid": {

#                         "field": "location",

#                         "precision": 7

#                     }

#                 }

#             }

#         }

#     }

# }

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = es.search(body=_body, index=_index)

for i in scanResp['aggregations']['large-grid']['buckets']:

    print(i)

#for i in scanResp['aggregations']['zoomed-in']['zoom1']['buckets']:

#    print(i)

切片聚合

# 低精度切片聚合,precision代表级别

_body = {

    "aggregations": {

        "large-grid": {

            "geotile_grid": {

                "field": "location",

                "precision": 8

            }

        }

    }

}

# 高精度切片聚合，范围聚合以切片聚合

# _body={

#     "aggregations" : {

#         "zoomed-in" : {

#             "filter" : {

#                 "geo_bounding_box" : {

#                     "location" : {

#                         "top_left": "18.4748659238899933,109.0007435371629470",

#                          "bottom_right": "18.4698857850465471,108.9991036098896730"

#                     }

#                 }

#             },

#             "aggregations":{

#                 "zoom1":{

#                     "geotile_grid" : {

#                         "field": "location",

#                         "precision": 18

#                     }

#                 }

#             }

#         }

#     }

# }

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = es.search(body=_body, index=_index)

for i in scanResp['aggregations']['large-grid']['buckets']:

    print(i)

# for i in scanResp['aggregations']['zoomed-in']['zoom1']['buckets']:

#      print(i)

Elasticsearch和PostGIS相同功能对比

PostGIS最近点查询

SELECT  id,geom, ST_DistanceSphere(geom,'SRID=4326;POINT(109.1681036098896730 18.1299957850465471)'::geometry)

FROM  h5

ORDER BY  geom <->

'SRID=4326;POINT(109.1681036098896730 18.1299957850465471)'::geometry

LIMIT 1

Elasticsearch最近点查询

from elasticsearch import Elasticsearch

import time

starttime = time.time()

_index = "gis_point"

_doc_type = "20190824"

ip = "127.0.0.1:9200"

_body={

  "sort": [

    {

      "_geo_distance": {

        "unit": "m",

        "order": "asc",

        "location": [

          109.1681036098896730,

          18.1299957850465471

        ],

        "distance_type": "arc",

        "mode": "min",

        "ignore_unmapped": True

      }

    }

  ],

  "from": 0,

  "size": 1,

    "query": {

        "bool": {

          "must": {

            "match_all": {}

          }

        }

      }

}

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = es.search(body=_body, index=_index)

endtime = time.time()

print(endtime - starttime)

PostGIS范围查询

select id,geom,fid  FROM public."California"

where

ST_Intersects(geom,ST_MakeEnvelope(-117.987103609889,33.40988578504,-117.003537162947,33.494865923889993, 4326))=true

[-117.987103609889, 33.494865923889993], [-117.003537162947, 33.40988578504]

Elasticsearch范围查询

from elasticsearch import Elasticsearch

from elasticsearch.helpers import scan

import time

starttime = time.time()

_index = "gis_california"

ip = "127.0.0.1:9200"

# envelope format, [[minlon,maxlat],[maxlon,minlat]]

_body = {

    "query": {

        "bool": {

            "must": {

                "match_all": {}

            },

            "filter": {

                "geo_shape": {

                    "geom": {

                        "shape": {

                            "type": "envelope",

                            "coordinates": [[-117.987103609889, 33.494865923889993], [-117.003537162947, 33.40988578504]]

                        },

                        "relation": "INTERSECTS"

                    }

                }

            }

        }

    }

}

es = Elasticsearch(hosts=[ip], timeout=5000)

scanResp = scan(es, query=_body, scroll="1m", index=_index, timeout="1m")

i=0

for resp in scanResp:

    i=i+1

    a=resp

print(i)

endtime = time.time()

print(endtime - starttime)

两种场景中PostGIS的性能更好

参考资料：

1.Elasticsearch(GEO)空间检索查询

2.Elasticsearch官网

3.PostGIS拆分LineString为segment,point

4.亿级“附近的人”，打通“特殊服务”通道

5.PostGIS教程二十二：最近邻域搜索

Elasticsearch(GEO)数据写入和空间检索的更多相关文章

Elasticsearch Lucene 数据写入原理 | ES 核心篇
前言最近 TL 分享了下 <Elasticsearch基础整理>https://www.jianshu.com/p/e8226138485d ,蹭着这个机会.写个小文巩固下,本文主要讲 ...
elasticsearch的数据写入流程及优化
Elasticsearch 写入流程及优化一. 集群分片设置:ES一旦创建好索引后,就无法调整分片的设置,而在ES中,一个分片实际上对应一个lucene 索引,而lucene索引的读写会占用很多的系 ...
Elasticsearch(GEO)空间检索查询
Elasticsearch(GEO)空间检索查询python版本 1.Elasticsearch ES的强大就不用多说了,当你安装上插件,搭建好集群,你就拥有了一个搜索系统. 当然,ES的集群优化和查 ...
通过Hive将数据写入到ElasticSearch
我在<使用Hive读取ElasticSearch中的数据>文章中介绍了如何使用Hive读取ElasticSearch中的数据,本文将接着上文继续介绍如何使用Hive将数据写入到Elasti ...
第三百六十七节，Python分布式爬虫打造搜索引擎Scrapy精讲—elasticsearch(搜索引擎)scrapy写入数据到elasticsearch中
第三百六十七节,Python分布式爬虫打造搜索引擎Scrapy精讲—elasticsearch(搜索引擎)scrapy写入数据到elasticsearch中前面我们讲到的elasticsearch( ...
四十六 Python分布式爬虫打造搜索引擎Scrapy精讲—elasticsearch(搜索引擎)scrapy写入数据到elasticsearch中
前面我们讲到的elasticsearch(搜索引擎)操作,如:增.删.改.查等操作都是用的elasticsearch的语言命令,就像sql命令一样,当然elasticsearch官方也提供了一个pyt ...
elasticsearch备份与恢复4_使用ES-Hadoop将ES中的索引数据写入HDFS中
背景知识见链接:elasticsearch备份与恢复3_使用ES-Hadoop将HDFS数据写入Elasticsearch中项目参考<Elasticsearch集成Hadoop最佳实践> ...
Elasticsearch准实时索引实现（数据写入到es分片并存储到文件中的过程）
溢写到文件系统缓存当数据写入到ES分片时,会首先写入到内存中,然后通过内存的buffer生成一个segment,并刷到文件系统缓存中,数据可以被检索(注意不是直接刷到磁盘) ES中默认1秒,refr ...
基于百度地图SDK和Elasticsearch GEO查询的地理围栏分析系统（1）
本文描述了一个系统,功能是评价和抽象地理围栏(Geo-fencing),以及监控和分析核心地理围栏中业务的表现. 技术栈:Spring-JQuery-百度地图WEB SDK 存储:Hive-Elast ...

随机推荐

BIOS和CMOS概念整理
一:什么是BIOS BIOS(Basic Input Output System),基本输入输出系统.是被写死在主板ROM只读芯片中的一组程序,在开机的时候首先要去读取的一个小程序. 它是我们可以将 ...
echarts对柱状图进行标注,以及取消hover时的阴影
option = { color: ['#3398DB'], tooltip : { trigger: 'axis', axisPointer : { // 坐标轴指示器,坐标轴触发有效 type : ...
2019-2020-5 20199317《Linux内核原理与分析》第五周作业
第4章系统调用的三层机制(上) 1 用户态.内核态和中断大多数程序员在写程序时很难离开系统调用,与系统调用打交道的方式是通过库函数的方式,库函数用来把系统调用给封装起来. 计算机的硬件资源是有限 ...
ThreadLocal快速了解一下
欢迎点赞阅读,一同学习交流,有疑问请留言 . GitHub上也有开源 JavaHouse 欢迎star 1 引入在Java8里面,ThreadLocal 是一个泛型类.这个类可以提供线程变量.每个线 ...
mysql 替换 tab 键（\t)
update t_instance set instance_name = replace(instance_name,'\t','') , host_name = replace(host_name ...
小白学 Python 爬虫（15）：urllib 基础使用（五）
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
客户端加载文本数据到mysql数据库表(数据导入和导出)
load data local infile "文件绝对路径" into table 表名; 如果指定了LOCAL,则文件会被客户主机上的客户端读取,并被发送到服务器如果要导出表 ...
node.js+react全栈实践-Form中按照指定路径上传文件并
书接上回,讲到“使用同一个新增弹框”中有未解决的问题,比如复杂的字段,文件,图片上传,这一篇就解决文件上传的问题.这里的场景是在新增弹出框中要上传一个图片,并且这个上传组件放在一个Form中,和其他文 ...
spring security 权限安全认证框架-入门（一）
spring security 概述: Spring Security是一个功能强大且高度可定制的身份验证和访问控制框架.它是保护基于spring的应用程序的实际标准. Spring Security ...
E1.Send Boxes to Alice(Easy Version)//中位数
发送盒子给Alice(简单版本) 题意:准备n个盒子放巧克力,从1到n编号,初始的时候,第i个盒子有ai个巧克力. Bob是一个聪明的家伙,他不会送n个空盒子给Alice,换句话说,每个盒子里面都有巧 ...