elasticsearch 특수문자 검색

elasticsearch에서 특수문자 검색 시에는 문자 앞에 ''를 포함하여 검색하는 것이 일반적이다. 하지만 standard analyzer를 사용하는 경우는 검색되지 않는다.

아래 예제에서 한번 확인해보자.

name 필드가 있는 test-special-character 인덱스를 생성
analyzer는 standard로 지정

인덱스 생성

PUT test-special-character 
{ 
    "mappings": { 
    "properties": { 
        "name": { 
            "type": "text", 
            "analyzer": "standard" 
            } 
          } 
    } 
}

데이터를 입력해보자.

POST /test-special-character/_doc/1 
{ 
    "name":"개발팀 (홍길동)" 
}

POST /test-special-character/_doc/2  
{  
    "name":"개발팀 #홍길동"  
}

match로 검색하면 1건의 결과가 나올것으로 기대된다.

GET test-special-character/_search
{
  "query": {
    "match": {
      "name": "(홍길동)"
    }
  }
}

결과는 아래와 같다.

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.36464313,
    "hits" : [
      {
        "_index" : "test-special-character",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.36464313,
        "_source" : {
          "name" : "개발팀 (홍길동)"
        }
      },
      {
        "_index" : "test-special-character",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.36464313,
        "_source" : {
          "name" : "개발팀 #홍길동"
        }
      }
    ]
  }
}

하지만 예상과는 다르게 결과가 2건이 나오는 것이다.

이는 analyzer로 standard를 사용하기 때문에 특수문자를 제외하고 인덱스을 하기 때문에 발생하는 문제이다.

위의 문장을 analyzer해보면

GET test-special-character/_analyze
{
  "analyzer": "standard", 
  "text": ["개발팀 (홍길동)"]
}

아래처럼 결과가 나온다.

{
  "tokens" : [
    {
      "token" : "개발팀",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "홍길동",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<HANGUL>",
      "position" : 1
    }
  ]
}

"개발팀 #홍길동" 도 동일하다.
standard analyzer를 사용하는 경우는 특수문자는 제외하고 인덱싱을 한다.

[analyzer] standard
개발팀 (홍길동) ==> 개발팀, 홍길동
개발팀 #홍길동 ==> 개발팀, 홍길동
[search analyzer] standard
(홍길동) ==> 홍길동

그러면 analyzer를 whitespace로 하면 어떻게 될까?

PUT test-special-character
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "whitespace"
      }
    }
  }
}

위의 2건의 데이터를 입력하고 나서 검색해보면

GET test-special-character/_search
{
  "query": {
    "match": {
      "name": "(홍길동)"
    }
  }
}

원하는 결과(1건)가 나오는 것을 확인할 수 있다.

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_index" : "test-special-character",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931471,
        "_source" : {
          "name" : "개발팀 (홍길동)"
        }
      }
    ]
  }
}

whitespace는 공백, 탭, 들여쓰기를 기준으로만 구분하기 때문에 가능한 것이다.

그럼 이 상황에서 "홍길동"이 있는 단어를 검색하려면 어떻게 해야 하나?
wildcard를 사용하면 된다.

GET test-special-character/_search
{
  "query": {
    "wildcard": {
      "name": {
        "value": "*홍길동*"
      }
    }
  }
}

물론 wildcard는 성능상의 문제를 야기시킬 수 있지만 데이타양이 아주 많지 않다면 크게 문제가 되지 않을것이다.

저작자표시 (새창열림)

티스토리툴바