elasticsearch에서 채팅방-메시지 구조에 따른 몇 가지 테스트

Elasticsearch로 채팅방 구조에 대한 데이터 설계를 어떻게 해야 할지 고민하면서 정리한 자료이다. 적절한 데이터 구조를 찾기 위해 실제 테스트를 위한 개발은 spring으로 진행하였다.

개요

일반적으로 채팅방에는 여러 개의 메시지로 구성되어 있다. 이런 메시지들이 계속 누적이 되어 1억~10억 단위로 넘어가면 문제가 생기기 시작한다. 이런 문제를 해결하기 위한 다양한 기술들이 있지만 여기서는 데이터 구조에 대한 문제만 다루도록 한다.

elasticsearch는 기본적으로 인덱스 간 조인을 지원하지 않는다. 이는 성능문제와 직결된 것이라 오히려 중복데이터를 유지하면서 검색하는게 성능적으로 더 유리하다고 판단하기 때문이다. 그래서 elasticsearch의 특성에 맞게 아래의 요구사항을 검토하게 되었다.

요구사항

메시지 입력속도가 빨라야 한다.
메시지 검색속도가 빨라야 한다.
특정 메시지를 포함하는 채팅방 목록을 검색할 수 있어야 한다. (메시지 목록 검색 아니다)

메시지 검색 요구사항이면 아래의 구조가 필요가 없다. 메시지 단위로 저장하고 검색하면 된다. 하지만 여기서 요구사항은 하나의 채팅방내의 모든 메시지를 대상으로 검색을 해야 한다는 것이다.

구현 방안

elasticsearch에서 생각해볼 수 있는 데이터 구조는 아래와 같다.

parent-child 로 구성 (room과 message를 하나의 인덱스로 생성하고 join)
1. room: 1 document, message: 1 document 생성
message를 nested object 로 구성 (room안에 여러 개의 message를 nested object로 구성하는 하나의 인덱스)
1. room: 1 document 생성
2. room 안에 여러 개의 message를 nested object로 구성
message를 object로 구성 (room안에 여러 개의 message를 object로 구성로 구성하는 하나의 인덱스)
1. room: 1 document 생성
2. room 안에 여러 개의 message를 object로 구성

위에서 정리한 내용을 데이터 구조로 다시 정리하면 아래와 같다.

1. parent-child로 구성

// 하나의 인덱스로 생성하고 구조만 다름

// 방의 데이터 구조
{
  "roomId": "room-1",
  "name": "채팅방 1"
}
// 메시지의 데이터 구조
{
  "roomId": "room-1",
  "messageId": "message-1",
  "message": "메시지 1",
  "joinField": {
    "name": "message",
    "parent": "room-1"
  }
}

1개의 방에 1개의 메시지가 있다면 2개의 document가 생성이 된다.

2. nested object 타입으로 구성

{
  "roomId": "room-1",
  "name": "채팅방 1",
  "messages": [
    {
      "messageId": "message-1",
      "message": "메시지 1"
    }
  ]
}

1개의 방에 1개의 메시지가 있다면 1개의 document가 생성이 된다.

3. object 타입으로 구성

{
  "roomId": "room-1",
  "name": "채팅방 1",
  "messages": [
    {
      "messageId": "message-1",
      "message": "메시지 1"
    }
  ]
}

1개의 방에 1개의 메시지가 있다면 2번과 동일하게 1개의 document가 생성이 된다.

말뭉치 데이터

테스트 데이터를 만들어야 하는데 기본적으로 의미없는 데이터로 만들게 되면 테스트 결과가 다르게 나올 수도 있을 것 같아서 최소한 실제와 유사한 데이타를 만들면 좋을 것 같다. 또한 동일한 데이터로 입력을 할 경우 elasticsearch reverse index구조를 만들 때 키워드에 대한 count만 업데이트하기 때문에 정상적인 색인 구조를 만들지 못한다.

그래서 실제 사용자의 메시지로 구성된 말뭉치 데이터를 다운로드 받아서 입력에 활용하도록 준비하였다.

말뭉치 데이터는 아래 경로에서 한국어로 된 말뭉치를 다운받을 수 있다. (1억건 준비)

https://www.kaggle.com/datasets?search=korean
https://github.com/e9t/nsmc

환경 구성

Elasticsearch 버전: 7.17.5 (docker)
java 버전: 1.8
spring boot 버전: 2.7.4
elasticsearch shard 수: 1개
elasticsearch 위치: 원격
메시지 analyzer: standard analyzer

데이터 모델 구성

1. parent-child 로 구성

우선 elasticsearch에서 지원하는 parent-child 구조로 document를 생성해보자.

elasticsearch에서는 인덱스 간 join은 지원하지 않는다. 즉, RDB에서 사용하는 두개의 테이블간의 equal join 은 지원하지 않는다는 의미이다.
대신 하나의 인덱스 내에서 parent-child를 지원하기 때문에 parent-child 구조로 만든다. (RDB의 self-join의 의미?)

RoomParentChildDoc 구조

@Document(
    indexName = "room-parent-child"
)
@NoArgsConstructor
@Getter
@Routing("roomId")
public class RoomParentChildDoc {

    @Id
    @Field(type = FieldType.Keyword)
    private String id;

    @Field(type = FieldType.Keyword)
    private String roomId;

    @Field(type = FieldType.Text)
    private String name;

    @Field(type = FieldType.Long)
    private Long createdAt;

    @Field(type = FieldType.Long)
    private Long messageId;

    @Field(type = FieldType.Text)
    private String message;

    @JoinTypeRelations(
        relations = {
            @JoinTypeRelation(parent = "room", children = "message")
        }
    )
    private JoinField<String> joinField;

    public RoomParentChildDoc(Room room) { // Room 생성자
        this.id = room.getRoomId();
        this.roomId = room.getRoomId();
        this.name = room.getName();
        this.createdAt = room.getCreatedAt();
        this.joinField = new JoinField<>("room");
    }

    public RoomParentChildDoc(Message message) { // Message 생성자
        this.id = message.getMessageId();
        this.roomId = message.getRoomId();
        this.messageId = message.getMessageId();
        this.message = message.getMessage();
        this.joinField = new JoinField<>("message", message.getRoomId()); // parent로 roomId를 지정
    }
}

Document 구조

위와 같이 Document를 정의하면 아래와 같은 형식으로 document가 생성이 된다.

{
  "room-parent-child" : {
    "mappings" : {
      "properties" : {
        "_class" : {
          "type" : "keyword",
          "index" : false,
          "doc_values" : false
        },
        "createdAt" : {
          "type" : "long"
        },
        "id" : {
          "type" : "keyword"
        },
        "joinField" : {
          "type" : "join",
          "eager_global_ordinals" : true,
          "relations" : {
            "room" : "message"
          }
        },
        "message" : {
          "type" : "text"
        },
        "messageId" : {
          "type" : "long"
        },
        "name" : {
          "type" : "text"
        },
        "roomId" : {
          "type" : "keyword"
        }
      }
    }
  }
}

위의 구조에서 보면 room과 message를 위한 joinField가 있고 room : message 로 relation이 생성된 것을 확인할 수 있다.

또한 room의 필드(roomId, name, createdAt)와 message의 필드(messageId, message)가 하나의 document에 동일하게 생성된 것을 확인할 수 있다.

채팅방을 만들고 하나의 메시지를 입력하면 2개의 document가 생성됨을 확인할 수 있다.

테스트 데이터 입력

테스트를 위해 데이터를 입력해보자.

채팅방: 1,000,000개
메시지 개수 (채팅방 당): 100개
총 메시지 개수: 100,000,000개 (1억개)

메시지를 입력할 때 단건으로 입력할 경우 오래걸리기 때문에 bulk로 입력이 필요하다.
그래서 ElasticsearchRepository의 saveAll 메소드를 이용하여 생성한다.

@Service
@RequiredArgsConstructor
public class RoomService {

    private final RoomRepository roomRepository;

    public void register(List<Room> rooms) {
        List<RoomDoc> roomDocs = rooms.stream().map(RoomDoc::new).collect(Collectors.toList());
        roomRepository.saveAll(roomDocs);
    }

    public void registerMessage(List<Message> messages) {
        List<RoomDoc> roomDocs = messages.stream().map(RoomDoc::new).collect(Collectors.toList());
        roomRepository.saveAll(roomDocs);
    }
}

실제 데이터

실행하면 아래와 같은 데이터가 입력됨을 확인할 수 있다.

{
  "took" : 41,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "room-parent-child",
        "_type" : "_doc",
        "_id" : "35b9416b-8ce4-4433-9b39-31859ae67c2d",
        "_score" : 14.764528,
        "_routing" : "35b9416b-8ce4-4433-9b39-31859ae67c2d",
        "_source" : {
          "_class" : "com.example.elastic.parentchild.doc.RoomParentChildDoc",
          "id" : "35b9416b-8ce4-4433-9b39-31859ae67c2d",
          "roomId" : "35b9416b-8ce4-4433-9b39-31859ae67c2d",
          "name" : "room-35b9416b-8ce4-4433-9b39-31859ae67c2d",
          "createdAt" : 1674714286611,
          "joinField" : {
            "name" : "room"
          }
        }
      },
      {
        "_index" : "room-parent-child",
        "_type" : "_doc",
        "_id" : "db734524-26bb-4453-a4dd-4bcface48f74",
        "_score" : 1.0,
        "_routing" : "35b9416b-8ce4-4433-9b39-31859ae67c2d",
        "_source" : {
          "_class" : "com.example.elastic.parentchild.doc.RoomParentChildDoc",
          "id" : "db734524-26bb-4453-a4dd-4bcface48f74",
          "roomId" : "35b9416b-8ce4-4433-9b39-31859ae67c2d",
          "messageId" : 1,
          "message" : "졸업 예정서도 졸업 전에 받는다",
          "joinField" : {
            "name" : "message",
            "parent" : "35b9416b-8ce4-4433-9b39-31859ae67c2d"
          }
        }
      }
      ...
    ]
  }
}

메시지 입력 속도

채팅방 100개를 elasticsearch에 입력하는 시간은 약 10초 정도 소요된다. (채팅방 수: 100개, 방당 메시지 수: 100개, 총 메시지 수: 10,000개)

elapsed : 1553(ms)

채팅방 100개, 메시지 10,000개 생성하는데 1.5초 정도 걸린다.

메시지 검색 속도

특정 메시지를 포함하고 있는 채팅방을 조회해보자. 채팅방과 메시지는 parent-child 구조이고 elasticsearch에서 조회할 수 있는 방법을 제공한다. SQL의 exists와 유사한 방법으로 has_child라는 기능을 사용하면 된다. 그래서 has_child로 하위가 특정 값을 가지고 있는 부모 document를 검색할 수 있다.

GET room-parent-child/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {
          "has_child": {
            "query": {
              "match": {
                "message": {
                  "query": "학교"
                }
              }
            },
            "type": "message",
            "score_mode": "avg"
          }
        }
      ]
    }
  }
}

위의 쿼리를 실행하면 걸리는 속도는 1초 이내이다. 이는 데이터가 많지 않아서 빠른 것이다.

has_child를 사용하게 되면 특정 메시지를 포함하는 메시지(message)를 찾은 parent가 채팅방인 것을 찾기 때문에 검색할 범위가 많아진다.
또한 검색 시 사용되는 CPU 사용률도 높아진다. 그래서 대용량 데이터인 경우는 성능에 취약할 수 밖에 없다.

하지만 has_child 내부에서 filter 조건으로 범위를 좁혀줄 수 있으면 성능향상을 가져올 수는 있지만 대용량인 경우는 가급적 사용하지 않는 것이 좋다.

2. nested object 로 구성

두 번째로 message를 nested object 구조로 document를 생성해보자.

elasticsearch에서는 기본 object의 단점을 지원하기 위해 nested object를 지원한다.

RoomNestedDoc 구조

@Document(
    indexName = "room-nested"
)
@NoArgsConstructor
@Getter
public class RoomNestedDoc {

    @Id
    @Field(type = FieldType.Keyword)
    private String id;

    @Field(type = FieldType.Keyword)
    private String roomId;

    @Field(type = FieldType.Text)
    private String name;

    @Field(type = FieldType.Long)
    private Long createdAt;

    @Field(type = FieldType.Nested)
    private List<MessageField> messages;

    public RoomNestedDoc(RoomNested room) { // Room 생성자
        this.id = room.getRoomId();
        this.roomId = room.getRoomId();
        this.name = room.getName();
        this.createdAt = room.getCreatedAt();
        this.messages = room.getMessages().stream().map(MessageField::new).collect(Collectors.toList());
    }

    public RoomNestedDoc(Message message) {
        this.id = message.getRoomId();
        this.messages = Arrays.asList(new MessageField(message));
    }

    @Override
    public String toString() {
        return JsonUtil.toJson(this);
    }
}

MessageField는 아래와 같다.

@NoArgsConstructor
@Getter
public class MessageField {

    @Field(type = FieldType.Long)
    private long messageId;

    @Field(type = FieldType.Text)
    private String message;

    public MessageField(Message message) {
        this.messageId = message.getMessageId();
        this.message = message.getMessage();
    }
}

Document 구조

위와 같이 Document를 정의하면 아래와 같은 형식으로 document가 생성이 된다.

{
  "room-nested" : {
    "mappings" : {
      "properties" : {
        "_class" : {
          "type" : "keyword",
          "index" : false,
          "doc_values" : false
        },
        "createdAt" : {
          "type" : "long"
        },
        "id" : {
          "type" : "keyword"
        },
        "messages" : {
          "type" : "nested",
          "properties" : {
            "_class" : {
              "type" : "keyword",
              "index" : false,
              "doc_values" : false
            },
            "message" : {
              "type" : "text"
            },
            "messageId" : {
              "type" : "long"
            },
            "roomId" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        },
        "name" : {
          "type" : "text"
        },
        "roomId" : {
          "type" : "keyword"
        }
      }
    }
  }
}

테스트 데이터 입력

테스트를 위해 데이터를 입력해보자.

채팅방: 1,000,000개
메시지 개수 (채팅방 당): 100개
총 메시지 개수: 100,000,000개 (1억개)

메시지를 입력할 때는 단건으로 입력할 경우 오래걸리기 때문에 bulk로 생성이 필요하다.

nested 타입에서는 메시지만 입력할 경우는 messages에 메시지만 추가되기 때문에 save로 저장하면 안된다. 그래서 새로 입력되는 메시지만 nested에 추가하는 방법을 사용해야 하는데 이를 위해 elasticsearch의 ctx._source로 해당 필드를 add할 수 있는 painless 스크립트를 이용한다.

@Component
@RequiredArgsConstructor
public class NestedMessageElasticClient {

    private final RestHighLevelClient restHighLevelClient;

    public void createMessage(List<Message> messages) {
        BulkRequest bulkRequest = new BulkRequest();

        messages.stream()
            .map(this::addMessageRequest)
            .forEach(bulkRequest::add);

        try {
            BulkResponse response = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private UpdateRequest addMessageRequest(Message message) {
        RoomNestedDoc roomDoc = new RoomNestedDoc(message);
        Map<String, Object> params = getParams(Collections.singletonMap("message", message));
        StringBuilder sb = new StringBuilder();
        sb.append("ctx._source.messages.add(params.message)");
        Script script = new Script(ScriptType.INLINE, "painless", sb.toString(), params);

        return new UpdateRequest("room-nested", message.getRoomId())
            .script(script)
            .upsert(roomDoc.toString(), XContentType.JSON)
            .retryOnConflict(3);
    }

    private Map<String, Object> getParams(Map<String, Object> paramMap) {
        Map<String, Object> map = new HashMap<>();
        for (String key : paramMap.keySet()) {
            map.put(key, paramMap.get(key));
        }

        return XContentHelper.convertToMap(XContentFactory.xContent(XContentType.JSON), JsonUtil.toJson(map), false);
    }
}

실제 데이터

실행하면 아래와 같은 데이터가 입력됨을 확인할 수 있다.

{
  "took" : 95,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "room-nested",
        "_type" : "_doc",
        "_id" : "cb452b5d-c92c-4dc5-82b2-994f9995faa8",
        "_score" : 1.0,
        "_source" : {
          "_class" : "com.example.elastic.nested.doc.RoomNestedDoc",
          "id" : "cb452b5d-c92c-4dc5-82b2-994f9995faa8",
          "roomId" : "cb452b5d-c92c-4dc5-82b2-994f9995faa8",
          "name" : "room-cb452b5d-c92c-4dc5-82b2-994f9995faa8",
          "createdAt" : 1674729786402,
          "messages" : [
            [
              {
                "messageId" : 0,
                "message" : "초 중반에는 지루하다 그냥 중후반때부터 좀 볼만했음",
                "roomId" : "601c9fad-2d9a-4c73-b08e-32f6520c7522"
              },
              {
                "messageId" : 1,
                "message" : "역시 권상우와 유지태 두 남자는 정말 멋진 훈남이다.7년이 지난 지금도 다시 보고싶은 영화다.",
                "roomId" : "601c9fad-2d9a-4c73-b08e-32f6520c7522"
              },
              ...
            ]
          ]
        }
      ]
    }
  }
}

메시지 입력 속도

채팅방 100개를 elasticsearch에 입력하는 시간은 약 10초 정도 소요된다. (채팅방 수: 100개, 방당 메시지 수: 100개, 총 메시지 수: 10,000개)

elapsed : 10052(ms)

채팅방 100개 당 10초 이상 걸린다. 이는 nested object 구조가 갖는 특성 때문인데 nested object는 하나의 문서 하위에 nested로 구성된 하나 object가 있다면 총 2건의 document가 생성이 된다. 또한 nested object가 하나 더 추가되면 기존의 2건의 document가 삭제되고 3건의 docment가 생성되는 구조를 가지고 있다.

메시지 검색 속도

특정 키워드가 포함되어 있는 메시지를 검색해보자.

nested object를 검색하기 위해서는 nested 키워드를 사용해야 한다.

GET room-nested/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "messages",
            "query": {
              "match": {
                "messages.message": "학교"
              }
            }
          }
        }
      ]
    }
  }
}

위의 쿼리를 실행하면 걸리는 속도는 1초 이내이다.

3. object로 구성

세 번째로 message를 object 구조로 document를 생성해보자.

object로 구성하는 방법은 nested object와 동일하다. 단지 List<MessageField> 타입만 Object로 구성한다.

RoomNestedDoc 구조

@Document(
    indexName = "room-object"
)
@NoArgsConstructor
@Getter
public class RoomObjectDoc {

    @Id
    @Field(type = FieldType.Keyword)
    private String id;

    @Field(type = FieldType.Keyword)
    private String roomId;

    @Field(type = FieldType.Text)
    private String name;

    @Field(type = FieldType.Long)
    private Long createdAt;

    @Field(type = FieldType.Object)
    private List<MessageField> messages;

    public RoomObjectDoc(RoomObject room) { // Room 생성자
        this.id = room.getRoomId();
        this.roomId = room.getRoomId();
        this.name = room.getName();
        this.createdAt = room.getCreatedAt();
        this.messages = room.getMessages().stream().map(MessageField::new).collect(Collectors.toList());
    }

    public RoomObjectDoc(Message message) {
        this.id = message.getRoomId();
        this.messages = Arrays.asList(new MessageField(message));
    }

    @Override
    public String toString() {
        return JsonUtil.toJson(this);
    }
}

MessageField는 nested object와 동일하다.

Document 구조

위와 같이 Document를 정의하면 아래와 같은 형식으로 document가 생성이 된다.

{
  "room-object" : {
    "mappings" : {
      "properties" : {
        "_class" : {
          "type" : "keyword",
          "index" : false,
          "doc_values" : false
        },
        "createdAt" : {
          "type" : "long"
        },
        "id" : {
          "type" : "keyword"
        },
        "messages" : {
          "properties" : {
            "_class" : {
              "type" : "keyword",
              "index" : false,
              "doc_values" : false
            },
            "message" : {
              "type" : "text"
            },
            "messageId" : {
              "type" : "long"
            }
          }
        },
        "name" : {
          "type" : "text"
        },
        "roomId" : {
          "type" : "keyword"
        }
      }
    }
  }
}

테스트 데이터 입력

테스트를 위해 데이터를 입력해보자.

채팅방: 1,000,000개
메시지 개수 (채팅방 당): 100개
총 메시지 개수: 100,000,000개 (1억개)

메시지 입력하는 방식도 nested object와 동일하다.

@Component
@RequiredArgsConstructor
public class ObjectMessageElasticClient {

    private final RestHighLevelClient restHighLevelClient;

    public void createMessage(List<Message> messages) {
        BulkRequest bulkRequest = new BulkRequest();

        messages.stream()
            .map(this::addMessageRequest)
            .forEach(bulkRequest::add);

        try {
            BulkResponse response = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private UpdateRequest addMessageRequest(Message message) {
        RoomObjectDoc roomDoc = new RoomObjectDoc(message);
        Map<String, Object> params = getParams(Collections.singletonMap("message", message));
        StringBuilder sb = new StringBuilder();
        sb.append("ctx._source.messages.add(params.message)");
        Script script = new Script(ScriptType.INLINE, "painless", sb.toString(), params);

        return new UpdateRequest("room-object", message.getRoomId())
            .script(script)
            .upsert(roomDoc.toString(), XContentType.JSON)
            .retryOnConflict(3);
    }


    private Map<String, Object> getParams(Map<String, Object> paramMap) {
        Map<String, Object> map = new HashMap<>();
        for (String key : paramMap.keySet()) {
            map.put(key, paramMap.get(key));
        }

        return XContentHelper.convertToMap(XContentFactory.xContent(XContentType.JSON), JsonUtil.toJson(map), false);
    }
}

실제 데이터

데이터 구조는 nested object와 동일하다.

메시지 입력 속도

채팅방 100개를 elasticsearch에 입력하는 시간은 약 8~9초 정도 소요된다. (채팅방 수: 100개, 방당 메시지 수: 100개, 총 메시지 수: 10,000개)

elapsed : 8715(ms)

Object 타입은 nested object와는 다르게 하나의 document 하위에 한개의 object타입이 있더라도 1개의 document만 생성이 된다. 또한 1개의 object가 더 추가되더라도 array 형식으로 추가되는 구조를 가지고 있다. 그래서, 여러개의 Object가 다른 유형을 가지고 있더라도 모두 OR 로 검색이 되게 때문에 구분할 수가 없다.

object 타입의 검색 시 문제

아래와 같이 사용자가 "홍길동", "이승엽"이 성과 이름으로 구분되어 저장되어 있는데 "이길동"으로 검색하면 당연히 검색이 안되어야 맞는거지만 object 타입일 경우는 검색이 된다. object 타입은 Array로 저장이 되고 모두 OR로 검색되는 구조를 가지고 있다.

즉, "이길동"으로 검색을 하면 firstName = "이" or lastName = "길동" 으로 검색이 된다. (nested object는 정상적으로 잘 동작한다. 즉 검색되지 않음)

[{ 
  "lastName": "길동", "firstName": "홍"
}, { 
  "lastName": "승엽", "firstName": "이"
}]

메시지 검색 속도

특정 메시지를 포함하고 있는 채팅방을 조회해보자. Object 타입은 nested가 아닌 match로 바로 검색을 할 수 있다.

GET room-object/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "messages.message": "학교"
          }
        }
      ]
    }
  }
}

위의 쿼리를 실행하면 걸리는 속도는 2~3초이다.

정리

메시지 검색을 지원하는 경우라면 메시지만 단일 document로 생성하여 검색을 지원하면 되지만 여기서는 특정 메시지를 포함하는 채팅방을 검색하는 구조라 위의 3가지 경우로 생각해보았다.
nested object와 object 유형은 검색 시 사용되는 연산자(or, and)에 따라 표시되는 score의 값이 다르기 때문에 어떻게 검색할 지 잘 판단해야 한다.
그리고 특정 경우가 무조건 성능에 유리하다고 보기는 어려우며 3가지 경우가 각기 장단점이 존재하기 때문에 상황에 맞는 구조로 사용하는 것이 좋을 것 같다.

(채팅방: 100,000건, 메시지: 100,000,000건)

	인덱싱 시간 (채팅방 100건당)	검색 시간 (특정 메시지를 포함하는 채팅방) 검색어: 2 단어 (학교 수업)	elasticsearch 문서 수/사이즈	설명
parent-child	1.5초	1초 내외	문서 수: 100,100,000 사이즈: 34.4G	장점: 데이터 입력 시간이 빠르다. 단점: 대용량이고 검색어가 길어질수록 검색 시간이 느리다.
nested object	10초	1초 내외	문서 수: 100,100,000 사이즈: 18.3G	장점: 메시지의 유형으로 구분하여 다양한 조건으로 검색할 수 있다. 단점: 메시지 입력 시간이 오래 걸린다. 스토리지 용량이 증가한다. 인덱스 merge가 많이 발생한다.
object	7~8초	3초	문서 수: 1,000,000 사이즈: 17.4G	장점: nested object에 비해 입력 속도가 빠르다. 단점: object 유형으로 구분할 수 없다. 특정 메시지의 and 검색이 불가능하다.

저작자표시 (새창열림)

개요

요구사항

구현 방안

1. parent-child로 구성

2. nested object 타입으로 구성

3. object 타입으로 구성

말뭉치 데이터

환경 구성

데이터 모델 구성

1. parent-child 로 구성

RoomParentChildDoc 구조

Document 구조

테스트 데이터 입력

실제 데이터

메시지 입력 속도

메시지 검색 속도

2. nested object 로 구성

RoomNestedDoc 구조

Document 구조

테스트 데이터 입력

실제 데이터

메시지 입력 속도

메시지 검색 속도

3. object로 구성

RoomNestedDoc 구조

Document 구조

테스트 데이터 입력

실제 데이터

메시지 입력 속도

object 타입의 검색 시 문제

메시지 검색 속도

정리

티스토리툴바