Optimal approaches for result content in Amazon Cloudsearch (or Solr, ElasticSearch)

We are inquiring because we want to transfer an application from Solr to Amazon Cloud Search. Before we proceed, we would like to understand how this process works. It is important to note that Amazon CloudSearch is built upon Apache Solr.


Question:

Is it true that it is recommended to only retrieve an ID when querying for results and then retrieve the metadata from the database? I am concerned about the impact on performance.


Solution 1:


From my perspective, it is generally advisable to store and retrieve the minimum number of fields, ideally limited to the ID, unless there is a specific requirement for a functionality like highlighting.

As your index expands, the excessive storage of data can hinder your search performance. It is important to note that no data loads faster than having no data at all. Additionally, retrieving objects based on their IDs should be a cost-effective operation in your primary data store.

Utilizing an ORM in your application allows for the consistent reuse of domain modeling throughout, which holds immense value.

Utilizing the return of
values straight
from your
search engine
can prove beneficial. However, unless there is a highly persuasive motive, I would be hesitant to divide my domain logic and forgo an ORM by solely relying on a search engine as a main data repository.


Solution 2:


In my opinion, combining the retrieval of search results and data into a single call would greatly enhance performance, as opposed to only retrieving IDs and then making a separate database call to obtain the metadata for those IDs.
Furthermore, Solr/ES offers built-in caching solutions, which would result in faster responses for subsequent queries. On the other hand, for the database, you might need to consider using a different solution or exploring other options.


Solution 3:

The outcome will vary based on the particular circumstances you are facing.

There are situations where what you say may hold true. For example, Etsy used to practice this approach. Their reasoning was based on having a highly efficient mysql cluster that they were proficient in managing. The mysql cluster performed exceptionally well, so having Solr return only the id was deemed sufficient by Etsy.

However, you could find yourself in a completely different situation where retrieving data from the database might be more time-consuming than storing all the necessary information in Solr and querying Solr exclusively.


Solution 4:

Based on my observations, Solr’s performance suffers when retrieving results under two conditions: either highlighting is turned on, or the fields being retrieved are excessively large, leading to increased overhead in
network serialization
/deserialization transfer. In such cases, it may be more efficient to asynchronously retrieve these fields from the database.

Frequently Asked Questions

Posted in Uncategorized