Solr approaches to re-indexing large document corpus

We are looking for some recommendations around systematically re-indexing in Solr an ever growing corpus of documents (tens of millions now, hundreds of millions in than a year) without taking the currently running index down. Re-indexing is needed on a periodic bases because:

  • New features are introduced around searching the existing corpus that require additional schema fields which we can't always anticipate in advance
  • The corpus is indexed across multiple shards. When it grows past a certain threshold, we need to create more shards and re-balance documents evenly across all of them (which SolrCloud does not seem to yet support).

The current index receives very frequent updates and additions, which need to be available for search within minutes. Therefore, approaches where the corpus is re-indexed in batch offline don't really work as by the time the batch is finished, new documents will have been made available.

The approaches we are looking into at the moment are:

  • Create a new cluster of shards and batch re-index there while the old cluster is still available for searching. New documents that are not part of the re-indexed batch are sent to both the old cluster and the new cluster. When ready to switch, point the load balancer to the new cluster.
  • Use CoreAdmin: spawn a new core per shard and send the re-indexed batch to the new cores. New documents that are not part of the re-indexed batch are sent to both the old cores and the new cores. When ready to switch, use CoreAdmin to dynamically swap cores.

We'd appreciate if folks can either confirm or poke holes in either or all these approaches. Is one more appropriate than the other? Or are we completely off? Thank you in advance.

-------------Problems Reply------------

This may not be applicable to you guys, but I'll offer my approach to this problem.

Our Solr setup is currently a single core. We'll be adding more cores in the future, but the overwhelming majority of the data is written to a single core.

With this in mind, sharding wasn't really applicable to us. I looked into distributed searches - cutting up the data and having different slices of it running on different servers. This, to me, just seemed to complicate things too much. It would make backup/restores more difficult and you end up losing out on certain features when performing distributed searches.

The approach we ended up going with was a very simple clustered master/slave setup.

Each cluster consists of a master database, and two solr slaves that are load balanced. All new data is written to the master database and the slaves are configured to sync new data every 5 minutes. Under normal circumstances this is a very nice setup. Re-indexing operations occur on the master, and while this is happening the slaves can still be read from.

When a major re-indexing operation is happening, I remove one slave from the load balancer and turn off polling on the other. So, the customer facing Solr database is now not syncing with the master, while the other is being updated. Once the re-index is complete and the offline slave database is in sync, I add it back to the load balancer, remove the other slave database from the load balancer, and re-configure it to sync with the master.

So far this has worked very well. We currently have around 5 million documents in our database and this number will scale much higher across multiple clusters.

Hope this helps!

Category:indexing Views:3 Time:2011-05-10

Related post

  • SOLR - Best approach to import 20 million documents from csv file 2012-02-25

    My current task on hand is to figure out the best approach to load millions of documents in solr. The data file is an export from DB in csv format. Currently, I am thinking about splitting the file into smaller files and having a script while post th

  • Solr Index PDF documents and post them to a remote server 2011-06-26

    Hi I am a naive user when it come to Solr. Please guide me on the following hurdles. 1) Solr Index PDF documents Solution tried I used tika-app 0.9.jar to extract the content from the Input PDF files to text file. Now I am trying to write a java code

  • Indexing PDF documents in Solr with no UniqueKey 2011-07-15

    I want to index PDF (and other rich) documents. I am using the DataImportHandler. Here is how my schema.xml looks: ......... ......... <field name="title" type="text" indexed="true" stored="true" multiValued="false"/> <field name="descriptio

  • indexing all documents in doc folder in to solr FileListEntityProcessor 2012-04-20 does not provide much information how to configure this handler in an webapplication which has its own context and want to use solr as server features as embebdedd solr . Can you please provide som

  • Why does Lucene cause OOM when indexing large files? 2009-09-01

    I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving OutOfMemoryError: Java heap space, when trying to index large text files. Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I i

  • Solr DIH - How to handle deleted documents? 2009-10-12

    I'm playing around with a Solr-powered search for my webapp, and I figured it'd be best to use the DataImportHandler to handle syncing with the app via the database. I like the elegance of just checking the last_updated_date field. Good stuff. Howeve

  • Indexing uploaded documents - searchable only by the users that uploaded them 2011-01-18

    If someone could point me in the right direction that would be most helpful. I have written a custom CMS where I want to be able to allow each individual user to upload documents (.doc .docx .pdf .rtf .txt etc) and then be able to search the contents

  • Large document has become sluggish to edit 2014-04-07

    I have a very large Word document (in Word 2010 docx format), of around 400 pages, containing mostly text but also many pictures and tables. Lately, probably because it has become so "bloated", editing in Word has become sluggish, for instnace, there

  • How do I visualize a large document set? 2009-05-19

    I have 100 Gb of documents. I would like to characterize it and get a general sense of what topics are prevalent. The documents are plain text. I have considered using a tool like Google Desktop to search, but it is too large to really guess what to

  • MongoDB Schema Design - Many small documents or fewer large documents? 2010-06-14

    Background I'm prototyping a conversion from our RDBMS database to MongoDB. While denormalizing, it seems as if I have two choices, one which leads to many (millions) of smaller documents or one which leads to fewer (hundreds of thousands) large docu

  • Indexing pdf documents 2010-09-17

    What the best way to index pdf documents? Should I index them by converting pdf documents to txt or there is a better way to index pdf files? --------------Solutions------------- Assuming you're talking about solr: see the ExtractingRequestHandler.

  • SimplePostTool: FATAL: Connection error in Solr when trying to index 2011-06-23

    I am new to Solr and when I was indexing xml document in my solr, then I got this error. C:\apache-solr-3.2.0\example\exampledocs>java -jar post.jar *.xml SimplePostTool: version 1.3 SimplePostTool: POSTing files to http://localhost:8983/solr/upda

  • Very slow performance with large documents (>400KB) in Internet Explorer 8 2012-01-24

    Hare Krsna, While viewing a large single HTML document (417KB), which is wider than one screen as well as longer, so that there are scrollbars both at the side and at the base of the Internet Explorer window, Internet Explorer 8 is very slow. Upon tr

  • How do I split up a large document into smaller files while keeping the same format? 2012-02-21

    How do I split up a large word document into smaller files without losing the original format? --------------Solutions------------- One fairly simple method is to leverage the "Master-/Subdocument" functionality. Make sure you save a copy the origina

  • Wisdom Needed for creating large documents i.e., books and outlined syllabi 2014-02-07

    Hello, I have a very generic but important questions. I am a writer and for years have used Word Perfect and was vary familiar with its features. When I decided to turn from Windows to Mac, I also needed to switch Word processors. I decided to go wit

  • Textual analysis of large documents 2009-07-12

    I have a project where I need to compare multi-chapter documents to a second document to determine their similarity. The issue is I have no idea how to go about doing this, what approaches exist or if their are any libraries available. My first quest

  • How can I index HTML documents? 2009-12-17

    I am using Lucene .NEt to do full-text searching. Till now I have been indexing PDF docs, but now I have a few webpages that I need to index. What's the best/easiest way to index HTML documents to add to my Lucene index? I am using .NET/C# ----------

  • The UI Is Unresponsive When Loading Large Document in UI Thread, Even with BackgroundWorker Implemented 2010-01-19

    I asked a similar question here; thanks to everyone who provided suggestion! However, it seems that my problem is bigger than the one described above, so I am posting a new question. The issue is that I need to keep my UI responsive when I am loading

  • Fast method to find regex matches in a large document using javascript? 2010-03-02

    I need to search the text in a HTML document for reg-exes(emails, phone numbers, etc) and words. The matches need to be highlighted and be made anchor-able so that a link can be generated to jump to the location of the matches. So not only does it ne

Copyright (C), All Rights Reserved.

processed in 0.122 (s). 11 q(s)