Lucene nightly indexing benchmark

Known changes:

A (2011-04-25): Switched from a traditional spinning-magnets hard drive (Western Digital Caviar Green, 1TB) to a 240 GB OCZ Vertex III SSD; this change gave a small increase in indexing rate, drastically reduced variance on the NRT reopen time (NRT is IO intensive), and didn't affect query performance (which is expected since the postings are small enough to fit into the OS's IO cache.
B (2011-05-02): Concurrent flushing, a major improvement to Lucene, was committed. Before this change, flushing a segment in IndexWriter was single-threaded and blocked all other indexing threads; after this change, each indexing thread flushes its own segment without blocking indexing of other threads. On highly concurrent hardware (the machine running these tests has 24 cores) this can result in a tremendous increase in Lucene's indexing throughput. See this post for details.
Some queries did get slower, because the index now has more segments. Unfortunately, the index produced by concurrent flushing will vary, night to night, in how many segments it contains, so this is a further source of noise in the search results.
C (2011-05-06): Changed how I build the index used for searching, to only use one thread. This results in exactly the same index structure (same segments, same docs per segment) from night to night, to avoid the added noise from change B.
D (2011-05-07): Increased number of indexing threads from 6 to 20 and dropped the IndexWriter RAM buffer from 512 MB to 350 MB. See this post for details.
E (2011-05-11): Added TermQuery, sorting by date/time and title fields.
F (2011-05-14): Added TermQuery, grouping by fields with 100, 10K, 1M unique values.
G (2011-06-03): Added Term (bgroup) and Term (bgroup, 1pass) using the BlockGroupingCollector for grouping into 1M unique groups.
H (2011-06-26): Switched to MemoryCodec for the primary-key 'id' field so that lookups (either for PKLookup test or for deletions during reopen in the NRT test) are fast, with no IO. Also switched to NRTCachingDirectory for the NRT test, so that small new segments are written only in RAM.
I (2011-07-04): Switched from Java 1.6.0_21 to 1.6.0_26
J (2011-07-11): LUCENE-3233: fast SynonymFilter using an FST, including an optimization to the FST representation allowing array arcs even when some arcs have large outputs; this resulted in a good speedup for MemoryCodec, which also speeds up the primary key lookup performance.
K (2011-07-22): LUCENE-3328: If all clauses of a BooleanQuery are MUST and are TermQuery then create a specialized scorer for scoring this common case.
L (2011-07-30): Switched back to Java 1.6.0_21 from 1.6.0_26 because _26 would sometimes deadlock threads.
M (2011-08-20): LUCENE-3030: cutover to more efficient BlockTree terms dict.
N (2011-09-22): LUCENE-3215: more efficient scoring for sloppy PhraseQuery.
O (2011-11-30): LUCENE-3584: make postings bulk API codec-private
P (2011-12-07): Switched to Java 1.7.0_01
Q (2011-12-16): LUCENE-3648: JIT optimizations to Lucene40 DocsEnum
R (2012-01-30): LUCENE-2858: Split IndexReader in AtomicReader and CompositeReader
S (2012-03-18): LUCENE-3738: Be consistent about negative vInt/vLong
V (2012-05-06): LUCENE-4024: FuzzyQuery never does edit distance > 2
W (2012-05-15): LUCENE-4024: (rev 1338668) fixed ob1 bug causing FuzzyQ(1) to be TermQuery
T (2012-05-25): LUCENE-4062: new aligned packed-bits implementations
U (2012-05-28): Disable Java's compressed OOPS (-XX:-UseCompressedOops), and LUCENE-4055: refactor SegmentInfos/FieldInfos
X (2012-06-02): Re-enable Java's compressed OOPS
Y (2012-06-06): Switched to Java 1.7.0_04
Z (2012-06-26): Fixed silly performance bug in PKLookupTask.java
AA (2012-10-06): Stopped overclocking the computer running benchmarks.
AB (2012-10-15): LUCENE-4446: switch to BlockPostingsFormat
AC (2012-12-10): LUCENE-4598: small optimizations to facet aggregation
AD (2013-01-11): LUCENE-4620: IntEncoder/Decoder bulk API
AE (2013-01-17): Facet performance improvements: LUCENE-4686, LUCENE-4620, LUCENE-4602
AE (2013-01-17): Facet performance improvements: LUCENE-4686, LUCENE-4620, LUCENE-4602
AF (2013-01-21): Facet performance improvements: LUCENE-4600
AF (2013-01-21): Facet performance improvements: LUCENE-4600
AG (2013-01-24): Switched to NO_PARENTS faceting
AH (2013-02-07): DocValues improvements (LUCENE-4547) and facets API improvements (LUCENE-4757)
AI (2013-02-12): LUCENE-4764: new Facet42DocValuesFormat for faster but more RAM-consuming DocValues
AI (2013-02-12): LUCENE-4764: new Facet42DocValuesFormat for faster but more RAM-consuming DocValues
AJ (2013-02-22): LUCENE-4791: optimize ConjunctionTermScorer to use skipping on first term
AK (2013-03-14): LUCENE-4607: add DISI/Spans.cost
AL (2013-05-03): LUCENE-4946: SorterTemplate
AM (2013-06-20): LUCENE-5063: compress int and long FieldCache entries
AN (2013-07-31): LUCENE-5140: recover slowdown in span queries and exact phrase query
AO (2013-09-10): Switched to Java 1.7.0_40
AP (2013-11-09): Switched to DirectDocValuesFormat for the Date facets field.
AP (2013-11-09): Switched to DirectDocValuesFormat for the Date facets field.
AQ (2014-02-06): LUCENE-5425: performance improvement for FixedBitSet.iterator
AW (2014-03-11): LUCENE-5487: add BulkScorer
AR (2014-04-05): LUCENE-527: LeafCollector (made CachingCollector slower)
AS (2014-04-25): Upgraded to Ubuntu 14.04 LTS (kernel 3.13.0-32-generic #57)
AT (2014-06-10): Switched from DirectDVFormat to Lucene's default for Date facet field
AU (2014-07-25): Disabled transparent huge pages
AU (2014-07-25): Disabled transparent huge pages
AV (2014-08-30): Re-enabled transparent huge pages
AV (2014-08-30): Re-enabled transparent huge pages
AX (2014-11-01): LUCENE-6030: norms compression
AX (2014-11-01): LUCENE-6030: norms compression
AY (2014-11-22): Upgrade from java 1.7.0_55-b13 to java 1.8.0_20-ea-b05
AZ (2015-01-15): LUCENE-6179: remove out-of-order scoring
BA (2015-01-19): LUCENE-6184: BooleanScorer better deals with sparse clauses
BB (2015-02-13): LUCENE-6198: Two phase intersection (approximations are not needed by any query in this benchmark, but the change refactored ConjunctionScorer a bit)
BD (2015-03-02): LUCENE-6320: Speed up CheckIndex
BE (2015-03-06): Upgrade JDK from 1.8.0_25-b17 to 1.8.0_40-b25
BF (2015-04-02): LUCENE-6308: span queries support two-phased iteration
BG (2015-04-04): LUCENE-5879: add auto-prefix terms
BH (2015-06-24): LUCENE-6548: some optimizations to block tree intersect
BH (2015-06-24): LUCENE-6548: some optimizations to block tree intersect
BI (2015-09-15): LUCENE-6789 switch to BM25 scoring by default
BJ (2015-10-05): Randomize what time of day benchmark runs
BJ (2015-10-05): Randomize what time of day benchmark runs
BK (2015-12-02): Upgrade to beast2 (72 cores, 256 GB RAM)
BL (2015-12-10): LUCENE-6919: Change the Scorer API to expose an iterator instead of extending DocIdSetIterator
BM (2015-12-14): LUCENE-6917: Change from LegacyNumericRangeQuery to DimensionalRangeQuery
BN (2016-05-23): Fix silly benchmark bottlenecks and re-tune for high indexing throughput
BN (2016-05-23): Fix silly benchmark bottlenecks and re-tune for high indexing throughput
BO (2016-05-25): Fix another benchmark bottleneck for 1 KB docs (but this added a bug in TermDateFacets, fixed on 10/18)
BO (2016-05-25): Fix another benchmark bottleneck for 1 KB docs (but this added a bug in TermDateFacets, fixed on 10/18)
BP (2016-06-13): LUCENE-7330: Speed up conjunctions
BR (2016-07-16): Upgrade beast2 kernel from 4.4.x to 4.6.x
BS (2016-08-17): Upgrade beast2 kernel from 4.6.x to 4.7.0
BT (2016-09-04): Upgrade beast2 kernel from 4.7.0 to 4.7.2
BU (2016-09-21): LUCENE-7407: Change doc values from random access to iterator API
BV (2016-10-18): Fix silly TermDateFacets bug causing single date facet to be indexed for all docs, added on 5/25
BW (2016-10-24): LUCENE-7462: give doc values an advanceExact API
BW (2016-10-24): LUCENE-7462: give doc values an advanceExact API
BX (2016-10-26): LUCENE-7519: optimize computing browse-only facets, LUCENE-7489: Remove one layer of abstraction in binary doc values and single-valued numerics
BX (2016-10-26): LUCENE-7519: optimize computing browse-only facets, LUCENE-7489: Remove one layer of abstraction in binary doc values and single-valued numerics
BY (2016-10-27): Re-enable transparent huge pages in Linux
BY (2016-10-27): Re-enable transparent huge pages in Linux
BZ (2016-10-31): LUCENE-7135: This issue accidentally caused FSDirectory.open to use NIOFSDirectory instead of MMapDirectory for e.g. CheckIndex
CA (2016-11-02): LUCENE-7135: Fixed this issue so we use MMapDirectory again
CB (2017-01-18): LUCENE-7641: Speed up point ranges that match most documents
CC (2017-10-27): LUCENE-7997: BM25 to use doubles instead of floats
CC (2017-10-27): LUCENE-7997: BM25 to use doubles instead of floats
CD (2018-01-31): LUCENE-4198: Allow codecs to index term impacts
CE (2018-02-20): LUCENE-8153: CheckIndex spends less time checking impacts
CF (2018-05-02): LUCENE-8279: CheckIndex now cross-checks terms with norms
CG (2018-05-11): Primary key now indexed with the default codec instead of the specialized Memory postings format
CG (2018-05-11): Primary key now indexed with the default codec instead of the specialized Memory postings format
CH (2018-05-25): LUCENE-8312: Leverage impacts for SynonymQuery (introduced regression for non-scoring term queries)
CI (2018-08-07): LUCENE-8312: Fixed regression with non-scoring term queries
CJ (2018-08-07): LUCENE-8060: Stop counting total hits by default
CI (2018-08-07): LUCENE-8312: Fixed regression with non-scoring term queries
CJ (2018-08-07): LUCENE-8060: Stop counting total hits by default
CK (2018-08-18): LUCENE-8448: Propagate min competitive score to sub clauses
CL (2018-11-19): LUCENE-8464: ConstantScoreScorer now implements setMinCompetitiveScore
CM (2019-04-23): Switched to OpenJDK 11
CM (2019-04-23): Switched to OpenJDK 11
CM (2019-04-23): Switched to OpenJDK 11
CN (2019-04-30): Switched GC back to ParallelGC (away from default G1GC)
CN (2019-04-30): Switched GC back to ParallelGC (away from default G1GC)
CO (2019-05-06): LUCENE-8781: FST lookup performance has been improved in many cases by encoding Arcs using full-sized arrays with gaps
CP (2019-05-24): LUCENE-8770: Two-phase support in conjunctions
CQ (2019-07-02): LUCENE-8901: Load freq blocks lazily
CR (2019-07-09): LUCENE-8311: Compute impacts for phrase queries
CS (2019-09-26): LUCENE-8980: Blocktree seekExact now checks min-max range of the segment
CT (2019-10-14): LUCENE-8920: Disable direct addressing of arcs.
CU (2019-11-14): LUCENE-8920: Re-enable direct addressing of arcs.
CV (2019-11-19): LUCENE-9027: SIMD decompression of postings.
CW (2019-11-21): LUCENE-9056: Fewer conditionals in #nextDoc/#advance
CX (2020-01-13): Switch to OpenJDK 13
CX (2020-01-13): Switch to OpenJDK 13
CY (2020-01-14): Switch to OpenJDK 12
CY (2020-01-14): Switch to OpenJDK 12
CZ (2020-01-17): Move invariant checks of CompetitiveImpactAccumulator under an assert
DA (2020-01-24): LUCENE-4702: compress suffix bytes in terms dictionary
DA (2020-01-24): LUCENE-4702: compress suffix bytes in terms dictionary
DB (2020-02-18): LUCENE-9211: Adding compression to BinaryDocValues storage
DC (2020-09-09): LUCENE-9511: Include StoredFieldsWriter in DWPT accounting
DC (2020-09-09): LUCENE-9511: Include StoredFieldsWriter in DWPT accounting
DD (2020-10-27): LUCENE-9280: optimization to skip non-competitive documents when sorting by field by indexing benchmark datetime field as both points and doc values
DD (2020-10-27): LUCENE-9280: optimization to skip non-competitive documents when sorting by field by indexing benchmark datetime field as both points and doc values
DE (2020-11-06): Move to new beast 3 Ryzen Threadripper 3990X hardware for all nightly benchmarks: 64 cores (128 with hyperthreading), 256 GB RAM, 960 GB Intel 905P Optane, Linux 5.9.2-arch1-1, still OpenJDK 12.0.2+10
DF (2020-11-14): LUCENE-9378: Configurable compression for binary doc values
DG (2020-12-04): Switch to JDK 15 (from JDK 12)
DH (2020-12-09 10:00:13): Add KNN vectors to indexing metrics
DI (2020-12-10): LUCENE-9626: switch to native arrays for HNSW ANN vector search
DJ (2020-12-28): LUCENE-9644: add diversity to HNSW (ANN search) neighbor selection, apparently yielding performance gains to VectorSearch task
DL (2021-01-07): LUCENE-9652: add dedicated method, DataInput.readLEFloats, to read float[] from DataInput, optimizing HNSW KNN
DK (2021-01-09 13:35:50): increase max number of concurrent merges from 3 to 12 for indexing tasks; increase Indexer heap from 8 to 32 GB
DM (2021-01-24 17:25:07): enable BinaryDocValues compression in taxonomy index
DN (2021-01-28): LUCENE-9695: WTF somehow this bug fix hurt vector indexing throughput?
DO (2021-02-25): Upgrade beast3 to Arch Linux 5.11.1
DP (2021-03-14 08:23:12): Move vectors indexing to dedicated (separate) indexing task
DQ (2021-06-24 00:03:16): LUCENE-9613: Create blocks for ords when it helps Lucene80DocValuesFormat
DR (2021-08-24 00:03:22): LUCENE-5309: specialize single-valued SortedSetDocValues faceting
DS (2021-08-26 07:26:00): LUCENE-10067: specialize SSDV ordinal decode
DT (2021-08-27 10:46:59): Upgrade to JDK 16.0.2+7
DU (2021-09-01 00:03:16): LUCENE-9662: CheckIndex should be concurrent
DV (2021-09-03 00:03:25): Use 16 concurrent threads for CheckIndex
DW (2021-09-25 00:03:25): LUCENE-10109: Increase default beam width from 16 to 100
DX (2021-10-05 12:21:08): Upgrade Linux kernel from 5.13.12 to 5.14.8
DY (2021-10-19 08:14:33): Upgrade to JDK17+35, and pass -release to ecj linting
EA (2021-11-24 18:04:23): LUCENE-10062: switch to storing taxonomy Facet ordinals from custom encoding in BINARY DV field, to SSDV field
EC (2022-01-03 18:03:13): LUCENE-10346: specialize single-valued doc values during taxonomy facet counting
ED (2022-01-19 10:17:11): Upgrade arch linux 5.15.10 -> 5.16.1; LUCENE-10375: Speed up HNSW merge by writing combined vector data
EE (2022-01-26 18:03:08): LUCENE-10054: Make HnswGraph hierarchical
EF (2022-02-18 07:54:59): LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls
EG (2022-02-18 07:54:59): LUCENE-10408 Better encoding of doc Ids in vectors
EH (2022-02-25 18:03:10): LUCENE-10421: Use fixed seed for HNSW search
EI (2022-03-23 18:03:07): LUCENE-10481: FacetsCollector sets ScoreMode.COMPLETE_NO_SCORES when scores are not needed
EJ (2022-04-21 18:03:04): LUCENE-10517: specialize SSDV pure-browse facets for some cases
EK (2022-05-12 18:02:51): LUCENE-10527: Use 2*maxConn for last layer in HNSW
EL (2022-05-19): GITHUB#11610: Prevent pathological O(N^2) merges
EM (2022-06-07 18:02:50): LUCENE-10078: Enable merge-on-refresh by default
EN (2022-07-04 18:02:46): LUCENE-10480: Use BMM scorer for 2 clauses disjunction
EO (2022-07-08 18:02:52): LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches
EP (2022-07-13 18:02:33): #1010: Specialize ordinal encoding for SortedSetDocValues
EQ (2022-07-23 06:00:31): LUCENE-10592: Build HNSW graph during indexing
ER (2022-07-30 18:02:50): LUCENE-10633: Dynamic pruning for queries sorted by SORTED(_SET) field
ES (2022-08-10): GITHUB#1017: Add new ShapeDocValuesField for LatLonShape and XYShape
ET (2022-09-28): GITHUB#11824: recover performance regression of LatLonPoint#newPolygonQuery
EU (2022-09-29): Update all Arch Linux packages, including kernel from 5.17.5 -> 5.19.12
EV (2022-09-30): Switch from OpenJDK 17.0.1+12 -> standard Arch Linux OpenJDK 17.0.4.1+1
EW (2022-10-06): luceneutil#192: increase the default topN from 10 to 100
EW (2022-10-06): luceneutil#192: increase the default topN from 10 to 100
EX (2022-10-19): Fix CombinedFieldsQuery tasks: https://github.com/mikemccand/luceneutil/commit/56729cf341a443fb81148dd25d3d49cb88bc72e8
EY (2022-10-26): Upgrade to OpenJDK 19.0.1+10
EZ (2022-11-03): Add --enable-preview command-line JVM flag to test new Panama-based MMapDirectory implementation
FA (2022-11-16): GITHUB#11939: fix bug of incorrect cost after upgradeToBitSet in DocIdSetBuilder class
FB (2023-03-01): GITHUB#12055: Better skipping for multi-term queries with a FILTER rewrite
FC (2023-03-14): GITHUB#12198: Reduced contention and increased thread affinity
FD (2023-04-04): Remove --enable-preview JVM flag
FE (2023-04-09): Upgrade Linux kernel 5.19.12 -> 6.2.9 plus all packages
FF (2023-04-11): Upgrade Java 19.0.2+7 to 20+36-2344
FG (2023-05-12): Allocate one NeighborQueue per search for results #12255
FH (2023-05-20): Revert allocate one NeighborQueue per search for results #12255
FI (2023-05-29): Use the incubating Panama API to accelerate KNN indexing and searching #12311
FJ (2023-06-22): Implement top-level disjunctions as a bulk scorer
FK (2023-07-20): Add BS1 optimization to MaxScoreBulkScorer; also increase vectors dimensions to 768: https://github.com/mikemccand/luceneutil/commit/ce6901ca05f3b944144e3a1e47e1efcc13a30e33
FK (2023-07-20): Add BS1 optimization to MaxScoreBulkScorer; also increase vectors dimensions to 768: https://github.com/mikemccand/luceneutil/commit/ce6901ca05f3b944144e3a1e47e1efcc13a30e33
FL (2023-07-28): Initialize facet counting data structures lazily (https://github.com/apache/lucene/commit/179b45bc23e4496278b7058811577b66ef3af77d); but this also incorrectly shifted which facet tasks are executed, making results incomparable: https://github.com/mikemccand/luceneutil/issues/226
FM (2023-08-03 06:31:03): Fix non-determinism in nightly benchmarks for vector and taxo facets tasks: https://github.com/mikemccand/luceneutil/issues/226, reduce overhead of BooleanScorer in non-scoring mode: https://github.com/apache/lucene/pull/12475, and reduce overhead of intersecting the scorer with the collector's competitive iterator: https://github.com/apache/lucene/pull/12481
FN (2023-08-11): Optimized counts on disjunctive queries https://github.com/apache/lucene/pull/12415
FO (2023-09-15): No longer propagate min competitive scores through the query tree https://github.com/apache/lucene/pull/12490
FP (2023-09-28): Specialized bulk scorer for conjunctions https://github.com/apache/lucene/pull/12382
FQ (2023-10-04): Reduced FST block size for BlockTreeTermsWriter https://github.com/apache/lucene/pull/12604
FR (2023-10-10): Better output prefix sharing in terms index https://github.com/apache/lucene/pull/12631
FS (2023-10-13): Lazy decoding of frequencies in BlockImpacstDocsEnum https://github.com/apache/lucene/pull/12668
FT (2023-10-17): Specialize BlockImpactsDocsEnum#nextDoc() https://github.com/apache/lucene/pull/12670
FU (2023-10-28): Intersect clauses in disjunctions when block max scores require multiple clauses to match https://github.com/apache/lucene/pull/12589
FU (2023-10-28): Intersect clauses in disjunctions when block max scores require multiple clauses to match https://github.com/apache/lucene/pull/12589
FV (2023-11-02): Add a specialized bulk scorer for regular conjunctions https://github.com/apache/lucene/pull/12719
FW (2023-11-09): Remove patching for doc blocks https://github.com/apache/lucene/pull/12741
FX (2023-11-13): Also index int8 quantized HNSW vectors, but introduced a bug where quantized usage is flipped, this is fixed on 2023-12-04
FY (2023-11-21): Switch tail postings to group-varint https://github.com/apache/lucene/pull/12782
FZ (2023-11-23): Skip decoding tail freqs when they are not needed https://github.com/apache/lucene/pull/12832
GA (2023-12-04): Correct int8 quantized vectors usage, this was flipped in the 2023-11-13 change and is now corrected
GB (2023-12-14): Change CheckIndex level back to 2 https://github.com/mikemccand/luceneutil/pull/251
GC (2023-12-24): Move group-varint encoding/decoding logic to DataOutput/DataInput https://github.com/apache/lucene/pull/12841
GD (2024-01-16): Override #readVInt and #readVLong for ByteBufferDataInput https://github.com/apache/lucene/pull/592
GE (2024-02-02): Optimize counts on two clause term disjunctions https://github.com/apache/lucene/pull/13036
GF (2024-02-07): Speedup concurrent multi-segment HNWS graph search by exchanging the global top scores collected so far across segments https://github.com/apache/lucene/pull/12962
GG (2024-03-01): Upgrade to OpenJDK 21
GH (2024-03-04): Upgrade from JDK 21+35 -> 21.0.2+13, and OS from 6.4.1-arch1-1 to 6.7.8-arch1-1
GI (2024-03-26): Break point estimation when threshold exceeded https://github.com/apache/lucene/pull/13199
GI (2024-03-26): Break point estimation when threshold exceeded https://github.com/apache/lucene/pull/13199
GJ (2024-04-01): #11888: optimize terms dictionary lookups when all terms have the same suffix length, which helps primary key lookup for fixed-length fields
GL (2024-05-20): Disjunction as CompetitiveIterator for numeric dynamic pruning https://github.com/apache/lucene/pull/13221
GM (2024-06-24): Enable intra-query concurrency across 8 threads
GN (2024-07-05): TaskExecutor should not fork unnecessarily https://github.com/apache/lucene/pull/13472
GO (2024-07-08): Replace AtomicLong with LongAdder in HitsThresholdChecker https://github.com/apache/lucene/pull/13546
GP (2024-07-19): Stop requiring MaxScoreBulkScorer's outer window from having at least INNER_WINDOW_SIZE docs https://github.com/apache/lucene/pull/13582
GQ (2024-07-24): Further reduce the search concurrency overhead https://github.com/apache/lucene/pull/13606, Run search tasks in IndexSearcher's executor https://github.com/mikemccand/luceneutil/pull/286
GR (2024-07-25): Bump the window size of disjunctions from 2,048 to 4,096 https://github.com/apache/lucene/pull/13605
GS (2024-07-31): Move to 2 levels of skip data, inlined in postings lists https://github.com/apache/lucene/pull/13585
GT (2024-08-25): Switch to 768 dimension Cohere vectors (from MPNet 768 dimensions)
GU (2024-08-28): Speed up advancing within a block https://github.com/apache/lucene/pull/13692
GV (2024-09-16): Revert: Speed up advancing within a block https://github.com/apache/lucene/pull/13692
GW (2024-10-03): Speedup GlobalHitsThresholdChecker a little https://github.com/apache/lucene/pull/13836
GX (2024-10-07): Speedup MaxScoreCache.computeMaxScore https://github.com/apache/lucene/pull/13865
GY (2024-10-09): Speedup OrderedIntervalsSource https://github.com/apache/lucene/pull/13871
GZ (2024-10-12): Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer https://github.com/apache/lucene/pull/13894 Lazy initialize ForDeltaUtil and ForUtil in Lucene912PostingsReader https://github.com/apache/lucene/pull/13885
HA (2024-10-14): Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader https://github.com/apache/lucene/pull/13901
HB (2024-10-21): Make BooleanScorer work on top of Scorers rather than BulkScorers https://github.com/apache/lucene/pull/13931, Speedup OrderIntervalsSource some more https://github.com/apache/lucene/pull/13937
HC (2024-10-22): Introduce a heuristic to amortize the per-window overhead in MaxScoreBulkScorer https://github.com/apache/lucene/pull/13941
HD (2024-10-26): Remove LeafSimScorer abstraction https://github.com/apache/lucene/pull/13957
HE (2024-10-28): Remove HitsThresholdChecker https://github.com/apache/lucene/pull/13943
HF (2024-10-30): Speed up advancing within a block https://github.com/apache/lucene/pull/13958
HG (2024-11-01): Move postings back to int[] to take advantage of having more lanes per vector https://github.com/apache/lucene/pull/13968
HH (2024-11-01): Make CombinedFieldQuery eligible for WAND/MAXSCORE. https://github.com/apache/lucene/pull/13999 Only consider clauses whose cost is less than the lead cost to compute block boundaries in WANDScorer. https://github.com/apache/lucene/pull/14003
HI (2024-11-22): Make CombinedFieldQuery eligible for WAND/MAXSCORE. https://github.com/apache/lucene/pull/13999 Only consider clauses whose cost is less than the lead cost to compute block boundaries in WANDScorer. https://github.com/apache/lucene/pull/14003
HJ (2024-11-25): Stop using SlowImpactsEnum for terms whose docFreq is less than 128. https://github.com/apache/lucene/pull/14017
HK (2024-11-26): Make WANDScorer compute scores on the fly. https://github.com/apache/lucene/pull/14021
HL (2024-11-27): Run filtered disjunctions with MaxScoreBulkScorer. https://github.com/apache/lucene/pull/14014
HM (2024-11-29): Make inlining decisions a bit more predictable in our main queries https://github.com/apache/lucene/pull/14023
HN (2024-12-02): Speed up PostingsEnum when reading positions. https://github.com/apache/lucene/pull/14032
HP (2024-12-06): Introduce a BulkScorer for DisjunctionMaxQuery. https://github.com/apache/lucene/pull/14040
HQ (2024-12-17): Speed up advancing on the disjunction iterator https://github.com/apache/lucene/pull/14052
HR (2024-12-18): Let DocIdSetIterator optimize loading into a FixedBitSet https://github.com/apache/lucene/pull/14069
HS (2024-12-19): Use the new loadIntoBitSet API to speed up dense conjunctions https://github.com/apache/lucene/pull/14080
HT (2024-12-20): Add JVM command-line options -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PreserveFramePointer so Linux perf command can see all C function calls
HU (2024-12-26): Optimize BitSetIterator#intoBitSet https://github.com/apache/lucene/pull/14083
HV (2025-01-09): Remove JVM command-line options -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PreserveFramePointer
HW (2025-01-14): Encode dense blocks of postings as bit sets. https://github.com/apache/lucene/pull/14133
HX (2025-01-23): Add a bias towards bit set encoding. https://github.com/apache/lucene/pull/14155
HY (2025-01-25): Not maintain docBufferUpTo when only docs needed. https://github.com/apache/lucene/pull/14164
HZ (2025-02-09): Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves. https://github.com/apache/lucene/pull/14176
IA (2025-02-13): Change the filtered HNSW search algorithm to one based on ACORN-1. https://github.com/apache/lucene/pull/14160
IB (2025-02-27): Use DenseConjunctionBulkScorer for single queries sometimes. https://github.com/apache/lucene/pull/14293
IC (2025-03-13): Speed up scoring conjunctions a bit. https://github.com/apache/lucene/pull/14345
ID (2025-03-16): Decode doc ids in BKD leaves with auto-vectorized loops. https://github.com/apache/lucene/pull/14203
IE (2025-03-20): Implement bulk adding methods for dynamic pruning. https://github.com/apache/lucene/pull/14365
IF (2025-04-14): A specialized Trie for Block Tree Index. https://github.com/apache/lucene/pull/14333
IG (2025-05-01): Kernel upgrade from 6.12.4 -> 6.14.4, which including changing HZ (how many time slices per sec the Linux scheduler creates) from 300 (Arch Linux) to 1000, likely the cause of our slowdown
IH (2025-05-13): Regression from optimistic KNN query -- see https://github.com/apache/lucene/issues/14671
II (2025-05-15): Downgraded from Java 24 to Java 23, and Lucene to #612f0da4a4ce3a133b40402a87ec5cf7eeb290cc, but perf regression remains
IJ (2025-05-17): Downgraded Linux kernel from 6.14.4 back to 6.12.4, recovering lost performance (temporarily)
IK (2025-05-22): Speed up exhaustive evaluation https://github.com/apache/lucene/pull/14679
IL (2025-05-29): Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion https://github.com/apache/lucene/pull/14701, Speed up TermQuery https://github.com/apache/lucene/pull/14709
IM (2025-06-02): Keep evaluating conjunction one doc-at-a-time until dynamic pruning kicks in. https://github.com/apache/lucene/pull/14739
IN (2025-06-03): Merge DocAndFreqBuffer and DocAndScoreBuffer https://github.com/apache/lucene/pull/14748
IO (2025-06-05): Swap out some simple PriorityQueue subclasses for one using a Comparator https://github.com/apache/lucene/pull/14705
IP (2025-06-06): Respect minCompetitiveScore in BlockMaxConjunctionBulkScorer https://github.com/apache/lucene/pull/14751
IQ (2025-06-13): Upgrade Linux kernel from 6.14.4 to 6.15.2, curiously impacting many queries
IS (2025-06-26): Implement nextDocAndScores on CombinedFieldQuery and ConstantScoreQuery https://github.com/apache/lucene/pull/14834 https://github.com/apache/lucene/pull/14772
IT (2025-06-30): Using BatchScoreBulkScorer on CombinedFieldQuery https://github.com/apache/lucene/pull/14854
IU (2025-07-01): Use LessThan here rather than Comparator for some key PriorityQueues https://github.com/apache/lucene/pull/14871
IV (2025-07-07): Use branchless way to speedup filterCompetitiveHits https://github.com/apache/lucene/pull/14906
IW (2025-07-13): Optimize bitset to array https://github.com/apache/lucene/pull/14935

Notes:

Test does not wait for merges on close (calls IW.close(false))
Analyzer is StandardAnalyzer, but we index all stop words
Test indexes full Wikipedia English XML export (1/15/2011), from a pre-created line file (one document per line), on a different drive from the one that stores the index
36 indexing threads
2048 MB RAM buffer
Java command-line: /usr/lib/jvm/java-24-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp
Java version: openjdk version "24.0.1" 2025-04-15 OpenJDK Runtime Environment (build 24.0.1) OpenJDK 64-Bit Server VM (build 24.0.1, mixed mode, sharing)
OS: Linux beast3.mikemccandless.com 6.15.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 10 Jun 2025 21:32:33 +0000 x86_64 GNU/Linux
CPU: 2 Xeon X5680, overclocked @ 4.0 Ghz (total 24 cores = 2 CPU * 6 core * 2 hyperthreads)
IO: index stored on 240 GB OCZ Vertex 3, starting on 4/25 (previously on traditional spinning-magnets hard drive (Western Digital Caviar Green, 1TB))
Source code: Indexer.java
All graphs are interactive Dygraphs

Back to all results

[last updated: 2025-07-22 09:06:55.885753; send questions to Mike McCandless]