Stockfish

mirror of https://github.com/HChaZZY/Stockfish.git synced 2025-12-26 12:06:22 +08:00

Author	SHA1	Message	Date
MinetaS	b976f0a101	Move DotProd code into optimized affine layer This patch moves the DotProd code into the propagation function which has sequential access optimization. To prove the speedup, the comparison is done without the sparse layer. With the sparse layer the effect is marginal (GCC 0.3%, LLVM/Clang 0.1%). For both tests, binary is compiled with GCC 14.1. Each test had 50 runs. Sparse layer included: ``` speedup = +0.0030 P(speedup > 0) = 1.0000 ``` Sparse layer excluded: ``` speedup = +0.0561 P(speedup > 0) = 1.0000 ``` closes https://github.com/official-stockfish/Stockfish/pull/5520 No functional change	2024-08-03 09:42:03 +02:00
mstembera	21ba32af6d	Remove m512_hadd128x16_interleave() This functionality is no longer used anywhere. closes https://github.com/official-stockfish/Stockfish/pull/5357 No functional change	2024-06-05 21:07:07 +02:00
cj5716	27eb49a221	Simplify ClippedReLU Removes some max calls Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation) Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes #5261 No functional change	2024-05-21 07:58:16 +02:00
mstembera	32e46fc47f	Remove some outdated SIMD functions Since https://github.com/official-stockfish/Stockfish/pull/4391 the x2 SIMD functions no longer serve any useful purpose. Passed non-regression STC: https://tests.stockfishchess.org/tests/view/659cf42579aa8af82b966d55 LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67392 W: 17222 L: 17037 D: 33133 Ptnml(0-2): 207, 7668, 17762, 7851, 208 closes https://github.com/official-stockfish/Stockfish/pull/4974 No functional change	2024-01-17 18:04:29 +01:00
Disservin	444f03ee95	Update copyright year closes https://github.com/official-stockfish/Stockfish/pull/4954 No functional change	2024-01-04 15:47:10 +01:00
FauziAkram	833a2e2bc0	Cleanup comments Tests used to derive some Elo worth comments: https://tests.stockfishchess.org/tests/view/656a7f4e136acbc573555a31 https://tests.stockfishchess.org/tests/view/6585fb455457644dc984620f closes https://github.com/official-stockfish/Stockfish/pull/4945 No functional change	2023-12-31 19:54:27 +01:00
Joost VandeVondele	ec02714b62	Cleanup comments and some code reorg. passed STC: https://tests.stockfishchess.org/tests/view/6536dc7dcc309ae83955b04d LLR: 2.93 (-2.94,2.94) <-1.75,0.25> Total: 58048 W: 14693 L: 14501 D: 28854 Ptnml(0-2): 200, 6399, 15595, 6669, 161 closes https://github.com/official-stockfish/Stockfish/pull/4846 No functional change	2023-10-24 17:43:05 +02:00
cj5716	d6a5c2b085	Small formatting improvements Changes some C style casts to C++ style, and fixes some incorrect comments and variable names. closes #4845 No functional change	2023-10-24 17:42:13 +02:00
Disservin	2d0237db3f	add clang-format This introduces clang-format to enforce a consistent code style for Stockfish. Having a documented and consistent style across the code will make contributing easier for new developers, and will make larger changes to the codebase easier to make. To facilitate formatting, this PR includes a Makefile target (`make format`) to format the code, this requires clang-format (version 17 currently) to be installed locally. Installing clang-format is straightforward on most OS and distros (e.g. with https://apt.llvm.org/, brew install clang-format, etc), as this is part of quite commonly used suite of tools and compilers (llvm / clang). Additionally, a CI action is present that will verify if the code requires formatting, and comment on the PR as needed. Initially, correct formatting is not required, it will be done by maintainers as part of the merge or in later commits, but obviously this is encouraged. fixes https://github.com/official-stockfish/Stockfish/issues/3608 closes https://github.com/official-stockfish/Stockfish/pull/4790 Co-Authored-By: Joost VandeVondele <Joost.VandeVondele@gmail.com>	2023-10-22 16:06:27 +02:00
mstembera	8a912951de	Remove handcrafted MMX code too small a benefit to maintain this old target closes https://github.com/official-stockfish/Stockfish/pull/4804 No functional change	2023-10-08 07:37:01 +02:00
cj5716	fce4cc1829	Make casting styles consistent Make casting styles consistent with the rest of the code. closes https://github.com/official-stockfish/Stockfish/pull/4793 No functional change	2023-09-22 19:14:29 +02:00
mstembera	97f706ecc1	Sparse impl of affine_transform_non_ssse3() deal with the general case About a 8.6% speedup (for general arch) Results for 200 tests for each version: Base Test Diff Mean 141741 153998 -12257 StDev 2990 3042 3742 p-value: 0.999 speedup: 0.086 closes https://github.com/official-stockfish/Stockfish/pull/4786 No functional change	2023-09-22 19:03:47 +02:00
Tomasz Sobczyk	1461d861c8	Prevent usage of AVX-512 for the last layer. Add more static checks regarding the SIMD width match. STC: https://tests.stockfishchess.org/tests/view/64f5c568a9bc5a78c669e70e LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 125216 W: 31756 L: 31636 D: 61824 Ptnml(0-2): 327, 13993, 33848, 14113, 327 Fixes a bug introduced in `2f2f45f`, where with AVX-512 the weights and input to the last layer were being read out of bounds. Now AVX-512 is only used for the layers it can be used for. Additional static assertions have been added to prevent more errors like this in the future. closes https://github.com/official-stockfish/Stockfish/pull/4773 No functional change	2023-09-11 22:11:30 +02:00
Disservin	3c0e86a91e	Cleanup includes Reorder a few includes, include "position.h" where it was previously missing and apply include-what-you-use suggestions. Also make the order of the includes consistent, in the following way: 1. Related header (for .cpp files) 2. A blank line 3. C/C++ headers 4. A blank line 5. All other header files closes https://github.com/official-stockfish/Stockfish/pull/4763 fixes https://github.com/official-stockfish/Stockfish/issues/4707 No functional change	2023-09-03 08:24:51 +02:00
Gian-Carlo Pascutto	c6f62363a6	Simplify Square Clipped ReLU code. Squared numbers are never negative, so barring any wraparound there is no need to clamp to 0. From reading the code, there's no obvious way to get wraparound, so the entire operation can be simplified away. Updated original truncated code comments to be sensible. Verified by running ./stockfish bench 128 1 24 and by the following test: STC: https://tests.stockfishchess.org/tests/view/64da4db95b17f7c21c0eabe7 LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 60224 W: 15425 L: 15236 D: 29563 Ptnml(0-2): 195, 6576, 16382, 6763, 196 closes https://github.com/official-stockfish/Stockfish/pull/4751 No functional change	2023-08-22 11:14:19 +02:00
Matthies	9abef246a9	Allow compilation on Raspi (for ARMv8) Current master fails to compile for ARMv8 on Raspi cause gcc (version 10.2.1) does not like to cast between signed and unsigned vector types. This patch fixes it by using unsigned vector pointer for ARM to avoid implicite cast. closes https://github.com/official-stockfish/Stockfish/pull/4752 No functional change	2023-08-22 10:43:51 +02:00
Joost VandeVondele	8192945870	Improve testing coverage, remove unused code a) Add further tests to CI to cover most features. This uncovered a potential race in case setoption was sent between two searches. As the UCI protocol requires this sent to be went the engine is not searching, setoption now ensures that this is the case. b) Remove some unused code closes https://github.com/official-stockfish/Stockfish/pull/4730 No functional change	2023-08-11 19:27:46 +02:00
AndrovT	a6d9a302b8	Implement AffineTransformSparseInput for armv8 Implements AffineTransformSparseInput layer for the NNUE evaluation for the armv8 and armv8-dotprod architectures. We measured some nice speed improvements via 10 runs of our benchmark: armv8, Cortex-X1 : 18.5% speed-up armv8, Cortex-A76 : 13.2% speed-up armv8-dotprod, Cortex-X1 : 27.1% speed-up armv8-dotprod, Cortex-A76 : 12.1% speed-up armv8, Cortex-A72, Raspberry Pi 4 : 8.2% speed-up (thanks Torom!) closes https://github.com/official-stockfish/Stockfish/pull/4719 No functional change	2023-08-06 21:22:37 +02:00
mstembera	cb22520a9c	Remove unused return type from propagate() Also make two get_weight_index() static methods constexpr, for consistency with the other static get_hash_value() method right above. Tested for speed by user Torom (thanks). closes https://github.com/official-stockfish/Stockfish/pull/4708 No functional change	2023-07-28 23:24:42 +02:00
mstembera	1444837887	Remove inline assembly closes https://github.com/official-stockfish/Stockfish/pull/4698 No functional change	2023-07-19 21:39:51 +02:00
AndrovT	a42ab95e1f	remove large input specialization Removes unused large input specialization for dense affine transform. It has been obsolete since #4612 was merged. closes https://github.com/official-stockfish/Stockfish/pull/4684 No functional change	2023-07-16 15:12:21 +02:00
mstembera	529d3be8e2	More simplifications and cleanup in affine_transform_sparse_input.h closes https://github.com/official-stockfish/Stockfish/pull/4677 No functional change	2023-07-13 08:20:33 +02:00
mstembera	f8e65d82eb	Simplify away lookup_count. https://tests.stockfishchess.org/tests/view/64a3c1a93ee09aa549c53167 LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 32832 W: 8497 L: 8280 D: 16055 Ptnml(0-2): 80, 3544, 8967, 3729, 96 closes https://github.com/official-stockfish/Stockfish/pull/4662 No functional change	2023-07-06 23:02:11 +02:00
mstembera	80564bcfcd	Simplify lookup_count and clean up pieces(). https://github.com/official-stockfish/Stockfish/pull/4656 No functional change	2023-07-03 18:20:10 +02:00
Andreas Matthies	7922e07af8	Fix for MSVC compilation. MSVC needs two more explicit casts to compile new affine_transform_sparse_input. See https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_castsi256_ps and https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_castsi128_ps closes https://github.com/official-stockfish/Stockfish/pull/4616 No functional change	2023-06-13 08:45:25 +02:00
AndrovT	38e61663d8	Use block sparse input for the first layer. Use block sparse input for the first fully connected layer on architectures with at least SSSE3. Depending on the CPU architecture, this yields a speedup of up to 10%, e.g. ``` Result of 100 runs of 'bench 16 1 13 default depth NNUE' base (...ockfish-base) = 959345 +/- 7477 test (...ckfish-patch) = 1054340 +/- 9640 diff = +94995 +/- 3999 speedup = +0.0990 P(speedup > 0) = 1.0000 CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics Hyperthreading: on ``` Passed STC: https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 8864 W: 2479 L: 2223 D: 4162 Ptnml(0-2): 13, 829, 2504, 1061, 25 This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs, but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue). Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3. closes https://github.com/official-stockfish/Stockfish/pull/4612 No functional change	2023-06-12 20:41:27 +02:00
Maxim Masiutin	2f2f45f9f4	Made two specializations for affine transform easier to understand. Added AVX-512 for the specialization for small inputs closes https://github.com/official-stockfish/Stockfish/pull/4502 No functional change	2023-04-10 09:29:52 +02:00
Sebastian Buchwald	43108a6198	Reuse existing functions to read/write array of network parameters closes https://github.com/official-stockfish/Stockfish/pull/4463 No functional change	2023-03-29 21:36:27 +02:00
Sebastian Buchwald	77dfcbedce	Remove unused macros closes https://github.com/official-stockfish/Stockfish/pull/4397 No functional change	2023-02-23 13:24:37 +01:00
Sebastian Buchwald	b4ad3a3c4b	Add support for ARM dot product instructions The sdot instruction computes (and accumulates) a signed dot product, which is quite handy for Stockfish's NNUE code. The instruction is optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above. The commit adds a new 'arm-dotprod' architecture with enabled dot product support. It also enables dot product support for the existing 'apple-silicon' architecture, which is at least Armv8.5. The following local speed test was performed on an Apple M1 with ARCH=apple-silicon. I had to remove CPU pinning from the benchmark script. However, the results were still consistent: Checking both binaries against themselves reported a speedup of +0.0000 and +0.0005, respectively. ``` Result of 100 runs ================== base (...ish.037ef3e1) = 1917997 +/- 7152 test (...fish.dotprod) = 2159682 +/- 9066 diff = +241684 +/- 2923 speedup = +0.1260 P(speedup > 0) = 1.0000 CPU: 10 x arm Hyperthreading: off ``` Fixes #4193 closes https://github.com/official-stockfish/Stockfish/pull/4400 No functional change	2023-02-23 13:22:03 +01:00
MinetaS	2c36d1e7e7	Fix overflow in add_dpbusd_epi32x2 This patch fixes 16bit overflow in _add_dpbusd_epi32x2 functions, that can be triggered in rare cases depending on the NNUE weights. While the code leads to some slowdown on affected architectures (most notably avx2), the fix is simpler than some of the other options discussed in https://github.com/official-stockfish/Stockfish/pull/4394 Code suggested by Sopel97. Result of "bench 4096 1 30 default depth nnue": \| Architecture \| master \| patch (gcc) \| patch (clang) \| \|---------------------\|-----------\|-------------\|---------------\| \| x86-64-vnni512 \| 762122798 \| 762122798 \| 762122798 \| \| x86-64-avx512 \| 769723503 \| 762122798 \| 762122798 \| \| x86-64-bmi2 \| 769723503 \| 762122798 \| 762122798 \| \| x86-64-ssse3 \| 769723503 \| 762122798 \| 762122798 \| \| x86-64 \| 762122798 \| 762122798 \| 762122798 \| Following architectures will experience ~4% slowdown due to an additional instruction in the middle of hot path: x86-64-avx512 * x86-64-bmi2 * x86-64-avx2 * x86-64-sse41-popcnt (x86-64-modern) * x86-64-ssse3 * x86-32-sse41-popcnt This patch clearly loses Elo against master with both STC and LTC. Failed non-regression STC (256bit fix only): LLR: -2.95 (-2.94,2.94) <-1.75,0.25> Total: 33528 W: 8769 L: 9049 D: 15710 Ptnml(0-2): 96, 3616, 9600, 3376, 76 https://tests.stockfishchess.org/tests/view/63e6a5b44299542b1e26a485 60+0.6 @ 30000 games: Elo: -1.67 +-1.7 (95%) LOS: 2.8% Total: 30000 W: 7848 L: 7992 D: 14160 Ptnml(0-2): 12, 2847, 9436, 2683, 22 nElo: -3.84 +-3.9 (95%) PairsRatio: 0.95 https://tests.stockfishchess.org/tests/view/63e7ac716d0e1db55f35a660 However, a test against nn-a3dc078bafc7.nnue, which is the latest "safe" network not causing the bug, passed with regular bounds. Passed STC: LLR: 2.94 (-2.94,2.94) <0.00,2.00> Total: 160456 W: 42658 L: 42175 D: 75623 Ptnml(0-2): 487, 17638, 43469, 18173, 461 https://tests.stockfishchess.org/tests/view/63e89836d62a5d02b0fa82c8 closes https://github.com/official-stockfish/Stockfish/pull/4391 closes https://github.com/official-stockfish/Stockfish/pull/4394 No functional change	2023-02-18 13:23:18 +01:00
Sebastian Buchwald	2f67409506	Remove redundant const qualifiers The const qualifiers are already implied by the constexpr qualifiers. closes https://github.com/official-stockfish/Stockfish/pull/4359 No functional change	2023-01-28 16:49:27 +01:00
Sebastian Buchwald	da5bcec481	Fix asm modifiers in add_dpbusd_epi32x2 implementations The accumulator should be an earlyclobber because it is written before all input operands are read. Otherwise, the asm code computes a wrong result if the accumulator shares a register with one of the other input operands (which happens if we pass in the same expression for the accumulator and the operand). Closes https://github.com/official-stockfish/Stockfish/pull/4339 No functional change	2023-01-22 10:51:02 +01:00
Sebastian Buchwald	4f4e652eca	Avoid unnecessary string copies closes https://github.com/official-stockfish/Stockfish/pull/4326 also fixes typo, closes https://github.com/official-stockfish/Stockfish/pull/4332 No functional change	2023-01-09 20:32:58 +01:00
Sebastian Buchwald	b60f9cc451	Update copyright years Happy New Year! closes https://github.com/official-stockfish/Stockfish/pull/4315 No functional change	2023-01-02 19:07:38 +01:00
mstembera	82bb21dc7a	Optimize AVX2 path in NNUE evaluation always selecting AffineTransform specialization for small inputs. A related patch was tested as Initially tested as a simplification STC https://tests.stockfishchess.org/tests/view/6317c3f437f41b13973d6dff LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 58072 W: 15619 L: 15425 D: 27028 Ptnml(0-2): 241, 6191, 15992, 6357, 255 Elo gain speedup test STC https://tests.stockfishchess.org/tests/view/63181c1b37f41b13973d79dc LLR: 2.94 (-2.94,2.94) <0.00,2.00> Total: 184496 W: 49922 L: 49401 D: 85173 Ptnml(0-2): 851, 19397, 51208, 19964, 828 and this patch gained in testing speedup = +0.0071 P(speedup > 0) = 1.0000 on CPU: 16 x AMD Ryzen 9 3950X closes https://github.com/official-stockfish/Stockfish/pull/4158 No functional change	2022-09-11 14:19:57 +02:00
Tomasz Sobczyk	c079acc26f	Update NNUE architecture to SFNNv5. Update network to nn-3c0aa92af1da.nnue. Architecture changes: Duplicated activation after the 1024->15 layer with squared crelu (so 15->15*2). As proposed by vondele. Trainer changes: Added bias to L1 factorization, which was previously missing (no measurable improvement but at least neutral in principle) For retraining linearly reduce lambda parameter from 1.0 at epoch 0 to 0.75 at epoch 800. reduce max_skipping_rate from 15 to 10 (compared to vondele's outstanding PR) Note: This network was trained with a ~0.8% error in quantization regarding the newly added activation function. This will be fixed in the released trainer version. Expect a trainer PR tomorrow. Note: The inference implementation cuts a corner to merge results from two activation functions. This could possibly be resolved nicer in the future. AVX2 implementation likely not necessary, but NEON is missing. First training session invocation: python3 train.py \ ../nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ ../nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 8 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Second training session invocation: python3 train.py \ ../nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ ../nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 8 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --start-lambda=1.0 \ --end-lambda=0.75 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp367/nn-exp367-run3-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Passed STC: LLR: 2.95 (-2.94,2.94) <0.00,2.50> Total: 27288 W: 7445 L: 7178 D: 12665 Ptnml(0-2): 159, 3002, 7054, 3271, 158 https://tests.stockfishchess.org/tests/view/627e8c001919125939623644 Passed LTC: LLR: 2.95 (-2.94,2.94) <0.50,3.00> Total: 21792 W: 5969 L: 5727 D: 10096 Ptnml(0-2): 25, 2152, 6294, 2406, 19 https://tests.stockfishchess.org/tests/view/627f2a855734b18b2e2ece47 closes https://github.com/official-stockfish/Stockfish/pull/4020 Bench: 6481017	2022-05-14 12:47:22 +02:00
mstembera	5f781d366e	Clean up and simplify some nnue code. Remove some unnecessary code and it's execution during inference. Also the change on line 49 in nnue_architecture.h results in a more efficient SIMD code path through ClippedReLU::propagate(). passed STC: https://tests.stockfishchess.org/tests/view/6217d3bfda649bba32ef25d5 LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 12056 W: 3281 L: 3092 D: 5683 Ptnml(0-2): 55, 1213, 3312, 1384, 64 passed STC SMP: https://tests.stockfishchess.org/tests/view/6217f344da649bba32ef295e LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 27376 W: 7295 L: 7137 D: 12944 Ptnml(0-2): 52, 2859, 7715, 3003, 59 closes https://github.com/official-stockfish/Stockfish/pull/3944 No functional change bench: 6820724	2022-02-25 08:37:57 +01:00
Tomasz Sobczyk	cb9c2594fc	Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. Architecture: The diagram of the "SFNNv4" architecture: https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png The most important architectural changes are the following: * 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster. * The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16. * The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future. Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better. First session: The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session. The training was done using the following command: python3 train.py \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.992 \ --lr=8.75e-4 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones. The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view Second session: The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600). The training was done using the following command: The training was done using the following command: python3 train.py \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest. The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way: Get the `5640ad48ae/script/interleave_binpacks.py` script. Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack Tests: STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7 LLR: 2.94 (-2.94,2.94) <0.00,2.50> Total: 16952 W: 4775 L: 4521 D: 7656 Ptnml(0-2): 133, 1818, 4318, 2076, 131 LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 LLR: 2.94 (-2.94,2.94) <0.50,3.00> Total: 14944 W: 4138 L: 3907 D: 6899 Ptnml(0-2): 21, 1499, 4202, 1728, 22 closes https://github.com/official-stockfish/Stockfish/pull/3927 Bench: 4919707	2022-02-10 19:54:31 +01:00
Brad Knox	ad926d34c0	Update copyright years Happy New Year! closes https://github.com/official-stockfish/Stockfish/pull/3881 No functional change	2022-01-06 15:45:45 +01:00
Tomasz Sobczyk	4766dfc395	Optimize FT activation and affine transform for NEON. This patch optimizes the NEON implementation in two ways. The activation layer after the feature transformer is rewritten to make it easier for the compiler to see through dependencies and unroll. This in itself is a minimal, but a positive improvement. Other architectures could benefit from this too in the future. This is not an algorithmic change. The affine transform for large matrices (first layer after FT) on NEON now utilizes the same optimized code path as >=SSSE3, which makes the memory accesses more sequential and makes better use of the available registers, which allows for code that has longer dependency chains. Benchmarks from Redshift#161, profile-build with apple clang george@Georges-MacBook-Air nets % ./stockfish-b82d93 bench 2>&1 \| tail -4 (current master) =========================== Total time (ms) : 2167 Nodes searched : 4667742 Nodes/second : 2154011 george@Georges-MacBook-Air nets % ./stockfish-7377b8 bench 2>&1 \| tail -4 (this patch) =========================== Total time (ms) : 1842 Nodes searched : 4667742 Nodes/second : 2534061 This is a solid 18% improvement overall, larger in a bench with NNUE-only, not mixed. Improvement is also observed on armv7-neon (Raspberry Pi, and older phones), around 5% speedup. No changes for architectures other than NEON. closes https://github.com/official-stockfish/Stockfish/pull/3837 No functional changes.	2021-12-07 18:08:54 +01:00
Tomasz Sobczyk	18dcf1f097	Optimize and tidy up affine transform code. The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again. The following changes were made: * ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite. * All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7 * AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one. 1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs. 2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code. Overall it should be a minor speedup, as shown on fishtest: passed STC: LLR: 2.96 (-2.94,2.94) <-0.50,2.50> Total: 51520 W: 4074 L: 3888 D: 43558 Ptnml(0-2): 111, 3136, 19097, 3288, 128 and various tests shown in the pull request closes https://github.com/official-stockfish/Stockfish/pull/3663 No functional change	2021-08-20 08:50:25 +02:00
Tomasz Sobczyk	d61d38586e	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters The summary of the changes: * Position for each perspective mirrored such that the king is on e..h files. Cuts the feature transformer size in half, while preserving enough knowledge to be good. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on. * The number of neurons after the feature transformer increased two-fold, to 1024x2. This is possibly mostly due to the now very optimized feature transformer update code. * The number of neurons after the second layer is reduced from 16 to 8, to reduce the speed impact. This, perhaps surprisingly, doesn't harm the strength much. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq The AffineTransform code did not work out-of-the box with the smaller number of neurons after the second layer, so some temporary changes have been made to add a special case for InputDimensions == 8. Also additional 0 padding is added to the output for some archs that cannot process inputs by <=8 (SSE2, NEON). VNNI uses an implementation that can keep all outputs in the registers while reducing the number of loads by 3 for each 16 inputs, thanks to the reduced number of output neurons. However GCC is particularily bad at optimization here (and perhaps why the current way the affine transform is done even passed sprt) (see https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit# for details) and more work will be done on this in the following days. I expect the current VNNI implementation to be improved and extended to other architectures. The network was trained with a slightly modified version of the pytorch trainer (https://github.com/glinscott/nnue-pytorch); the changes are in https://github.com/glinscott/nnue-pytorch/pull/143 The training utilized 2 datasets. dataset A - https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing dataset B - as described in `ba01f4b954` The training process was as following: train on dataset A for 350 epochs, take the best net in terms of elo at 20k nodes per move (it's fine to take anything from later stages of training). convert the .ckpt to .pt --resume-from-model from the .pt file, train on dataset B for <600 epochs, take the best net. Lambda=0.8, applied before the loss function. The first training command: python3 train.py \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --max_epochs=600 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 The second training command: python3 serialize.py \ --features=HalfKAv2_hm^ \ ../nnue-pytorch-training/experiment_131/run_6/default/version_0/checkpoints/epoch-499.ckpt \ ../nnue-pytorch-training/experiment_$1/base/base.pt python3 train.py \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=0.8 \ --max_epochs=600 \ --resume-from-model ../nnue-pytorch-training/experiment_$1/base/base.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 STC: https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4 LLR: 2.97 (-2.94,2.94) <-0.50,2.50> Total: 22480 W: 2434 L: 2251 D: 17795 Ptnml(0-2): 101, 1736, 7410, 1865, 128 LTC: https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 9776 W: 442 L: 333 D: 9001 Ptnml(0-2): 5, 295, 4180, 402, 6 closes https://github.com/official-stockfish/Stockfish/pull/3646 bench: 5189338	2021-08-15 12:05:43 +02:00
Tomasz Sobczyk	26edf9534a	Avoid unnecessary stores in the affine transform This patch improves the codegen in the AffineTransform::forward function for architectures >=SSSE3. Current code works directly on memory and the compiler cannot see that the stores through outptr do not alias the loads through weights and input32. The solution implemented is to perform the affine transform with local variables as accumulators and only store the result to memory at the end. The number of accumulators required is OutputDimensions / OutputSimdWidth, which means that for the 1024->16 affine transform it requires 4 registers with SSSE3, 2 with AVX2, 1 with AVX512. It also cuts the number of stores required by NumRegs * 256 for each node evaluated. The local accumulators are expected to be assigned to registers, but even if this cannot be done in some case due to register pressure it will help the compiler to see that there is no aliasing between the loads and stores and may still result in better codegen. See https://godbolt.org/z/59aTKbbYc for codegen comparison. passed STC: LLR: 2.94 (-2.94,2.94) <-0.50,2.50> Total: 140328 W: 10635 L: 10358 D: 119335 Ptnml(0-2): 302, 8339, 52636, 8554, 333 closes https://github.com/official-stockfish/Stockfish/pull/3634 No functional change	2021-07-30 17:15:52 +02:00
Tomasz Sobczyk	e8d64af123	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters, as obtained by a new pytorch trainer. The network is already very strong at short TC, without regression at longer TC, and has potential for further improvements. https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d TC: 10s+0.1s, 1 thread ELO: 21.74 +-3.4 (95%) LOS: 100.0% Total: 10000 W: 1559 L: 934 D: 7507 Ptnml(0-2): 38, 701, 2972, 1176, 113 https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b TC: 60s+0.6s, 1 thread ELO: 5.85 +-1.7 (95%) LOS: 100.0% Total: 20000 W: 1381 L: 1044 D: 17575 Ptnml(0-2): 27, 885, 7864, 1172, 52 https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806 TC: 20s+0.2s, 8 threads LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 34272 W: 1610 L: 1452 D: 31210 Ptnml(0-2): 30, 1285, 14350, 1439, 32 https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72 TC: 60s+0.6s, 8 threads LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 45544 W: 1262 L: 1214 D: 43068 Ptnml(0-2): 12, 1129, 20442, 1177, 12 The network has been trained (by vondele) using the https://github.com/glinscott/nnue-pytorch/ trainer (started by glinscott), specifically the branch https://github.com/Sopel97/nnue-pytorch/tree/experiment_56. The data used are in 64 billion positions (193GB total) generated and scored with the current master net d8: https://drive.google.com/file/d/1hOOYSDKgOOp38ZmD0N4DV82TOLHzjUiF/view?usp=sharing d9: https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing d10: https://drive.google.com/file/d/1ZC5upzBYMmMj1gMYCkt6rCxQG0GnO3Kk/view?usp=sharing fishtest_d9: https://drive.google.com/file/d/1GQHt0oNgKaHazwJFTRbXhlCN3FbUedFq/view?usp=sharing This network also contains a few architectural changes with respect to the current master: Size changed from 256x2-32-32-1 to 512x2-16-32-1 ~15-20% slower ~2x larger adds a special path for 16 valued ClippedReLU fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions this is safe now because the inputs are processed in groups of 4 in the current affine transform code The feature set changed from HalfKP to HalfKAv2 Includes information about the kings like HalfKA Packs king features better, resulting in 8% size reduction compared to HalfKA The board is flipped for the black's perspective, instead of rotated like in the current master PSQT values for each feature the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero. 8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4 initialized to classical material values on the start of the training 8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4 only one subnetwork is evaluated for any position, no or marginal speed loss A diagram of the network is available: https://user-images.githubusercontent.com/8037982/118656988-553a1700-b7eb-11eb-82ef-56a11cbebbf2.png A more complete description: https://github.com/glinscott/nnue-pytorch/blob/master/docs/nnue.md closes https://github.com/official-stockfish/Stockfish/pull/3474 Bench: 3806488	2021-05-18 18:06:23 +02:00
Stéphane Nicolet	f90274d8ce	Small clean-ups - Comment for Countemove pruning -> Continuation history - Fix comment in input_slice.h - Shorter lines in Makefile - Comment for scale factor - Fix comment for pinners in see_ge() - Change Thread.id() signature to size_t - Trailing space in reprosearch.sh - Add Douglas Matos Gomes to the AUTHORS file - Introduce comment for undo_null_move() - Use Stockfish coding style for export_net() - Change date in AUTHORS file closes https://github.com/official-stockfish/Stockfish/pull/3416 No functional change	2021-05-17 10:47:14 +02:00
Tomasz Sobczyk	58054fd0fa	Exporting the currently loaded network file This PR adds an ability to export any currently loaded network. The export_net command now takes an optional filename parameter. If the loaded net is not the embedded net the filename parameter is required. Two changes were required to support this: * the "architecture" string, which is really just a some kind of description in the net, is now saved into netDescription on load and correctly saved on export. * the AffineTransform scrambles weights for some architectures and sparsifies them, such that retrieving the index is hard. This is solved by having a temporary scrambled<->unscrambled index lookup table when loading the network, and the actual index is saved for each individual weight that makes it to canSaturate16. This increases the size of the canSaturate16 entries by 6 bytes. closes https://github.com/official-stockfish/Stockfish/pull/3456 No functional change	2021-05-11 19:36:11 +02:00
Tomasz Sobczyk	fbbd4adc3c	Unify naming convention of the NNUE code matches the rest of the stockfish code base closes https://github.com/official-stockfish/Stockfish/pull/3437 No functional change	2021-04-24 12:49:29 +02:00
Stéphane Nicolet	83eac08e75	Small cleanups (march 2021) With help of @BM123499, @mstembera, @gvreuls, @noobpwnftw and @Fanael Thanks! Closes https://github.com/official-stockfish/Stockfish/pull/3405 No functional change	2021-03-24 17:11:06 +01:00
Dieter Dobbelaere	7ffae17f85	Add Stockfish namespace. fixes #3350 and is a small cleanup that might make it easier to use SF in separate projects, like a NNUE trainer or similar. closes https://github.com/official-stockfish/Stockfish/pull/3370 No functional change.	2021-03-07 14:26:54 +01:00

1 2

70 Commits