For some systems std::aligned_alloc is not available even if c++17 is specified.
Wrap the function and use specific solutions.
Update macosx-version-min to the required minimum.
No functional change.
for users that set the needed privilige "Lock Pages in Memory"
large pages will be automatically enabled (see Readme.md).
This expert setting might improve speed, 5% - 30%, depending
on the hardware, the number of threads and hash size. More for
large hashes, large number of threads and NUMA. If the operating
system can not allocate large pages (easier after a reboot), default
allocation is used automatically. The engine log provides details.
closes https://github.com/official-stockfish/Stockfish/pull/2656
fixes https://github.com/official-stockfish/Stockfish/issues/2619
No functional change
There is an ambiguity between global and std clamp implementations when compiling in c++17,
and on certain toolchains that are not strictly conforming to c++11.
This is solved by putting our clamp implementation in a namespace.
closes https://github.com/official-stockfish/Stockfish/pull/2572
No functional change.
Align the TT allocation by 2M to make it huge page friendly and advise the
kernel to use huge pages.
Benchmarks on my i7-8700K (6C/12T) box: (3 runs per bench per config)
vanilla (nps) hugepages (nps) avg
==================================================================================
bench | 3012490 3024364 3036331 3071052 3067544 3071052 +1.5%
bench 16 12 20 | 19237932 19050166 19085315 19266346 19207025 19548758 +1.1%
bench 16384 12 20 | 18182313 18371581 18336838 19381275 19738012 19620225 +7.0%
On my box, huge pages have a significant perf impact when using a big
hash size. They also speed up TT initialization big time:
vanilla (s) huge pages (s) speed-up
=======================================================================
time stockfish bench 16384 1 1 | 5.37 1.48 3.6x
In practice, huge pages with auto-defrag may always be enabled in the
system, in which case this patch has no effect. This
depends on the values in /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag.
closes https://github.com/official-stockfish/Stockfish/pull/2463
No functional change
This patch shows a description of the compiler used to compile Stockfish,
when starting from the console.
Usage:
```
./stockfish
compiler
```
Example of output:
```
Stockfish 120120 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
Compiled by clang++ 9.0.0 on Apple
__VERSION__ macro expands to: 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.38)
```
No functional change
Revert the previous patch now that the binary for the super-final
of TCEC season 16 has been sent.
Maybe the feature of showing the name of compiler will be added to the
master branch in the future. But we may use a cleaner way to code it, see
some ideas using the Makefile approach at the end of pull request #2327 :
https://github.com/official-stockfish/Stockfish/pull/2327
Bench: 3618154
This patch shows a description of the compiler used to compile Stockfish,
when starting from the console.
Usage:
```
./stockfish
compiler
```
Example of output:
```
Stockfish 240919 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
Compiled by clang++ 9.0.0 on Apple
__VERSION__ macro expands to: 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.38)
```
No functional change
Preparation commit for the upcoming Stockfish 10 version, giving a chance to catch last minute feature bugs and evaluation regression during the one-week code freeze period. Also changing the copyright dates to include 2019.
No functional change
This is a patch to fix issue #1498, switching the time management variables
to 64 bits to avoid overflow of time variables after 25 days.
There was a bug in Stockfish 9 causing the output to be wrong after
2^31 milliseconds search. Here is a long run from the starting position:
info depth 64 seldepth 87 multipv 1 score cp 23 nodes 13928920239402
nps 0 tbhits 0 time -504995523 pv g1f3 d7d5 d2d4 g8f6 c2c4 d5c4 e2e3 e7e6 f1c4
c7c5 e1g1 b8c6 d4c5 d8d1 f1d1 f8c5 c4e2 e8g8 a2a3 c5e7 b2b4 f8d8 b1d2 b7b6 c1b2
c8b7 a1c1 a8c8 c1c2 c6e5 d1c1 c8c2 c1c2 e5f3 d2f3 a7a5 b4b5 e7c5 f3d4 d8c8 d4b3
c5d6 c2c8 b7c8 b3d2 c8b7 d2c4 d6c5 e2f3 b7d5 f3d5 e6d5 c4e5 a5a4 e5d3 f6e4 d3c5
e4c5 b2d4 c5e4 d4b6 e4d6 g2g4 d6b5 b6c5 b5c7 g1g2 c7e6 c5d6 g7g6
We check at compile time that the TimePoint type is exactly 64 bits long for
the compiler (TimePoint is our alias in Stockfish for std::chrono::milliseconds
-- it is a signed integer type of at least 45 bits according to the C++ standard,
but will most probably be implemented as a 64 bits signed integer on modern
compilers), and we use this TimePoint type consistently across the code.
Bug report by user "fischerandom" on the TCEC chat (thanks), and the
patch includes code and suggestions by user "WOnder93" and Ronald de Man.
Fixes issue: https://github.com/official-stockfish/Stockfish/issues/1498
Closes pull request: https://github.com/official-stockfish/Stockfish/pull/1510
No functional change.
Under Windows it is not possible for a process to run on more than one
logical processor group. This usually means to be limited to use max 64
cores. To overcome this, some special platform specific API should be
called to set group affinity for each thread. Original code from Texel by
Peter sterlund.
Tested by Jean-Paul Vael on a Xeon E7-8890 v4 with 88 threads and confimed
speed up between 44 and 88 threads is about 30%, as expected.
No functional change.
Allow to specifiy the log file name, this comes
handy in case of self-matches so that each SF
instance writes into a different log file.
No functional change.
Funny enough, gcc __builtin_prefetch() expects
already a void*, instead Windows's _mm_prefetch()
requires a char*.
The patch allows to remove ugly casts from caller
sites.
No functional change.
Import C++11 branch from:
https://github.com/mcostalba/Stockfish/tree/c++11
The version imported is teh last one as of today:
6670e93e50
Branch is fully equivalent with master but syzygy
tablebases that are missing (but will be added with
next commit).
bench: 8080602
This patch replaces RKISS by a simpler and faster PRNG, xorshift64* proposed
by S. Vigna (2014). It is extremely simple, has a large enough period for
Stockfish's needs (2^64), requires no warming-up (allowing such code to be
removed), and offers slightly better randomness than MT19937.
Paper: http://xorshift.di.unimi.it/
Reference source code (public domain):
http://xorshift.di.unimi.it/xorshift64star.c
The patch also simplifies how init_magics() searches for magics:
- Old logic: seed the PRNG always with the same seed,
then use optimized bit rotations to tailor the RNG sequence per rank.
- New logic: seed the PRNG with an optimized seed per rank.
This has two advantages:
1. Less code and less computation to perform during magics search (not ROTL).
2. More choices for random sequence tuning. The old logic only let us choose
from 4096 bit rotation pairs. With the new one, we can look for the best seeds
among 2^64 values. Indeed, the set of seeds[][] provided in the patch reduces
the effort needed to find the magics:
64-bit SF:
Old logic -> 5,783,789 rand64() calls needed to find the magics
New logic -> 4,420,086 calls
32-bit SF:
Old logic -> 2,175,518 calls
New logic -> 1,895,955 calls
In the 64-bit case, init_magics() take 25 ms less to complete (Intel Core i5).
Finally, when playing with strength handicap, non-determinism is achieved
by setting the seed of the static RNG only once. Afterwards, there is no need
to skip output values.
The bench only changes because the Zobrist keys are now different (since they
are random numbers straight out of the PRNG).
The RNG seed has been carefully chosen so that the
resulting Zobrist keys are particularly well-behaved:
1. All triplets of XORed keys are unique, implying that it
would take at least 7 keys to find a 64-bit collision
(test suggested by ceebo)
2. All pairs of XORed keys are unique modulo 2^32
3. The cardinality of { (key1 ^ key2) >> 48 } is as close
as possible to the maximum (65536)
Point 2 aims at ensuring a good distribution among the bits
that determine an TT entry's cluster, likewise point 3
among the bits that form the TT entry's key16 inside a
cluster.
Details:
Bitset card(key1^key2)
------ ---------------
RKISS
key16 64894 = 99.020% of theoretical maximum
low18 180117 = 99.293%
low32 305362 = 99.997%
Xorshift64*, old seed
key16 64918 = 99.057%
low18 179994 = 99.225%
low32 305350 = 99.993%
Xorshift64*, new seed
key16 65027 = 99.223%
low18 181118 = 99.845%
low32 305371 = 100.000%
Bench: 9324905
Resolves#148
Set threads number always to 1 at startup and let the
user explicitly to chose the number of threads.
Also preserve the useful behavior of automatically set
"Min Split Depth" according to the requested threads,
indeed this parameter is too technical for a casual user,
so, when left to zero, we set it on a sensible value.
No functional change
And #ifdef instead of #if defined
This is more standard form (see for example iostream file).
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>