mirror of
https://github.com/HChaZZY/Stockfish.git
synced 2025-12-25 19:46:55 +08:00
Improve performance on NUMA systems
Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, **without** libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes #5253 closes https://github.com/official-stockfish/Stockfish/pull/5285 No functional change
This commit is contained in:
committed by
Joost VandeVondele
parent
b0287dcb1c
commit
a169c78b6d
91
src/thread.h
91
src/thread.h
@@ -26,10 +26,12 @@
|
||||
#include <memory>
|
||||
#include <mutex>
|
||||
#include <vector>
|
||||
#include <functional>
|
||||
|
||||
#include "position.h"
|
||||
#include "search.h"
|
||||
#include "thread_win32_osx.h"
|
||||
#include "numa.h"
|
||||
|
||||
namespace Stockfish {
|
||||
|
||||
@@ -37,6 +39,32 @@ namespace Stockfish {
|
||||
class OptionsMap;
|
||||
using Value = int;
|
||||
|
||||
// Sometimes we don't want to actually bind the threads, but the recipent still
|
||||
// needs to think it runs on *some* NUMA node, such that it can access structures
|
||||
// that rely on NUMA node knowledge. This class encapsulates this optional process
|
||||
// such that the recipent does not need to know whether the binding happened or not.
|
||||
class OptionalThreadToNumaNodeBinder {
|
||||
public:
|
||||
OptionalThreadToNumaNodeBinder(NumaIndex n) :
|
||||
numaConfig(nullptr),
|
||||
numaId(n) {}
|
||||
|
||||
OptionalThreadToNumaNodeBinder(const NumaConfig& cfg, NumaIndex n) :
|
||||
numaConfig(&cfg),
|
||||
numaId(n) {}
|
||||
|
||||
NumaReplicatedAccessToken operator()() const {
|
||||
if (numaConfig != nullptr)
|
||||
return numaConfig->bind_current_thread_to_numa_node(numaId);
|
||||
else
|
||||
return NumaReplicatedAccessToken(numaId);
|
||||
}
|
||||
|
||||
private:
|
||||
const NumaConfig* numaConfig;
|
||||
NumaIndex numaId;
|
||||
};
|
||||
|
||||
// Abstraction of a thread. It contains a pointer to the worker and a native thread.
|
||||
// After construction, the native thread is started with idle_loop()
|
||||
// waiting for a signal to start searching.
|
||||
@@ -44,22 +72,35 @@ using Value = int;
|
||||
// the search is finished, it goes back to idle_loop() waiting for a new signal.
|
||||
class Thread {
|
||||
public:
|
||||
Thread(Search::SharedState&, std::unique_ptr<Search::ISearchManager>, size_t);
|
||||
Thread(Search::SharedState&,
|
||||
std::unique_ptr<Search::ISearchManager>,
|
||||
size_t,
|
||||
OptionalThreadToNumaNodeBinder);
|
||||
virtual ~Thread();
|
||||
|
||||
void idle_loop();
|
||||
void start_searching();
|
||||
void idle_loop();
|
||||
void start_searching();
|
||||
void clear_worker();
|
||||
void run_custom_job(std::function<void()> f);
|
||||
|
||||
// Thread has been slightly altered to allow running custom jobs, so
|
||||
// this name is no longer correct. However, this class (and ThreadPool)
|
||||
// require further work to make them properly generic while maintaining
|
||||
// appropriate specificity regarding search, from the point of view of an
|
||||
// outside user, so renaming of this function in left for whenever that happens.
|
||||
void wait_for_search_finished();
|
||||
size_t id() const { return idx; }
|
||||
|
||||
std::unique_ptr<Search::Worker> worker;
|
||||
std::function<void()> jobFunc;
|
||||
|
||||
private:
|
||||
std::mutex mutex;
|
||||
std::condition_variable cv;
|
||||
size_t idx, nthreads;
|
||||
bool exit = false, searching = true; // Set before starting std::thread
|
||||
NativeThread stdThread;
|
||||
std::mutex mutex;
|
||||
std::condition_variable cv;
|
||||
size_t idx, nthreads;
|
||||
bool exit = false, searching = true; // Set before starting std::thread
|
||||
NativeThread stdThread;
|
||||
NumaReplicatedAccessToken numaAccessToken;
|
||||
};
|
||||
|
||||
|
||||
@@ -67,31 +108,44 @@ class Thread {
|
||||
// parking and, most importantly, launching a thread. All the access to threads
|
||||
// is done through this class.
|
||||
class ThreadPool {
|
||||
|
||||
public:
|
||||
ThreadPool() {}
|
||||
|
||||
~ThreadPool() {
|
||||
// destroy any existing thread(s)
|
||||
if (threads.size() > 0)
|
||||
{
|
||||
main_thread()->wait_for_search_finished();
|
||||
|
||||
while (threads.size() > 0)
|
||||
delete threads.back(), threads.pop_back();
|
||||
threads.clear();
|
||||
}
|
||||
}
|
||||
|
||||
void start_thinking(const OptionsMap&, Position&, StateListPtr&, Search::LimitsType);
|
||||
void clear();
|
||||
void set(Search::SharedState, const Search::SearchManager::UpdateContext&);
|
||||
ThreadPool(const ThreadPool&) = delete;
|
||||
ThreadPool(ThreadPool&&) = delete;
|
||||
|
||||
ThreadPool& operator=(const ThreadPool&) = delete;
|
||||
ThreadPool& operator=(ThreadPool&&) = delete;
|
||||
|
||||
void start_thinking(const OptionsMap&, Position&, StateListPtr&, Search::LimitsType);
|
||||
void run_on_thread(size_t threadId, std::function<void()> f);
|
||||
void wait_on_thread(size_t threadId);
|
||||
size_t num_threads() const;
|
||||
void clear();
|
||||
void set(const NumaConfig& numaConfig,
|
||||
Search::SharedState,
|
||||
const Search::SearchManager::UpdateContext&);
|
||||
|
||||
Search::SearchManager* main_manager();
|
||||
Thread* main_thread() const { return threads.front(); }
|
||||
Thread* main_thread() const { return threads.front().get(); }
|
||||
uint64_t nodes_searched() const;
|
||||
uint64_t tb_hits() const;
|
||||
Thread* get_best_thread() const;
|
||||
void start_searching();
|
||||
void wait_for_search_finished() const;
|
||||
|
||||
std::vector<size_t> get_bound_thread_count_by_numa_node() const;
|
||||
|
||||
std::atomic_bool stop, abortedSearch, increaseDepth;
|
||||
|
||||
auto cbegin() const noexcept { return threads.cbegin(); }
|
||||
@@ -102,13 +156,14 @@ class ThreadPool {
|
||||
auto empty() const noexcept { return threads.empty(); }
|
||||
|
||||
private:
|
||||
StateListPtr setupStates;
|
||||
std::vector<Thread*> threads;
|
||||
StateListPtr setupStates;
|
||||
std::vector<std::unique_ptr<Thread>> threads;
|
||||
std::vector<NumaIndex> boundThreadToNumaNode;
|
||||
|
||||
uint64_t accumulate(std::atomic<uint64_t> Search::Worker::*member) const {
|
||||
|
||||
uint64_t sum = 0;
|
||||
for (Thread* th : threads)
|
||||
for (auto&& th : threads)
|
||||
sum += (th->worker.get()->*member).load(std::memory_order_relaxed);
|
||||
return sum;
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user