ok, we need most bang for buck on worker nodes. 1g network, roughly 8 cores at 2.4ghz
so call it 19 ghz
bang per buck is tough if you do not limit the power budget to something sensible as refurbished high end enterprise will get you quite far
ha
So that ~20ghz bullshit number I just made up will get about 1MH/s and about max out the gigE link
so that is the top end
that can be done fanless
Thinking something like that. ~$500
maybe even load 3 of 4 gb of the field in local memory
why the ryzen?
also that power supply is way overspec for that, you'll not be in the efficient range of that thing
good
why not ryzen?
seems to be a good price point
meh, i'd go for a sub-100$ cpu for that
it's still dual channel memory
the limitation I am working around is not the memory channel in this case
also the gpu seems unnecessary if you can just boot those off pxe
mostly for console on initial setup
if end up building more than one, can move it around
i'm an autodiscovery fabric kinda thinker when it comes to clustering stuff like that
it should figure out on its own what it is and what it should do
figure out the setup once and just hook more things into the network
agreed
current plan is we have a few nodes that can store and serve field 8 from ram
put those on 10g switch ports
then hang a bunch of 1g workers off the switch
i'm still not figuring out how you get enough data into the crunching part so the cpu makes any sense
would need boxes and arrows kinda representations
Have you read my arktika notes on marshalling for network calls?
yes, but the cpu still reads off ram
sure, and the ram is plenty fast. ddr4 is like 40GB/s per channel or something
you're read count limited and if you marshall into 4k you still get an up to factor of 256 improvement there
I am absolutely CPU limited currently :wink:
I have lots of bandwidth and data
based on the shares you post, i'm not really seeing that
hehe, true
I should switch back to field 7 for now
i'm not even sure how one would metrify if any snowblossom mining is cpu limited or not
the limiting factor is the 'it is not aliens' of astronomy for all things computers
I have a mode for that
I have an arktika layer that returns random results
to see how fast you could mine if you were never waiting for data
i'm aware of the fake layer, yes
but that's missing a lot of pipeline blockers vs. real mining
sure. I get similar results if I to network mining from a machine with the field in memory.
i'm mostly trying to think of what happens on the micro ops level on a cpu when mining
magic
i'm still pegging you as the aliens guy on that one
on bursts i can see that happening as there can be a neat queue to consume off of
but it's still a flowthrough problem so eventually both sides hit the read size issue?
unless there is something fancy in how a cpu directly consumes stuff as interrupts off the network interface on some higher end networking gear in the mix, i'm not figuring out how it misses the ram read density limits on either side
could be i'm missing something the lower level data caches do when consuming off a neater queue
yeah, I have no idea
I just see what works
if i did not misattribute your miners, you were doing on peaks 300k .. 400k on the pool, which matches what two ram channels should be doing
that was me having trouble with /dev/shm inefficencies vs direct in process memory
have you recently metrified fscache vs. memfield?
once I switch back to field 7, you should see me go to about 1mh/s, which maxes out my 1g network
between the memory node and the workers
i'm not seeing performance differences on just reading off the disk when the field is cached vs. fiddling with memfield
and way less of a hassle to not have to bother with the heap size nonsense
probably cause your are CPU bound :wink:
was pushing 8MH/s last time i metrified, of course possible
reading as far as I can from /dev/shm on my ddr2 box, was about 18GB/s
reading from direct in process ram was about 120GB/s
that could be something strange about that system, I don't know
that beast is a sparc, so most bets off for comparability
also solaris
but cat the chunks to /dev/null once and see what you get 'off the disk'?
i'm left curious
I'll check that out
not sure why that didn't occur to me
also for emulating precache just dd with a blocksize and a count to /dev/null
or even easier with a chunked field
also xargs -P for doing that in parallel
with dd you get a fun sequential read disk benchmark on top
I'll try it in a bit, I need to head out
looking into this out of curiousity a bit more, it could be the magic sauce for arktika is the translation lookaside buffer being parallel per default
i have a curious hunch i've been benchmarking with set associative caches in between
but i'm still not ultimately seeing how anything gets around a 128 byte fetch taking about 40ns
unless something makes those 128 bytes luckier than they ought to be
where queues and fetch ordering might actually play a role
and/or pipelining execution order trickery and micro op fusions
m68k was so simple in comparison to the modern madness
what is 128 bytes?
yeah, it's probably the uncore global queue stuff shuffling things around https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
a dram access is 128 bytes
ah
the network message will be a tight 8k of data, if that all gets loaded in l2 cache or something it should be good
but really, I don't know
looking at the newer generations of that document, i'm seriously losing track to how the hell stuff works under the hood https://www.intel.com/content/dam/www/public/us/en/documents/manuals/xeon-e5-2600-v2-uncore-manual.pdf
phew, they wrote one for dummies, let's see if one can still play catch-up https://software.intel.com/en-us/articles/how-memory-is-accessed This article explains why optimal usage of the memory subsystem can have massive performance benefits
hm, 256 byte reads
that'd mean i'm off by a factor of two on both the latency and read size, but i'm glad that works out about the same all the way through ddr2 to ddr4 :stuck_out_tongue:
`Before 3D XPointâ„¢ DIMMs, interleaving was done per one or two cache lines (64 bytes or 128 bytes), but DIMM non-volatile memory characteristics motivated a change to every four cache lines (256 bytes). So now four adjacent cache lines go to the same channel, and then the next set of four cache lines go to the next channel.` ok that explains
oh nifty, they actually rolled their own performance counters for this stuff https://software.intel.com/en-us/vtune-amplifier-help-getting-started Use the Getting Started document to get up and running with a basic Hotspots analysis using your own application on your host system. Windows* Linux*
@Fireduck you got really lucky that every new memory tech improvement is essentially eaten away by the CPUs wanting to do larger and larger fetches per default as 'no one does small fetches anymore' :smile:
and the page cache should peform the same as memfield as far as i can tell
also intel is clever with its performance tips as every so often there's `Use More of the Hardware` in there :stuck_out_tongue:
this is reading almost like an arktika mining guide `The extra hardware may be an additional core, an additional processor, or an additional motherboard. ` :smile:
Yeah, who the hell wants 16 bytes?
damn, i think this might actually be why people on windows are struggling https://stackoverflow.com/questions/44355217/why-doesnt-the-jvm-emit-prefetch-instructions-on-windows-x86...
the intel guides tell you to either throw more threads at fetching data or making all fetches through explicit `_mm_prefetch`, and the jvm does not do that on windows
but, more threads should do the same, but that explains why people have to ramp up a lot more and still not quite get there
Interesting
heh, you can hamstring your hardware by turning that off on the hardware level, did not know :stuck_out_tongue: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors Disclosure of H/W prefetchers control on some Intel processors This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.
i guess that'll help debugging ghosts in the machine
though you'd have to roll your own efi script between the bios handoff and your boot loader to do that
or roll your own efi
oh god
nope, someone figured a way around https://01.org/msr-tools MSR Tools project provides utilities to access the processor MSRs and CPU ID directly. This project is composed of three different user space console applications.
and intel did it itself apparently :smile:
but as jvm added the prefetch stuff in 2001, i suppose no one is around to answer why they omitted windows and why no one got back around to that anymore https://bugs.openjdk.java.net/browse/JDK-4453409
but i guess one is way off the happy path when looking into things where the relevant assembly is warning posted with underscore prefixing