2018-11-20 21:54:48
ok, we need most bang for buck on worker nodes. 1g network, roughly 8 cores at 2.4ghz
Fireduck
2018-11-20 21:55:21
so call it 19 ghz
Fireduck
2018-11-20 22:09:49
bang per buck is tough if you do not limit the power budget to something sensible as refurbished high end enterprise will get you quite far
Rotonen
2018-11-20 22:10:04
ha
Fireduck
2018-11-20 22:10:25
So that ~20ghz bullshit number I just made up will get about 1MH/s and about max out the gigE link
Fireduck
2018-11-20 22:10:28
so that is the top end
Fireduck
2018-11-20 22:11:13
that can be done fanless
Rotonen
2018-11-20 22:12:05
Thinking something like that. ~$500
Fireduck
2018-11-20 22:15:30
maybe even load 3 of 4 gb of the field in local memory
Fireduck
2018-11-20 22:30:37
why the ryzen?
Rotonen
2018-11-20 22:31:31
also that power supply is way overspec for that, you'll not be in the efficient range of that thing
Rotonen
2018-11-20 22:31:56
good
Fireduck
2018-11-20 22:32:03
why not ryzen?
Fireduck
2018-11-20 22:32:09
seems to be a good price point
Fireduck
2018-11-20 22:32:23
meh, i'd go for a sub-100$ cpu for that
Rotonen
2018-11-20 22:32:36
it's still dual channel memory
Rotonen
2018-11-20 22:32:52
the limitation I am working around is not the memory channel in this case
Fireduck
2018-11-20 22:33:36
also the gpu seems unnecessary if you can just boot those off pxe
Rotonen
2018-11-20 22:33:55
mostly for console on initial setup
Fireduck
2018-11-20 22:34:04
if end up building more than one, can move it around
Fireduck
2018-11-20 22:34:26
i'm an autodiscovery fabric kinda thinker when it comes to clustering stuff like that
Rotonen
2018-11-20 22:34:41
it should figure out on its own what it is and what it should do
Rotonen
2018-11-20 22:34:59
figure out the setup once and just hook more things into the network
Rotonen
2018-11-20 22:35:06
agreed
Fireduck
2018-11-20 22:35:34
current plan is we have a few nodes that can store and serve field 8 from ram
Fireduck
2018-11-20 22:35:42
put those on 10g switch ports
Fireduck
2018-11-20 22:35:52
then hang a bunch of 1g workers off the switch
Fireduck
2018-11-20 22:36:06
i'm still not figuring out how you get enough data into the crunching part so the cpu makes any sense
Rotonen
2018-11-20 22:36:18
would need boxes and arrows kinda representations
Rotonen
2018-11-20 22:36:35
Have you read my arktika notes on marshalling for network calls?
Fireduck
2018-11-20 22:36:46
yes, but the cpu still reads off ram
Rotonen
2018-11-20 22:37:13
sure, and the ram is plenty fast. ddr4 is like 40GB/s per channel or something
Fireduck
2018-11-20 22:37:43
you're read count limited and if you marshall into 4k you still get an up to factor of 256 improvement there
Rotonen
2018-11-20 22:38:06
I am absolutely CPU limited currently :wink:
Fireduck
2018-11-20 22:38:12
I have lots of bandwidth and data
Fireduck
2018-11-20 22:39:08
based on the shares you post, i'm not really seeing that
Rotonen
2018-11-20 22:39:58
hehe, true
Fireduck
2018-11-20 22:40:24
I should switch back to field 7 for now
Fireduck
2018-11-20 22:41:20
i'm not even sure how one would metrify if any snowblossom mining is cpu limited or not
Rotonen
2018-11-20 22:41:36
the limiting factor is the 'it is not aliens' of astronomy for all things computers
Rotonen
2018-11-20 22:41:57
I have a mode for that
Fireduck
2018-11-20 22:42:10
I have an arktika layer that returns random results
Fireduck
2018-11-20 22:42:19
to see how fast you could mine if you were never waiting for data
Fireduck
2018-11-20 22:42:41
i'm aware of the fake layer, yes
Rotonen
2018-11-20 22:43:06
but that's missing a lot of pipeline blockers vs. real mining
Rotonen
2018-11-20 22:43:38
sure. I get similar results if I to network mining from a machine with the field in memory.
Fireduck
2018-11-20 22:43:43
i'm mostly trying to think of what happens on the micro ops level on a cpu when mining
Rotonen
2018-11-20 22:43:50
magic
Fireduck
2018-11-20 22:44:07
i'm still pegging you as the aliens guy on that one
Rotonen
2018-11-20 22:44:44
on bursts i can see that happening as there can be a neat queue to consume off of
Rotonen
2018-11-20 22:45:00
but it's still a flowthrough problem so eventually both sides hit the read size issue?
Rotonen
2018-11-20 22:46:13
unless there is something fancy in how a cpu directly consumes stuff as interrupts off the network interface on some higher end networking gear in the mix, i'm not figuring out how it misses the ram read density limits on either side
Rotonen
2018-11-20 22:50:16
could be i'm missing something the lower level data caches do when consuming off a neater queue
Rotonen
2018-11-20 22:52:07
yeah, I have no idea
Fireduck
2018-11-20 22:52:11
I just see what works
Fireduck
2018-11-20 22:52:46
if i did not misattribute your miners, you were doing on peaks 300k .. 400k on the pool, which matches what two ram channels should be doing
Rotonen
2018-11-20 22:53:17
that was me having trouble with /dev/shm inefficencies vs direct in process memory
Fireduck
2018-11-20 22:53:40
have you recently metrified fscache vs. memfield?
Rotonen
2018-11-20 22:53:41
once I switch back to field 7, you should see me go to about 1mh/s, which maxes out my 1g network
Fireduck
2018-11-20 22:53:52
between the memory node and the workers
Fireduck
2018-11-20 22:54:31
i'm not seeing performance differences on just reading off the disk when the field is cached vs. fiddling with memfield
Rotonen
2018-11-20 22:54:48
and way less of a hassle to not have to bother with the heap size nonsense
Rotonen
2018-11-20 22:54:52
probably cause your are CPU bound :wink:
Fireduck
2018-11-20 22:55:11
was pushing 8MH/s last time i metrified, of course possible
Rotonen
2018-11-20 22:55:20
reading as far as I can from /dev/shm on my ddr2 box, was about 18GB/s
Fireduck
2018-11-20 22:55:29
reading from direct in process ram was about 120GB/s
Fireduck
2018-11-20 22:55:50
that could be something strange about that system, I don't know
Fireduck
2018-11-20 22:56:05
that beast is a sparc, so most bets off for comparability
Rotonen
2018-11-20 22:56:12
also solaris
Rotonen
2018-11-20 22:57:06
but cat the chunks to /dev/null once and see what you get 'off the disk'?
Rotonen
2018-11-20 22:57:09
i'm left curious
Rotonen
2018-11-20 22:59:55
I'll check that out
Fireduck
2018-11-20 23:00:13
not sure why that didn't occur to me
Fireduck
2018-11-20 23:00:43
also for emulating precache just dd with a blocksize and a count to /dev/null
Rotonen
2018-11-20 23:00:50
or even easier with a chunked field
Rotonen
2018-11-20 23:00:58
also xargs -P for doing that in parallel
Rotonen
2018-11-20 23:01:45
with dd you get a fun sequential read disk benchmark on top
Rotonen
2018-11-20 23:03:10
I'll try it in a bit, I need to head out
Fireduck
2018-11-20 23:14:09
looking into this out of curiousity a bit more, it could be the magic sauce for arktika is the translation lookaside buffer being parallel per default
Rotonen
2018-11-20 23:16:08
i have a curious hunch i've been benchmarking with set associative caches in between
Rotonen
2018-11-20 23:24:15
but i'm still not ultimately seeing how anything gets around a 128 byte fetch taking about 40ns
Rotonen
2018-11-20 23:26:36
unless something makes those 128 bytes luckier than they ought to be
Rotonen
2018-11-20 23:26:56
where queues and fetch ordering might actually play a role
Rotonen
2018-11-20 23:27:28
and/or pipelining execution order trickery and micro op fusions
Rotonen
2018-11-20 23:28:00
m68k was so simple in comparison to the modern madness
Rotonen
2018-11-20 23:31:01
what is 128 bytes?
Fireduck
2018-11-20 23:31:05
yeah, it's probably the uncore global queue stuff shuffling things around
https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Rotonen
2018-11-20 23:31:09
a dram access is 128 bytes
Rotonen
2018-11-20 23:31:19
ah
Fireduck
2018-11-20 23:31:52
the network message will be a tight 8k of data, if that all gets loaded in l2 cache or something it should be good
Fireduck
2018-11-20 23:32:54
but really, I don't know
Fireduck
2018-11-20 23:32:56
looking at the newer generations of that document, i'm seriously losing track to how the hell stuff works under the hood
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/xeon-e5-2600-v2-uncore-manual.pdf
Rotonen
2018-11-20 23:34:41
phew, they wrote one for dummies, let's see if one can still play catch-up
https://software.intel.com/en-us/articles/how-memory-is-accessed This article explains why optimal usage of the memory subsystem can have massive performance benefits
Rotonen
2018-11-20 23:35:31
hm, 256 byte reads
Rotonen
2018-11-20 23:35:59
that'd mean i'm off by a factor of two on both the latency and read size, but i'm glad that works out about the same all the way through ddr2 to ddr4 :stuck_out_tongue:
Rotonen
2018-11-20 23:38:43
`Before 3D XPointâ„¢ DIMMs, interleaving was done per one or two cache lines (64 bytes or 128 bytes), but DIMM non-volatile memory characteristics motivated a change to every four cache lines (256 bytes). So now four adjacent cache lines go to the same channel, and then the next set of four cache lines go to the next channel.`
ok that explains
Rotonen
2018-11-20 23:41:27
oh nifty, they actually rolled their own performance counters for this stuff
https://software.intel.com/en-us/vtune-amplifier-help-getting-started Use the Getting Started document to get up and running with a basic Hotspots analysis using your own application on your host system. Windows* Linux*
Rotonen
2018-11-20 23:44:16
@Fireduck you got really lucky that every new memory tech improvement is essentially eaten away by the CPUs wanting to do larger and larger fetches per default as 'no one does small fetches anymore' :smile:
Rotonen
2018-11-20 23:45:15
and the page cache should peform the same as memfield as far as i can tell
Rotonen
2018-11-20 23:46:14
also intel is clever with its performance tips as every so often there's `Use More of the Hardware` in there :stuck_out_tongue:
Rotonen
2018-11-20 23:46:43
this is reading almost like an arktika mining guide `The extra hardware may be an additional core, an additional processor, or an additional motherboard. ` :smile:
Rotonen
2018-11-20 23:47:36
Yeah, who the hell wants 16 bytes?
Fireduck
2018-11-20 23:49:16
damn, i think this might actually be why people on windows are struggling
https://stackoverflow.com/questions/44355217/why-doesnt-the-jvm-emit-prefetch-instructions-on-windows-x86...
Rotonen
2018-11-20 23:50:12
the intel guides tell you to either throw more threads at fetching data or making all fetches through explicit `_mm_prefetch`, and the jvm does not do that on windows
Rotonen
2018-11-20 23:50:33
but, more threads should do the same, but that explains why people have to ramp up a lot more and still not quite get there
Rotonen
2018-11-20 23:51:16
Interesting
Fireduck
2018-11-20 23:53:21
heh, you can hamstring your hardware by turning that off on the hardware level, did not know :stuck_out_tongue:
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors Disclosure of H/W prefetchers control on some Intel processors This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.
Rotonen
2018-11-20 23:53:31
i guess that'll help debugging ghosts in the machine
Rotonen
2018-11-20 23:54:19
though you'd have to roll your own efi script between the bios handoff and your boot loader to do that
Rotonen
2018-11-20 23:54:23
or roll your own efi
Rotonen
2018-11-20 23:54:59
oh god
Fireduck
2018-11-20 23:55:07
nope, someone figured a way around
https://01.org/msr-tools MSR Tools project provides utilities to access the processor MSRs and CPU ID directly. This project is composed of three different user space console applications.
Rotonen
2018-11-20 23:55:26
and intel did it itself apparently :smile:
Rotonen
2018-11-20 23:56:49
but as jvm added the prefetch stuff in 2001, i suppose no one is around to answer why they omitted windows and why no one got back around to that anymore
https://bugs.openjdk.java.net/browse/JDK-4453409
Rotonen
2018-11-20 23:58:16
but i guess one is way off the happy path when looking into things where the relevant assembly is warning posted with underscore prefixing
Rotonen