archive - snowblossom - slack - mining

2018-11-20 21:54:48

Fireduck

ok, we need most bang for buck on worker nodes. 1g network, roughly 8 cores at 2.4ghz

2018-11-20 21:55:21

Fireduck

so call it 19 ghz

2018-11-20 22:09:49

Rotonen

bang per buck is tough if you do not limit the power budget to something sensible as refurbished high end enterprise will get you quite far

2018-11-20 22:10:04

Fireduck

ha

2018-11-20 22:10:25

Fireduck

So that ~20ghz bullshit number I just made up will get about 1MH/s and about max out the gigE link

2018-11-20 22:10:28

Fireduck

so that is the top end

2018-11-20 22:11:13

Rotonen

that can be done fanless

2018-11-20 22:11:55

Fireduck

https://pcpartpicker.com/list/jKqXWD

2018-11-20 22:12:05

Fireduck

Thinking something like that. ~$500

2018-11-20 22:15:30

Fireduck

maybe even load 3 of 4 gb of the field in local memory

2018-11-20 22:30:37

Rotonen

why the ryzen?

2018-11-20 22:31:31

Rotonen

also that power supply is way overspec for that, you'll not be in the efficient range of that thing

2018-11-20 22:31:56

Fireduck

good

2018-11-20 22:32:03

Fireduck

why not ryzen?

2018-11-20 22:32:09

Fireduck

seems to be a good price point

2018-11-20 22:32:23

Rotonen

meh, i'd go for a sub-100$ cpu for that

2018-11-20 22:32:36

Rotonen

it's still dual channel memory

2018-11-20 22:32:52

Fireduck

the limitation I am working around is not the memory channel in this case

2018-11-20 22:33:36

Rotonen

also the gpu seems unnecessary if you can just boot those off pxe

2018-11-20 22:33:55

Fireduck

mostly for console on initial setup

2018-11-20 22:34:04

Fireduck

if end up building more than one, can move it around

2018-11-20 22:34:26

Rotonen

i'm an autodiscovery fabric kinda thinker when it comes to clustering stuff like that

2018-11-20 22:34:41

Rotonen

it should figure out on its own what it is and what it should do

2018-11-20 22:34:59

Rotonen

figure out the setup once and just hook more things into the network

2018-11-20 22:35:06

Fireduck

agreed

2018-11-20 22:35:34

Fireduck

current plan is we have a few nodes that can store and serve field 8 from ram

2018-11-20 22:35:42

Fireduck

put those on 10g switch ports

2018-11-20 22:35:52

Fireduck

then hang a bunch of 1g workers off the switch

2018-11-20 22:36:06

Rotonen

i'm still not figuring out how you get enough data into the crunching part so the cpu makes any sense

2018-11-20 22:36:18

Rotonen

would need boxes and arrows kinda representations

2018-11-20 22:36:35

Fireduck

Have you read my arktika notes on marshalling for network calls?

2018-11-20 22:36:46

Rotonen

yes, but the cpu still reads off ram

2018-11-20 22:37:13

Fireduck

sure, and the ram is plenty fast. ddr4 is like 40GB/s per channel or something

2018-11-20 22:37:43

Rotonen

you're read count limited and if you marshall into 4k you still get an up to factor of 256 improvement there

2018-11-20 22:38:06

Fireduck

I am absolutely CPU limited currently :wink:

2018-11-20 22:38:12

Fireduck

I have lots of bandwidth and data

2018-11-20 22:39:08

Rotonen

based on the shares you post, i'm not really seeing that

2018-11-20 22:39:58

Fireduck

hehe, true

2018-11-20 22:40:24

Fireduck

I should switch back to field 7 for now

2018-11-20 22:41:20

Rotonen

i'm not even sure how one would metrify if any snowblossom mining is cpu limited or not

2018-11-20 22:41:36

Rotonen

the limiting factor is the 'it is not aliens' of astronomy for all things computers

2018-11-20 22:41:57

Fireduck

I have a mode for that

2018-11-20 22:42:10

Fireduck

I have an arktika layer that returns random results

2018-11-20 22:42:19

Fireduck

to see how fast you could mine if you were never waiting for data

2018-11-20 22:42:41

Rotonen

i'm aware of the fake layer, yes

2018-11-20 22:43:06

Rotonen

but that's missing a lot of pipeline blockers vs. real mining

2018-11-20 22:43:38

Fireduck

sure. I get similar results if I to network mining from a machine with the field in memory.

2018-11-20 22:43:43

Rotonen

i'm mostly trying to think of what happens on the micro ops level on a cpu when mining

2018-11-20 22:43:50

Fireduck

magic

2018-11-20 22:44:07

Rotonen

i'm still pegging you as the aliens guy on that one

2018-11-20 22:44:44

Rotonen

on bursts i can see that happening as there can be a neat queue to consume off of

2018-11-20 22:45:00

Rotonen

but it's still a flowthrough problem so eventually both sides hit the read size issue?

2018-11-20 22:46:13

Rotonen

unless there is something fancy in how a cpu directly consumes stuff as interrupts off the network interface on some higher end networking gear in the mix, i'm not figuring out how it misses the ram read density limits on either side

2018-11-20 22:50:16

Rotonen

could be i'm missing something the lower level data caches do when consuming off a neater queue

2018-11-20 22:52:07

Fireduck

yeah, I have no idea

2018-11-20 22:52:11

Fireduck

I just see what works

2018-11-20 22:52:46

Rotonen

if i did not misattribute your miners, you were doing on peaks 300k .. 400k on the pool, which matches what two ram channels should be doing

2018-11-20 22:53:17

Fireduck

that was me having trouble with /dev/shm inefficencies vs direct in process memory

2018-11-20 22:53:40

Rotonen

have you recently metrified fscache vs. memfield?

2018-11-20 22:53:41

Fireduck

once I switch back to field 7, you should see me go to about 1mh/s, which maxes out my 1g network

2018-11-20 22:53:52

Fireduck

between the memory node and the workers

2018-11-20 22:54:31

Rotonen

i'm not seeing performance differences on just reading off the disk when the field is cached vs. fiddling with memfield

2018-11-20 22:54:48

Rotonen

and way less of a hassle to not have to bother with the heap size nonsense

2018-11-20 22:54:52

Fireduck

probably cause your are CPU bound :wink:

2018-11-20 22:55:11

Rotonen

was pushing 8MH/s last time i metrified, of course possible

2018-11-20 22:55:20

Fireduck

reading as far as I can from /dev/shm on my ddr2 box, was about 18GB/s

2018-11-20 22:55:29

Fireduck

reading from direct in process ram was about 120GB/s

2018-11-20 22:55:50

Fireduck

that could be something strange about that system, I don't know

2018-11-20 22:56:05

Rotonen

that beast is a sparc, so most bets off for comparability

2018-11-20 22:56:12

Rotonen

also solaris

2018-11-20 22:57:06

Rotonen

but cat the chunks to /dev/null once and see what you get 'off the disk'?

2018-11-20 22:57:09

Rotonen

i'm left curious

2018-11-20 22:59:55

Fireduck

I'll check that out

2018-11-20 23:00:13

Fireduck

not sure why that didn't occur to me

2018-11-20 23:00:43

Rotonen

also for emulating precache just dd with a blocksize and a count to /dev/null

2018-11-20 23:00:50

Rotonen

or even easier with a chunked field

2018-11-20 23:00:58

Rotonen

also xargs -P for doing that in parallel

2018-11-20 23:01:45

Rotonen

with dd you get a fun sequential read disk benchmark on top

2018-11-20 23:03:10

Fireduck

I'll try it in a bit, I need to head out

2018-11-20 23:14:09

Rotonen

looking into this out of curiousity a bit more, it could be the magic sauce for arktika is the translation lookaside buffer being parallel per default

2018-11-20 23:16:08

Rotonen

i have a curious hunch i've been benchmarking with set associative caches in between

2018-11-20 23:24:15

Rotonen

but i'm still not ultimately seeing how anything gets around a 128 byte fetch taking about 40ns

2018-11-20 23:26:36

Rotonen

unless something makes those 128 bytes luckier than they ought to be

2018-11-20 23:26:56

Rotonen

where queues and fetch ordering might actually play a role

2018-11-20 23:27:28

Rotonen

and/or pipelining execution order trickery and micro op fusions

2018-11-20 23:28:00

Rotonen

m68k was so simple in comparison to the modern madness

2018-11-20 23:31:01

Fireduck

what is 128 bytes?

2018-11-20 23:31:05

Rotonen

yeah, it's probably the uncore global queue stuff shuffling things around https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

2018-11-20 23:31:09

Rotonen

a dram access is 128 bytes

2018-11-20 23:31:19

Fireduck

ah

2018-11-20 23:31:52

Fireduck

the network message will be a tight 8k of data, if that all gets loaded in l2 cache or something it should be good

2018-11-20 23:32:54

Fireduck

but really, I don't know

2018-11-20 23:32:56

Rotonen

looking at the newer generations of that document, i'm seriously losing track to how the hell stuff works under the hood https://www.intel.com/content/dam/www/public/us/en/documents/manuals/xeon-e5-2600-v2-uncore-manual.pdf

2018-11-20 23:34:41

Rotonen

phew, they wrote one for dummies, let's see if one can still play catch-up https://software.intel.com/en-us/articles/how-memory-is-accessed This article explains why optimal usage of the memory subsystem can have massive performance benefits

2018-11-20 23:35:31

Rotonen

hm, 256 byte reads

2018-11-20 23:35:59

Rotonen

that'd mean i'm off by a factor of two on both the latency and read size, but i'm glad that works out about the same all the way through ddr2 to ddr4 :stuck_out_tongue:

2018-11-20 23:38:43

Rotonen

`Before 3D XPoint™ DIMMs, interleaving was done per one or two cache lines (64 bytes or 128 bytes), but DIMM non-volatile memory characteristics motivated a change to every four cache lines (256 bytes). So now four adjacent cache lines go to the same channel, and then the next set of four cache lines go to the next channel.` ok that explains

2018-11-20 23:41:27

Rotonen

oh nifty, they actually rolled their own performance counters for this stuff https://software.intel.com/en-us/vtune-amplifier-help-getting-started Use the Getting Started document to get up and running with a basic Hotspots analysis using your own application on your host system. Windows* Linux*

2018-11-20 23:44:16

Rotonen

@Fireduck you got really lucky that every new memory tech improvement is essentially eaten away by the CPUs wanting to do larger and larger fetches per default as 'no one does small fetches anymore' :smile:

2018-11-20 23:45:15

Rotonen

and the page cache should peform the same as memfield as far as i can tell

2018-11-20 23:46:14

Rotonen

also intel is clever with its performance tips as every so often there's `Use More of the Hardware` in there :stuck_out_tongue:

2018-11-20 23:46:43

Rotonen

this is reading almost like an arktika mining guide `The extra hardware may be an additional core, an additional processor, or an additional motherboard. ` :smile:

2018-11-20 23:47:36

Fireduck

Yeah, who the hell wants 16 bytes?

2018-11-20 23:49:16

Rotonen

damn, i think this might actually be why people on windows are struggling https://stackoverflow.com/questions/44355217/why-doesnt-the-jvm-emit-prefetch-instructions-on-windows-x86...

2018-11-20 23:50:12

Rotonen

the intel guides tell you to either throw more threads at fetching data or making all fetches through explicit `_mm_prefetch`, and the jvm does not do that on windows

2018-11-20 23:50:33

Rotonen

but, more threads should do the same, but that explains why people have to ramp up a lot more and still not quite get there

2018-11-20 23:51:16

Fireduck

Interesting

2018-11-20 23:53:21

Rotonen

heh, you can hamstring your hardware by turning that off on the hardware level, did not know :stuck_out_tongue: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors Disclosure of H/W prefetchers control on some Intel processors This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.

2018-11-20 23:53:31

Rotonen

i guess that'll help debugging ghosts in the machine

2018-11-20 23:54:19

Rotonen

though you'd have to roll your own efi script between the bios handoff and your boot loader to do that

2018-11-20 23:54:23

Rotonen

or roll your own efi

2018-11-20 23:54:59

Fireduck

oh god

2018-11-20 23:55:07

Rotonen

nope, someone figured a way around https://01.org/msr-tools MSR Tools project provides utilities to access the processor MSRs and CPU ID directly. This project is composed of three different user space console applications.

2018-11-20 23:55:26

Rotonen

and intel did it itself apparently :smile:

2018-11-20 23:56:49

Rotonen

but as jvm added the prefetch stuff in 2001, i suppose no one is around to answer why they omitted windows and why no one got back around to that anymore https://bugs.openjdk.java.net/browse/JDK-4453409

2018-11-20 23:58:16

Rotonen

but i guess one is way off the happy path when looking into things where the relevant assembly is warning posted with underscore prefixing