aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-01-23hwmon: (w83627ehf) make sensor_dev_attr_##_name variables staticChen Zhou1-28/+28
Fix sparse warning: drivers/hwmon/w83627ehf.c:1202:1: warning: symbol 'sensor_dev_attr_pwm1_target' was not declared. Should it be static? and many more similar messages. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Chen Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected] [groeck: Dropped all but one log message from description] Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23hwmon: (pmbus) Detect if chip is write protectedGuenter Roeck3-1/+33
If a chip is write protected, we can not change any limits, and we can not clear status flags. This may be the reason why clearing status flags is reported to not work for some chips. Detect the condition in the pmbus core. If the chip is write protected, set limit attributes as read-only, and set the flag indicating that the status flag should be ignored. Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23hwmon: Driver for MAX31730Guenter Roeck5-1/+497
MAX31730 is a 3-Channel Remote Temperature Sensor. Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23hwmon: Add support for enable attributes to hwmon coreGuenter Roeck2-4/+22
The hwmon ABI supports enable attributes since commit fb41a710f84e ("hwmon: Document the sensor enable attribute"), but did not add support for those attributes to the hwmon core. Do that now. Since the enable attributes are logically the most important attributes, they are added as first attribute to the attribute list. Move hwmon_in_enable from last to first place for consistency. Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23hwmon: (w83627ehf) convert to with_info interfaceDr. David Alan Gilbert1-799/+630
Convert the old hwmon_device_register code to devm_hwmon_device_register_with_info. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23hwmon: (pmbus/ucd9000) Add support for UCD90320 Power SequencerJim Wright3-17/+40
Add support for the UCD90320 chip and its expanded set of GPIO pins. Signed-off-by: Jim Wright <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23dt-bindings: hwmon/pmbus: Add ti,ucd90320 power sequencerJim Wright1-0/+45
Document the UCD90320 device tree binding. Signed-off-by: Jim Wright <[email protected]> Reviewed-by: Rob Herring <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23hwmon: Add intrusion templatesDr. David Alan Gilbert2-1/+16
Add templates for intrusion%d_alarm and intrusion%d_beep. Note, these start at 0. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]>
2020-01-23net_sched: fix datalen for ematchCong Wang1-1/+1
syzbot reported an out-of-bound access in em_nbyte. As initially analyzed by Eric, this is because em_nbyte sets its own em->datalen in em_nbyte_change() other than the one specified by user, but this value gets overwritten later by its caller tcf_em_validate(). We should leave em->datalen untouched to respect their choices. I audit all the in-tree ematch users, all of those implement ->change() set em->datalen, so we can just avoid setting it twice in this case. Reported-and-tested-by: [email protected] Reported-by: [email protected] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: Eric Dumazet <[email protected]> Signed-off-by: Cong Wang <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23Merge branch 'Fixes-for-SONIC-ethernet-driver'David S. Miller2-162/+262
Finn Thain says: ==================== Fixes for SONIC ethernet driver Various SONIC driver problems have become apparent over the years, including tx watchdog timeouts, lost packets and duplicated packets. The problems are mostly caused by bugs in buffer handling, locking and (re-)initialization code. This patch series resolves these problems. This series has been tested on National Semiconductor hardware (macsonic), qemu-system-m68k (macsonic) and qemu-system-mips64el (jazzsonic). The emulated dp8393x device used in QEMU also has bugs. I have fixed the bugs that I know of in a series of patches at, https://github.com/fthain/qemu/commits/sonic Changed since v1: - Minor revisions as described in commit logs. - Deferred net-next patches. Changed since v2: - Minor revisions as described in commit logs. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Prevent tx watchdog timeoutFinn Thain1-4/+13
Section 5.5.3.2 of the datasheet says, If FIFO Underrun, Byte Count Mismatch, Excessive Collision, or Excessive Deferral (if enabled) errors occur, transmission ceases. In this situation, the chip asserts a TXER interrupt rather than TXDN. But the handler for the TXDN is the only way that the transmit queue gets restarted. Hence, an aborted transmission can result in a watchdog timeout. This problem can be reproduced on congested link, as that can result in excessive transmitter collisions. Another way to reproduce this is with a FIFO Underrun, which may be caused by DMA latency. In event of a TXER interrupt, prevent a watchdog timeout by restarting transmission. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Fix CAM initializationFinn Thain1-9/+12
Section 4.3.1 of the datasheet says, This bit [TXP] must not be set if a Load CAM operation is in progress (LCAM is set). The SONIC will lock up if both bits are set simultaneously. Testing has shown that the driver sometimes attempts to set LCAM while TXP is set. Avoid this by waiting for command completion before and after giving the LCAM command. After issuing the Load CAM command, poll for !SONIC_CR_LCAM rather than SONIC_INT_LCD, because the SONIC_CR_TXP bit can't be used until !SONIC_CR_LCAM. When in reset mode, take the opportunity to reset the CAM Enable register. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Fix command register usageFinn Thain1-15/+3
There are several issues relating to command register usage during chip initialization. Firstly, the SONIC sometimes comes out of software reset with the Start Timer bit set. This gets logged as, macsonic macsonic eth0: sonic_init: status=24, i=101 Avoid this by giving the Stop Timer command earlier than later. Secondly, the loop that waits for the Read RRA command to complete has the break condition inverted. That's why the for loop iterates until its termination condition. Call the helper for this instead. Finally, give the Receiver Enable command after clearing interrupts, not before, to avoid the possibility of losing an interrupt. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Quiesce SONIC before re-initializing descriptor memoryFinn Thain2-0/+28
Make sure the SONIC's DMA engine is idle before altering the transmit and receive descriptors. Add a helper for this as it will be needed again. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Fix receive buffer replenishmentFinn Thain2-63/+105
As soon as the driver is finished with a receive buffer it allocs a new one and overwrites the corresponding RRA entry with a new buffer pointer. Problem is, the buffer pointer is split across two word-sized registers. It can't be updated in one atomic store. So this operation races with the chip while it stores received packets and advances its RRP register. This could result in memory corruption by a DMA write. Avoid this problem by adding buffers only at the location given by the RWP register, in accordance with the National Semiconductor datasheet. Re-factor this code into separate functions to calculate a RRA pointer and to update the RWP. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Improve receive descriptor status flag checkFinn Thain1-10/+5
After sonic_tx_timeout() calls sonic_init(), it can happen that sonic_rx() will subsequently encounter a receive descriptor with no flags set. Remove the comment that says that this can't happen. When giving a receive descriptor to the SONIC, clear the descriptor status field. That way, any rx descriptor with flags set can only be a newly received packet. Don't process a descriptor without the LPKT bit set. The buffer is still in use by the SONIC. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Avoid needless receive descriptor EOL flag updatesFinn Thain1-6/+15
The while loop in sonic_rx() traverses the rx descriptor ring. It stops when it reaches a descriptor that the SONIC has not used. Each iteration advances the EOL flag so the SONIC can keep using more descriptors. Therefore, the while loop has no definite termination condition. The algorithm described in the National Semiconductor literature is quite different. It consumes descriptors up to the one with its EOL flag set (which will also have its "in use" flag set). All freed descriptors are then returned to the ring at once, by adjusting the EOL flags (and link pointers). Adopt the algorithm from datasheet as it's simpler, terminates quickly and avoids a lot of pointless descriptor EOL flag changes. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Fix receive buffer handlingFinn Thain2-7/+33
The SONIC can sometimes advance its rx buffer pointer (RRP register) without advancing its rx descriptor pointer (CRDA register). As a result the index of the current rx descriptor may not equal that of the current rx buffer. The driver mistakenly assumes that they are always equal. This assumption leads to incorrect packet lengths and possible packet duplication. Avoid this by calling a new function to locate the buffer corresponding to a given descriptor. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Fix interface error stats collectionFinn Thain2-14/+8
The tx_aborted_errors statistic should count packets flagged with EXD, EXC, FU, or BCM bits because those bits denote an aborted transmission. That corresponds to the bitmask 0x0446, not 0x0642. Use macros for these constants to avoid mistakes. Better to leave out FIFO Underruns (FU) as there's a separate counter for that purpose. Don't lump all these errors in with the general tx_errors counter as that's used for tx timeout events. On the rx side, don't count RDE and RBAE interrupts as dropped packets. These interrupts don't indicate a lost packet, just a lack of resources. When a lack of resources results in a lost packet, this gets reported in the rx_missed_errors counter (along with RFO events). Don't double-count rx_frame_errors and rx_crc_errors. Don't use the general rx_errors counter for events that already have special counters. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Use MMIO accessorsFinn Thain1-8/+8
The driver accesses descriptor memory which is simultaneously accessed by the chip, so the compiler must not be allowed to re-order CPU accesses. sonic_buf_get() used 'volatile' to prevent that. sonic_buf_put() should have done so too but was overlooked. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Clear interrupt flags immediatelyFinn Thain1-22/+6
The chip can change a packet's descriptor status flags at any time. However, an active interrupt flag gets cleared rather late. This allows a race condition that could theoretically lose an interrupt. Fix this by clearing asserted interrupt flags immediately. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/sonic: Add mutual exclusion for accessing shared stateFinn Thain2-14/+36
The netif_stop_queue() call in sonic_send_packet() races with the netif_wake_queue() call in sonic_interrupt(). This causes issues like "NETDEV WATCHDOG: eth0 (macsonic): transmit queue 0 timed out". Fix this by disabling interrupts when accessing tx_skb[] and next_tx. Update a comment to clarify the synchronization properties. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23net: fsl/fman: rename IF_MODE_XGMII to IF_MODE_10GMadalin Bucur1-2/+2
As the only 10G PHY interface type defined at the moment the code was developed was XGMII, although the PHY interface mode used was not XGMII, XGMII was used in the code to denote 10G. This patch renames the 10G interface mode to remove the ambiguity. Signed-off-by: Madalin Bucur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23Merge branch 'net-fsl-fman-address-erratum-A011043'David S. Miller20-1/+37
Madalin Bucur says: ==================== net: fsl/fman: address erratum A011043 This addresses a HW erratum on some QorIQ DPAA devices. MDIO reads to internal PCS registers may result in having the MDIO_CFG[MDIO_RD_ER] bit set, even when there is no error and read data (MDIO_DATA[MDIO_DATA]) is correct. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. When the issue was present, one could see such errors during boot: mdio_bus ffe4e5000: Error while reading PHY0 reg at 3.3 ==================== Signed-off-by: David S. Miller <[email protected]>
2020-01-23net/fsl: treat fsl,erratum-a011043Madalin Bucur1-1/+6
When fsl,erratum-a011043 is set, adjust for erratum A011043: MDIO reads to internal PCS registers may result in having the MDIO_CFG[MDIO_RD_ER] bit set, even when there is no error and read data (MDIO_DATA[MDIO_DATA]) is correct. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. Signed-off-by: Madalin Bucur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23powerpc/fsl/dts: add fsl,erratum-a011043Madalin Bucur18-0/+18
Add fsl,erratum-a011043 to internal MDIO buses. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. Signed-off-by: Madalin Bucur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23dt-bindings: net: add fsl,erratum-a011043Madalin Bucur1-0/+13
Add an entry for erratum A011043: the MDIO_CFG[MDIO_RD_ER] bit may be falsely set when reading internal PCS registers. MDIO reads to internal PCS registers may result in having the MDIO_CFG[MDIO_RD_ER] bit set, even when there is no error and read data (MDIO_DATA[MDIO_DATA]) is correct. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. Signed-off-by: Madalin Bucur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23qlcnic: Fix CPU soft lockup while collecting firmware dumpManish Chopra2-0/+3
Driver while collecting firmware dump takes longer time to collect/process some of the firmware dump entries/memories. Bigger capture masks makes it worse as it results in larger amount of data being collected and results in CPU soft lockup. Place cond_resched() in some of the driver flows that are expectedly time consuming to relinquish the CPU to avoid CPU soft lockup panic. Signed-off-by: Shahed Shaikh <[email protected]> Tested-by: Yonggen Xu <[email protected]> Signed-off-by: Manish Chopra <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-01-23Merge tag 'xarray-5.5' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds4-59/+175
Pull XArray fixes from Matthew Wilcox: "Primarily bugfixes, mostly around handling index wrap-around correctly. A couple of doc fixes and adding missing APIs. I had an oops live on stage at linux.conf.au this year, and it turned out to be a bug in xas_find() which I can't prove isn't triggerable in the current codebase. Then in looking for the bug, I spotted two more bugs. The bots have had a few days to chew on this with no problems reported, and it passes the test-suite (which now has more tests to make sure these problems don't come back)" * tag 'xarray-5.5' of git://git.infradead.org/users/willy/linux-dax: XArray: Add xa_for_each_range XArray: Fix xas_find returning too many entries XArray: Fix xa_find_after with multi-index entries XArray: Fix infinite loop with entry at ULONG_MAX XArray: Add wrappers for nested spinlocks XArray: Improve documentation of search marks XArray: Fix xas_pause at ULONG_MAX
2020-01-23Merge tag 'trace-v5.5-rc6-2' of ↵Linus Torvalds8-71/+163
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various tracing fixes: - Fix a function comparison warning for a xen trace event macro - Fix a double perf_event linking to a trace_uprobe_filter for multiple events - Fix suspicious RCU warnings in trace event code for using list_for_each_entry_rcu() when the "_rcu" portion wasn't needed. - Fix a bug in the histogram code when using the same variable - Fix a NULL pointer dereference when tracefs lockdown enabled and calling trace_set_default_clock() - A fix to a bug found with the double perf_event linking patch" * tag 'trace-v5.5-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/uprobe: Fix to make trace_uprobe_filter alignment safe tracing: Do not set trace clock if tracefs lockdown is in effect tracing: Fix histogram code when expression has same var as value tracing: trigger: Replace unneeded RCU-list traversals tracing/uprobe: Fix double perf_event linking on multiprobe uprobe tracing: xen: Ordered comparison of function pointers
2020-01-23Merge tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-clientLinus Torvalds1-2/+6
Pull ceph fix from Ilya Dryomov: "A fix for a potential use-after-free from Jeff, marked for stable" * tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-client: ceph: hold extra reference to r_parent over life of request
2020-01-23Merge tag 'pm-5.5-rc8' of ↵Linus Torvalds1-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Prevent the kernel from crashing during resume from hibernation if free pages contain leftover data from the restore kernel and init_on_free is set (Alexander Potapenko)" * tag 'pm-5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: fix crashes with init_on_free=1
2020-01-23Merge tag 'pci-v5.5-fixes-2' of ↵Linus Torvalds1-6/+13
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fix from Bjorn Helgaas: "Mark ATS as broken on AMD Navi14 GPU rev 0xc5 (Alex Deucher)" * tag 'pci-v5.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: Mark AMD Navi14 GPU rev 0xc5 ATS as broken
2020-01-23partitions/ldm: fix spelling mistake "to" -> "too"Colin Ian King1-1/+1
There is a spelling mistake in a ldm_error message. Fix it. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: reap from tail of c->btree_cache in bch_mca_scan()Coly Li1-5/+5
When shrink btree node cache from c->btree_cache in bch_mca_scan(), no matter the selected node is reaped or not, it will be rotated from the head to the tail of c->btree_cache list. But in bcache journal code, when flushing the btree nodes with oldest journal entry, btree nodes are iterated and slected from the tail of c->btree_cache list in btree_flush_write(). The list_rotate_left() in bch_mca_scan() will make btree_flush_write() iterate more nodes in c->btree_list in reverse order. This patch just reaps the selected btree node cache, and not move it from the head to the tail of c->btree_cache list. Then bch_mca_scan() will not mess up c->btree_cache list to btree_flush_write(). Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()Coly Li1-3/+3
In order to skip the most recently freed btree node cahce, currently in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list are skipped when shrinking bcache node caches in bch_mca_scan(). The related code in bch_mca_scan() is, 737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) { 738 if (nr <= 0) 739 goto out; 740 741 if (++i > 3 && 742 !mca_reap(b, 0, false)) { lines free cache memory 746 } 747 nr--; 748 } The problem is, if virtual memory code calls bch_mca_scan() and the calculated 'nr' is 1 or 2, then in the above loop, nothing will be shunk. In such case, if slub/slab manager calls bch_mca_scan() for many times with small scan number, it does not help to shrink cache memory and just wasts CPU cycles. This patch just selects btree node caches from tail of the c->btree_cache_freeable list, then the newly freed host cache can still be allocated by mca_alloc(), and at least 1 node can be shunk. Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: remove member accessed from struct btreeColy Li2-8/+2
The member 'accessed' of struct btree is used in bch_mca_scan() when shrinking btree node caches. The original idea is, if b->accessed is set, clean it and look at next btree node cache from c->btree_cache list, and only shrink the caches whose b->accessed is cleaned. Then only cold btree node cache will be shrunk. But when I/O pressure is high, it is very probably that b->accessed of a btree node cache will be set again in bch_btree_node_get() before bch_mca_scan() selects it again. Then there is no chance for bch_mca_scan() to shrink enough memory back to slub or slab system. This patch removes member accessed from struct btree, then once a btree node ache is selected, it will be immediately shunk. By this change, bch_mca_scan() may release btree node cahce more efficiently. Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: print written and keys in trace_bcache_btree_writeGuoju Fang1-1/+2
It's useful to dump written block and keys on btree write, this patch add them into trace_bcache_btree_write. Signed-off-by: Guoju Fang <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: avoid unnecessary btree nodes flushing in btree_flush_write()Coly Li1-5/+75
the commit 91be66e1318f ("bcache: performance improvement for btree_flush_write()") was an effort to flushing btree node with oldest btree node faster in following methods, - Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot of clean btree nodes. - Take c->btree_cache as a LRU-like list, aggressively flushing all dirty nodes from tail of c->btree_cache util the btree node with oldest journal entry is flushed. This is to reduce the time of holding c->bucket_lock. Guoju Fang and Shuang Li reported that they observe unexptected extra write I/Os on cache device after applying the above patch. Guoju Fang provideed more detailed diagnose information that the aggressive btree nodes flushing may cause 10x more btree nodes to flush in his workload. He points out when system memory is large enough to hold all btree nodes in memory, c->btree_cache is not a LRU-like list any more. Then the btree node with oldest journal entry is very probably not- close to the tail of c->btree_cache list. In such situation much more dirty btree nodes will be aggressively flushed before the target node is flushed. When slow SATA SSD is used as cache device, such over- aggressive flushing behavior will cause performance regression. After spending a lot of time on debug and diagnose, I find the real condition is more complicated, aggressive flushing dirty btree nodes from tail of c->btree_cache list is not a good solution. - When all btree nodes are cached in memory, c->btree_cache is not a LRU-like list, the btree nodes with oldest journal entry won't be close to the tail of the list. - There can be hundreds dirty btree nodes reference the oldest journal entry, before flushing all the nodes the oldest journal entry cannot be reclaimed. When the above two conditions mixed together, a simply flushing from tail of c->btree_cache list is really NOT a good idea. Fortunately there is still chance to make btree_flush_write() work better. Here is how this patch avoids unnecessary btree nodes flushing, - Only acquire c->journal.lock when getting oldest journal entry of fifo c->journal.pin. In rested locations check the journal entries locklessly, so their values can be changed on other cores in parallel. - In loop list_for_each_entry_safe_reverse(), checking latest front point of fifo c->journal.pin. If it is different from the original point which we get with locking c->journal.lock, it means the oldest journal entry is reclaim on other cores. At this moment, all selected dirty nodes recorded in array btree_nodes[] are all flushed and clean on other CPU cores, it is unncessary to iterate c->btree_cache any longer. Just quit the list_for_each_entry_safe_reverse() loop and the following for-loop will skip all the selected clean nodes. - Find a proper time to quit the list_for_each_entry_safe_reverse() loop. Check the refcount value of orignial fifo front point, if the value is larger than selected node number of btree_nodes[], it means more matching btree nodes should be scanned. Otherwise it means no more matching btee nodes in rest of c->btree_cache list, the loop can be quit. If the original oldest journal entry is reclaimed and fifo front point is updated, the refcount of original fifo front point will be 0, then the loop will be quit too. - Not hold c->bucket_lock too long time. c->bucket_lock is also required for space allocation for cached data, hold it for too long time will block regular I/O requests. When iterating list c->btree_cache, even there are a lot of maching btree nodes, in order to not holding c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are selected and to flush in following for-loop. With this patch, only btree nodes referencing oldest journal entry are flushed to cache device, no aggressive flushing for unnecessary btree node any more. And in order to avoid blocking regluar I/O requests, each time when btree_flush_write() called, at most only BTREE_FLUSH_NR btree nodes are selected to flush, even there are more maching btree nodes in list c->btree_cache. At last, one more thing to explain: Why it is safe to read front point of c->journal.pin without holding c->journal.lock inside the list_for_each_entry_safe_reverse() loop ? Here is my answer: When reading the front point of fifo c->journal.pin, we don't need to know the exact value of front point, we just want to check whether the value is different from the original front point (which is accurate value because we get it while c->jouranl.lock is held). For such purpose, it works as expected without holding c->journal.lock. Even the front point is changed on other CPU core and not updated to local core, and current iterating btree node has identical journal entry local as original fetched fifo front point, it is still safe. Because after holding mutex b->write_lock (with memory barrier) this btree node can be found as clean and skipped, the loop will quite latter when iterate on next node of list c->btree_cache. Fixes: 91be66e1318f ("bcache: performance improvement for btree_flush_write()") Reported-by: Guoju Fang <[email protected]> Reported-by: Shuang Li <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: add code comments for state->pool in __btree_sort()Coly Li1-0/+5
To explain the pages allocated from mempool state->pool can be swapped in __btree_sort(), because state->pool is a page pool, which allocates pages by alloc_pages() indeed. Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23lib: crc64: include <linux/crc64.h> for 'crc64_be'Ben Dooks (Codethink)1-0/+1
The crc64_be() is declared in <linux/crc64.h> so include this where the symbol is defined to avoid the following warning: lib/crc64.c:43:12: warning: symbol 'crc64_be' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: use read_cache_page_gfp to read the superblockChristoph Hellwig2-9/+8
Avoid a pointless dependency on buffer heads in bcache by simply open coding reading a single page. Also add a SB_OFFSET define for the byte offset of the superblock instead of using magic numbers. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: store a pointer to the on-disk sb in the cache and cached_dev structuresChristoph Hellwig2-19/+17
This allows to properly build the superblock bio including the offset in the page using the normal bio helpers. This fixes writing the superblock for page sizes larger than 4k where the sb write bio would need an offset in the bio_vec. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: return a pointer to the on-disk sb from read_superChristoph Hellwig1-11/+11
Returning the properly typed actual data structure insteaf of the containing struct page will save the callers some work going forward. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: transfer the sb_page reference to register_{bdev,cache}Christoph Hellwig1-15/+6
Avoid an extra reference count roundtrip by transferring the sb_page ownership to the lower level register helpers. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: fix use-after-free in register_bcache()Coly Li1-1/+2
The patch "bcache: rework error unwinding in register_bcache" introduces a use-after-free regression in register_bcache(). Here are current code, 2510 out_free_path: 2511 kfree(path); 2512 out_module_put: 2513 module_put(THIS_MODULE); 2514 out: 2515 pr_info("error %s: %s", path, err); 2516 return ret; If some error happens and the above code path is executed, at line 2511 path is released, but referenced at line 2515. Then KASAN reports a use- after-free error message. This patch changes line 2515 in the following way to fix the problem, 2515 pr_info("error %s: %s", path?path:"", err); Signed-off-by: Coly Li <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: properly initialize 'path' and 'err' in register_bcache()Coly Li1-1/+3
Patch "bcache: rework error unwinding in register_bcache" from Christoph Hellwig changes the local variables 'path' and 'err' in undefined initial state. If the code in register_bcache() jumps to label 'out:' or 'out_module_put:' by goto, these two variables might be reference with undefined value by the following line, out_module_put: module_put(THIS_MODULE); out: pr_info("error %s: %s", path, err); return ret; Therefore this patch initializes these two local variables properly in register_bcache() to avoid such issue. Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: rework error unwinding in register_bcacheChristoph Hellwig1-30/+45
Split the successful and error return path, and use one goto label for each resource to unwind. This also fixes some small errors like leaking the module reference count in the reboot case (which seems entirely harmless) or printing the wrong warning messages for early failures. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: use a separate data structure for the on-disk super blockChristoph Hellwig2-3/+54
Split out an on-disk version struct cache_sb with the proper endianness annotations. This fixes a fair chunk of sparse warnings, but there are some left due to the way the checksum is defined. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-23bcache: cached_dev_free needs to put the sb pageLiang Chen1-0/+3
Same as cache device, the buffer page needs to be put while freeing cached_dev. Otherwise a page would be leaked every time a cached_dev is stopped. Signed-off-by: Liang Chen <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>