summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2012-04-20blkcg: make request_queue bypassing on allocationTejun Heo1-12/+25
With the previous change to guarantee bypass visiblity for RCU read lock regions, entering bypass mode involves non-trivial overhead and future changes are scheduled to make use of bypass mode during init path. Combined it may end up adding noticeable delay during boot. This patch makes request_queue start its life in bypass mode, which is ended on queue init completion at the end of blk_init_allocated_queue(), and updates blk_queue_bypass_start() such that draining and RCU synchronization are performed only when the queue actually enters bypass mode. This avoids unnecessarily switching in and out of bypass mode during init avoiding the overhead and any nasty surprises which may step from leaving bypass mode on half-initialized queues. The boot time overhead was pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20blkcg: make sure blkg_lookup() returns %NULL if @q is bypassingTejun Heo2-19/+46
Currently, blkg_lookup() doesn't check @q bypass state. This patch updates blk_queue_bypass_start() to do synchronize_rcu() before returning and updates blkg_lookup() to check blk_queue_bypass() and return %NULL if bypassing. This ensures blkg_lookup() returns %NULL if @q is bypassing. This is to guarantee that nobody is accessing policy data while @q is bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in place on policy [de]activation. v2: Added more comments explaining bypass guarantees as suggested by Vivek. v3: Added more comments explaining why there's no synchronize_rcu() in blk_cleanup_queue() as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20blkcg: make blkg_conf_prep() take @pol and return with queue lock heldTejun Heo4-10/+14
Add @pol to blkg_conf_prep() and let it return with queue lock held (to be released by blkg_conf_finish()). Note that @pol isn't used yet. This is to prepare for per-queue policy activation and doesn't cause any visible difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20blkcg: remove static policy ID enumsTejun Heo4-40/+63
Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate @pol->plid dynamically on registration. The maximum number of blkcg policies which can be registered at the same time is defined by BLKCG_MAX_POLS constant added to include/linux/blkdev.h. Note that blkio_policy_register() now may fail. Policy init functions updated accordingly and unnecessary ifdefs removed from cfq_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20blkcg: use @pol instead of @plid in update_root_blkg_pd() and ↵Tejun Heo4-21/+23
blkcg_print_blkgs() The two functions were taking "enum blkio_policy_id plid". Make them take "const struct blkio_policy_type *pol" instead. This is to prepare for per-queue policy activation and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20blkcg: kill blkio_list and replace blkio_list_lock with a mutexTejun Heo2-16/+17
With blkio_policy[], blkio_list is redundant and hinders with per-queue policy activation. Remove it. Also, replace blkio_list_lock with a mutex blkcg_pol_mutex and let it protect the whole [un]registration. This is to prepare for per-queue policy activation and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20cfq: fix build breakage & warningsTejun Heo2-11/+10
* CFQ_WEIGHT_* defined inside CONFIG_BLK_CGROUP causes cfq-iosched.c compile failure when the config is disabled. Move it outside the ifdef block. * Dummy cfqg_stats_*() definitions were lacking inline modifiers causing unused functions warning if !CONFIG_CFQ_GROUP_IOSCHED. Add them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-13Merge branch 'for-3.4/core' of git://git.kernel.dk/linux-blockLinus Torvalds3-5/+12
Pull block core bits from Jens Axboe: "It's a nice and quiet round this time, since most of the tricky stuff has been pushed to 3.5 to give it more time to mature. After a few hectic block IO core changes for 3.3 and 3.2, I'm quite happy with a slow round. Really minor stuff in here, the only real functional change is making the auto-unplug threshold a per-queue entity. The threshold is set so that it's low enough that we don't hold off IO for too long, but still big enough to get a nice benefit from the batched insert (and hence queue lock cost reduction). For raid configurations, this currently breaks down." * 'for-3.4/core' of git://git.kernel.dk/linux-block: block: make auto block plug flush threshold per-disk based Documentation: Add sysfs ABI change for cfq's target latency. block: Make cfq_target_latency tunable through sysfs. block: use lockdep_assert_held for queue locking block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNEL
2012-04-06block: make auto block plug flush threshold per-disk basedShaohua Li1-1/+2
We do auto block plug flush to reduce latency, the threshold is 16 requests. This works well if the task is accessing one or two drives. The problem is if the task is accessing a raid 0 device and the raid disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have heavy lock contention. This patch makes the threshold per-disk based. The latency should be still ok accessing one or two drives. The setup with application accessing a lot of drives in the meantime uaually is big machine, avoiding lock contention is more important, because any contention will actually increase latency. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-01blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macrosTejun Heo4-84/+72
Now that all stat handling code lives in policy implementations, there's no need to encode policy ID in cft->private. * Export blkcg_prfill_[rw]stat() from blkcg, remove blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which use hard-code BLKIO_POLICY_PROP. * Use cft->private for offset of the target field directly and drop BLKCG_STAT_{PRIV|POL|OFF}(). Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: pass around pd->pdata instead of pd itself in prfill functionsTejun Heo4-41/+33
Now that all conf and stat fields are moved into policy specific blkio_policy_data->pdata areas, there's no reason to use blkio_policy_data itself in prfill functions. Pass around @pd->pdata instead of @pd. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move blkio_group_conf->iops and ->bps to blk-throttleTejun Heo2-103/+58
blkio_cgroup_conf->iops and ->bps are owned by blk-throttle and has no reason to be defined in blkcg core. Drop them and let conf setting functions directly manipulate throtl_grp->bps[] and ->iops[]. This makes blkio_group_conf empty. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move blkio_group_conf->weight to cfqTejun Heo3-50/+45
blkio_group_conf->weight is owned by cfq and has no reason to be defined in blkcg core. Replace it with cfq_group->dev_weight and let conf setting functions directly set it. If dev_weight is zero, the cfqg doesn't have device specific weight configured. Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename blkio_cgroup->weight to blkio_cgroup->cfq_weight. We eventually want per-policy storage in blkio_cgroup but just mark the ownership of the field for now. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move blkio_group_stats_cpu and friends to blk-throttle.cTejun Heo3-125/+114
blkio_group_stats_cpu is used only by blk-throtl and has no reason to be defined in blkcg core. * Move blkio_group_stats_cpu to blk-throttle.c and rename it to tg_stats_cpu. * blkg_policy_data->stats_cpu is replaced with throtl_grp->stats_cpu. prfill functions updated accordingly. * All related macros / functions are renamed so that they have tg_ prefix and the unnecessary @pol arguments are dropped. * Per-cpu stats allocation code is also moved from blk-cgroup.c to blk-throttle.c and gets simplified to only deal with BLKIO_POLICY_THROTL. percpu stat free is performed by the exit method throtl_exit_blkio_group(). * throtl_reset_group_stats() implemented for blkio_reset_group_stats_fn method so that tg->stats_cpu can be reset. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move blkio_group_stats to cfq-iosched.cTejun Heo3-278/+193
blkio_group_stats contains only fields used by cfq and has no reason to be defined in blkcg core. * Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats. * blkg_policy_data->stats is replaced with cfq_group->stats. blkg_prfill_[rw]stat() are updated to use offset against pd->pdata instead. * All related macros / functions are renamed so that they have cfqg_ prefix and the unnecessary @pol arguments are dropped. * All stat functions now take cfq_group * instead of blkio_group *. * lockdep assertion on queue lock dropped. Elevator runs under queue lock by default. There isn't much to be gained by adding lockdep assertions at stat function level. * cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method so that cfqg->stats can be reset. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: add blkio_policy_ops operations for exit and stat resetTejun Heo2-4/+16
Add blkio_policy_ops->blkio_exit_group_fn() and ->blkio_reset_group_stats_fn(). These will be used to further modularize blkcg policy implementation. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: cfq doesn't need per-cpu dispatch statsTejun Heo4-95/+48
blkio_group_stats_cpu is used to count dispatch stats using per-cpu counters. This is used by both blk-throtl and cfq-iosched but the sharing is rather silly. * cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates those stats while holding queue_lock. * blk-throtl needs per-cpu dispatch stats but only service_bytes and serviced. It doesn't make use of sectors. This patch makes cfq add and use global stats for service_bytes, serviced and sectors, removes per-cpu sectors counter and moves per-cpu stat printing code to blk-throttle.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move statistics update code to policiesTejun Heo4-397/+259
As with conf/stats file handling code, there's no reason for stat update code to live in blkcg core with policies calling into update them. The current organization is both inflexible and complex. This patch moves stat update code to specific policies. All blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP stats are collapsed into their cfq_blkiocg_update_*_stats() counterparts. blkiocg_update_dispatch_stats() is used by both policies and duplicated as throtl_update_dispatch_stats() and cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01cfq: collapse cfq.h into cfq-iosched.cTejun Heo2-119/+113
block/cfq.h contains some functions which interact with blkcg; however, this is only part of it and cfq-iosched.c already has quite some #ifdef CONFIG_CFQ_GROUP_IOSCHED. With conf/stat handling being moved to specific policies, having these relay functions isolated in cfq.h doesn't make much sense. Collapse cfq.h into cfq-iosched.c for now. Let's split blkcg support properly later if necessary. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move conf/stat file handling code to policiesTejun Heo4-420/+333
blkcg conf/stat handling is convoluted in that details which belong to specific policy implementations are all out in blkcg core and then policies hook into core layer to access and manipulate confs and stats. This sadly achieves both inflexibility (confs/stats can't be modified without messing with blkcg core) and complexity (all the call-ins and call-backs). The previous patches restructured conf and stat handling code such that they can be separated out. This patch relocates the file handling part. All conf/stat file handling code which belongs to BLKIO_POLICY_PROP is moved to cfq-iosched.c and all BKLIO_POLICY_THROTL code to blk-throtl.c. The move is verbatim except for blkio_update_group_{weight|bps|iops}() callbacks which relays conf changes to policies. The configuration settings are handled in policies themselves so the relaying isn't necessary. Conf setting functions are modified to directly call per-policy update functions and the relaying mechanism is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: implement blkio_policy_type->cftypesTejun Heo2-0/+7
Add blkiop->cftypes which is added and removed together with the policy. This will be used to move conf/stat handling to the policies. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: export conf/stat helpers to prepare for reorganizationTejun Heo2-27/+52
conf/stat handling is about to be moved to policy implementation from blkcg core. Export conf/stat helpers from blkcg core so that blk-throttle and cfq-iosched can use them. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: simplify blkg_conf_prep()Tejun Heo1-54/+10
blkg_conf_prep() implements "MAJ:MIN VAL" parsing manually, which is unnecessary. Just use sscanf("%u:%u %llu"). This might not reject some malformed input (extra input at the end) but we don't care. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: restructure blkio_group configruation settingTejun Heo2-140/+147
As part of userland interface restructuring, this patch updates per-blkio_group configuration setting. Instead of funneling everything through a master function which has hard-coded cases for each config file it may handle, the common part is factored into blkg_conf_prep() and blkg_conf_finish() and different configuration setters are implemented using the helpers. While this doesn't result in immediate LOC reduction, this enables further cleanups and more modular implementation. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: restructure configuration printingTejun Heo2-104/+55
Similarly to the previous stat restructuring, this patch restructures conf printing code such that, * Conf printing uses the same helpers as stat. * Printing function doesn't require hardcoded switching on the config being printed. Note that this isn't complete yet for throttle confs. The next patch will convert setting for these confs and will complete the transition. * Printing uses read_seq_string callback (other methods will be phased out). Note that blkio_group_conf.iops[2] is changed to u64 so that they can be manipulated with the same functions. This is transitional and will go away later. After this patch, per-device configurations - weight, bps and iops - use __blkg_prfill_u64() for printing which uses white space as delimiter instead of tab. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: drop blkiocg_file_write_u64()Tejun Heo1-28/+7
blkiocg_file_write_u64() has single switch case. Drop blkiocg_file_write_u64(), rename blkio_weight_write() to blkcg_set_weight() and use it directly for .write_u64 callback. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: restructure statistics printingTejun Heo2-374/+243
blkcg stats handling is a mess. None of the stats has much to do with blkcg core but they are all implemented in blkcg core. Code sharing is achieved by mixing common code with hard-coded cases for each stat counter. This patch restructures statistics printing such that * Common logic exists as helper functions and specific print functions use the helpers to implement specific cases. * Printing functions serving multiple counters don't require hardcoded switching on specific counters. * Printing uses read_seq_string callback (other methods will be phased out). This change enables further cleanups and relocating stats code to the policy implementation it belongs to. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: introduce blkg_stat and blkg_rwstatTejun Heo2-207/+293
blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: BLKIO_STAT_CPU_SECTORS doesn't have subcountersTejun Heo1-3/+6
BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters and is counted by blkio_group_stats_cpu->sectors; however, it still holds a member in blkio_group_stats_cpu->stat_arr_cpu. Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have subcounters. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: remove unused @pol and @plid parametersTejun Heo4-16/+9
@pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no longer necessary. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01block: Make cfq_target_latency tunable through sysfs.Tao Ma1-2/+8
In cfq, when we calculate a time slice for a process(or a cfqq to be precise), we have to consider the cfq_target_latency so that all the sync request have an estimated latency(300ms) and it is controlled by cfq_target_latency. But in some hadoop test, we have found that if there are many processes doing sequential read(24 for example), the throughput is bad because every process can only work for about 25ms and the cfqq is switched. That leads to a higher disk seek. We can achive the good throughput by setting low_latency=0, but then some read's latency is too much for the application. So this patch makes cfq_target_latency tunable through sysfs so that we can tune it and find some magic number which is not bad for both the throughput and the read latency. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-01Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-mergedTejun Heo5-65/+35
cgroup/for-3.5 contains the following changes which blk-cgroup needs to proceed with the on-going cleanup. * Dynamic addition and removal of cftypes to make config/stat file handling modular for policies. * cgroup removal update to not wait for css references to drain to fix blkcg removal hang caused by cfq caching cfqgs. Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the following conflicts in block/blk-cgroup.c. * 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks" conflicts with blkiocg_pre_destroy() addition and blkiocg_attach() removal. Resolved by removing @subsys from all subsys methods. * 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in controllers" conflicts with ->pre_destroy() and ->attach() updates and removal of modular config. Resolved by dropping forward declarations of the methods and applying updates to the relocated blkio_subsys. * 4baf6e3325 "cgroup: convert all non-memcg controllers to the new cftype interface" builds upon the previous item. Resolved by adding ->base_cftypes to the relocated blkio_subsys. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01cgroup: convert all non-memcg controllers to the new cftype interfaceTejun Heo1-7/+2
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio, net_cls and device controllers to use the new cftype based interface. Termination entry is added to cftype arrays and populate callbacks are replaced with cgroup_subsys->base_cftypes initializations. This is functionally identical transformation. There shouldn't be any visible behavior change. memcg is rather special and will be converted separately. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vivek Goyal <vgoyal@redhat.com>
2012-04-01cgroup: relocate cftype and cgroup_subsys definitions in controllersTejun Heo1-22/+16
blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol unnecessarily define cftype array and cgroup_subsys structures at the top of the file, which is unconventional and necessiates forward declaration of methods. This patch relocates those below the definitions of the methods and removes the forward declarations. Note that forward declaration of tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup(). This will be removed soon by another patch. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-03-30block: use lockdep_assert_held for queue lockingAndi Kleen1-1/+1
Instead of an ugly open coded variant. Cc: axboe@kernel.dk Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-29blkcg: change a spin_lock() to spin_lock_irq()Dan Carpenter1-1/+1
Smatch complains that we re-enable IRQs twice. It looks like we forgot to disable them here on the spin_trylock() failure path. This was added in 9f13ef678e "blkcg: use double locking instead of RCU for blkg synchronization". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>` Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHEDTejun Heo1-17/+35
When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up calling blkg_get/put() on dummy cfqg leading to the following crash. BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs RIP: 0010:[<ffffffff813d44d8>] [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 RSP: 0018:ffff88001f9dfd80 EFLAGS: 00010046 RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40 RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040) Stack: ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0 ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30 ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507 Call Trace: [<ffffffff813b0507>] elevator_init+0xd7/0x140 [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150 [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80 [<ffffffff813b9523>] blk_init_queue+0x13/0x20 [<ffffffff821aec00>] floppy_init+0x82/0xec7 [<ffffffff810001d2>] do_one_initcall+0x42/0x170 [<ffffffff821835fc>] kernel_init+0xcb/0x14f [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10 Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0 RIP [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED, separate from BLK_CGROUP, blkg access through cfqg needs to be conditioned on it. * Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on CFQ_GROUP_IOSCHED. If disabled, they always return %NULL. * Introduce cfqg_get() and cfqg_put() conditioned on CFQ_GROUP_IOSCHED. If disabled, they are noops. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNELDan Carpenter1-1/+1
We should use the GFP flags that the caller specified instead of picking our own. All the callers specify GFP_KERNEL so this doesn't make a difference to how the kernel runs, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20Merge branch 'for-3.4' of ↵Linus Torvalds1-14/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Out of the 8 commits, one fixes a long-standing locking issue around tasklist walking and others are cleanups." * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list cgroup: Remove wrong comment on cgroup_enable_task_cg_list() cgroup: remove cgroup_subsys argument from callbacks cgroup: remove extra calls to find_existing_css_set cgroup: replace tasklist_lock with rcu_read_lock cgroup: simplify double-check locking in cgroup_attach_proc cgroup: move struct cgroup_pidlist out from the header file cgroup: remove cgroup_attach_task_current_cg()
2012-03-20Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2-24/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes for v3.4 from Ingo Molnar * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) printk: Make it compile with !CONFIG_PRINTK sched/x86: Fix overflow in cyc2ns_offset sched: Fix nohz load accounting -- again! sched: Update yield() docs printk/sched: Introduce special printk_sched() for those awkward moments sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer sched: Cleanup cpu_active madness sched: Fix load-balance wreckage sched: Clean up parameter passing of proc_sched_autogroup_set_nice() sched: Ditch per cgroup task lists for load-balancing sched: Rename load-balancing fields sched: Move load-balancing arguments into helper struct sched/rt: Do not submit new work when PI-blocked sched/rt: Prevent idle task boosting sched/wait: Add __wake_up_all_locked() API sched/rt: Document scheduler related skip-resched-check sites sched/rt: Use schedule_preempt_disabled() sched/rt: Add schedule_preempt_disabled() sched/rt: Do not throttle when PI boosting sched/rt: Keep period timer ticking when rt throttling is active ...
2012-03-20block: remove ioc_*_changed()Tejun Heo2-87/+0
After the previous patch to cfq, there's no ioc_get_changed() user left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all related stuff. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20cfq: don't use icq_get_changed()Tejun Heo1-23/+40
cfq caches the associated cfqq's for a given cic. The cache needs to be flushed if the cic's ioprio or blkcg has changed. It is currently done by requiring the changing action to set the respective ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(), which involves iterating through all the affected icqs. All cfq wants to know is whether ioprio and/or blkcg have changed since the last flush and can be easily achieved by just remembering the current ioprio and blkcg ID in cic. This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to use the remembered value instead, and updates cfq_set_request() path such that, instead of using icq_get_changed(), the current values are compared against the remembered ones and trigger appropriate flush action if not. Condition tests are moved inside both _changed functions which are now named check_ioprio_changed() and check_blkcg_changed(). ioprio.h::task_ioprio*() can't be used anymore and replaced with open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20cfq: pass around cfq_io_cq instead of io_contextTejun Heo1-22/+17
Now that io_cq is managed by block core and guaranteed to exist for any in-flight request, it is easier and carries more information to pass around cfq_io_cq than io_context. This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and cfq_get_queue() to take @cic instead of @ioc. This change removes a duplicate cfq_cic_lookup() from cfq_find_alloc_queue(). This change enables the use of cic-cached ioprio in the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: add blkcg->idTejun Heo2-0/+6
Add 64bit unique id to blkcg. This will be used by policies which want blkcg identity test to tell whether the associated blkcg has changed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: remove blkio_group->stats_lockTejun Heo2-109/+103
With recent plug merge updates, all non-percpu stat updates happen under queue_lock making stats_lock unnecessary to synchronize stat updates. The only synchronization necessary is stat reading, which can be done using u64_stats_sync instead. This patch removes blkio_group->stats_lock and adds blkio_group_stats->syncp for reader synchronization. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: restructure blkio_get_stat()Tejun Heo2-50/+56
Restructure blkio_get_stat() to prepare for removal of stats_lock. * Define BLKIO_STAT_ARR_NR explicitly to denote which stats have subtypes instead of using BLKIO_STAT_QUEUED. * Separate out stat acquisition and printing. After this, there are only two users of blkio_fill_stat(). Just open code it. * The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1. There's no need to subtract one. Use MAX_KEY_LEN consistently. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: simplify stat resetTejun Heo2-57/+37
blkiocg_reset_stats() implements stat reset for blkio.reset_stats cgroupfs file. This feature is very unconventional and something which shouldn't have been merged. It's only useful when there's only one user or tool looking at the stats. As soon as multiple users and/or tools are involved, it becomes useless as resetting disrupts other usages. There are very good reasons why all other stats expect readers to read values at the start and end of a period and subtract to determine delta over the period. The implementation is rather complex - some fields shouldn't be cleared and it saves some fields, resets whole and restores for some reason. Reset of percpu stats is also racy. The comment points to 64bit store atomicity for the reason but even without that stores for zero can simply race with other CPUs doing RMW and get clobbered. Simplify reset by * Clear selectively instead of resetting and restoring. * Grouping debug stat fields to be reset and using memset() over them. * Not caring about stats_lock. * Using memset() to reset percpu stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: don't use percpu for merged statsTejun Heo2-23/+9
With recent plug merge updates, merged stats are no longer called for plug merges and now only updated while holding queue_lock. As stats_lock is scheduled to be removed, there's no reason to use percpu for merged stats. Don't use percpu for merged stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: alloc per cpu stats from worker thread in a delayed mannerVivek Goyal2-40/+91
Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-14Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds4-79/+158
Pull block fixes from Jens Axboe: "Been sitting on this for a while, but lets get this out the door. This fixes various important bugs for 3.3 final, along with a few more trivial ones. Please pull!" * 'for-linus' of git://git.kernel.dk/linux-block: block: fix ioc leak in put_io_context block, sx8: fix pointer math issue getting fw version Block: use a freezable workqueue for disk-event polling drivers/block/DAC960: fix -Wuninitialized warning drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning block: fix __blkdev_get and add_disk race condition block: Fix setting bio flags in drivers (sd_dif/floppy) block: Fix NULL pointer dereference in sd_revalidate_disk block: exit_io_context() should call elevator_exit_icq_fn() block: simplify ioc_release_fn() block: replace icq->changed with icq->flags