Some More Digging
As described here I’m still having performance issues with my Gentoo 64bit installation. I’ve not yet switched back to 32bit.
In the hope, that someone might make some sense of my findings, I’ve dug a bit more. First of all, there is definitely a hardware dependency in all this. I’ve carried all the binary packages from my system to a different system and basically duplicated my setup on an older unused 4 core AMD Opteron box (dual processor and Promise chip set), the only difference being, that the kernel is configured for generic x86_64 while at home I’m using Core 2. I don’t see any problems at all. Audacity slices through a large MP3 file like a hot knife through butter.
Then yesterday late at night I had the idea to profile my problem scenario (with OProfile). That is, what I did today. I recompiled a kernel with CONFIG_OPROFILE and CONFIG_ARCH_SUPPORTS_OPROFILE enabled and did two measurements: letting Audacity write the audio data into the root fs, which took more than 8 minutes, and letting Audacity write into a loopback mounted EXT3 fs. The second try took under 2 minutes. Each time I profiled the period until loading into Audacity was complete.
Here is some global profiling data from executing simply opreport.
Good run: Slow run:
========= =========
samples| %| samples| %|
---------------- ----------------
1053609 52.9598 libmad.so.0.2.1 1058074 55.8550 libmad.so.0.2.1
490038 24.6318 vmlinux 341606 18.0332 vmlinux
267461 13.4440 audacity 274106 14.4699 audacity
66449 3.3401 libc-2.6.1.so 68979 3.6414 libc-2.6.1.so
11774 0.5918 libwx_base-2.6.so.0.4.0 18725 0.9885 ld-2.6.1.so
10877 0.5467 nvidia_drv.so 14354 0.7577 nvidia_drv.so
9789 0.4920 Xorg 13594 0.7176 libqt-mt.so.3.3.8
9521 0.4786 libqt-mt.so.3.3.8 13309 0.7026 nvidia
8391 0.4218 ld-2.6.1.so 11974 0.6321 Xorg
7520 0.3780 libwx_gtk2_core-2.6.so.0.4.0 11206 0.5916 libwx_base-2.6.so.0.4.0
6909 0.3473 nvidia 7709 0.4070 libpthread-2.6.1.so
6320 0.3177 libpthread-2.6.1.so 7363 0.3887 libwx_gtk2_core-2.6.so.0.4.0
6016 0.3024 oprofiled 5935 0.3133 libglib-2.0.so.0.1400.6
4669 0.2347 libgobject-2.0.so.0.1400.6 5882 0.3105 libgobject-2.0.so.0.1400.6
4266 0.2144 libglib-2.0.so.0.1400.6 5793 0.3058 oprofiled
4142 0.2082 libwfb.so 5757 0.3039 libwfb.so
3512 0.1765 libgdk-x11-2.0.so.0.1200.8 4191 0.2212 libgdk-x11-2.0.so.0.1200.8
2683 0.1349 ahci 2787 0.1471 libX11.so.6.2.0
2402 0.1207 libX11.so.6.2.0 2704 0.1427 ahci
Admittedly I don’t completely understand the implications of the displayed numbers, but there is a significant difference in the numbers for vmlinux and the numbers for ld-2.6.1.so differ by roughly 50%.
The next lines are the output of opreport --symbols --image-path=/lib/modules/2.6.24...., just to show, what the output looks like:
samples % image name app name symbol name
1053609 52.9598 libmad.so.0.2.1 libmad.so.0.2.1 (no symbols)
267461 13.4440 audacity audacity (no symbols)
66449 3.3401 libc-2.6.1.so libc-2.6.1.so (no symbols)
49271 2.4766 vmlinux vmlinux transfer_none
26277 1.3208 vmlinux vmlinux __copy_user_nocache
To condense the listing a bit I egrepped after 'ld-2|libc|vmlinux' to remove the application data. First some lines from the “good” run:
Good run:
=========
samples % image name app name symbol name
66449 3.3401 libc-2.6.1.so libc-2.6.1.so (no symbols)
49271 2.4766 vmlinux vmlinux transfer_none
26277 1.3208 vmlinux vmlinux __copy_user_nocache
19446 0.9775 vmlinux vmlinux sub_preempt_count
16331 0.8209 vmlinux vmlinux add_preempt_count
11737 0.5900 vmlinux vmlinux journal_add_journal_head
9007 0.4527 vmlinux vmlinux kmem_cache_alloc
8541 0.4293 vmlinux vmlinux __wake_up_bit
8391 0.4218 ld-2.6.1.so ld-2.6.1.so (no symbols)
7693 0.3867 vmlinux vmlinux journal_put_journal_head
7411 0.3725 vmlinux vmlinux debug_smp_processor_id
7129 0.3583 vmlinux vmlinux __find_get_block
6928 0.3482 vmlinux vmlinux ext3_mark_iloc_dirty
6575 0.3305 vmlinux vmlinux get_page_from_freelist
6209 0.3121 vmlinux vmlinux do_get_write_access
5575 0.2802 vmlinux vmlinux kmem_cache_free
5427 0.2728 vmlinux vmlinux end_buffer_async_write
5362 0.2695 vmlinux vmlinux __link_path_walk
4877 0.2451 vmlinux vmlinux journal_commit_transaction
4847 0.2436 vmlinux vmlinux start_this_handle
4342 0.2183 vmlinux vmlinux __d_lookup
4298 0.2160 vmlinux vmlinux journal_dirty_metadata
4136 0.2079 vmlinux vmlinux copy_user_generic_string
3972 0.1997 vmlinux vmlinux journal_dirty_data
3856 0.1938 vmlinux vmlinux bit_waitqueue
3834 0.1927 vmlinux vmlinux mwait_idle
And then from the “slow” run:
Slow run:
=========
samples % image name app name symbol name
68979 3.6414 libc-2.6.1.so libc-2.6.1.so (no symbols)
24052 1.2697 vmlinux vmlinux __copy_user_nocache
18725 0.9885 ld-2.6.1.so ld-2.6.1.so (no symbols)
14598 0.7706 vmlinux vmlinux sub_preempt_count
11806 0.6232 vmlinux vmlinux add_preempt_count
8444 0.4458 vmlinux vmlinux journal_add_journal_head
7830 0.4133 vmlinux vmlinux mwait_idle
6452 0.3406 vmlinux vmlinux ext3_mark_iloc_dirty
6231 0.3289 vmlinux vmlinux do_get_write_access
5925 0.3128 vmlinux vmlinux read_tsc
5436 0.2870 vmlinux vmlinux journal_put_journal_head
5317 0.2807 vmlinux vmlinux vsnprintf
5293 0.2794 vmlinux vmlinux __link_path_walk
5121 0.2703 vmlinux vmlinux __find_get_block
4955 0.2616 vmlinux vmlinux kmem_cache_alloc
4771 0.2519 vmlinux vmlinux journal_dirty_metadata
4698 0.2480 vmlinux vmlinux __wake_up_bit
4384 0.2314 vmlinux vmlinux __d_lookup
4144 0.2188 vmlinux vmlinux debug_smp_processor_id
4112 0.2171 vmlinux vmlinux copy_user_generic_string
3656 0.1930 vmlinux vmlinux journal_cancel_revoke
3274 0.1728 vmlinux vmlinux page_fault
3269 0.1726 vmlinux vmlinux get_page_from_freelist
3050 0.1610 vmlinux vmlinux __ext3_get_inode_loc
Is anybody out there, who can make sense out of this data. What again catches the eye is a roughly 50% difference for mwait_idle and that the good run has a good bit of samples for transfer_none.
I guess, I’ll wait for 2.6.25 until I finally decide to switch back to 32bit.
