Archive for April 2008

 
 

Some More Digging

As described here I’m still having performance issues with my Gentoo 64bit installation. I’ve not yet switched back to 32bit.

In the hope, that someone might make some sense of my findings, I’ve dug a bit more. First of all, there is definitely a hardware dependency in all this. I’ve carried all the binary packages from my system to a different system and basically duplicated my setup on an older unused 4 core AMD Opteron box (dual processor and Promise chip set), the only difference being, that the kernel is configured for generic x86_64 while at home I’m using Core 2. I don’t see any problems at all. Audacity slices through a large MP3 file like a hot knife through butter.

Then yesterday late at night I had the idea to profile my problem scenario (with OProfile). That is, what I did today. I recompiled a kernel with CONFIG_OPROFILE and CONFIG_ARCH_SUPPORTS_OPROFILE enabled and did two measurements: letting Audacity write the audio data into the root fs, which took more than 8 minutes, and letting Audacity write into a loopback mounted EXT3 fs. The second try took under 2 minutes. Each time I profiled the period until loading into Audacity was complete.

Here is some global profiling data from executing simply opreport.

Good run:                                    Slow run:
=========                                    =========
samples|      %|                             samples|      %|
----------------                             ----------------
1053609 52.9598 libmad.so.0.2.1              1058074 55.8550 libmad.so.0.2.1
 490038 24.6318 vmlinux                       341606 18.0332 vmlinux
 267461 13.4440 audacity                      274106 14.4699 audacity
  66449  3.3401 libc-2.6.1.so                  68979  3.6414 libc-2.6.1.so
  11774  0.5918 libwx_base-2.6.so.0.4.0        18725  0.9885 ld-2.6.1.so
  10877  0.5467 nvidia_drv.so                  14354  0.7577 nvidia_drv.so
   9789  0.4920 Xorg                           13594  0.7176 libqt-mt.so.3.3.8
   9521  0.4786 libqt-mt.so.3.3.8              13309  0.7026 nvidia
   8391  0.4218 ld-2.6.1.so                    11974  0.6321 Xorg
   7520  0.3780 libwx_gtk2_core-2.6.so.0.4.0   11206  0.5916 libwx_base-2.6.so.0.4.0
   6909  0.3473 nvidia                          7709  0.4070 libpthread-2.6.1.so
   6320  0.3177 libpthread-2.6.1.so             7363  0.3887 libwx_gtk2_core-2.6.so.0.4.0
   6016  0.3024 oprofiled                       5935  0.3133 libglib-2.0.so.0.1400.6
   4669  0.2347 libgobject-2.0.so.0.1400.6      5882  0.3105 libgobject-2.0.so.0.1400.6
   4266  0.2144 libglib-2.0.so.0.1400.6         5793  0.3058 oprofiled
   4142  0.2082 libwfb.so                       5757  0.3039 libwfb.so
   3512  0.1765 libgdk-x11-2.0.so.0.1200.8      4191  0.2212 libgdk-x11-2.0.so.0.1200.8
   2683  0.1349 ahci                            2787  0.1471 libX11.so.6.2.0
   2402  0.1207 libX11.so.6.2.0                 2704  0.1427 ahci

Admittedly I don’t completely understand the implications of the displayed numbers, but there is a significant difference in the numbers for vmlinux and the numbers for ld-2.6.1.so differ by roughly 50%.

The next lines are the output of opreport --symbols --image-path=/lib/modules/2.6.24...., just to show, what the output looks like:

samples  %        image name               app name                 symbol name
1053609  52.9598  libmad.so.0.2.1          libmad.so.0.2.1          (no symbols)
267461   13.4440  audacity                 audacity                 (no symbols)
66449     3.3401  libc-2.6.1.so            libc-2.6.1.so            (no symbols)
49271     2.4766  vmlinux                  vmlinux                  transfer_none
26277     1.3208  vmlinux                  vmlinux                  __copy_user_nocache

To condense the listing a bit I egrepped after 'ld-2|libc|vmlinux' to remove the application data. First some lines from the “good” run:

Good run:
=========
samples   %       image name      app name        symbol name
66449     3.3401  libc-2.6.1.so   libc-2.6.1.so   (no symbols)
49271     2.4766  vmlinux         vmlinux         transfer_none
26277     1.3208  vmlinux         vmlinux         __copy_user_nocache
19446     0.9775  vmlinux         vmlinux         sub_preempt_count
16331     0.8209  vmlinux         vmlinux         add_preempt_count
11737     0.5900  vmlinux         vmlinux         journal_add_journal_head
9007      0.4527  vmlinux         vmlinux         kmem_cache_alloc
8541      0.4293  vmlinux         vmlinux         __wake_up_bit
8391      0.4218  ld-2.6.1.so     ld-2.6.1.so     (no symbols)
7693      0.3867  vmlinux         vmlinux         journal_put_journal_head
7411      0.3725  vmlinux         vmlinux         debug_smp_processor_id
7129      0.3583  vmlinux         vmlinux         __find_get_block
6928      0.3482  vmlinux         vmlinux         ext3_mark_iloc_dirty
6575      0.3305  vmlinux         vmlinux         get_page_from_freelist
6209      0.3121  vmlinux         vmlinux         do_get_write_access
5575      0.2802  vmlinux         vmlinux         kmem_cache_free
5427      0.2728  vmlinux         vmlinux         end_buffer_async_write
5362      0.2695  vmlinux         vmlinux         __link_path_walk
4877      0.2451  vmlinux         vmlinux         journal_commit_transaction
4847      0.2436  vmlinux         vmlinux         start_this_handle
4342      0.2183  vmlinux         vmlinux         __d_lookup
4298      0.2160  vmlinux         vmlinux         journal_dirty_metadata
4136      0.2079  vmlinux         vmlinux         copy_user_generic_string
3972      0.1997  vmlinux         vmlinux         journal_dirty_data
3856      0.1938  vmlinux         vmlinux         bit_waitqueue
3834      0.1927  vmlinux         vmlinux         mwait_idle

And then from the “slow” run:

Slow run:
=========
samples   %       image name      app name        symbol name
68979     3.6414  libc-2.6.1.so   libc-2.6.1.so   (no symbols)
24052     1.2697  vmlinux         vmlinux         __copy_user_nocache
18725     0.9885  ld-2.6.1.so     ld-2.6.1.so     (no symbols)
14598     0.7706  vmlinux         vmlinux         sub_preempt_count
11806     0.6232  vmlinux         vmlinux         add_preempt_count
8444      0.4458  vmlinux         vmlinux         journal_add_journal_head
7830      0.4133  vmlinux         vmlinux         mwait_idle
6452      0.3406  vmlinux         vmlinux         ext3_mark_iloc_dirty
6231      0.3289  vmlinux         vmlinux         do_get_write_access
5925      0.3128  vmlinux         vmlinux         read_tsc
5436      0.2870  vmlinux         vmlinux         journal_put_journal_head
5317      0.2807  vmlinux         vmlinux         vsnprintf
5293      0.2794  vmlinux         vmlinux         __link_path_walk
5121      0.2703  vmlinux         vmlinux         __find_get_block
4955      0.2616  vmlinux         vmlinux         kmem_cache_alloc
4771      0.2519  vmlinux         vmlinux         journal_dirty_metadata
4698      0.2480  vmlinux         vmlinux         __wake_up_bit
4384      0.2314  vmlinux         vmlinux         __d_lookup
4144      0.2188  vmlinux         vmlinux         debug_smp_processor_id
4112      0.2171  vmlinux         vmlinux         copy_user_generic_string
3656      0.1930  vmlinux         vmlinux         journal_cancel_revoke
3274      0.1728  vmlinux         vmlinux         page_fault
3269      0.1726  vmlinux         vmlinux         get_page_from_freelist
3050      0.1610  vmlinux         vmlinux         __ext3_get_inode_loc

Is anybody out there, who can make sense out of this data. What again catches the eye is a roughly 50% difference for mwait_idle and that the good run has a good bit of samples for transfer_none.

I guess, I’ll wait for 2.6.25 until I finally decide to switch back to 32bit.