QuicksearchDisclaimerThe individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
|
Oracle Solaris 11 Tech Days 2012Monday, January 23. 2012
Im Februar läuft eine Reihe von Events in Deutschland und der Schweiz zum Thema Solaris 11. Die Events versprechen technisch sehr interessant werden, da die Sprecher jeweils sehr tief in der Materie sind. Über Detlef Drewanz - der bei allen Events dabei ist - muss ich seit dem Containerleitfaden genauso wie über Uli Gräf (der an einigen, aber nicht allen Orten spricht) wohl nichts mehr sagen. Christian Christian Ritzka und Elke Freymann sind ausgewiesene Experten zum Thema OpsCenter. Und ja .. ich halte auch einen Vortrag über Datamanagement in Solaris 11. Und da ich schon zweimal die Frage gesehen habe: Die Veranstaltung ist kostenfrei
Die genaue Agenda mit den Sprechern in den einzelnen Orten und eine Möglichkeit zur Anmeldung findet ihr auf den Eventseiten: Simulating the cloud - a practical example.Wednesday, December 28. 2011
Work in Progress - this entry will change often in the next days and weeks
A few days^H^H^H^Hweeks ago, i wrote about simulating the cloud that is most often tagged with the name "network" or "intranet" and sometimes "internet" . This would not be c0t0d0s0.org without an article to explain how you can configure this. This article will explain how you simulate a complete network on a single host with routers, switches, dynamic routing protocols and so on ScopeAt first i want to set the expectations right. I don't want to simulate a cloud in the sense of cloud computing here. I'm thinking about something more complex: I'm talking about the simulation of this cloud, that often hides a lot of complexities and traps in architectural diagrams. A word of caution firstThis article uses a invisible feature. You don't see that it's there because it isn't in the man page, it isn't in the help output of the dladm command. But it's there. It's the commands dladm create/modify-simnet. As it's undocumented i assume it just can disappear without any notice, because it's not there. Don't complain here when it disappears, don't complain at Oracle. Of course no support. You know the game. Consider it as an artifact. As a diagnosis socket labeled "Only for factory use". Consider it as the testing wiring existing in every technical product that's just use for the testing when the product leaves the factory. Never ever use it in production.Why i'm writing about this "feature" here? Because it's useful. Because there are a multitude of hints that this function exists. All of them are public. The zonestat documentation mentions a "simnet" type at docs.oracle.com and from there you are just a google away from the PSARC case 2009/200. And the source code at src.opensolaris.org shows it as well. From there it's just curiosity to find everything else out that is used in this text. About this articleI stumbled the first time over this command when i searched for something in the dladm source at src.opensolaris.org. A month ago my former colleague Brian Utterback remembered me of this and i though "let's check if this is still working". And to my astonishment it still worked. Writing this article takes virtually forever. Because of my broken ankle i took painkillers and that made me somewhat drownsy. And this drownsyness slowed down everything. Thus i decided to create this article under your observation to get it finally out of the door. Thus it's work in progress. simnetI just write about simnet. What are simnets? I just want to point you to the PSARC case for indepth information. It's available on opensolaris.org in the caselog. But in short: Simnets are simulated networks. It's a mechanism to test networking protocols. And in this example we will use it exactly for this purpose. Testing networking. Okay, let's assume you are admin of FUBAR Inc. You want to recreate your network in a box. You have offices in Hamburg, London, Singapore, New York and San Francisco. In each office you have a multi-legged router, connecting to a switch for the internal network with servers an clients, the other interfaces of the switch are connecting to the other routers. As an image says more than 1000 words i will just summarize the network with this figure. Configuring itOf course the and the servers will be zones. However we have to recreate the network topology as well. And that's the point where we use the the simnet non-feature. We need a the switches in our offices first. Those are really easy to configure Now i need some switchports. At first i create some switch ports in order to connect the switch to the router. Now i create some additional switchports to connect servers. Ports meant for the bridge are nice, however they should be connected with the bridge. Let's now create all the interfaces we need for the routers. >And of course we need interfaces for all the servers Now we have to create logical cables … lots of them. At first the routers with their switches. Uff … on the networking side this is all. The active configuration should look something like that ... Zone CreationOkay, now we have to create the zones. We create a lot of controlfiles first. With this controlfiles we will feed zonecfg later on. I created the /opt/cloudsimulation/zones directory to hold them. Of course it's useful to have an own ZFS filesystem in order to enable the zone creation process to simply copy the data needed by a zone by creating a clone of a filesystem.
Whois is wondering about the sfo and sin IATA shorthands that i've used instead of the long names in other "cities". Quagga doesn't seem to like interface names longer than 16 characters. Okay. Now we have to create all the zones. That's easy. As i said, i will just feed the control files into zonecfg with the -f option.Okay, at first we install the template zone. We do a full install here. and that's pretty much the only purpose … to have one installed baseline zone as providing the starting point for all other zones. This may take a while. Depending on your system you may opt for a coffee or two. We never boot this one, it's just to ease the next steps. Okay, now we prepare the real zones. You don't have to to the next steps, however they relief you from login into each zones and going to the same dialog windows. We will use a simple trick to circumvent the need to go through each sysconfig dialog in each router we will use a simple trick. You can create a xml file containing the necessary data and pass it to the cloning of the zone. Important: I want to make the resulting xml file as generic as possible, thus i won't configure networking via this process, albeit this is possible. As it's a CUI, i will guide you through this dialog with some pictures. After leaving the last screen, you should yield a file with content similar to this: Before you ask, the password for radmin and root is n0mn0mn0m. And the jamphfhn just stands for "just a meaningless placeholder for hostname". Okay, i will create another template zone. This is because a routing zone will have some special properties that a zone acting as a server doesn't need and i don't want such properties in the server zones. At first i just take the template.xml script and substitute the hostname. I could simply do it via vi, but for a tutorial a simple shell line is more efficient. I use the newly created file as an input for the zone clone command. As the system just creates a zfs clone the command should return after a small period of time. Now we can log into the console of the zone with zlogin. I wrote earlier, that the template for the router contains some additional stuff. At first i need a telnet client. It will get obvious why i need it later on: Okay, now let's install quagga. Quagga is a suite of daemons to implement dynamic routing protocols: Okay, now we have to configure some basics that are equal to all the router in the network. At first we activate forwarding. With this activation, you enable the operating system to accept packets on one interface ipv4-routing tells the system to startup routing protocol daemons. When you have a default router configured it's disabled, when there isn't one this setting is enabled per default.Okay, now we have to do some quagga configurations. I want to use quagga with OSPF, so there are two important services for me. Zebra and ospf. Zebra is the layer, that the quagga suite used to interact with the system. Why is it called Zebra? I assume it's history, the old GNU routing protocol daemon suite was called zebra, quagga is the follow-on project as zebra is now a defunct software development project. What do we configure here. Both daemons offer a command line for interfaction with the daemon. We configure both just to react from 127.0.0.1 (aka localhost). The zebra daemon has it's console on port 2602, the ospf daemon listens on port 2601. And this both ports are the reason we need telnet on our routers. You access the consoles via telnet. With this command we tell Solaris to use ospf as the routing protocol for ipv4 purposes. Now we have to activate the new setting You should now get some weired SMF error messages that some services couldn't start up. that's normal because there are no configuration files available for the quagga suite. Don't think about it, just shut the zone down now. Okay, now we have derived our template for the router zones from the generic template for zones. We use this template for installing all the router zones now. Okay, i just wrote about quagga config files. I want to prepare them now in order to be able just to copy them into the zones before starting them up and thus to circumvent the error messages. We need a lot of them.
Put something like this into the file /opt/cloudsimulation/zones/londonsrv1Switches for Hamburg-MANSimulating that clouldSaturday, December 10. 2011
In the past i wrote quite often about a thing that i call systemic features, when features start to fit together seamlessly in order to create possibilities more than the sum of the features. One of the systemic features is the simulation of the cloud. I don't talk about that thing that most people connect in mind with the word cloud (the grid with a credit card checkout
It's not new: I talked about this mid November at the DOAG conference in Nuremberg. And i've playing around with this at customers an privately for a while now. Many customers have networks as large and as complex as the internet part of a smaller country perhaps 15 years ago. The interesting question is: How can you test your application for it's resiliency against failures in this cloud shaped icon. How does your application react, when your network is doing its high availability magic. And interestingly Solaris 11 can help you here. The thoughts behind this are pretty simple.
When i'm combining all this features i can set up a vast array of zones doing nothing else taking each incoming packet on a interface, routing it on a multitude of ways between each other, and send it out on a outgoing interface. Even when the system in your environment are placed in many separate networks of your network you can still use a system with many networking cards or something called server-on-a-stick (single high-bandwidth connection to a vlan-trunking capable switch and using the switch ports as a fan-out). So in order to emulate a complex corporate network, all i have to do is configuring a lot of etherstubs, configuring many vnics, replicate the physical bandwidths with the maxbw setting on the vnics, set up a lot of zones, perhaps translate the ACL of the routers into firewall rules for firewall functionality of Solaris, installting the routing daemons and configure it similar to the configuration of the routers (in regard of timeouts and so on). Now i can test, how my applications react, when the network starts to converge against a new topology because of the failures of some lines. I can test, to which topology my network will converge after an line outage (which is nothing more than a deny-all firewall rule). I can test the impact, when the network converges that way, that my traffic flows over a 2 MBit/s line instead of a 155 MBit/s line. For even more complex failure modes i can even use the htbx driver to introduce additional latencies, packet drop or packet reorderung as shown in this article. In essence you can emulate your complete internal network in a single box and with Zones and Crossbow in Solaris 11 it's so low overhead (at the end it is still just one kernel) that you can really emulate the reality and not a simplyfied view, as you don't have emulate via separate hardware or many independent operating system instances in virtual machines. At the end you could simply use a single Solaris system, put it between all your test systems and use this solaris system as a emulation device for your corporate network. It's simulating the cloud-shaped icon in your architectural diagrams.
Posted by Joerg Moellenkamp
in English, Solaris, Sun/Oracle
at
09:45
| Comments (2)
| Trackbacks (0)
10 years of ZFSWednesday, November 2. 2011
ZFS celebrated its 10th birthday on October 31st. So whatever you plan as a filesystem to kill ZFS ... may take a while
It's facepalm time ...Thursday, October 27. 2011
Surely you’ve recognized that my blog was down for a few days and with it all services on the system. The problem that led to this situation was a really dumb one. Perhaps this article is more a story about not thinking about a failure mode just because it’s not a problem under your preferred operating system (Or to be exact: It was a problem before Solaris 10, but afterwards it was solved). And and most it's a story about being totally problem-blind in the first moment.
Perhaps I should explain first that c0t0d0s0.org isn’t run with Solaris, it uses this-other-unixoid-operating-system in a well-known non-commercial variant. That’s the dirty secret of c0t0d0s0.org. No technical reason for it, but webserving and mail could be done by any operating system and thus I used that operating system with ubiquitous availability at almost all providers of dedicated servers. I’m able to migrate the server from one dedicated server provider to another within 2 hours including moving the data and did this three times in the past (from 1&1 to Hetzner, and two times within Hetzner). This saved quite significant money until now and that’s the basic reason why I don’t want donations and when you do donations I would donate this money to kiva.org) Hetzner has reasonably priced dedicated servers and I had no problems in the past, however they have one important shortfall: No serial console in the standard product. When you need a console, you have to make a support call and they connect one. As you need it really seldom, it’s okay. As I found out later: I With this serial console I would have recognized the problem within a minute, and fixed in a second. However: The console was exactly thing that I didn’t had to my disposal at this moment. So it was a lot harder to find out what’s happened. However i wanted my server back as soon as possible (out of personal reasons I was just able to start the recovery in the evening and as I have job to do I could only do the further stuff in the evenings as well) and thus I just reimaged the server after keeping a copy of the logfiles. I have a quite extensive backup regimen with very regular rsyncs and database replication on my server at home thus I knew I would perhaps just lose an hour of minutes of data and that was okay for me, additionally I was able to mount the disks of the non-working installation and to copy the delta of mails between the last backup and the last mail in the queue to my backup. What had happened: At 10:something my server provider had a large power outage. The UPS didn’t take over as planed and thus a lot of servers rebooted. One of them was mine. Damned … but that’s the basic reason why I’m a fan of proper enterprise architecture and not of some singular availability features, no matter what marketing tells you. Real availability is hard work and often expensive. But: When you really bet your business on IT, you need an architecture that is even capable to cover an UPS that proofs to be not so uninterruptible. The availability feature UPS may fail (and did fail my case) but a proper enterprise architecture keeps your service up and running. Even more important: With a proper enterprise architecture you don’t need the feature UPS for availability reasons at all because your service can survive the outage of some parts. Perhaps you want the UPS out of other reasons like “don’t want the hassle of bringing up all the systems again.”. But you don’t need it with such an architecture to keep your business running. By having a proper planed enterprise architecture with servers on two seperate sides with different power grids you may forget about the UPS because a UPS won’t help you with a prolonged power outage for example because of region-wide blackout. An outage that maybe will take out the connectivity as well as it’s not that unprobable that your local carrier has the same power problem Okay: After a while my system worked again and thus I had time to find out what had happened. I knew that the system was still reacting on pings, thus I knew the kernel of this-other-unixoid-operating-system in a well-known non-commercial variant was working. Looking into the logfiles I saw complete bootup of the kernel and some of the daemons were starting up .... like acpid for example. However I couldn’t log into SSH. No signs in the logfiles of a ssh daemon startup. The apache was in a half-reacting state. Port 80 was open but it didn’t reacted to HTTP commands. Out of this I concluded: The kernel and the boot configuration is okay. The bring up of the services is at least working partially, because otherwise it would start services at all. And as it reacted on the networking there must have been at least a working boot of services mandated by rcS.d, as otherwise there would be no networking. The problem must be in apache that is frozen halfway. And out of other reasons ssh isn’t started at all. There must have been a major fsckup in the startup of the services As I had no console as explained before I needed to conclude from the leftovers what had happened. And now was a little bit puzzled. 5-6 years ago I would have recognized this problem in an instance (because sometimes i've produced ... well ... suboptimal startup scripts) … but now today it took a while, because I didn’t felt prey for 5-6 years to such a problem. It took me a lot of more thoughts what might had happened. When you do one operating system for a living and one for hobby, you tend to project your mindset of one to the others and you don’t do justice to this other OS. As you all may know, Solaris ditched init.d with Solaris 10 in order to introduce SMF (not to forget the equally important features like the contract filesystem and the Fault Management Architecture). One of the nice advantages of SMF is that services that aren’t interdependent will be started in parallel without waiting for another. This has two advantages: At first the system can start up much faster, at second a service not able to start up can’t block the startup of the rest (short of services needed by all others). The init.d concept is a different. All services are started in sequence. The sequence is numerical and then alphabetical. That is of course slower but more important … depending on the way you write your script a script or binary hanging or waiting for user interaction can block the startup. The variant of this-other-unixoid-operating-system is using init.d And it’s quite easy to block a service. For example by integrating a new SSL key and certificate. My key had a password and apache was asking for this key in order to startup. This exacly happened. Acpi started up because it was started before Apache (guess what: ACpi is before APache in an alphabetical order, and way before Ssh). This is the basic reason why you strip of the password from your key. Guess what I did last week: I put a new key and certificate on my server and I forgot to strip the password from it. And that exactly happened: In my version of this-other-unixoid-operating-system the ssh daemon is started after the apache daemon. When Apache waits for something you won't get SSH. Damned … it's facepalm time. Basically I felt prey to a beginners error because I’m working with an operating system that reacts totally differently on such situations. On Solaris such a situation just don’t matter at all … you get at least your ssh login and the system non-availability is just a service-non-availability you can fix within a second. However given the init.d system of this-other-unixoid-operating-system the outcome was somewhat more problematic. However: What a dumb error on my side ... However: The reinstallation wasn't that bad ... the system could used a reinstallation because of some tests and experiments. So it was worth the work in the evenings. And on the other side: Who had the glorious idea to start apache before ssh? That said, this-other-unixoid-operating-system in newer variants have a different startup mechanism up upstart all the services. However: My heating control is running on a beagleboard-XM at the moment using a really current variant of this-other-unixoid-operating-system just released a few days ago. It uses this new startup mechanism. And it’s justs my unimportant personal preference, that doesn’t matter: But I don’t like it. And I have a lot of reasons for it. It looks by far too much designed for desktop needs. However my dislike would require an article I won’t write in this blog nowadays. But as I wrote: That’s my personal opinion that doesn’t matter. However it’s really important that this-other-unixoid-operating-system gets away from the old init.d mechanism to something more current. I think in 2011 every operating system deserves something more functional, something better than init.d … init.d is simple and well understood, however it creates classes of problems unnecessary today. Especially: In order to keep die-hard Solaris admins to fall prey to such a beginners error because such problems were parts of their distant past. And now i will start to cut holes for my eyes in the brown paper bag for my head. Nice example for the power of boot environmentsTuesday, October 4. 2011
There is a nice example of the power of boot environment. Boot environments are something like snapshots of your operating system installation made writeable. As you may already assume, they are based on ZFS snapshots and the clone functionality. This is possible due to the usage of ZFS as the root filesystem.
So: Please don't try this at home. Whey you try it, don't try it on any Solaris 11 Express installation of any value. But don't try it. I don't want to hear any story. that you've deleted your ERP system by accident because you used the wrong terminal window. Leave that to trained professional stunt admins with the right equipment (Solaris 11 Express) Assume you have a system, configured with all your application, everything is running fine. So you think it would be nice to have something like a freezed state of this situation. No problem. This command will do the trick. When you reboot your system you will see it as a new entry in the grub menu. Okay, but boot into the old environment starting "Oracle Solaris ..." first by selecting it in the grub menu (it should be already selected, or you used beadm activate already. Now i will drop the atomic bomb on your installation. Essentially we've just nuked the installation. After a moment the system should just freeze. Reset the system and boot again via grub into the boot environment starting with "Oracle Solaris ...":Okay ... on a normal system this would send you to the tapes. With Solaris 11: Reset the system. Boot into the boot environment "rescuenet" via selecting it in grub. Tada! Just creating a boot environment with a single command after a config change may safe your butt later .... and btw ... this even works in zones ... they know the concepts of boot environment,too.
Posted by Joerg Moellenkamp
in English, Solaris, Sun/Oracle
at
20:06
| Comments (7)
| Trackbacks (0)
How to activate IPoIB Connected mode in Solaris 10 Update 9Monday, October 3. 2011
Just a short hint: The What's new document of Solaris 10 Update 9 states, that the support for IPoIB Connected Mode has been added in the release. However you have to search a bit in order for some information how to activate it. The necessary step is documented in the manpage for the ibd driver. Let's assume you have to instances of the ibd driver running (ibd0 and ibd1). In this case you have to change one line at the end of
/kernel/drv/ibd.conf file to enable_rc=1,1; and reload the ibd driver respectively reboot the system. After that you ibd devices should show an mtu size of 65520 bytes instead of 2044.PS: The process for Solaris 11 is better, as you just use dladm for it. However connected mode is the default there anyway. In Solaris 10 unreliable datagram was kept as the default, as one of the rules in Solaris is that you have to opt-in to such changes between updates. Hunting red herringsMonday, August 15. 2011
Sometimes you “know” the problem from the first moment. But sometimes your feeling in the gut results in something that is perceived as a large change, so you have to find the smoking gun, the undeniable proof for your hypothesis.
This is the story of such a search. It started with a telephone call of a colleague. He got my name from another colleague. An Oracle database running on a Solaris system, the datafiles and logs are located on a Veritas File System. The customer saw massive delays (in the range of hundreds of seconds) when excuting certain commands. One of the commands was “truncate table”. A hypothesis - but the proof?And in the beginning it started with a red herring.In this case the thread is trying to execute something on a semaphore, but it wasn’t able to do so. However the semtimedop is timebombed. When the timeout is reached without being able to execute on the semaphore , it terminates with error 11. All the timeouts were consistent with the waiting time seen from the SQL commands perspective. Obviously the customer and other involved parties were tempted to see this as the problem, but already thought that this may be just the harbinger of bad news. And after a short look into the truss files, I was pretty sure that they were right with their doubts in regard of passing the . It was just the harbinger of bad news. After a short amount of research I suspected, that we were talking about a locking problem here. There was just a problem: vxfs. At first I worked seldomly with it, thus it’s not really my center of expertise. One point that diverted the attention of the customer from the locking stuff is a small but important difference: The customer knew that Oracle likes Direct I/O. With UFS the "Direct I/O" is doing a little bit more than just making the I/O direct by disabling buffering. It also removes the inode r/w lock mandated by POSIX rules.The customer knew about UFS Direct I/O that and thus activated Direct I/O on vxfs. And thus I found lines like /oracle/importantdatabase/oradata1 on /dev/vx/dsk/importantdatabase/oradata1 read/write/setuid/devices/mincache=direct/convosync=direct/delaylog/largefiles/ioerror=mwdisable/mntlock=VCS/dev=51836b0 on Thu Mar 17 20:14:11 2011However i stil suspected a lock contention problem, and had a reason for it: Direct I/O isn't the same with vxfs than it's in UFS. In vxfs Direct I/O is really just the direct part. It doesn't enable concurrent I/O (explain that moniker later) to a file. The removal of the inode r/w-lock isn't part of the feature. You have to use either Quick I/O (QIO) or the ODM module for vxfs. As both features weren't activated, that was the moment where i told the customer "Hey, choose the ODM module for vxfs or QIO, activate it and the problem should go away". Both remove that lock contention and thus are of big help in order to get better Oracle performance when using vxfs. Just to remove a misunderstanding: ODM (Oracle Disk Management) is an API in Oracle, not of Veritas. Oracles DNFS (direct NFS) is implemented via an ODM module as well. The problem: You used to pay for both vxfs, neither of them is really cheap and before doing the change, the customer wanted to know that i was right with my diagnosis (according to the release notes, ODM and QIO are now part of the SF except in basic). I wrote of two problems, but just wrote of one so far. Normally, finding out this inode rwlock contention problems are quite easy to find . But not in this case. vxfs is different than UFS in a multitude of ways. It doesn’t use the locking primitives of Solaris but has its own instead. And thus all values reported by prefered diagnosis tools were pretty useless. Damned … how should you find problems, when your instruments can’t show the problems. Without instrumentation troubleshooting is just guesswork and experience. At this point a question on a mail alias (it’s great to have people on internal aliases, that have forgotten more about Solaris than I know) and some research via google yielded the same result in a few minutes of time: vxfs`vx_rwsleep_rec_lock is the function waiting on/implementing the posix inode rw lock. Now I was back in the game and I was able to use all the nice things of the operating system i prefer.Digging in the dirtI asked the customer to put a dtrace script into a script that is executed in the moment of the wait:The result was interesting, as it clearly showed a peak of 307 events in the range 34359738368 nanoseconds (34.36 seconds) to 68719476735 nanoseconds (68.72 seconds). This was especially interesting as the same dtrace script didn't showed such a peak during times where the system ran flawlessly. Okay ... well ... next step ... what parts of the system were executing this vxfs`vx_rwsleep_rec_lock function. I could have used dtrace for this task as well, but i wanted some additional insight in one step. Thus i used a nice little command of the modular debugger in Solaris: # echo "::threadlist -v" | mdb -k The output is quite long on a loaded solaris system. It prints something like this for each thread: I hate multiple line outputs when searching for patterns. There is nothing better than two monitors, an terminal streched on both and the two glibberish grep-implementations on the front side of your skull. But this works best, if one event is just in one line.So i did some grepsed-fu on it. .Each thread is now in a single line. Yeah … perhaps there is a more elegant way to do this, but that was the first that came into my mind Just a quick check. At the moment of the hang, 1008 processes were in vxfs`vx_rwsleep_rec_lock. That was interesting. Even more interesting were the list of commands that had threads in the mentioned function. It's column 10 in the threadlist in it's concatenated form.When you further dig down into the large heap of data: From all this threads belonging to the ora_dbwriter3_importantdatabase just seven weren't in the vx_rwsleep_rec_lock function.At that moment i thought: That isn't a smoking gun, that's a smoking howitzer. An attempt to explainMost threads excuting this function are part of the database writers. When you think about it, that's not so astonishing, especially when you think about the nature of an rwlock. At first: There is a rwlock for each inode in a filesystem. Their function: Multiple readers can get the lock and so they can read concurrently from the file, but just one writer is able to hold it and thus to write into the file. Equally important: You can't write to the file as long one or more readers is in the codepath protected by the rwlock for this file, and no one can read from the file as long there is a writer in protected codepath.In really basic rwlock implementations this can lead to writer starvation, as it's hard for the writer to get the lock, because all readers have to relinquish the rwlock and no new readers should start before the writer can get the lock. Out of this reason, the Solaris threads implementation tends to favour writers before readers. However when you have many writers, it may take a long time before the backlog of writes. Blindly prefering writers is not a solution as well, because then readers would starve which is even more problematic, because reads are always synchronous by nature. As i wrote at other locations. While a system can chose the time of an physical write to some extent, it can't chose the time of a read. A function won't execute as long the data isn't available. But that's out of scope of this article. For the capability to write and read in parallel to a file the name cocurrent I/O was coined. I just wrote that it can take a moment before the backlog of writes has been executed. In this case it was even worse: The inode r/w lock adds insult to injury. Because basically the inode r/w lock limits you to just a single write I/O operation in parallel to a file, no matter how many HBA, how many disks you have in your system. And now you've made a while out of a moment. Even when the changes in the file are totally unrelated, e.g. changing a block belonging to the user table stored in it and another block in the article database or you want to read a block into the sga containing the customer database and writing the new salary for the promited assistant. You can't do this in parallel due to the inode rw lock. And with many updates in your workload it's not that astonishing that database writer threads start to twiddeling fingers in an increasing number in order to wait for their turn to write to the file. You may ask yourself, why the heck there is such a mechanism. The r/w lock is something mandatory in order to be Posix compliant. You need it to ensure write ordering and consistent reads, when updates occur in parallel to read. Obviously you really want such a protection when working with files. However especially with databases a file is just a container for a large heap of things. Independent things. And things are now different. Out of this reason there were some developments in the database realm to get rid of the inode rwlock and put this mechanism elsewere. Oracle allows you to use a raw disk, and so it has to do the consistent read and write ordering stuff anyways and as it’s aware of the inner structure of the heaps of data, it can do it with a much greater granularity than just per inode and thus per file. The inode r/w lock is just a bottleneck without any use in this case. Out of this reason Direct I/O of UFS for example offers a mode that removes the lock. It's not the way, that those write ordering things or consistency protections are away. They are just in a layer that knows more about the structure inside the file and thus can do a better job at doing this job. vxfs knows similar mechanisms. QIO or ODM don't have such an inodewise locking. They are working differently compared with UFS direct I/O but as an earlier chancelor of the Federal Republic of Germany said: Outcome matters. One question was still open. Why was this problem reproducible by a "TRUNCATE TABLE" command? That’s pretty easy however you have to dig deep into the internals of Oracle. When Oracle executes a TRUNCATE TABLE command, it checkpoints the database. In such a situation it writes all dirty blocks from the SGA into the database datafiles. This must be done for recovery purposes. Such checkpointing may trigger a storm of writes via the database writer, especially when you have a SGA with a lot of dirty blocks. The checkpoint has to complete, before the TRUNCATE TABLE executes. And then we are at another red herring at the end: It's not the TRUNCATE TABLE command that was slow ... it's the checkpoint occuring before. You can check this pretty easy, when a "TRUNCATE TABLE" takes too long for your taste, trigger a checkpoint manually and do the TRUNCATE TABLE directly afterwards. TRUNCATE TABLE does still a checkpoint, but as you've already cleaned up the SGA from dirty buffers, it doesn't have to do much writing. It should run much faster now. ConclusionAt the end i had to tell the customer, that in essence everything works as designed. It would be a bug, when the system would act just a little bit different differently. However that's seldom the answer a customer wants.So: The solution for the issue? It's as old as it's easy. Getting rid of the inode rwlock. Get concurrent I/O: Either by using raw disks, by using ASM, by using UFS or by using ODM or QIO for vxfs. I just can reiterate something i've already said: When you put your Oracle database file into a filesystem, you want to use direct I/O and concurrent I/O! Migrating your notebook from a smaller to a larger diskThursday, July 21. 2011
My colleague Christophe Pauliat - Principal Sales Consultant at Oracle - came up with a really nifty way to migrate his Solaris based notebook from a smaller disk to a larger one. I will copy his mail in verbatim here, because i think it's extremely useful. It somewhat resembles the "workaround" for ZFS resizing, however Christophe does takes this significantly forward and does this for boot disks.
The autoexpand really does an large amount of the trick. The size of a mirrored pool is always the size of the smallest disk. When you have an 80 GB and a 500 GB disk, the size of the pool is 80 GB. Remove the 80 GB disk. The smallest disk is now 500 GB and the size of the pool is 500 GB now as well, as long as you've activated autoexpand. A little change of queuesWednesday, July 6. 2011
An overwhelming number of ZFS installations work with just a bunch of disks, perhaps in a JBOD or in the server itself. However there are installations, that use disk arrays with RAID-controllers. Some of those installations are even using a single LUN. I don’t think that this is a good idea (for e.g. because ZFS can just detect corruptions without redundancies, but not repair them) but that’s a different story I don’t want to discuss here.
There is a slight change in the default parameters of ZFS in Update 9. It’s related to the parameter zfs:zfs_vdev_max_pending . This parameter controls, how many I/O requests can be pending per vdev. For example when you have 100 disks visible from your OS with a zfs:zfs_vdev_max_pending of 2, you have 200 request outstanding at maximum. When you have 100 disks hidden behind your storage controller just showing a single LUN, you will have – you will know it – 2 pending requests at maximum.You may think, that you could increase the queue depth without end, but as usual this is a tradeoff game and not that easy, longer queue depths may increase latency of the commands. Experience showed that certain queue depth delivered the best performance on most installations. However the installed landscape changes and sometimes you have to adjust things. Exactly this happened a while ago in Opensolaris. And it seems that this change moved into Solaris. The default for zfs:zfs_vdev_max_pending is 10 at the moment. You can check this:0xa in decimal is 10.And this is a wise choice for most implementations out there. But it was different on older versions. I checked it on U7, i asked my twitter/facebook contacts to make quick check on U8 as i was to lazy to install it: 0x23 in decimal is 35 and 35 was the default up to Update 8 of Solaris 10. So essentially the queues are less deep than before. For JBODs this is most often a good thing, as each vdev and thus each LUN has its own queue of 10 pending I/Os. For a single LUN hiding many disks sometimes not. So how do you change it back to the old value? You can change it dynamically: To make this change boot-persistent you have to add a line to /etc/system: Sometimes even an higher value may be indicated with very large numbers of disks behind your controller forming a single LUN. How do you know if this decreased queue depth is a problem for you at all? The command iostat will help you:If you see the column actv at or near the number of zfs:zfs_vdev_max_pending, it’s worth a try. Otherwise not.
Posted by Joerg Moellenkamp
in English, Solaris, Sun/Oracle
at
11:50
| Comments (2)
| Trackbacks (0)
New Hivemind - PowermanagementSunday, July 3. 2011
I was asked in a comment, if Solaris supports power management with the processor in the HP N36L microserver. The answer is yes.
Best way to check this is via kstat. If kstat shows multiple frequencies as supported frequencies, it supports Power Management for the processor: However before Solaris really uses it, you have to configure powermanagement. At first add or change the following lines in /etc/power.conf:Afterwards run the command pmconfig once. Now keep the system idling for 10 seconds and check the frequencies the system runs at:Result of the "How long do you wait before Solaris 11 gets on your prod systems?"Thursday, June 23. 2011
Posted by Joerg Moellenkamp
in English, Solaris, Sun/Oracle
at
16:20
| Comments (16)
| Trackbacks (0)
Vortrag "Deduplication" auf der FrosonMonday, June 6. 2011
Gestern habe ich die Mitteilung bekommen, das mein Vortrag "Deduplication" auf der Froscon 2011 angenommen worden ist. Die Froscon findet dieses Jahr am 20./21. August 2011 an Hochschule Bonn-Rhein-Sieg statt. Ich werde dort eine vollständig überarbeitete Version meines Deduplication-Vortrags von der GUUG halten. An vielen stellen allgemeiner und nicht so ZFS-zentrisch, an anderen Stellen allerdings sehr viel ZFS-zentrischer, da ich anhand des Sourcecodes von ZFS einige Konzepte erklaeren will.
Posted by Joerg Moellenkamp
in English, Solaris, The IT Business
at
09:57
| Comments (0)
| Trackbacks (0)
Solaris 10/11x hardware compatibility listThursday, June 2. 2011
The Solaris 10/11 Express Hardware compatibility list has a new home: It's now part of the OTN and available at http://www.oracle.com/webfolder/technetwork/hcl/index.html
Posted by Joerg Moellenkamp
in English, Solaris, Sun/Oracle
at
21:40
| Comments (0)
| Trackbacks (0)
SNIA releases testing spec for SSDTuesday, May 24. 2011
In customer projects, i'm talking more and more frequently about solid state disks and their performance. From this point the release of a testing specification for SSD performance is really welcome to me, as you often compare apples with pears and pears with potatoes. Looks like the SSD market gets more mature.
(Page 1 of 61, totaling 915 entries)
» next page
View as PDF: Category Solaris | This month | Full blog Competition entry by David Cummins powered by Serendipity v1.0 |
+1The LKSF bookThe book with the consolidated Less known Solaris Tutorials is available for download here
Web 2.0Contact
Networking xing.com My photos Buttons![]() This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License
![]() ![]() ![]() Blog AdministrationDonateOkay, okay ... as several people have asked for it ... but you know my opinion.
|
||||||||||||||||||||||||||||||||||||||||||



