Thursday, November 26, 2009

Papers on CUDA

Some new scientific papers by the GPU computing community. By the way, I highly recommend adding GPGPU feed to your RSS reader if you're interested in this subject. That's where I get most of my news.

PyCuda  is an open-source framework written in Python for generating CUDA code. It isolates you from tedious details of CUDA. Most programmers  would find Python much more comfortable to work with. There is a scientific paper on PyCuda: GPU Run-Time Code Generation for High-Performance Computing. Lots of tutorials for beginners in the net, just ask your search engine. I even found one in Polish.

CheCUDA is a checkpoint/restart tool for CUDA applications. Checkpointing is a known method in long-running scientific computations: periodically write the state to the disk and if your software crashes, system needs to be restarted or the job migrated to another machine, you don't have to start from the beginning. Until now, there was no checkpointing tool for GPU computing. See this paper by Hiroyuki Takizawa, Katuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi for details.
Blogged with the Flock Browser

Tuesday, November 24, 2009

...flock together

First experimental post from Flock.

Time to get more Web 2.0. I'm not the one who falls for the latest buzzword. But this one is 5 years old, so I might give it a chance.

Blogged with the Flock Browser

Friday, November 20, 2009

Data visualization 101

For the last few weeks I've been reading on data visualization in my spare time. The concept is ubiquitous: it's used in business (reports and presentations), science and technology, newspapers, magazines, even daily life: look at your cellphone, it has a battery and signal indicator. I bet you can see at least ten examples of chart, map or timeline from the place where you seat (and that's without going to another website). The reason is obvious: we humans are much better in processing graphical then numerical data. There's high chance that someday you'll have to prepare a chart of some sort.

Yet visualization is often misunderstood and applied poorly. Take this spreadsheet chart:

The standard template has too much clutter. Notice the grey background reducing contrast. That makes heavy grid lines necessary, but they're yet another form of visual clutter. It also looks dull. Most people who want to make their graphs more interesting does it by applying 3D effects and gradients making it even less readable.

What's the right solution? If you need precision (e.g. in scientific paper), follow this simple guidelines:
- Reduce clutter. Remove backrounds, borders, as much grid as you can - imagine you're trying to save on ink.
- In bar graph, try spacing the bars and keeping them close, see what works best. There's no rule.
- If you only have one series of data, remove legend. If you have more, consider labeling directly on the graph.
- Colors should mean something. Use it to distinguish one series from the other or the outllier from the rest of samples. Or use color intensity to show value. Don't ever make every sample in a series a different color for no reason.
Surprisingly, it'll look better and definitely it'll make easier to compare values.

The example here is only about 10 seconds of clicking away from the default. Although it's far from perfect, it's already much better.

If you value aesthetics more than precision (e.g. for a newspaper or advertising), use anything but the standard template. Consider infographics: want to chart the real estate market? Use the drawing of the house for your graph. Or use maps, possibly with satellite imaging or in 3D isometric view. The further you go from the old, boring chart, the better it'll look. But don't overdo it or the readers won't grasp your idea. There's a similar problem in science - some forms of visualization work great for people with the right background (e.g. in statistics), but others won't understand them at all.



Want to know more?
www.juiceanalytics.com/writing/ - Great practical advice on designing graphs, presentation and dashboards.
www.visualcomplexity.com/vc/ - Lots of stunning examples.
www.edwardtufte.com/tufte/ - A renowned expert on data vizualition, inventor of sparkline, author of many articles and books.
Handbook of Data Visualization - More on the scientific side, but first chapters offer loads of general knowledge.
Visualizing data - Next one on my reading list (or more likely glancing through list). Judging by the table of contents, that should have been my starting point.

Monday, November 16, 2009

VirtualBox 3.1.0 adds live migration

New VirtualBox, now in beta, adds live migration (and calls it teleportation for no apparent reason). This is standard feature for server hypervisors like Xen and KVM, but something new for desktop products. Correct me if I'm wrong, but I think VirtualBox is the first. This might not be that useful for workstation as it is for servers, but I can see some uses: collaboration on developing a system or moving the system from development to production.

Other important feature in this release is redesigned snapshot system. It is now possible to restore to any snapshot instead of only the last one. This, on the other hand, is really useful. I have a Windows guest and instead of reinstalling once every few months, I just revert to a last good state. Saves me hours on installing software, updating and configuring.

There's also improved hardware support, including 2D acceleration for Windows guests. Note it's a beta release with some known problems (ironically, in the area of snapshots). I've just upgraded my setup (Linux 64-bit hosts without VT, Windows XP guest) and it works fine, your mileage may vary.

Tuesday, November 10, 2009

Eucalyptus 1.6.1 available

New release of Eucayptus got some interesting features. At last it's possible to manage a cloud consisting of multiple clusters. Support for dynamic DNS and monitoring tools Nagios and Ganglia was added. It also fixes numerous bugs and promises better stability. See changelog for more details. I didn't have time to upgrade my setup yet (I'd have to upgrade Debian first), but it sure looks interesting.

Thursday, November 5, 2009

XenServer opensourced

I've already written about Citrix opensourcing XenServer. The source code for XAPI is now released. Xen developers are working on providing better interoperability between both versions of Xen. Join if you can! See Developer's Guide and XAPI mailing list for details.

I look forward for the day when I can ditch Xen Center and use libvirt-based software to manage all virtualization platforms.

Sunday, November 1, 2009

Ant for system administration

Ant is a popular build tool, especially for Java development. What most people don't realize is it can be used to automate other tasks. Lately I started using it for system administration purposes. Don't get me wrong, I'm not saying it should replace good old fashioned scripts. On the contrary, I estimate that I use simple bash scripts for 90% of the tasks, real languages like Python or PHP for 9%, remaining 1% is everything else, including Ant. And even this jobs I could do with other tool. But if a 10-line buildfile does the job of 100-line Python script, I'll go with the simpler solution.


What sysadmin tasks can you do with Ant?

Ant has support for:
- File operations, like copy, delete, move, mkdir, chown, chmod. Useful as a part of a more complicated task. If you only need file operations, shell script is way easier to write.
- Archives: usual like tar, gzip, bzip2, zip and few others, including rpm.
- Network-related: sending email, remote execution with SSH and telnet, FTP and SCP client, HTTP client (with ant-contrib)
- Revision control systems: CVS built-in, many (if not all) others with a plugin


When I use Ant?

I found two tasks for which Ant excels. Both are often needed in modern environments - with virtualization and cloud platforms in place, one administrator often controls hundreds of systems.

Uploading files to remote systems

Ant supports FTP and SCP. Nothing special, except it's easy to send only modifled files. Obviously with ordinary script you can read the timestamps, store them somewhere and compare them yourself, but it's more tedious. Here's an FTP example:

<?xml version="1.0" encoding="UTF-8"?>
<project name="whatever" default="ftpexample">

<target  name="ftpexample">
<ftp server="www.example.com"
userid="admin"
password="YourVerySecretPassword"
passive="yes"
depends="yes"
verbose="yes"
remotedir="/wherever/I/want "
binary="yes">
<fileset dir="/source/directory" defaultexcludes="yes">
</fileset>
</ftp>
</target>

</project>



Let's go through the example, skipping obvious lines.
- depends="yes" - only send modifiles files. Yes, it's that simple.
- verbose="yes" - print names of the transfered files.
- defaultexcludes="yes" - exclude backups (filenames ending with ~), CVS and SVN directories and the like.


Running remote commands on multiple systems

Ant can run tasks in parallel and track dependencies between tasks (i.e. before running task2, make sure task1 was finished). Ant can run commands remotely with SSH or telnet. Combine the two and you get a simple way of writing a distributed glue code. As an example, we'll check numbers of messages in Postfix queue on several servers.

<?xml version="1.0" encoding="UTF-8"?>
<project name="remoteexample" default="all">

<target  name="checkservers">
<parallel>
<sshexec host="server1.example.com"
username="exampleuser"
trust="yes"
keyfile="${user.home}/.ssh/id_rsa"
outputproperty="out1"
command="find /var/spool/postfix/deferred  -type f -print | wc -l" />

<sshexec host="server2.example.com"
username="exampleuser"
trust="yes"
keyfile="${user.home}/.ssh/id_rsa"
outputproperty="out2"
command="find /var/spool/postfix/deferred  -type f -print | wc -l" />
</parallel>
</target>

<target name="all" depends="checkservers">
<echo message="${out1}" />
<echo message="${out2}" />
</target>
</project>



Two caveats:
- Ant has built-in support for SSH. If you run it on Linux, it will not follow your usual configuration in ~/.ssh. You need to provide all required parameters in the buildfile, like the path to keyfile in the example above.
- If you run the above example, you'll notice the output looks like this:
all:
[echo] find /var/spool/postfix/deferred -type f -print | wc -l : 2
[echo] find /var/spool/postfix/deferred -type f -print | wc -l : 0

The command you ran is a part of the outputproperty. It's a known bug of Ant, local exec doesn't behave that way. It was fixed lately, but the new build is not available yet. If you want to process the return value (e.g. sum the outputs), you either have to filter the command out or compile Ant from subversion. Tried both ways and filtering in the shell script is the easier one.


The downsides of Ant

  • Ant requires Java. Not a problem if you need it anyway, but JRE is a little too big to install just for one or two scripts.
  • Ant buildfile.xml has a completely different syntax than a script. You don't write a sequence of commands. Instead, you declare what you want to achieve and what it requires using simple XML syntax. Still, it only takes an hour or two to grasp the concept and start writing buildfiles.