Skip navigation

yEd is a gem of a graph editor that makes it very easy to create diagrams and flowcharts. I used Inkscape for this purpose in the past, and had to do a lot of manual alignment. yEd does it all auto-magically with a very intuitive interface. The only bad thing about it is that it’s not open source, although it does come free of charge.

yEd can import graphs from a variety of formats, one of which is GraphML. The other day I had a graph in Python (NetworkX), and I wanted to lay it out nicely for printing. yEd’s layout functions surpass anything I’ve seen Graphviz do, so I decided to export the graph to GraphML and load it into yEd. This proved more difficult than I anticipated, but only because I didn’t know where to look. Hopefully this post will save you some time.

Let’s generate a graph, give the nodes some labels, and export the graph to a GraphML file:

import networkx as nx
graph = nx.gnp_random_graph(10, 0.13, directed=True)
for node in graph.nodes():
    graph.node[node]['label'] = "node %d" % (node + 1)
nx.readwrite.write_graphml(graph, "random.graphml")

Now open the random.graphml file in yEd. All you will see is a square. This is because all the nodes are on top of each other and have no label. Don’t despair.

First, let’s recover those labels. If you open the GraphML file in a text editor, you will see that NetworkX was smart enough to export the node labels, but yEd was not smart enough to realize they were labels. In fact, they became properties of the nodes in yEd, which you can see by right-clicking on a node, clicking Properties, and then selecting the Data tab.

And now we come to the part that it took me a long time to figure out. We need to map the “Label” property imported from the GraphML file onto the internal property that yEd uses for labels. To do this, click Edit, Properties Mapper. Click the little plus sign under Configurations, and then ‘New Configuration for Nodes.’ Now click the plus sign next to ‘Mappings’. If you’re lucky, yEd should figure out what you’re trying to do, and automatically select the right mapping. Your window should look like this (click for larger version):

Now select ‘Fit Node to Label’ if you want the nodes to be resized to fit the labels, then click OK. (If you forgot to select ‘Fit Node to Label,’ you can do it later by going to Tools, Fit Node to Label.) You should see the labels now:

But the nodes are still on top of each other. To fix that, use one of the algorithms under the Layout menu. Final result:

Ta-da! If you save the modified graph in GraphML format and open the file in a text editor, you will see that reverse-engineering the format that yEd uses to store labels would not be easy.

I will never be care-free again. What a sad thing to realize.

These past few days, as I was brainstorming for my senior thesis (a year-long endeavor), I found myself wishing that the work would just stop, and let me relax for a week or two knowing that there are no looming deadlines. This is how it used to be in high school and the first few semesters of college: All my work had an absolute deadline — the end of the semester. When the break came, I was done, and work wouldn’t start again until the next semester. I could relax, knowing that I had no responsibilities until classes started again. I could survive several days without checking my email. I could work on crafts stuff without feeling guilty. My life was structured in alternating periods of work and no-work.

I really miss that.

The work I do now is not like that at all. Research just drags on and on. I’m still trying to finish up some stuff from last summer, and next June I’ll be presenting some work that was supposed to be finished one year before. Even though I want to be done with it, the work just keeps coming back; it never ends. If I don’t do so well on a problem set, I’ll just start fresh on a new one next week. But with research, I have to live with the results and the crappy code and the thorny questions for a long time, and there are no semester cutoffs.

The consequence is that I can’t really relax during the breaks, because I always have work that is not finished. I check my email daily, in case something important (work-related) comes up. I feel guilty getting up at 12 and doing origami, because I could be working on my thesis. I end up mixing work and play all day, and then feeling dissatisfied at the end of the day because I was neither productive nor relaxed. My break got shorter by a day, but my pile of work did not get any smaller.

The depressing thing is that grad school will be like this too. The work will leak from one year to the next, never completely finished. And a real job would be the same, I’m sure. The periods I miss, with no work and no responsibilities, are not coming back. I’ll never be care-free again, and I need to find a way to relax and rest even knowing that the work will drag on forever.

I’ll take a quiet life,
A handshake of carbon monoxide,
No alarms and no surprises please.

I just discovered this today: With the gnupg.vim plugin, Vim can edit GPG-encrypted files transparently. So if a file has a .gpg extension, Vim will automatically decrypt it upon opening, and re-encrypt it upon saving. Awesome! I no longer need my clunky script that did this by dumping the cleartext to a temporary file…

This is a rant about how expensive it is to apply for grad school.

Let’s do the math:

$160 to take the GRE General test (which tests high-school-level math with no calculus, ridiculous English vocabulary that only literary people would use, and your ability to write a bullshit essay as fast as possible.)

$140 to take the GRE Subject test (The CS test was very broad; it had questions on everything from networking and operating systems to algorithms and programming languages and RSA encryption.)

$23 x N to send GRE score reports to N schools, assuming you don’t use the free four that you get when taking either test (They are sent electronically, so why do they cost so much?!)

$3 x N to send official transcripts to N schools (This probably covers the cost of printing and mailing. Thank you Tufts for not being greedy, although you definitely compensate in other ways. Random fact: some schools don’t want an official transcript until you’re admitted; some others want two copies for some reason.)

$90 x N average application fee for N schools (Nearly all schools have a higher fee for international applicants. And you can only maybe qualify for a fee waiver if you’re a US citizen / resident.)

My total for N=9 schools was $1,236. Holy shit. My bank account is weeping now :( This does not include the psychological costs of lost sleep, ignoring your friends for an entire semester, pounding away at your statement of purpose until your wrists hurt, and constant feelings of inadequacy / anxiety / mild panic. All of it to become… this? Hmm.

BibTeX is a reference management system often used together with the LaTeX typesetting system. Today I wanted to find out if I had any unused references in my BibTeX file. There doesn’t seem to be an easy way to do this. Luckily, a combination of tools did it. This is a quick brain dump of what I did. (If you find an easier way to do this, let me know.)

My bibtex file is paper.bib. This contains all (used and unused) references. My paper is in paper.tex. When first compiling the paper, latex creates paper.aux. This intermediary file contains entries only for the references that the paper actually cites.

1) Dump keys for all (used and unused) references, and sort them:

bib2bib paper.bib -ob /dev/null -oc /dev/stdout |sort >all

2) Dump keys for used references, and sort them:

aux2bib paper.aux |bib2bib -ob /dev/null -oc /dev/stdout |sort >used

3) List keys which are in all but not in used:

diff --old-line-format=%L --unchanged-line-format= all used

You can consult the manual pages for bib2bib, aux2bib, and diff to see what the parameters above do. The commands bib2bib and aux2bib can be found in the bibtex2html package.

Imagine putting error messages on a spectrum, according to how easy to understand they are. “404″ lies on the cryptic end. “Page not found” lies somewhere in the middle. I think the middle is a good trade-off between designer effort and user satisfaction. But what lies on the other end? This:

We recognize that our website used to present a challenge, and that many people have memorized the path through the maze or bookmarked the information they need. Unfortunately, due to the new organization of our website content, those trails of breadcrumbs and bookmarks will no longer work. We apologize for “moving the cheese” at the end of the maze, but we think you’ll have a much easier time finding the information you need.

Our website content has been organized into a number of related categories, listed below. Please Contact us if you need any further information

The above is a real error message I got on this page, and from a government agency no less. Someone is either over-zealous, or getting paid by the keystroke… The worst problem is that you don’t immediately realize this is an error page, so you waste time trying to make sense of that paragraph… Yay usability.

By the way, the Internet Archive can be a great resource when you need to dig up a dead link. If the website is “important” enough (I wonder according to which criterion), the Internet Archive will probably have a stored copy. I’ve found my article here. Another trick for recently-removed content, or for content that’s temporarily down, is to search Google’s cache (one, two).

Sometimes you want to save a local copy of an entire website, either because you want to use it offline, or because you only have access to it temporarily. You could open each page in a browser and save it, but that’s tedious. There are nice crawlers out there like HTTrack, which will save an entire website for you, and even tweak the links so that they work on the local version. Unfortunately, such crawlers do not handle JavaScript. If your website uses Ajax to load its content, you’re out of luck. Your downloaded copy will contain unexecuted calls like “fetchContent()”, instead of the actual stuff that you would see in a browser.

Wouldn’t it be great if a crawler could execute the JavaScript on a page, and save it after all the content has been filled in? It turns out that some folks at MIT have already thought about this. Crowbar is a headless Firefox-like browser, running on top of Mozilla’s XULRunner. You point it to a URL, it loads it, executes the JavaScript, and gives you the resulting page. It even seems to handle cookies. It’s like the full-fledged Firefox running without any screen output, which is exactly what you’d want for web scraping!

After I stopped jumping up and down with excitement, I played with Crowbar a little bit. The idea is great, but the tool itself has some disappointing flaws. First, it hangs if you point it to a URL that is not an HTML page (an image, for example). Why would you want to fetch images with Crowbar? Because some evil websites won’t deliver them unless you have the appropriate “session id” cookie. Which brings me to the second flaw: Even though Crowbar handles cookies, there is no way to get at them from the outside. I’ve found cookies.sqlite in ~/.crowbar/profile-name/, but the database is locked and inaccessible while Crowbar is running. (And even if I could open it, it probably wouldn’t store session cookies.) The third flaw is more subtle: When faced with pages that have non-ASCII characters, represented with two bytes in UTF-8, Crowbar seems to silently drop the first byte. This gives you corrupted data, leading to hours of keyboard-smashing frustration.

The last SVN commit in Crowbar was in June 2008, so I am not too optimistic about seeing these bugs fixed. I also don’t know enough JavaScript and XUL to do it myself. Still, I think that using a headless browser for scraping is a great idea. I first stumbled upon Crowbar via a post by Jabba Laci. I have since found a few related tools that might be more useful:

I haven’t tried any of these three, but I hold high hopes for the one that uses QtWebKit from within Python. If I understand it correctly, you should be able to get full access over the browser, and peek at cookies, HTTP headers, and anything else you might want. Finally, here are some other random resources that might be useful:

  • Mechanize, a programmable headless web browser for Python. It doesn’t handle JavaScript, but it does handle cookies, and it has a nice interface for filling out forms. No JavaScript means this is much lighter than running a full (albeit headless) browser.
  • Tools to handle broken HTML: LXML, BeautifulSoup, html5lib.
  • An interesting blog about web scraping. Also a python library by the same guy.
  • Apparently there is a fair bit of money and controversy around web scraping.

I began this post talking about crawlers, but then focused on scraping a single page with a JavaScript-enabled headless browser. I don’t know about any existing crawlers that support JavaScript / Ajax this way. One problem is that you can’t tell when the scripts on a page have finished running. (Crowbar just waits a predetermined amount of time before delivering a snapshot of the page’s contents.) Anyway, a JavaScript-enabled crawler sounds like an interesting project :) sudo give me free time…

I shouldn’t do this, but I pulled you out for a moment to give you a hint.

in-the-way

Voilà, ma petite Amélie, vous n’avez pas des os en verre. Vous pouvez vous cogner à la vie. Si vous laissez passer cette chance, alors avec le temps, c’est votre cœur qui va devenir aussi sec et cassant que mon squelette. Alors, allez y, nom d’un chien!

maximize

Credits: Humor-sans font by ch00f, based on the handwriting on xkcd.

Post hidden and mangled, after it has been pointed out to me how it can be misinterpreted as more than a silly writing exercise… Yay self-censorship :( I failed to get across the point that these are (based on) things I’ve been a reluctant witness to, and not things I’ve invented…

Read More »