Skip navigation

In one of my classes this semester, we programmed little agents in an artificial world to do (hopefully) interesting things. My group focused on exploring disease spreading patterns. We picked a few variables: How fast does the disease spread? Are the agents attracted to one another? Can the agents accurately observe whether other agents are sick? And so on. We built a pretty simple simulation of this artificial world using the excellent MASON simulator. Here’s an example of what it looks like:

The green circles are food. The other circles are agents, who are sick (red) or healthy (blue). An agent is perceived as sick by other agents only if it has another smaller dot next to it. The bar above each agent shows the agent’s energy; and the agent dies if the energy drops to zero. Agents can eat food to get energy, and they burn more energy per step if they are sick.

Overall, the simulation we built was rather unsatisfying. In this post I’d like to dwell on the reasons why.

Read More »

Our library has a bunch of really nice book scanners, which let you save your scan to a USB flash drive, or email the scan to yourself. The scanners are operated by a computer with a touch screen. A normal user probably wouldn’t even notice that there is a “real” computer in there, because the touch interface is pretty straightforward to use. That is, until it fails with a helpful error message (“This program has encountered a problem and needs to close”), and dies taking with it everything I’ve scanned so far.

But that’s not the end of the story. When the touch-screen UI dies, it exposes a standard Windows XP desktop underneath. And guess what? It’s running under the Administrator account. Okay, keep that in mind in case you want to have some fun with it later.

So I reboot the machine, and the touch-screen UI starts again. I try Alt+Tab, Win+R, Win+D, Ctrl+Shift+Esc; nothing works. I guess the software swallows any keyboard shortcuts to prevent such curious exploration. Oh well. I try to scan my papers again, and again the UI crashes. Fiddlesticks.

Another reboot. This time I hit Win+R before the touch-screen UI is fully loaded, and start an instance of cmd. Sure enough, now I can Alt+Tab between the UI and cmd, and I can launch anything I want from cmd. Let’s take a look at the C:\ drive. Whoa, 200 MB free out of 50GB? That might be why it was crashing. So I look around and find the temp directory for the scanner UI, and sure enough it’s full of temp files, gigabytes of them. They look like raw bitmaps from previous scans. I delete them all; the scanner UI works again, and I am able to finish my work.

Lessons learned:

  1. The machine runs on an unprotected Windows admin account, wide open for whatever abuse you might want to throw at it.
  2. The scanner UI prevents simple attempts to get around it (like Win+R), but the machine is still vulnerable in the small time window after Windows has finished booting, but before the UI auto-starts.
  3. The UI seems to store raw copies of everything it ever scans. Massive privacy hole. (It’s possible that it deletes the temp files if the scan is successful, and only leaves them when it fails, but that’s still pretty bad.)
  4. In so many years of desktop computing, we are still caught by surprise when the disk gets full. (A few years back, KDE would fail to log me in without any explanation. It turned out my /home/ was full. Even Mars rovers have failed when their disks got full.) There has to be a better way to handle things. Temp directories that are wiped on reboot are a good first step. Watchdogs might work for desktop users, but not on an embedded system. What else could we do?

On Linux, ulimit allows you to limit the resources that a process can use. Two use cases:

  1. You have a program that sometimes runs out of memory, slowing your computer down to a crawl. You can use ulimit -v to limit the amount of memory that processes in a shell can use. If a process tries to allocate more memory than that, the allocation will fail and the program will usually abort.
  2. You have a program with a deep recursion, which segfaults with the default stack limit of 8M. You can use ulimit -s to increase the allowed stack size.

There are many more limits you can set; type help ulimit in bash to list them. You can find out the current limits by typing ulimit -a.

Two gotchas that I always forget about:

  1. You may try to limit the memory usage of a process by setting the maximum resident set size (ulimit -m). This has no effect on Linux. man setrlimit says it used to work only in ancient versions. You should limit the maximum amount of virtual memory (ulimit -v) instead.
  2. ulimit has hard limits and soft limits. Hard limits can be decreased but not increased. You can shoot yourself in the foot if you set your hard limit too low. I recommend using soft limits only. Set them with, for example, ulimit -Sv, and query them with ulimit -Sa.

Happy hacking!

yEd is a gem of a graph editor that makes it very easy to create diagrams and flowcharts. I used Inkscape for this purpose in the past, and had to do a lot of manual alignment. yEd does it all auto-magically with a very intuitive interface. The only bad thing about it is that it’s not open source, although it does come free of charge.

yEd can import graphs from a variety of formats, one of which is GraphML. The other day I had a graph in Python (NetworkX), and I wanted to lay it out nicely for printing. yEd’s layout functions surpass anything I’ve seen Graphviz do, so I decided to export the graph to GraphML and load it into yEd. This proved more difficult than I anticipated, but only because I didn’t know where to look. Hopefully this post will save you some time.

Let’s generate a graph, give the nodes some labels, and export the graph to a GraphML file:

import networkx as nx
graph = nx.gnp_random_graph(10, 0.13, directed=True)
for node in graph.nodes():
    graph.node[node]['label'] = "node %d" % (node + 1)
nx.readwrite.write_graphml(graph, "random.graphml")

Now open the random.graphml file in yEd. All you will see is a square. This is because all the nodes are on top of each other and have no label. Don’t despair.

First, let’s recover those labels. If you open the GraphML file in a text editor, you will see that NetworkX was smart enough to export the node labels, but yEd was not smart enough to realize they were labels. In fact, they became properties of the nodes in yEd, which you can see by right-clicking on a node, clicking Properties, and then selecting the Data tab.

And now we come to the part that it took me a long time to figure out. We need to map the “Label” property imported from the GraphML file onto the internal property that yEd uses for labels. To do this, click Edit, Properties Mapper. Click the little plus sign under Configurations, and then ‘New Configuration for Nodes.’ Now click the plus sign next to ‘Mappings’. If you’re lucky, yEd should figure out what you’re trying to do, and automatically select the right mapping. Your window should look like this (click for larger version):

Now select ‘Fit Node to Label’ if you want the nodes to be resized to fit the labels, then click OK. (If you forgot to select ‘Fit Node to Label,’ you can do it later by going to Tools, Fit Node to Label.) You should see the labels now:

But the nodes are still on top of each other. To fix that, use one of the algorithms under the Layout menu. Final result:

Ta-da! If you save the modified graph in GraphML format and open the file in a text editor, you will see that reverse-engineering the format that yEd uses to store labels would not be easy.

I will never be care-free again. What a sad thing to realize.

These past few days, as I was brainstorming for my senior thesis (a year-long endeavor), I found myself wishing that the work would just stop, and let me relax for a week or two knowing that there are no looming deadlines. This is how it used to be in high school and the first few semesters of college: All my work had an absolute deadline — the end of the semester. When the break came, I was done, and work wouldn’t start again until the next semester. I could relax, knowing that I had no responsibilities until classes started again. I could survive several days without checking my email. I could work on crafts stuff without feeling guilty. My life was structured in alternating periods of work and no-work.

I really miss that.

The work I do now is not like that at all. Research just drags on and on. I’m still trying to finish up some stuff from last summer, and next June I’ll be presenting some work that was supposed to be finished one year before. Even though I want to be done with it, the work just keeps coming back; it never ends. If I don’t do so well on a problem set, I’ll just start fresh on a new one next week. But with research, I have to live with the results and the crappy code and the thorny questions for a long time, and there are no semester cutoffs.

The consequence is that I can’t really relax during the breaks, because I always have work that is not finished. I check my email daily, in case something important (work-related) comes up. I feel guilty getting up at 12 and doing origami, because I could be working on my thesis. I end up mixing work and play all day, and then feeling dissatisfied at the end of the day because I was neither productive nor relaxed. My break got shorter by a day, but my pile of work did not get any smaller.

The depressing thing is that grad school will be like this too. The work will leak from one year to the next, never completely finished. And a real job would be the same, I’m sure. The periods I miss, with no work and no responsibilities, are not coming back. I’ll never be care-free again, and I need to find a way to relax and rest even knowing that the work will drag on forever.

I’ll take a quiet life,
A handshake of carbon monoxide,
No alarms and no surprises please.

I just discovered this today: With the gnupg.vim plugin, Vim can edit GPG-encrypted files transparently. So if a file has a .gpg extension, Vim will automatically decrypt it upon opening, and re-encrypt it upon saving. Awesome! I no longer need my clunky script that did this by dumping the cleartext to a temporary file…

This is a rant about how expensive it is to apply for grad school.

Let’s do the math:

$160 to take the GRE General test (which tests high-school-level math with no calculus, ridiculous English vocabulary that only literary people would use, and your ability to write a bullshit essay as fast as possible.)

$140 to take the GRE Subject test (The CS test was very broad; it had questions on everything from networking and operating systems to algorithms and programming languages and RSA encryption.)

$23 x N to send GRE score reports to N schools, assuming you don’t use the free four that you get when taking either test (They are sent electronically, so why do they cost so much?!)

$3 x N to send official transcripts to N schools (This probably covers the cost of printing and mailing. Thank you Tufts for not being greedy, although you definitely compensate in other ways. Random fact: some schools don’t want an official transcript until you’re admitted; some others want two copies for some reason.)

$90 x N average application fee for N schools (Nearly all schools have a higher fee for international applicants. And you can only maybe qualify for a fee waiver if you’re a US citizen / resident.)

My total for N=9 schools was $1,236. Holy shit. My bank account is weeping now :( This does not include the psychological costs of lost sleep, ignoring your friends for an entire semester, pounding away at your statement of purpose until your wrists hurt, and constant feelings of inadequacy / anxiety / mild panic. All of it to become… this? Hmm.

BibTeX is a reference management system often used together with the LaTeX typesetting system. Today I wanted to find out if I had any unused references in my BibTeX file. There doesn’t seem to be an easy way to do this. Luckily, a combination of tools did it. This is a quick brain dump of what I did. (If you find an easier way to do this, let me know.)

My bibtex file is paper.bib. This contains all (used and unused) references. My paper is in paper.tex. When first compiling the paper, latex creates paper.aux. This intermediary file contains entries only for the references that the paper actually cites.

1) Dump keys for all (used and unused) references, and sort them:

bib2bib paper.bib -ob /dev/null -oc /dev/stdout |sort >all

2) Dump keys for used references, and sort them:

aux2bib paper.aux |bib2bib -ob /dev/null -oc /dev/stdout |sort >used

3) List keys which are in all but not in used:

diff --old-line-format=%L --unchanged-line-format= all used

You can consult the manual pages for bib2bib, aux2bib, and diff to see what the parameters above do. The commands bib2bib and aux2bib can be found in the bibtex2html package.

Imagine putting error messages on a spectrum, according to how easy to understand they are. “404″ lies on the cryptic end. “Page not found” lies somewhere in the middle. I think the middle is a good trade-off between designer effort and user satisfaction. But what lies on the other end? This:

We recognize that our website used to present a challenge, and that many people have memorized the path through the maze or bookmarked the information they need. Unfortunately, due to the new organization of our website content, those trails of breadcrumbs and bookmarks will no longer work. We apologize for “moving the cheese” at the end of the maze, but we think you’ll have a much easier time finding the information you need.

Our website content has been organized into a number of related categories, listed below. Please Contact us if you need any further information

The above is a real error message I got on this page, and from a government agency no less. Someone is either over-zealous, or getting paid by the keystroke… The worst problem is that you don’t immediately realize this is an error page, so you waste time trying to make sense of that paragraph… Yay usability.

By the way, the Internet Archive can be a great resource when you need to dig up a dead link. If the website is “important” enough (I wonder according to which criterion), the Internet Archive will probably have a stored copy. I’ve found my article here. Another trick for recently-removed content, or for content that’s temporarily down, is to search Google’s cache (one, two).

Sometimes you want to save a local copy of an entire website, either because you want to use it offline, or because you only have access to it temporarily. You could open each page in a browser and save it, but that’s tedious. There are nice crawlers out there like HTTrack, which will save an entire website for you, and even tweak the links so that they work on the local version. Unfortunately, such crawlers do not handle JavaScript. If your website uses Ajax to load its content, you’re out of luck. Your downloaded copy will contain unexecuted calls like “fetchContent()”, instead of the actual stuff that you would see in a browser.

Wouldn’t it be great if a crawler could execute the JavaScript on a page, and save it after all the content has been filled in? It turns out that some folks at MIT have already thought about this. Crowbar is a headless Firefox-like browser, running on top of Mozilla’s XULRunner. You point it to a URL, it loads it, executes the JavaScript, and gives you the resulting page. It even seems to handle cookies. It’s like the full-fledged Firefox running without any screen output, which is exactly what you’d want for web scraping!

After I stopped jumping up and down with excitement, I played with Crowbar a little bit. The idea is great, but the tool itself has some disappointing flaws. First, it hangs if you point it to a URL that is not an HTML page (an image, for example). Why would you want to fetch images with Crowbar? Because some evil websites won’t deliver them unless you have the appropriate “session id” cookie. Which brings me to the second flaw: Even though Crowbar handles cookies, there is no way to get at them from the outside. I’ve found cookies.sqlite in ~/.crowbar/profile-name/, but the database is locked and inaccessible while Crowbar is running. (And even if I could open it, it probably wouldn’t store session cookies.) The third flaw is more subtle: When faced with pages that have non-ASCII characters, represented with two bytes in UTF-8, Crowbar seems to silently drop the first byte. This gives you corrupted data, leading to hours of keyboard-smashing frustration.

The last SVN commit in Crowbar was in June 2008, so I am not too optimistic about seeing these bugs fixed. I also don’t know enough JavaScript and XUL to do it myself. Still, I think that using a headless browser for scraping is a great idea. I first stumbled upon Crowbar via a post by Jabba Laci. I have since found a few related tools that might be more useful:

I haven’t tried any of these three, but I hold high hopes for the one that uses QtWebKit from within Python. If I understand it correctly, you should be able to get full access over the browser, and peek at cookies, HTTP headers, and anything else you might want. Finally, here are some other random resources that might be useful:

  • Mechanize, a programmable headless web browser for Python. It doesn’t handle JavaScript, but it does handle cookies, and it has a nice interface for filling out forms. No JavaScript means this is much lighter than running a full (albeit headless) browser.
  • Tools to handle broken HTML: LXML, BeautifulSoup, html5lib.
  • An interesting blog about web scraping. Also a python library by the same guy.
  • Apparently there is a fair bit of money and controversy around web scraping.

I began this post talking about crawlers, but then focused on scraping a single page with a JavaScript-enabled headless browser. I don’t know about any existing crawlers that support JavaScript / Ajax this way. One problem is that you can’t tell when the scripts on a page have finished running. (Crowbar just waits a predetermined amount of time before delivering a snapshot of the page’s contents.) Anyway, a JavaScript-enabled crawler sounds like an interesting project :) sudo give me free time…