<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Please insert some dreams to continue...</title>
	<atom:link href="http://thirld.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://thirld.com/blog</link>
	<description>a sometimes-technical, sometimes-personal blog</description>
	<lastBuildDate>Mon, 30 Apr 2012 23:57:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Thoughts about Artificial-Life Simulations</title>
		<link>http://thirld.com/blog/2012/04/30/thoughts-about-artificial-life-simulations/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=thoughts-about-artificial-life-simulations</link>
		<comments>http://thirld.com/blog/2012/04/30/thoughts-about-artificial-life-simulations/#comments</comments>
		<pubDate>Mon, 30 Apr 2012 23:57:37 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[random]]></category>
		<category><![CDATA[reflections]]></category>
		<category><![CDATA[agent]]></category>
		<category><![CDATA[artificial life]]></category>
		<category><![CDATA[costs]]></category>
		<category><![CDATA[disease spreading]]></category>
		<category><![CDATA[fitness]]></category>
		<category><![CDATA[mason]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[parameters]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[simulation]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=151</guid>
		<description><![CDATA[In one of my classes this semester, we programmed little agents in an artificial world to do (hopefully) interesting things. My group focused on exploring disease spreading patterns. We picked a few variables: How fast does the disease spread? Are the agents attracted to one another? Can the agents accurately observe whether other agents are [...]]]></description>
			<content:encoded><![CDATA[<p>In one of my classes this semester, we programmed little agents in an artificial world to do (hopefully) interesting things. My group focused on exploring disease spreading patterns. We picked a few variables: How fast does the disease spread? Are the agents attracted to one another? Can the agents accurately observe whether other agents are sick? And so on. We built a pretty simple simulation of this artificial world using the excellent <a href="http://cs.gmu.edu/~eclab/projects/mason/">MASON</a> simulator. Here&#8217;s an example of what it looks like:</p>
<p><iframe width="480" height="360" src="http://www.youtube.com/embed/ROAYUJs_nOg" frameborder="0" allowfullscreen></iframe></p>
<p>The green circles are food. The other circles are agents, who are sick (red) or healthy (blue). An agent is perceived as sick by other agents only if it has another smaller dot next to it. The bar above each agent shows the agent&#8217;s energy; and the agent dies if the energy drops to zero. Agents can eat food to get energy, and they burn more energy per step if they are sick.</p>
<p>Overall, the simulation we built was rather unsatisfying. In this post I&#8217;d like to dwell on the reasons why.</p>
<p><span id="more-151"></span></p>
<p>We ended up having a lot of parameters (constants), especially for our flocking behavior. How much are agents attracted to food? How much are they repelled by sick agents? How much randomness do we add in their motion? We had about 7 parameters just for the agents&#8217; motion, not including simulation-level parameters such as disease type and observability. With this many parameters, there was no systematic way to find good settings for all of them. Instead, we ended up tweaking them until the simulation &#8220;looked right&#8221;. This has been like groping around in the dark, with no clear idea of what we were looking for, and with no way to tell which parameter values were better. Often adjusting the parameters gave us some behavior we wanted, but broke many other behaviors that were working before.</p>
<p>This explosion in the number of parameters seems to be unavoidable in any non-trivial simulation. How might we overcome it? When we don&#8217;t know the value of a parameter, there should be a way to find a good value automatically. This is where machine learning and search (including evolutionary algorithms) come in. But in order to use any optimization technique, we would need to define what &#8220;good&#8221; means &#8212; what effect the optimizer should strive to produce. Could we take inspiration from nature, and try to maximize some evolutionary criterion, like the number of offspring an agent has? That would require giving our agents the ability to reproduce. And even in nature, it isn&#8217;t clear what the &#8220;fitness function&#8221; is. For example it is not clear that <a href="http://www.cs.berkeley.edu/~christos/papers/MixabilityTheory-1.pdf">sexual reproduction</a> is better than asexual reproduction, if we are trying to maximize fitness. Yet advanced organisms all use sexual, rather than asexual reproduction. Another example (that Dan Dennett likes to give): Going to college actually reduces your &#8220;fitness&#8221; &#8212; you will have, on average, fewer children than someone who hasn&#8217;t gone to college. Yet most of us think that going to college is a good idea ;-) To sum things up, nature is complicated, and it is not clear what &#8220;fitness function&#8221; evolution is optimizing. Going back to our simulation, we could just optimize some criterion of our choice, instead of trying to emulate nature. But then we might miss out on whatever interesting behaviors could emerge if we had chosen a different criterion to optimize.</p>
<p>Another pet peeve of mine is that we had to hard code the agents&#8217; desires, such as the desire to look for food. It would have been more satisfying to give the agents a basic desire (survive!), and let them figure out what to do to achieve it. But how? It seems impossible without giving the agents a general ability to learn, which is beyond what machine learning can do today. In nature, we have basic needs like thirst and hunger, and we also have more complicated drives, like ambition. These needs and drives are not computed in a nice &#8220;Needs-and-Drives&#8221; module in our brain &#8212; they are the result of a <a href="https://secure.wikimedia.org/wikipedia/en/wiki/Descartes%27_Error">complicated interplay</a> of forces all throughout our bodies. It is very hard to tell what should be &#8220;hard coded&#8221; in the architecture of an agent, and what should be left for the agent to learn on its own.</p>
<p>Just like we hard coded desires, we also hard coded living costs, such as how much energy the agents burned in each time step. It would have been more satisfying if the agents consumed just as much energy as they needed for the actions they were performing. Computation in nature has its costs: A bigger brain needs more energy to run, so you don&#8217;t get compute power for free. Reflexes bypass the brain, so latency matters. These aspects are hard to capture in a simulation where time moves in discrete steps. I have a vague idea of charging agents based on the computations they perform (number of instructions executed; bytes of memory used), but it is far from something I could sit down and implement.</p>
<p>So what needs to happen for <a href="https://secure.wikimedia.org/wikipedia/en/wiki/Artificial_life">artificial life</a> simulations to become more than simple toys?</p>
<ul>
<li>Find a way to navigate the huge parameter space automatically.</li>
<li>Figure out a meaningful fitness function.</li>
<li>Make the costs incurred by agents more realistic.</li>
<li>Find a more realistic way to represent time.</li>
</ul>
<p>I&#8217;d love to read more about this and see what other people have come up with.</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2012/04/30/thoughts-about-artificial-life-simulations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Adventures of a Library Book Scanner</title>
		<link>http://thirld.com/blog/2012/02/20/adventures-of-a-library-book-scanner/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=adventures-of-a-library-book-scanner</link>
		<comments>http://thirld.com/blog/2012/02/20/adventures-of-a-library-book-scanner/#comments</comments>
		<pubDate>Mon, 20 Feb 2012 23:49:49 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[random]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[full hard disk]]></category>
		<category><![CDATA[mars rover]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[scanner]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=126</guid>
		<description><![CDATA[Our library has a bunch of really nice book scanners, which let you save your scan to a USB flash drive, or email the scan to yourself. The scanners are operated by a computer with a touch screen. A normal user probably wouldn&#8217;t even notice that there is a &#8220;real&#8221; computer in there, because the [...]]]></description>
			<content:encoded><![CDATA[<p>Our library has a bunch of really nice book scanners, which let you save your scan to a USB flash drive, or email the scan to yourself. The scanners are operated by a computer with a touch screen. A normal user probably wouldn&#8217;t even notice that there is a &#8220;real&#8221; computer in there, because the touch interface is pretty straightforward to use. That is, until it fails with a helpful error message (&#8220;This program has encountered a problem and needs to close&#8221;), and dies taking with it everything I&#8217;ve scanned so far.</p>
<p>But that&#8217;s not the end of the story. When the touch-screen UI dies, it exposes a standard Windows XP desktop underneath. And guess what? It&#8217;s running under the Administrator account. Okay, keep that in mind in case you want to have some fun with it later.</p>
<p>So I reboot the machine, and the touch-screen UI starts again. I try Alt+Tab, Win+R, Win+D, Ctrl+Shift+Esc; nothing works. I guess the software swallows any keyboard shortcuts to prevent such curious exploration. Oh well. I try to scan my papers again, and again the UI crashes. Fiddlesticks.</p>
<p>Another reboot. This time I hit Win+R before the touch-screen UI is fully loaded, and start an instance of cmd. Sure enough, now I can Alt+Tab between the UI and cmd, and I can launch anything I want from cmd. Let&#8217;s take a look at the C:\ drive. Whoa, 200 MB free out of 50GB? That might be why it was crashing. So I look around and find the temp directory for the scanner UI, and sure enough it&#8217;s full of temp files, gigabytes of them. They look like raw bitmaps from previous scans. I delete them all; the scanner UI works again, and I am able to finish my work.</p>
<p>Lessons learned:</p>
<ol>
<li>The machine runs on an unprotected Windows admin account, wide open for whatever abuse you might want to throw at it.</li>
<li>The scanner UI prevents simple attempts to get around it (like Win+R), but the machine is still vulnerable in the small time window after Windows has finished booting, but before the UI auto-starts.</li>
<li>The UI seems to store raw copies of everything it ever scans. Massive privacy hole. (It&#8217;s possible that it deletes the temp files if the scan is successful, and only leaves them when it fails, but that&#8217;s still pretty bad.)</li>
<li>In so many years of desktop computing, we are still caught by surprise when the disk gets full. (A few years back, KDE would fail to log me in without any explanation. It turned out my /home/ was full. Even <a href="http://web.archive.org/web/20110719212649/http://www.planetary.org/blog/article/00000702/">Mars rovers have failed</a> when their disks got full.) There <em>has</em> to be a better way to handle things. Temp directories that are wiped on reboot are a good first step. Watchdogs might work for desktop users, but not on an embedded system. What else could we do?</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2012/02/20/adventures-of-a-library-book-scanner/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Things to Remember when Using ulimit</title>
		<link>http://thirld.com/blog/2012/02/09/things-to-remember-when-using-ulimit/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=things-to-remember-when-using-ulimit</link>
		<comments>http://thirld.com/blog/2012/02/09/things-to-remember-when-using-ulimit/#comments</comments>
		<pubDate>Thu, 09 Feb 2012 16:35:49 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[gotcha]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[stack]]></category>
		<category><![CDATA[ulimit]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=128</guid>
		<description><![CDATA[On Linux, ulimit allows you to limit the resources that a process can use. Two use cases: You have a program that sometimes runs out of memory, slowing your computer down to a crawl. You can use ulimit -v to limit the amount of memory that processes in a shell can use. If a process [...]]]></description>
			<content:encoded><![CDATA[<p>On Linux, <code>ulimit</code> allows you to limit the resources that a process can use. Two use cases:</p>
<ol>
<li>You have a program that sometimes runs out of memory, slowing your computer down to a crawl. You can use <code>ulimit -v</code> to limit the amount of memory that processes in a shell can use. If a process tries to allocate more memory than that, the allocation will fail and the program will usually abort.</li>
<li>You have a program with a deep recursion, which segfaults with the default stack limit of 8M. You can use <code>ulimit -s</code> to increase the allowed stack size.</li>
</ol>
<p>There are many more limits you can set; type <code>help ulimit</code> in bash to list them. You can find out the current limits by typing <code>ulimit -a</code>.</p>
<p>Two gotchas that I always forget about:</p>
<ol>
<li>You may try to limit the memory usage of a process by setting the maximum resident set size (<code>ulimit -m</code>). This has no effect on Linux. <code>man setrlimit</code> says it used to work only in ancient versions. You should limit the maximum amount of virtual memory (<code>ulimit -v</code>) instead.</li>
<li><code>ulimit</code> has hard limits and soft limits. Hard limits can be decreased but not increased. You can shoot yourself in the foot if you set your hard limit too low. I recommend using soft limits only. Set them with, for example, <code>ulimit -Sv</code>, and query them with <code>ulimit -Sa</code>.</li>
</ol>
<p>Happy hacking!</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2012/02/09/things-to-remember-when-using-ulimit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making yEd Import Node Labels from GraphML Files</title>
		<link>http://thirld.com/blog/2012/01/31/making-yed-import-labels-from-graphml-files/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=making-yed-import-labels-from-graphml-files</link>
		<comments>http://thirld.com/blog/2012/01/31/making-yed-import-labels-from-graphml-files/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 02:54:54 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graphml]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[label]]></category>
		<category><![CDATA[networkx]]></category>
		<category><![CDATA[node]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[trick]]></category>
		<category><![CDATA[yed]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=117</guid>
		<description><![CDATA[yEd is a gem of a graph editor that makes it very easy to create diagrams and flowcharts. I used Inkscape for this purpose in the past, and had to do a lot of manual alignment. yEd does it all auto-magically with a very intuitive interface. The only bad thing about it is that it&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.yworks.com/en/products_yed_about.html">yEd</a> is a gem of a graph editor that makes it very easy to create diagrams and flowcharts. I used Inkscape for this purpose in the past, and had to do a lot of manual alignment. yEd does it all auto-magically with a very intuitive interface. The only bad thing about it is that it&#8217;s not open source, although it does come free of charge.</p>
<p>yEd can import graphs from a variety of formats, one of which is <a href="http://graphml.graphdrawing.org/">GraphML</a>. The other day I had a graph in Python (<a href="http://networkx.lanl.gov/">NetworkX</a>), and I wanted to lay it out nicely for printing. yEd&#8217;s layout functions surpass anything I&#8217;ve seen Graphviz do, so I decided to export the graph to GraphML and load it into yEd. This proved more difficult than I anticipated, but only because I didn&#8217;t know where to look. Hopefully this post will save you some time.</p>
<p>Let&#8217;s generate a graph, give the nodes some labels, and export the graph to a GraphML file:</p>
<pre>
import networkx as nx
graph = nx.gnp_random_graph(10, 0.13, directed=True)
for node in graph.nodes():
    graph.node[node]['label'] = "node %d" % (node + 1)
nx.readwrite.write_graphml(graph, "random.graphml")
</pre>
<p>Now open the <code>random.graphml</code> file in yEd. All you will see is a square. This is because all the nodes are on top of each other and have no label. Don&#8217;t despair.</p>
<p><img src="http://thirld.com/blog/wp-content/uploads/2012/01/square.png" alt="" title="square" width="284" height="217" class="aligncenter size-full wp-image-118" /></p>
<p>First, let&#8217;s recover those labels. If you open the GraphML file in a text editor, you will see that NetworkX was smart enough to export the node labels, but yEd was not smart enough to realize they were labels. In fact, they became properties of the nodes in yEd, which you can see by right-clicking on a node, clicking Properties, and then selecting the Data tab.</p>
<p><img src="http://thirld.com/blog/wp-content/uploads/2012/01/properties.png" alt="" title="properties" width="323" height="218" class="aligncenter size-full wp-image-119" /></p>
<p>And now we come to the part that it took me a long time to figure out. We need to map the &#8220;Label&#8221; property imported from the GraphML file onto the internal property that yEd uses for labels. To do this, click Edit, Properties Mapper. Click the little plus sign under Configurations, and then &#8216;New Configuration for Nodes.&#8217; Now click the plus sign next to &#8216;Mappings&#8217;. If you&#8217;re lucky, yEd should figure out what you&#8217;re trying to do, and automatically select the right mapping. Your window should look like this (click for larger version):</p>
<p><a href="http://thirld.com/blog/wp-content/uploads/2012/01/mapping.png"><img src="http://thirld.com/blog/wp-content/uploads/2012/01/mapping-300x138.png" alt="" title="mapping" width="300" height="138" class="aligncenter size-medium wp-image-120" /></a></p>
<p>Now select &#8216;Fit Node to Label&#8217; if you want the nodes to be resized to fit the labels, then click OK. (If you forgot to select &#8216;Fit Node to Label,&#8217; you can do it later by going to Tools, Fit Node to Label.) You should see the labels now:</p>
<p><img src="http://thirld.com/blog/wp-content/uploads/2012/01/labels.png" alt="" title="labels" width="288" height="194" class="aligncenter size-full wp-image-121" /></p>
<p>But the nodes are still on top of each other. To fix that, use one of the algorithms under the Layout menu. Final result:</p>
<p><img src="http://thirld.com/blog/wp-content/uploads/2012/01/final.png" alt="" title="final" width="284" height="258" class="aligncenter size-full wp-image-122" /></p>
<p>Ta-da! If you save the modified graph in GraphML format and open the file in a text editor, you will see that reverse-engineering the format that yEd uses to store labels would not be easy.</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2012/01/31/making-yed-import-labels-from-graphml-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I Will Never Be Care-Free Again</title>
		<link>http://thirld.com/blog/2011/12/30/i-will-never-be-care-free-again/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=i-will-never-be-care-free-again</link>
		<comments>http://thirld.com/blog/2011/12/30/i-will-never-be-care-free-again/#comments</comments>
		<pubDate>Sat, 31 Dec 2011 03:54:41 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[reflections]]></category>
		<category><![CDATA[break]]></category>
		<category><![CDATA[nostalgia]]></category>
		<category><![CDATA[relax]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[rest]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=113</guid>
		<description><![CDATA[I will never be care-free again. What a sad thing to realize. These past few days, as I was brainstorming for my senior thesis (a year-long endeavor), I found myself wishing that the work would just stop, and let me relax for a week or two knowing that there are no looming deadlines. This is [...]]]></description>
			<content:encoded><![CDATA[<p>I will never be care-free again. What a sad thing to realize.</p>
<p>These past few days, as I was brainstorming for my senior thesis (a year-long endeavor), I found myself wishing that the work would just stop, and let me relax for a week or two knowing that there are no looming deadlines. This is how it used to be in high school and the first few semesters of college: All my work had an absolute deadline &#8212; the end of the semester. When the break came, I was done, and work wouldn&#8217;t start again until the next semester. I could relax, knowing that I had no responsibilities until classes started again. I could survive several days without checking my email. I could work on crafts stuff without feeling guilty. My life was structured in alternating periods of work and no-work.</p>
<p>I really miss that.</p>
<p>The work I do now is not like that at all. Research just drags on and on. I&#8217;m still trying to finish up some stuff from last summer, and next June I&#8217;ll be presenting some work that was supposed to be finished one year before. Even though I want to be done with it, the work just keeps coming back; it never ends. If I don&#8217;t do so well on a problem set, I&#8217;ll just start fresh on a new one next week. But with research, I have to live with the results and the crappy code and the thorny questions for a long time, and there are no semester cutoffs.</p>
<p>The consequence is that I can&#8217;t really relax during the breaks, because I always have work that is not finished. I check my email daily, in case something important (work-related) comes up. I feel guilty getting up at 12 and doing origami, because I could be working on my thesis. I end up mixing work and play all day, and then feeling dissatisfied at the end of the day because I was neither productive nor relaxed. My break got shorter by a day, but my pile of work did not get any smaller.</p>
<p>The depressing thing is that grad school will be like this too. The work will leak from one year to the next, never completely finished. And a real job would be the same, I&#8217;m sure. The periods I miss, with no work and no responsibilities, are not coming back. I&#8217;ll never be care-free again, and I need to find a way to relax and rest even knowing that the work will drag on forever.</p>
<p style="text-align: center;"><em><a href="http://www.youtube.com/watch?v=u5CVsCnxyXg">I&#8217;ll take a quiet life,<br />
A handshake of carbon monoxide,<br />
No alarms and no surprises please.</a></em></p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2011/12/30/i-will-never-be-care-free-again/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Awesome: Edit Encrypted Files Transparently in Vim</title>
		<link>http://thirld.com/blog/2011/12/17/awesome-edit-encrypted-files-transparently-in-vim/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=awesome-edit-encrypted-files-transparently-in-vim</link>
		<comments>http://thirld.com/blog/2011/12/17/awesome-edit-encrypted-files-transparently-in-vim/#comments</comments>
		<pubDate>Sun, 18 Dec 2011 00:20:39 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[random]]></category>
		<category><![CDATA[awesome]]></category>
		<category><![CDATA[encryption]]></category>
		<category><![CDATA[gnupg]]></category>
		<category><![CDATA[gpg]]></category>
		<category><![CDATA[plugin]]></category>
		<category><![CDATA[vim]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=109</guid>
		<description><![CDATA[I just discovered this today: With the gnupg.vim plugin, Vim can edit GPG-encrypted files transparently. So if a file has a .gpg extension, Vim will automatically decrypt it upon opening, and re-encrypt it upon saving. Awesome! I no longer need my clunky script that did this by dumping the cleartext to a temporary file&#8230;]]></description>
			<content:encoded><![CDATA[<p>I just discovered this today: With the <a href="http://www.vim.org/scripts/script.php?script_id=3645">gnupg.vim</a> plugin, Vim can edit GPG-encrypted files transparently. So if a file has a <code>.gpg</code> extension, Vim will automatically decrypt it upon opening, and re-encrypt it upon saving. Awesome! I no longer need my clunky script that did this by dumping the cleartext to a temporary file&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2011/12/17/awesome-edit-encrypted-files-transparently-in-vim/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Cost of Applying to Grad School</title>
		<link>http://thirld.com/blog/2011/12/15/the-cost-of-applying-to-grad-school/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-cost-of-applying-to-grad-school</link>
		<comments>http://thirld.com/blog/2011/12/15/the-cost-of-applying-to-grad-school/#comments</comments>
		<pubDate>Fri, 16 Dec 2011 04:45:36 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[reflections]]></category>
		<category><![CDATA[cost]]></category>
		<category><![CDATA[grad school]]></category>
		<category><![CDATA[gre]]></category>
		<category><![CDATA[money]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=102</guid>
		<description><![CDATA[This is a rant about how expensive it is to apply for grad school. Let&#8217;s do the math: $160 to take the GRE General test (which tests high-school-level math with no calculus, ridiculous English vocabulary that only literary people would use, and your ability to write a bullshit essay as fast as possible.) $140 to [...]]]></description>
			<content:encoded><![CDATA[<p>This is a rant about how expensive it is to apply for grad school.</p>
<p>Let&#8217;s do the math:</p>
<p>$160 to take the GRE General test (which tests high-school-level math with no calculus, ridiculous English vocabulary that only literary people would use, and your ability to write a bullshit essay as fast as possible.)</p>
<p>$140 to take the GRE Subject test (The CS test was very broad; it had questions on everything from networking and operating systems to algorithms and programming languages and RSA encryption.)</p>
<p>$23 x N to send GRE score reports to N schools, assuming you don&#8217;t use the free four that you get when taking either test (They are sent <em>electronically</em>, so why do they cost so much?!)</p>
<p>$3 x N to send official transcripts to N schools (This probably covers the cost of printing and mailing. Thank you Tufts for not being greedy, although you definitely compensate in other ways. Random fact: some schools don&#8217;t want an official transcript until you&#8217;re admitted; some others want two copies for some reason.)</p>
<p>$90 x N average application fee for N schools (Nearly all schools have a higher fee for international applicants. And you can only <em>maybe</em> qualify for a fee waiver if you&#8217;re a US citizen / resident.)</p>
<p>My total for N=9 schools was $1,236. Holy shit. My bank account is weeping now :( This does not include the psychological costs of lost sleep, ignoring your friends for an entire semester, pounding away at your statement of purpose until your wrists hurt, and constant feelings of inadequacy / anxiety / mild panic. All of it to become&#8230; <a href="http://www.phdcomics.com/comics/archive.php?comicid=1436">this</a>? Hmm.</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2011/12/15/the-cost-of-applying-to-grad-school/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Find Unused BibTeX Entries</title>
		<link>http://thirld.com/blog/2011/11/20/how-to-find-unused-bibtex-entries/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-find-unused-bibtex-entries</link>
		<comments>http://thirld.com/blog/2011/11/20/how-to-find-unused-bibtex-entries/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 03:27:30 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[bibtex]]></category>
		<category><![CDATA[diff]]></category>
		<category><![CDATA[hack]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[latex]]></category>
		<category><![CDATA[unix]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=97</guid>
		<description><![CDATA[BibTeX is a reference management system often used together with the LaTeX typesetting system. Today I wanted to find out if I had any unused references in my BibTeX file. There doesn&#8217;t seem to be an easy way to do this. Luckily, a combination of tools did it. This is a quick brain dump of [...]]]></description>
			<content:encoded><![CDATA[<p><a href="https://secure.wikimedia.org/wikipedia/en/wiki/BibTeX">BibTeX</a> is a reference management system often used together with the <a href="https://secure.wikimedia.org/wikipedia/en/wiki/LaTeX">LaTeX</a> typesetting system. Today I wanted to find out if I had any unused references in my BibTeX file. There doesn&#8217;t seem to be an easy way to do this. Luckily, a combination of tools did it. This is a quick brain dump of what I did. (If you find an easier way to do this, let me know.)</p>
<p>My bibtex file is <code>paper.bib</code>. This contains all (used and unused) references. My paper is in <code>paper.tex</code>. When first compiling the paper, latex creates <code>paper.aux</code>. This intermediary file contains entries only for the references that the paper actually cites.</p>
<p>1) Dump keys for all (used and unused) references, and sort them:</p>
<p><code>bib2bib paper.bib -ob /dev/null -oc /dev/stdout |sort >all</code></p>
<p>2) Dump keys for used references, and sort them:</p>
<p><code>aux2bib paper.aux |bib2bib -ob /dev/null -oc /dev/stdout |sort >used</code></p>
<p>3) List keys which are in <code>all</code> but not in <code>used</code>:</p>
<p><code>diff --old-line-format=%L --unchanged-line-format= all used</code></p>
<p>You can consult the manual pages for bib2bib, aux2bib, and diff to see what the parameters above do. The commands <code>bib2bib</code> and <code>aux2bib</code> can be found in the <a href="http://www.lri.fr/~filliatr/bibtex2html/">bibtex2html</a> package.</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2011/11/20/how-to-find-unused-bibtex-entries/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Where&#8217;s My Cheese?</title>
		<link>http://thirld.com/blog/2011/11/12/wheres-my-cheese/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=wheres-my-cheese</link>
		<comments>http://thirld.com/blog/2011/11/12/wheres-my-cheese/#comments</comments>
		<pubDate>Sun, 13 Nov 2011 03:53:04 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[random]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[cheese]]></category>
		<category><![CDATA[dead link]]></category>
		<category><![CDATA[error message]]></category>
		<category><![CDATA[google cache]]></category>
		<category><![CDATA[internet archive]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=86</guid>
		<description><![CDATA[Imagine putting error messages on a spectrum, according to how easy to understand they are. &#8220;404&#8243; lies on the cryptic end. &#8220;Page not found&#8221; lies somewhere in the middle. I think the middle is a good trade-off between designer effort and user satisfaction. But what lies on the other end? This: We recognize that our [...]]]></description>
			<content:encoded><![CDATA[<p>Imagine putting error messages on a spectrum, according to how easy to understand they are. &#8220;404&#8243; lies on the cryptic end. &#8220;Page not found&#8221; lies somewhere in the middle. I think the middle is a good trade-off between designer effort and user satisfaction. But what lies on the other end? This:</p>
<p><a href="http://thirld.com/blog/wp-content/uploads/2011/11/cheese.jpg"><img src="http://thirld.com/blog/wp-content/uploads/2011/11/cheese.jpg" alt="" title="cheese" width="222" height="278" class="aligncenter size-full wp-image-87" /></a></p>
<blockquote><p>
We recognize that our website used to present a challenge, and that many people have memorized the path through the maze or bookmarked the information they need. Unfortunately, due to the new organization of our website content, those trails of breadcrumbs and bookmarks will no longer work. We apologize for &#8220;moving the cheese&#8221; at the end of the maze, but we think you&#8217;ll have a much easier time finding the information you need.</p>
<p>Our website content has been organized into a number of related categories, listed below. Please Contact us if you need any further information
</p></blockquote>
<p>The above is a real error message I got on <a href="http://www.ntsb.gov/Pressrel/2007/070710b.htm">this page</a>, and from a government agency no less. Someone is either over-zealous, or getting paid by the keystroke&#8230; The worst problem is that you don&#8217;t immediately realize this is an error page, so you waste time trying to make sense of that paragraph&#8230; Yay usability.</p>
<p>By the way, the <a href="http://www.archive.org/index.php">Internet Archive</a> can be a great resource when you need to dig up a dead link. If the website is &#8220;important&#8221; enough (I wonder according to which criterion), the Internet Archive will probably have a stored copy. I&#8217;ve found my article <a href="http://web.archive.org/web/20090902202510/http://ntsb.gov/Pressrel/2007/070710b.htm">here</a>. Another trick for recently-removed content, or for content that&#8217;s temporarily down, is to search Google&#8217;s cache (<a href="http://www.google.com/support/forum/p/Web%20Search/thread?tid=73b6a5e00db594bf&#038;hl=en">one</a>, <a href="http://www.google.com/support/forum/p/Web+Search/thread?tid=3340c5b01f83f283&#038;hl=en">two</a>).</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2011/11/12/wheres-my-cheese/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Scraping Ajax websites with Crowbar</title>
		<link>http://thirld.com/blog/2011/10/26/scraping-ajax-websites-with-crowbar/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=scraping-ajax-websites-with-crowbar</link>
		<comments>http://thirld.com/blog/2011/10/26/scraping-ajax-websites-with-crowbar/#comments</comments>
		<pubDate>Thu, 27 Oct 2011 03:38:58 +0000</pubDate>
		<dc:creator>cberzan</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[ajax]]></category>
		<category><![CDATA[browser]]></category>
		<category><![CDATA[crawl]]></category>
		<category><![CDATA[crowbar]]></category>
		<category><![CDATA[headless]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[scrape]]></category>

		<guid isPermaLink="false">http://thirld.com/blog/?p=79</guid>
		<description><![CDATA[Sometimes you want to save a local copy of an entire website, either because you want to use it offline, or because you only have access to it temporarily. You could open each page in a browser and save it, but that&#8217;s tedious. There are nice crawlers out there like HTTrack, which will save an [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes you want to save a local copy of an entire website, either because you want to use it offline, or because you only have access to it temporarily. You could open each page in a browser and save it, but that&#8217;s tedious. There are nice crawlers out there like <a href="http://www.httrack.com/">HTTrack</a>, which will save an entire website for you, and even tweak the links so that they work on the local version. Unfortunately, such crawlers do not handle JavaScript. If your website uses Ajax to load its content, you&#8217;re out of luck. Your downloaded copy will contain unexecuted calls like &#8220;fetchContent()&#8221;, instead of the actual stuff that you would see in a browser.</p>
<p>Wouldn&#8217;t it be great if a crawler could execute the JavaScript on a page, and save it <em>after</em> all the content has been filled in? It turns out that some folks at MIT have already thought about this. <a href="http://simile.mit.edu/wiki/Crowbar">Crowbar</a> is a headless Firefox-like browser, running on top of Mozilla&#8217;s XULRunner. You point it to a URL, it loads it, executes the JavaScript, and gives you the resulting page. It even seems to handle cookies. It&#8217;s like the full-fledged Firefox running without any screen output, which is exactly what you&#8217;d want for web scraping!</p>
<p>After I stopped jumping up and down with excitement, I played with Crowbar a little bit. The idea is great, but the tool itself has some disappointing flaws. First, it hangs if you point it to a URL that is not an HTML page (an image, for example). Why would you want to fetch images with Crowbar? Because some evil websites won&#8217;t deliver them unless you have the appropriate &#8220;session id&#8221; cookie. Which brings me to the second flaw: Even though Crowbar handles cookies, there is no way to get at them from the outside. I&#8217;ve found <code>cookies.sqlite</code> in <code>~/.crowbar/profile-name/</code>, but the database is locked and inaccessible while Crowbar is running. (And even if I could open it, it probably wouldn&#8217;t store <a href="https://secure.wikimedia.org/wikipedia/en/wiki/Session_cookie#Session_cookie">session cookies</a>.) The third flaw is more subtle: When faced with pages that have non-ASCII characters, represented with two bytes in UTF-8, Crowbar seems to silently drop the first byte. This gives you corrupted data, leading to hours of keyboard-smashing frustration.</p>
<p>The last SVN commit in Crowbar was in June 2008, so I am not too optimistic about seeing these bugs fixed. I also don&#8217;t know enough JavaScript and XUL to do it myself. Still, I think that using a headless browser for scraping is a great idea. I first stumbled upon Crowbar via <a href="http://ubuntuincident.wordpress.com/2011/04/15/scraping-ajax-web-pages/">a post by Jabba Laci</a>. I have since found a few related tools that might be more useful:</p>
<ul>
<li><a href="http://www.phantomjs.org/">PhantomJS</a>, a headless WebKit-based browser with a JavaScript API.</li>
<li><a href="http://blog.sitescraper.net/2010/06/scraping-javascript-webpages-in-python.html">Using QtWebKit directly from Python</a> to scrape pages. </li>
<li><a href="http://seleniumhq.org/">Selenium</a>, which automates browsers for testing. I bet it could be used for scraping, too.</li>
</ul>
<p>I haven&#8217;t tried any of these three, but I hold high hopes for the one that uses QtWebKit from within Python. If I understand it correctly, you should be able to get full access over the browser, and peek at cookies, HTTP headers, and anything else you might want. Finally, here are some other random resources that might be useful:</p>
<ul>
<li><a href="http://wwwsearch.sourceforge.net/mechanize/">Mechanize</a>, a programmable headless web browser for Python. It doesn&#8217;t handle JavaScript, but it does handle cookies, and it has a nice interface for filling out forms. No JavaScript means this is much lighter than running a full (albeit headless) browser.</li>
<li>Tools to handle broken HTML: <a href="http://lxml.de/lxmlhtml.html">LXML</a>, <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, <a href="http://code.google.com/p/html5lib/">html5lib</a>.</li>
<li><a href="http://blog.sitescraper.net/">An interesting blog</a> about web scraping. Also <a href="http://code.google.com/p/webscraping/">a python library</a> by the same guy.</li>
<li>Apparently there is a fair bit of money and <a href="http://en.wikipedia.org/wiki/Web_scraping#Legal_issues">controversy</a> around web scraping.</li>
</ul>
<p>I began this post talking about crawlers, but then focused on scraping a single page with a JavaScript-enabled headless browser. I don&#8217;t know about any existing <em>crawlers</em> that support JavaScript / Ajax this way. One problem is that you can&#8217;t tell when the scripts on a page have finished running. (Crowbar just waits a predetermined amount of time before delivering a snapshot of the page&#8217;s contents.) Anyway, a JavaScript-enabled crawler sounds like an interesting project :) sudo give me free time&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://thirld.com/blog/2011/10/26/scraping-ajax-websites-with-crowbar/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

