There are a variety of EM simulation schemes in wide use, with different strengths and weaknesses. For free-space antennas at radio frequency, where dielectrics are simple and metals are excellent conductors, integral equation schemes such as the method of moments (MoM) win. At optical frequencies, particularly when metal is involved, partial differential equation methods are generally better. The two most common PDE schemes are finite element method (FEM) and finite difference, time domain (FDTD). The antenna-coupled tunnel junction work required simulations with very fine resolution (1 nm) in some places, to represent plasmons and metal surface discontinuities, and a very large simulation domain, at least 5 μm square by 20 μm long. This requires multiprocessor capability and subgridding, i.e. different places in the simulation domain having different cell sizes. Subgridding is a natural strength of FE, but presents a challenge in FDTD, which naturally likes uniform cubical grids. On the other hand, mesh generation can be very time consuming, and FEM doesn't clusterize as well as FDTD and is much harder to get correct.
Lens design programs and circuit simulators have optimization capability—given a decent starting point (which may be hard to find sometimes), they will automatically adjust the lens prescriptions or circuit constants to achieve best performance by some criterion set by the user.
POEMS is a very capable FDTD simulator that brings this optimizing capability to the full EM world. I've mostly been using it to design waveguide-coupled antennas, but it's good for many sorts of EM problems. The current version uses either my own clusterized FDTD code, or (for verification) the well-tested but very much slower Berkeley TEMPEST 6.0 FDTD code for its number crunching engine. Here's the manual
The POEMS clusterized FDTD engine is currently working on EOI's AMD servers in an ad-hoc cluster with about 1 TFLOPS. Besides the original ACTJ devices, we've used it to design fluorescence-detection biochips for DNA sequencing, IR imaging, and coherent lidar. Its previous incarnations ran on a 24-way SMP and a 7-node, 14-processor Opteron cluster.
Scaling performance on these small SMPs and clusters is excellent, with less than 30% deviation from linear scaling of the single-machine version. In multicore boxes, this is due to some issues with the Linux thread scheduler, and in clusters, there is also communications latency over even the fastest Ethernet connections. The reason my code is so fast is that it precomputes a strategy, which allows a very clean inner loop iterating over a list of 1-D arrays of identical voxels, whereas, like most FDTD codes, TEMPEST has a big switch statement inside triply nested loops to decide what to do at each voxel on each iteration. That makes optimization and caching muc more difficult.
For more details, here's the manual. Send me an e-mail if you have a problem that might benefit from POEMS.