Intro Mind Notes, Week 8: Vision
(HMW, Ch. 4, especially pp. 211-214, 242-261, and 268-284)

A. Why the Problem of Vision is Hard

  1. People think vision is easy. All that has to be done, they think, is for the nervous system to get a picture to the brain. Then the brain just sees what is going on. But that idea buys into the fallacy of the homunculus. Images from the external world are projected (upside down) onto to our retina at the back of the eyeball. This retinal image can be considered to be a vector or array of values, one value for each rod and cone on the retina. This representation is a far cry from what cognition needs to navigate the world. To be of use, the brain needs a representation of a three-dimensional world filled with objects.
  2. The representation cognition needs would allow us to distinguish one object from another, to appreciate their positions, motions, sizes, shapes, and textures, despite the fact that lighting in the environment is variable, that we ourselves may be moving, and that we must recognize objects from many different points of view.
  3. The problem of vision is to explain the mechanism that transforms the retinal array into an object-level representation that can be stored in memory and processed by other cognitive systems. Pinker assumes this representation is symbolic, that is written in mentalese.
  4. The problem of retrieving the three dimensional image from retinal arrays is not solvable. This is why the eye can be fooled by illusions. But the vision system manages to do a reasonably good job nonetheless by depending on basic assumptions about the world of 3D objects. For example, very few objects can increase or decrease their sizes like balloons. Since so few objects that humans encounter do this, the visual system has come to depend on the assumption that objects stay pretty much the same size. This allows it to predict that if an object's image becomes larger in the visual field, the reason is that it is moving closer, and not actually increasing in size.

B. The Nature of Low-Level Visual Processing

  1. A fundamental question about vision is the extent to which higher cognitive processes such as goals, expectations, attention, reasoning, and conceptual structures, influence the transformation from retinal to cognitive level representations. Although these so called top down factors are clearly important, a lot of visual processing can be safely studied from bottom up, leaving aside these top down considerations.
  2. Anatomical study of the visual system gives important clues as to how it works. Visual processing is massively parallel and local. Local means that processing at one point of the image depends for the most part only on activity of nearby points.
  3. We also know that visual processing is modular. There are different maps of the visual system specially designed to resolve different parts of the problem. Examples: edge detection, motion, depth. Each of these may be further subdivided. For example, for depth information, we have several different systems based on different sources of depth information: stereopsis , the differences in the images in each eye; motion, as object recedes or advances the "size" of object changes; and overlap, where depth is indicated by whether one object blocks the view of another.

C. David Marr's Theory of Vision

  1. David Marr, in his book Vision presented one of the most influential theories of vision in the early 80s. His theory has been an inspiration for the computational theory of mind. His main idea is that the function of the visual system is to convert images projected onto our retina into representations of the world written in mentalese. The process starts with the retinal images from two eyes and proceeds through a number of different levels of representation.
  2. Grey-Level Representation . This is just the raw output from the rods and cones on the retina. It is a vector of values indicating the activity of each retinal neuron.
  3. Zero Crossing Map . On Marr's theory, the first job the vision system has to do is detect edges or object boundaries in the grey level representation. Usually an object boundary corresponds to a quick change in intensity of light. Intensity changes can be computed by taking differences between adjoining pixels in the image. (For math mavens, this is the first derivative.) If we look at changes in those changes (the second derivative) areas where there was a change in the original will correspond to zero activity (zero crossings ) in the new image. The new image will look something like an outline version of the old one. Since neurons calculate weighted sums, and weights can be negative, arrays of neurons can easily compute differences in activity between neighboring neurons. Marr worked out the neural nets that can compute the zero crossing image, and so find some of the information it needs to locate object boundaries.
  4. Primal Sketch. Knowing where edges are is not enough. The orientation of these edges must be computed, and junction points between edges such a Ts and Ls must be found, for these provide important clues about overlap. The primal sketch represents these important features along with the edges.
  5. 2.5 D Sketch. But the primal sketch is not enough, because edges can arise from changes in lighting due to the angle at which we see a surface. The problem is compounded by the fact that lighting need not be uniform across the surface, and by the fact that the surface of an object can be "painted" in different colors. (See the diagram on p. 242.) In pages 242-255 Pinker does a beautiful job of explaining how separate systems (demons) computing how light coming to the eye is effected by these factors can cooperate to compute the nature of the object in the real world. The 2.5 D sketch keeps track of all information about edges in the visual field, distinguishing between changes in color on the surface of the object, changes due to changes in angle on the surface, and changes due to lighting. (See the diagram on p. 260).
  6. Frame Neutral Sketch. But the 2.5 D Sketch is not enough. Our bodies and eyes are constantly on the move, so the image on our eyeballs is constantly changing. Eye motions (saccades ) constantly flit from one spot to another in the scene and are essential to effective vision. The detection of the motion of the visual scene across the retinal array must be suppressed during a saccade. Think how much jitter there would be if you were making a video and you were to move your camera the way your eyes move. The brain needs to distinguish what is moving in the real world from the changes in the images it receives that are due to body and eye motion. The frame neutral sketch represents only the changes in the world, separating those from changes due to eye and body motion.
  7. 3D Sketch. The 3D sketch is the final representation computed by the visual system. It provides a three dimensional representation of the objects in the world, allowing us to recognize what those objects are even though they look very different from different points of view. On Marr's theory this is done by representing the object as a nested set of more and more complex cylinders: Human: Head Body; Body: Trunk, Arms, Leg; Arm: Upper-Arm, Forearm, Hand; Hand: Palm, Fingers.

D. Extensions of Marr's Theory

  1. On Beiderman's theory, a more complex set of basic objects (called geons ) is used for object recognition: cones, cylinders, cubical shapes, and distortions of these that result from changing the length to width ratio and the shape of the center line. (See the picture on p. 270.) Biederman believes that representations of objects we can identify are stored in the form of something like sentences, listing the components that form the object along with their attachment points. In short, the recognition of objects is like the recognition of a sentence, for the object is composed of geons the way a sentence is composed of words. But how are the components recognized? By the boundaries between them, which are typically concave and rather sharply sloped inwards. (Consider the "joints" of the Michelin man, for an exaggerated version of the idea.)
  2. But geons can't entirely explain our ability to recognize objects from many different viewpoints. To do that , it would seem that the brain would have to stored a representation for an object to be identified from all possible viewpoints. True, the up-down axis is used as a major default assumption about how objects are aligned, and this may simplify the process of object identification. Violations of this alignment cause errors in identification (as NASA designers know well.) However, we can still identify objects when they are upside down or sideways.
  3. There is good evidence that the human visual system also has a ability to mentally rotate 3-D representations to help in object identification. This would vastly reduce the number of representations needed for an object to be recognized. Pinker discusses some of his own work on this topic on pp. 279-284.

E. Face Recognition

  1. Geon theory cannot be the whole story for object recognition. It is likely that we have other visual systems for recognizing natural objects like trees and mountains which cannot easily be represented as combinations of geons.
  2. One case where we have clear evidence of this is in the recognition of faces. The evidence for a special module for face recognition comes from brain injury patients who are (pretty much) normal in all other visual recognition tasks but who simply cannot recognize faces. Other patients can recognize faces but lack the ability to recognize other objects.

F. Treisman's Theory of Attention (See HMW pp. 140-142)

  1. Treisman's thesis is that there are basic visual processes that are computed in parallel that feed information to higher level processes responsible for binding features together. These second processes are calculated in series by an attention mechanism.
  2. We can develop evidence for this theory by presenting images with target shapes surrounded by distractors. If we measure the reaction time for identifying the targets and discover that it is fast and does not depend on the number of distractors, then we assume it is a basic parallel process. If the reaction time grows with the number of distractors then we assume the process is serial and involves attending to one thing after another in the scene.
  3. For example, the letters L and T have the same elements in the same orientation, and differ only in how the elements are conjoined to each other. Recognition of these targets depends on attention. The differences between them do not just obviously "pop out". However if you examine a field of |s and /s, where the only difference is the orientation, the difference is immediately and easily apparent.
  4. Basic features include orientation, brightness, curvature. A discrimination that requires conjoining features (white triangles and black squares, vs. black triangles and white squares) is extremely difficult to discriminate and takes tedious one-by-one inspection.

G. Top-Down vs. Bottom-Up

  1. Marr and many other researchers have tried to create theories of vision where the processing from retina to brain does not require higher-level information to identify the object. (For example, concepts like animals have 4 legs, or that the sky is above us and is blue or grey, etc.)
  2. Clearly there are instances where higher level information is required to resolve the ambiguities in a scene. For example the same shape (N) can be read horizontally as an en, and vertically as a zee.
  3. But to what extent does vision rely on top-down processing? Consider the Kanisza Triangle . (It is in HMW, p. 259.) Here we see an image of a triangle hovering above the scene, but there is no luminance difference on the two sides of its edges to allow us to pick out the boundary. Why do we perceive the edge? Perhaps conceptual information about how other things blot out other shapes helps us. However, there is some evidence that this phenomenon is very low level. For example we have evidence from monkey studies that the "edges" are already processed early in visual processing. So maybe a bottom up explanation of the Kanisza triangle effect is more likely.

H. Imagination and Imagery

  1. Some cognitive scientists have championed the view that imagination is a separate cognitive ability that provides an alternative to the symbolic processing story. Brains might contain a special graphic processor along with (or instead of) a symbolic processor. What advantages would this graphic processor bring?
  2. One important idea is that visual imagery carries much more information that symbolic representations are capable of. A picture is worth a thousand words. Consulting an image makes things obvious that we would otherwise have to think out. There are a number of skills such as finding things, planning errands, trying out ways of building things such as bridges, explaining continental drift etc., where an ability to imagine the various objects, actions and likely outcomes is extremely helpful. We can literally see in our mind's eye the things we should avoid doing when we imagine a course of events. Imagination gives us foresight. It also allows us to adapt ahead of time. For example, just by imagining a task, the athlete can train herself to improve.
  3. There is excellent evidence that language understanding and reasoning is based on metaphors which are in turn founded on visual imagery . For example, top, up ,etc. mean better, stronger (top of his game) down, bottom mean worse, weaker etc. (in the pits). If I imagine A is to the left of B and B to the left of C, I instantly "see" that A is to the left of C.
  4. There is also evidence that imagination of mental pictures rather than symbolic representations is crucial to certain cognitive abilities. Kosslyn had people imagine moving attention from one point to another on a map. The time to move attention was proportional to the distance on the map suggesting that attention actually "moves" from point to point in an imaging space. In another experiment, subjects asked whether two images matched apparently used a mental rotation technique to solve the task, for the time it took for solution depended on the angle through which the image would have to be rotated to align the two.
  5. Experiments with PET and rCBF scanning show that visual imagination and other cognitive tasks differ in that the former involve activation of visual areas of the brain. Research with brain-damaged patients has shown that brain damage can selectively impair abilities at mental rotation.