Brain-Mind Institute: Programs

Imagine how the brain of a baby learns (i.e., develops) in real time while the baby's body acts. The brain of the baby receives a real-time sensory stream and a real-time motric stream. The baby can perform at each time instant well before he has learned everything for his life.

Each Contest entry can only run on a single general-purpose learning engine for all the five Types of streams decribed below. But during the first year of the Contest, only a single sensory modality will be considered for each artificial brain, each Type and each training stream corresponding to an artificial life. A Contest interface will pass dimension information of each Type to each Contest artificial brain. Then, this single learning engine will be trained and tested on a "life-long" training-and-testing stream. The stream can correspond to visual modality (a "life-long" stream of image frames), auditory modality (a "life-long" stream of sound waves), or key-stroke modality for natural languages (a "life-long" stream of characters and punctuations). Each input frame in the stream is frame-synchronized with a motric input/test frame that is properly delayed as the motor output from a brain is delayed from the corresponding sensory frame because the brain needs time to react. Each training-and-testing stream contains largely training frames interlaced with some testing frames. The BMI courses have taught and will continue to teach how to understand, model, and implement brain-like machine learning. The Contest Workshop will teach all Contest participants how to understand, use, and improve the supplied general-purpose learning engines.

Contest entries are submitted via Internet and no travel is a must. The Contest will provide software interface and data streams for training-and-testing. Organizers of the contest are ineligible for team members of any entry.

The International Conference on Brain-Mind (ICBM) will feature Contest score announcement, sponsor rewards, and team presentations.

The criteria of performance are highly integrated:

The lowest average error rate given a limited training-and-testing experience and a limited size of the learning network.

Specificially, the criteria are:

Given the training-and-steting stream, the average error rate over all test time instances through the stream.
1. This requires quick learning at early time instances and do not degrade at later time instances.
2. Each learner's life must be successful through development. Namely, trying different "genes" or intial guesses will lead to a poor performance. The local minimum problem that are typical in traditional neural networks must be solved for the highly nonlinear learning problems in the Contest. Arguably, "genes" are for development/learning not directly for intelligence.
3. The curse of dimentionality problem must be solved as the contestants are not allowed to handcraft skull-internal features, such as convolution features, as each learner engine must be applied to different senary modalities and different tasks that are not fully known during the programming time. Convolution and max-pulling such as those in Cresceptron are not effective for large tempelates and are not suited for other sensory modalities.
4. The skull-internal representations must emerge, free from (handcrafted) symbolic representations, because both the modality and the tasks to learn are unknown during the programming time. Namely, the Contest tests machine learning methods that are fully automatic inside the "skull" through the machine learner's life.
The temporal block size b=1 for sensory processing, because the next sensory frame depends on the current action (e.g., what you see in the next sensory frame depends on how you move now).
1. The training and testing environment is given as a long training-and-testing stream with descrete time instances.
2. The initial state/action is given, like the initial context of a life before development/learning.
3. Each motor at a time instance can be either action-supervised (for training) or action-free for (testing and self-taught learning), specified by the environment (a supplied pattern for motor supervision or a symbol * otherwise).
The number of neurons in the hidden (skull-internal) area of network is limited.
1. Because the Contest is not meant to compare the power of the computer hardware that the Contestants use, the number of hidden neurons are given for each training-and-testing stream. It is unfair to compare the performance with a learning network that is twice as large.
2. For the same reason, each neuron in the hidden (skull-internal) area can only update once between two consecutive time instances. It is unfair to compare the performance with another network that iterates its skull-internal areas more times between two consecutive time instances. This is meant for real-time learners as environmental events unfold.

Within each stream, the following five types of substreams (each contains multiple tasks and subtasks, skills and subskills) will be trained and tested at different time instances. The contestants do not know what type each time instance corresponds to. An interface program will be provided that read each stream and measure the motor errors. The source code of this interface program is also provided to the Contestants that the contestants can use to debug and improve, but all error measure parts of the interface program must not be modified so that all contestant entires can be measured by the same criteria. Contestants are allowed to browse through the supplied streams to decide what additional motors to define and teach if there is such a need.

Type 1: Spatially non-attentive and non-temporal streams
Many components of a sensory frame are related to the next motoric frame (e.g., the object of interest almost fills the entire image and the next motoric frame contains the object type). Non-temporal here means that a single frame is sufficient to decide the next motor frame. This is similar to monolithic pattern classification (e.g., image classification). But past experience is useful for later learning within the same training-and-testing stream.

Type 2: Spatially attentive and non-temporal streams
A relatively small number of components of a sensory frame are related to the next motoric frames (e.g., the car to be recognized and detected is in a large cluttered street scene where the next motoric frames should contain the location, type, and scale of the attended car). Type 2 is a spatial generalization of Type 1. This is like object recognition and detection from cluttered dynamic scenes conducted concurrently (where the next motoric frames provide desired actions). Each sensory frame is not segmented but internal automatic segmentation needs to be learned. Namely, skills to find which image patch is related to the action in the motoric frame need to be gradually learned from earlier learning and refined in later learning within the same stream. The early attention skills can be learned from motor vector (supervised learning) and/or through reinforcement learning (pain and sweet signals in sensory frames). The motoric frames may contain action-supervision signals and the sensory frames may contain components for reinforcement signals (rewards or punishment components like pain receptors and sweet receptors). The contents in each sensorimotor frame signal what learning modes are needed. For example, a supplied action in a motoric vector calls for supervised learning, a supplied pain signal in a sensory vector calls for reinforcement learning, and the presence of both calls for a combination of supervised learning and reinforcement learning.

Type 3: Spatially non-attentive and temporal streams
Each motoric frame is a function of not only the last sensory frame but also an unknown number of earlier sensory frames.
Each motoric frame corresponds to the temporal state/action. Type 3 is a temporal generalization of Type 1. This is like recognizing sentences from a TV screen where the TV screen presents one letter at a time. Again, past experience is useful for later learning (e.g., learning individual letters and punctuations, individual words, individual phrases, individual sentences, etc. progressively, through a single long stream).

Type 4: Spatially attentive and temporal streams
Each motoric frame is related to parts of recent sensory frames. Type 4 is the temporal generalization of Types 2 and the spatial generalization of Type 3. An example is recognizing and detecting the intent of a car moving in a cluttered scene. Again, earlier experience is useful for later learning (e.g., motion direction, motion patterns, object type, object location, object orientation, etc.).

Type 5: Generalization that requires certain amount of autonomous thinking
The actions in the motoric frame require the system to invent rules and use such rules on the fly within the same (long) training-and-testing stream. Type 5 is the thinking generalization of Type 4. Classical conditioning, instrumental conditioning, autonomous reasoning, and autonomous planning are examples.

Practice streams for training-and-testing will be provided by the Contest early on. For the Contest, each entry is required to run through a Contest Interface, which records the performance in real time. The frame rate is around 10Hz in real time, but each entry can run slower in virtual time. GPU is recommended but not required. The information about the computer architecture should be provided. Spatial and temporal computational complexities are considered in Criteria (3) and (4).

Data: The training-and-testing streams will be provided. Many machine learning techniques are for off-line, batch training, batch testing, and task specific. They must be modified to take the official training-and-testing streams for online training and testing. Each stream consists of a single sequence of many time frames; each time frame i contains a sensory frame X[i] and a motoric frame Z[i]. Each motoric frame many include both training data points and testing data points. If a motoric frame that is marked * (free), it is a testing frame, absent of training data. Namely, each stream is a synchronized sensorimotor sequenced (X[i], Z[i]), i = 0, 1, 2, … n, where X[i] and Z[i] are the sensory vector (e.g., image) and action vector (state) at time i, both non-symbolic (numeric vector) to promote fully automatic machine learning. Z[i] includes binary components that represent abstract concepts of a spatiotemporal event (e.g., location concept, type concept, state concept of a sentence). X[i] may include specified components as punishments and rewards for action Z[i-1] or a few frames earlier (not too much delay that confuses with earlier actions). There are two types of Z[i]’s, supervised and free, respectively. Namely, free Z[i]’s are motor vectors for test. Each Z[i] consists of a number of concept zones [e.g., Z=(ZT, ZL,ZS), where ZT, ZL, ZS represent type zone, location zone, and scale zone, respectively for the attended object]. With each zone, only one neuron can fire at 1 and all other neurons do not fire and take value 0. Within each stream, past learned skills with early i’s is useful for later learning with later i’s.

Contest 2018: The source program of a learning engine, Developmental Network 2 (DN-2), is provided and discussed during the Contest Workshop. Each Contest participant modifies the way to plug in the supplied learning engine and probably also the learning engine itself to enhance the performance to construct his own learning agent. He also receives a Contest Interface porgram that reads the Contest data streams and accepts the outputs from the learning agent to measure the performance for the data streams.

There are three data streams for the Contest, visual, auditory, and text. A complete contest entry must submit results for all three data streams. For each data stream of the three, the following information is provided and measured:

The maximum number n of hidden neurons allowed in the learning agent. Namely, the computational resource is limited in terms of number of hidden neurons. Only one-pass update of the n neurons in the hidden area is allowed for each time instance. The actual real time used by the learning agent is considered but not used in computing the contest scores because it depends on actual computers used.
Within each data stream, which time instances are action-supervised. All other time instances are action-testing instances. In other words, the training environment for the learning agent is provided. A better learning agent performs better given the same traning experience and the allowed computational resource in terms of the maximum number of hidden neurons.
The performance measured for each data stream is the average action error among all testing time instances and across all agent motors (e.g., for the visual stream, the motors correspond to, respectively, the currently attended image patch, the type of the currently attended object in the attended image patch, the current heading direction for navigation, while the current navigation goal is supervised through the GPS motor).
Depending on whether the sensory input contains irrelevant backgrounds, not all three data streams are equally sensitive to attention. In the first year 2016 of the Contest, the visual modality stream requires attention because the input image contains cluttered scenes but the attention in the action tells which part of the input image is more relevant to the heading action. The use of this attention information is essential for the heading performance because an attention-free agent gave only poor performance as each sensed entire image contains objects that are irrelavant to the GPS goal (Re: Zejia Zheng and Juyang Weng, CVVT 2016 at CVPR). In the auditory steam for the 2016 Contest, each sensory input at every time instance includes only one speaker's sound (instead of multiple speakers speaking simultaneously). Likewise, in the text steam for the Contest, each sensory input at each time instance includes only one word (instead of all text on a page). Therefore, attention with the auditory stream and the text stream is less an issue compared to the visual stream. Autonomous covert attention is essential in developing agent's "thinking" skills.

The submission of each Contest entry must include the following items via email to castrog4@msu.edu with a CC to weng@cse.msu.edu with a required Subject Line: AIML Contest Submission version i by (name). Otherwise, the submission is incomplete.

A report (maximum 2 pages) stating the contestant's major work to enhance the performance based on the supplied material including the objective, the method, the references, the result, comments and conclusions.
The source program used to generate the Contest result for all the three sensory modalities so that the Contest organizer can duplicate the reported the corresponding contestant submitted output files. All the modified parts of the source program should be explcitly marked as "MBC:" (Modified By Contestant) and should be acceptably commented for the Contest organizers to understand. Only a single learning agent is allowed for all the training-and-testing streams, visual, auditory, and text. The stream-specific information is only used for defining sensor, effectors (motors) and the maximum number of hidden neurons.
The complete output file (as a .txt file) from the learning agent for all the three streams, including every time instance marked from 1 till the end of stream and the produced motor action at each time instance from the learning agent for all its motors using the same motor representation as taught (see above). The output file must also include the trace of performance measure at every test instance as well as the corresponding statistics of the error measures computed by the supplied plug-in program. The error meansure plug-in program is open in the finally supplied source program but must not be modified by the contestant. Such error information is useful for the contestants to debug the program.

AIML Home | Rules | Courses | Workshop | Registration | FAQ | Contest Sponsors | BMI