The Second WBA Hackathon Report

The Second Whole Brain Architecture Hackathon was held on October 8th to 10th, 2016 with the theme of “Communal making of Cognitive Architecture” aiming for higher autonomy (CFP) (Video).
Eleven teams participated for the application of advanced topics such as “the accumulator model,” “the free energy principle,” “predictive coding,” “continual learning,” and “deep deterministic policy gradient.” They surveyed papers and platforms and aimed for something ‘cool’, ‘intriguing’, or ‘like-brain.’ Mentors and judges joined discussions of interesting topics such as the biosphere begetting altruistic behavior, the feed forward/backward models of cerebellar control, the implementation of intentions from series learning, the model of others, multi-agent learning, affordance, meta-cognition, and body intelligence.
Four teams were nominated for the second round of examination at the 30th Anniversary of the Japanese Society for Artificial Intelligence in November.

Summary

  • Date: October 8-10, 2016
  • Venue: Raiohsha, Keio University Hiyoshi campus, Yokohama, Japan
  • Organizer: The Whole Brain Architecture Initiative (Operated by the WBAI Supporters)
  • Co-Organizer: SIG-AGI of the Japanese Society for Artificial Intelligence
  • Sponsors: Nextremer Co., Ltd., Furuya Accounting Office, and Brains Consulting, Inc.
  • Cooperated with Sakura Internet
  • Collaborated with Dwango AI Lab.
  • Supported by
    • Grant-in-Aid for Scientific Research on Innovative Areas “Comparison and Fusion of Artificial Intelligence and Brain Science”
    • MEXT Grant: Post K “The Whole Brain Simulation and Brain-like AI”
  • Teams
    • Team Osawa: Masahiko Osawa, Daiki Shimada, & Yuta Ashihara
      Mentor: Hiroki Kurashige (U. of Electro-Communications)
      Suppressing Boosting with Noh-Gazebo-ROS-Gym Integration
      The 2nd Round: ASCA: Cognitive Architecture Suppressing Behavior based on Accumulator Model
    • Team Ochiai: Koji Ochiai & Taku Tsuzuki
      Mentor: Koichi Takahashi (Riken)
      Attention Control by Free Energy
    • Team Morioto: Toshiaki Morimoto & Mitsugu Ootaki
      Mentor: Shigeyuki Oba (Kyoto University)
      Abnormality Detection in Time Series with PredNet
      The 2nd Round: “Risk Aversion” Cognitive Architecture
    • Team Noguchi: Yuki Noguchi
      Mentor: Masahiro Suzuki (University of Tokyo)
      Continual Learning in Multiple Game Play
    • Team Nao: Takatoshi Nao, Masaru Kanai, Keiichiro Miyamoto, & Shogo Naka
      Mentor: Masayoshi Nakamura (Dwango)
      Survival and Evolution of Super-A-Life
    • Team Takahashi: Tomomi Takahashi, Takumi Takahashi, Seigo Matsuo
      Mentor: Tadahiro Taniguchi (Ritsumeikan University)
      Fight Learning in a Random Terrain via Agent Interaction
    • Team Omasa: Takamitsu Omasa, Naoyuki Nemoto, Junya Kuwada, & Naoyuki Sakai
      Mentor: Hiroshi Yamakawa (Dwango)
      Past×Present×Future: Memory of Past and Prediction of Future
    • Team Ohto: Yasunori Ohto, Sho Naka, Toru Kawakami, & Daisuke Ishii
      Mentor: Taichi Iki (Nextremer)
      Memory
    • Team Kato: Takuma Kato, Koki Teraoka, & Shouta Nakagawa
      Mentor: Sachiyo Arai (Chiba University)
      Learning Hunting by Multi-Agents
    • Team Hashimoto: Yutaka Hashimoto
      Mentor: Michihiko Ueno (Dwango)
      Representing a Virtual Shibainu
    • Team Shimomura: Takuji Shimomura, Shuichi Sasaki, Yuki Sasaki, & Hiroki Sato
      Mentor: Yasuo Katano
      Handling Cups and Glasses Correctly
  • Prizes
    • Sugoi Prize (for being cool)
    • Nouppoi Prize (for being neuroscientific)
    • Omoroi Prize (for being intriguing)
    • Brains Consulting Prize
    • Nextremer Prize
    • Furuya Accounting Office WBA Special Prize
    • Fighting Spirit Award

Team Activities

Team Osawa:

Nominated for the Second Round (the Final Winner)
Sugoi Prize, Nouppoi Pirze

Suppressing Boosting with Noh-Gazebo-ROS-Gym Integration

A hierarchized “accumulator model” that arbitrates modules with the model of prefrontal accumulator neurons is proposed and implemented as “ASCA: the Cognitive Architecture Suppressing Behavior based on Accumulator Model.”
Decision making in the brain is thought to be carried out by disinhibiting the action to be taken among enumerated and inhibited feasible options. Many of recent studies of decision making with reinforcement learning use End-to-End Learning with a single module and do not employ the idea of inhibition/disinhibition. Thus, the team proposed reinforcement learning of multiple modules with inhibition/disinhibition. Among the brain regions carrying out inhibition/disinhibition such as basal ganglia and prefrontal cortex, the team focused on the prefrontal area that functions in higher-level arbitration. The accumulator is the decision making algorithm that carries out an action when the accumulated ‘evidence’ for it surpasses a threshold. Theories referred to include neurons acting on brain regions as accumulators [Mazurek-Shadlen 2003] [Hanks-Brody 2015], the accumulator model of spontaneous motion [Schurger-Dehaene 2012], and the role of prefrontal accumulators in commencing spontaneous motion selection [Soon-Haynes 2008].
The ASCA cycle includes: 1) obtaining the first-person depth image from the environment (to the cognition module), 2) coordinate recognition of specific objects from the depth image (to the reflex module; with template matching), 3) parallel action selection with a) the recognition module (action selection with a Deep Q Network), b) the reflex module (rule-based action selection with detected objects), and c) the static module (random action selection or no action), and 4) decision making with the arbitration module (the accumulator model). Modules are hierarchized so that active upper modules suppress all the lower modules. Finally, ROS converts the target value for locomotion speed calculated by the arbitration module to the control signal given to the environment, while reward signals are sent from the environment to the recognition module.
The software environment includes: Gazebo (a dynamical simulator working with ROS), ROS(Robot Operating System), and Noh (a learning platform for brain-like cognitive architecture).
A comment from the floor: As the history of parallel situation processing is long, reference to multi-agent active learning can be helpful.
Source Code
[1] First Round: https://github.com/iwawomaru/SAO
[2] Second Round:https://github.com/iwawomaru/SUSANoh

Team Ochiai:

Nominated for the Second Round
Sugoi Prize, Nextremer Prize

Attention Control by Free Energy

The team regarded Karl Friston’s Free Energy Principle as the framework of agent learning to decide action in the environment, and implemented the principle to determine the internal state and behavior so as to minimize its own free energy with ANN (Artificial Neural Network).
Compared to reinforcement learning, the Free Energy Principle is better in that it can be applied to a wide range of phenomena without arbitrarily engineered reward. However, the implementation by Friston can only be applied to simple problems. The team thought that it could be applied to large-scale and complicated problems with ANN. The variational autoencoder (VAE) is the neural net representation of variational Bayes, in which Free Energy is minimized. The team decided to add active inference to VAE, regarding it as a Network minimizing free energy with internal parameter adjustment (= VAE). It would generate motion to minimize free energy.

Specifically, an agent that solves gaze movement tasks was implemented. The environment was set as a space (torus) in which MNIST images are arranged and obtained with pixel unit movement. The agent is implemented with VAE + a behavior generator. The result was that the gaze movement stopped as learning progresses. In order to solve this problem, the team investigated (after the Hackathon) the epistemic value for seeking behavior and decided to let the multimodal VAE learn the samples and calculate the epistemic value with its hidden layer. It was confirmed that it took high values in places with high-values for observation.

Team Morimoto:

Nominated for the Second Round
Omoroi Prize

Abnormality Detection in Time Series with PredNet

The team judged that predictive coding was useful for the “risk aversion” cognitive architecture hypothesis, and made experiments to verify whether it could model brain functions with PredNet, its algorithmic model. Specifically, an abnormal value detection experiment of a reverse running vehicle was carried out.
Human intelligence is a decision mechanism that responds to stimuli from the outside world and cognitive architecture is its model. First, the team aimed at murine level cognitive architecture and chose the task to be “risk aversion.” Here, risk means a state where a penalty is given, and fear means the feeling that occurs predicting risk. There are low and high paths in fear, the difference being whether or not they pass through the sensory cortex. Ref. J. LeDoux, The Emotional Brain (1998)
Next, since fear is emotion, the team adopted the definition of emotion as “a value calculation system for action decision”. Ref. Masayoshi Toda, Emotion (1992 in Japanese).
As cognitive architecture, a predictor was devised with the consideration of the flow from stimulation to the sensory thalamus and sensory cortex, amygdala, frontal lobe, or nucleus accumbens. PredNet was adopted as a predictor where risk can be determined only if it is predicted. Ref. Takashi Omori, “An attempt to model emotions as a value system” (2016 in Japanese)
Predictive coding is a hypothesis concerning brain function, in which feedback connection conveys predictions of lower level neural activity, and feedforward connection conveys error between predicted value and actual activity. Ref. Kenji Doya “Calculation mechanism of the brain” (2005 in Japanese) , Lotter et al., “Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning” (2016)
As abnormal value detection for specific objects with PredNet, a point where the prediction error exceeds a threshold is defined as abnormal, it was tested with reverse running vehicles. With the consideration of segmentation and attention, a task of detecting the abnormal values of reverse running cars is set. Specifically, the background was memorized with the scenes of the road and the change of cars was memorized with those of the road and cars, and the model was tested with the videos of reverse running cars. As a result, abnormal values could be properly detected, though non-abnormal situations were sometimes judged as abnormal.

Team Noguchi:

Nominated for the Second Round
Omoroi Prize

Continual Learning in Multiple Game Play

Continual learning was chosen to be the base for cognitive architecture that retains acquired representation/knowledge to utilize for new tasks. It uses A3C (Asynchronous Advantage Actor Critic) as its reinforcement learning model, and PNN (Progressive Neural Networks) was implemented as the method for continual learning.
AGI needs the ability to reuse the knowledge acquired in the past. Although reinforcement learning has greatly improved with the flexible representation learning of deep learning, there are still many things that can not be learned sufficiently. A typical neural network has destructive interference to forget the past when learning something new. Human beings can associate new information with existing knowledge without forgetting important knowledge and memory to some extent, thereby making learning faster (transition learning). In order to realize this ability, A3C and PNN were used.
In Hackathon, LIS was used, where multiple games are played in a virtual environment. The team implemented PNN which reuses knowledge, compared its learning speed with the baseline, and showed the effectiveness of continual learning. A3C learns by moving agents in parallel with its actor-critic method. Experience replay is no longer necessary and results are better than DQN. In this experiment, eight agents run in parallel. PNN has advantages such that the knowledge learned in the past is not lost and can be used for new tasks by adding columns as the number of tasks increases. However, it has also drawbacks such that parameters increase as the number of tasks increases and the timing to add columns is fixed.
Results show that models that reuse acquired knowledge learn faster than models that have learned only one game.
Reference
[1] A3C: https://arxiv.org/abs/1602.01783
[2] PNN (Progressive Neural Networks) : https://arxiv.org/abs/1606.04671
Source Code
[1]https://github.com/seann999/progressive_a3c

Team Nao:

Furuya Accounting Office WBA Special Prize

Survival and Evolution of Super-A-Life

The team created a super artificial life that can recognize the external environment, its own condition, and the influence of irregularly received force and learn its external shape and behavior to adapt to the environment. Behavior learning of the architecture used actor-critic DDPG (Deep Deterministic Policy Gradient) instead of DQN and environment adaptation learning was carried out by genetic algorithm (GA) on the server.
With the task of learning new behavior at hand, the team studied actor-critic DDPG that enables continuous and high-dimensional action spaces through articles on “CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING” and AlphaGO, having learned that DQN only learns discrete and low-dimensional action spaces. In the experiment, reward increase was observed with an agent on the LIS sample substituting only DQN, and learning was verified for increased variation of motion. Then the team worked on agent adaptation to the environment with GA. It was expected that reward would increase as the form of the agent evolves and survival probability increases. The input for GA was scale(X,Y,Z) for each unit as the gene and reward and the output was gene manipulated with GA. Gene manipulation includes selection by evaluation values, the crossover of selected gene, mutation, and the generation and substitution of new units. As the constraint from the terrain was not enough, the addition of combat items, i.e., the rule that if agents collide, the larger agent damages the smaller one, was introduced.
From the floor, there was a comment that it would be interesting if the adequacy of the entire biosphere would be studied with the emergence of altruistic behavior.
Source Code
[1]https://github.com/nao0811ta/AILifeWorld

Team Takahashi:

Brain Consulting Prize, Fighting Spirit Award

Fight Learning in a Random Terrain via Agent Interaction

The team conceived cognitive architecture for repeated engaging and learning among tank-agents, aimed for realizing fights that look spectacular and non-mechanical for human beings. The team chose the topic wanting to get an intuition on what is like to be the brain or human.
The system consists of four tank-agents, items (obtaining of which benefits the agent in the fight), ten obstacles (disposed randomly on a 6×6 matrix). There are seven types of actions of the tank-agents: doing nothing, going forward, going backward, turning to the left, turning to the right, regular firing, and special firing (only available after getting an item). The rules: tank-agents are destroyed by three hits with regular firing or by one hit with special firing; the game is over when only one tank-agent is left or by time-out. Rewards: positive rewards by hitting or destroying an enemy with firing or getting an item; negative rewards by receiving a hit of firing or being destroyed.
The experiment has three phases: with three pre-learned tank-agents fighting with a human operated agent; three tank-agents learning; three learned tank-agents fighting with a with human operated agent. The human-operated agent won against tank-agents learned for about four hours.
The team was awarded the Brain Consulting Prize for the system actually runs.
From the floor, there was comment that it would be interesting to make tank-agents to mimic the human-operated agent and that it would be helpful to devise rewards or study inverse reinforcement learning (to estimate the reward space of the human operator).
Source Code
[1] https://github.com/MatsuoSeigo/TankDqn

Team Omasa:

Fighting Spirit Award

Past×Present×Future: Memory of Past and Prediction of Future

Cognitive architecture that learns behavior by giving reward for memory + prediction was proposed. LISv2 was altered into a system that put image data from Unity into the prediction stream (PredNet + AlexNet) and the memory stream (AlexNet + memory unit) in parallel to merge them for learning.
The prediction stream: the current image ⇒ PredNet → predicated image ⇒ AlexNet → features of predicated image ⇒ fully-connected module. The memory stream: the current image ⇒ PredNet → predicated image ⇒ AlexNet → the current features ⇒ the memory unit ⇒ fully-connected module. MQN was used as the memory unit, which stores the encoded data of past images and encoded key information from past images, creates keys from key information and the current context, selects past information retrieved with key, and gives the current and past information.
It was regrettable that only the prediction stream could be implemented with the time limit.
As with comments from the floor, the improvement of the memory unit is planned to make FRMQN with LSTM and feedback. There were also important comments such that it is thought to be a series learning with its memory and prediction, that it would be interesting to realize volition, that its own behavior could be predicted, that PredNet should learn only when the agent makes appropriate behavior, and that it is desirable to have a drive (purpose) for prediction.
Reference
[1] Control of Memory, Active Perception, and Action in Minecraft
https://arxiv.org/pdf/1605.09128v1.pdf

Team Ohto:

Fighting Spirit Award

Memory

With the idea of learning from organisms at hand, architecture that memorizes the causal relationships of episodes and acts with the memorized causal relationships was created. Specifically, an escape game was conceived. The environment is simple in that a LIS agentis confined on an asteroid and when it stands in front of the door, the door opens, it goes out through the door and receives reward.
The policy of the team is to “learn from babies”: babies learn from the environment, create internal models from experience, and get interested when the environment differs from its internal model. The LIS strategy mimics it so that it memorizes the environment, creates internal models, compares the environment and its model, takes normal action when the prediction is met, and changes behavior when the prediction differs with the environment.
Specifically, the internal model was constructed with learning to make the difference between the internal model and the environmental input smaller, and it predicts the environmental change with behavior, and if it matches the prediction, normal search action (right/left rotation or go forward), or else it gets interested (like change in the environment due to opening of the door, action change by recognizing environmental change). Behavioral changes in right/left rotation are implemented as increase of the stride, in response to the the difference between the internal model and the external environment. In “go forward,” it was implemented so that the agent escapes through the door if a tendency towards the door is generated. The software hierarchy was Brain, LIS Client, and the environment (Unity).
Four models were constructed: LIS Original, LIS + Memory, LIS + Interest, LIS + Memory + Interest. LIS Original has the flow of converting the environmental input to feature vectors with CNN, moving the LIS in the environment, and determining LIS action (right/left rotation or go forward) (CNN, Liner, and Action). LIS + Memory uses LSTM to memorize internal states. In order to associate the passing of the door with the door opening switch, LSTM was added with reference to the paper [Deep Recurrent Q-Learning for Partially Observable MDPs]. In LIS + Interest, the environment and the internal model is compared and when the difference from the internal model (between the expected result at time t and the expected result at time t + 1), behavior is changed. In LIS + Memory + Interest, LSTM was also added to the interest model.
In the experimental results, LIS Original, which persevered to the switch to open the door, could not go beyond the gap to pass the door; LIS + Memory could open the door and go straight out of the door, but when entering diagonally, it often thrusted into the wall; LIS + Interest explored until opening the door, went out when it opened the door, and interest rose sharply after going out; LIS + Memory + Interest seemed like learning the environment, but it could not calculate up to the escape with the limit of resources.
As a future task, it is planned to try out the adjustment and learning of hyper parameters with genetic algorithms in an evolutionary manner.

Team Kato:

Fighting Spirit Award

Learning Hunting by Multi-Agents

Thinking of hunting situations in which two hunters collaborate lead to the proposal of the cognitive architecture for learning hunting by multi-agents.
In the environment, agents gain reward by approaching to food, food flees from agents, and the target of learning by the agents is to learn to chase up food by two of them (learning collaborative behavior). Food determines its orientation randomly and veers as it gets near an agent.
As results, an agent could not catch food (as a gorilla chasing a banana) and learning did not seem to progress. With two gorillas, information was processed with Deep-Q-Network on GPU, for which learning time ran out for evaluation.
For future prospects, the team will consider reward distribution rules, the implementation of difference in agent abilities, and the mechanism of agents for grasping and learning the positions of each other.
The comments from the floor include: “The study on this kind of tasks was published by Prof. Omori of WBAI. It could be referred to in the context of the model of others” and “There are numerous studies on multi-agents such as of distributed Q-learning. But a gold mine may be around as there has been no conclusive research. It would be nice if the past research revives with experiments with e.g., deep learning.”

Team Hashimoto:

Fighting Spirit Award

Representing a Virtual Shibainu

It is aimed for the recognition of the real world and its coupling with real world behavior through HoloLens combined with AR/MR and AI.
It was conceived that in the environment, a shibainu recognizes the field such as the existence of the floor and wall, learns not to hit the wall with negative reward. Virtual food is given with a sensor to gain positive reward. Seeing the shibainu with HoloLens was gave up due to the lack of the basic development library for HoloLens applications.
LISv2 and BriCA was used for AI framework. The final product was an Android application with which LISv2 recognizes the field with Android + an AR engine. BriCA was build on a Mac and as interaction with the external world, the world-first virtual feeder was made with a micro computer (connecting with ESP8266/MQTT).
The author felt the importance of ‘warmth’ in AI-human interaction, and learned that he could watch the shibainu moving around for ever and that it was simply a fun to see the interaction.

Team Shimomura:

Fighting Spirit Award

Handling Cups and Glasses Correctly

Cognitive architecture with the finding of the cerebellum, hippocampus, neocortex, and basal ganglia was conceived, aiming for, e.g., realizing graceful, beautiful, and quick thirst quenching.
How can a human being get to learn to act unconsciously? For example, in case of drinking water from a glass or coffee from a cup, you may have learned from parents to drink or to hold the container, have observed adult behavior, and/or acted repeatedly. Rewards for unconscious behavior may be of “getting more skillful (joy),” “less time (for free time),” “less energy (conservation),” and “less stress (on the spirit).” Thus, learning in the simulation plan was defined with “least energy,” “least time,” and “making a form,” with the standpoint of learning from behavior.
In the recognition of “cups, glasses and other objects,” the position, kind, temperature, kind and quantity of liquid will be discerned at glance. In “reaching” and “grasping,” it will extend its arm, decide the position on the container to grasp, make the form of its hand, and grasp the container so that it would not spill the liquid. In “moving the target,” it will do so without spilling liquid, move it to its lips, and put it on the lips. In “attaining the goal,” it will drink the liquid and stop when it had a necessary amount. Learning was designed to attain the goal by adjusting the entire steps. Goals include visual and somatic senses and its merge with the feeling of “becoming the self desired for,” setting initial parameters of the simulation, finding commonality with other unconscious activities, and discovering new learning parameters.
There were two comments from the floors: one is that cerebellar control has many precise models such as the forward/backward model in which the cerebellum is the model of musculoskeletal system that combines the forward model controlling the adequacy of cerebral output with feedback with the backward model substituting the function of the cerebral cortex. There is also the MOSAIC model that multiplexes the forward/backward model. Surveys in these topics could be interesting. The other is that while the topic is much-disputed with terms such as affordance, meta-cognition, and body intelligence, as there isn’t implementation, steps with technologies and brain models would be the key.

Related URL: