PogamutUT2004

[ prev topic | next topic ]

Bots standing around doing nothing

Posted by jacob.schrum on Sun 21 of Nov, 2010 19:40 CET

I'm running experiments evolving bots, which means that I run several evaluations
in sequence. I always have to kill all the bots, stop the server, restart the server, and
then reintroduce the bots. This seems to work fine for a while, but after several evaluations
I will check in on the match and see that all of the bots are standing still, or if they are
moving they make a move, then stop, then move, then stop. It's like the bots aren't receiving
commands as often as they should.

Looking at the output from the simulation, I think the problem might be related to the following
exception (and surrounding log messages):

___

(Hunter MEDKIT) WARNING 12:01:57.937 Unregistering JMX components.
(Hunter SEE ITEM) WARNING 12:01:57.926 Component SocketConnection[SocketConnectionAddresslocalhost:3000,connected:false) has stopped.
(Hunter ITEMS) WARNING 12:01:57.925 Component Raycasting has stopped.
(Hunter ITEMS) WARNING 12:01:57.938 Component Action has stopped.
(Hunter SEE ITEM) WARNING 12:01:57.927 Component UT2004Parser has stopped.
(Hunter ENGAGE) SEVERE 12:01:57.939 Thread 100: afterLogicException() exception.
ComponentNotRunningExceptionUT2004SyncLockableWorldView: In state: STOPPING.
at cz.cuni.amis.pogamut.base.communication.worldview.impl.EventDrivenWorldView.notify(EventDrivenWorldView.java:207)
at cz.cuni.amis.pogamut.base3d.worldview.impl.BatchAwareWorldView.notify(BatchAwareWorldView.java:83)
at cz.cuni.amis.pogamut.ut2004.communication.worldview.UT2004SyncLockableWorldView.processBatch(UT2004SyncLockableWorldView.java:206)
at cz.cuni.amis.pogamut.ut2004.communication.worldview.UT2004SyncLockableWorldView.processBatches(UT2004SyncLockableWorldView.java:193)
at cz.cuni.amis.pogamut.ut2004.communication.worldview.UT2004SyncLockableWorldView.unlock(UT2004SyncLockableWorldView.java:166)
at cz.cuni.amis.pogamut.ut2004.agent.module.logic.SyncUT2004BotLogic.afterLogicException(SyncUT2004BotLogic.java:79)
at cz.cuni.amis.pogamut.base.agent.module.LogicModule$LogicRunner.run(LogicModule.java:385)
at java.lang.Thread.run(Thread.java:619)

(Hunter SEE ITEM) WARNING 12:01:58.555 Thread 99: stopping the thread, received ComponentNotRunningException from UT2004SyncLockableWorldView.
(Hunter ITEMS) WARNING 12:01:57.939 Component AdvancedLocomotion has stopped.
(Hunter ENGAGE) WARNING 12:01:58.556 Thread 100: stopping the thread, received ComponentNotRunningException from SocketConnection-Writer.
(UCC) INFO 12:01:58.555 ID16 In: State: Dead, BeginState()

___

Because the exception is being thrown in a thread I didn't create, I don't really know how to deal with this. It seems like I can't possibly catch and deal with it without editing the Pogamut source, which I'd rather not do.

Related to this: If I'm completely restarting the server and all bots in it, then why would the bots have a problem being sluggish? I would think that killing everything and starting over would give me a fresh start and make things run like normal.

Reads: 19988

Link

Posted by jakub.gemrot on Sun 21 of Nov, 2010 21:20 CET

Hi!

Cool, it seems you're using agents in very interesting way ... I think I will need a bit more information :-)

1) how did you managed to reach Thread 99 count?

2) the agents you're restarting ... does it mean you're starting the same agent instance again and again? Or are you creating new agent instances after you shutdown the server?

3) The log you're receiving is not wrong ... or at least, should not be, as it catches the scenario pictured by the log
... I know it is a confusing as it is an exception, but the multi-thread synchronization always results in weird situations ... what the exception is depicting is that:
a) an agent is being stopped
b) which means, the UT2004SyncLockableWorldview must stop as well
c) but the logic thread must stop too
d) unfortunately the logic is working with UT2004SyncLockableWorldview and it happens that it must call one of worldview's method during clean up ...
e) no big deal as the logic thread is stopping too, so it will let Worldview to take care about itself and just terminates

4) so the question is, whenever the exception is being manifested? If it happens only whenever you're stopping your agent, it is completely ok.

5) is it possible to send me a code where you setup your agents / starting server / etc... something might strike my eyes...? jakub.gemrot at gmail.com

6) how many bots are you using? UT2004 can't handle too many bots, I mean whenever you connect around 16 bots to the server (or smaller number whenever the map is large), they will start to behave exactly the way you're describing (taking turns in the command execution)

Cheers!

Jimmy

P.S.: please be patient, it will probably take some time to solve the issue

Link

Posted by jacob.schrum on Mon 22 of Nov, 2010 04:02 CET

Here are the answers:

1) When I was initially troubleshooting, I was getting an error associated with calling bot.kill() and bot.stop() from within the bot. The error said that the kill command should be run from a Thread other than the one running the bot. Therefore, I created the following class:
---------------------------------
public class BotKiller extends Thread {

private final BaseBot b;

public BotKiller(BaseBot b) {
this.b = b;
}

@Override
public void run() {
System.out.println("Killing " + b);
//b.bot.kill();
b.bot.stop();
System.out.println("Done killing " + b);
}
}
--------------------------------
As you can see from the comment, I've stopped using kill() and only use stop(), so running a separate Thread may no longer be necessary. BaseBot is a super class of all of my bot classes.

2) I completely shutdown the server and kill/stop all of the bots. The reason for this is that I want to guarantee a fresh start. I've noticed that if I simply reload bots into the same server instance, there is sometimes a problem on the UT side where the bot body lingers in the game after it has been disconnected. However, simply replacing the "brains" of constantly running bots might be a viable option.

3) I'm not entirely sure this is the source of the problem. However, when I watch several separate evaluations, they mostly seem to be running fine, until after the point where I see an error (I have tracked down and fixed previous errors/exceptions before this one). After the error, if I log in to watch the bots, then their movement is stunted.

4) I'm pretty sure this error only happens when the bot is being stopped.

5) I want to mess around with the code a bit before I send it off. I'm still trying some ideas on my own. However, there is one snippet I'd like you to look at. I'm not using the StoryControlServer class yet. I'm still running things my own way, simply because I've been using it for so long; I've been able to make stuff work before by using small populations and periodically resuming from a checkpoint, but I want to use larger populations, and this error is preventing me from finishing a single generation. Anyway, here is how I start and restart the server:
-------------------------------
MyUCCWrapper.MyUCCWrapperConf config = new MyUCCWrapper.MyUCCWrapperConf();
/* setup config */
MyUCCWrapper ucc = new MyUCCWrapper(config);

UT2004Server server = ucc.getUTServer();
System.out.println("Confirming empty server");
while (server.getAgents().size() > 0) {
try {
System.out.println("NOT EMPTY! RESET!");
server.kill();
Process p = ucc.getProcess();
ucc.stop();
p.destroy();
synchronized (this) {
this.wait(1000);
}
} catch (InterruptedException ex) {
ex.printStackTrace();
System.out.println("Wait to reset interrupted");
} catch (Exception e) {
System.out.println("Mysterious exception");
e.printStackTrace();
System.out.println(e);
} finally {
ucc = new MyUCCWrapper(config);
server = ucc.getUTServer();
}
}
//Server was launched?
System.out.println("Launch bots on empty server");

try {
/* This line uses the MultipleBotLauncher to launch my bot and five Hunters.
* It keeps track of the length of evaluation, and launches Threads to kill/stop all bots
* when the eval time is up.
*/
HunterNetwork.launchBot((TWEANNController) descriptor.controllers0, Hunter.class, 5);
} finally {
while (true) {
try {
server.stop(); // An exception here would skip the break
break;
} catch (ComponentCantStopException ex) {
System.out.println("SERVER COMPONENT CAN'T STOP! TRY AGAIN!");
ex.printStackTrace();
}
}
//server.kill();
try {
System.out.println("Server should be stopped");
//server.stop();
Process p = ucc.getProcess();
ucc.stop();
p.destroy();
} finally {
try {
synchronized (this) {
this.wait(1000);
}
} catch (InterruptedException ex) {
System.out.println("Post-server-stop wait interrupted");
Logger.getLogger(LocalBaseExperimentExecutorImpl.class.getName()).log(Level.SEVERE, null, ex);
ex.printStackTrace();
}
}
}
------------------------------
6) As indicated in the comment above, each evaluation involves my bot and 5 Hunters for a total of 6.

I'll wait for your reply to this post to see if you have any more insights. If we still can't figure it out, I'll send you all of my code to muck through.

-Jacob

Link

Posted by jakub.gemrot on Mon 22 of Nov, 2010 08:04 CET

Hi!

1)

You might want to try/catch bot.stop() and on exception call bot.kill() anyway ... the behavior of the stop is that it won't kill() the agent on its own if some exception happens during stopping (probably weird behavior... I'm not sure about that.)

2)

No other clue though :-(

-----------

Things you might try to check:

a) whenever bots start taking turns, check whether there are more than one ucc.exe process running in your OS

b) how much CPU power is used by the ucc.exe when the problem occurs? does it consumes whole cpu-core? (ut2004 is single-threaded)

c.i) what about JVM - does it have enough heap? You might want to profile your JVM (NetBeans profiles should suffice) to check whether JVM heap has not been totally consumed (in such case, the gc() would take the most of JVM CPU time resulting in bad performance of any code you have)

c.ii) or just try to give JVM more heap straight away and check whether it won't give you more time before bots start to lag
http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html

Best,
Jimmy

Link

Posted by jacob.schrum on Wed 24 of Nov, 2010 17:56 CET

Well, I've done a lot of coding, and have the following to report. My original code completely reset the server and reconnected new bots for each evaluation. Then I made a version of the code that kept the same server running all the time, and only killed and reconnected the bots each time. This version would reset the server if it was not empty at the beginning of an eval, due to problems disconnecting. Then I made a third version of the code that uses the same server and bots all of the time, but replaces the "genome" of the running bot during execution.

All versions of this code eventually have the same problem, which is that all of the bots end up just standing mostly still. Sometimes they'll move, but generally not. In all versions of the code, there is only one version of ucc.exe ever running at a time. The fact that the error occurs under all conditions leads me to believe I'm having memory issues as suggested, but I've had some trouble running the profiler because it causes the code to crash before anything interesting happens.

Since this error seems to have arisen mainly as a result of increasing the population size, its possible that I'm simple caching too much information about the population each generation, and this is slowing things down. However, there is one other possibility that I want to ask the Pogamut team about:

Even when I use the same bots each time, I use a special class to collect information about bot performance to help compute the fitness function. This class defines several listeners. For example:

--------------------------------
getAgent().agentmemory.world.addEventListener(PlayerKilled.class, killListener = new IWorldEventListener() {

public void notify(PlayerKilled killed) {
//System.out.println("Player killed");
// someone was killed
if (killed.getKiller().equals(getAgent().getMemory().getAgentID())) {
// our agent killed him
frags++;
damageDone.put(killed.getId().getStringId(), 0);
hitShots++;
}
}
});
--------------------------------

A possible problem is that I may be flooding the system with listeners, so I would like to know a bit more about how they work. For example, in the case where I kill the bots each time and restart them, I would assume that all associated listeners are destroyed. However, if I reuse the same bots each time, but keep reinstantiating my stats collecting class, then I might be duplicating the existing listeners. Does this sound like it could be an issue? What's the best way to check for existing listeners? There are several available methods, but they don't seem to have complete JavaDocs.

Link

Posted by michal.bida on Wed 24 of Nov, 2010 20:04 CET

When I was doing multiple runs of my bots with Pogamut I was always restarting the server completely - that means I even shut down the ucc.exe and restarted it again - you can do it with UCCWrapper now.

In the end I had a .bat file that was running the server, then running my bots, then killing the bots and killing the server and repeat this in a loop for X runs.

I am not sure if this bug may be hidden in ucc.. Is it possible to connect to the server after multiple runs of you bots have joined and left? Isn't the server flooded?

best,
michal

Link

Posted by michal.bida on Wed 24 of Nov, 2010 20:06 CET

The code of my bat file:

:: Automatically runs the emotion experiment
::
@ECHO ON

SET LOOP=0

::run another experiment
ECHO RUNNING EXPERIMENT %LOOP%

ECHO STARTING UT SERVER
START F:\Hry\UT2004\scenario_server.bat

SLEEP 40

f:
cd F:\Temp\Experiments_new01\
START F:\Temp\Pogamut\branches\devel\project\addons\core\EmotionModel\dist\EmotionalScenario.jar
SLEEP 650

:LOOP
::KILL previous process
TASKKILL /F /IM "UCC.exe"
TASKKILL /F /IM "javaw.exe"
::wait a bit
SLEEP 10

ECHO STARTING UT SERVER
START F:\Hry\UT2004\scenario_server.bat

SLEEP 40

::run another experiment
ECHO RUNNING EXPERIMENT %LOOP%
START F:\Temp\Pogamut\branches\devel\project\addons\core\EmotionModel\dist\EmotionalScenario.jar
:: sleep 11 minutes
SLEEP 650

SET LOOP=%LOOP% + 1
IF NOT "%LOOP%" == "50" GOTO LOOP


SET LOOP=

:END

Link

Posted by jakub.gemrot on Thu 25 of Nov, 2010 08:24 CET

Hi!

Have you tried to run JVM with extra memory? That would mean you have to start your main class manually, which is a bit tedious to do, but can be done.

Regarding listeners - I haven't touched that topic (inside the code yet). When I was writing the world view, I could not decide whether it is a good thing to remove all the listeners whenever the world view is shut down or not, so i decided not to drop them. It might be possible that you're flooding the system with listeners, but that does not explain why you're bots are suffocating because you're listening to PlayerKilled message which is not sent that often. Nevertheless, you might want to review your code regarding listeners and attach them only if they do not exist.

Alternative way is to work with EventReact / EventReactOnce / ObjectEventReact / ObjectEventReactOnce classes which you can create inside the "prepareBot" once and then you have option to disable() / enable() them whenever you like.

Last way that you might try is to use annotations for listeners.

But either way I personally do not think this is the source of the problem.

I think all this leads to making some workaround... I'm suggesting to:

1) create code for saving/loading genomes / population of your bots
2) perform only N iterations of your GA in your main method
3) create a bat file that will run always a new JVM (with extra memory, something like 1GB) which would continue where the last JVM has finished

But, if you may share code of your project, I can offer you to test it at our side (which is usually the best thing to do as we can observe it and discuss the strange Pogamut behavior issue in our team).

Best regards,
Jakub

Link

Posted by jacob.schrum on Fri 26 of Nov, 2010 06:36 CET

Thank you for all of the help and suggestions. I think I may have finally fixed the problem, but we'll see how well things are running after a few days worth of evolution.

The problem basically had to do with my special data collecting class that I was using to compute fitness values. This class maintained references to lots of the bot's internal classes after the bot finished running. So even though I was completely done with a given bot, I still had (via a long chain) references to it's AgentInfo, Players, Senses, PathPlanner, etc. classes. This made for a lot of leaking memory.

I also had references to listeners that no longer needed to be used. The one listener example I mentioned above was just an example. I had listeners for several types of messages, including Self messages, which are very frequent. The listeners themselves may not have been interfering, but they were another memory leak, so I fixed that.

If after all this things still don't work, I'll go ahead and send in my code.

Link

Posted by jakub.gemrot on Sat 27 of Nov, 2010 11:26 CET

Cool! I think you've find the source of your problems!

If it is somehow possible, I would suggest you to use WeakReference(s) in your code that is referencing Pogamut bot's internal data structures.... actually listeners are using them, so you do not need to manually removing them from the bot as they are only weakly referenced, which does not prevent bot to be GCed.

So you might want to have something like:
WeakReference agentInfoRef = new WeakReference(bot.getAgentInfo());

Good luck!

Jimmy

Link

POGAMUT

virtual characters made easy, TUTORIALS, Pogamut Devel Wiki, LATEST POGAMUT AI COURSE

PogamutUT2004

Bots standing around doing nothing

News

Pogamut

Acknowledgement