It's actually interesting to watch it without sound so you can concentrate on the visual changes and contrast (you do hit the right moments lipsync wise during the pauses and the breathing).
I understand the tone of the clip and the emotional continuity, but when I watched it the first time without sound (soooo sleepy... moving slowly - you should see me type :) ), I felt like the overall facial expression is pretty much the same through out. Look at the very beginning and the very end. Head down, eyes off to the side, head tilted away a bit, and the middle part is just a softer version of it. I do like how the body changes direction in the middle, that works well.
Maybe she could be insecure and timid at the beginning (like you have it) but afterwards she turns she's a bit more concentrated, probing and serious? It would work with the sound and would be emotionally and visually a bit more contrasty. Something so that the very first and very last frame feel different (especially facially).
Hope that makes sense...