Input data size could be a user hint or based off of previous run or hadoop file or Hive stastitics ala Spark's own cost-based-query-planner.