OSDN Git Service

habanalabs/gaudi: Fix collective wait bug
authorfarah kassabri <fkassabri@habana.ai>
Wed, 3 Nov 2021 11:15:55 +0000 (13:15 +0200)
committerOded Gabbay <ogabbay@kernel.org>
Sun, 26 Dec 2021 06:59:05 +0000 (08:59 +0200)
In Signaling-From-Graph case, the driver didn't set the hw_sob pointer
at the right place, which is needed for the cs completion
check prior to start sending all the master/slaves jobs to device.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
drivers/misc/habanalabs/gaudi/gaudi.c

index b481436..a9e279b 100644 (file)
@@ -1276,6 +1276,7 @@ static int gaudi_collective_wait_init_cs(struct hl_cs *cs)
                container_of(cs->signal_fence, struct hl_cs_compl, base_fence);
        struct hl_cs_compl *cs_cmpl =
                container_of(cs->fence, struct hl_cs_compl, base_fence);
+       struct hl_cs_encaps_sig_handle *handle = cs->encaps_sig_hdl;
        struct gaudi_collective_properties *cprop;
        u32 stream, queue_id, sob_group_offset;
        struct gaudi_device *gaudi;
@@ -1288,10 +1289,16 @@ static int gaudi_collective_wait_init_cs(struct hl_cs *cs)
        gaudi = hdev->asic_specific;
        cprop = &gaudi->collective_props;
 
-       /* In encaps signals case the SOB info will be retrieved from
-        * the handle in gaudi_collective_slave_init_job.
-        */
-       if (!cs->encaps_signals) {
+       if (cs->encaps_signals) {
+               cs_cmpl->hw_sob = handle->hw_sob;
+               /* at this checkpoint we only need the hw_sob pointer
+                * for the completion check before start going over the jobs
+                * of the master/slaves, the sob_value will be taken later on
+                * in gaudi_collective_slave_init_job depends on each
+                * job wait offset value.
+                */
+               cs_cmpl->sob_val = 0;
+       } else {
                /* copy the SOB id and value of the signal CS */
                cs_cmpl->hw_sob = signal_cs_cmpl->hw_sob;
                cs_cmpl->sob_val = signal_cs_cmpl->sob_val;